In the first week of June (2025), our team at Igalia held our regular meeting about Chromium.
We talked about our technical projects, but also where the Chromium project is leading, given all the investments going to AI, and this interesting initiative from the Linux Foundation to fund open development of Chromium.
We also held our annual Igalia meeting, filled with many special moments — one of them being when Valerie, who had previously shared how Igalia is co-operatively managed, spoke about her personal journey and involvement with other cooperatives.
In
the previous
post we set up a Zephyr development environment and
checked that we could build applications for multiple
different targets. In this one we'll work on a sample
application that we can use to showcase a few Zephyr features
and as a template for other applications with a similar
workflow.
We'll simulate a real work scenario and develop a firmware
for a hardware board (in this example it'll be
a Raspberry
Pi Pico 2W) and we'll set up a development workflow that
supports the native_sim target, so we can do most
of the programming and software prototyping on a simulated
environment without having to rely on the
hardware.
When developing for new hardware, it's a common practice
that the software teams need to start working on firmware
and drivers before the hardware is available, so the initial
stages of software development for new silicon and boards is
often tested on software or hardware emulators.
Then, after the prototyping is done we can deploy and test the
firmare on the real board. We'll see how we can do a simple
behavioral model of some of the devices we'll use in the final
hardware setup and how we can leverage this workflow to
unit-test and refine the firmware.
This post is a walkthrough of the whole application. You can
find the code here.
Application description
The application we'll build and run on the Raspberry Pi Pico
2W will basically just listen for a button press. When the
button is pressed the app will enqueue some work to be done by
a processing thread and the result will be published via I2C
for a controller to request. At the same time, it will
configure two serial consoles, one for message logging and
another one for a command shell that can be used for testing
and debugging.
These are the main features we'll cover with this experiment:
Support for multiple targets.
Target-specific build and hardware configuration.
Logging.
Multiple console output.
Zephyr shell with custom commands.
Device emulation.
GPIO handling.
I2C target handling.
Thread synchronization and message-passing.
Deferred work (bottom halves).
Hardware setup
Besides the target board and the development machine, we'll
be using a Linux-based development board that we can use to
communicate with the Zephyr board via I2C. Anything will do
here, I used a very
old Raspberry
Pi Model B that I had lying around.
The only additional peripheral we'll need is a physical
button connected to a couple of board pins. If we don't have
any, a jumper cable and a steady pulse will also
work. Optionally, to take full advantage of the two serial
ports, a USB - TTL UART converter will be useful. Here's how the
full setup looks like:
For additional info on how to set up the Linux-based
Raspberry Pi, see the appendix at the
end.
Setting up the application files
Before we start coding we need to know how we'll structure
the application. There are certain conventions and file
structure that the build system expects to find under certain
scenarios. This is how we'll structure the application
(test_rpi):
Some of the files there we already know from
the previous
post: CMakeLists.txt
and prj.conf. All the application code will be in
the src directory, and we can structure it as we
want as long as we tell the build system about the files we
want to compile. For this application, the main code will be
in main.c, processing.c will contain
the code of the processing thread, and emul.c
will keep everything related to the device emulation for
the native_sim target and will be compiled only
when we build for that target. We describe this to the build
system through the contents of CMakeLists.txt:
In prj.conf we'll put the general Zephyr
configuration options for this application. Note that inside
the boards directory there are two additional
.conf files. These are target-specific options that will be
merged to the common ones in prj.conf depending
on the target we choose to build for.
Normally, most of the options we'll put in the .conf files
will be already defined, but we can also define
application-specific config options that we can later
reference in the .conf files and the code. We can define them
in the application-specific Kconfig file. The build system
will it pick up as the main Kconfig file if it exists. For
this application we'll define one additional config option
that we'll use to configure the log level for the program, so
this is how Kconfig will look like:
config TEST_RPI_LOG_LEVEL
int "Default log level for test_rpi"
default 4
source "Kconfig.zephyr"
Here we're simply prepending a config option before all
the rest of the main Zephyr Kconfig file. We'll see how to
use this option later.
Finally, the boards directory also contains
target-specific overlay files. These are regular device tree
overlays which are normally used to configure the
hardware. More about that in a while.
Main application architecture
The application flow is structured in two main threads: the
main
thread
and an additional processing thread that does its work
separately. The main thread runs the application entry point
(the main() function) and does all the software
and device set up. Normally it doesn't need to do anything
more, we can use it to start other threads and have them do
the rest of the work while the main thread sits idle, but in
this case we're doing some work with it instead of creating an
additional thread for that. Regarding the processing thread,
we can think of it as "application code" that runs on its
own and provides a simple interface to interact with the rest
of the system1.
Once the main thread has finished all the initialization
process (creating threads, setting up callbacks, configuring
devices, etc.) it sits in an infinite loop waiting for messages
in a message queue. These messages are sent by the processing
thread, which also runs in a loop waiting for messages in
another queue. The messages to the processing thread are sent, as
a result of a button press, by the GPIO ISR callback registered
(actually, by the bottom half triggered by it and run by a
workqueue
thread). Ignoring the I2C part for now, this is how the
application flow would look like:
Once the button press is detected, the GPIO ISR calls a
callback we registered in the main setup code. The callback
defers the work (1) through a workqueue (we'll see why later),
which sends some data to the processing thread (2). The data
it'll send is just an integer: the current uptime in
seconds. The processing thread will then do some processing
using that data (convert it to a string) and will send the
processed data to the main thread (3). Let's take a look at
the code that does all this.
Thread creation
As we mentioned, the main thread will be responsible for,
among other tasks, spawning other threads. In our example it
will create only one additional thread.
We'll see what the data_process() function does
in a while. For now, notice we're passing two message queues,
one for input and one for output, as parameters for that
function. These will be used as the interface to connect the
processing thread to the rest of the firmware.
GPIO handling
Zephyr's device tree support greatly simplifies device
handling and makes it really easy to parameterize and handle
device operations in an abstract way. In this example, we
define and reference the GPIO for the button in our setup
using a platform-independent device tree node:
This looks for a "button-gpios" property in
the "zephyr,user"
node in the device tree of the target platform and
initializes
a gpio_dt_spec
property containing the GPIO pin information defined in the
device tree. Note that this initialization and the check for
the "zephyr,user" node are static and happen at compile time
so, if the node isn't found, the error will be caught by the
build process.
This is how the node is defined for the Raspberry Pi Pico
2W:
This defines the GPIO to be used as the second GPIO from
bank
0, it'll be set up with an internal pull-up resistor and
will be active-low. See
the device
tree GPIO API for details on the specification format. In
the board, that GPIO is routed to pin 4:
Now we'll use
the GPIO
API to configure the GPIO as defined and to add a callback
that will run when the button is pressed:
if (!gpio_is_ready_dt(&button)) {
LOG_ERR("Error: button device %s is not ready",
button.port->name);
return 0;
}
ret = gpio_pin_configure_dt(&button, GPIO_INPUT);
if (ret != 0) {
LOG_ERR("Error %d: failed to configure %s pin %d",
ret, button.port->name, button.pin);
return 0;
}
ret = gpio_pin_interrupt_configure_dt(&button,
GPIO_INT_EDGE_TO_ACTIVE);
if (ret != 0) {
LOG_ERR("Error %d: failed to configure interrupt on %s pin %d",
ret, button.port->name, button.pin);
return 0;
}
gpio_init_callback(&button_cb_data, button_pressed, BIT(button.pin));
gpio_add_callback(button.port, &button_cb_data);
We're configuring the pin as an input and then we're enabling
interrupts for it when it goes to logical level "high". In
this case, since we defined it as active-low, the interrupt
will be triggered when the pin transitions from the stable
pulled-up voltage to ground.
Finally, we're initializing and adding a callback function
that will be called by the ISR when it detects that this GPIO
goes active. We'll use this callback to start an action from
a user event. The specific interrupt handling is done
by the target-specific device driver2 and we don't have to worry about
that, our code can remain device-independent.
NOTE: The callback we'll define is meant as
a simple exercise for illustrative purposes. Zephyr provides an
input
subsystem to handle cases
like this
properly.
What we want to do in the callback is to send a message to
the processing thread. The communication input channel to the
thread is the in_msgq message queue, and the data
we'll send is a simple 32-bit integer with the number of
uptime seconds. But before doing that, we'll first de-bounce
the button press using a simple idea: to schedule the message
delivery to a
workqueue
thread:
/*
* Deferred irq work triggered by the GPIO IRQ callback
* (button_pressed). This should run some time after the ISR, at which
* point the button press should be stable after the initial bouncing.
*
* Checks the button status and sends the current system uptime in
* seconds through in_msgq if the the button is still pressed.
*/
static void debounce_expired(struct k_work *work)
{
unsigned int data = k_uptime_seconds();
ARG_UNUSED(work);
if (gpio_pin_get_dt(&button))
k_msgq_put(&in_msgq, &data, K_NO_WAIT);
}
static K_WORK_DELAYABLE_DEFINE(debounce_work, debounce_expired);
/*
* Callback function for the button GPIO IRQ.
* De-bounces the button press by scheduling the processing into a
* workqueue.
*/
void button_pressed(const struct device *dev, struct gpio_callback *cb,
uint32_t pins)
{
k_work_reschedule(&debounce_work, K_MSEC(30));
}
That way, every unwanted oscillation will cause a
re-scheduling of the message delivery (replacing any prior
scheduling). debounce_expired will eventually
read the GPIO status and send the message.
Thread synchronization and messaging
As I mentioned earlier, the interface with the processing
thread consists on two message queues, one for input and one for
output. These are defined statically with
the K_MSGQ_DEFINE macro:
Both queues have space to hold only one message each. For the
input queue (the one we'll use to send messages to the
processing thread), each message will be one 32-bit
integer. The messages of the output queue (the one the
processing thread will use to send messages) are 8 bytes
long.
Once the main thread is done initializing everything, it'll
stay in an infinite loop waiting for messages from the
processing thread. The processing thread will also run a loop
waiting for incoming messages in the input queue, which are
sent by the button callback, as we saw earlier, so the message
queues will be used both for transferring data and for
synchronization. Since the code running in the processing
thread is so small, I'll paste it here in its entirety:
static char data_out[PROC_MSG_SIZE];
/*
* Receives a message on the message queue passed in p1, does some
* processing on the data received and sends a response on the message
* queue passed in p2.
*/
void data_process(void *p1, void *p2, void *p3)
{
struct k_msgq *inq = p1;
struct k_msgq *outq = p2;
ARG_UNUSED(p3);
while (1) {
unsigned int data;
k_msgq_get(inq, &data, K_FOREVER);
LOG_DBG("Received: %d", data);
/* Data processing: convert integer to string */
snprintf(data_out, sizeof(data_out), "%d", data);
k_msgq_put(outq, data_out, K_NO_WAIT);
}
}
I2C target implementation
Now that we have a way to interact with the program by
inputting an external event (a button press), we'll add a way
for it to communicate with the outside world: we're going to
turn our device into a
I2C target
that will listen for command requests from a controller and
send data back to it. In our setup, the controller will be
Linux-based Raspberry Pi, see the diagram in
the Hardware setup section above
for details on how the boards are connected.
In order to define an I2C target we first need a
suitable device defined in the device tree. To abstract the
actual target-dependent device, we'll define and use an alias
for it that we can redefine for every supported target. For
instance, for the Raspberry Pi Pico 2W we define this alias in
its device tree overlay:
So now in the code we can reference
the i2ctarget alias to load the device info and
initialize it:
/*
* Get I2C device configuration from the devicetree i2ctarget alias.
* Check node availability at buid time.
*/
#define I2C_NODE DT_ALIAS(i2ctarget)
#if !DT_NODE_HAS_STATUS_OKAY(I2C_NODE)
#error "Unsupported board: i2ctarget devicetree alias is not defined"
#endif
const struct device *i2c_target = DEVICE_DT_GET(I2C_NODE);
To register the device as a target, we'll use
the i2c_target_register()
function, which takes the loaded device tree device and an
I2C target configuration (struct
i2c_target_config) containing the I2C
address we choose for it and a set of callbacks for all the
possible events. It's in these callbacks where we'll define the
target's functionality:
Each of those callbacks will be called as a response from an
event started by the controller. Depending on how we want to
define the target we'll need to code the callbacks to react
appropriately to the controller requests. For this application
we'll define a register that the controller can read to get a
timestamp (the firmware uptime in seconds) from the last time
the button was pressed. The number will be received as an
8-byte ASCII string.
If the controller is the Linux-based Raspberry Pi, we can use
the i2c-tools
to poll the target and read from it:
We basically want the device to react when the controller
sends a write request (to select the register and prepare the
data), when it sends a read request (to send the data bytes
back to the controller) and when it sends a stop
condition.
To handle the data to be sent, the I2C callback
functions manage an internal buffer that will hold the string
data to send to the controller, and we'll load this buffer
with the contents of a source buffer that's updated every time
the main thread receives data from the processing thread (a
double-buffer scheme). Then, when we program an I2C
transfer we walk this internal buffer sending each byte to the
controller as we receive read requests. When the transfer
finishes or is aborted, we reload the buffer and rewind it for
the next transfer:
typedef enum {
I2C_REG_UPTIME,
I2C_REG_NOT_SUPPORTED,
I2C_REG_DEFAULT = I2C_REG_UPTIME
} i2c_register_t;
/* I2C data structures */
static char i2cbuffer[PROC_MSG_SIZE];
static int i2cidx = -1;
static i2c_register_t i2creg = I2C_REG_DEFAULT;
[...]
/*
* Callback called on a write request from the controller.
*/
int write_requested_cb(struct i2c_target_config *config)
{
LOG_DBG("I2C WRITE start");
return 0;
}
/*
* Callback called when a byte was received on an ongoing write request
* from the controller.
*/
int write_received_cb(struct i2c_target_config *config, uint8_t val)
{
LOG_DBG("I2C WRITE: 0x%02x", val);
i2creg = val;
if (val == I2C_REG_UPTIME)
i2cidx = -1;
return 0;
}
/*
* Callback called on a read request from the controller.
* If it's a first read, load the output buffer contents from the
* current contents of the source data buffer (str_data).
*
* The data byte sent to the controller is pointed to by val.
* Returns:
* 0 if there's additional data to send
* -ENOMEM if the byte sent is the end of the data transfer
* -EIO if the selected register isn't supported
*/
int read_requested_cb(struct i2c_target_config *config, uint8_t *val)
{
if (i2creg != I2C_REG_UPTIME)
return -EIO;
LOG_DBG("I2C READ started. i2cidx: %d", i2cidx);
if (i2cidx < 0) {
/* Copy source buffer to the i2c output buffer */
k_mutex_lock(&str_data_mutex, K_FOREVER);
strncpy(i2cbuffer, str_data, PROC_MSG_SIZE);
k_mutex_unlock(&str_data_mutex);
}
i2cidx++;
if (i2cidx == PROC_MSG_SIZE) {
i2cidx = -1;
return -ENOMEM;
}
*val = i2cbuffer[i2cidx];
LOG_DBG("I2C READ send: 0x%02x", *val);
return 0;
}
/*
* Callback called on a continued read request from the
* controller. We're implementing repeated start semantics, so this will
* always return -ENOMEM to signal that a new START request is needed.
*/
int read_processed_cb(struct i2c_target_config *config, uint8_t *val)
{
LOG_DBG("I2C READ continued");
return -ENOMEM;
}
/*
* Callback called on a stop request from the controller. Rewinds the
* index of the i2c data buffer to prepare for the next send.
*/
int stop_cb(struct i2c_target_config *config)
{
i2cidx = -1;
LOG_DBG("I2C STOP");
return 0;
}
The application logic is done at this point, and we were
careful to write it in a platform-agnostic way. As mentioned
earlier, all the target-specific details are abstracted away
by the device tree and the Zephyr APIs. Although we're
developing with a real deployment board in mind, it's very
useful to be able to develop and test using a behavioral model
of the hardware that we can program to behave as close to the
real hardware as we need and that we can run on our
development machine without the cost and restrictions of the
real hardware.
To do this, we'll rely on the native_sim board3, which implements the core OS
services on top of a POSIX compatibility layer, and we'll add
code to simulate the button press and the I2C
requests.
Emulating a button press
We'll use
the gpio_emul
driver as a base for our emulated
button. The native_sim device tree already defines
an emulated GPIO bank for this:
We'll model the button press as a four-phase event consisting
on an initial status change caused by the press, then a
semi-random rebound phase, then a phase of signal
stabilization after the rebounds stop, and finally a button
release. Using the gpio_emul API it'll look like
this:
/*
* Emulates a button press with bouncing.
*/
static void button_press(void)
{
const struct device *dev = device_get_binding(button.port->name);
int n_bounces = sys_rand8_get() % 10;
int state = 1;
int i;
/* Press */
gpio_emul_input_set(dev, 0, state);
/* Bouncing */
for (i = 0; i < n_bounces; i++) {
state = state ? 0: 1;
k_busy_wait(1000 * (sys_rand8_get() % 10));
gpio_emul_input_set(dev, 0, state);
}
/* Stabilization */
gpio_emul_input_set(dev, 0, 1);
k_busy_wait(100000);
/* Release */
gpio_emul_input_set(dev, 0, 0);
}
The driver will take care of checking if the state changes
need
to raise interrupts, depending on the GPIO configuration,
and will trigger the registered callback that we defined
earlier.
Emulating an I2C controller
As with the button emulator, we'll rely on an existing
emulated device driver for
this: i2c_emul. Again,
the device tree for the target already defines the node we
need:
So we can define a machine-independent alias that we can
reference in the code:
/ {
aliases {
i2ctarget = &i2c0;
};
The events we need to emulate are the requests sent by the
controller: READ start, WRITE start and STOP. We can define
these based on
the i2c_transfer()
API function which will, in this case, use
the i2c_emul
driver implementation to simulate the transfer. As in the
GPIO emulation case, this will trigger the appropriate
callbacks. The implementation of our controller requests looks
like this:
/*
* A real controller may want to continue reading after the first
* received byte. We're implementing repeated-start semantics so we'll
* only be sending one byte per transfer, but we need to allocate space
* for an extra byte to process the possible additional read request.
*/
static uint8_t emul_read_buf[2];
/*
* Emulates a single I2C READ START request from a controller.
*/
static uint8_t *i2c_emul_read(void)
{
struct i2c_msg msg;
int ret;
msg.buf = emul_read_buf;
msg.len = sizeof(emul_read_buf);
msg.flags = I2C_MSG_RESTART | I2C_MSG_READ;
ret = i2c_transfer(i2c_target, &msg, 1, I2C_ADDR);
if (ret == -EIO)
return NULL;
return emul_read_buf;
}
static void i2c_emul_write(uint8_t *data, int len)
{
struct i2c_msg msg;
/*
* NOTE: It's not explicitly said anywhere that msg.buf can be
* NULL even if msg.len is 0. The behavior may be
* driver-specific and prone to change so we're being safe here
* by using a 1-byte buffer.
*/
msg.buf = data;
msg.len = len;
msg.flags = I2C_MSG_WRITE;
i2c_transfer(i2c_target, &msg, 1, I2C_ADDR);
}
/*
* Emulates an explicit I2C STOP sent from a controller.
*/
static void i2c_emul_stop(void)
{
struct i2c_msg msg;
uint8_t buf = 0;
/*
* NOTE: It's not explicitly said anywhere that msg.buf can be
* NULL even if msg.len is 0. The behavior may be
* driver-specific and prone to change so we're being safe here
* by using a 1-byte buffer.
*/
msg.buf = &buf;
msg.len = 0;
msg.flags = I2C_MSG_WRITE | I2C_MSG_STOP;
i2c_transfer(i2c_target, &msg, 1, I2C_ADDR);
}
Now we can define a complete request for an "uptime read"
operation in terms of these primitives:
/*
* Emulates an I2C "UPTIME" command request from a controller using
* repeated start.
*/
static void i2c_emul_uptime(const struct shell *sh, size_t argc, char **argv)
{
uint8_t buffer[PROC_MSG_SIZE] = {0};
i2c_register_t reg = I2C_REG_UPTIME;
int i;
i2c_emul_write((uint8_t *)®, 1);
for (i = 0; i < PROC_MSG_SIZE; i++) {
uint8_t *b = i2c_emul_read();
if (b == NULL)
break;
buffer[i] = *b;
}
i2c_emul_stop();
if (i == PROC_MSG_SIZE) {
shell_print(sh, "%s", buffer);
} else {
shell_print(sh, "Transfer error");
}
}
Ok, so now that we have implemented all the emulated
operations we needed, we need a way to trigger them on the
emulated
environment. The Zephyr
shell is tremendously useful for cases like this.
Shell commands
The shell module in Zephyr has a lot of useful features that
we can use for debugging. It's quite extensive and talking
about it in detail is out of the scope of this post, but I'll
show how simple it is to add a few custom commands to trigger
the button presses and the I2C controller requests
from a console. In fact, for our purposes, the whole thing is
as simple as this:
SHELL_CMD_REGISTER(buttonpress, NULL, "Simulates a button press", button_press);
SHELL_CMD_REGISTER(i2cread, NULL, "Simulates an I2C read request", i2c_emul_read);
SHELL_CMD_REGISTER(i2cuptime, NULL, "Simulates an I2C uptime request", i2c_emul_uptime);
SHELL_CMD_REGISTER(i2cstop, NULL, "Simulates an I2C stop request", i2c_emul_stop);
We'll enable these commands only when building for
the native_sim board. With the configuration
provided, once we run the application we'll have the log
output in stdout and the shell UART connected to a pseudotty,
so we can access it in a separate terminal and run these
commands while we see the output in the terminal where we ran
the application:
$ ./build/zephyr/zephyr.exe
WARNING: Using a test - not safe - entropy source
uart connected to pseudotty: /dev/pts/16
*** Booting Zephyr OS build v4.1.0-6569-gf4a0beb2b7b1 ***
# In another terminal
$ screen /dev/pts/16
uart:~$
uart:~$ help
Please press the <Tab> button to see all available commands.
You can also use the <Tab> button to prompt or auto-complete all commands or its subcommands.
You can try to call commands with <-h> or <--help> parameter for more information.
Shell supports following meta-keys:
Ctrl + (a key from: abcdefklnpuw)
Alt + (a key from: bf)
Please refer to shell documentation for more details.
Available commands:
buttonpress : Simulates a button press
clear : Clear screen.
device : Device commands
devmem : Read/write physical memory
Usage:
Read memory at address with optional width:
devmem <address> [<width>]
Write memory at address with mandatory width and value:
devmem <address> <width> <value>
help : Prints the help message.
history : Command history.
i2cread : Simulates an I2C read request
i2cstop : Simulates an I2C stop request
i2cuptime : Simulates an I2C uptime request
kernel : Kernel commands
rem : Ignore lines beginning with 'rem '
resize : Console gets terminal screen size or assumes default in case
the readout fails. It must be executed after each terminal
width change to ensure correct text display.
retval : Print return value of most recent command
shell : Useful, not Unix-like shell commands.
To simulate a button press (ie. capture the current
uptime):
uart:~$ buttonpress
And the log output should print the enabled debug
messages:
This is the process I followed to set up a Linux system on a
Raspberry Pi (very old, model 1 B). There are plenty of
instructions for this on the Web, and you can probably just
pick up a pre-packaged and
pre-configured Raspberry
Pi OS and get done with it faster, so I'm adding this here
for completeness and because I want to have a finer grained
control of what I put into it.
The only harware requirement is an SD card with two
partitions: a small (~50MB) FAT32 boot partition and the rest
of the space for the rootfs partition, which I formatted as
ext4. The boot partition should contain a specific set of
configuration files and binary blobs, as well as the kernel
that we'll build and the appropriate device tree binary. See
the official
docs for more information on the boot partition contents
and this
repo for the binary blobs. For this board, the minimum
files needed are:
bootcode.bin: the second-stage bootloader, loaded by the
first-stage bootloader in the BCM2835 ROM. Run by the GPU.
start.elf: GPU firmware, starts the ARM CPU.
fixup.dat: needed by start.elf. Used to configure the SDRAM.
kernel.img: this is the kernel image we'll build.
dtb files and overlays.
And, optionally but very recommended:
config.txt: bootloader configuration.
cmdline.txt: kernel command-line parameters.
In practice, pretty much all Linux setups will also have
these files. For our case we'll need to add one additional
config entry to the config.txt file in order to
enable the I2C bus:
dtparam=i2c_arm=on
Once we have the boot partition populated with the basic
required files (minus the kernel and dtb files), the two main
ingredients we need to build now are the kernel image and the
root filesystem.
There's nothing non-standard about how we'll generate this
kernel image, so you can search the Web for references on how
the process works if you need to. The only things to take into
account is that we'll pick
the Raspberry
Pi kernel instead of a vanilla mainline kernel. I also
recommend getting the arm-linux-gnueabi
cross-toolchain
from kernel.org.
After installing the toolchain and cloning the repo, we just
have to run the usual commands to configure the kernel, build
the image, the device tree binaries, the modules and have the
modules installed in a specific directory, but first we'll
add some extra config options:
cd kernel_dir
KERNEL=kernel
make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- bcmrpi_defconfig
We'll need to add at least ext4 builtin support so that the
kernel can mount the rootfs, and I2C support for our
experiments, so we need to edit .config, add
these:
CONFIG_EXT4_FS=y
CONFIG_I2C=y
And run the olddefconfig target. Then we can
proceed with the rest of the build steps:
make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- olddefconfig
make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- zImage modules dtbs -j$(nproc)
mkdir modules
make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- INSTALL_MOD_PATH=./modules modules_install
Now we need to copy the kernel and the dtbs to the boot partition of the
sd card:
(we really only need the dtb for this particular board, but
anyway).
Setting up a Debian rootfs
There are many ways to do this, but I normally use the classic
debootstrap
to build Debian rootfss. Since I don't always know which
packages I'll need to install ahead of time, the strategy I
follow is to build a minimal image with the bare minimum
requirements and then boot it either on a virtual machine or
in the final target and do the rest of the installation and
setup there. So for the initial setup I'll only include the
openssh-server package:
mkdir bookworm_armel_raspi
sudo debootstrap --arch armel --include=openssh-server bookworm \
bookworm_armel_raspi http://deb.debian.org/debian
# Remove the root password
sudo sed -i '/^root/ { s/:x:/::/ }' bookworm_armel_raspi/etc/passwd
# Create a pair of ssh keys and install them to allow passwordless
# ssh logins
cd ~/.ssh
ssh-keygen -f raspi
sudo mkdir bookworm_armel_raspi/root/.ssh
cat raspi.pub | sudo tee bookworm_armel_raspi/root/.ssh/authorized_keys
Now we'll copy the kernel modules to the rootfs. From the kernel directory, and
based on the build instructions above:
cd kernel_dir
sudo cp -fr modules/lib/modules /path_to_rootfs_mountpoint/lib
If your distro provides qemu static binaries (eg. Debian:
qemu-user-static), it's a good idea to copy the qemu binary to the
rootfs so we can mount it locally and run apt-get on it:
Otherwise, we can boot a kernel on qemu and load the rootfs there to
continue the installation. Next we'll create and populate the filesystem
image, then we can boot it on qemu for additional tweaks or
dump it into the rootfs partition of the SD card:
# Make rootfs image
fallocate -l 2G bookworm_armel_raspi.img
sudo mkfs -t ext4 bookworm_armel_raspi.img
sudo mkdir /mnt/rootfs
sudo mount -o loop bookworm_armel_raspi.img /mnt/rootfs/
sudo cp -a bookworm_armel_raspi/* /mnt/rootfs/
sudo umount /mnt/rootfs
(Substitute /dev/sda2 for the sd card rootfs
partition in your system).
At this point, if we need to do any extra configuration steps
we can either:
Mount the SD card and make the changes there.
Boot the filesystem image in qemu with a suitable kernel
and make the changes in a live system, then dump the
changes into the SD card again.
Boot the board and make the changes there directly. For
this we'll need to access the board serial console through
its UART pins.
Here are some of the changes I made. First, network
configuration. I'm setting up a dedicated point-to-point
Ethernet link between the development machine (a Linux laptop)
and the Raspberry Pi, with fixed IPs. That means I'll use a
separate subnet for this minimal LAN and that the laptop will
forward traffic between the Ethernet nic and the WLAN
interface that's connected to the Internet. In the rootfs I
added a file
(/etc/systemd/network/20-wired.network) with the
following contents:
Where 192.168.2.101 is the address of the board NIC and
192.168.2.100 is the one of the Eth NIC in my laptop. Then,
assuming we have access to the serial console of the board and
we logged in as root, we need to
enable systemd-networkd:
systemctl enable systemd-networkd
Additionally, we need to edit the ssh server configuration to
allow login as root. We can do this by
setting PermitRootLogin yes
in /etc/ssh/sshd_config.
In the development machine, I configured the traffic
forwarding to the WLAN interface:
1: Although in this case the thread is a regular
kernel thread and runs on the same memory space as the rest of
the code, so there's no memory protection. See
the User
Mode page in the docs for more
details.↩
2: As a reference, for the Raspberry Pi Pico 2W,
this
is where the ISR is registered
for enabled
GPIO devices,
and this
is the ISR that checks the pin status and triggers the
registered callbacks.↩
Adaptation of WPE WebKit targeting the Android operating system.
WPE-Android has been updated to use WebKit 2.48.5. This update particular interest for development on Android is the support for using the system logd service, which can be configured using system properties. For example, the following will enable logging all warnings:
adb shell setprop debug.log.WPEWebKit all
adb shell setprop log.tag.WPEWebKit WARN
Stable releases of WebKitGTK 2.48.5 and WPE WebKit 2.48.5 are now available. These include the fixes and improvements from the corresponding2.48.4 ones, and additionally solve a number of security issues. Advisory WSA-2025-0005 (GTK, WPE) covers the included security patches.
Ruby was re-added to the GNOME SDK, thanks to Michael Catanzaro and Jordan Petridis. So we're happy to report that the WebKitGTK nightly builds for GNOME Web Canary are now fixed and Canary updates were resumed.
August greetings, comrades! Today I want to bookend some recent work on
my Immix-inspired garbage
collector:
firstly, an idea with muddled results, then a slog through heuristics.
the big idea
My mostly-marking collector’s main space is called the “nofl space”.
Its name comes from its historical evolution from mark-sweep to mark-region:
instead of sweeping unused memory to freelists and allocating from those
freelists, sweeping is interleaved with allocation; “nofl” means
“no free-list”. As it finds holes, the collector bump-pointer allocates into those
holes. If an allocation doesn’t fit into the current hole, the collector sweeps
some more to find the next hole, possibly fetching another block. Space
for holes that are too small is effectively wasted as fragmentation;
mutators will try again after the next GC. Blocks with lots of
holes will be chosen for opportunistic evacuation, which is the heap
defragmentation mechanism.
Hole-too-small fragmentation has bothered me, because it presents a
potential pathology. You don’t know how a GC will be used or what the
user’s allocation pattern will be; if it is a mix of medium (say, a
kilobyte) and small (say, 16 bytes) allocations, one could imagine a
medium allocation having to sweep over lots of holes, discarding them in
the process, which hastens the next collection. Seems wasteful,
especially for non-moving configurations.
So I had a thought: why not collect those holes into a size-segregated
freelist? We just cleared the hole, the memory is core-local, and we
might as well. Then before fetching a new block, the allocator
slow-path can see if it can service an allocation from the second-chance
freelist of holes. This decreases locality a bit, but maybe it’s worth
it.
Thing is, I implemented it, and I don’t know if it’s worth it! It seems
to interfere with evacuation, in that the blocks that would otherwise be
most profitable to evacuate, because they contain many holes, are
instead filled up with junk due to second-chance allocation from the
freelist. I need to do more measurements, but I think my big-brained
idea is a bit of a wash, at least if evacuation is enabled.
heap growth
When running the new collector in Guile, we have a performance oracle in
the form of BDW: it had better be faster for Guile to compile a Scheme
file with the new nofl-based collector than with BDW. In this use case
we have an additional degree of freedom, in that unlike the lab tests of
nofl vs BDW, we don’t impose a fixed heap size, and instead allow
heuristics to determine the growth.
BDW’s built-in heap growth heuristics are very opaque. You give it a
heap multiplier, but as a divisor truncated to an integer. It’s very
imprecise. Additionally, there are nonlinearities: BDW is relatively
more generous for smaller heaps, because attempts to model and amortize
tracing cost, and there are some fixed costs (thread sizes, static data
sizes) that don’t depend on live data size.
Thing is, BDW’s heuristics work pretty well. For example, I had a
process that ended with a heap of about 60M, for a peak live data size
of 25M or so. If I ran my collector with a fixed heap multiplier, it
wouldn’t do as well as BDW, because it collected much more frequently
when the heap was smaller.
I ended up switching from the primitive
“size the heap as a multiple of live data” strategy to live data plus a
square root factor; this is like what Racket ended up doing in its
simple implementation of MemBalancer. (I do have a proper implementation
of MemBalancer, with time measurement and shrinking and all, but I
haven’t put it through its paces yet.) With this fix I can meet BDW’s performance
for my Guile-compiling-Guile-with-growable-heap workload. It would be
nice to exceed BDW of course!
parallel worklist tweaks
Previously, in parallel configurations, trace workers would each have a
Chase-Lev deque to which they could publish objects needing tracing.
Any worker could steal an object from the top of a worker’s public
deque. Also, each worker had a local, unsynchronized FIFO worklist,
some 1000 entries in length; when this worklist filled up, the worker
would publish its contents.
There is a pathology for this kind of setup, in which one worker can end
up with a lot of work that it never publishes. For example, if there
are 100 long singly-linked lists on the heap, and the worker happens to have them all on
its local FIFO, then perhaps they never get published, because the FIFO
never overflows; you end up not parallelising. This seems to be the case
in one microbenchmark. I switched to not have local worklists at all;
perhaps this was not the right thing, but who knows. Will poke in
future.
a hilarious bug
Sometimes you need to know whether a given address is in an object
managed by the garbage collector. For the nofl space it’s pretty easy,
as we have big slabs of memory; bisecting over the array of slabs is
fast. But for large objects whose memory comes from the kernel, we
don’t have that. (Yes, you can reserve a big ol’ region with
PROT_NONE and such, and then allocate into that region; I don’t do
that currently.)
Previously I had a splay tree for lookup. Splay trees are great but not
so amenable to concurrent access, and parallel marking is one place where we need to do this lookup. So I prepare a sorted array before marking, and then
bisect over that array.
Except a funny thing happened: I switched the bisect routine to return
the start address if an address is in a region. Suddenly, weird
failures started happening randomly. Turns out, in some places I was
testing if bisection succeeded with an int; if the region happened to
be 32-bit-aligned, then the nonzero 64-bit uintptr_t got truncated to its low 32
bits, which were zero. Yes, crusty reader, Rust would have caught this!
fin
I want this new collector to work. Getting the growth heuristic good
enough is a step forward. I am annoyed that second-chance allocation
didn’t work out as well as I had hoped; perhaps I will find some time
this fall to give a proper evaluation. In any case, thanks for reading,
and hack at you later!
I recently started playing around
with Zephyr,
reading about it and doing some experiments, and I figured
I'd rather jot down my impressions and findings so that the me
in the future, who'll have no recollection of ever doing this,
can come back to it as a reference. And if it's helpful for
anybody else, that's a nice bonus.
It's been a really long time since I last dove into embedded
programming for low-powered hardware and things have changed
quite a bit, positively, both in terms of hardware
availability for professionals and hobbyists and in the
software options. Back in the day, most of the open source
embedded OSs1 I tried
felt like toy operating systems: enough for simple
applications but not really suitable for more complex systems
(eg. not having a proper preemptive scheduler is a serious
limitation). In the proprietary side things looked better and
there were many more options but, of course, those weren't
freely available.
Nowadays, Zephyr has filled that gap in the open source
embedded OSs field2,
even becoming the de facto OS to use, something like
a "Linux for embedded": it feels like a full-fledged OS, it's
feature rich, flexible and scalable, it has an enormous
traction in embedded, it's widely supported by many of the big
names in the industry and it has plenty of available
documentation, resources and a thriving community. Currently,
if you need to pick an OS for embedded platforms, unless
you're targetting very minimal hardware (8/16bit
microcontrollers), it's a no brainer.
Noteworthy features
One of the most interesting qualities of Zephyr is its
flexibility: the base system is lean and has a small
footprint, and at the same time it's easy to grow a
Zephyr-based firmware for more complex applications thanks to
the variety of supported features. These are some of them:
Feature-rich
kernel core
services: for a small operating system, the amount of
core services available is quite remarkable. Most of the
usual tools for general application development are there:
thread-based runtime with preemptive and cooperative
scheduling, multiple synchronization and IPC mechanisms,
basic memory management functions, asynchronous and
event-based programming support, task management, etc.
SMP support.
Extensive core library: including common data structures,
shell support and a POSIX compatibility layer.
Logging and tracing: simple but capable facilities with
support for different backends, easy to adapt to the
hardware and application needs.
Native
simulation target
and device
emulation: allows to build applications as native
binaries that can run on the development platform for
prototyping and debugging purposes.
Now let's move on and get some actual hands on experience
with Zephyr. The first thing we'll do is to set up a basic
development environment so we can start writing some
experiments and testing them. It's a good idea to keep a
browser tab open on
the Zephyr
docs, so we can reference them when needed or search for
more detailed info.
Development environment setup
The development environment is set up and contained within a
python venv. The Zephyr project provides
the west
command line tool to carry out all the setup and build
steps.
The basic tool requirements in Linux are CMake, Python3 and
the device tree compiler. Assuming they are installed and
available, we can then set up a development environment like
this:
python3 -m venv zephyrproject/.venv
. zephyrproject/.venv/bin/activate
# Now inside the venv
pip install west
west init zephyrproject
cd zephyrproject
west update
west zephyr-export
west packages pip --install
Some basic nomenclature: the zephyrproject
directory is known as a west "workspace". Inside it,
the zephyr directory contains the repo of Zephyr
itself.
Next step is to install the Zephyr SDK, ie. the toolchains
and other host tools. I found this step a bit troublesome and
it could have better defaults. By default it will install all
the available SDKs (many of which we won't need) and then all
the host tools (which we may not need either). Also, in my
setup, the script that install the host tools fails with a
buffer overflow, so instead of relying on it to install the
host tools (in my case I only needed qemu) I installed it
myself. This has some drawbacks: we might be missing some
features that are in the custom qemu binaries provided by the
SDK, and west won't be able to run our apps on
qemu automatically, we'll have to do that ourselves. Not ideal
but not a dealbreaker either, I could figure it out and run
that myself just fine.
So I recommend to install the SDK interactively so we can
select the toolchains we want and whether we want to install
the host tools or not (in my case I didn't):
cd zephyr
west sdk install -i
For the initial tests I'm targetting riscv64 on qemu, we'll
pick up other targets later. In my case, since the host tools
installation failed on my setup, I needed to
provide qemu-system-riscv64 myself, you probably
won't have to do that.
Now, to see if everything is set up correctly, we can try to
build the simplest example program there
is: samples/hello_world. To build it
for qemu_riscv64 we can use west
like this:
west build -p always -b qemu_riscv64 samples/hello_world
Where -p always tells west to do a
pristine build ,ie. build everything every time. We may not
need that necessarily but for now it's a safe flag to use.
Then, to run the app in qemu, the standard way is to
do west build -t run, but if we didn't install
the Zephyr host tools we'll need to run qemu ourselves:
Architecture-specific note: we're
calling qemu-system-riscv64 with -bios
none to prevent qemu from loading OpenSBI into address
0x80000000. Zephyr doesn't need OpenSBI and it's loaded into
that address, which is where qemu-riscv's ZSBL jumps
to3.
Starting a new application
The Zephyr
Example Application repo repo contains an example
application that we can use as a reference for a workspace
application (ie. an application that lives in the
`zephyrproject` workspace we created earlier). Although we
can use it as a reference, I didn't have a good experience
with it
According to the docs, we can simply clone the example
application repo into an existing workspace, but that
doesn't seem to work, and it looks
like the
docs are wrong about that.
,
so I recommend to start from scratch or to take the
example applications in the zephyr/samples
directory as templates as needed.
To create a new application, we simply have to make a
directory for it in the workspace dir and write a minimum set of
required files:
where test_app is the name of the
application. prj.conf is meant to contain
application-specific config options and will be empty for
now. README.rst is optional.
Assuming the code in main.c is correct, we can
then build the application for a specific target with:
west build -p always -b <target> <app_name>
where <app_name> is the directory containing the
application files listed above. Note that west
uses CMake under the hood, so the build will be based on
whatever build system CMake uses
(apparently, ninja by default), so many of these
operations can also be done at a lower level using the
underlying build system commands (not recommended).
Building for different targets
Zephyr supports building applications for different target
types or abstractions. While the end goal will normally be to
have a firmare running on a SoC, for debugging purposes, for
testing or simply to carry out most of the development without
relying on hardware, we can target qemu to run the application
on an emulated environment, or we can even build the app as a
native binary to run on the development machine.
The differences between targets can be abstracted through
proper use of APIs and device tree definitions so, in theory,
the same application (with certain limitations) can be
seamlessly built for different targets without modifications,
and the build process takes care of doing the right thing
depending on the target.
As an example, let's build and run
the hello_world sample program in three different
targets with different architectures: native_sim
(x86_64 with emulated devices), qemu (Risc-V64 with full system
emulation) and a real board,
a Raspberry
Pi Pico 2W (ARM Cortex-M33).
Before starting, let's clean up any previous builds:
west build -t clean
Now, to build and run the application as a native binary:
west build -t clean
west build -p always -b qemu_riscv64 zephyr/samples/hello_world
[... omitted build output]
west build -t run
*** Booting Zephyr OS build v4.1.0-6569-gf4a0beb2b7b1 ***
Hello World! qemu_riscv64/qemu_virt_riscv64
For the Raspberry Pi Pico 2W:
west build -t clean
west build -p always -b rpi_pico2/rp2350a/m33 zephyr/samples/hello_world
[... omitted build output]
west flash -r uf2
In this case, flashing and checking the console output are
board-specific steps. Assuming the flashing process worked, if
we connect to the board UART0, we can see the output
message:
*** Booting Zephyr OS build v4.1.0-6569-gf4a0beb2b7b1 ***
Hello World! rpi_pico2/rp2350a/m33
Note that the application prints that line like this:
This shows we can easily build our applications using
hardware abstractions and have them working on different
platforms using the same code and build environment.
What's next?
Now that we're set and ready to work and the environment is
all set up, we can start doing more interesting things. In a
follow-up post I'll show a concrete example of an application
that showcases most of the features
listed above.
1: Most of them are generally labelled as RTOSs,
although the "RT" there is used rather
loosely.↩
2: ThreadX is now an option too, having become
open source recently. It brings certain features that are
more common in proprietary systems, such as security
certifications, and it looks like it was designed in a more
focused way. In contrast, it lacks the ecosystem and other
perks of open source projects (ease of adoption, rapid
community-based growth).↩
…and I immediately thought, This is a perfect outer-limits probe! By which I mean, if I hand a browser values that are effectively infinite by way of theinfinity keyword, it will necessarily end up clamping to something finite, thus revealing how far it’s able or willing to go for that property.
The first thing I did was exactly what Andy proposed, with a few extras to zero out box model extras:
Then I loaded the (fully valid HTML 5) test page in Firefox Nightly, Chrome stable, and Safari stable, all on macOS, and things pretty immediately got weird:
Element Size Results
Browser
Computed value
Layout value
Safari
33,554,428
33,554,428
Chrome
33,554,400
33,554,400
Firefox (Nightly)
19.2 / 17,895,700
19.2 / 8,947,840 †
† height / width
Chrome and Safari both get very close to 225-1 (33,554,431), with Safari backing off from that by just 3 pixels, and Chrome by 31. I can’t even hazard a guess as to why this sort of value would be limited in that way; if there was a period of time where 24-bit values were in vogue, I must have missed it. I assume this is somehow rooted in the pre-Blink-fork codebase, but who knows. (Seriously, who knows? I want to talk to you.)
But the faint whiff of oddness there has nothing on what’s happening in Firefox. First off, the computed height is19.2px, which is the height of a line of text at default font size and line height. If I explicitly gave it line-height: 1, the height of the <div> changes to 16px. All this is despite my assigning a height of infinite pixels! Which, to be fair, is not really possible to do, but does it make sense to just drop it on the floor rather than clamp to an upper bound?
Even if that can somehow be said to make sense, it only happens with height. The computed width value is, as indicated, nearly 17.9 million, which is not the content width and is also nowhere close to any power of two. But the actual layout width, according to the diagram in the Layout tab, is just over 8.9 million pixels; or, put another way, one-half of 17,895,700 minus 10.
This frankly makes my brain hurt. I would truly love to understand the reasons for any of these oddities. If you know from whence they arise, please, please leave a comment! The more detail, the better. I also accept trackbacks from blog posts if you want to get extra-detailed.
For the sake of my aching skullmeats, I almost called a halt there, but I decided to see what happened with font sizes.
My skullmeats did not thank me for this, because once again, things got… interesting.
Font Size Results
Browser
Computed value
Layout value
Safari
100,000
100,000
Chrome
10,000
10,000
Firefox (Nightly)
3.40282e38
2,400 / 17,895,700 †
† line height values of normal /1
Safari and Chrome have pretty clearly set hard limits, with Safari’s an order of magnitude larger than Chrome’s. I get it: what are the odds of someone wanting their text to be any larger than, say, a viewport height, let alone ten or 100 times that height? What intrigues me is the nature of the limits, which are so clearly base-ten numbers that someone typed in at some point, rather than being limited by setting a register size or variable length or something that would have coughed up a power of two.
And speaking of powers of two… ah, Firefox. Your idiosyncrasy continues. The computed value is a 32-bit single-precision floating-point number. It doesn’t get used in any of the actual rendering, but that’s what it is. Instead, the actual font size of the text, as judged by the Box Model diagram on the Layout tab, is… 2,400 pixels.
Except, I can’t say that’s the actual actual font size being used: I suspect the actual value is 2,000 with a line height of 1.2, which is generally what normal line heights are in browsers. “So why didn’t you just set line-height: 1 to verify that, genius?” I hear you asking. I did! And that’s when the layout height of the <div> bloomed to just over 8.9 million pixels, like it probably should have in the previous test! And all the same stuff happened when I moved the styles from the<div> to the <body>!
I’ve started writing at least three different hypotheses for why this happens, and stopped halfway through each because each hypothesis self-evidently fell apart as I was writing it. Maybe if I give my whimpering neurons a rest, I could come up with something. Maybe not. All I know is, I’d be much happier if someone just explained it to me; bonus points if their name is Clarissa.
Since setting line heights opened the door to madness in font sizing, I thought I’d try setting line-height to infinite pixels and see what came out. This time, things were (relatively speaking) more sane.
Line Height Results
Browser
Computed value
Layout value
Safari
33,554,428
33,554,428
Chrome
33,554,400
33,554,400
Firefox (Nightly)
17,895,700
8,947,840
Essentially, the results were the same as what happened with element widths in the first example: Safari and Chrome were very close to 225-1, and Firefox had its thing of a strange computed value and a rendering size not quite half the computed value.
I’m sure there’s a fair bit more to investigate about infinite-pixel values, or about infinite values in general, but I’m going to leave this here because my gray matter needs a rest and possibly a pressure washing. Still, if you have ideas for infinitely fun things to jam into browser engines and see what comes out, let me know. I’m already wondering what kind of shenanigans, other than in z-index, I can get up to with calc(-infinity)…
Things happen in GNOME? Could have fooled me, right?
Of course, things happen in GNOME. After all, we have been releasing every
six months, on the dot, for nearly 25 years. Assuming we’re not constantly
re-releasing the same source files, then we have to come to the conclusion
that things change inside each project that makes GNOME, and thus things
happen that involve more than one project.
So let’s roll back a bit.
GNOME’s original sin
We all know Havoc Pennington’s essay on
preferences; it’s one of GNOME’s
foundational texts, we refer to it pretty much constantly both inside and
outside the contributors community. It has guided our decisions and taste
for over 20 years. As far as foundational text goes, though, it applies to
design philosophy, not to project governance.
When talking about the inception and technical direction of the GNOME project
there are really two foundational texts that describe the goals of GNOME, as
well as the mechanisms that are employed to achieve those goals.
The first one is, of course, Miguel’s announcement of the GNOME project
itself, sent to the GTK, Guile, and (for good measure) the KDE mailing lists:
We will try to reuse the existing code for GNU programs as
much as possible, while adhering to the guidelines of the project.
Putting nice and consistent user interfaces over all-time
favorites will be one of the projects.
— Miguel de Icaza, “The GNOME Desktop project.” announcement email
Once again, everyone related to the GNOME project is (or should be) familiar
with this text.
The second foundational text is not as familiar, outside of the core group
of people that were around at the time. I am referring to Derek Glidden’s
description of the differences between GNOME and KDE, written five years
after the inception of the project. I isolated a small fragment of it:
Development strategies are generally determined by whatever light show happens
to be going on at the moment, when one of the developers will leap up and scream
“I WANTITTOLOOKJUSTLIKETHAT” and then straight-arm his laptop against the
wall in an hallucinogenic frenzy before vomiting copiously, passing out and
falling face-down in the middle of the dance floor.
— Derek Glidden, “GNOME vs KDE”
What both texts have in common is subtle, but explains the origin of the
project. You may not notice it immediately, but once you see it you can’t
unsee it: it’s the over-reliance on personal projects and taste, to be
sublimated into a shared vision. A “bottom up” approach, with “nice and
consistent user interfaces” bolted on top of “all-time favorites”, with zero
indication of how those nice and consistent UIs would work on extant code
bases, all driven by somebody’s with a vision—drug induced or otherwise—that
decides to lead the project towards its implementation.
It’s been nearly 30 years, but GNOME still works that way.
Sure, we’ve had a HIG for 25 years, and the shared development resources
that the project provides tend to mask this, to the point that everyone
outside the project assumes that all people with access to the GNOME commit
bit work on the whole project, as a single unit. If you are here, listening
(or reading) to this, you know it’s not true. In fact, it is so comically
removed from the lived experience of everyone involved in the project that
we generally joke about it.
Herding cats and vectors sum
During my first GUADEC, back in 2005, I saw a great slide from Seth Nickell,
one of the original GNOME designers. It showed GNOME contributors
represented as a jumble of vectors going in all directions, cancelling each
component out; and the occasional movement in the project was the result of
somebody pulling/pushing harder in their direction.
Of course, this is not the exclusive province of GNOME: you could take most
complex free and open source software projects and draw a similar diagram. I
contend, though, that when it comes to GNOME this is not emergent behaviour
but it’s baked into the project from its very inception: a loosey-goosey
collection of cats, herded together by whoever shows up with “a vision”,
but, also, a collection of loosely coupled projects. Over the years we tried
to put a rest to the notion that GNOME is a box of LEGO, meant to be
assembled together by distributors and users in the way they most like it;
while our software stack has graduated from the “thrown together at the last
minute” quality of its first decade, our community is still very much
following that very same model; the only way it seems to work is because
we have a few people maintaining a lot of components.
On maintainers
I am a software nerd, and one of the side effects of this terminal condition
is that I like optimisation problems. Optimising software is inherently
boring, though, so I end up trying to optimise processes and people. The
fundamental truth of process optimisation, just like software, is to avoid
unnecessary work—which, in some cases, means optimising away the people involved.
I am afraid I will have to be blunt, here, so I am going to ask for your
forgiveness in advance.
Let’s say you are a maintainer inside a community of maintainers. Dealing
with people is hard, and the lord forbid you talk to other people about what
you’re doing, what they are doing, and what you can do together, so you only
have a few options available.
The first one is: you carve out your niche. You start, or take over, a
project, or an aspect of a project, and you try very hard to make yourself
indispensable, so that everything ends up passing through you, and everyone
has to defer to your taste, opinion, or edict.
Another option: API design is opinionated, and reflects the thoughts of the
person behind it. By designing platform API, you try to replicate your
toughts, taste, and opinions into the minds of the people using it, like the
eggs of parasitic wasp; because if everybody thinks like you, then there
won’t be conflicts, and you won’t have to deal with details, like “how to
make this application work”, or “how to share functionality”; or, you know,
having to develop a theory of mind for relating to other people.
Another option: you try to reimplement the entirety of a platform by
yourself. You start a bunch of projects, which require starting a bunch of
dependencies, which require refactoring a bunch of libraries, which ends up
cascading into half of the stack. Of course, since you’re by yourself, you
end up with a consistent approach to everything. Everything is as it ought
to be: fast, lean, efficient, a reflection of your taste, commitment, and
ethos. You made everyone else redundant, which means people depend on you,
but also nobody is interested in helping you out, because you are now taken
for granted, on the one hand, and nobody is able to get a word edgewise into
what you made on the other.
I purposefully did not name names, even though we can all recognise somebody
in these examples. For instance, I recognise myself. I have been all of
these examples, at one point or another over the past 20 years.
Painting a target on your back
But if this is what it looks like from within a project, what it looks like
from the outside is even worse.
Once you start dragging other people, you raise your visibility; people
start learning your name, because you appear in the issue tracker, on
Matrix/IRC, on Discourse and Planet GNOME. Youtubers and journalists start
asking you questions about the project. Randos on web forums start
associating you to everything GNOME does, or does not; to features, design,
and bugs. You become responsible for every decision, whether you are or not,
and this leads to being the embodiment of all evil the project does. You’ll
get hate mail, you’ll be harrassed, your words will be used against you and
the project for ever and ever.
Burnout and you
Of course, that ends up burning people out; it would be absurd if it didn’t.
Even in the best case possible, you’ll end up burning out just by reaching
empathy fatigue, because everyone has access to you, and everyone has their
own problems and bugs and features and wouldn’t it be great to solve every
problem in the world? This is similar to working for non profits as opposed
to the typical corporate burnout: you get into a feedback loop where you
don’t want to distance yourself from the work you do because the work you do
gives meaning to yourself and to the people that use it; and yet working on
it hurts you. It also empowers bad faith actors to hound you down to the
ends of the earth, until you realise that turning sand into computers was a
terrible mistake, and we should have torched the first personal computer
down on sight.
Governance
We want to have structure, so that people know what to expect and how to
navigate the decision making process inside the project; we also want to
avoid having a sacrificial lamb that takes on all the problems in the world
on their shoulders until we burn them down to a cinder and they have to
leave. We’re 28 years too late to have a benevolent dictator, self-appointed
or otherwise, and we don’t want to have a public consultation every time we
want to deal with a systemic feature. What do we do?
Examples
What do other projects have to teach us about governance? We are not the
only complex free software project in existence, and it would be an appaling
measure of narcissism to believe that we’re special in any way, shape or form.
Python
We should all know what a Python PEP is, but if
you are not familiar with the process I strongly recommend going through it.
It’s well documented, and pretty much the de facto standard for any complex
free and open source project that has achieved escape velocity from a
centralised figure in charge of the whole decision making process. The real
achievement of the Python community is that it adopted this policy long
before their centralised figure called it quits. The interesting thing of
the PEP process is that it is used to codify the governance of the project
itself; the PEP template is a PEP; teams are defined through PEPs; target
platforms are defined through PEPs; deprecations are defined through PEPs;
all project-wide processes are defined through PEPs.
Rust
Rust has a similar process for language, tooling, and standard library
changes, called “RFC”. The RFC process
is more lightweight on the formalities than Python’s PEPs, but it’s still
very well defined. Rust, being a project that came into existence in a
Post-PEP world, adopted the same type of process, and used it to codify
teams, governance, and any and all project-wide processes.
Fedora
Fedora change proposals exist to discuss and document both self-contained
changes (usually fairly uncontroversial, given that they are proposed by the
same owners of module being changed) and system-wide changes. The main
difference between them is that most of the elements of a system-wide change
proposal are required, wheres for self-contained proposals they can be
optional; for instance, a system-wide change must have a contingency plan, a
way to test it, and the impact on documentation and release notes, whereas
as self-contained change does not.
GNOME
Turns out that we once did have “GNOME Enhancement
Proposals” (GEP), mainly modelled on
Python’s PEP from 2002. If this comes as a surprise, that’s because they
lasted for about a year, mainly because it was a reactionary process to try
and funnel some of the large controversies of the 2.0 development cycle into
a productive outlet that didn’t involve flames and people dramatically
quitting the project. GEPs failed once the community fractured, and people
started working in silos, either under their own direction or, more likely,
under their management’s direction. What’s the point of discussing a
project-wide change, when that change was going to be implemented by people
already working together?
The GEP process mutated into the lightweight “module proposal” process,
where people discussed adding and removing dependencies on the desktop
development mailing list—something we also lost over the 2.x cycle, mainly
because the amount of discussions over time tended towards zero. The people
involved with the change knew what those modules brought to the release, and
people unfamiliar with them were either giving out unsolicited advice, or
were simply not reached by the desktop development mailing list. The
discussions turned into external dependencies notifications, which also died
up because apparently asking to compose an email to notify the release team
that a new dependency was needed to build a core module was far too much of
a bother for project maintainers.
The creation and failure of GEP and module proposals is both an indication
of the need for structure inside GNOME, and how this need collides with the
expectation that project maintainers have not just complete control over
every aspect of their domain, but that they can also drag out the process
until all the energy behind it has dissipated. Being in charge for the long
run allows people to just run out the clock on everybody else.
Goals
So, what should be the goal of a proper technical governance model for the
GNOME project?
Diffusing responsibilities
This should be goal zero of any attempt at structuring the technical
governance of GNOME. We have too few people in too many critical positions.
We can call it “efficiency”, we can call it “bus factor”, we can call it
“bottleneck”, but the result is the same: the responsibility for anything is
too concentrated. This is how you get conflict. This is how you get burnout.
This is how you paralise a whole project. By having too few people in
positions of responsibility, we don’t have enough slack in the governance
model; it’s an illusion of efficiency.
Responsibility is not something to hoard: it’s something to distribute.
Empowering the community
The community of contributors should be able to know when and how a decision
is made; it should be able to know what to do once a decision is made. Right
now, the process is opaque because it’s done inside a million different
rooms, and, more importantly, it is not recorded for posterity. Random
GitLab issues should not be the only place where people can be informed that
some decision was taken.
Empowering individuals
Individuals should be able to contribute to a decision without necessarily
becoming responsible for a whole project. It’s daunting, and requires a
measure of hubris that cannot be allowed to exist in a shared space. In a
similar fashion, we should empower people that want to contribute to the
project by reducing the amount of fluff coming from people with zero stakes
in it, and are interested only in giving out an opinion on their perfectly
spherical, frictionless desktop environment.
It is free and open source software, not free and open mic night down at
the pub.
Actual decision making process
We say we work by rough consensus, but if a single person is responsible for
multiple modules inside the project, we’re just deceiving ourselves. I
should not be able to design something on my own, commit it to all projects
I maintain, and then go home, regardless of whether what I designed is good
or necessary.
Proposed GNOME Changes✝
✝ Name subject to change
PGCs
We have better tools than what the GEP used to use and be. We have better
communication venues in 2025; we have better validation; we have better
publishing mechanisms.
We can take a lightweight approach, with a well-defined process, and use it
not for actual design or decision-making, but for discussion and
documentation. If you are trying to design something and you use this
process, you are by definition Doing It Wrong™. You should have a design
ready, and series of steps to achieve it, as part of a proposal. You should
already know the projects involved, and already have an idea of the effort
needed to make something happen.
Once you have a formal proposal, you present it to the various stakeholders,
and iterate over it to improve it, clarify it, and amend it, until you have
something that has a rough consensus among all the parties involved. Once
that’s done, the proposal is now in effect, and people can refer to it
during the implementation, and in the future. This way, we don’t have to ask
people to remember a decision made six months, two years, ten years ago:
it’s already available.
Editorial team
Proposals need to be valid, in order to be presented to the community at
large; that validation comes from an editorial team. The editors of the
proposals are not there to evaluate its contents: they are there to ensure
that the proposal is going through the expected steps, and that discussions
related to it remain relevant and constrained within the accepted period and
scope. They are there to steer the discussion, and avoid architecture
astronauts parachuting into the issue tracker or Discourse to give their
unwarranted opinion.
Once the proposal is open, the editorial team is responsible for its
inclusion in the public website, and for keeping track of its state.
Steering group
The steering group is the final arbiter of a proposal. They are responsible
for accepting it, or rejecting it, depending on the feedback from the
various stakeholders. The steering group does not design or direct GNOME as
a whole: they are the ones that ensure that communication between the parts
happens in a meaningful manner, and that rough consensus is achieved.
The steering group is also, by design, not the release team: it is made of
representatives from all the teams related to technical matters.
Is this enough?
Sadly, no.
Reviving a process for proposing changes in GNOME without addressing the
shortcomings of its first iteration would inevitably lead to a repeat of its results.
We have better tooling, but the problem is still that we’re demanding that
each project maintainer gets on board with a process that has no mechanism
to enforce compliance.
Once again, the problem is that we have a bunch of fiefdoms that need to be
opened up to ensure that more people can work on them.
Whither maintainers
In what was, in retrospect, possibly one of my least gracious and yet most
prophetic moments on the desktop development mailing list, I once said that,
if it were possible, I would have already replaced all GNOME maintainers
with a shell script. Turns out that we did replace a lot of what maintainers
used to do, and we used a large Python
service
to do that.
Individual maintainers should not exist in a complex project—for both the
project’s and the contributors’ sake. They are inefficiency made manifest, a
bottleneck, a point of contention in a distributed environment like GNOME.
Luckily for us, we almost made them entirely redundant already! Thanks to
the release service and CI pipelines, we don’t need a person spinning up a
release archive and uploading it into a file server. We just need somebody
to tag the source code repository, and anybody with the right permissions
could do that.
We need people to review contributions; we need people to write release
notes; we need people to triage the issue tracker; we need people to
contribute features and bug fixes. None of those tasks require the
“maintainer” role.
So, let’s get rid of maintainers once and for all. We can delegate the
actual release tagging of core projects and applications to the GNOME
release team; they are already releasing GNOME anyway, so what’s the point
in having them wait every time for somebody else to do individual releases?
All people need to do is to write down what changed in a release, and that
should be part of a change itself; we have centralised release notes, and we
can easily extract the list of bug fixes from the commit log. If you can
ensure that a commit message is correct, you can also get in the habit of
updating the NEWS file as part of a merge request.
Additional benefits of having all core releases done by a central authority
are that we get people to update the release notes every time something
changes; and that we can sign all releases with a GNOME key that downstreams
can rely on.
Embracing special interest groups
But it’s still not enough.
Especially when it comes to the application development platform, we have
already a bunch of components with an informal scheme of shared
responsibility. Why not make that scheme official?
Let’s create the SDK special interest group; take all the developers for
the base libraries that are part of GNOME—GLib, Pango, GTK, libadwaita—and
formalise the group of people that currently does things like development,
review, bug fixing, and documentation writing. Everyone in the group should
feel empowered to work on all the projects that belong to that group. We
already are, except we end up deferring to somebody that is usually too busy
to cover every single module.
Other special interest groups should be formed around the desktop, the core
applications, the development tools, the OS integration, the accessibility
stack, the local search engine, the system settings.
Adding more people to these groups is not going to be complicated, or
introduce instability, because the responsibility is now shared; we would
not be taking somebody that is already overworked, or even potentially new
to the community, and plopping them into the hot seat, ready for a burnout.
Each special interest group would have a representative in the steering
group, alongside teams like documentation, design, and localisation, thus
ensuring that each aspect of the project technical direction is included in
any discussion. Each special interest group could also have additional
sub-groups, like a web services group in the system settings group; or a
networking group in the OS integration group.
What happens if I say no?
I get it. You like being in charge. You want to be the one calling the
shots. You feel responsible for your project, and you don’t want other
people to tell you what to do.
If this is how you feel, then there’s nothing wrong with parting ways with
the GNOME project.
GNOME depends on a ton of projects hosted outside GNOME’s own
infrastructure, and we communicate with people maintaining those projects
every day. It’s 2025, not 1997: there’s no shortage of code hosting services
in the world, we don’t need to have them all on GNOME infrastructure.
If you want to play with the other children, if you want to be part of
GNOME, you get to play with a shared set of rules; and that means sharing
all the toys, and not hoarding them for yourself.
Civil service
What we really want GNOME to be is a group of people working together. We
already are, somewhat, but we can be better at it. We don’t want rule and
design by committee, but we do need structure, and we need that structure to
be based on expertise; to have distinct sphere of competence; to have
continuity across time; and to be based on rules. We need something
flexible, to take into account the needs of GNOME as a project, and be
capable of growing in complexity so that nobody can be singled out, brigaded
on, or burnt to a cinder on the sacrificial altar.
Our days of passing out in the middle of the dance floor are long gone. We
might not all be old—actually, I’m fairly sure we aren’t—but GNOME has long
ceased to be something we can throw together at the last minute just because
somebody assumed the mantle of a protean ruler, and managed to involve
themselves with every single project until they are the literal embodiement
of an autocratic force capable of dragging everybody else towards a goal,
until the burn out and have to leave for their own sake.
We can do better than this. We must do better.
To sum up
Stop releasing individual projects, and let the release team do it when needed.
Create teams to manage areas of interest, instead of single projects.
Create a steering group from representatives of those teams.
Every change that affects one or more teams has to be discussed and
documented in a public setting among contributors, and then published for
future reference.
None of this should be controversial because, outside of the publishing bit,
it’s how we are already doing things. This proposal aims at making it
official so that people can actually rely on it, instead of having to divine
the process out of thin air.
The next steps
We’re close to the GNOME 49 release, now that GUADEC 2025 has ended, so
people are busy working on tagging releases, fixing bugs, and the work on
the release notes has started. Nevertheless, we can already start planning
for an implementation of a new governance model for GNOME for the next cycle.
First of all, we need to create teams and special interest groups. We don’t
have a formal process for that, so this is also a great chance at
introducing the change proposal process as a mechanism for structuring the
community, just like the Python and Rust communities do. Teams will need
their own space for discussing issues, and share the load. The first team
I’d like to start is an “introspection and language bindings” group, for all
bindings hosted on GNOME infrastructure; it would act as a point of
reference for all decisions involving projects that consume the GNOME
software development platform through its machine-readable ABI description.
Another group I’d like to create is an editorial group for the developer and
user documentation; documentation benefits from a consistent editorial
voice, while the process of writing documentation should be open to
everybody in the community.
A very real issue that was raised during GUADEC is bootstrapping the
steering committee; who gets to be on it, what is the committee’s remit, how
it works. There are options, but if we want the steering committee to be a
representation of the technical expertise of the GNOME community, it also
has to be established by the very same community; in this sense, the board
of directors, as representatives of the community, could work on
defining the powers and compositions of this committee.
There are many more issues we are going to face, but I think we can start
from these and evaluate our own version of a technical governance model that
works for GNOME, and that can grow with the project. In the next couple of
weeks I’ll start publishing drafts for team governance and the
power/composition/procedure of the steering committee, mainly for iteration
and comments.
Update on what happened in WebKit in the week from July 21 to July 28.
This week the trickle of improvements to the graphics stack continues with
more font handling improvements and tuning of damage information; plus the
WPEPlatform Wayland backend gets server-side decorations with some compositors.
Font synthesis properties (synthetic bold/italic) are now correctly
handled, so that fonts are rendered
bold or italic even when the font itself does not provide these variants.
A few minorimprovements to the damage
propagation feature have landed.
The screen device scaling factor in use is now
shown in the webkit://gpu internal
information page.
WPE WebKit 📟
WPE Platform API 🧩
New, modern platform API that supersedes usage of libwpe and WPE backends.
The Wayland backend included with WPEPlatform has been taught how to request
server-side decorations using the XDG
Decoration protocol.
This means that compositors that support the protocol will provide window
frames and title bars for WPEToplevel instances. While this is a welcome
quality of life improvement in many cases, window decorations will not be shown
on Weston and Mutter (used by GNOME Shell among others), as they do not support
the protocol at the moment.
Somehow I internalized that my duty as software programmer was to silently work
in a piece of code as if it were a magnum opus, until it’s finished, and then
release it to the world with no need of explanations, because it should speak
for itself. In other words, I tend to consider my work as a form of art, and
myself as an artist. But I’m not. There’s no magnus opus and there will never
be one. I’m rather a craftsman, in the sense of Richard Sennett: somebody who
cares about their craft, making small, quick but thoughtful and clean changes,
here and there, hoping that they will be useful to someone, now and in the
future. And those little efforts need to be exposed openly, in spaces as this
one and social media, as if I were a bazaar merchant.
This reflection invites me to add another task to my duties as software
programmer: a periodical exposition of the work done. And this is the first
intent to forge a (monthly) discipline in that direction, not in the sense of
bragging, or looking to overprice a product (in the sense of commodity
fetishism), but to build bridges with those that might find useful those pieces
of software.
We have been working lately on video encoding, and we wanted an easy way to test
our work, using common samples such as those shared by the Derf’s
collection. They are in a file format known
as YUV4MPEG2, or more commonly known as y4m, because of their file name
extension.
YUV4MPEG2 is a simple file format designed to
hold uncompressed frames of YUV video, formatted as YCbCr 4:2:0, YCbCr 4:2:2 or
YCbCr 4:4:4 data for the purpose of encoding. Instead of using raw YUV streams,
where the frame size and color format have to be provided out-of-band, these
metadata are embedded in the file.
There were already GStreamer elements for encoding and decoding y4m streams,
but y4mdec was in gst-plugins-bad while y4menc in gst-plugins-good.
Our first task was to fix and improve y4menc
[!8654],
then move y4mdec to gst-plugins-good
[!8719],
but that implied to rewrite the element and add unit tests, while add more
features such as handling more color formats.
Heavily inspired by Fluster, a testing
framework written in Python for decoder conformance, we are sketching
Soothe, a script that aims to be a testing
framework for video encoders, using VMAF, a
perceptual video quality assessment algorithm.
This is the reason of the efforts expressed above: vulkanh264enc, a H.264
encoder using the Vulkan Video extension
[!7197].
One interesting side of this task was to propose a base class for hardware
accelerated H.264 encoders, based on the vah264enc, the GStreamer VA-API H.264
encoder. We talked about this base class in the GStreamer Conference
2024.
Now the H.264 encoder merged and it will be part of the future release of
GStreamer 1.28.
Now GStreamer-VAAPI functionality has been replaced with the VA plugin in
gst-plugins-bad. Still, it isn’t a full featured replacement
[#3947], but
it’s complete and stable enough to be widely deployed. As Tim said in the
GStreamer Conference 2024: it just works.
So, GStreamer-VAAPI subproject has been removed from main branch in git
repository
[!9200],
and its Gitlab project, archived.
We believe that Vulkan Video extension will be one of the main APIs for video
encoding, decoding and processing. Igalia participate in the Vulkan Video
Technical Sub Group (TSG) and helps with the Conformance Test Suite (CTS).
Vulkan Video extension is big and constantly updated. In order to keep track of
it we maintain a web page with the latest news about the specification,
proprietary drivers, open source drivers and open source applications, along
with articles and talks about it.
Last but not least, GStreamer
Planet has been updated and
overhauled.
Given that the old Planet script, written in Python 2, is unmaintained, we
worked on a new one in Rust:
planet-rs. It internally
uses tera for templates,
feed-rs for feed parsing, and
reqwest for HTTP handling. The planet
is generated using Gitlab scheduled CI pipelines.
Update on what happened in WebKit in the week from July 14 to July 21.
In this week we had a fix for the libsoup-based resource loader on platforms
without the shared-mime-info package installed, a fix for SQLite usage in
WebKit, ongoing work on the GStreamer-based WebRTC implementation including
better encryption for its default DTLS certificate and removal of a dependency,
and an update on the status of GNOME Web Canary version.
Cross-Port 🐱
ResourceLoader delegates local resource loading (e.g. gresources) to ResourceLoaderSoup, which in turn uses g_content_type_guess to identify their content type. In platforms where shared-mime-info is not available, this fails silently and reports "text/plain", breaking things such as PDFjs.
A patch was submitted to use MIMETypeRegistry to get the MIME type of these local resources, falling back to g_content_type_guess when that fails, making internal resource loading more resilient.
Fixed "PRAGMA incrementalVacuum" for SQLite, which is used to reclaim freed filesystem space.
Multimedia 🎥
GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.
Most web engines migrated from a default DTLS certificate signed with a RSA key to a ECDSA p-256 key, almost a decade ago. GstWebRTC is now also signing its default DTLS certificate with that private key format. This improves compatibility with various SFUs, the Jitsi Video Bridge among them.
Adaptation of WPE WebKit targeting the Android operating system.
Changed libpsl to include built-in public-suffix data when building WPE for Android. Among other duties, having this working correctly is important for site isolation, resource loading, and cookie handling.
Releases 📦️
The GNOME Web Canary build has been stale for several weeks, since the GNOME nightly SDK was updated to freedesktop SDK 25.08beta which no longer ships one of the WebKitGTK build dependencies (Ruby). We will do our best to get the builds back to a working state, soon hopefully.
Some months ago, my colleague Madeeha Javed and I wrote a tool to convert QEMU disk images into qcow2, writing the result directly to stdout.
This tool is called qcow2-to-stdout.py and can be used for example to create a new image and pipe it through gzip and/or send it directly over the network without having to write it to disk first.
If you’re interested in the technical details, read on.
A closer look under the hood
QEMU uses disk images to store the contents of the VM’s hard drive. Images are often in qcow2, QEMU’s native format, although a variety of other formats and protocols are also supported.
I have written in detail about the qcow2 format in the past (for example, here and here), but the general idea is very easy to understand: the virtual drive is divided into clusters of a certain size (64 KB by default), and only the clusters containing non-zero data need to be physically present in the qcow2 image. So what we have is essentially a collection of data clusters and a set of tables that map guest clusters (what the VM sees) to host clusters (what the qcow2 file actually stores).
qemu-img is a powerful and versatile tool that can be used to create, modify and convert disk images. It has many different options, but one question that sometimes arises is whether it can use stdin or stdout instead of regular files when converting images.
The short answer is that this is not possible in general. qemu-img convert works by checking the (virtual) size of the source image, creating a destination image of that same size and finally copying all the data from start to finish.
Reading a qcow2 image from stdin doesn’t work because data and metadata blocks can come in any arbitrary order, so it’s perfectly possible that the information that we need in order to start writing the destination image is at the end of the input data¹.
Writing a qcow2 image to stdout doesn’t work either because we need to know in advance the complete list of clusters from the source image that contain non-zero data (this is essential because it affects the destination file’s metadata). However, if we do have that information then writing a new image directly to stdout is technically possible.
The bad news is that qemu-img won’t help us here: it uses the same I/O code as the rest of QEMU. This generic approach makes total sense because it’s simple, versatile and is valid for any kind of source and destination image that QEMU supports. However, it needs random access to both images.
If we want to write a qcow2 file directly to stdout we need new code written specifically for this purpose, and since it cannot reuse the logic present in the QEMU code this was written as a separate tool (a Python script).
The process itself goes like this:
Read the source image from start to finish in order to determine which clusters contain non-zero data. These are the only clusters that need to be present in the new image.
Write to stdout all the metadata structures of the new image. This is now possible because after the previous step we know how much data we have and where it is located.
Read the source image again and copy the clusters with non-zero data to stdout.
Images created with this program always have the same layout: header, refcount tables and blocks, L1 and L2 tables, and finally all data clusters.
One problem here is that, while QEMU can read many different image formats, qcow2-to-stdout.py is an independent tool that does not share any of the code and therefore can only read raw files. The solution here is to use qemu-storage-daemon. This program is part of QEMU and it can use FUSE to export any file that QEMU can read as a raw file. The usage of qemu-storage-daemon is handled automatically and the user only needs to specify the format of the source file:
qcow2-to-stdout.py can only create basic qcow2 files and does not support features like compression or encryption. However, a few parameters can be adjusted, like the cluster size (-c), the width of the reference count entries (-r) and whether the new image is created with the input as an external data file (-d and -R).
And this is all, I hope that you find this tool useful and this post informative. Enjoy!
Acknowledgments
This work has been developed by Igalia and sponsored by Outscale, a Dassault Systèmes brand.
¹ This problem would not happen if the input data was in raw format but in this case we would not know the size in advance.
Update on what happened in WebKit in the week from July 7 to July 14.
This week saw a fix for IPv6 scope-ids in DNS responses, frame pointers
re-enabled in JSC developer builds, and a significant improvement to
emoji fonts selection.
The initial value is auto which means the browser can determine the shape of the caret to follow platform conventions in different situations, however so far this is always using a bar caret (|). Then you can decide to use either a block (█) or underscore (_) caret, which might be useful and give a nice touch to some kinds of applications like a code editor.
Next you can see a very simple example which modifies the value of the caret-shape property so you can see how it works.
Screencast of the different caret-shape possible values
As you might have noticed we’re only using caret-shape: block property and not setting any particular color for it, in order to ensure the characters are still visible, the current Chromium implementation adds transparency to the block caret.
Let’s now combine the three CSS caret properties in a single example. Imagine we want a more fancy insertion caret that uses the block shape but blinks between two colors. To achieve something like this we have to use caret-color and also caret-animation so we can control how the caret is animated and change the color through CSS animations.
As you can see we’re using caret-shape: block to define we want a block insertion caret, and also caret-animation: manual which makes the browser to stop animating the caret. Thus we have to use our own animation that modifies caret-color to switch colors.
Screencast of a block caret that blinks between two colors
Screencast of a caret that switches between block and underscore shapes
These are just some quick examples about how to use these new properties, you can start experimenting with them though caret-shape is still in the oven but implementation is in active development. Remember that if you want to play with the linked examples you have to enable the experimental web platform features flag (via chrome://flags#enable-experimental-web-platform-features or passing -enable-experimental-web-platform-features).
Thanks to my colleagues Stephen Chenney and Ziran Sun that have been working on the implementation of these features and Bloomberg for sponsoring this work as part of the ongoing collaboration with Igalia to improve the web platform.
Igalia and Bloomberg working together to build a better web
Specifically, this is the mostly-moving collector with conservative
stack scanning. Most collections will be marked in place. When the
collector wants to compact, it will scan ambiguous roots in the
beginning of the collection cycle, marking objects referenced by such
roots in place. Then the collector will select some blocks for evacuation, and
when visiting an object in those blocks, it will try to copy the object to one of the
evacuation target blocks that are held in reserve. If the collector runs out of space in the evacuation reserve, it falls
back to marking in place.
Given that the collector has to cope with failed evacuations, it is easy to give the it the ability to pin any object in place. This proved
useful when making the needed modifications to Guile: for example, when
we copy a stack slice containing ambiguous references to a
heap-allocated continuation, we eagerly traverse that stack to pin the
referents of those ambiguous edges. Also, whenever the address of an
object is taken and exposed to Scheme, we pin that object. This happens
frequently for identity hashes (hashq).
Anyway, the bulk of the work here was a pile of refactors to Guile to
allow a centralized scm_trace_object function to be written, exposing
some object representation details to the internal object-tracing
function definition while not exposing them to the user in the form of
API or ABI.
bugs
I found quite a few bugs. Not many of them were in Whippet, but some
were, and a few are still there; Guile exercises a GC more than my test
workbench is able to. Today I’d like
to write about a funny one that I haven’t fixed yet.
So, small objects in this garbage collector are managed by a Nofl
space. During a collection, each
pointer-containing reachable object is traced by a global user-supplied
tracing procedure. That tracing procedure should call a
collector-supplied inline function on each of the object’s fields.
Obviously the procedure needs a way to distinguish between different
kinds of objects, to trace them appropriately; in Guile, we use an the
low bits of the initial word of heap objects for this purpose.
Object marks are stored in a side table in associated 4-MB aligned
slabs, with one mark byte per granule (16 bytes). 4 MB is 0x400000, so
for an object at address A, its slab base is at A & ~0x3fffff, and the
mark byte is offset by (A & 0x3fffff) >> 4. When the tracer sees an
edge into a block scheduled for evacuation, it first checks the mark
byte to see if it’s already marked in place; in that case there’s
nothing to do. Otherwise it will try to evacuate the object, which
proceeds as follows...
But before you read, consider that there are a number of threads which
all try to make progress on the worklist of outstanding objects needing
tracing (the grey objects). The mutator threads are paused; though we
will probably add concurrent tracing at some point, we are unlikely to
implement concurrent evacuation. But it could be that two GC threads
try to process two different edges to the same evacuatable object at the
same time, and we need to do so correctly!
With that caveat out of the way, the implementation is
here.
The user has to supply an annoyingly-large state machine to manage the
storage for the forwarding word; Guile’s is
here.
Basically, a thread will try to claim the object by swapping in a busy
value (-1) for the initial word. If that worked, it will allocate space
for the object. If that failed, it first marks the object in place,
then restores the first word. Otherwise it installs a forwarding
pointer in the first word of the object’s old location, which has a
specific tag in its low 3 bits allowing forwarded objects to be
distinguished from other kinds of object.
I don’t know how to prove this kind of operation correct, and probably I
should learn how to do so. I think it’s right, though, in the sense
that either the object gets marked in place or evacuated, all edges get
updated to the tospace locations, and the thread that shades the object
grey (and no other thread) will enqueue the object for further tracing
(via its new location if it was evacuated).
But there is an invisible bug, and one that is the reason for me writing
these words :) Whichever thread manages to shade the object from white
to grey will enqueue it on its grey worklist. Let’s say the object is
on an block to be evacuated, but evacuation fails, and the object gets
marked in place. But concurrently, another thread goes to do the same;
it turns out there is a timeline in which the thread A has
marked the object, published it to a worklist for tracing, but thread B
has briefly swapped out the object’s the first word with the busy value
before realizing the object was marked. The object might then be traced
with its initial word stompled, which is totally invalid.
What’s the fix? I do not know. Probably I need to manage the state
machine within the side array of mark bytes, and not split between the
two places (mark byte and in-object). Anyway, I thought that readers of
this web log might enjoy a look in the window of this clown car.
next?
The obvious question is, how does it perform? Basically I don’t know
yet; I haven’t done enough testing, and some of the heuristics need
tweaking. As it is, it appears to be a net improvement over the
non-moving configuration and a marginal improvement over BDW, but which
currently has more variance. I am deliberately imprecise here because I
have been more focused on correctness than performance; measuring
properly takes time, and as you can see from the story above, there are
still a couple correctness issues. I will be sure to let folks know
when I have something. Until then, happy hacking!
Update on what happened in WebKit in the week from June 30 to July 7.
Improvements to Sysprof and related dependencies, WebKit's usage of
std::variant replaced by mpark::variant, major WebXR overhauling,
and support for the logd service on Android, are all part of this
week's bundle of updates.
Cross-Port 🐱
The WebXR support in the GTK and WPE WebKit ports has been ripped off in preparation for an overhaul that will make it better fit WebKit's multi-process architecture.
Note these are the first steps on this effort, and there is still plenty to do before WebXR experiences work again.
Changed usage of std::variant in favor of an alternative implementation based on mpark::variant, which reduces the size of the built WebKit library—currently saves slightly over a megabyte for release builds.
Adaptation of WPE WebKit targeting the Android operating system.
Logging support is being improved to submit entries to the logd service on Android, and also to configure logging using a system property. This makes debugging and troubleshooting issues on Android more manageable, and is particularly welcome to develop WebKit itself.
While working on this feature, the definition of logging channels was simplified, too.
Community & Events 🤝
WebKit on Linux integrates with Sysprof and reports a plethora of marks. As we report more information to Sysprof, we eventually pushed Sysprof internals to its limit! To help with that, we're adding a new feature to Sysprof: hiding marks from view.
Hello everyone! As we have with the last bunch of meetings, we're excited to tell you about all the new discussions taking place in TC39 meetings and how we try to contribute to them. However, this specific meeting has an even more special place in our hearts since Igalia had the privilege of organising it in our headquarters in A Coruña, Galicia. It was an absolute honor to host all the amazing delegates in our home city. We would like to thank everyone involved and look forward to hosting it again!
Let's delve together into some of the most exciting updates.
Array.from, which takes a synchronous iterable and dumps it into a new array, is one of Array's most frequently used built-in methods, especially for unit tests or CLI interfaces. However, there was no way to do the equivalent with an asynchronous iterator. Array.fromAsync solves this problem, being to Array.from as for await is to for. This proposal has now been shipping in all JS engines for at least a year (which means it's Baseline 2024), and it has been highly requested by developers.
From a bureaucratic point of view however, the proposal was never really stage 3. In September 2022 it advanced to stage 3 with the condition that all three of the ECMAScript spec editors signed off on the spec text; and the editors requested that a pull request was opened against the spec with the actual changes. However, this PR was not opened until recently. So in this TC39 meeting, the proposal advanced to stage 4, conditional on this editors actually reviewing it.
The Explicit Resource Management proposal introduces implicit cleanup callbacks for objects based on lexical scope. This is enabled through the new using x = declaration:
{ using myFile =open(fileURL); const someBytes = myFile.read();
// myFile will be automatically closed, and the // associated resources released, here at the // end of the block. }
The proposal is now shipped in Chrome, Node.js and Deno, and it's behind a flag in Firefox. As such, Ron Buckton asked for (and obtained!) consensus to approve it for Stage 4 during the meeting.
Similarly to Array.fromAsync, it's not quite Stage 4 yet, as there is still something missing before including it in the ECMAScript standard: test262 tests need to be merged, and the ECMAScript spec editors need to approve the proposed specification text.
The Error.isError(objectToCheck) method provides a reliable way to check whether a given value is a real instance of Error. This proposal was originally presented by Jordan Harband in 2015, to address concerns about it being impossible to detect whether a given JavaScript value is actually an error object or not (did you know that you can throw anything, including numbers and booleans!?). It finally became part of the ECMAScript standard during this meeting.
Intl.Locale objects represent Unicode Locale identifiers; i.e., a combination of language, script, region, and preferences for things like collation or calendar type.
For example, de-DE-1901-u-co-phonebk means "the German language as spoken in Germany with the traditional German orthography from 1901, using the phonebook collation". They are composed of a language optionally followed by:
a script (i.e. an alphabet)
a region
one or more variants (such as "the traditional German orthography from 1901")
a list of additional modifiers (such as collation)
Intl.Locale objects already had accessors for querying multiple properties about the underlying locale but was missing one for the variants due to an oversight, and the committee reached consensus on also exposing them in the same way.
The Intl.Locale Info Stage 3 proposal allows JavaScript applications to query some metadata specific to individual locales. For example, it's useful to answer the question: "what days are considered weekend in the ms-BN locale?".
The committee reached consensus on a change regarding information about text direction: in some locales text is written left-to-right, in others it's right-to-left, and for some of them it's unknown. The proposal now returns undefined for unknown directions, rather than falling back to left-to-right.
Our colleague Philip Chimento presented a regular status update on Temporal, the upcoming proposal for better date and time support in JS. The biggest news is that Temporal is now available in the latest Firefox release! The Ladybird, Graal, and Boa JS engines all have mostly-complete implementations. The committee agreed to make a minor change to the proposal, to the interpretation of the seconds (:00) component of UTC offsets in strings. (Did you know that there has been a time zone that shifted its UTC offset by just 20 seconds?)
The Immutable ArrayBuffer proposal allows creating ArrayBuffers in JS from read-only data, and in some cases allows zero-copy optimizations. After last time, the champions hoped they could get the tests ready for this plenary and ask for stage 3, but they did not manage to finish that on time. However, they did make a very robust testing plan, which should make this proposal "the most well-tested part of the standard library that we've seen thus far". The champions will ask to advance to stage 3 once all of the tests outlined in the plan have been written.
The iterator sequencing Stage 2.7 proposal introduces a new Iterator.concat method that takes a list of iterators and returns an iterator yielding all of their elements. It's the iterator equivalent of Array.prototype.concat, except that it's a static method.
Michael Ficarra, the proposal's champion, was originally planning to ask for consensus on advancing the proposal to Stage 3: test262 tests had been written, and on paper the proposal was ready. However, that was not possible because the committe discussed some changes about re-using "iterator result" objects that require some changes to the proposal itself (i.e. should Iterator.concat(x).next() return the same object as x.next(), or should it re-create it?).
The iterator chunking Stage 2 proposal introduces two new Iterator.prototype.* methods: chunks(size), which splits the iterator into non-overlapping chunks, and windows(size), which generates overlapping chunks offset by 1 element:
[1,2,3,4].values().chunks(2);// [1,2] and [3,4] [1,2,3,4].values().windows(2);// [1,2], [2,3] and [3,4]
The proposal champion was planning to ask for Stage 2.7, but that was not possible due to some changes about the .windows behaviour requested by the committee: what should happen when requesting windows of size n out of an iterator that has less than n elements? We considered multiple options:
Do not yield any array, as it's impossible to create a window of size n
Yield an array with some padding (undefined?) at the end to get it to the expected length
Yield an array with fewer than n elements
The committee concluded that there are valid use cases both for (1) and for (3). As such, the proposal will be updated to split .windows() into two separate methods.
AsyncContext is a proposal that allows having state persisted across async flows of control -- like thread-local storage, but for asynchronicity in JS. The champions of the proposal believe async flows of control should not only flow through await, but also through setTimeout and other web features, such as APIs (like xhr.send()) that asynchronously fire events. However, the proposal was stalled due to concerns from browser engineers about the implementation complexity of it.
In this TC39 session, we brainstormed about removing some of the integration points with web APIs: in particular, context propagation through events caused asynchronously. This would work fine for web frameworks, but not for tracing tools, which is the other main use case for AsyncContext in the web. It was pointed out that if the context isn't propagated implicitly through events, developers using tracing libraries might be forced to snapshot contexts even when they're not needed, which would lead to userland memory leaks. In general, the room seemed to agree that the context should be propagated through events, at the very least in the cases in which this is feasible to implement.
This TC39 discussion didn't do much move the proposal along, and we weren't expecting it to do so -- browser representatives in TC39 are mostly engineers working on the core JS engines (such as SpiderMonkey, or V8), while the concerns were coming from engineers working on web APIs. However, the week after this TC39 plenary, Igalia organized the Web Engines Hackfest, also in A Coruña, where we could resume this conversation with the relevant people in the room. As a result, we've had positive discussions with Mozilla engineers about a possible path forward for the proposal that did propagate the context through events, analyzing more in detail the complexity of some specific APIs where we expect the propagation to be more complex.
The Math.clamp proposal adds a method to clamp a numeric value between two endpoints of a range. This proposal reached stage 1 last February, and in this plenary we discussed and resolved some of the open issues it had:
One of them was whether the method should be a static method Math.clamp(min, value, max), or whether it should be a method on Number.prototype so you could do value.clamp(min, max). We opted for the latter, since in the former the order of the arguments might not be clear.
Another was whether the proposal should support BigInt as well. Since we're making clamp a method of Number, we opted to only support the JS number type. A follow-up proposal might add this on BigInt.prototype as well.
Finally, there was some discussion about whether clamp should throw an exception if min is not lower or equal to max; and in particular, how this should work with positive and negative zeros. The committee agreed that this can be decided during Stage 2.
With this, the Math.clamp (or rather, Number.prototype.clamp) proposal advanced to stage 2. The champion was originally hoping to get to Stage 2.7, but they ended up not proposing it due to the pending planned changes to the proposed specification text.
As it stands, JavaScript's built-in functionality for generating (pseudo-)random numbers does not accept a seed, a piece of data that anchors the generation of random numbers at a fixed place, ensuring that repeated calls to Math.random, for example, produce a fixed sequence of values. There are various use cases for such numbers, such as testing (how can I lock down the behavior of a function that calls Math.random for testing purposes if I don't know what it will produce?). This proposal seeks to add a new top-level Object, Random, that will permit seeding of random number generation. It was generally well received and advanced to stage 2.
Tab Atkins-Bittner, who presented the Seeded PRNG proposal, continued in a similar vein with "More random functions". The idea is to settle on a set of functions that frequently arise in all sorts of settings, such as shuffling an array, generating a random number in an interval, generating a random boolean, and so on. There are a lot of fun ideas that can be imagined here, and the committee was happy to advance this proposal to stage 1 for further exploration.
Eemeli Aro of Mozilla proposed a neat bugfix for two parts of JavaScript's internationalization API that handle numbers. At the moment, when a digit string, such as "123.456" is given to the Intl.PluralRules and Intl.NumberFormat APIs, the string is converted to a Number. This is generally fine, but what about digit strings that contain trailing zeroes, such as "123.4560"? At the moment, that trailing zero gets removed and cannot be recovered. Eemeli suggest that we keep such digits. They make a difference when formatting numbers and in using them for pluralizing words, such as "1.0 stars". This proposal advanced to stage 1, with the understanding that some work needs to be done to clarify how some some already-existing options in the NumberFormat and PluralRules APIs are to be understood when handling such strings. Eemeli's proposal is now at stage 1!
We shared the latest developments on the Decimal proposal and its potential integration with Intl, focusing on the concept of amounts. These are lightweight wrapper classes designed to pair a decimal number with an integer "precision", representing either the number of significant digits or the number of fractional digits, depending on context. The discussion was a natural follow-on to the earlier discussion of keeping trailing zeroes in Intl.NumberFormat and Intl.PluralRules. In discussions about decimal, we floated the idea of a string-based version of amounts, as opposed to one backed by a decimal, but this was a new, work-in-progress idea. It seems that the committee is generally happy with the underlying decimal proposal but not yet convinced about the need for a notion of an amount, at least as it was presented. Decimal stays at stage 1.
Many JS environments today provide some sort of assertion functions. (For example, console.assert, Node.js's node:assert module, the chai package on NPM.) The committee discussed a new proposal presented by Jacob Smith, Comparisons, which explores whether this kind of functionality should be part of the ECMAScript standard. The proposal reached stage 1, so the investigation and scoping will continue: should it cover rich equality comparisons, should there be some sort of test suite integration, should there be separate debug and production modes? These questions will be explored in future meetings.
If you look at the specifications for HTML, the DOM, and other web platform features, you can't miss the Web IDL snippets in there. This IDL is used to describe all of the interfaces available in web browser JS environments, and how each function argument is processed and validated.
IDL does not only apply to the specifications! The IDL code is also copied directly into web browsers' code bases, sometimes with slight modifications, and used to generate C++ code.
Tooru Fujisawa (Arai) from Mozilla brought this proposal back to the committee after a long hiatus, and presented a vision of how the same thing might be done in the ECMAScript specification, gradually. This would lower maintenance costs for any JS engine, not just web browsers. However, the way that function arguments are generally handled differs sufficiently between web platform APIs and the ECMAScript specification that it wouldn't be possible to just use the same Web IDL directly.
Tooru presented some possible paths to squaring this circle: adding new annotations to the existing Web IDL or defining new syntax to support the ECMAScript style of operations.
The May 2025 plenary was packed with exciting progress across the JavaScript language and internationalization features. It was also a special moment for us at Igalia as proud hosts of the meeting in our hometown of A Coruña.
We saw long-awaited proposals like Array.fromAsync, Error.isError, and Explicit Resource Management reach Stage 4, while others continued to evolve through thoughtful discussion and iteration.
We’ll continue sharing updates as the work evolves, until then, thanks for reading, and see you at the next meeting!
Update on what happened in WebKit in the week from June 24 to July 1.
This was a slow week, where the main highlight are new development
releases of WPE WebKit and WebKitGTK.
Cross-Port 🐱
JavaScriptCore 🐟
The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.
Made some further progress bringing the 32-bit version of OMG closer to the 64-bit one
Releases 📦️
WebKitGTK 2.49.3 and WPE WebKit 2.49.3 have been released. These are development snapshots intended to allow those interested to test the new features and improvement which will be part of the next stable release series. As usual, bug reports are welcome in the WebKit Bugzilla.
Back in September 2024 I wrote a piece about the history of attempts at standardizing some kind of Micropayments going back to the late 90s. Like a lot of things I write, it's the outcome of looking at history and background for things that I'm actively thinking about. An announcement the other day made me think that perhaps now is a good time for a follow up post.
As you probably already know if you're reading this, I write and think a lot about the health of the web ecosystem. We've even got a whole playlist of videos (lots of podcast episodes) on the topic on YouTube. Today, that's nearly all paid for, on all sides, by advertising. In several important respects, it's safe to say that the status quo is under many threats. In several ways it's also worth questioning if the status quo is even good.
When Ted Nelson first imagined Micropayments in the 1960s, he was imaging a fair economic model for digital publishing. We've had many ideas and proposals since then. Web Monetization is one idea which isn't dead yet. Its main ideas involve embedding a declarative link to a "payment pointer" (like a wallet address) where payments can be sent via Interledger. I say "sent", but "streamed" might be more accurate. Interledger is a novel idea which treats money as "packets" and routes small amounts around. Full disclosure: Igalia has been working on some prototype work in Chromium to help see what a native implementation would look like, what its architecture would be and what options this opens (or closes). Our work has been funded by the Interledger Foundation. It does not amount to an endorsement, and it does not mean something will ship. That said, it doesn't mean the opposite either.
You might know that Brave, another Chromium-based browser, has system for creators too. In their model, publishers/creators sign up and verify their domain (or social accounts!), and people browsing those with Brave browsers sort of keep track of that locally, and at the end of the month Brave can batch up and settle accounts of Basic Attention Tokens ("BAT") which they can then pay out to creators in lump sums. As of the time of this writing, Brave has 88 million monthly active users (source) who could be paying its 1.67 million plus content creators and publishers (source).
Finally, in India, UPI offers most transactions free of charge and can also be used for micro payments - it's being used in $240 billion USD / month worth of transactions!
But there's also some "adjacent" stuff that doesn't claim to be micro transactions but somehow are similar:
If you've ever used Microsoft's Bing search engine, they also give you "points" (I like to call them "Bing Bucks") which you can trade in for other stuff (the payment is going in a different direction!). There was also Scroll, years ago, which was aimed to be a kind of universal service you could pay into to remove ads on many properties (it was bought by Twitter and shut down.)
Enter: Offerwall
Just the other day, Google Ad Manager gave a new idea a potentially really signficant boost. I think it's worth looking at: Offerwall. Offerwall lets sites provide potentially a few ways to monetize content, and for users to choose the one that they prefer. For example, a publisher can set up to allow reading their site in exchange for watching an ad (similar to YouTube's model). That's pretty interesting, but far more interesting to me, is that it integrates with a third-party service called Supertab. Supertab lets people provide their own subscriptions - including a tiny fee for this page, or access to the site for some timed pass - 4 hours, 24 hours, a week, etc. It does this with pretty friction-less wallet integration and by 'pooling' the funds until it makes sense to do a real, regular transaction. Perhaps the easiest thing is to look at some of their own examples.
Offerwall also allows other integrations, so maybe we'll see some of these begin to come together somehow too.
It's a very interesting way to split the difference and address a few complaints of micro transaction critics and generally people skeptical that something could gain significant traction. More than that even, it seems to me that by integrating with Google Ad manager it's got about as much advantage as anyone could get (the vast majority of ads are already served with Google Ad manager and this actually tries to expand that).
I'm very keen to see how this all plays out! What do you think will happen? Share your thoughts with me on social media.
Multiple MediaRecorder-related improvements landed in main recently (1, 2, 3, 4), and also in GStreamer.
JavaScriptCore 🐟
The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.
JSC saw some fixes in i31 reference types when using Wasm GC.
WPE WebKit 📟
WPE now has support for analog gamepad buttons when using libwpe. Since version 1.16.2 libwpe has the capability to handle analog gamepad button events, but the support on the WPE side was missing. It has now been added, and will be enabled when the appropriate versions of libwpe are used.
I've been kicking the tires on various LLMs lately, and like many have been
quite taken by the pace of new releases especially of models with weights
distributed under open licenses, always with impressive benchmark results. I
don't have local GPUs so trialling different models necessarily requires using
an external host. There are various configuration parameters you can set when
sending a query that affect generation and many vendors document
recommended settings on the model card or associated documentation. For my own
purposes I wanted to collect these together in one place, and also confirm in
which cases common serving software like
vLLM will use defaults provided
alongside the model.
Main conclusions
If accessing a model via a hosted API you typically don't have much insight
into their serving setup, so explicitly setting parameters client-side is
probably your best bet if you want to try out a model and ensure any
recommended parameters are applied to generation.
Although recent versions of vLLM will take preferred parameters from
generation_config.json, not all models provide that file or if they do,
they may not include their documented recommendations in it.
Some model providers have very strong and clear recommendations about which
parameters to set to which values, for others it's impossible to find any
guidance one way or another (or even what sampling setup was used for their
benchmark results).
Sadly there doesn't seem to be a good alternative to trawling through the
model descriptions and associated documentation right now (though hopefully
this page helps!).
Even if every model starts consistently setting preferred parameters in
generation_config.json (and inference API providers respect this), and/or
a standard like model.yaml is adopted containing
these parameters, some attention may still be required if a model has
different recommendations for different use cases / modes (as Qwen3 does).
And of course there's a non-conclusion on how much this really matters. I
don't know. Clearly for some models it's deemed very important, for the
other's it's not always clear whether it just doesn't matter much, or if the
model producer has done a poor job of documenting it.
Overview of parameters
The parameters supported by vLLM are documented
here,
though not all are supported in the HTTP API provided by different vendors.
For instance, the subset of parameters supported by models on
Parasail (an inference API provider I've been
kicking the tires on recently) is documented
here
I cover just that subset below:
temperature: controls the randomness of sampling of tokens. Lower values are
more deterministic, higher values are more random. This is one the
parameters you'll see spoken about the most.
top_p: limits the tokens that are considered. If set to e.g. 0.5 then only
consider the top most probable tokens whose summed probability doesn't
exceed 50%.
top_k: also limits the tokens that are considered, such that only the top
k tokens are considered.
frequency_penalty: penalises new tokens based on their frequency in the
generated text. It's possible to set a negative value to encourage
repetition.
presence_penalty: penalises new tokens if they appear in the generated text
so far. It's possible to set a negative value to encourage repetition.
repetition_penalty: This is documented as being a parameter that penalises
new tokens based on whether they've appeared so far in the generated text or
prompt.
Based on that description it's not totally obvious how it differs from the
frequency or presence penalties, but given the description talks about
values less than 1 penalising repeated tokens and less than 1 encouraging
repeated tokens we can infer this is applied as a multiplication on
rather than an addition.
We can confirm this implementation by tracing through where penalties are
applied in vllm's
sampler.py,
which in turn calls the apply_penalties helper
function.
This confirms how the frequency and presence penalties are applied based
only on the output, unlike the repetition penalty is applied taking the
prompt into account as well. Following the call-stack down to an
implementation of the repetition
penalty
shows that if the
logit
is positive, it divides by the penalty and otherwise multiplies by it.
This was a pointless sidequest as this is a vllm-specific parameter that
none of the models I've seen has a specific recommendation for.
Default vLLM behaviour
The above settings are typically exposed via the API, but what if you don't
explicitly set them? vllm
documents
that it will by default apply settings from generation_config.json
distributed with the model on HuggingFace if it exists (overriding its own
defaults), but you can ignore generation_config.json to just use vllm's own
defaults by setting --generation-config vllm when launching the server. This
behaviour was introduced in a PR that landed in early March this
year. We'll explore below
which models actually have a generation_config.json with their recommended
settings, but what about parameters not set in that file, or if that file
isn't present? As far as I can see, that's where
_DEFAULT_SAMPLING_PARAMS
comes in and we get temperature=1.0 and repetition_penalty, top_p, top_k and
min_p set to values that have no effect on the sampler.
Although Parasail use vllm for serving most (all?) of their hosted models,
it's not clear if they're running with a configuration that allows defaults to
be taken from generation_config.json. I'll update this post if that is
clarified.
Recommended parameters from model vendors
As all of these models are distributed with benchmark results front and
center, it should be easy to at least find what settings were used for these
results, even if it's not an explicit recommendation on which parameters to
use - right? Let's find out. I've decided to step through models groups by
their level of openness.
Recommendation: temperature=0.3 (specified on model card)
generation_config.json with recommended parameters: No.
Notes:
This model card is what made me pay more attention to these parameters -
DeepSeek went as far as to map a temperature of 1.0 via the API to
their recommended
0.3
(temperatures between 0 and 1 are multiplied by 0.7, and they subtract
0.7 for temperatures between 1 and 2). So clearly they're keen to
override clients that default to setting temperture=1.0.
There's no generation_config.json and the V3 technical
report indicates they used
temperature=0.7 for for some benchmarks. They also state "Benchmarks
containing fewer than 1000 samples are tested multiple times using varying
temperature settings to derive robust final results" (not totally clear if
results are averaged, or the best result is taken). There's no
recommendation I can see for other generation parameters, and to add some
extra confusion the DeepSeek API docs have a page on the temperature
parameter
with specific recommendations for different use cases and it's not totally
clear if these apply equally to V3 (after its temperature scaling) and R1.
Recommendation: temperature=0.6, top_p=0.95 (specified on model
card)
generation_config.json with recommended parameters: Yes.
Notes: They report using temperature=0.6 and top_p=0.95 for their
benchmarks (this is stated both on the model card and the
paper) and state that temperature=0.6
is the value used for the web chatbot interface. They do have a
generation_config.json that includes that
setting.
Notes: I saw that one of Mistral's API
methods
for their hosted models returns the default_model_temperature. Executing
curl --location "https://api.mistral.ai/v1/models" --header "Authorization: Bearer $MISTRAL_API_KEY" | jq -r '.data[] | "\(.name): \(.default_model_temperature)"' | sort gives some confusing results. The
mistral-small-2506 version isn't yet available on the API. But the older
mistral-small-2501 is, with a default temperature of 0.3 (differing
from the recommendation on the model
card.
mistral-small-2503 has null for its default temperature. Go figure.
generation_config.json with recommended parameters: No.
Notes: This is a fine-tune of Mistral-Small-3.1. There is no explicit
recommendation for temperature on the model card, but the example code
does use temperature=0.15. However, this isn't set in
generation_config.json
(which doesn't set any default parameters) and Mistral's API indicates a
default temperature of 0.0.
Recommendation: temperature=0.7, top_p=0.95 (specified on model
card)
generation_config.json with recommended parameters:
No (file exists, but parameters missing).
Notes: The model card has a very clear recommendation to use
temperature=0.7 and top_p=0.95 and this default temperature is also reflected
in Mistral's API mentioned above.
qwen3
family
including Qwen/Qwen3-235B-A22B, Qwen/Qwen3-30B-A3B, Qwen/Qwen3-32B, and
more.
Recommendation: temperature=0.6, top_p=0.95, top_k=20, min_p=0 for thinking mode
and for non-thinking mode temperature=0.7, top_p=0.8, top_k=20min_p=0 (specified on model card)
generation_config.json with recommended parameters: Yes, e.g. for
Qwen3-32B
(uses the "thinking mode" recommendations). (All the ones I've checked
have this at least).
Notes: Unlike many others, there is a very clear recommendation under
the best practices section of each model
card, which for all
models in the family that I've checked makes the same recommendation. They
also suggest setting the presence_penalty between 0 and 2 to reduce
endless repetitions. The Qwen 3 technical
report notes the same parameters but
also states that for the non-thinking mode they set presence_penalty=1.5
and applied the same setting for thinking mode for the Creative Writing v3
and WritingBench benchmarks.
generation_config.json with recommended parameters:
Yes
(temperature=1.0 should be the vllm default anyway, so it shouldn't
matter it isn't specified).
Notes: It was surprising to not see more clarity on this in the model
card or technical
report,
neither of which have an explicit recommendation. As noted above, the
generation_config.json does set top_k and top_p and the Unsloth
folks apparently had confirmation from the Gemma team on recommended
temperature though I couldn't find a public comment directly from
the Gemma team.
generation_config.json with recommended parameters:
Yes.
Notes: There was no discussion of recommended parameters in the model
card itself. I accessed generation_config.json via a third-party mirror
as providing name and DoB to view it on HuggingFace (as required by
Llama's restrictive access policy) seems ridiculous.
model.yaml
As it happens, while writing this blog post I saw Simon Willison blogged
about model.yaml.
Model.yaml is an initiative from the LM Studio folks
to provide a definition of a model and its sources that can be used with
multiple local inference tools. This includes the ability to specify preset
options for the model. It doesn't appear to be used by anyone else though, and
looking at the LM Studio model catalog, taking
qwen/qwen3-32b as an example:
although the Qwen3 series have very strongly recommended default settings, the
model.yaml only sets top_k and min_p, leaving temperature and top_p
unset.
The DRM GPU scheduler is a shared Direct Rendering Manager (DRM) Linux Kernel level component used by a number of GPU drivers for managing job submissions from multiple rendering contexts to the hardware. Some of the basic functions it can provide are dependency resolving, timeout detection, and most importantly for this article, scheduling algorithms whose essential purpose is picking the next queued unit of work to execute once there is capacity on the GPU.
Different kernel drivers use the scheduler in slightly different ways - some simply need the dependency resolving and timeout detection part, while the actual scheduling happens in the proprietary firmware, while others rely on the scheduler’s algorithms for choosing what to run next. The latter ones is what the work described here is suggesting to improve.
More details about the other functionality provided by the scheduler, including some low level implementation details, are available in the generated kernel documentation repository[1].
Three DRM scheduler data structures (or objects) are relevant for this topic: the scheduler, scheduling entities and jobs.
First we have a scheduler itself, which usually corresponds with some hardware unit which can execute certain types of work. For example, the render engine can often be single hardware instance in a GPU and needs arbitration for multiple clients to be able to use it simultaneously.
Then there are scheduling entities, or in short entities, which broadly speaking correspond with userspace rendering contexts. Typically when an userspace client opens a render node, one such rendering context is created. Some drivers also allow userspace to create multiple contexts per open file.
Finally there are jobs which represent units of work submitted from userspace into the kernel. These are typically created as a result of userspace doing an ioctl(2) operation, which are specific to the driver in question.
Jobs are usually associated with entities and entities are then executed by schedulers. Each scheduler instance will have a list of runnable entities (entities with least one queued job) and when the GPU is available to execute something it will need to pick one of them.
Typically every userspace client will submit at least one such job per rendered frame and the desktop compositor may issue one or more to render the final screen image. Hence, on a busy graphical desktop, we can find dozens of active entities submitting multiple GPU jobs, sixty or more times per second.
In order to select the next entity to run, the scheduler defaults to the First In First Out (FIFO) mode of operation where selection criteria is the job submit time.
The FIFO algorithm in general has some well known disadvantages around the areas of fairness and latency, and also because selection criteria is based on job submit time, it couples the selection with the CPU scheduler, which is also not desirable because it creates an artifical coupling between different schedulers, different sets of tasks (CPU processes and GPU tasks), and different hardware blocks.
This is further amplified by the lack of guarantee that clients are submitting jobs with equal pacing (not all clients may be synchronised to the display refresh rate, or not all may be able to maintain it), the fact their per frame submissions may consist of unequal number of jobs, and last but not least the lack of preemption support. The latter is true both for the DRM scheduler itself, but also for many GPUs in their hardware capabilities.
Apart from uneven GPU time distribution, the end result of the FIFO algorithm picking the sub-optimal entity can be dropped frames and choppy rendering.
Apart from the default FIFO scheduling algorithm, the scheduler also implements the round-robin (RR) strategy, which can be selected as an alternative at kernel boot time via a kernel argument. Round-robin, however, suffers from its own set of problems.
Whereas round-robin is typically considered a fair algorithm when used in systems with preemption support and ability to assign fixed execution quanta, in the context of GPU scheduling this fairness property does not hold. Here quanta are defined by userspace job submissions and, as mentioned before, the number of submitted jobs per rendered frame can also differ between different clients.
The final result can again be unfair distribution of GPU time and missed deadlines.
In fact, round-robin was the initial and only algorithm until FIFO was added to resolve some of these issue. More can be read in the relevant kernel commit. [2]
Another issue in the current scheduler design are the priority queues and the strict priority order execution.
Priority queues serve the purpose of implementing support for entity priority, which usually maps to userspace constructs such as VK_EXT_global_priority and similar. If we look at the wording for this specific Vulkan extension, it is described like this[3]:
The driver implementation *will attempt* to skew hardware resource allocation in favour of the higher-priority task. Therefore, higher-priority work *may retain similar* latency and throughput characteristics even if the system is congested with lower priority work.
As emphasised, the wording is giving implementations leeway to not be entirely strict, while the current scheduler implementation only executes lower priorities when the higher priority queues are all empty. This over strictness can lead to complete starvation of the lower priorities.
To solve both the issue of the weak scheduling algorithm and the issue of priority starvation we tried an algorithm inspired by the Linux kernel’s original Completely Fair Scheduler (CFS)[4].
With this algorithm the next entity to run will be the one with least virtual GPU time spent so far, where virtual GPU time is calculated from the the real GPU time scaled by a factor based on the entity priority.
Since the scheduler already manages a rbtree[5] of entities, sorted by the job submit timestamp, we were able to simply replace that timestamp with the calculated virtual GPU time.
When an entity has nothing more to run it gets removed from the tree and we store the delta between its virtual GPU time and the top of the queue. And when the entity re-enters the tree with a fresh submission, this delta is used to give it a new relative position considering the current head of the queue.
Because the scheduler does not currently track GPU time spent per entity this is something that we needed to add to make this possible. It however did not pose a significant challenge, apart having a slight weakness with the up to date utilisation potentially lagging slightly behind the actual numbers due some DRM scheduler internal design choices. But that is a different and wider topic which is out of the intended scope for this write-up.
The virtual GPU time selection criteria largely decouples the scheduling decisions from job submission times, to an extent from submission patterns too, and allows for more fair GPU time distribution. With a caveat that it is still not entirely fair because, as mentioned before, neither the DRM scheduler nor many GPUs support preemption, which would be required for more fairness.
Because priority is now consolidated into a single entity selection criteria we were also able to remove the per priority queues and eliminate priority based starvation. All entities are now in a single run queue, sorted by the virtual GPU time, and the relative distribution of GPU time between entities of different priorities is controlled by the scaling factor which converts the real GPU time into virtual GPU time.
Another benefit of being able to remove per priority run queues is a code base simplification. Going further than that, if we are able to establish that the fair scheduling algorithm has no regressions compared to FIFO and RR, we can also remove those two which further consolidates the scheduler. So far no regressions have indeed been identified.
As an first example we set up three demanding graphical clients, one of which was set to run with low priority (VK_QUEUE_GLOBAL_PRIORITY_LOW_EXT).
One client is the Unigine Heaven benchmark[6] which is simulating a game, while the other two are two instances of the deferredmultisampling Vulkan demo from Sascha Willems[7], modified to support running with the user specified global priority. Those two are simulating very heavy GPU load running simultaneouosly with the game.
All tests are run on a Valve Steam Deck OLED with an AMD integrated GPU.
First we try the current FIFO based scheduler and we monitor the GPU utilisation using the gputop[8] tool. We can observe two things:
That the distribution of GPU time between the normal priority clients is not equal.
That the low priority client is not getting any GPU time.
Switching to the CFS inspired (fair) scheduler the situation changes drastically:
GPU time distribution between normal priority clients is much closer together.
Low priority client is not starved, but receiving a small share of the GPU.
Note that the absolute numbers are not static but represent a trend.
This proves that the new algorithm can make the low priority useful for running heavy GPU tasks in the background, similar to what can be done on the CPU side of things using the nice(1) process priorities.
Apart from experimenting with real world workloads, another functionality we implemented in the scope of this work is a collection of simulated workloads implemented as kernel unit tests based on the recently merged DRM scheduler mock scheduler unit test framework[9][10]. The idea behind those is to make it easy for developers to check for scheduling regressions when modifying the code, without the need to set up sometimes complicated testing environments.
Let us look at a few examples on how the new scheduler compares with FIFO when using those simulated workloads.
First an easy, albeit exaggerated, illustration of priority starvation improvements.
Here we have a normal priority client and a low priority client submitting many jobs asynchronously (only waiting for the submission to finish after having submitted the last job). We look at the number of outstanding jobs (queue depth - qd) on the Y axis and the passage of time on the X axis. With the FIFO scheduler (blue) we see that the low priority client is not making any progress whatsoever, all until the all submission of the normal client have been completed. Switching to the CFS inspired scheduler (red) this improves dramatically and we can see the low priority client making slow but steady progress from the start.
Second example is about fairness where two clients are of equal priority:
Here the interesting observation is that the new scheduler graphed lines are much more straight. This means that the GPU time distribution is more equal, or fair, because the selection criteria is decoupled from the job submission time but based on each client’s GPU time utilisation.
For the final set of test workloads we will look at the rate of progress (aka frames per second, or fps) between different clients.
In both cases we have one client representing a heavy graphical load, and one representing an interactive, lightweight client. They are running in parallel but we will only look at the interactive client in the graphs. Because the goal is to look at what frame rate the interactive client can achieve when competing for the GPU. In other words we use that as a proxy for assessing user experience of using the desktop while there is simultaneous heavy GPU usage from another client.
The interactive client is set up to spend 1ms of GPU time in every 10ms period, resulting in an effective GPU load of 10%.
First test is with a heavy client wanting to utilise 75% of the GPU by submitting three 2.5ms jobs back to back, repeating that cycle every 10ms.
We can see that the average frame rate the interactive client achieves with the new scheduler is much higher than under the current FIFO algorithm.
For the second test we made the heavy GPU load client even more demanding by making it want to completely monopolise the GPU. It is now submitting four 50ms jobs back to back, and only backing off for 1us before repeating the loop.
Again the new scheduler is able to give significantly more GPU time to the interactive client compared to what FIFO is able to do.
From all the above it appears that the experiment was successful. We were able to simplify the code base, solve the priority starvation and improve scheduling fairness and GPU time allocation for interactive clients. No scheduling regressions have been identified to date.
The complete patch series implementing these changes is available at[11].
Because this work has simplified the scheduler code base and introduced entity GPU time tracking, it also opens up the possibilities for future experimenting with other modern algorithms. One example could be an EEVDF[12] inspired scheduler, given that algorithm has recently improved upon the kernel’s CPU scheduler and is looking potentially promising for it is combining fairness and latency in one algorithm.
Connection with the DRM scheduling cgroup controller proposal #
Another interesting angle is that, as this work implements scheduling based on virtual GPU time, which as a reminder is calculated by scaling the real time by a factor based on entity priority, it can be tied really elegantly to the previously proposed DRM scheduling cgroup controller.
There we had group weights already which can now be used when scaling the virtual time and lead to a simple but effective cgroup controller. This has already been prototyped[13], but more on that in a following blog post.
Update on what happened in WebKit in the week from May 27 to June 16.
After a short hiatus coinciding with this year's edition of the Web Engines
Hackfest, this issue covers a mixed bag of new API features, releases,
multimedia, and graphics work.
Cross-Port 🐱
A new WebKitWebView::theme-color property has
beenadded to the public API, along with a
corresponding webkit_web_view_get_theme_color() getter. Its value follows
that of the theme-color metadata
attribute
declared by pages loaded in the web view. Although applications may use the
theme color in any way they see fit, the expectation is that it will be used to
adapt their user interface (as in this
example) to
complement the Web content being displayed.
Multimedia 🎥
GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.
Damage propagation has been toggled for the GTK
port: for now only a single rectangle
is passed to the UI process, which then is used to let GTK know which part of a
WebKitWebView has received changes since the last repaint. This is a first
step to get damage tracking code widely tested, with further improvements to be
enabled later when considered appropriate.
Adaptation of WPE WebKit targeting the Android operating system.
WPE-Android 0.2.0
has been released. The main change in this version is the update to WPE WebKit
2.48.3, which is the first that can be built for Android out of the box,
without needing any additional patching. Thanks to this, we expect that the WPE
WebKit version used will receive more frequent updates going forward. The
prebuilt packages available at the Maven Central
repository
have been updated accordingly.
Releases 📦️
WebKitGTK
2.49.2 and
WPE WebKit 2.49.2 have
been released. These are development snapshots and are intended to let those
interested test out upcoming features and improvements, and as usual issue
reports are welcome in Bugzilla.
The Yocto project has well-established OS update mechanisms available via third-party layers. But, did you know that recent releases of Yocto already come with a simple update mechanism?
The goal of this blog post is to present an alternative that doesn’t require a third-party layer and explain how it can be integrated with your Yocto-based image.
Enter systemd-sysupdate: a mechanism capable of automatically discovering, downloading, and installing A/B-style OS updates. In a nutshell, it provides:
Atomic updates for a collection of different resources (files, directories or partitions).
Updates from remote and local sources (HTTP/HTTPS and directories).
Parallel installed versions A/B/C/… style.
Relative small footprint (~10 MiB or roughly 5% increase in our demo image).
Basic features are available since systemd 251 (released in May 2022).
Optional built-in services for updating and rebooting.
Optional DBus interface for applications integration.
Optional grouping of resources to be enabled together as features.
Together with automatic boot assessment, systemd-boot, and other tools, we can turn this OS update mechanism into a comprehensive alternative for common scenarios.
In order for sysupdate to determine the current version of the OS, it looks for the os-release file and inspects it for an IMAGE_VERSION field. Therefore, the image version must be included.
Resources that require updating must also be versioned with the image version. Following our previous assumptions:
The UKI filename is suffixed with the image version (e.g., uki_0.efi where 0 is the image version).
The rootfs partition is also versioned by suffixing the image version in its partition name (e.g., rootfs_0 could be the initial name of the partition).
To implement these changes in your Yocto-based image, the following recipes should be added or overridden:
Note that the value of IMAGE_VERSION can be hardcoded, provided by the continuous integration pipeline or determined at build-time (e.g., the current date and time).
In the above recipes, we’re adding the suffix to the UKI filename and partition name, and we’re also coupling our UKI directly to its correspondent rootfs partition.
By default, sysupdate is disabled in Yocto’s systemd recipe and there are no “default” transfer files for sysupdate. Therefore you must:
Override systemd build configuration options and dependencies.
Write transfer files for each resource that needs to be updated.
Extend the partitions kickstart file with an additional partition that must mirror the original rootfs partition. This is to support an A/B OS update scheme.
To implement these changes in your Yocto-based image, the following recipes should be added or modified:
Updates can be served locally via regular directories or remotely via a regular HTTP/HTTPS web server. For Over-the-air (OTA) updates, HTTP/HTTPS is the correct option. Any web server can be used.
When using HTTP/HTTPS, sysupdate will request a SHA256SUMS checksum file. This file acts as the update server’s “manifest”, describing what updated resources are available.
Over the past few months I had the chance to spend some time looking at an
interesting new FUSE feature. This feature, merged into the Linux kernel 6.14
release, has introduced the ability to perform the communication between the
user-space server (or FUSE server) and the kernel using io_uring. This means
that file systems implemented in user-space will get a performance improvement
simply by enabling this new feature.
But let's start with the beginning:
What is FUSE?
Traditionally, file systems in *nix operating systems have been implemented
within their (monolithic) kernels. From the BSDs to Linux, file systems were
all developed in the kernel. Obviously, the exceptions already existed since
the beginning as well. Micro-kernels, for example, could be executed in ring0,
while their file systems would run as servers with lower privileged levels. But
these were the exceptions.
There are, however, several advantages in implementing them in user-space
instead. Here are just a few of the most obvious ones:
It's probably easier to find people experienced in writing user-space code
than kernel code.
It is easier, generally speaking, to develop, debug, and test user-space
applications. Not because kernel is necessarily more complex, but because
kernel development cycle is slower, requiring specialised tools and knowledge.
There are more tools and libraries available in user-space. It's way easier
to just pick an already existing compression library to add compression in
your file system than having it re-implemented in the kernel. Sure, nowadays
the Linux kernel is already very rich in all sorts of library-like subsystems,
but still.
Security, of course! Code in user-space can be isolated, while in the kernel
it would be running in ring0.
And, obviously, porting a file system into a different operating systems is
much easier if it's written in user-space.
And this is where FUSE can help: FUSE is a framework that provides the necessary
infrastructure to make it possible to implement file systems in user-space.
FUSE includes two main components: a kernel-space module, and a user-space
server. The kernel-space fuse module is responsible for getting all the
requests from the virtual file system layer (VFS), and redirect them to
user-space FUSE server. The communication between the kernel and the FUSE
server is done through the /dev/fuse device.
There's also a third optional component: libfuse. This is a user-space
library that makes life easier for developers implementing a file system as it
hides most of the details of the FUSE protocol used to communicate between user-
and kernel-space.
The diagram below helps understanding the interaction between all these
components.
FUSE diagram
As the diagram shows, when an application wants to execute an operation on a
FUSE file system (for example, reading a few bytes from an open file), the
workflow is as follows:
The application executes a system call (e.g., read() to read data from an
open file) and enters kernel space.
The kernel VFS layer routes the operation to the appropriate file system
implementation, the FUSE kernel module in this case. However, if the
read() is done on a file that has been recently accessed, the data may
already be in the page cache. In this case the VFS may serve the request
directly and return the data immediately to the application without calling
into the FUSE module.
FUSE will create a new request to be sent to the user-space server, and
queues it. At this point, the application performing the read() is
blocked, waiting for the operation to complete.
The user-space FUSE file system server gets the new request from /dev/fuse
and starts processing it. This may include, for example, network
communication in the case of a network file system.
Once the request is processed, the user-space FUSE server writes the reply
back into /dev/fuse.
The FUSE kernel module will get that reply, return it to VFS and the
user-space application will finally get its data.
As we can seen, there are a lot of blocking operations and context switches
between user- and kernel- spaces.
What's io_uring
io_uring is an API for performing asynchronous I/O, meant to replace, for
example, the old POSIX API (aio_read(), aio_write(), etc). io_uring can be
used instead of read() and write(), but also for a lot of other I/O
operations, such as fsync, poll. Or even for network-related operations
such as the socket sendmsg() and recvmsg(). An application using this
interface will prepare a set of requests (Submit Queue Entries or SQE), add
them to Submission Queue Ring (SQR), and notify the kernel about these
operations. The kernel will eventually pick these entries, executed them and
add completion entries to the Completion Queue Ring (CQR). It's a simple
producer-consumer model, as shown in the diagram bellow.
io_uring diagram
What's FUSE over io_uring
As mentioned above, the usage of /dev/fuse for communication between the FUSE
server and the kernel is one of the performance bottlenecks when using
user-space file systems. Thus, replacing this mechanism by a block of memory
(ring buffers) shared between the user-space server and the kernel was expected
to result in performance improvements.
The implementation of FUSE over io_uring that was merged into the 6.14 kernel
includes a set of SQR/CQR queues per CPU core and, even if not all the low-level
FUSE operations are available through io_uring1, the performance
improvements are quite visible. Note that, in the future, this design of having
a set of rings per CPU may change and may become customisable. For example, it
may be desirable to have a set of CPUs dedicated for doing I/O on a FUSE file
system, keep other CPUs for other purposes.
Using FUSE over io_uring
One awesome thing about the way this feature was implemented is that there is no
need to add any specific support to the user-space server implementations: as
long as the FUSE server uses libfuse, all the details are totally transparent
to the server.
In order to use this new feature one simply needs to enable it through a fuse
kernel module parameter, for example by doing:
echo 1 > /sys/module/fuse/parameters/enable_uring
And then, when a new FUSE file system is mounted, io_uring will be used. Note
that the above command needs to be executed before the file system is mounted,
otherwise it will keep using the traditional /dev/fuse device.
Unfortunately, as of today, the libfuse library support for this feature
hasn't been released yet. Thus, it is necessary to compile a version of this
library that is still under review. It can be obtained in the maintainer git
tree, branch uring.
After compiling this branch, it's easy to test io_uring using one of the
passthrough file system examples distributed with the library. For example,
one could use the following set of commands to mount a passthrough file system
that uses io_uring:
The graphics below show the results of running some very basic read() and
write() tests, using a simple setup with the passthrough_hp example file
system. The workload used was the standard I/O generator fio.
The graphics on the left are for read() operations, and the ones on the right
for write() operations; on the top the graphics are for buffered I/O and on
the bottom for direct I/O.
All of them show the I/O bandwidth on the Y axis and the number of jobs
(processes doing I/O) on the X axis. The test system used had 8 CPUs, and the
tests used 1, 2, 4 and 8 jobs. Also, for each operation different block sizes
were used. In these graphics only 4k and 32k block sizes are shown.
Reads
Writes
The graphics show clearly that the io_uring performance is better than when
using the FUSE /dev/fuse device. For the reads, the 4k block size io_uring
tests are even better than the 32k tests for the traditional FUSE device. That
doesn't happen in the writes, but io_uring are still better.
Conclusion
To summarise, today is already possible to improve the performance of FUSE file
systems simply by explicitly enabling the io_uring communication between the
kernel and the FUSE server. libfuse still needs to be manually compiled, but
this should change very soon, once this library is released with support for
this new feature. And this proves once again that user-space file systems
are not necessarily "toy" file systems developed by "misguided" people.
Good evening, hackfolk. A quick note this evening to record a waypoint
in my efforts to improve Guile’s memory manager.
So, I got Guile running on top of the
Whippet API. This API can be
implemented by a number of concrete garbage collector implementations.
The implementation backed by the Boehm collector is fine, as expected.
The implementation that uses the bump-pointer-allocation-into-holes
strategy
is less good. The minor reason is heap sizing heuristics; I still get
it wrong about when to grow the heap and when not to do so. But the
major reason is that non-moving Immix collectors appear to have
pathological fragmentation characteristics.
Fragmentation, for our purposes, is memory under the control of the GC
which was free after the previous collection, but which the current cycle failed to use for allocation. I have the feeling
that for the non-moving Immix-family collector implementations,
fragmentation is much higher than for size-segregated freelist-based mark-sweep
collectors. For an allocation of, say, 1024 bytes, the collector might
have to scan over many smaller holes until you find a hole that is big
enough. This wastes free memory. Fragmentation memory is not gone—it
is still available for allocation!—but it won’t be allocatable until
after the current cycle when we visit all holes again. In Immix, fragmentation wastes allocatable memory during a cycle, hastening collection and causing more frequent whole-heap traversals.
The value proposition of Immix is that if there is too much
fragmentation, you can just go into evacuating mode, and probably
improve things. I still buy it. However I don’t think that non-moving
Immix is a winner. I still need to do more science to know for sure. I
need to fix Guile to support the stack-conservative, heap-precise
version of the Immix-family collector which will allow for evacuation.
So that’s where I’m at: a load of gnarly Guile refactors to allow for
precise tracing of the heap. I probably have another couple weeks left until I can run some tests. Fingers crossed; we’ll see!
Last month the Embedded Recipes conference was held in Nice, France.
Igalia was sponsoring the event, and my colleague Martín and myself were attending.
In addition we both delivered a talk to a highly technical and engaged audience.
My presentation, unlike most other talks, was a high-level overview of how Igalia engineers contribute to SteamOS to shape the future of gaming on Linux, through our contracting work with Valve. Having joined the project recently, this was a challenge (the good kind) to me: it allowed me to gain a much better understanding of what all my colleagues who work on SteamOS do, through conversations I had with them when preparing the presentation.
The talk was well received and the feedback I got was overall very positive, and it was followed up by several interesting conversations.
I was apprehensive about the questions from the audience, as most of the work I presented wasn’t mine, and indeed some of them had to remain unanswered.
Martín delivered a lightning talk on how to implement OTA updates with systemd-sysupdate on Yocto-based distributions.
It was also well received, and followed up by conversations in the Yocto workshop that took place the following day.
I found the selection of presentations overall quite interesting and relevant, and there were plenty of opportunities for networking during lunch, coffee breaks that were splendidly supplied with croissants, fruit juice, cheese and coffee, and a dinner at a beach restaurant.
Many thanks to Kevin and all the folks at BayLibre for a top-notch organization in a relaxed and beautiful setting, to fellow speakers for bringing us these talks, and to everyone I talked to in the hallway track for the enriching conversations.
The Web Engines Hackfest 2025 is kicking off next Monday in A Coruña and among
all the interesting talks and
sessions about
different engines, there are a few that can be interesting to people involved
one way or another with WebKitGTK and WPE:
“Multimedia in
WebKit”, by Philippe
Normand (Tuesday 3rd at 12:00 CEST), will focus on the current status and
future plans for the multimedia stack in WebKit.
All talks will be live streamed and a Jitsi Meet link will be available for
those interested in participating remotely. You can find all the details at
webengineshackfest.org.
The release of GStreamer
1.26, last March, delivered
new features, optimization and improvements.
Igalia played its role as long
standing contributor, with 382 commits (194 merge requests) from a total of 2666
of commits merged in this release.This blog post takes a closer look on those contributions.
GstValidate
is a tool to check if elements are behaving as expected.
Added support for HTTP Testing.
Scenario fixes such as reset pipelines on expected errors to avoid
inconsistent states, improved error logging, and async action handling to
prevent busy loops.
GStreamer Base Plugins is a well-groomed and well-maintained collection of
plugins. It also contains helper libraries and base classes useful for writing
elements.
audiorate: respect tolerance property to avoid unnecessary sample
adjustments for minor gaps.
audioconvert: support reordering of unpositioned input channels.
videoconvertscale: improve aspect ratio handling.
glcolorconvert: added I422_10XX, I422_12XX, Y444_10XX, and Y444_16XX
color formats, and fixed caps negotiation for DMABuf.
glvideomixer: handle mouse events.
pbutils: added VVC/H.266 codec support
encodebasebin: parser fixes.
oggdemux: fixed seek to the end of files.
rtp: fixed precision for UNIX timestamp.
sdp: enhanced debugging messages.
parsebin: improved caps negotiation.
decodebin3: added missing locks to prevent race conditions.
GStreamer Bad Plug+ins is a set of plugins that aren’t up to par compared to the
rest. They might be close to being good quality, but they’re missing something,
be it a good code review, some documentation, a set of tests, etc.
dashsink: a lot of improvements and cleanups, such as unit tests, state
and event management.
h266parse: enabled vvc1 and vvi1 stream formats, improved code data
parsing and negotiatios, along with cleanups and fixes.
mpegtsmux and tsdemux: added support for VVC/H.266 codec.
vulkan:
Added compatibility for timeline semaphores and barriers.
Initial support of multiple GPU and dynamic element registering.
Vulkan image buffer pool improvements.
vulkanh264dec: support interlaced streams.
vulkanencoding: rate control and quality level adjustments, update
SPS/PPS, support layered DPBs.
webrtcbin:
Resolved duplicate payload types in SDP offers with RTX and multiple codecs.
Transceivers are now created earlier during negotiation to avoid linkage
issues.
Allow session level in setup attribute in SDP answer.
wpevideosrc:
code cleanups
cached SHM buffers are cleared after caps renegotiation.
handle latency queries and post progress messages on bus.
srtdec: fixes
jpegparse: handle avi1 tag for progressive images
va: improve encoders configuration when properties change in run+time,
specially rate control.
Earlier this month, Alex presented "Improvements to RISC-V vector code
generation in LLVM" at the RISC-V Summit Europe in Paris. This blog post
summarises that talk.
So RISC-V, vectorisation, the complexities of the LLVM toolchain and just 15
minutes to cover it in front of an audience with varying specialisations. I
was a little worried when first scoping this talk but the thing with compiler
optimisations is that the objective is often pretty clear and easy to
understand, even if the implementation can be challenging. I'm going to be
exploiting that heavily in this talk by trying to focus on the high level
objective and problems encountered.
Where are we today in terms of the implementation of optimisation of RISC-V
vector codegen? I'm oversimplifying the state of affairs here, but the list in
the slide above isn't a bad mental model. Basic enablement is done, it's been
validated to the point it's enabled by default, we've had a round of
additional extension implementation, and a large portion of ongoing work is on
performance analysis and tuning. I don't think I'll be surprising any of you
if I say this is a huge task. We're never going to be "finished" in the sense
that there's always more compiler performance tuning to be done, but there's
certainly phases of catching the more obvious cases and then more of a long
tail.
What is the compiler trying to do here? There are multiple metrics, but
typically we're focused primarily on performance of generated code. This isn't
something we do at all costs -- in a general purpose compiler you can't for
instance spend 10hrs optimising a particular input. So we need a lot of
heuristics that help us arrive at a reasonable answer without exhaustively
testing all possibilities.
The kind of considerations for the compiler during compilation includes:
Profitability. If you're transforming your code then for sure you want the
new version to perform better than the old one! Given the complexity of the
transformations from scalar to vector code and costs incurred by moving
values between scalar and vector registers, it can be harder than you might
think to figure out at the right time whether the vector route vs the scalar
route might be better. You're typically estimating the cost of either choice
before you've gone and actually applied a bunch of additional optimisations
and transformations that might further alter the trade-off.
More specific to RISC-V vectors, it's been described before as effectively
being a wider than 32-bit instruction width but with the excess encoded in
control status registers. If you're too naive about it, you risk switching
the vtype CSR more than necessary, adding unwanted overhead.
Spilling is when we load and store values to the stack. Minimising this is a
standard objective for any target, but the lack of callee saved vector
registers in the standard ABI poses a challenge, and this is more subtle but
the fact we don't have immediate offsets for some vector instructions can
put more pressure on scalar register allocation.
Or otherwise just ensuring that we're using the instructions available
whenever we can. One of the questions I had was whether I'm going to be
talking just about autovectorisation, or vector codegen where it's explicit
in the input (e.g. vector datatypes, intrinsics). I'd make the point they're
not fully independent, in fact all these kind of considerations are
inter-related. If the compiler is doing cost modelling that's telling it
vectorisation isn't profitable. Sometimes that's true, sometimes the model
isn't detailed enough, or sometimes it's true for the compiler right now
because it could be doing a better job of choosing instructions. If I solve
the issue of suboptimal instruction selection then it benefits both
autovectorisation (as it's more likely to be profitable, or will be more
profitable) and potentially the more explicit path (as explicit uses of
vectors benefit from the improved lowering).
Just one final point of order I'll say once to avoid repeating myself again
and again. I'm giving a summary of improvements made by all LLVM contributors
across many companies, rather than just those by my colleagues at Igalia.
The intuition behind both this improvement and the one on the next slide is
actually exactly the same. Cast your minds back to 2015 or so when Krste was
presenting the vector extension. Some details have changed, but if you look at
the slides (or any RVV summary since then) you see code examples with simple
minimal loops even for irregularly sized vectors or where the length of a
vector isn't fixed at compile time. The headline is that the compiler now
generates output that looks a lot more like that handwritten code that better
exploits the features of RISC-V vector.
For non-power-of-two vectorisation, I'm talking about the case here where you
have a fixed known-at-compile time length. In LLVM this is handled usually by
what we call the SLP or Superword Level Parallelism
vectorizer. It needed
to be taught to handle non-power-of-two sizes like we support in RVV. Other
SIMD ISAs don't have the notion of vl and so generating non-power-of-two
vector types isn't as easy.
The example I show here has pixels with rgb values. Before it would do a very
narrow two-wide vector operation then handle the one remaining item with
scalar code. Now we directly operate on a 3-element vector.
We are of course using simple code examples for illustration here. If you want
to brighten an image as efficiently as possible sticking the per-pixel
operation in a separate function like this perhaps isn't how you'd do it!
Often when operating on a loop, you have an input of a certain length and you
process it in chunks of some reasonable size. RISC-V vector gives us a lot
more flexibility about doing this. If our input vector isn't an exact multiple
of our vectorization factor ("chunk size") - which is the calculated vector
length used per iteration - we can still process that in RVV using the same
vector code path. While for other architectures, as you see with the old code
has a vector loop, then may branch to a scalar version for the tail for any
remainder elements. Now that's not necessary, LLVM's loop
vectorizer can
handle these cases properly and we get a single vectorised loop body. This
results in performance improvements on benchmarks like x264 where the scalar
tail is executed frequently, and improves code size even in cases where there
is no direct performance impact.
This one is a little bit simpler. It's common for the compiler to synthesise
its own version of memcpy/memset when it sees it can generate a more
specialised version based on information about alignment or size of the
operands. Of course when the vector extension is available the compiler
should be able to use it to implement these operations, and now it can.
This example shows how a small number of instructions expanded inline might be
used to implement memcpy and memcmp. I also note there is a RISC-V vector
specific consideration in favour of inlining operations in this case - as the
standard calling convention doesn't have any callee-saved vector registers,
avoiding the function call may avoid spilling vector registers.
Sometimes of course it's a matter of a new extension letting us do something
we couldn't before. We need to teach the compiler how to select instructions
in that case, and to estimate the cost. Half precision and bf16 floating point
is an interesting example where you introduce a small number of instructions
for the values of that type, but otherwise rely on widening to 32-bit. This is
of course better than falling back to a libcall or scalarising to use Zfh
instruction, but someone needs to be put in the work to convince the compiler
of that!
The slide above has a sampling of other improvements. If you'd like to know
more about the VL optimizer, my colleague's presentation at EuroLLVM earlier
this year is now up on YouTube.
Another fun highlight is
llvm-exegesis, this
is a tool for detecting microarchitectural implementation details via probing,
e.g. latency and throughput of different operations that will help you write a
scheduling model. It now supports RVV which is a bit helpful for the one piece
of RVV 1.0 hardware we have readily available, but should be a lot more
helpful once more hardware reaches the market.
So, it's time to show the numbers. Here I'm looking at execution time for SPEC
CPU 2017 benchmarks (run using LLVM's harness) on at SpacemiT X60 and compiled
with the options mentioned above. As you can see, 12 out of 16 benchmarks
improved by 5% or more, 7 out of 16 by 10% or more. These are meaningful
improvements a bit under 9% geomean when compared to Clang as of March this
year to Clang from 18 months prior.
There's more work going in as we speak, such as the optimisation work done by
my colleague Mikhail and written up on the RISE
blog.
Benchmarking done for that work comparing Clang vs GCC showed today's LLVM is
faster than GCC in 11 of the 16 tested SPEC benchmarks, slower in 3, and about
equal for the other two.
Are we done? Goodness no! But we're making great progress. As I say for all of
these presentations, even if you're not directly contributing compiler
engineering resources I really appreciate anyone able to contribute by
reporting any cases when they compiler their code of interest and don't get
the optimisation expected. The more you can break it down and produce minimised
examples the better, and it means us compiler engineers can spend more time
writing compiler patches rather than doing workload analysis to figure out the
next priority.
Adding all these new optimisations is great, but we want to make sure the
generated code works and continues to work as these new code generation
features are iterated on. It's been really important to have CI coverage for
some of these new features including when they're behind flags and not enabled
by default. Thank you to RISE for supporting my work here, we have a nice
dashboard providing an easy view of just the RISC-V
builders.
Here's some directions of potential future work or areas we're already
looking. Regarding the default scheduling model, Mikhail's recent work on the
Spacemit X60 scheduling model shows how having at least a basic scheduling
model can have a big impact (partly as various code paths are pessimised in
LLVM if you don't at least have something). Other backends like AArch64 pick a
reasonable in-order core design on the basis that scheduling helps a lot for
such designs, and it's not harmful for more aggressive OoO designs.
To underline again, I've walked through progress made by a whole community of
contributors not just Igalia. That includes at least the companies mentioned
above, but more as well. I really see upstream LLVM as a success story for
cross-company collaboration within the RISC-V ecosystem. For sure it could be
better, there are companies doing a lot with RISC-V who aren't doing much with
the compiler they rely on, but a huge amount has been achieved by a
contributor community that spans many RISC-V vendors. If you're working on the
RISC-V backend downstream and looking to participate in the upstream
community, we run biweekly contributor calls (details on the RISC-V category
on LLVM's Discourse
that may be a helpful way to get started.
Update on what happened in WebKit in the week from May 19 to May 26.
This week saw updates on the Android version of WPE, the introduction
of a new mechanism to support memory-mappable buffers which can lead
to better performance, a new gamepad API to WPE, and other improvements.
Cross-Port 🐱
Implemented support for the new 'request-close' command for dialog elements.
JavaScriptCore 🐟
The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.
Added support for using the GDB JIT API when dynamically generating code in JSC.
Graphics 🖼️
Added support for memory-mappable GPU buffers. This mechanism allows to allocate linear textures that can be used from OpenGL, and memory-mapped into CPU-accessible memory. This allows to update the pixel data directly, bypassing the usual glCopyTexSubImage2D logic that may introduce implicit synchronization / perform staging copies / etc. (driver-dependant).
WPE WebKit 📟
WPE Platform API 🧩
New, modern platform API that supersedes usage of libwpe and WPE backends.
Landed a patch to add a gamepads API to WPE Platform with an optional default implementation using libmanette.
The Linux 6.15 has just been released, bringing a lot of new features:
nova-core, the “base” driver for the new NVIDIA GPU driver, written in Rust. nova project will eventually replace Nouveau driver for all GSP-based GPUs.
RISC-V gained support for some extensions: BFloat16 floating-point, Zaamo, Zalrsc and ZBKB.
The fwctl subsystem has been merged. This new family of drivers acts as a transport layer between userspace and complex firmware. To understand more about its controversies and how it got merged, check out this LWN article.
Support for MacBook touch bars, both as a DRM driver and input source.
Support for Adreno 623 GPU.
As always, I suggest to have a look at the Kernel Newbies summary. Now, let’s have a look at Igalia’s contributions.
DRM wedged events
In 3D graphics APIs such Vulkan and OpenGL, there are some mechanisms that applications can rely to check if the GPU had reset (you can read more about this in the kernel documentation). However, there was no generic mechanism to inform userspace that a GPU reset has happened. This is useful because in some cases the reset affected not only the app involved in the reset, but the whole graphic stack and thus needs some action to recover, like doing a module rebind or even bus reset to recovery the hardware. For this release, we helped to add an userspace event for this, so a daemon or the compositor can listen to it and trigger some recovery measure after the GPU has reset. Read more in the kernel docs.
DRM scheduler work
In the DRM scheduler area, in preparation for the future scheduling improvements, we worked on cleaning up the code base, better separation of the internal and external interfaces, and adding formal interfaces at places where individual drivers had too much knowledge of the scheduler internals.
General GPU/DRM stack
In the wider GPU stack area we optimised the most frequent dma-fence single fence merge operation to avoid memory allocations and array sorting. This should slightly reduce the CPU utilisation with workloads which use the DRM sync objects heavily, such as the modern composited desktops using Vulkan explicit sync.
Some releases ago, we helped to enable async page flips in the atomic DRM uAPI. So far, this feature was only enabled for the primary plane. In this release, we added a mechanism for the driver to decide which plane can perform async flips. We used this to enable overlay planes to do async flips in AMDGPU driver.
We also fixed a bug in the DRM fdinfo common layer which could cause use after free after driver unbind.
Intel Xe driver improvements
On the Intel GPU specific front we worked on adding better Alderlake-P support to the new Intel Xe driver by identifying and adding missing hardware workarounds, fixed the workaround application in general and also made some other smaller improvements.
sched_ext
When developing and optimizing a sched_ext-based scheduler, it is important to understand the interactions between the BPF scheduler and the in-kernel sched_ext core. If there is a mismatch between what the BPF scheduler developer expects and how the sched_ext core actually works, such a mismatch could often be the source of bugs or performance issues.
To address such a problem, we added a mechanism to count and report the internal events of the sched_ext core. This significantly improves the visibility of subtle edge cases, which might easily slip off. So far, eight events have been added, and the events can be monitored through a BPF program, sysfs, and a tracepoint.
A few less bugs
As usual, as part of our work on diverse projects, we keep an eye on automated test results to look for potential security and stability issues in different kernel areas. We’re happy to have contributed to making this release a bit more robust by fixing bugs in memory management, network (SCTP), ext4, suspend/resume and other subsystems.
This is the complete list of Igalia’s contributions for this release:
It was a pleasant surprise how easy it was to switch—from the user’s
point of view, you just pass --with-gc=heap-conservative-parallel-mmc
to Guile’s build (on the wip-whippet branch); when developing I also pass --with-gc-debug, and I
had a couple bugs to fix—but, but, there are still some issues. Today’s
note thinks through the ones related to heap sizing heuristics.
growable heaps
Whippet has three heap sizing
strategies:
fixed, growable, and adaptive
(MemBalancer). The adaptive policy
is the one I would like in the long term; it will grow the heap for processes with a high allocation rate, and shrink when they go
idle. However I won’t really be able to test heap shrinking until I get
precise tracing of heap edges, which will allow me to evacuate sparse
blocks.
So for now, Guile uses the growable policy, which attempts to size the
heap so it is at least as large as the live data size, times some
multiplier. The multiplier currently defaults to 1.75×, but can be set
on the command line via the GUILE_GC_OPTIONS environment variable.
For example to set an initial heap size of 10 megabytes and a 4×
multiplier, you would set
GUILE_GC_OPTIONS=heap-size-multiplier=4,heap-size=10M.
Anyway, I have run into problems! The fundamental issue is
fragmentation. Consider a 10MB growable heap with a 2× multiplier,
consisting of a sequence of 16-byte objects followed by 16-byte holes.
You go to allocate a 32-byte object. This is a small object (8192 bytes
or less), and so it goes in the Nofl space. A Nofl mutator holds on to
a block from the list of sweepable blocks, and will sequentially scan
that block to find holes. However, each hole is only 16 bytes, so we
can’t fit our 32-byte object: we finish with the current block, grab
another one, repeat until no blocks are left and we cause GC. GC runs,
and after collection we have an opportunity to grow the heap: but the
heap size is already twice the live object size, so the heuristics say
we’re all good, no resize needed, leading to the same sweep again,
leading to a livelock.
I actually ran into this case during Guile’s bootstrap, while allocating
a 7072-byte vector. So it’s a thing that needs fixing!
observations
The root of the problem is fragmentation. One way to solve the problem
is to remove fragmentation; using a semi-space collector comprehensively
resolves the issue,
modulo
any block-level fragmentation.
However, let’s say you have to live with fragmentation, for example
because your heap has ambiguous edges that need to be traced conservatively. What can we do?
Raising the heap multiplier is an effective mitigation, as it increases
the average hole size, but for it to be a comprehensive solution in
e.g. the case of 16-byte live objects equally interspersed with holes,
you would need a multiplier of 512× to ensure that the largest 8192-byte
“small” objects will find a hole. I could live with 2× or something,
but 512× is too much.
We could consider changing the heap organization entirely. For example,
most mark-sweep collectors (BDW-GC included) partition the heap into
blocks whose allocations are of the same size, so you might have some
blocks that only hold 16-byte allocations. It is theoretically possible
to run into the same issue, though, if each block only has one live
object, and the necessary multiplier that would “allow” for more empty
blocks to be allocated is of the same order (256× for 4096-byte blocks
each with a single 16-byte allocation, or even 4096× if your blocks are
page-sized and you have 64kB pages).
My conclusion is that practically speaking, if you can’t deal with
fragmentation, then it is impossible to just rely on a heap multiplier
to size your heap. It is certainly an error to live-lock the process,
hoping that some other thread mutates the graph in such a way to free up
a suitable hole. At the same time, if you have configured your heap to
be growable at run-time, it would be bad policy to fail an allocation,
just because you calculated that the heap is big enough already.
It’s a shame, because we lose a mooring on reality: “how big will my
heap get” becomes an unanswerable question because the heap might grow
in response to fragmentation, which is not deterministic if there are
threads around, and so we can’t reliably compare performance between
different configurations. Ah well. If reliability is a goal, I think
one needs to allow for evacuation, one way or another.
for nofl?
In this concrete case, I am still working on a solution. It’s going to
be heuristic, which is a bit of a disappointment, but here we are.
My initial thought has two parts. Firstly, if the heap is growable but
cannot defragment, then we need to reserve some empty blocks after each
collection, even if reserving them would grow the heap beyond the
configured heap size multiplier. In that way we will always be able to
allocate into the Nofl space after a collection, because there will
always be some empty blocks. How many empties? Who knows. Currently
Nofl blocks are 64 kB, and the largest “small object” is 8kB. I’ll
probably try some constant multiplier of the heap size.
The second thought is that searching through the entire heap for a hole
is a silly way for the mutator to spend its time. Immix will reserve a
block for overflow allocation: if a medium-sized allocation (more than
256B and less than 8192B) fails because no hole in the current block is
big enough—note that Immix’s holes have 128B granularity—then the
allocation goes to a dedicated overflow block, which is taken from the
empty block set. This reduces fragmentation (holes which were not used
for allocation because they were too small).
Nofl should probably do the same, but given its finer granularity, it
might be better to sweep over a variable number of blocks, for example
based on the logarithm of the allocation size; one could instead sweep
over clz(min-size)–clz(size) blocks before taking from the empty block list, which would at least bound the
sweeping work of any given allocation.
fin
Welp, just wanted to get this out of my head. So far, my experience
with this Nofl-based heap configuration is mostly colored by live-locks,
and otherwise its implementation of a growable heap sizing policy seems
to be more tight-fisted regarding memory allocation than BDW-GC’s
implementation. I am optimistic though that I will be able to get
precise tracing sometime soon, as measured in development time; the
problem as always is fragmentation, in that I don’t have a hole in my
calendar at the moment. Until then, sweep on Wayne, cons on Garth,
onwards and upwards!
There’s a layout type that web designers have been using for a long time now, and yet can’t be easily done with CSS: “masonry” layout, sometimes called “you know, like Pinterest does it” layout. Masonry sits sort of halfway between flexbox and grid layout, which is a big part of why it’s been so hard to formalize. There are those who think of it as an extension of flexbox, and others who think it’s an extension of grid, and both schools of thought have pretty solid cases.
But then, maybe you don’t actually need to explore the two sides of the debate, because there’s a new proposal in town. It’s currently being called Item Flow (which I can’t stop hearing sung by Eddie Vedder, please send help) and is explained in some detail in a blog post from the WebKit team. The short summary is that it takes the flow and packing capabilities from flex and grid and puts them into their own set of properties, along with some new capabilities.
As an example, here’s a thing you can currently do with flexbox:
Now you might be thinking, okay, this just renames some flex properties to talk about items instead and you also get a shorthand property; big deal. It actually is a big deal, though, because these item-* properties would apply in grid settingsas well. In other words, you would be able to say:
display: grid;
item-flow: wrap column;
Hold up. Item wrapping… in grid?!? Isn’t that just the same as what grid already does? Which is an excellent question, and not one that’s actually settled.
However, let’s invert the wrapping in grid contexts to consider an example given in the WebKit article linked earlier, which is that you could specify a single row of grid items that equally divide up the row’s width to size themselves, like so:
In that case, a row of five items would size each item to be one-fifth the width of the row, whereas a row of three items would have each item be one-third the row’s width. That’s a new thing, and quite interesting to ponder.
The proposal includes the properties item-pack and item-slack, the latter of which makes me grin a little like J.R. “Bob” Dobbs but the former of which I find a lot more interesting. Consider:
This would act with flex items much the way text-wrap: balance acts with words. If you have six flex items of roughly equal size, they’ll balance between two rows to three-and-three rather than five-and-one. Even if your flex items are of very different sizes, item-pack: balance would do always automatically its best to get the row lengths as close to equal as possible, whether that’s two rows, three rows, four rows, or however many rows. Or columns! This works just as well either way.
There are still debates to be had and details to be worked out, but this new direction does feel fairly promising to me. It covers all of the current behaviors that flex and grid flowing already permit, plus it solves some longstanding gripes about each layout approach and while also opening some new doors.
The prime example of a new door is the aforementioned masonry layout. In fact, the previous code example is essentially a true masonry layout (because it resembles the way irregular bricks are laid in a wall). If we wanted that same behavior, only vertically like Pinterest does it, we could try:
display: flex;
item-direction: column; /* could also be `flex-direction` */
item-wrap: wrap; /* could also be `flex-wrap` */
item-pack: balance;
That would be harder to manage, though, since for most writing modes on the web, the width is constrained and the height is not. In other words, to make that work with flexbox, we’d have to set an explicit height. We also wouldn’t be able to nail down the number of columns. Furthermore, that would cause the source order to flow down columns and then jump back to the top of the next column. So, instead, maybe we’d be able to say:
If I’ve read the WebKit article correctly, that would allow Pinterest-style layout with the items actually going across the columns in terms of source order, but being laid out in packed columns (sometimes called “waterfall” layout, which is to say, “masonry” but rotated 90 degrees).
That said, it’s possible I’m wrong in some of the particulars here, and even if I’m not, the proposal is still very much in flux. Even the property names could change, so values and behaviors are definitely up for debate.
As I pondered that last example, the waterfall/Pinterest layout, I thought: isn’t this visual result essentially what multicolumn layout does? Not in terms of source order, since multicolumn elements run down one column before starting again at the top of the next. But that seems an easy enough thing to recreate like so:
That’s a balanced set of three equally wide columns, just like in multicol. I can use gap for the column gaps, so that’s handled. I wouldn’t be able to set up column rules — at least, not right now, though that may be coming thanks to the Edge team’s gap decorations proposal. But what I would be able to do, that I can’t now, is vary the width of my multiple columns. Thus:
Is that useful? I dunno! It’s certainly not a thing we can do in CSS now, though, and if there’s one thing I’ve learned in the past almost three decades, it’s that a lot of great new ideas come out of adding new layout capabilities.
So, if you’ve made it this far, thanks for reading and I strongly encourage you to go read the WebKit team’s post if you haven’t already (it has more detail and a lovely summary matrix near the end) and think about what this could do for you, or what it looks like it might fall short of making possible for you.
As I’ve said, this feels promising to me, as it enables what we thought was a third layout mode (masonry/waterfall) by enriching and extending the layout modes we already have (flex/grid). It also feels like this could eventually lead to a Grand Unified Layout Platform — a GULP, if you will — where we don’t even have to say whether a given layout’s display is flex or grid, but instead specify the exact behaviors we want using various item-* properties to get just the right ratio of flexible and grid-like qualities for a given situation.
…or, maybe, it’s already there. It almost feels like it is, but I haven’t thought about it in enough detail yet to know if there are things it’s missing, and if so, what those might be. All I can say is, my Web-Sense is tingling, so I’m definitely going to be digging more at this to see what might turn up. I’d love to hear from all y’all in the comments about what you think!
In April, many colleagues from Igalia participated in a TC39 meeting organized remotely to discuss proposed features for the JavaScript standard alongside delegates from various other organizations.
Let's delve together into some of the most exciting updates!
In 2020, the Intl.NumberFormat Unified API proposal added a plethora of new features to Intl.NumberFormat, including compact and other non-standard notations. It was planned that Intl.PluralRules would be updated to work with the notation option to make the two complement each other. This normative change achieved this by adding a notation option to the PluralRules constructor.
Given the very small size of this Intl change, it didn't go through the staging process for proposals and was instead directly approved to be merged into the ECMA-402 specification.
Our colleague Philip Chimento presented a regular status update on Temporal, the upcoming proposal for better date and time support in JS.
Firefox is at ~100% conformance with just a handful of open questions. The next most conformant implementation, in the Ladybird browser, dropped from 97% to 96% since February — not because they broke anything, but just because we added more tests for tricky cases in the meantime. GraalJS at 91% and Boa at 85% have been catching up.
Completing the Firefox implementation has raised a few interoperability questions which we plan to solve with the Intl Era and Month Code proposal soon.
Dan Minor of Mozilla reported on a tricky case with the proposed using keyword for certain resources. The feature is essentially completely implemented in SpiderMonkey, but Dan highlighted an ambiguity about using the new keyword in switch statements. The committee agreed on a resolution of the issue suggested by Dan, including those implementers who have already shipped this stage 3 feature.
The JavaScript iterator and async iterator protocols power all modern iteration methods in the language, from for of and for await of to the rest and spread operators, to the modern iterator helpers proposals...
One less-well-known part of these protocols, however, is the optional .throw() and .return() methods, which can be used to influence the iteration itself. In particular, .return() indicates to the iterator that the iteration is finished, so it can perform any cleanup actions. For example, this is called in for of/for await of when the iteration stops early (due to a break, for example).
When using for await of with a sync iterator/iterable, such as an array of promises, each value coming from the sync iterator is awaited. However, a bug was found recently where if one of those promises coming from the sync iterator rejects, the iteration would stop, but the original sync iterator's .return() method would never be called. (Note that in for of with sync iterators, .return() is always called after .next() throws).
In the January TC39 plenary we decided to make it so that such a rejection would close the original sync iterator. In this plenary, we decided that since Array.fromAsync (which is currently stage 3) uses the same underlying spec machinery for this, it also would affect that API.
The Immutable ArrayBuffer proposal allows creating ArrayBuffers in JS from read-only data, and in some cases allows zero-copy optimizations. After advancing to stage 2.7 last time, there is work underway to write conformance tests. The committee considered advancing the proposal to stage 3 conditionally on the tests being reviewed, but decided to defer that to the next meeting.
Champions: Mark S. Miller, Peter Hoddie, Richard Gibson, Jack-Works
The notion of "upserting" a value into an object for a key is a great match for a common use case: is it possible to set a value for a property on an object, but, if the object already has that property, update the value in some way? To use CRUD terminology, it's a fusion of inserting and updating. This proposal is proceeding nicely; it just recently achieved stage 2, and achieved stage 2.7 at this plenary, since it has landed a number of test262 tests. This proposal is being worked on by Dan Minor with assistance from a number of students at the University of Bergen, illustrating a nice industry-academia collaboration.
JavaScript objects can be made non-extensible using Object.preventExtensions: the value of the properties of a non-extensible object can be changed, but you cannot add new properties to it.
"use strict";
let myObj ={x:2,y:3}; Object.preventExtensions(myObj); myObj.x =5;// ok myObj.z =4;// error!
However, this only applies to public properties: you can still install new private fields on the object thanks to the "return it from super() trick".
classAddPrivateFieldextendsfunction(x){return x }{ #foo =2; statichasFoo(obj){return #foo in obj;} }
let myObj ={x:2,y:3}; Object.preventExtension(myObj); AddPrivateField.hasFoo(obj);// false newAddPrivateField(obj); AddPrivateField.hasFoo(obj);// true
This new proposal, which went all the way to Stage 2.7 in a single meeting, attempts to make the new AddPrivateField(obj) throw when myObj is non-extensible.
The V8 team is currently investigating the web compatibility of this change.
Champions: Mark Miller, Shu-yu Guo, Chip Morningstar, Erik Marks
Records and Tuples was a proposal to support composite primitive types, similar to object and arrays, but that would be deeply immutable and with recursive equality. They also had similar syntax as objects and arrays, but prefixed by #:
The proposal reached stage 2 years ago, but then got stuck due to significant performance concerns from browsers:
changing the way === works would risk making every existing === usage a little bit slower
JavaScript developers were expecting === on these values to be fast, but in reality it would have required either a full traversal of the two records/tuples or complex interning mechanisms
Ashley Claymore, working at Bloomberg, presented a new simpler proposal that would solve one of the use cases of Records and Tuples: having Maps and Sets whose keys are composed of multiple values. The proposal introduces composites: some objects that Map and Set would handle specially for that purpose.
const myMap =newMap(); myMap.set(["foo","bar"],3); myMap.has(["foo","bar"]);// false, it's a different array with just the same contents
AsyncContext is a proposal that allows storing state which is local to an async flow of control (roughly the async equivalent of thread-local storage in other languages), which was impossible in browsers until now. We had previously opened a Mozilla standards position issue about AsyncContext, and it came back negative. One of the main issues they had is that AsyncContext has a niche use case: this feature would be mostly used by third-party libraries, especially for telemetry and instrumentation, rather than by most developers. And Mozilla reasoned that making those authors' lives slightly easier was not worth the additional complexity to the web platform.
However, we should have put more focus on the facts that AsyncContext would enable libraries to improve the UX for their users, and that AsyncContext is also incredibly useful in many front-end frameworks. Not having access to AsyncContext leads to confusing and hard-to-debug behavior in some frameworks, and forces other frameworks to transpile all user code. We interviewed the maintainers for a number of frameworks to see their use cases, which you can read here.
Mozilla was also worried about the potential for memory leaks, since in a previous version of this proposal, calling .addEventListener would store the current context (that is, a copy of the value for every single AsyncContext.Variable), which would only be released in the corresponding .removeEventListener call -- which almost never happens. As a response we changed our model so that .addEventListener would not store the context. (You can read more about the memory aspects of the proposal here.)
A related concern is developer complexity, because in a previous model some APIs and events used the "registration context" (for events, the context in which .addEventListener is called) while others used the "dispatch context" (for events, the context that directly caused the event). We explained that in our newer model, we always use the dispatch context, and that this model would match the context you'd get if the API was internally implemented in JS using promises -- but that for most APIs other than events, those two contexts are the same. (You can read more about the web integration of AsyncContext here.)
After the presentation, Mozilla still had concerns about how the web integration might end up being a large amount of work to implement, and it might still not be worth it, even when the use cases were clarified. They pointed out that the frameworks do have use cases for the core of the proposal, but that they don't seem to need the web integration.
In a post Temporal JavaScript, non-Gregorian calendars can be utilized beyond just Internationalization with a much higher level of detail. Some of this work is relatively uncharted and therefore needs standardization. One of these small but highly significant details is the string IDs for era and months for various calendars. This stage 2 update brought the committee up to speed on some of the design directions of the effort and justified the rationale behind certain tradeoffs including favoring human-readable era codes and removing the requirement of them to be globally unique as well as some of the challenges we have faced with standardizing and programmatically implementing Hijri calendars.
Originally created as part of the import defer proposal, deferred re-exports allow, well... deferring re-export declarations.
The goal of the proposal is to reduce the cost of unused export ... from statements, as well as providing a minimum basis for tree-shaking behavior that everybody must implement and can be relied upon.
Now, when users do import { add } from "./my-library.js", my-library/sets.js will not be loaded and executed: the decision whether it should actually be imported or not has been deferred to my-library's user, who decided to only import what was necessary for the add function.
In the AsyncContext proposal, you can't set the value of an AsyncContext.Variable. Instead, you have the .run method, which takes a callback, runs it with the updated state, and restores the previous value before returning. This offers strong encapsulation, making sure that no mutations can be leaked out of the scope. However, this also adds inflexibility in some cases, such as when refactoring a scope inside a function.
The disposable AsyncContext.Variable proposal extends the AsyncContext proposal by adding a way to set a variable without entering a new function scope, which builds on top of the explicit resource management proposal and its using keyword:
const asyncVar =newAsyncContext.Variable();
function*gen(){ // This code with `.run` would need heavy refactoring, // since you can't yield from an inner function scope. using _ = asyncVar.withValue(createSpan()); yieldcomputeResult(); yieldcomputeResult2(); // The scope of `_` ends here, so `asyncVar` is restored // to its previous value. }
One issue with this is that if the return value of .withValue is not used with a using declaration, the context will never be reset at the end of the scope; so when the current function returns, its caller will see an unexpected context (the context inside the function would leak to the outside). The strict enforcement of using proposal (currently stage 1) would prevent this from happening accidentally, but deliberately leaking the context would still be possible by calling Symbol.enter but not Symbol.dispose. (Note that context leaks are not memory leaks.)
The champions of this proposal explored how to deal with context leaks, and whether it's worth it, since preventing them would require changing the internal using machinery and would make composition of disposables non-intuitive. These leaks are not "unsafe" since you can only observe them with access to the same AsyncContext.Variable, but they are unexpected and hard to debug, and the champions do not know of any genuine use case for them.
The committee resolved on advancing this proposal to stage 1, indicating that it is worth spending time on, but the exact semantics and behaviors still need to be decided.
We presented the results of recent discussions in the overlap between the measure and decimal proposals having to do with what we call an Amount: a container for a number (a Decimal, a Number, a BigInt, a digit string) together with precision. The goal is to be able to represent a number that knows how precise it is. The presentation focused on how the notion of an Amount can solve the internationalization needs of the decimal proposal while, at the same time, serving as a building block on which the measure proposal can build by slotting in a unit (or currency). The committee was not quite convinced by this suggestion, but neither did they reject the idea. We have an active biweekly champions call dedicated to the topic of JS numerics, where we will iterate on these ideas and, in all likelihood, present them again to committee at the next TC39 plenary in May at Igalia headquarters in A Coruña. Stay tuned!
Champions: Jesse Alama, Jirka Maršík, Andrew Paprocki
String encoding in programming languages has come a long way since the Olden Times, when anything not 7-bit ASCII was implementation-defined. Now we have Unicode. 32 bits per character is a lot though, so there are various ways to encode Unicode strings that use less space. Common ones include UTF-8 and UTF-16.
You can tell that JavaScript encodes strings as UTF-16 by the fact that string indexing s[0] returns the first 2-byte code unit. Iterators, on the other hand, iterate through Unicode characters ("code points"). Explained in terms of pizza:
>'🍕'[0]// code unit indexing '\ud83c' >'🍕'.length // length in 2-byte code units 2 >[...'🍕'][0]// code point indexing (by using iteration) '🍕' >[...'🍕'].length // length in code points 1
It's currently possible to compare JavaScript strings by code units (the < and > operators and the array sort() method) but there's no facility to compare strings by code points. It requires writing complicated code yourself. This is unfortunate for interoperability with non-JS software such as databases, where comparisons are almost always by code point. Additionally, the problem is unique to UTF-16 encoding: with UTF-8 it doesn't matter if you compare by unit or point, because the results are the same.
This is a completely new proposal and the committee decided to move it to stage 1. There's no proposed API yet, just a consensus to explore the problem space.
Champions: Mathieu Hofman, Mark S. Miller, Christopher Hiller
This proposal discusses a taxonomy of possible errors that can occur when a JavaScript host runs out of memory (OOM) or space (OOS). It generated much discussion about how much can be reasonably expected of a JS host, especially when under such pressure. This question is particularly important for JS engines that are, by design, working with rather limited memory and space, such as embedded devices. There was no request for stage advancement, so the proposal stays at stage 1. A wide variety of options and ways in which to specify JS engine behavior under these extreme conditions were presented, so we can expect the proposal champions to iterate on the feedback they received and come back to plenary with a more refined proposal.
Champions: Mark S. Miller, Peter Hoddie, Zbyszek Tenerowicz, Christopher Hiller
Enums have been a staple of TypeScript for a long time, providing a type that represents a finite domain of named constant values. The reason to propose enums in JavaScript after all this time is that some modes of compilation, such as the "type stripping" mode used by default in Node.js, can't support enums unless they're also part of JS.
enum Numbers { zero =0, one =1, two =2, alsoTwo = two,// self-reference twoAgain = Numbers.two,// also self-reference }
console.log(Numbers.zero);// 0
One notable difference with TS is that all members of the enum must have a provided initializer, since automatic numbering can easily cause accidental breaking changes. Having auto-initializers seems to be highly desirable, though, so some ways to extend the syntax to allow them are being considered.
Update on what happened in WebKit in the week from May 12 to May 19.
This week focused on infrastructure improvements, new releases that
include security fixes, and featured external projects that use the
GTK and WPE ports.
Cross-Port 🐱
Multimedia 🎥
GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.
Fixed
a reference cycle in the mediastreamsrc element, which prevented its disposal.
JavaScriptCore 🐟
The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.
Added an internal class that will be
used to represent Temporal Duration objects in a way that allows for more
precise calculations. This is not a user-visible change, but will enable future
PRs to advance Temporal support in JSC towards completion.
WPE WebKit 📟
WPE Platform API 🧩
New, modern platform API that supersedes usage of libwpe and WPE backends.
Added
an initial demo application to the GTK4 WPEPlatform implementation.
Releases 📦️
WebKitGTK
2.48.2 and
WPE WebKit 2.48.2 have
been released. These are paired with a security advisory (WSA-2025-0004:
GTK,
WPE), and therefore it is
advised to update.
On top of security fixes, these release also include correctness fixes, and
support for CSS Overscroll
Behaviour
is now enabled by default.
Community & Events 🤝
GNOME Web has
gained a
preferences page that allows toggling WebKit features at run-time. Tech Preview
builds of the browser will show the settings page by default, while in regular
releases it is hidden and may be enabled with the following command:
gsettings set org.gnome.Epiphany.ui webkit-features-page true
This should allow frontend developers to test upcoming features more easily. Note that the settings for WebKit features are not persistent, and they will be reset to their default state on every launch.
Infrastructure 🏗️
Landed an improvement to error
reporting in the script within WebKit that runs test262 JavaScript tests.
The WebKit Test Runner (WKTR) will no longer
crash if invalid UTF-8 sequences
are written to the standard error stream, (e.g. from 3rd party libraries'
debugging options.
Experimentation is ongoing to un-inline String::find(), which saves ~50 KiB
in the resulting binary size worth of repeated implementations of SIMD “find
character in UTF-16” and “find character in UTF-32” algorithms. Notably, the
algorithm for “find character in ASCII string” was not even part of the
inlining.
Added the LLVM
repository to the
WebKit container SDK. Now it is possible to easily install Clang 20.x with
wkdev-setup-default-clang --version=20.
Figured out that a performance bug related to jump threading optimization in
Clang 18 resulted in a bottleneck adding up to five minutes of build time in
the container SDK. This may be fixed by updating to Clang 20.x.
This week, I reviewed the last available version of the Linux KMS Color
API.
Specifically, I explored the proposed API by Harry Wentland and Alex Hung
(AMD), their implementation for the AMD display driver and tracked the parallel
efforts of Uma Shankar and Chaitanya Kumar Borah
(Intel)
in bringing this plane color management to life. With this API in place,
compositors will be able to provide better HDR support and advanced color
management for Linux users.
To get a hands-on feel for the API’s potential, I developed a fork of
drm_info compatible with the new color properties. This allowed me to
visualize the display hardware color management capabilities being exposed. If
you’re curious and want to peek behind the curtain, you can find my exploratory
work on the
drm_info/kms_color branch.
The README there will guide you through the simple compilation and installation
process.
Note: You will need to update libdrm to match the proposed API. You can find
an updated version in my personal repository
here. To avoid
potential conflicts with your official libdrm installation, you can compile
and install it in a local directory. Then, use the following command: export
LD_LIBRARY_PATH="/usr/local/lib/"
In this post, I invite you to familiarize yourself with the new API that is
about to be released. You can start doing as I did below: just deploy a custom
kernel with the necessary patches and visualize the interface with the help of
drm_info. Or, better yet, if you are a userspace developer, you can start
developing user cases by experimenting with it.
The more eyes the better.
KMS Color API on AMD
The great news is that AMD’s driver implementation for plane color operations
is being developed right alongside their Linux KMS Color API proposal, so it’s
easy to apply to your kernel branch and check it out. You can find details of
their progress in
the AMD’s series.
I just needed to compile a custom kernel with this series applied,
intentionally leaving out the AMD_PRIVATE_COLOR flag. The
AMD_PRIVATE_COLOR flag guards driver-specific color plane properties, which
experimentally expose hardware capabilities while we don’t have the generic KMS
plane color management interface available.
If you don’t know or don’t remember the details of AMD driver specific color
properties, you can learn more about this work in my blog posts
[1][2][3].
As driver-specific color properties and KMS colorops are redundant, the driver
only advertises one of them, as you can see in
AMD workaround patch 24.
So, with the custom kernel image ready, I installed it on a system powered by
AMD DCN3 hardware (i.e. my Steam Deck). Using
my custom drm_info,
I could clearly see the Plane Color Pipeline with eight color operations as
below:
Note that Gamescope is currently using
AMD driver-specific color properties
implemented by me, Autumn Ashton and Harry Wentland. It doesn’t use this KMS
Color API, and therefore COLOR_PIPELINE is set to Bypass. Once the API is
accepted upstream, all users of the driver-specific API (including Gamescope)
should switch to the KMS generic API, as this will be the official plane color
management interface of the Linux kernel.
KMS Color API on Intel
On the Intel side, the driver implementation available upstream was built upon
an earlier iteration of the API. This meant I had to apply a few tweaks to
bring it in line with the latest specifications. You can explore their latest
work
here.
For a more simplified handling, combining the V9 of the Linux Color API,
Intel’s contributions, and my necessary adjustments, check out
my dedicated branch.
I then compiled a kernel from this integrated branch and deployed it on a
system featuring Intel TigerLake GT2 graphics. Running
my custom drm_info
revealed a Plane Color Pipeline with three color operations as follows:
Observe that Intel’s approach introduces additional properties like “HW_CAPS”
at the color operation level, along with two new color operation types: 1D LUT
with Multiple Segments and 3x3 Matrix. It’s important to remember that this
implementation is based on an earlier stage of the KMS Color API and is
awaiting review.
A Shout-Out to Those Who Made This Happen
I’m impressed by the solid implementation and clear direction of the V9 of the
KMS Color API. It aligns with the many insightful discussions we’ve had over
the past years. A huge thank you to Harry Wentland and Alex Hung for their
dedication in bringing this to fruition!
Beyond their efforts, I deeply appreciate Uma and Chaitanya’s commitment to
updating Intel’s driver implementation to align with the freshest version of
the KMS Color API. The collaborative spirit of the AMD and Intel developers in
sharing their color pipeline work upstream is invaluable. We’re now gaining a
much clearer picture of the color capabilities embedded in modern display
hardware, all thanks to their hard work, comprehensive documentation, and
engaging discussions.
Finally, thanks all the userspace developers, color science experts, and kernel
developers from various vendors who actively participate in the upstream
discussions, meetings, workshops, each iteration of this API and the crucial
code review process. I’m happy to be part of the final stages of this long
kernel journey, but I know that when it comes to colors, one step is completed
for new challenges to be unlocked.
Looking forward to meeting you in this year Linux Display Next hackfest,
organized by AMD in Toronto, to further discuss HDR, advanced color management,
and other display trends.
In this tutorial I’m using a Raspberry Pi 5 with a
Camera Module 3. Be careful to use the
right cable as the default white cable shipped with the camera is
for older models of the Raspberry Pi.
In order to not have to switch keyboard, mouse, screen, any cables between the device and the development machine, the
idea is to do the whole development remotely. Obviously, you can also follow the whole tutorial by developping directly
on the Raspberry Pi itself as, once configured, local or remote development is totally transparent.
In my own configuration I only have the Raspberry Pi connected to its power cable and to my local Wifi network.
I’m using Visual Studio Code with the
Remote-SSH extension on the
development machine. In reality the device may be located anywhere in the world as Visual Studio Code is using a SSH
tunnel to manage the remote connection in a secure way.
Basically, once the Raspberry Pi OS installed and the device connected to the
network, you can install the needed development tools (clang or gcc, git, meson, ninja, etc…) and that’s all.
Everything else is done from the development machine where you will install Visual Studio Code and the Remote-SSH
extension. The first time the IDE is connecting to the device through SSH, it will automatically install the tools
required. The detailed installation process is described here.
Once the IDE is connected to the device you can chose which extensions to install locally on the device (like the
C/C++ or
Meson extensions).
Some useful tricks:
Append your public SSH key content (situated by default in ~/.ssh/id_rsa.pub) to the device
~/.ssh/authorized_keys file. It will allow you to connect to the device through ssh without having to enter each
time a password.
Configure your ssh client (in the ~/.ssh/config file) to forward the ssh agent. It will allow to use securely your
local ssh keys to access remote git repositories from the remote device. A typical configuration block would be
something like:
Host berry [the friendly name that will appear in Visual Studio Code]
HostName berry.local [the device hostname or IP address]
User cam [the username used to access the device with ssh]
ForwardAgent yes
With those simple tricks, just executing ssh berry is enough to connect to the device without any password and then
you can access any git repository locally just like if you were on the development machine itself.
You should also change, in the Meson extension configuration in Visual Studio Code, the build directory name and
replace the default builddir by just build because if you are not using
IntelliSense but another extension like
clangd, it will not find
the compile_commands.json file automatically. To update it directly, add this entry to the
~/.config/Code/User/settings.json file:
And the basic main.cpp file with the libcamera initialization code:
#include<libcamera/libcamera.h>usingnamespace libcamera;intmain(){// Initialize the camera manager.auto camManager = std::make_unique<CameraManager>();
camManager->start();return0;}
You can configure and build the project by calling:
meson setup build
ninja -C build
or by using the tools integrated into Visual Studio Code through the Meson extension.
In order to debug the executable inside the IDE, add a .vscode/launch.json file with this content:
{"version":"0.2.0","configurations":[{"name":"Debug","type":"cppdbg","request":"launch","program":"${workspaceFolder}/build/cam-and-berry","cwd":"${workspaceFolder}","stopAtEntry":false,"externalConsole":false,"MIMode":"gdb","preLaunchTask":"Meson: Build all targets"}]}
Now, just pressing F5 will build the project and start the debug session on the device while being driven remotely from
the development machine.
If everything has worked well so far, you should see the libcamera logs on stderr, something like:
[5:10:53.005657356][4366] ERROR IPAModule ipa_module.cpp:171 Symbol ipaModuleInfo not found
[5:10:53.005916466][4366] ERROR IPAModule ipa_module.cpp:291 v4l2-compat.so: IPA module has no valid info
[5:10:53.005942225][4366] INFO Camera camera_manager.cpp:327 libcamera v0.4.0+53-29156679
[5:10:53.013988595][4371] INFO RPI pisp.cpp:720 libpisp version v1.1.0 e7974a156008 27-01-2025 (21:50:51)[5:10:53.035006731][4371] INFO RPI pisp.cpp:1179 Registered camera /base/axi/pcie@120000/rp1/i2c@88000/imx708@1a to CFE device /dev/media0 and ISP device /dev/media1 using PiSP variant BCM2712_D0
You can disable those logs by adding this line at the beginning of the main function:
While running (after calling start()) the
libcamera::CameraManager initializes and then
maintains up-to-date a vector of libcamera::Camera
instances each time a physical camera is connected to or removed from the system. In our case we can consider that the
Camera Module 3 will always be present as it is connected to the Raspberry internal connector.
We can list the available cameras at any moment by calling:
...intmain(){...// List camerasfor(constauto& camera : camManager->cameras()){
std::cout <<"Camera found: "<< camera->id()<< std::endl;}return0;}
This should give an output like:
Camera found: /base/axi/pcie@120000/rp1/i2c@88000/imx708@1a
Each retrieved camera has a list of specific properties and controls (which can be different for every model of
camera). This information can be listed using the camera properties() and controls() getters.
The idMap() getter in the libcamera::ControlList
class returns a map associating each property ID to a property description defined in a
libcamera::ControlId instance. It allows to retrieve
the property name and global caracteristics.
Using this information we can now have a complete description of the camera properties, available controls and their
possible values:
We are now going to see how we can extract frames from the camera. The camera is not producing frames by itself, the
extraction process works on demand: you first need to send a request to the camera to ask for a new frame.
The libcamera library provides a queue to process all those requests. So, basically, you need to create some requests
and push them to this queue. When the camera is ready to take an image, it will pop out the next request from the queue
and fill its associated buffer with the image content. Once the image is ready, the camera sends a signal to the
application to inform that the request has been completed.
If you want to take a simple photo you only need to send one request, but if you want to display or stream some live
video you will need to recycle and re-queue the requests once the corresponding frame has been processed. In the
following code this is what we are going to do as it will be easy to adapt the code to only take one photo.
flowchart TB
A(Acquire camera) --> B(Choose configuration)
B --> C(Allocate buffers and requests)
C --> D(Start camera)
D --> E
subgraph L [Frames extraction loop]
E(Push request) -->|Frame produced| F(("Request completed
callback"))
F --> G(Process frame)
G --> E
end
L --> H(Stop camera)
H --> I(Free buffers and requests)
I --> J(Release camera)
In all cases, there are some steps to follow before sending requests to the camera.
Let’s consider that we have a camera available and we selected it during the former cameras listing. Our selected
camera is called: selectedCamera and it’s a std::shared_ptr<Camera>.
We just have to call: selectedCamera->acquire(); to get an exclusive access to this camera. When we have finished
with it, we can release it by calling selectedCamera->release();.
Once the camera acquired for an exclusive access, we need to configure it. In particular, we need to choose the frames
resolution and pixel format. This is done by creating a camera configuration that will be tweaked, validated and
applied to the camera.
// Lock the selected camera and choose a configuration for video display.
selectedCamera->acquire();auto camConfig = selectedCamera->generateConfiguration({StreamRole::Viewfinder});if(camConfig->empty()){
std::cerr <<"No suitable configuration found for the selected camera"<< std::endl;return-2;}
The libcamera::StreamRole
allows to pre-configure the returned stream configurations depending on the intended usage: taking photos (in raw mode
or not), doing some video capture for streaming or recording (may provide encoded streams if the camera is able to do
it) or doing some video capture for local display.
It returns the default camera configurations for each stream role required.
The default configuration returned may be tweaked with user values. Once modified the configuration must be validated.
The camera may refuse those changes or adjust them to fit the device limits. Once validated, the configuration is
applied to the selected camera.
auto& streamConfig = camConfig->at(0);
std::cout <<"Default camera configuration is: "<< streamConfig.toString()<< std::endl;
streamConfig.size.width =1920;
streamConfig.size.height =1080;
streamConfig.pixelFormat = formats::RGB888;if(camConfig->validate()== CameraConfiguration::Invalid){
std::cerr <<"Invalid camera configuration"<< std::endl;return-3;}
std::cout <<"Targeted camera configuration is: "<< streamConfig.toString()<< std::endl;if(selectedCamera->configure(camConfig.get())!=0){
std::cerr <<"Failed to update the camera configuration"<< std::endl;return-4;}
std::cout <<"Camera configured successfully"<< std::endl;
Allocate the buffers and requests for frames extraction #
The memory for the frames buffers and requests is held by the user. Indeed, the frame content itself is allocated
through DMA buffers for which the
libcamera::FrameBuffer instance is holding the
file descriptors.
The frames buffers are allocated through a
libcamera::FrameBufferAllocator instance.
When this instance is deleted, all buffers in the internal pool are also deleted, including the associated DMA buffers.
So, the lifetime of the FrameBufferAllocator instance must be longer than the lifetime of all the requests associated
with buffers from its internal pool.
The same FrameBufferAllocator instance is used to allocate buffers pools for the different streams from the same
camera. In our case we are only using a single stream and so we will do the allocation only for this stream.
// Allocate the buffers pool used to fetch frames from the camera.
Stream* stream = streamConfig.stream();auto frameAllocator = std::make_unique<FrameBufferAllocator>(selectedCamera);if(frameAllocator->allocate(stream)<0){
std::cerr <<"Failed to allocate buffers for the selected camera stream"<< std::endl;return-5;}auto& buffersPool = frameAllocator->buffers(stream);
std::cout <<"Camera stream has a pool of "<< buffersPool.size()<<" buffers"<< std::endl;
Once we have the frames buffers allocated we can create the corresponding requests and associate each buffer with a
request. So when the camera receives the request it will fill the associated frame buffer with the next image content.
// Create the requests used to fetch the actual camera frames.
std::vector<std::unique_ptr<Request>> requests;for(auto& buffer : buffersPool){auto request = selectedCamera->createRequest();if(!request){
std::cerr <<"Failed to create a frame request for the selected camera"<< std::endl;return-6;}if(request->addBuffer(stream, buffer.get())!=0){
std::cerr <<"Failed to add a buffer to the frame request"<< std::endl;return-7;}
requests.push_back(std::move(request));}
If the camera supports multistream, additional buffers can be added to a single request (using
libcamera::Request::addBuffer)
to capture frames for the other streams. However, only one buffer per stream is allowed in the same request.
Now that we have a pool of requests, each one with its associated frame buffer, we can send them to the camera for
processing. Each time the camera has finished with a request, by filling the associated buffer with the actual image,
it calls a requestCompleted callback and then continues with the next request in the queue.
When we receive the requestCompleted signal, we can extract the image content from the request buffer and process it.
Once the image processing is finished, we recycle the buffer and push again the request in the queue for the next
frames. To take a single photo we would only need one buffer and one request, and we would queue this request only once.
// Connect the requests execution callback, it is called each time a frame// has been produced by the camera.
selectedCamera->requestCompleted.connect(selectedCamera.get(),[&selectedCamera](Request* request){if(request->status()== Request::RequestCancelled){return;}// We can directly take the first request buffer as we are managing// only one stream. In case of multiple streams, we should iterate// over the BufferMap entries or access the buffer by stream pointer.auto buffer = request->buffers().begin()->second;auto& metadata = buffer->metadata();if(metadata.status == FrameMetadata::FrameSuccess){// As we are using a RGB888 color format we have only one plane, but// in case of using a multiplanes color format (like YUV420) we// should iterate over all the planes.
std::cout <<"Frame #"<< std::setw(2)<< std::setfill('0')<< metadata.sequence
<<": time="<< metadata.timestamp <<"ns, size="<< metadata.planes().begin()->bytesused
<<", fd="<< buffer->planes().front().fd.get()<< std::endl;}else{
std::cerr <<"Invalid frame received"<< std::endl;}// Reuse the request buffer and re-queue the request.
request->reuse(Request::ReuseBuffers);
selectedCamera->queueRequest(request);});
Before queueing the first request we need to start the camera and we must stop it when we’ve finished with the frames
extraction. The lifetime of all the requests pushed to the camera must be longer than this start/stop loop. Once the
camera is stopped, we can delete the corresponding requests as they will not be used anymore.
This implies that the FrameBufferAllocator instance must also outlive this same start/stop loop. If you try to delete
the requests vector or the frameAllocator instance before stopping the camera, you will naturally trigger a
segmentation fault.
// Start the camera streaming loop and run it for a few seconds.
selectedCamera->start();for(constauto& request : requests){
selectedCamera->queueRequest(request.get());}
std::this_thread::sleep_for(1500ms);
selectedCamera->stop();
At the end we clean up the resources. Here it is not really needed as the destructors will do automatically the job.
But if you were building a more complex architecture and you need to explicitly free up the resources, that would be
the order to follow.
With the current code the only important point here is to explicitly stop the camera before getting out of the main
function (and to implicitly trigger the destructors calls), else the frameAllocator instance will be destroyed while
the camera is still processing the associated requests, which will lead to a segmentation fault.
// Cleanup the resources. In fact those resources are automatically released// when the corresponding destructors are called. The only compulsory call// to make is selectedCamera->stop() as the camera streaming loop MUST be// stopped before releasing the associated buffers pool.
frameAllocator.reset();
selectedCamera->release();
selectedCamera.reset();
camManager->stop();
If everything has worked well so far, you should see the following output:
In this part, we are going to display the extracted frames using a small OpenGL ES application. This application will
show a rotating cube with a metallic aspect displaying, on each face, the live video stream from the Raspberry Pi 5
camera with an orange/red shade, like in the following video:
For this, we need a little bit more code to initialize the window, the OpenGL context and manage the drawing. The full
code is available at the code repository or
you can download it here.
We are using the GLFW library to manage the
EGL and
OpenGL ES contexts and the GLM
library to manage the 3D vectors and matrices. Those libraries are included as Meson wraps in the subprojects folder.
So, just like with the previous code, to build the project you only need to execute:
meson setup build
ninja -C build
All the 3D rendering part is out of the scope of this tutorial and the corresponding classes have been grouped in the
src/rendering subfolder to help focussing on the Camera and CameraTexture classes. If you are also interested in
3D rendering you can find a lot of interesting material on the Web and, in particular,
Anton’s OpenGL 4 Tutorials or Learn OpenGL.
The Camera class is basically a wrapper of the code explained in the previous parts. In this case we are configuring
the camera to use a pixel format aligned on 32 bits (XRGB8888) to be compatible with the hardware accelerated
rendering.
// We need to choose a pixel format with a stride aligned on 32 bits to be// compatible with the GLES renderer. We only need 2 buffers, while one// buffer is used by the GLES renderer, the other one is filled by the// camera next frame and then both buffers are swapped.
streamConfig.size.width = captureWidth;
streamConfig.size.height = captureHeight;
streamConfig.pixelFormat = libcamera::formats::XRGB8888;
streamConfig.bufferCount =2;
We are also using 2 buffers as one buffer will be rendered on screen while the other buffer will receive the next
camera frame, and then we’ll switch both buffers. We already know that when the
requestCompleted
signal is triggered, the corresponding buffer has finished being written with the next camera frame. This is our
synchronization point to send this buffer to the rendering.
On the rendering side, we know that when the OpenGL buffers are swapped, the displayed image has been fully rendered.
This is our synchronization point to recycle the buffer back the to camera rendering loop.
A specific wrapping class: Camera::Frame is used to exchange those buffers between the camera and the renderer. It is
passed through a std::unique_ptr to ensure an exclusive access from the camera or the renderer. When the instance is
destroyed, it automatically recycles the underlying buffer to make it available for the next camera frame.
When Camera::startCapturing is called, the camera starts producing frames continuously (like in the code from the
previous parts). Each new frame replaces the previous one which is automatically recycled during its destruction:
voidCamera::onRequestCompleted(libcamera::Request* request){if(request->status()== libcamera::Request::RequestCancelled){return;}// We can directly take the first request buffer as we are managing// only one stream. In case of multiple streams, we should iterate// over the BufferMap entries or access the buffer by stream pointer.auto buffer = request->buffers().begin()->second;if(buffer->metadata().status == libcamera::FrameMetadata::FrameSuccess){// As we are using a XRGB8888 color format we have only one plane, but// in case of using a multiplanes color format (like YUV420) we// should iterate over all the planes.
std::unique_ptr<Frame>frame(newFrame(this, request, buffer->cookie()));
std::lock_guard<std::mutex>lock(m_nextFrameMutex);
m_nextFrame = std::move(frame);}else{// Reuse the request buffer and re-queue the request.
request->reuse(libcamera::Request::ReuseBuffers);
m_selectedCamera->queueRequest(request);}}
Camera::Frame::~Frame(){auto camera = m_camera.lock();if(camera && m_request){
m_request->reuse(libcamera::Request::ReuseBuffers);
camera->m_selectedCamera->queueRequest(m_request);}}
At any moment the renderer can fetch this frame to render it:
voidonRender(double time)noexceptoverride{if(m_camera){// We are fetching the next camera produced frame that is ready to// be drawn. If there is no new frame available, we are just// keeping on drawing the same frame.auto cameraFrame = m_camera->getNextFrame();if(cameraFrame){// We need to keep a reference to the current drawn frame in// order to not have the Camera class recycle the underlying// dma-buf while the GLES renderer is still using it for// drawing. This is the Camera::Frame destructor which ensures// proper synchronization. When reaching this point, the// previous m_currentCameraFrame has been fully drawn (the GLES// buffers swap has just occurred on the previous onRender// call), when the unique_ptr is replaced the previous// Camera::Frame is destroyed which triggers the recycling of// its FrameBuffer (for the next camera frame capture), while// the new frame is locked for drawing until it is itself// replaced.
m_currentCameraFrame = std::move(cameraFrame);// We can directly fetch and bind the corresponding GLES// texture from the FrameBuffer cookie.auto textureIndex = m_currentCameraFrame->getCookie();
m_textures[textureIndex]->bind();// The texture mix value is only used to reuse the same shader// without and with a camera frame. Now that we have a frame to// draw we can show it.
m_shader->setCameraTextureMix(1.0f);}}glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glm::mat4 modelMatrix =
glm::rotate(glm::mat4(1.0f),1.5f*static_cast<float>(time), glm::vec3(0.8f,0.5f,0.4f));
m_shader->setModelMatrix(modelMatrix);
m_cube->draw();}
As we have only 2 buffers and the access to each buffer is exclusive, the camera and renderer speeds are going to
adjust each other. The underlying frame buffer is only recycled once destroyed, which only happens when replaced by the
next available buffer.
N.B. The Camera::onRequestCompleted callback is called from a libcamera capturing thread while the
AppRenderer::onRender is called on the application main thread. The call to
libcamera::Camera::queueRequest
is thread-safe, but the access to the std::unique_ptr must be protected by a mutex to be passed to the render
thread.
// Create an EGLImage from the camera FrameBuffer.// In our case we are using a packed color format (XRGB8888), so we// only need the first buffer plane. In case of using a multiplanar color// format (like YUV420 for example), we would need to iterate over all the// color planes in the buffer and fill the EGL_DMA_BUF_PLANE[i]_FD_EXT,// EGL_DMA_BUF_PLANE[i]_OFFSET_EXT and EGL_DMA_BUF_PLANE[i]_PITCH_EXT for// each plane.constauto& plane = buffer.planes().front();const EGLAttrib attrs[]={EGL_WIDTH,
streamConfiguration.size.width,
EGL_HEIGHT,
streamConfiguration.size.height,
EGL_LINUX_DRM_FOURCC_EXT,
streamConfiguration.pixelFormat.fourcc(),
EGL_DMA_BUF_PLANE0_FD_EXT,
plane.fd.get(),
EGL_DMA_BUF_PLANE0_OFFSET_EXT,(plane.offset != libcamera::FrameBuffer::Plane::kInvalidOffset)? plane.offset :0,
EGL_DMA_BUF_PLANE0_PITCH_EXT,
streamConfiguration.stride,
EGL_NONE};
EGLImage eglImage =eglCreateImage(eglDisplay, EGL_NO_CONTEXT, EGL_LINUX_DMA_BUF_EXT,nullptr, attrs);if(!eglImage){returnnullptr;}
N.B. It is important to use a pixel format compatible with the rendering device, else the eglCreateImage
function will fail with eglGetError()
returning EGL_BAD_MATCH.
Then, the EGLImage can be attached to an external OpenGL ES texture using the
OES_EGL_image_external OpenGL
extension:
// Create the GLES texture and attach the EGLImage to it.glGenTextures(1,&texture->m_texture);glBindTexture(GL_TEXTURE_EXTERNAL_OES, texture->m_texture);glTexParameteri(GL_TEXTURE_EXTERNAL_OES, GL_TEXTURE_MIN_FILTER, GL_LINEAR);glTexParameteri(GL_TEXTURE_EXTERNAL_OES, GL_TEXTURE_MAG_FILTER, GL_LINEAR);glTexParameteri(GL_TEXTURE_EXTERNAL_OES, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);glTexParameteri(GL_TEXTURE_EXTERNAL_OES, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);glEGLImageTargetTexture2DOES(GL_TEXTURE_EXTERNAL_OES, eglImage);glBindTexture(GL_TEXTURE_EXTERNAL_OES,0);// Now that the EGLImage is attached to the texture, we can destroy it. The// underlying dma-buf will be released when the texture is deleted.eglDestroyImage(eglDisplay, eglImage);
The corresponding texture can be used like any other kind of texture by binding it to the GL_TEXTURE_EXTERNAL_OES
target. Still, the shader will need to use the same extension and a specific sampler to use this external texture
target:
Although the dma-buf is wrapped by two layers (EGLImage and Texture), its content is never copied or transferred to the
system CPU memory (RAM). This is the same memory space, allocated in a dedicated hardware memory, that is used to
receive the camera frame content and display it on screen, allowing the kernel to optimize the corresponding resources.
The libcamera library is allocating the dma-bufs needed to store the captured frames content when calling
libcamera::FrameBufferAllocator:allocate.
So, we can create the corresponding external textures right after the Camera instance creation:
m_camera =Camera::create(m_width, m_height);if(m_camera){// Create one texture per available camera buffer.for(constauto& request : m_camera->getRequests()){// We know that we are only using one stream and one buffer per// request. If we were using multiple streams at once, we// should iterate on the request BufferMap.auto[stream, buffer]=*request->buffers().begin();auto texture =CameraTexture::create(eglDisplay, stream->configuration(),*buffer);if(!texture){
std::cerr <<"Failed to create a camera texture"<< std::endl;
m_textures.clear();
m_camera.reset();
m_shader.reset();
m_cube.reset();returnfalse;}// We are using the associated buffer cookie to store the// corresponding texture index in the internal vector. This way// it will be easy to fetch the right texture when a frame// buffer is ready to be drawn.
m_textures.push_back(std::move(texture));
buffer->setCookie(m_textures.size()-1);}
m_camera->startCapturing();}
We’re happy to have released gst-dots-viewer, a new development tool that makes it easier to visualize and debug GStreamer pipelines. This tool, included in GStreamer 1.26, provides a web-based interface for viewing pipeline graphs in real-time as your application runs and allows to easily request all pipelines to be dumped at any time.
What is gst-dots-viewer?
gst-dots-viewer is a server application that monitors a directory for .dot files generated by GStreamer’s pipeline visualization system and displays them in your web browser. It automatically updates the visualization whenever new .dot files are created, making it simpler to debug complex applications and understand the evolution of the pipelines at runtime.
Key Features
Real-time Updates: Watch your pipelines evolve as your application runs
Interactive Visualization:
Click nodes to highlight pipeline elements
Use Shift-Ctrl-scroll or w/s keys to zoom
Drag-scroll support for easy navigation
Easily deployable in cloud based environments
How to Use It
Start the viewer server:
gst-dots-viewer
Open your browser at http://localhost:3000
Enable the dots tracer in your GStreamer application:
GST_TRACERS=dots your-gstreamer-application
The web page will automatically update whenever new pipeline are dumped, and you will be able to dump all pipelines from the web page.
New Dots Tracer
As part of this release, we’ve also introduced a new dots tracer that replaces the previous manual approach to specify where to dump pipelines. The tracer can be activated simply by setting the GST_TRACERS=dots environment variable.
Interactive Pipeline Dumps
The dots tracer integrates with the pipeline-snapshot tracer to provide real-time pipeline visualization control. Through a WebSocket connection, the web interface allows you to trigger pipeline dumps. This means you can dump pipelines exactly when you need them during debugging or development, from your browser.
Future Improvements
We plan on adding more feature and have this list of possibilities:
Additional interactive features in the web interface
Enhanced visualization options
Integration with more GStreamer tracers to provide comprehensive debugging information. For example, we could integrate the newly released memory-tracer and queue-level tracers so to plot graphs about memory usage at any time.
This could transform gst-dots-viewer into a more complete debugging and monitoring dashboard for GStreamer applications.
Hey all, just a lab notebook entry today. I’ve been working on the
Whippet GC library for about three
years now, learning a lot on the way. The goal has always been to
replace Guile’s use of the Boehm-Demers-Weiser
collector
with something more modern and maintainable. Last year I finally got to
the point that I felt Whippet was
feature-complete,
and taking into account the old adage about long arses and brief videos,
I think that wasn’t too far off. I carved out some time this spring and for the
last month have been integrating Whippet into Guile in anger, on the
wip-whippet
branch.
the haps
Well, today I removed the last direct usage of the BDW collector’s API
by Guile! Instead, Guile uses Whippet’s API any time it needs to
allocate an object, add or remove a thread from the active set, identify
the set of roots for a collection, and so on. Most tracing is still
conservative, but this will move to be more precise over time. I
haven’t had the temerity to actually try one of the Nofl-based
collectors yet, but that will come soon.
Code-wise, the initial import of Whippet added some 18K lines to Guile’s
repository, as counted by git diff --stat, which includes
documentation and other files. There was an unspeakable amount of autotomfoolery to get Whippet in Guile’s ancient build system. Changes to Whippet during the course of
integration added another 500 lines or so. Integration of Whippet
removed around 3K lines of C from Guile. It’s not a pure experiment, as
my branch is also a major version bump and so has the freedom to
refactor and simplify some things.
Things are better but not perfect. Notably, I switched to build weak
hash tables in terms of buckets and chains where the links are
ephemerons, which give me concurrent lock-free reads and writes but not
resizable tables. I would like to somehow resize these tables in
response to GC, but haven’t wired it up yet.
Accessibility in the free and open source world is somewhat of a sensitive topic.
Given the principles of free software, one would think it would be the best possible place to advocate for accessibility. After all, there’s a collection of ideologically motivated individuals trying to craft desktops to themselves and other fellow humans. And yet, when you look at the current state of accessibility on the Linux desktop, you couldn’t possibly call it good, not even sufficient.
It’s a tough situation that’s forcing people who need assistive technologies out of these spaces.
I think accessibility on the Linux desktop is in a particularly difficult position due to a combination of poor incentives and historical factors:
The dysfunctional state of accessibility on Linux makes it so that the people who need it the most cannot even contribute to it.
There is very little financial incentive for companies to invest in accessibility technologies. Often, and historically, companies invest just enough to tick some boxes on government checklists, then forget about it.
Volunteers, especially those who contribute for fun and self enjoyment, often don’t go out of their ways to make the particular projects they’re working on accessible. Or to check if their contributions regress the accessibility of the app.
The nature of accessibility makes it such that the “functional progression” is not linear. If only 50% of the stack is working, that’s practically a 0%. Accessibility requires that almost every part of the stack to be functional for even the most basic use cases.
There’s almost nobody contributing to this area anymore. Expertise and domain knowledge are almost entirely lost.
In addition to that, I feel like work on accessibility is invisible. In the sense that most people are simply apathetic to the work and contributions done on this area. Maybe due to the dynamics of social media that often favor negative engagement? I don’t know. But it sure feels unrewarding. Compare:
Now, I think if I stopped writing here, you dear reader might feel that the situation is mostly gloomy, maybe even get angry at it. However, against all odds, and fighting a fight that seems impossible, there are people working on accessibility. Often without any kind of reward, doing this out of principle. It’s just so easy to overlook their effort!
So as we prepare for the Global Accessibility Awareness Day, I thought it would be an excellent opportunity to highlight these fantastic contributors and their excellent work, and also to talk about some ongoing work on GNOME.
If you consider this kind of work important and relevant, and/or if you need accessibility features yourself, I urge you: please donate to the people mentioned here. Grab these people a coffee. Better yet, grab them a monthly coffee! Contributors who accept donations have a button beneath their avatars. Go help them.
Calendar
GNOME Calendar, the default calendaring app for GNOME, has slowly but surely progressing towards being minimally accessible. This is mostly thanks to the amazing work from Hari Rana and Jeff Fortin Tam!
Back when I was working on fixing accessibility on WebKitGTK, I found the lack of modern tools to inspect the AT-SPI bus a bit off-putting, so I wrote a little app to help me through. Didn’t think much of it, really.
Of course, almost nothing I’ve mentioned so far would be possible if the toolkit itself didn’t have support for accessibility. Thanks to Emmanuele Bassi GTK4 received an entirely new accessibility backend.
Over time, more people picked up on it, and continued improving it and filling in the gaps. Matthias Clasen and Emmanuele continue to review contributions and keep things moving.
One particular contributor is Lukáš Tyrychtr, who has implemented the Text interface of AT-SPI in GTK. Lukáš contributes to various other parts of the accessibility stack as well!
On the design side, one person in particular stands out for a series of contributions on the Accessibility panel of GNOME Settings: Sam Hewitt. Sam introduced the first mockups of this panel in GitLab, then kept on updating it. More recently, Sam introduced mockups for text-to-speech (okay technically these are in the System panel, but that’s in the accessibility mockups folder!).
Please join me in thanking Sam for these contributions!
Having apps and toolkits exposing the proper amount of accessibility information is a necessary first step, but it would be useless if there was nothing to expose to.
Thanks to Mike Gorse and others, the AT-SPI project keeps on living. AT-SPI is the service that receives and manages the accessibility information from apps. It’s the heart of accessibility in the Linux desktop! As far as my knowledge about it goes, AT-SPI is really old, dating back to Sun days.
Samuel Thibault continues to maintain speech-dispatcher and Accerciser. Speech dispatcher is the de facto text-to-speech service for Linux as of now. Accerciser is a venerable tool to inspect AT-SPI trees.
Eitan Isaacson is shaking up the speech synthesis world with libspiel, a speech framework for the desktop. Orca has experimental support for it. Eitan is now working on a desktop portal so that sandboxed apps can benefit from speech synthesis seamlessly!
One of the most common screen readers for Linux is Orca. Orca maintainers have been keeping it up an running for a very long time. Here I’d like to point out that we at Igalia significantly fund Orca development.
I would like to invite the community to share a thank you for all of them!
I tried to reach out to everyone nominally mentioned in this blog post. Some people preferred not to be mentioned. I’m pretty sure I’ve never got to learn about others that are involved in related projects.
I guess what I’m trying to say is, this list is not exhaustive. There are more people involved. If you know some of them, please let me encourage you to pay them a tea, a lunch, a boat trip in Venice, whatever you feel like; or even just reach out to them and thank them for their work.
If you contribute or know someone who contributes to desktop accessibility, and wishes to be here, please let me know. Also, please let me know if this webpage itself is properly accessible!
A Look Into The Future
Shortly after I started to write this blog post, I thought to myself: “well, this is nice and all, but it isn’t exactly robust.” Hm. If only there was a more structured, reliable way to keep investing on this.
Coincidentally, at the same time, we were introduced to our new executive director Steven. With such a blast of an introduction, and seeing Steven hanging around in various rooms, I couldn’t resist asking about it. To my great surprise and joy, Steven swiftly responded to my inquiries and we started discussing some ideas!
Conversations are still ongoing, and I don’t want to create any sort of hype in case things end up not working, but… maaaaaaybe keep in mind that there might be an announcement soon!
Huge thanks to the people above, and to everyone who helped me write this blog post
¹ – Jeff doesn’t accept donations for himself, but welcomes marketing-related business
Update on what happened in WebKit in the week from May 5 to May 12.
This week saw one more feature enabled by default, additional support to
track memory allocations, continued work on multimedia and WebAssembly.
Cross-Port 🐱
The Media Capabilities API is now enabled by default. It was previously available as a run-time option in the WPE/WebKitGTK API (WebKitSettings:enable-media-capabilities), so this is just a default tweak.
Landed a change that integrates malloc heap breakdown functionality with non-Apple ports. It works similarly to Apple's one yet in case of non-Apple ports the per-heap memory allocation statistics are printed to stdout periodically for now. In the future this functionality will be integrated with Sysprof.
Multimedia 🎥
GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.
Support for WebRTC RTP header extensions was improved, a RTP header extension for video orientation metadata handling was introduced and several simulcast tests are now passing
Progress is ongoing on resumable player suspension, which will eventually allow us to handle websites with lots of simultaneous media elements better in the GStreamer ports, but this is a complex task.
JavaScriptCore 🐟
The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.
The in-place Wasm interpreter (IPInt) port to 32-bits has seen some more work.
Fixed a bug in OMG caused by divergence with the 64-bit version. Further syncing is underway.
Releases 📦️
Michael Catanzaro has published a writeup on his blog about how the WebKitGTK API versions have changed over time.
Infrastructure 🏗️
Landed some improvements in the WebKit container SDK for Linux, particularly in error handling.
In my work on RISC-V LLVM, I end up working with the llvm-test-suite a lot,
especially as I put more effort into performance analysis, testing, and
regression hunting.
suite-helper is a
Python script that helps with some of the repetitive tasks when setting up,
building, and analysing LLVM test
suite builds. (Worth nothing for
those who aren't LLVM regulars: llvm-test-suite is a separate repository to
LLVM and includes execution tests and benchmarks, which is different to the
targeted unit tests including in the LLVM monorepo).
As always, it scratches an itch for me. The design target is to provide a
starting point that is hopefully good enough for many use cases, but it's easy
to modify (e.g. by editing the generated scripts or emitted command lines) if
doing something that isn't directly supported.
The main motivation for putting this script together came from my habit of
writing fairly detailed "lab notes" for most of my work. This typically
includes a listing of commands run, but I've found such listings rather
verbose and annoying to work with. This presented a good opportunity for
factoring out common tasks into a script, resulting in suite-helper.
Functionality overview
suite-helper has the following subtools:
create
Checkout llvm-test-suite to the given directory. Use the --reference
argument to reference git objects from an existing local checkout.
add-config
Add a build configuration using either the "cross" or "native" template.
See suite-helper add-config --help for a listing of available options.
For a build configuration 'foo', a _rebuild-foo.sh file will be created
that can be used to build it within the build.foo subdirectory.
status
Gives a listing of suite-helper managed build configurations that were
detected, attempting to indicate if they are up to date or not (e.g.
spotting if the hash of the compiler has changed).
run
Run the given build configuration using llvm-lit, with any additional
options passed on to lit.
match-tool
A helper that is used by suite-helper reduce-ll but may be useful in
your own reduction scripts. When looking at generated assembly or
disassembly of an object file/binary and an area of interest, your natural
inclination may well be to try to carefully craft logic to match something
that has equivalent/similar properties. Credit to Philip Reames for
underlining to me just how unresonably effective it is to completely
ignore that inclination and just write something that naively matches a
precise or near-precise assembly sequence. The resulting IR might include
some extraneous stuff, but it's a lot easier to cut down after this
initial minimisation stage, and a lot of the time it's good enough. The
match-tool helper takes a multiline sequence of glob patterns as its
argument, and will attempt to find a match for them (a sequential set of
lines) on stdin. It also normalises whitespace.
get-ll
Query ninja nad process its output to try to produce and execute a
compiler command that will emit a .ll for the given input file (e.g. a .c
file). This is a common first step for llvm-reduce, or for starting to
inspect the compilation of a file with debug options enabled.
reduce-ll
For me, it's fairly common to want to produce a minimised .ll file that
produces a certain assembly pattern, based on compiling a given source
input. This subtool automates that process, using get-ll to retrieve the
ll, then llvm-reduce and match-tool to match the assembly.
Usage example
suite-helper isn't intended to avoid the need to understand how to build the
LLVM test suite using CMake and run it using lit, rather it aims to
streamline the flow. As such, a good starting point might be to work through
some llvm-test-suite builds yourself and then look here to see if anything
makes your use case easier or not.
All of the notes above may seem rather abstract, so here is an example of
using the helper to while investigating some poorly canonicalised
instructions and testing my work-in-progress patch to address them.
suite-helper create llvmts-redundancies --reference ~/llvm-test-suite
for CONFIG in baseline trial; do
suite-helper add-config cross $CONFIG\
--cc=~/llvm-project/build/$CONFIG/bin/clang \
--target=riscv64-linux-gnu \
--sysroot=~/rvsysroot \
--cflags="-march=rva22u64 -save-temps=obj"\
--spec2017-dir=~/cpu2017 \
--extra-cmake-args="-DTEST_SUITE_COLLECT_CODE_SIZE=OFF -DTEST_SUITE_COLLECT_COMPILE_TIME=OFF"
./_rebuild-$CONFIG.sh
done# Test suite builds are now available in build.baseline and build.trial, and# can be compared with e.g. ./utils/tdiff.py.# A separate script had found a suspect instruction sequence in sqlite3.c, so# let's get a minimal reproducer.
suite-helper reduce build.baseline ./MultiSource/Applications/sqlite3/sqlite3.c \'add.uw a0, zero, a2 subw a4, a4, zero'\
--reduce-bin=~/llvm-project/build/baseline/bin/llvm-reduce \
--llc-bin=~/llvm-project/build/baseline/bin/llc \
--llc-args=-O3
Hey peoples! Tonight, some meta-words. As you know I am fascinated by
compilers and language implementations, and I just want to know all the
things and implement all the fun stuff: intermediate representations,
flow-sensitive source-to-source optimization passes, register
allocation, instruction selection, garbage collection, all of that.
It started long ago with a combination of curiosity and a hubris to satisfy
that curiosity. The usual way to slake such a thirst is structured
higher education followed by industry apprenticeship, but for whatever
reason my path sent me through a nuclear engineering bachelor’s program
instead of computer science, and continuing that path was so distasteful
that I noped out all the way to rural Namibia for a couple years.
Fast-forward, after 20 years in the programming industry, and having
picked up some language implementation experience, a few years ago I
returned to garbage collection. I have a good level of language
implementation chops but never wrote a memory manager, and Guile’s
performance was limited by its use of the Boehm collector. I had been
on the lookout for something that could help, and when I learned of
Immix it seemed to me that the only thing missing was an appropriate
implementation for Guile, and hey I could do that!
whippet
I started with the idea of an MMTk-style
interface to a memory manager that was abstract enough to be implemented
by a variety of different collection algorithms. This kind of
abstraction is important, because in this domain it’s easy to convince
oneself that a given algorithm is amazing, just based on vibes; to stay
grounded, I find I always need to compare what I am doing to some fixed
point of reference. This GC implementation effort grew into
Whippet, but as it did so a funny
thing happened: the mark-sweep collector that I
prototyped
as a direct replacement for the Boehm collector maintained mark bits in
a side table, which I realized was a suitable substrate for
Immix-inspired bump-pointer allocation into holes. I ended up building
on that to develop an Immix collector, but without lines: instead each
granule of allocation (16 bytes for a 64-bit system) is its own line.
regions?
The Immix paper is
funny, because it defines itself as a new class of mark-region
collector, fundamentally different from the three other fundamental
algorithms (mark-sweep, mark-compact, and evacuation). Immix’s
regions are blocks (64kB coarse-grained heap divisions) and lines
(128B “fine-grained” divisions); the innovation (for me) is the
optimistic evacuation discipline by which one can potentially
defragment a block without a second pass over the heap, while also
allowing for bump-pointer allocation. See the papers for the deets!
However what, really, are the regions referred to by mark-region? If
they are blocks, then the concept is trivial: everyone has a
block-structured heap these days. If they are spans of lines, well, how
does one choose a line size? As I understand it, Immix’s choice of 128
bytes was to be fine-grained enough to not lose too much space to
fragmentation, while also being coarse enough to be eagerly swept during
the GC pause.
This constraint was odd, to me; all of the mark-sweep systems I have
ever dealt with have had lazy or concurrent sweeping, so the lower bound
on the line size to me had little meaning. Indeed, as one reads papers
in this domain, it is hard to know the real from the rhetorical; the
review process prizes novelty over nuance. Anyway. What if we cranked
the precision dial to 16 instead, and had a line per granule?
That was the process that led me to Nofl. It is a space in a collector
that came from mark-sweep with a side table, but instead uses the side
table for bump-pointer allocation. Or you could see it as an Immix
whose line size is 16 bytes; it’s certainly easier to explain it that
way, and that’s the tack I took in a recent paper submission to
ISMM’25.
paper??!?
Wait what! I have a fine job in industry and a blog, why write a paper?
Gosh I have meditated on this for a long time and the answers are very
silly. Firstly, one of my language communities is Scheme, which was a
research hotbed some 20-25 years ago, which means many
practitioners—people I would be pleased to call peers—came up
through the PhD factories and published many interesting results in
academic venues. These are the folks I like to hang out with! This is
also what academic conferences are, chances to shoot the shit with
far-flung fellows. In Scheme this is fine, my work on Guile is enough
to pay the intellectual cover charge, but I need more, and in the field
of GC I am not a proven player. So I did an atypical thing, which is to
cosplay at being an independent researcher without having first been a
dependent researcher, and just solo-submit a paper. Kids: if you see
yourself here, just go get a doctorate. It is not easy but I can only
think it is a much more direct path to goal.
And the result? Well, friends, it is this blog post :) I got the usual
assortment of review feedback, from the very sympathetic to the less so,
but ultimately people were confused by leading with a comparison to
Immix but ending without an evaluation against Immix. This is fair and
the paper does not mention that, you know, I don’t have an Immix lying
around. To my eyes it was a good paper, an 80%
paper, but, you know, just a
try. I’ll try again sometime.
In the meantime, I am driving towards getting Whippet into Guile. I am
hoping that sometime next week I will have excised all the uses of the
BDW (Boehm GC) API in Guile, which will finally allow for testing Nofl
in more than a laboratory environment. Onwards and upwards!
GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.
The GstWPE2 GStreamer plugin landed in GStreamer
main,
it makes use of the WPEPlatform API. It will ship in GStreamer 1.28. Compared
to GstWPE1 it provides the same features, but improved support for NVIDIA
GPUs. The main regression is lack of audio support, which is work-in-progress,
both on the WPE and GStreamer sides.
JavaScriptCore 🐟
The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.
Work on enabling the in-place Wasm interpreter (IPInt) on 32-bits has progressednicely
The JSC tests runner can now guard against a pathological failure mode.
In JavaScriptCore's implementation of
Temporal,
Tim Chevalier fixed the parsing of RFC
9557 annotations in date
strings to work according to the standard. So now syntactically valid but
unknown annotations [foo=bar] are correctly ignored, and the ! flag in an
annotation is handled correctly. Philip Chimento expanded the test suite
around this feature and fixed a couple of crashes in Temporal.
Math.hypot(x, y, z) received a fix for a corner case.
WPE WebKit 📟
WPE now uses the new pasteboard API, aligning it with the GTK port, and enabling features that were previously disabled. Note that the new features work only with WPEPlatform, because libwpe-based backends are limited to access clipboard text items.
WPE Platform API 🧩
New, modern platform API that supersedes usage of libwpe and WPE backends.
Platform backends may add their own clipboard handling, with the Wayland one being the first one to, using wl_data_device_manager.
This continues the effort to close the feature gap between the “traditional” libwpe-based WPE backends and the new WPEPlatform ones.
Community & Events 🤝
Carlos García has published a blog post about the optimizations introduced in
the WPE and GTK WebKit
ports
since the introduction of Skia replacing Cairo for 2D rendering. Plus, there
are some hints about what is coming next.
Cast your mind back to the late 2000s and one thing you might remember is the
excitement about netbooks. You
sacrifice something in raw computational power, but get a lightweight, low
cost and ultra-portable system. Their popularity peaked and started wane maybe
15 years ago now, but I was pleased to discover that the idea lives on in the
form of the Chuwi MiniBook X
N150 and have
been using it as my daily driver for about a month now. Read on for some notes
and thoughts on the device as well as more information than you probably want
about configuring Linux on it.
The bottom line is that I enjoy it, I'd buy it again. But there are real
limitations to keep in mind if you're considering following suit.
Background
First a little detour. As many of my comments are made in reference to my
previous laptops it's probably worth fleshing out that history a little. The
first thing to understand is that my local computing needs are relatively
simple and minimal. I work on large C/C++ codebases (primarily LLVM) with
lengthy compile times, but I build and run tests on a remote machine. This
means I only need enough local compute to comfortably navigate codebases, do
whatever smaller local projects I want to do, and use any needed browser based
tools like videoconferencing or GDocs.
Looking back at my previous two laptops (oldest first):
Intel
N5000
processor, 4GiB RAM (huge weak point even then), 256GB SSD, 14" 1920x1080
matte screen.
Fanless and absolutely silent.
A big draw was the long battery life. Claimed 17h by the manufacturer,
tested at ~12h20m 'light websurfing' in one
review
which I found to be representative, with runtimes closer to 17h possible
if e.g. mostly doing text editing when traveling without WiFi.
Three USB-A ports, one USB-C port, 3.5mm audio jack, HDMI, SD card slot.
Charging via proprietary power plug.
1.30kg weight and 32.3cm x 22.8cm dimensions.
Took design stylings of rather more expensive devices, with a metal
chassis, the ability to fold flat, and a large touchpad.
Claimed battery life reduced to 15h. I found it very similar in practice.
But the battery has degraded significantly over time.
Two USB-A ports, one USB-C port, 3.5mm audio jack, HDMI. Charging via
proprietary power plug.
1.30kg weight and 32.3cm x 21.2cm dimensions.
Still a metal chassis, though sadly designed without the ability to fold
the screen completely flat and the size of the touchpad was downgraded.
I think you can see a pattern here.
As for the processors, the N5000 was part of Intel "Gemini
Lake"
which used the Goldmont Plus microarchitecture. This targets the same market
segment as earlier Atom branded processors (as used by many of those early
netbooks) but with substantially higher performance and a much more
complicated microarchitecture than the early Atom (which was dual issue, in
order with a 16 stage pipeline). The
best reference I can see for the microarchitectures used in the N5000 and
N6000 is AnandTech's Tremont microarchitecture
write-up
(matching the
N6000),
which makes copious reference to differences vs previous iterations. Both the
N5000 and N6000 have a TDP of 6W and 4 cores (no hyperthreading). Notably,
all these designs lack AVX support.
The successor to Tremont, was the Gracemont
microarchitecture,
this time featuring AVX2 and seeing much wider usage due to being used as the
"E-Core" design throughout Intel's chips pairing some number of more
performance-oriented P-Cores with energy efficiency optimised E-Cores. Low
TDP chips featuring just E-Cores were released such as the N100 serving
as a successor to the
N6000
and later the N150 added as a slightly higher clocked version. There have been
further iterations on the microarchitecture since Gracemont with
Crestmont and
Skymont,
but at the time of writing I don't believe these have made it into similar
E-Core only low TDP chips. I'd love to see competitive devices at similar
pricepoints using AMD or Arm chips (and one day RISC-V of course), but this
series of Intel chips seems to have really found a niche.
28.8Wh battery, seems to give 4-6h battery depending on what you're doing
(possibly more if offline and text editing, I've not tried to push to the
limits).
Two USB-C ports (both supporting charging via USB PD), 3.5mm audio jack.
0.92kg weight and 24.4cm x 16.6cm dimensions.
Display is touchscreen, and can fold all the way around for tablet-style
usage.
Just looking at the specs the key trade-offs are clear. There's a big drop in
battery life, but a newer faster processor and fun mini size.
Overall, it's a positive upgrade but there are definitely some downsides. Main
highlights:
Smol! Reasonably light. The 10" display works well at 125% zoom.
The keyboard is surprisingly pleasant to use. The trackpad is obviously
small given size constraints, but again it works just fine for me. It feels
like this is the smallest size where you can have a fairly normal experience
in terms of display and input.
With a metal chassis, the build quality feels good overall. Of course the
real test is how it lasts.
Charging via USB-C PD! I am so happy to be free of laptop power bricks.
The N150 is a nice upgrade vs the N5000 and N6000. AVX2 support means
we're much more likely to hit optimised codepaths for libraries that make
use of it.
But of course there's a long list of niggles or drawbacks. As I say, overall
it works for me, but if it didn't have these drawbacks I'd probably move
more towards actively recommending it without lots of caveats:
Battery life isn't fantastic. I'd be much happier with 10-12h. Though given
the USB-C PD support, it's not hard to reach this with an external battery.
I miss having a silent fanless machine. The fan doesn't come on frequently
in normal usage, but of course it's noticeable when it does. My unit also
suffers from some coil wine which is audible sometimes when scrolling.
Neither is particularly loud but there is a huge difference between never
being able to hear your computer vs sometimes being able to hear it.
Some tinkering needed for initial Linux setup. Depending on your mindset,
this might be a pro! Regardless, I've documented what I've done down below.
I should note that all the basic hardware does work including the
touchscreen, webcam, and microphone. The fact the display is rotated is
mostly an easy fix, but I haven't checked if the fact it shows as 1200x1920
rather than 1920x1080 causes problems for e.g. games.
In-built display is 50Hz rather than 60Hz and I haven't yet succeeded at
overriding this in Linux (although it seems possible in Windows).
It's unfortunate there's no ability to limit charging at e.g. 80% as
supported by some charge controllers as a way of extending battery lifetime.
It charges relatively slowly (~20W draw), which is a further incentive to
have an external battery if out and about.
It's a shame they went with the soldered on Intel AX101 WiFi module rather
than spending a few dollars more for a better module from Intel's line-up.
I totally understand why Chuwi don't/can't have different variants with
different keyboards, but I would sure love a version with a UK key layout!
Screen real estate is lost to the bezel. Additionally, the rounded corners
of the bezel cutting off the corner pixels is annoying.
Do beware that the laptop ships with a 12V/3A charger with a USB-C connection
that apparently will use that voltage without any negotiation. It's best not
to use it at all due to the risk of plugging in something that can't handle
12V input.
Conclusion: It's not perfect machine but I'm a huge fan of this form
factor. I really hope we get future iterations or competing products.
Appendix A: Accessories
YMMV, but I picked up the following with the most notable clearly being the
replacement SSD. Prices are the approximate amount paid including any
shipping.
Installation was trivial. Undo 8 screws on the MiniBook underside and it
comes off easily.
The spec is overkill for this laptop (PCIe Gen4 when the MiniBook only
supports Gen3 speeds). But the price was good meaning it wasn't very
attractive to spend a similar amount for a slower last-generation drive
with worse random read/write performance.
Unlike the MiniBook itself, charges very quickly. Also supports
pass-through charging so you can charge the battery while also charging
downstream devices, through a single wall socket.
Goes for a thin but wider squared shape vs many other batteries that are
quick thick, though narrower. For me this is more convenient in most
scenarios.
Despite being designed for the Steam Deck, this actually works really nicely
for holding it vertically. The part that holds the device is adjustable and
comfortably holds it without blocking the air vents. I use this at my work
desk and just need to plug in a single USB-C cable for power, monitor, and
peripherals (and additionally the 3.5mm audio jack if using speakers).
I'd wondered if I might have to instead find some below-desk setup to keep
cables out of the way, but placing this at the side of my desk and using
right-angled cables (or adapters) that go straight down off the side means
seems to work fairly well for keeping the spiders web of cables out of the
way.
Support 20V 1.75A when only a USB-C cable is connected, which is more than
enough for charging the MiniBook.
Given all my devices when traveling are USB, I was interested in
something compact that avoids the need for separate adapter plugs. This
seems to fit the bill.
Case: 11" Tablet
case (~£2.50 when
bought with some other things)
Took a gamble but this fits remarkably well, and has room for extra cables
/ adapters.
Appendix B: Arch Linux setup
As much for my future reference as for anything else, here are notes on
installing and configuring Arch Linux on the MiniBook X to my liking, and
working through as many niggles as I can. I'm grateful to Sonny Piers' GitHub
repo for some pointers on dealing
with initial challenges like screen rotation.
Initial install
Download an Arch Linux install image and
write to a USB drive. Enter the BIOS by pressing F2 while booting and disable
secure boot. I found I had to do this, then save and exit for
it to stick. Then enter BIOS again on a subsequent boot and select the option
to boot straight into it (under the "Save and Exit" menu).
In order to have the screen rotated correctly, we need to set the boot
parameter video=DSI-1:panel_orientation=right_side_up. Do this by pressing
e at the boot menu and manually adding.
Then connect to WiFi (iwctl then station wlan0 scan, station wlan0 get-networks, station wlan0 connect $NETWORK_NAME and enter the WiFi
password). It's likely more convenient to do the rest of the setup via ssh,
which can be done by setting a temporary root password with passwd and then
connecting with
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null root@archiso.
Set the SSD sector size to 4k:
# Confirm 4k sector sizes are available and supported.
nvme id-ns -H /dev/nvme0n1
# Shows:# LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)# LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better
nvme format --lbaf=1 /dev/nvme0n1
Now partition disks and create filesystems (with encrypted rootfs):
The touchscreen input also needs to be rotated to work properly. See
here for guidance
on the transformation matrix for xinput and confirm the name to match with
xinput list.
git clone https://aur.archlinux.org/yay.git &&cd yay
makepkg -si
cd .. && rm -rf yay
yay xautolock
yay ps_mem
Use UK keymap and in X11 use caps lock as escape:
localectl set-keymap uk
localectl set-x11-keymap gb """" caps:escape
The device has a US keyboard layout which has one less key than than the UK
layout and
several keys in different places. As I regularly use a UK layout external
keyboard, rather than just get used to this I set a UK layout and use AltGr
keycodes for backslash (AltGr+-) and pipe
(AltGR+`).
For audio support, I didn't need to do anything other than get rid of
excessive microphone noise by opening alsamixer and turning "Interl Mic
Boost" down to zero.
Suspend rather than shutdown when pressing power button
It's too easy to accidentally hit the power button especially when
plugging/unplugging usb-c devices, so lets make it just suspend rather than
shutdown.
See the Arch
wiki
for a discussion. s2idle and deep are reported as supported from
/sys/power/mem_sleep, but the discharge rate leaving the laptop suspended
overnight feels higher than I'd like. Let's enable deep sleep in the hope it
reduces it.
Check last sleep mode used with sudo journalctl | grep "PM: suspend" | tail -2. And check the current sleep mode with cat /sys/power/mem_sleep.
Checking the latter after boot you're likely to be worried to see that s2idle
is still default. But try suspending and then checking the journal and you'll
see systemd switches it just prior to suspending. (i.e. the setting works as
expected, even if it's only applied lazily).
I haven't done a reasonably controlled test of the impact.
Changing DPI
The strategy is to use xsettingsd to
update applications on the fly that support it, and otherwuse update Xft.dpi
in Xresources. I've found a DPI of 120 works well for me. So add systemctl --user restart xsettingsd to .xinitrc as well as a call to this set_dpi
script with the desired DPI:
!/bin/sh
DPI="$1"if[ -z "$DPI"]; thenecho"Usage: $0 <dpi>"exit1fiCONFIG_FILE="$HOME/.config/xsettingsd/xsettingsd.conf"
mkdir -p "$(dirname "$CONFIG_FILE")"if ! [ -e "$CONFIG_FILE"]; then
touch "$CONFIG_FILE"fiif grep -q 'Xft/DPI'"$CONFIG_FILE"; then
sed -i "s|Xft/DPI.*|Xft/DPI $(($DPI*1024))|""$CONFIG_FILE"elseecho"Xft/DPI $(($DPI*1024))" >> "$CONFIG_FILE"fi
systemctl --user restart xsettingsd.service
echo"Xft.dpi: $DPI" | xrdb -merge
echo"DPI set to $DPI"
If attaching to an external display where a different DPI is desirable, just
call set_dpi as needed.
Enabing Jabra bluetooth headset
sudo systemctl enable --now bluetooth.service
Follow instructions in https://wiki.archlinux.org/title/bluetooth_headset to
pair
Remember to do the 'trust' step so it automatically reconnects
Automatically enabling/disabling display outputs upon plugging in a monitor
The srandrd tool provides a handy way of
listening for changes in the plug/unplugged status of connections and
launching a shell script. First try it out with the following to observe
events:
yay srandrd
cat - <<'EOF' > /tmp/echo.sh
echo $SRANDRD_OUTPUT $SRANDRD_EVENT $SRANDRD_EDID
EOF
chmod +x /tmp/echo.sh
srandrd -n /tmp/echo.sh
# You should now see the events as you plug/unplug devices.
So this is simple - we just write a shell script that srandrd will invoke
which calls xrandr as desired when connect/disconnect of the device with the
target EDID happens? Almost. There are two problems I need to work around:
The monitor I use for work is fairly bad at picking up a 4k60Hz input
signal. As far as I can tell this is independent of the cable used or input
device. What does seem to reliably work is to output a 1080p signal, wait a
bit, and then reconfigure to 4k60Hz.
The USB-C cable I normally plug into in my sitting room is also connected
to the TV via HDMI (I often use this for my Steam Deck). I noticed occasional
graphical slowdowns and after more debugging found I could reliably see this
in hiccups / reduced measured frame rate in glxgears that correspond with
recurrent plug/unplug events. The issue disappears completely if video output
via the cable is configured once and then unconfigured again. Very weird, but
at least there's a way round it.
Solving both of the above can readily be addressed by producing a short
sequence of xrandr calls rather than just one. Except these xrandr calls
themselves trigger new events that cause srandrd to reinvoke the
script. So I add a mechanism to
have the script ignore events if received in short succession. We end up with
the following:
#!/usr/bin/shEVENT_STAMP=/tmp/display-change-stamp
# Recognised displays (as reported by $SRANDRD_EDID).WORK_MONITOR="720405518350B628"TELEVISION="6D1E82C501010101"
msg(){printf"display-change-handler: %s\n""$*" >&2}# Call xrandr, but refresh $EVENT_STAMP just before doing so. This causes# connect/disconnect events generated by the xrandr operation to be skipped at# the head of this script. Call xrefresh afterwards to ensure windows are# redrawn if necessary.
wrapped_xrandr(){
touch $EVENT_STAMP
xrandr "$@"
xrefresh
}
msg "received event '$SRANDRD_OUTPUT: $SRANDRD_EVENT$SRANDRD_EDID'"# Suppress event if within 2 seconds of the timestamp file being updated.if[ -f $EVENT_STAMP]; thencur_time=$(date +%s)file_time=$(stat -c %Y $EVENT_STAMP)if[$((cur_time-file_time)) -le 2]; then
msg "suppressing event (exiting)"exit0fifi
touch $EVENT_STAMP
is_output_outputting(){
xrandr --query | grep -q "^$1 connected.*[0-9]\+x[0-9]\++[0-9]\++[0-9]\+"}# When connecting the main 'docked' display, disable the internal screen. Undo# this when disconnecting.case"$SRANDRD_EVENT$SRANDRD_EDID"in"connected $WORK_MONITOR")
msg "enabling 1920x1080 output on $SRANDRD_OUTPUT, disabling laptop display, and sleeping for 10 seconds"
wrapped_xrandr --output DSI-1 --off --output $SRANDRD_OUTPUT --mode 1920x1080
sleep 10
msg "switching up to 4k output"
wrapped_xrandr --output DSI-1 --off --output $SRANDRD_OUTPUT --preferred
msg "done"exit
;;
"disconnected $WORK_MONITOR")
msg "re-enabling laptop display and disabling $SRANDRD_OUTPUT"
wrapped_xrandr --output DSI-1 --preferred --rotate right --output $SRANDRD_OUTPUT --off
msg "done"exit
;;
"connected $TELEVISION")# If we get the 'connected' event and a resolution is already configured# and being emitted, then do nothing as the event was likely generated by# a manual xrandr call from outside this script.if is_output_outputting $SRANDRD_OUTPUT; then
msg "doing nothing as manual reconfiguration suspected"exit0fi
msg "enabling then disabling output $SRANDRD_OUTPUT which seems to avoid subsequent disconnect/reconnects"
wrapped_xrandr --output $SRANDRD_OUTPUT --mode 1920x1080
sleep 1
wrapped_xrandr --output $SRANDRD_OUTPUT --off
msg "done"exit
;;
*)
msg "no handler for $SRANDRD_EVENT$SRANDRD_EDID"exit
;;
esac
Outputting to in-built screen at 60Hz (not yet solved)
The screen is unfortunately limited to 50Hz out of the box, but at least on
Windows it's possible to use Custom Resolution
Utility
to edit the EDID and add a 1200x1920 60Hz mode (reminder: the display is
rotated to the right which is why width x height is the opposite order to
normal). To add Custom Resolution utility:
Open CRU
Click to "add a detailed resolution"
Select "Exact reduced" and enter Active: 1200 horizontal pixels, Vertical
1920 lines, and Refresh rate: 60.000 Hz. This results in Horizontal:
117.000kHz and pixel clock 159.12MHz. Leave interlaced unticked.
I exported this to a file with the hope of reusing on Linux.
As is often the case, the Arch Linux wiki has some relevant
guidance
on configuring an EDID override on Linux. I tried to follow the guidance by:
Copying the exported EDID file to
/usr/lib/firmware/edid/minibook_x_60hz.bin.
Adding drm.edid_firmware=DSI-1:edid/minibook_x_60hz.bin (DSI-1 is the
internal display) to the kernel commandline using efibootmgr.
Confirming this shows up in the kernel command line in dmesg but there are
no DRM messages regarding EDID override or loading the file. I also verify
it shows up in cat /sys/module/drm/parameters/edid_firmware.
Attempt adding /usr/lib/firmware/edid/minibook_x_60hz.bin to FILES in
/etc/mkinitcpio.conf and regenerating the initramfs. No effect.
Over the past eight months, Igalia has been working through RISE on the LLVM compiler, focusing on its RISC-V target. The goal is to improve the performance of generated code for application-class RISC-V processors, especially where there are gaps between LLVM and GCC RISC-V.
The result? A set of improvements that reduces execution time by up to 15% on our SPEC CPU® 2017-based benchmark harness.
In this blog post, I’ll walk through the challenges, the work we did across different areas of LLVM (including instruction scheduling, vectorization, and late-stage optimizations), and the resulting performance gains that demonstrate the power of targeted compiler optimization for the RISC-V architecture on current RVA22U64+V and future RVA23 hardware.
First, to understand the work involved in optimizing the RISC-V performance, let’s briefly discuss the key components of this project: the RISC-V architecture itself, the LLVM compiler infrastructure, and the Banana Pi BPI-F3 board as our target platform.
RISC-V is a modern, open-standard instruction set architecture (ISA) built around simplicity and extensibility. Unlike proprietary ISAs, RISC-V’s modular design allows implementers to choose from base instruction sets (e.g., RV32I, RV64I) and optional extensions (e.g., vector ops, compressed instructions). This flexibility makes it ideal for everything from microcontrollers to high-performance cores, while avoiding the licensing hurdles of closed ISAs. However, this flexibility also creates complexity: without guidance, developers might struggle to choose the right combination of extensions for their hardware.
Enter RISC-V Profiles: standardized bundles of extensions that ensure software compatibility across implementations. For the BPI-F3’s CPU, the relevant profile is RVA22U64, which includes:
Mandatory: RV64GC (64-bit with general-purpose + compressed instructions), Zicsr (control registers), Zifencei (instruction-fetch sync), and more.
Optional: The Vector extension (V) v1.0 (for SIMD operations) and other accelerators.
We chose to focus our testing on two configurations: RVA22U64 (scalar) and RVA22U64+V (vector), since they cover a wide variety of hardware. It's also important to note that code generation for vector-capable systems (RVA22U64+V) differs significantly from scalar-only targets, making it crucial to optimize both paths carefully.
RVA23U64, which mandates the vector extension, was not chosen because the BPI-F3 doesn’t support it.
LLVM is a powerful and widely used open-source compiler infrastructure. It's not a single compiler but rather a collection of modular and reusable compiler and toolchain technologies. LLVM's strength lies in its flexible and well-defined architecture, which allows it to efficiently compile code written in various source languages (like C, C++, Rust, etc.) for a multitude of target architectures, including RISC-V. A key aspect of LLVM is its optimization pipeline. This series of analysis and transformation passes works to improve the generated machine code in various ways, such as reducing the number of instructions, improving data locality, and exploiting target-specific hardware features.
The Banana Pi BPI-F3 is a board featuring a SpacemiT K1 8-core RISC-V chip: PU integrates 2.0 TOPs AI computing power. 2/4/8/16G DDR and 8/16/32/128G eMMC onboard.2x GbE Ethernet port, 4x USB 3.0 and PCIe for M.2 interface, support HDMI and Dual MIPI-CSI Camera.
Most notably, the RISC-V CPU supports the RVA22U64 Profile and 256-bit RVV 1.0 standard.
Let's define the testing environment. We use the training dataset on SPEC CPU® 2017-based benchmark to measure the impact of changes to the LLVM codebase. We do not use the reference dataset for practical reasons, i.e., the training dataset finishes in hours instead of days.
The benchmarks were executed on the BPI-F3, running Arch Linux and Kernel 6.1.15. The configuration of each compiler invocation is as follows:
LLVM at the start of the project (commit cd0373e0): SPEC benchmarks built with optimization level 3 (-O3), and LTO enabled (-flto). We’ll show the results using both RVA22U64 (-march=rva22u64) and the RVA22U64+V profiles (-march=rva22u64_v).
LLVM today (commit b48c476f): SPEC benchmarks built with optimization level 3 (-O3), LTO enabled (-flto), tuned for the SpacemiT-X60 (-mcpu=spacemit-x60), and IPRA enabled (-mllvm -enable-ipra -Wl,-mllvm,-enable-ipra). We’ll also show the results using both RVA22U64 (-march=rva22u64) and the RVA22U64+V profile (-march=rva22u64_v).
GCC 14.2: SPEC benchmarks built with optimization level 3 (-O3), and LTO enabled (-flto). GCC 14.2 doesn't support profile names in -march, so a functionally equivalent ISA naming string was used (skipping the assortment of extensions that don't affect codegen and aren't recognised by GCC 14.2) for both RVA22U64 and RVA22U64+V.
The following graph shows the improvements in execution time of the SPEC benchmarks from the start of the project (light blue bar) to today (dark blue bar) using the RVA22U64 profile, on the BPI-F3. Note that these include not only my contributions but also the improvements of all other individuals working on the RISC-V backend. We also include the results of GCC 14.2 for comparison (orange bar). Our contributions will be discussed later.
The graph is sorted by the execution time improvements brought by the new scheduling model. We see improvements across almost all benchmarks, from small gains in 531.deepsjeng_r (3.63%) to considerable ones in 538.imagick_r (19.67%) and 508.namd_r (25.73%). There were small regressions in the execution time of 510.parest_r (-3.25%); however, 510.parest_r results vary greatly in daily tests, so it might be just noise. Five benchmarks are within 1% of previous results, so we assume there was no impact on their execution time.
When compared to GCC, LLVM today is faster in 11 out of the 16 tested benchmarks (up to 23.58% faster than GCC in 541.leela_r), while being slower in three benchmarks (up to 6.51% slower than GCC in 510.parest_r). Current LLVM and GCC are within 1% of each other in the other two benchmarks. Compared to the baseline of the project, GCC was faster in ten benchmarks (up to 26.83% in 508.namd_r) while being slower in only five.
Similarly, the following graph shows the improvements in the execution time of SPEC benchmarks from the start of the project (light blue bar) to today (dark blue bar) on the BPI-F3, but this time with the RVA22U64+V profile, i.e., the RVA22U64 plus the vector extension (V) enabled. Again, GCC results are included (orange bar), and the graph shows all improvements gained during the project.
The graph is sorted by the execution time improvements brought by the new scheduling model. The results for RVA22U64+V follow a similar trend, and we see improvements in almost all benchmarks. From 4.91% in 500.perlbench_r to (again) a considerable 25.26% improvement in 508.namd_r. Similar to the RVA22U64 results, we see a couple of regressions: 510.parest_r with (-3.74%) and 523.xalancbmk_r (-6.01%). Similar to the results on RVA22U64, 523.xalancbmk_r, and 510.parest_r vary greatly in daily tests on RVA22u64+V, so these regressions are likely noise. Four benchmarks are within 1% of previous results, so we assume there was no impact on their execution time.
When compared to GCC, LLVM today is faster in 10 out of the 16 tested benchmarks (up to 23.76% faster than GCC in 557.xz_r), while being slower in three benchmark (up to 5.58% slower in 538.imagick_r). LLVM today and GCC are within 1-2% of each other in the other three benchmarks. Compared to the baseline of the project, GCC was faster in eight benchmarks (up to 25.73% in 508.namd_r) while being slower in five.
Over the past eight months, our efforts have concentrated on several key areas within the LLVM compiler infrastructure to specifically target and improve the efficiency of RISC-V code generation. These contributions have involved delving into various stages of the compilation process, from instruction selection to instruction scheduling. Here, we'll focus on three major areas where substantial progress has been made:
Introducing a scheduling model for the hardware used for benchmarking (SpacemiT-X60): LLVM had no scheduling model for the SpacemiT-X60, leading to pessimistic and inefficient code generation. We added a model tailored to the X60’s pipeline, allowing LLVM to better schedule instructions and improve performance. Longer term, a more generic in-order model could be introduced in LLVM to help other RISC-V targets that currently lack scheduling information, similar to how it’s already done for other targets, e.g., Aarch64. This contribution alone brings up to 15.76% improvement on the execution time of SPEC benchmarks.
Improved Vectorization Efficiency: LLVM’s SLP vectorizer used to skip over entire basic blocks when calculating spill costs, leading to inaccurate estimations and suboptimal vectorization when functions were present in the skipped blocks. We addressed this by improving the backward traversal to consider all relevant blocks, ensuring spill costs were properly accounted for. The final solution, contributed by the SLP Vectorizer maintainer, was to fix the issue without impacting compile times, unlocking better vectorization decisions and performance. This contribution brings up to 11.87% improvement on the execution time of SPEC benchmarks.
Register Allocation with IPRA Support: enabling Inter-Procedural Register Allocation (IPRA) to the RISC-V backend. IPRA reduces save/restore overhead across function calls by tracking which registers are used. In the RISC-V backend, supporting IPRA required implementing a hook to report callee-saved registers and prevent miscompilation. This contribution brings up to 3.42% improvement on the execution time of SPEC benchmarks.
The biggest contribution so far is the scheduler modeling tailored for the SpacemiT-X60. This scheduler is integrated into LLVM's backend and is designed to optimize instruction ordering based on the specific characteristics of the X60 CPU.
The scheduler was introduced in PR 137343. It includes detailed scheduling models that account for the X60's pipeline structure, instruction latencies for all scalar instructions, and resource constraints. The current scheduler model does not include latencies for vector instructions, but it is a planned future work. By providing LLVM with accurate information about the target architecture, the scheduler enables more efficient instruction scheduling, reducing pipeline stalls and improving overall execution performance.
The graph is sorted by the execution time improvements brought by the new scheduling model. The introduction of a dedicated scheduler yielded substantial performance gains. Execution time improvements were observed across several benchmarks, ranging from 1.04% in 541.leela_r to 15.76% in 525.x264_r.
Additionally, the scheduler brings significant benefits even when vector extensions are enabled, as shown above. The graph is sorted by the execution time improvements brought by the new scheduling model. Execution time improvements range from 3.66% in 544.nab_r to 15.58% in 508.namd_r, with notable code size reductions as well, e.g., a 6.47% improvement in 519.lbm_r (due to decreased register spilling).
Finally, the previous graph shows the comparison between RVA22U64 vs RVA22U64+V, both with the X60 scheduling model enabled. The only difference is 525.x264_r: it is 17.48% faster on RVA22U64+V.
A key takeaway from these results is the critical importance of scheduling for in-order processors like the SpacemiT-X60. The new scheduler effectively closed the performance gap between the scalar (RVA22U64) and vector (RVA22U64+V) configurations, with the vector configuration now outperforming only in a single benchmark (525.x264_r). On out-of-order processors, the impact of scheduling would likely be smaller, and vectorization would be expected to deliver more noticeable gains.
SLP Vectorizer Spill Cost Fix + DAG Combiner Tuning #
One surprising outcome in early benchmarking was that scalar code sometimes outperformed vectorized code, despite RISC-V vector support being available. This result prompted a detailed investigation.
Using profiling data, we noticed increased cycle counts around loads and stores in vectorized functions; the extra cycles were due to register spilling, particularly around function call boundaries. Digging further, we found that the SLP Vectorizer was aggressively vectorizing regions without properly accounting for the cost of spilling vector registers across calls.
To understand how spill cost miscalculations led to poor vectorization decisions, consider this simplified function, and its graph representation:
This function loads two values from %p, conditionally calls @g() (in both foo and bar), and finally stores the values to %q. Previously, the SLP vectorizer only analyzed the entry and baz blocks, ignoring foo and bar entirely. As a result, it missed the fact that both branches contain a call, which increases the cost of spilling vector registers. This led LLVM to vectorize loads and stores here, introducing unprofitable spills across the calls to @g().
To address the issue, we first proposed PR 128620, which modified the SLP vectorizer to properly walk through all basic blocks when analyzing cost. This allowed the SLP vectorizer to correctly factor in function calls and estimate the spill overhead more accurately.
The results were promising: execution time dropped by 9.92% in 544.nab_r, and code size improved by 1.73% in 508.namd_r. However, the patch also increased compile time in some cases (e.g., +6.9% in 502.gcc_r), making it unsuitable for upstream merging.
Following discussions with the community, Alexey Bataev (SLP Vectorizer code owner) proposed a refined solution in PR 129258. His patch achieved the same performance improvements without any measurable compile-time overhead and was subsequently merged.
The graph shows execution time improvements from Alexey’s patch, ranging from 1.49% in 500.perlbench_r to 11.87% in 544.nab_r. Code size also improved modestly, with a 2.20% reduction in 508.namd_r.
RVA22U64 results are not shown since this is an optimization tailored to prevent the spill of vectors. Scalar code was not affected by this change.
Finally, PR 130430 addressed the same issue in the DAG Combiner by preventing stores from being merged across call boundaries. While this change had minimal impact on performance in the current benchmarks, it improves code correctness and consistency and may benefit other workloads in the future.
IPRA (Inter-Procedural Register Allocation) Support #
Inter-Procedural Register Allocation (IPRA) is a compiler optimization technique that aims to reduce the overhead of saving and restoring registers across function calls. By analyzing the entire program, IPRA determines which registers are used across function boundaries, allowing the compiler to avoid unnecessary save/restore operations.
In the context of the RISC-V backend in LLVM, enabling IPRA required implementing a hook in LLVM. This hook informs the compiler that callee-saved registers should always be saved in a function, ensuring that critical registers like the return address register (ra) are correctly preserved. Without this hook, enabling IPRA would lead to miscompilation issues, e.g., 508.namd_r would never finish running (probably stuck in an infinite loop).
To understand how IPRA works, consider the following program before IPRA. Let’s assume function foo uses s0 but doesn't touch s1:
# Function bar calls foo and conservatively saves all callee-saved registers.
bar:
addi sp, sp, -32
sd ra, 16(sp) # Save return address (missing before our PR)
sd s0, 8(sp)
sd s1, 0(sp) # Unnecessary spill (foo won't clobber s1)
call foo
ld s1, 0(sp) # Wasted reload
ld s0, 8(sp)
ld ra, 16(sp)
addi sp, sp, 32
ret
After IPRA (optimized spills):
# bar now knows foo preserves s1: no s1 spill/reload.
bar:
addi sp, sp, -16
sd ra, 8(sp) # Save return address (missing before our PR)
sd s0, 0(sp)
call foo
ld s0, 0(sp)
ld ra, 8(sp)
addi sp, sp, 16
ret
By enabling IPRA for RISC-V, we eliminated redundant spills and reloads of callee-saved registers across function boundaries. In our example, IPRA reduced stack usage and cut unnecessary memory accesses. Crucially, the optimization maintains correctness: preserving the return address (ra) while pruning spills for registers like s1 when provably unused. Other architectures like x86 already support IPRA in LLVM, and we enable IPRA for RISC-V PR 125586.
IPRA is not enabled by default due to a bug, described in issue 119556; however, it does not affect the SPEC benchmarks.
The graph shows the improvements achieved by this transformation alone, using the RVA22U64 profile. There were execution time improvements ranging from 1.57% in 505.mcf_r to 3.16% in 519.lbm_r.
The graph shows the improvements achieved by this transformation alone, using the RVA22U64+V profile. We see similar gains, with execution time improvements of 1.14% in 505.mcf_r and 3.42% in 531.deepsjeng_r.
While we initially looked at code size impact, the improvements were marginal. Given that save/restore sequences tend to be a small fraction of total size, this isn't surprising and not the main goal of this optimization.
Setting Up Reliable Performance Testing. A key part of this project was being able to measure the impact of our changes consistently and meaningfully. For that, we used LNT, LLVM’s performance testing tool, to automate test builds, runs, and result comparisons. Once set up, LNT allowed us to identify regressions early, track improvements over time, and visualize the impact of each patch through clear graphs.
Reducing Noise on the BPI-F3. Benchmarking is noisy by default, and it took considerable effort to reduce variability between runs. These steps helped:
Disabling ASLR: To ensure a more deterministic memory layout.
Running one benchmark at a time on the same core: This helped eliminate cross-run contention and improved result consistency.
Multiple samples per benchmark: We collected 3 samples to compute statistical confidence and reduce the impact of outliers.
These measures significantly reduced noise, allowing us to detect even small performance changes with confidence.
Interpreting Results and Debugging Regressions. Another challenge was interpreting performance regressions or unexpected results. Often, regressions weren't caused by the patch under test, but by unrelated interactions with the backend. This required:
Cross-checking disassembly between runs.
Profiling with hardware counters (e.g., using perf).
Identifying missed optimization opportunities due to incorrect cost models or spill decisions.
Comparing scalar vs vector codegen and spotting unnecessary spills or register pressure.
My colleague Luke Lau also set up a centralized LNT instance that runs nightly tests. This made it easy to detect and track performance regressions (or gains) shortly after new commits landed. When regressions did appear, we could use the profiles and disassembly generated by LNT to narrow down which functions were affected, and why.
Using llvm-exegesis (sort of). At the start of the project, llvm-exegesis, the tool LLVM provides to measure instruction latencies and throughput, didn’t support RISC-V at all. Over time, support was added incrementally across three patches: first for basic arithmetic instructions, then load instructions, and eventually vector instructions. This made it a lot more viable as a tool for microarchitectural analysis on RISC-V. However, despite this progress, we ultimately didn’t use llvm-exegesis to collect the latency data for our scheduling model. The results were too noisy, and we needed more control over how measurements were gathered. Instead, we developed an internal tool to generate the latency data, something we plan to share in the future.
Notable Contributions Without Immediate Benchmark Impact. While some patches may not have led to significant performance improvements in benchmarks, they were crucial for enhancing the RISC-V backend's robustness and maintainability:
Improved Vector Handling in matchSplatAsGather (PR #117878): This patch updated the matchSplatAsGather function to handle vectors of different sizes, enhancing code generation for @llvm.experimental.vector.match on RISC-V.
Addition of FMA Cost Model (PRs #125683 and #126076): These patches extended the cost model to cover the FMA instruction, ensuring accurate cost estimations for fused multiply-add operations.
Generalization of vp_fneg Cost Model (PR #126915): This change moved the cost model for vp_fneg from the RISC-V-specific implementation to the generic Target Transform Info (TTI) layer, promoting consistent handling across different targets.
Late Conditional Branch Optimization for RISC-V (PR #133256): Introduced a late RISC-V-specific optimization pass that replaces conditional branches with unconditional ones when the condition can be statically evaluated. This creates opportunities for further branch folding and cleanup later in the pipeline. While performance impact was limited in current benchmarks, it lays the foundation for smarter late-stage CFG optimizations.
These contributions, while not directly impacting benchmark results, laid the groundwork for future improvements.
This project significantly improved the performance of the RISC-V backend in LLVM through a combination of targeted optimizations, infrastructure improvements, and upstream contributions. We tackled key issues in vectorization, register allocation, and scheduling, demonstrating that careful backend tuning can yield substantial real-world benefits, especially on in-order cores like the SpacemiT-X60.
Future Work:
Vector latency modeling: The current scheduling model lacks accurate latencies for vector instructions.
Further scheduling model fine-tuning: This would impact the largest number of users and would align RISC-V with other targets in LLVM.
Improve vectorization: The similar performance between scalar and vectorized code suggests we are not fully exploiting vectorization opportunities. Deeper analysis might uncover missed cases or necessary model tuning.
Improvements to DAGCombine: after PR 130430, Philip Reames created issue 132787 with ideas to improve the store merging code.
This work was made possible thanks to support from RISE, under Research Project RP009. I would like to thank my colleagues Luke Lau and Alex Bradbury for their ongoing technical collaboration and insight throughout the project. I’m also grateful to Philip Reames from Rivos for his guidance and feedback. Finally, a sincere thank you to all the reviewers in the LLVM community who took the time to review, discuss, and help shape the patches that made these improvements possible.
Recent presentations at BlinkOn strike some familliar notes. Seems a common theme, ideas come back.
Since I joined Igalia in 2019, I don't think I've missed a BlinkOn. This year, however, there was a conflict with the W3C AC meetings and we felt that it was more useful that I attend those, since Igalia already had a sizable contingent at BlinkOn itself and my Web History talk with Chris Lilley was pre-recorded.
When I returned, and videos of the event began landing, I was keen to see what people talked about. There were lots of interesting talks, but one jumped out at me right away: Bramus gave one called "CSS Parser Extensions" - which I wasn't familliar with, so was keen to see. Turns out it was just very beginnings of him exploring ideas to make CSS polyfillable.
This talk made me sit up and pay attention because, actually, it's really how I came to be involved in standards. It's the thing that started a lot of the conversations that eventually became the Extensible Web Community Group and the Extensible Web Manifesto, and ultimately Houdini, a joint Task Force of the W3C TAG and CSS Working Group (in fact, I am also the one who proposed the name ✨). In his talk, he hit on many of the same notes that led me there too.
Polyfills are really interesting when you step back and look at them. They can be used to make the standards development, feedback and rollout so much better. But CSS is almost historically hostile to that approach becauase it just throws away anything it doesn't understand. That means if you want to polyfill something you've got to re-implement lots of stuff that the browser already does: You've got to re-fetch the stylesheet (if you can!) as text, and then bring your own parser to parse it, and then... well, you still can't actually realistically implement many things.
But what if you could?
Houdini has stalled. In my mind, this mainly due to when it happened and what it chose to focus on shipping first. One of the first things that we all agreed to in the first Houdini meeting was that we expose the parser. This is true for all of the reasons Bramus discussed, and more. But that effort got hung up on the fact that there was a sense we first needed a typed OM. I'm not sure how true that really is. Other cool Houdini things were, I think, also hung up on lots of things that were being reworked at the time, and resource competition. But I think that the thing that really killed it was just what shipped first. It was not something that might be really useful for polyfilling, like custom functions or custom media queries or custom pseduo classes, or very ambitiously, something like custom layouts --- but custom paint. The CSS Working Group doesn't publish a lot of new "paints". There are approximately 0 named background images, for example. There's no background-image: checkerboard; for example. But the working group does publish lots of those other things like functions or psueudo classes. See what I mean? Those other things were part of the real vision - they can be used to make cow paths. Or, they can be used to show that, actually, nobody wants that cow path. Or, if it isn't - It can instead rapidly inspire better solutions..
Anyway, the real challenge with most polyfills is performance. Any time that we're going to step out of "60 fps scrollers" into JS land that's iffy... But not impossible, and if we're honest, we'd have to admit that the truth is that our current/actual attempts to polyfill are definitely worse than something closer to native. With effort, surely we can at least improve things by looking at where there are some nice "joints" where we can cleave the problem.
This is why in recent years I've suggested that perhaps what would really benefit us is a few custom things (like functions) and then just enabling CSS-like languages, which can handle the fetch/parse problems and perhaps give us some of the most basic primitive ideas.
So, where will all of this go? Who knows - but I'm glad some others are interested and talking about some of it again.
This blog post might interest you if you want to try the bleeding edge NVK driver which allows to decode H264/5 video with the power of the Vulkan extensions VK_KHR_video_decode_h26x.
This is a summary of the instructions provided in the MR. This work needs a recent kernel with new features, so it will describe the steps to add this feature and build this new kernel on an Ubuntu based system.
To run the NVK driver, you need a custom patch to be applied on top of the Nouveau driver. This patch applies on minimum 6.12 kernel so you need to build a new kernel except if your distribution is running bleeding edge kernel which i doubt so here is my method I used to build this kernel.
Next step will be to configure the kernel. The best option I’ll recommend you is to copy the kernel config, your distribution is shipping with. On Ubuntu you can find it in /boot with the name config-6.8.0-52-generic for example.
$ cp /boot/config-6.8.0-52-generic .config
Then to get the default config, your kernel will use, including the specific options coming with Ubuntu, you’ll have to run:
$ make defconfig
This will setup the build and make it ready to compile with this version of the kernel, auto configuring the new features.
Two options CONFIG_SYSTEM_TRUSTED_KEYS and CONFIG_SYSTEM_REVOCATION_KEYS must be disabled to avoid compilation errors with missing certificates. For that you can set it up within menuconfig or you can edit .config and set these values to ""
Then you should be ready to go for a break ☕, short or long depending on your machine to cook the brand new kernel debian packaged, ready to use:
$ make clean
$ make -j `getconf _NPROCESSORS_ONLN` deb-pkg LOCALVERSION=-custom
The process should end up with a new package named linux-image-6.12.8-custom_6.12.8-3_amd64.deb in the upper folder which can then be installed along your previous kernel.
The first one will replace your current default menulist item in grub upon installation. This means that if you install it, next time you reboot, you’ll boot into that kernel.
Mesa depends on various system packages in addition to python modules and the the rust toolchain. So first we’ll have to install the given package which are all present in Ubuntu 24.04:
Now that the kernel and the Mesa driver have been built and are available for your machine, you should be able to decode your first h264 stream with the NVK driver.
As you might have use the Nvidia driver first and installed with your regular kernel, you might hit a weired error when invoking vulkaninfo such as:
ERROR: [Loader Message] Code 0 : setup_loader_term_phys_devs: Failed to detect any valid GPUs in the current config
ERROR at ./vulkaninfo/./vulkaninfo.h:247:vkEnumeratePhysicalDevices failed with ERROR_INITIALIZATION_FAILED
Indeed nouveau driver can not live along the Nvidia driver, so you’ll have to uninstall the Nvidia driver first to be able to use nouveau properly and the vulkan extensions.
One other solution is to boot on your new custom kernel and modify the file /etc/modprobe.d/nvidia-installer-disable-nouveau.conf to get something like:
# generated by nvidia-installer
#blacklist nouveau
options nouveau modeset=1
In that case the modeset=1 option will enable the driver and allow to use it.
Then you’ll have to reboot with this new configuration
As you may have noticed, during the configure stage that we chose to install the artifacts of the build in a folder named mesa/builddir/install.
Here is a script run-nvk.sh which you can use before calling any binary which will use this folder as a base to set the environment variable dedicated to the NVK Vulkan driver
Now its time to run a real application exploiting the power of Vulkan to decode multimedia content. For that I’ll recommend you to use GStreamer which ship with Vulkan elements for decoding in 1.24.2 version bundled in Ubuntu 24.04.
First of all, you’ll have to install the ubuntu packages for GStreamer
If you succeed to see this list of elements, you should be able to run a GStreamer pipeline with Vulkan Video extensioms. Here is a pipeline to decode a content:
Today, some more words on memory management, on the practicalities of a
system with conservatively-traced references.
The context is that I have finally started banging
Whippet into
Guile, initially in a configuration that
continues to use the conservative Boehm-Demers-Weiser (BDW) collector
behind the scene. In that way I can incrementally migrate over all of
the uses of the BDW API in Guile to use Whippet API instead, and then if
all goes well, I should be able to switch Whippet to use another GC
algorithm, probably the mostly-marking collector
(MMC).
MMC scales better than BDW for multithreaded mutators, and it can
eliminate fragmentation via Immix-inspired optimistic evacuation.
problem statement: how to manage ambiguous edges
A garbage-collected heap consists of memory, which is a set of
addressable locations. An object is a disjoint part of a heap, and is
the unit of allocation. A field is memory within an object that may
refer to another object by address. Objects are nodes in a directed graph in
which each edge is a field containing an object reference. A root is an
edge into the heap from outside. Garbage collection reclaims memory from objects that are not reachable from the graph
that starts from a set of roots. Reclaimed memory is available for new
allocations.
In the course of its work, a collector may want to relocate an object,
moving it to a different part of the heap. The collector can do so if
it can update all edges that refer to the object to instead refer to its
new location. Usually a collector arranges things so all edges have the
same representation, for example an aligned word in memory; updating an
edge means replacing the word’s value with the new address. Relocating
objects can improve locality and reduce fragmentation, so it is a good
technique to have available. (Sometimes we say evacuate, move, or compact
instead of relocate; it’s all the same.)
Some collectors allow ambiguous edges: words in memory whose value
may be the address of an object, or might just be scalar data.
Ambiguous edges usually come about if a compiler doesn’t precisely
record which stack locations or registers contain GC-managed objects.
Such ambiguous edges must be traced conservatively: the collector adds
the object to its idea of the set of live objects, as if the edge were a
real reference. This tracing mode isn’t supported by all collectors.
Any object that might be the target of an ambiguous edge cannot be
relocated by the collector; a collector that allows conservative edges
cannot rely on relocation as part of its reclamation strategy.
Still, if the collector can know that a given object will not be the referent
of an ambiguous edge, relocating it is possible.
How can one know that an object is not the target of an ambiguous edge?
We have to partition the heap somehow into
possibly-conservatively-referenced and
definitely-not-conservatively-referenced. The two ways that I know to
do this are spatially and temporally.
Spatial partitioning means that regardless of the set of root and
intra-heap edges, there are some objects that will never be
conservatively referenced. This might be the case for a type of object
that is “internal” to a language implementation; third-party users that
may lack the discipline to precisely track roots might not be exposed to
objects of a given kind. Still, link-time optimization tends to weather
these boundaries, so I don’t see it as being too reliable over time.
Temporal partitioning is more robust: if all ambiguous references come
from roots, then if one traces roots before intra-heap edges, then any
object not referenced after the roots-tracing phase is available for
relocation.
kinds of ambiguous edges in guile
So let’s talk about Guile! Guile uses BDW currently, which considers
edges to be ambiguous by default. However, given that objects carry
type tags, Guile can, with relatively little effort, switch to precisely
tracing most edges. “Most”, however, is not sufficient; to allow for
relocation, we need to eliminate intra-heap ambiguous edges, to
confine conservative tracing to the roots-tracing phase.
Conservatively tracing references from C stacks or even from static data
sections is not a problem: these are roots, so, fine.
Guile currently traces Scheme stacks almost-precisely: its compiler
emits stack maps for every call site, which uses liveness analysis to
only mark those slots that are Scheme values that will be used in the
continuation. However it’s possible that any given frame is marked
conservatively. The most common case is when using the BDW collector
and a thread is pre-empted by a signal; then its most recent stack frame
is likely not at a safepoint and indeed is likely undefined in terms of
Guile’s VM. It can also happen if there is a call site within a VM
operation, for example to a builtin procedure, if it throws an exception
and recurses, or causes GC itself. Also, when per-instruction
traps
are enabled, we can run Scheme between any two Guile VM operations.
So, Guile could change to trace Scheme stacks fully precisely, but this
is a lot of work; in the short term we will probably just trace Scheme
stacks as roots instead of during the main trace.
However, there is one more significant source of ambiguous roots, and
that is reified continuation objects. Unlike active stacks, these have
to be discovered during a trace and cannot be partitioned out to the
root phase. For delimited continuations, these consist of a slice of
the Scheme stack. Traversing a stack slice precisely is less
problematic than for active stacks, because it isn’t in motion, and it
is captured at a known point; but we will have to deal with stack frames
that are pre-empted in unexpected locations due to exceptions within
builtins. If a stack map is missing, probably the solution there is to
reconstruct one using local flow analysis over the bytecode of the stack
frame’s function; time-consuming, but it should be robust as we do it
elsewhere.
Undelimited continuations (those captured by call/cc) contain a slice
of the C stack also, for historical reasons, and there we can’t trace it
precisely at all. Therefore either we disable relocation if there are
any live undelimited continuation objects, or we eagerly pin any object
referred to by a freshly captured stack slice.
fin
If you want to follow along with the Whippet-in-Guile work, see the
wip-whippet
branch in Git. I’ve bumped its version to 4.0 because, well, why the
hell not; if it works, it will certainly be worth it. Until next time,
happy hacking!
Some short thoughts on recent antitrust and the future of the web platform...
Last week, in a 115 page US Antitrust ruling a federal judge in Virginia found that Google had two more monopolies, this time with relation to advertising technolgies. Previously, you'll recall that we had rulings related to search. There are still more open cases related to Android. And it's not only in the US that similar actions are playing out.
All of these cases kind of mention one another because the problems themselves are all deeply intertwined - but this one is really at the heart of it: That sweet, sweet ad money. I think that you could argue, reasonably, that pretty much everything else was somehow in service of that.
Initially, they made a ton of money showing ads every time someone searches, and pretty quickly signed a default search deal with Mozilla to drive the number of searches way up.
Why make a browser of your own? To drive the searches that show the ads, but also keep more of the money.
Why make OSes of your own, and deals around things that need to be installed? To guarantee that all of those devices drive the searches to show the ads.
And so on...
For a long time now, I've been trying to discuss what, to me, is a rather worrying problem: That those default search dollars are, in the end, what funds the whole web ecosystem. Don't forget that it's not just about the web itself, it's about the platform which provides the underlying technology for just about everything else at this point too.
Between years of blog posts, a podcast series, several talks, experiments like Open Prioritization I have been thinking about this a lot. Untangling it all is going to be quite complex.
In the US, the governments proposed remedies touch just about every part of this. I've been trying to think about how I could sum up my feelings and concerns, but it's quite complex. Then, the other day an article on arstechnica contained an illustration which seemed pretty perfect..
A "game" board that looks like the game Operation, but instead of pieces of internal anatomy there are logos for chrome, gmail, ads, adsense, android and on the board it says "Monoperation: Skill game where you are the DOJ" and the person is removing chrome, and a buzzer is going ff
This image (credited to Aurich Lawson) kind of hit the nail on its head for me: I deeply hope they will be absoltely surgical about this intervention, because the patient I'm worried about isn't Google, it's the whole Web Platform.
If this is interesting to you, my colleague Eric Meyer and I posted an Igalia Chats podcast episode on the topic: Adpocalypse Now?
Notes on AI for Mathematics and Theoretical Computer Science
In April 2025 I had the pleasure to attend an intense week-long workshop at the Simons Institute for the Theory of Computing entitled AI for Mathematics and Theoretical Computer Science. The event was organized jointly with the Simons Laufer Mathematical Sciences Institute (SLMath, for short). It was an intense time (five fully-packed days!) for learning a lot about cutting-edge ideas in this intersection of formal mathematics (primarily in Lean), AI, and powerful techniques for solving mathematical problems, such as SAT solvers and decision procedures (e.g., the Walnut system). Videos of the talks (but not of the training sessions) have been made available.
Every day, several dozen people were in attendance. Judging from the array of unclaimed badges (easily another several dozen), quite a lot more had signed up for the event, but didn't come for one reason or another. It was inspiring to be in the room with so many people involved in these ideas. The training sessions in the afternoon had a great vibe, since so many people we learning and working together simultaneously.
It was great to connect with a number of people, of all stripes. Most of the presenters and attendees were coming from academia, with a minority, such as myself, coming from industry.
The organization was fantastic. We had talks in the morning and training in the afternoon. The final talk in the morning, before lunch, was an introduction to the afternoon training. The training topics were:
The links above point to the tutorial git repos for following along at home.
In the open discussion on the final afternoon, I raised my hand and outed myself as someone coming to the workshop from an industry perspective. Although I had already met a few people in industry prior to Friday, I was able to meet even more by raising my hand and inviting fellow practioners to discuss things. This led to meeting a few more people.
The talks were fascinating; the selection of speakers and topics was excellent. Go ahead and take a look at the list of videos, pick out one or two of interest, grab a beverage of your choice, and enjoy.
With the release of GStreamer 1.26, we now have playback support for Versatile Video Coding (VVC/H.266). In this post, I’ll describe the pieces of the puzzle that enable this, the contributions that led to it, and hopefully provide a useful guideline to adding a new video codec in GStreamer.
With GStreamer 1.26 and the relevant plugins enabled, one can play multimedia files containing VVC content, for example, by using gst-play-1.0:
gst-play-1.0 vvc.mp4
By using gst-play-1.0, a pipeline using playbin3 will be created and the appropriate elements will be auto-plugged to decode and present the VVC content. Here’s what such a pipeline looks like:
Although the pipeline is quite large, the specific bits we’ll focus on in this blog are inside parsebin and decodebin3:
qtdemux → ... → h266parse → ... → avdec_h266
I’ll explain what each of those elements is doing in the next sections.
To store multiple kinds of media (e.g. video, audio and captions) in a way that keeps them synchronized, we typically make use of container formats. This process is usually called muxing, and in order to play back the file we perform de-muxing, which separates the streams again. That is what the qtdemux element is doing in the pipeline above, by extracting the audio and video streams from the input MP4 file and exposing them as the audio_0 and video_0 pads.
Support for muxing and demuxing VVC streams in container formats was added to:
qtmux and qtdemux: for ISOBMFF/QuickTime/MP4 files (often saved with the .mp4 extension)
mpegtsmux and tsdemux: for MPEG transport stream (MPEG-TS) files (often saved with the .ts extension)
Besides the fact that the demuxers are used for playback, by also adding support to VVC in the muxer elements we are then also able to perform remuxing: changing the container format without transcoding the underlying streams.
Some examples of simplified re-muxing pipelines (only taking into account the VVC video stream):
But why do we need h266parse when re-muxing from Matroska to MPEG-TS? That’s what I’ll explain in the next section.
Parsing and converting between VVC bitstream formats #
Video codecs like H.264, H.265, H.266 and AV1 may have different stream formats, depending on which container format is used to transport them. For VVC specifically, there are two main variants, as shown in the caps for h266parse:
Pad Templates: SINK template:'sink' Availability: Always Capabilities: video/x-h266
byte-stream or so-called Annex-B format (as in Annex B from the VVC specification): it separates the NAL units by start code prefixes (0x000001 or 0x00000001), and is the format used in MPEG-TS, or also when storing VVC bitstreams in files without containers (so-called “raw bitstream files”).
ℹ️ Note: It’s also possible to play raw VVC bitstream files with gst-play-1.0. That is achieved by the typefind element detecting the input file as VVC and playbin taking care of auto-plugging the elements.
vvc1 and vvi1: those formats use length field prefixes before each NAL unit. The difference between the two formats is the way that parameter sets (e.g. SPS, PPS, VPS NALs) are stored, and reflected in the codec_data field in GStreamer caps. For vvc1, the parameter sets are stored as container-level metadata, while vvi1 allows for the parameter sets to be stored also in the video bitstream.
The alignment field in the caps signals whether h266parse will collect multiple NALs into an Access Unit (AU) for a single GstBuffer, where an AU is the smallest unit for a decodable video frame, or whether each buffer will carry only one NAL.
That explains why we needed the h266parse when converting from MKV to MPEG-TS: it’s converting from vvc1/vvi1 to byte-stream! So the gst-launch-1.0 command with more explicit caps would be:
FFmpeg 7.1 has a native VVC decoder which is considered stable. In GStreamer 1.26, we have allowlisted that decoder in gst-libav, and it is now exposed as the avdec_h266 element.
Intel has added the vah266dec element in GStreamer 1.26, which enables hardware-accelerated VVC decoding on Intel Lunar Luke CPUs. However, it still has rank of 0 in GStreamer 1.26, so in order to test it out, one would need to, for example, manually set GST_PLUGIN_FEATURE_RANK.
Similar to h266parse, initially vah266dec was added with support for only the byte-stream format. I implemented support for the vvc1 and vvi1 modes in the base h266decoder class, which fixes the support for them in vah266dec as well. However, it hasn’t yet been merged and I don’t expect it to be backported to 1.26, so likely it will only be available in GStreamer 1.28.
Here’s a quick demo of vah266dec in action on an ASUS ExpertBook P5. In this screencast, I perform the following actions:
Run vainfo and display the presence of VVC decoding profile
gst-inspect vah266dec
export GST_PLUGIN_FEATURE_RANK='vah266dec:max'
Start playback of six simultaneous 4K@60 DASH VVC streams. The stream in question is the classic Tears of Steel, sourced from the DVB VVC test streams.
Run nvtop, showing GPU video decoding & CPU usage per process.
A tool that is handy for testing the new decoder elements is Fluster. It simplifies the process of testing decoder conformance and comparing decoders by using test suites that are adopted by the industry. It’s worth checking it out, and it’s already common practice to test new decoders with this test framework. I added the GStreamer VVC decoders to it: vvdec, avdec_h266 and vah266dec.
We’re still missing the ability to encode VVC video in GStreamer. I have a work-in-progress branch that adds the vvenc element, by using VVenC and safe Rust bindings (similarly to the vvdec element), but it still needs some work. I intend to work on it during the GStreamer Spring Hackfest 2025 to make it ready to submit upstream 🤞
2025 was my first year at FOSDEM, and I can say it was an incredible experience
where I met many colleagues from Igalia who live around
the world, and also many friends from the Linux display stack who are part of
my daily work and contributions to DRM/KMS. In addition, I met new faces and
recognized others with whom I had interacted on some online forums and we had
good and long conversations.
During FOSDEM 2025 I had the opportunity to present
about kworkflow in the kernel devroom. Kworkflow is a
set of tools that help kernel developers with their routine tasks and it is the
tool I use for my development tasks. In short, every contribution I make to the
Linux kernel is assisted by kworkflow.
The goal of my presentation was to spread the word about kworkflow. I aimed to
show how the suite consolidates good practices and recommendations of the
kernel workflow in short commands. These commands are easily configurable and
memorized for your current work setup, or for your multiple setups.
For me, Kworkflow is a tool that accommodates the needs of different agents in
the Linux kernel community. Active developers and maintainers are the main
target audience for kworkflow, but it is also inviting for users and user-space
developers who just want to report a problem and validate a solution without
needing to know every detail of the kernel development workflow.
Something I didn’t emphasize during the presentation but would like to correct
this flaw here is that the main author and developer of kworkflow is my
colleague at Igalia, Rodrigo Siqueira. Being honest,
my contributions are mostly on requesting and validating new features, fixing
bugs, and sharing scripts to increase feature coverage.
So, the video and slide deck of my FOSDEM presentation are available for
download
here.
And, as usual, you will find in this blog post the script of this presentation
and more detailed explanation of the demo presented there.
Kworkflow at FOSDEM 2025: Speaker Notes and Demo
Hi, I’m Melissa, a GPU kernel driver developer at Igalia and today I’ll be
giving a very inclusive talk to not let your motivation go by saving time with
kworkflow.
So, you’re a kernel developer, or you want to be a kernel developer, or you
don’t want to be a kernel developer. But you’re all united by a single need:
you need to validate a custom kernel with just one change, and you need to
verify that it fixes or improves something in the kernel.
And that’s a given change for a given distribution, or for a given device, or
for a given subsystem…
Look to this diagram and try to figure out the number of subsystems and related
work trees you can handle in the kernel.
So, whether you are a kernel developer or not, at some point you may come
across this type of situation:
There is a userspace developer who wants to report a kernel issue and says:
Oh, there is a problem in your driver that can only be reproduced by running this specific distribution.
And the kernel developer asks:
Oh, have you checked if this issue is still present in the latest kernel version of this branch?
But the userspace developer has never compiled and installed a custom kernel
before. So they have to read a lot of tutorials and kernel documentation to
create a kernel compilation and deployment script. Finally, the reporter
managed to compile and deploy a custom kernel and reports:
Sorry for the delay, this is the first time I have installed a custom kernel.
I am not sure if I did it right, but the issue is still present in the kernel
of the branch you pointed out.
And then, the kernel developer needs to reproduce this issue on their side, but
they have never worked with this distribution, so they just created a new
script, but the same script created by the reporter.
What’s the problem of this situation? The problem is that you keep creating new
scripts!
Every time you change distribution, change architecture, change hardware,
change project - even in the same company - the development setup may change
when you switch to a different project, you create another script for your new
kernel development workflow!
You know, you have a lot of babies, you have a collection of “my precious
scripts”, like Sméagol (Lord of the Rings) with the precious ring.
Instead of creating and accumulating scripts, save yourself time with
kworkflow. Here is a typical script that many of you may have. This is a
Raspberry Pi 4 script and contains everything you need to memorize to compile
and deploy a kernel on your Raspberry Pi 4.
With kworkflow, you only need to memorize two commands, and those commands are
not specific to Raspberry Pi. They are the same commands to different
architecture, kernel configuration, target device.
What is kworkflow?
Kworkflow is a collection of tools and software combined to:
Optimize Linux kernel development workflow.
Reduce time spent on repetitive tasks, since we are spending our lives
compiling kernels.
Standardize best practices.
Ensure reliable data exchange across kernel workflow. For example: two people
describe the same setup, but they are not seeing the same thing, kworkflow
can ensure both are actually with the same kernel, modules and options enabled.
I don’t know if you will get this analogy, but kworkflow is for me a megazord
of scripts. You are combining all of your scripts to create a very powerful
tool.
What is the main feature of kworflow?
There are many, but these are the most important for me:
Build & deploy custom kernels across devices & distros.
Handle cross-compilation seamlessly.
Manage multiple architecture, settings and target devices in the same work tree.
Organize kernel configuration files.
Facilitate remote debugging & code inspection.
Standardize Linux kernel patch submission guidelines. You don’t need to
double check documentantion neither Greg needs to tell you that you are not
following Linux kernel guidelines.
Upcoming: Interface to bookmark, apply and “reviewed-by” patches from
mailing lists (lore.kernel.org).
This is the list of commands you can run with kworkflow.
The first subset is to configure your tool for various situations you may face
in your daily tasks.
We have some tools to manage and interact with target machines.
# Manage and interact with target machines
kw ssh (s) - SSH support
kw remote (r) - Manage machines available via ssh
kw vm - QEMU support
To inspect and debug a kernel.
# Inspect and debug
kw device - Show basic hardware information
kw explore (e) - Explore string patterns in the work tree and git logs
kw debug - Linux kernel debug utilities
kw drm - Set of commands to work with DRM drivers
To automatize best practices for patch submission like codestyle, maintainers
and the correct list of recipients and mailing lists of this change, to ensure
we are sending the patch to who is interested in it.
# Automatize best practices for patch submission
kw codestyle (c) - Check code style
kw maintainers (m) - Get maintainers/mailing list
kw send-patch - Send patches via email
And the last one, the upcoming patch hub.
# Upcoming
kw patch-hub - Interact with patches (lore.kernel.org)
How can you save time with Kworkflow?
So how can you save time building and deploying a custom kernel?
First, you need a .config file.
Without kworkflow: You may be manually extracting and managing .config
files from different targets and saving them with different suffixes to link
the kernel to the target device or distribution, or any descriptive suffix to
help identify which is which. Or even copying and pasting from somewhere.
With kworkflow: you can use the kernel-config-manager command, or simply
kw k, to store, describe and retrieve a specific .config file very easily,
according to your current needs.
Then you want to build the kernel:
Without kworkflow: You are probably now memorizing a combination of
commands and options.
With kworkflow: you just need kw b (kw build) to build the kernel with
the correct settings for cross-compilation, compilation warnings, cflags,
etc. It also shows some information about the kernel, like number of modules.
Finally, to deploy the kernel in a target machine.
Without kworkflow: You might be doing things like: SSH connecting to the
remote machine, copying and removing files according to distributions and
architecture, and manually updating the bootloader for the target distribution.
With kworkflow: you just need kw d which does a lot of things for you,
like: deploying the kernel, preparing the target machine for the new
installation, listing available kernels and uninstall them, creating a tarball,
rebooting the machine after deploying the kernel, etc.
You can also save time on debugging kernels locally or remotely.
Without kworkflow: you do: ssh, manual setup and traces enablement,
copy&paste logs.
With kworkflow: more straighforward access to debug utilities: events,
trace, dmesg.
You can save time on managing multiple kernel images in the same work tree.
Without kworkflow: now you can be cloning multiple times the same
repository so you don’t lose compiled files when changing kernel
configuration or compilation options and manually managing build and deployment
scripts.
With kworkflow: you can use kw env to isolate multiple contexts in the
same worktree as environments, so you can keep different configurations in
the same worktree and switch between them easily without losing anything from
the last time you worked in a specific context.
Finally, you can save time when submitting kernel patches. In kworkflow, you
can find everything you need to wrap your changes in patch format and submit
them to the right list of recipients, those who can review, comment on, and
accept your changes.
This is a demo that the lead developer of the kw patch-hub feature sent me.
With this feature, you will be able to check out a series on a specific mailing
list, bookmark those patches in the kernel for validation, and when you are
satisfied with the proposed changes, you can automatically submit a reviewed-by
for that whole series to the mailing list.
Demo
Now a demo of how to use kw environment to deal with different devices,
architectures and distributions in the same work tree without losing compiled
files, build and deploy settings, .config file, remote access configuration and
other settings specific for those three devices that I have.
Setup
Three devices:
laptop (debian
x86
intel
local)
SteamDeck (steamos
x86
amd
remote)
RaspberryPi 4 (raspbian
arm64
broadcomm
remote)
Goal: To validate a change on DRM/VKMS using a single kernel tree.
Kworkflow commands:
kw env
kw d
kw bd
kw device
kw debug
kw drm
Demo script
In the same terminal and worktree.
First target device: Laptop (debian|x86|intel|local)
$ kw env --list # list environments available in this work tree
$ kw env --use LOCAL # select the environment of local machine (laptop) to use: loading pre-compiled files, kernel and kworkflow settings.
$ kw device # show device information
$ sudo modinfo vkms # show VKMS module information before applying kernel changes.
$ <open VKMS file and change module info>
$ kw bd # compile and install kernel with the given change
$ sudo modinfo vkms # show VKMS module information after kernel changes.
$ git checkout -- drivers
Second target device: RaspberryPi 4 (raspbian|arm64|broadcomm|remote)
$ kw env --use RPI_64 # move to the environment for a different target device.
$ kw device # show device information and kernel image name
$ kw drm --gui-off-after-reboot # set the system to not load graphical layer after reboot
$ kw b # build the kernel with the VKMS change
$ kw d --reboot # deploy the custom kernel in a Raspberry Pi 4 with Raspbian 64, and reboot
$ kw s # connect with the target machine via ssh and check the kernel image name
$ exit
Third target device: SteamDeck (steamos|x86|amd|remote)
$ kw env --use STEAMDECK # move to the environment for a different target device
$ kw device # show device information
$ kw debug --dmesg --follow --history --cmd="modprobe vkms" # run a command and show the related dmesg output
$ kw debug --dmesg --follow --history --cmd="modprobe -r vkms" # run a command and show the related dmesg output
$ <add a printk with a random msg to appear on dmesg log>
$ kw bd # deploy and install custom kernel to the target device
$ kw debug --dmesg --follow --history --cmd="modprobe vkms" # run a command and show the related dmesg output after build and deploy the kernel change
Q&A
Most of the questions raised at the end of the presentation were actually
suggestions and additions of new features to kworkflow.
The first participant, that is also a kernel maintainer, asked about two
features: (1) automatize getting patches from patchwork (or lore) and
triggering the process of building, deploying and validating them using the
existing workflow, (2) bisecting support. They are both very interesting
features. The first one fits well the patch-hub subproject, that is
under-development, and I’ve actually made a similar
request a couple of weeks
before the talk. The second is an already existing
request in kworkflow github
project.
Another request was to use kexec and avoid rebooting the kernel for testing.
Reviewing my presentation I realized I wasn’t very clear that kworkflow doesn’t
support kexec. As I replied, what it does is to install the modules and you can
load/unload them for validations, but for built-in parts, you need to reboot
the kernel.
Another two questions: one about Android Debug Bridge (ADB) support instead of
SSH and another about support to alternative ways of booting when the custom
kernel ended up broken but you only have one kernel image there. Kworkflow
doesn’t manage it yet, but I agree this is a very useful feature for embedded
devices. On Raspberry Pi 4, kworkflow mitigates this issue by preserving the
distro kernel image and using config.txt file to set a custom kernel for
booting. For ADB, there is no support too, and as I don’t see currently users
of KW working with Android, I don’t think we will have this support any time
soon, except if we find new volunteers and increase the pool of contributors.
The last two questions were regarding the status of b4 integration, that is
under development, and other debugging features that the tool doesn’t support
yet.
Finally, when Andrea and I were changing turn on the stage, he suggested to add
support for virtme-ng to kworkflow. So I
opened an issue for
tracking this feature request in the project github.
With all these questions and requests, I could see the general need for a tool
that integrates the variety of kernel developer workflows, as proposed by
kworflow. Also, there are still many cases to be covered by kworkflow.
Despite the high demand, this is a completely voluntary project and it is
unlikely that we will be able to meet these needs given the limited resources.
We will keep trying our best in the hope we can increase the pool of users and
contributors too.
In my previous post, when I introduced the switch to Skia for 2D rendering, I explained that we replaced Cairo with Skia keeping mostly the same architecture. This alone was an important improvement in performance, but still the graphics implementation was designed for Cairo and CPU rendering. Once we considered the switch to Skia as stable, we started to work on changes to take more advantage of Skia and GPU rendering to improve the performance even more. In this post I’m going to present some of those improvements and other not directly related to Skia and GPU rendering.
Explicit fence support
This is related to the DMA-BUF renderer used by the GTK port and WPE when using the new API. The composited buffer is shared as a DMA-BUF between the web and UI processes. Once the web process finished the composition we created a fence and waited for it, to make sure that when the UI process was notified that the composition was done the buffer was actually ready. This approach was safe, but slow. In 281640@main we introduced support for explicit fencing to the WPE port. When possible, an exportable fence is created, so that instead of waiting for it immediately, we export it as a file descriptor that is sent to the UI process as part of the message that notifies that a new frame has been composited. This unblocks the web process as soon as composition is done. When supported by the platform, for example in WPE under Wayland when the zwp_linux_explicit_synchronization_v1 protocol is available, the fence file descriptor is passed to the platform implementation. Otherwise, the UI process asynchronously waits for the fence by polling the file descriptor before passing the buffer to the platform. This is what we always do in the GTK port since 281744@main. This change improved the score of all MotionMark tests, see for example multiply.
Enable MSAA when available
In 282223@main we enabled the support for MSAA when possible in the WPE port only, because this is more important for embedded devices where we use 4 samples providing good enough quality with a better performance. This change improved the Motion Mark tests that use 2D canvas like canvas arcs, paths and canvas lines. You can see here the change in paths when run in a RaspberryPi 4 with WPE 64 bits.
Avoid textures copies in accelerated 2D canvas
As I also explained in the previous post, when 2D canvas is accelerated we now use a dedicated layer that renders into a texture that is copied to be passed to the compositor. In 283460@main we changed the implementation to use a CoordinatedPlatformLayerBufferNativeImage to handle the canvas texture and avoid the copy, directly passing the texture to the compositor. This improved the MotionMark tests that use 2D canvas. See canvas arcs, for example.
Introduce threaded GPU painting mode
In the initial implementation of the GPU rendering mode, layers were painted in the main thread. In 287060@main we moved the rendering task to a dedicated thread when using the GPU, with the same threaded rendering architecture we have always used for CPU rendering, but limited to 1 worker thread. This improved the performance of several MotionMark tests like images, suits and multiply. See images.
Update default GPU thread settings
Parallelization is not so important for GPU rendering compared to CPU, but still we realized that we got better results by increasing a bit the amount of worker threads when doing GPU rendering. In 290781@main we increased the limit of GPU worker threads to 2 for systems with at least 4 CPU cores. This improved mainly images and suits in MotionMark. See suits.
Hybrid threaded CPU+GPU rendering mode
We had either GPU or CPU worker threads for layer rendering. In systems with 4 CPU cores or more we now have 2 GPU worker threads. When those 2 threads are busy rendering, why not using the CPU to render other pending tiles? And the same applies when doing CPU rendering, when all workers are busy, could we use the GPU to render other pending tasks? We tried and turned out to be a good idea, especially in embedded devices. In 291106@main we introduced the hybrid mode, giving priority to GPU or CPU workers depending on the default rendering mode, and also taking into account special cases like on HiDPI, where we are always scaling, and we always prefer the GPU. This improved multiply, images and suits. See images.
Use Skia API for display list implementation
When rendering with Cairo and threaded rendering enabled we use our own implementation of display lists specific to Cairo. When switching to Skia we thought it was a good idea to use the WebCore display list implementation instead, since it’s cross-platform implementation shared with other ports. But we realized this implementation is not yet ready to support multiple threads, because it holds references to WebCore objects that are not thread safe. Main thread might change those objects before they have been processed by painting threads. So, we decided to try to use the Skia API (SkPicture) that supports recording in the main thread and replaying from worker threads. In 292639@main we replaced the WebCore display list usage by SkPicture. This was expected to be a neutral change in terms of performance but it surprisingly improved several MotionMark tests like leaves, multiply and suits. See leaves.
Use Damage to track the dirty region of GraphicsLayer
Every time there’s a change in a GraphicsLayer and it needs to be repainted, it’s notified and the area that changed is included so that we only render the parts of the layer that changed. That’s what we call the layer dirty region. It can happen that when there are many small updates in a layer we end up with lots of dirty regions on every layer flush. We used to have a limit of 32 dirty regions per layer, so that when more than 32 are added we just united them into the first dirty area. This limit was removed because we always unite the dirty areas for the same tiles when processing the updates to prepare the rendering tasks. However, we also tried to avoid handling the same dirty region twice, so every time a new dirty region was added we iterated the existing regions to check if it was already present. Without the 32 regions limit that means we ended up iterating a potentially very long list on every dirty region addition. The damage propagation feature uses a Damage class to efficiently handle dirty regions, so we thought we could reuse it to track the layer dirty region, bringing back the limit but uniting in a more efficient way than using always the first dirty area of the list. It also allowed to remove check for duplicated area in the list. This change was added in 292747@main and improved the performance of MotionMark leaves and multiply tests. See leaves.
Record all dirty tiles of a layer once
After the switch to use SkPicture for the display list implementation, we realized that this API would also allow to record the graphics layer once, using the bounding box of the dirty region, and then replay multiple times on worker threads for every dirty tile. Recording can be a very heavy operation, specially when there are shadows or filters, and it was always done for every tile due to the limitations of the previous display list implementation. In 292929@main we introduced the change with improvements in MotionMark leaves and multiply tests. See multiply.
MotionMark results
I’ve shown here the improvements of these changes in some of the MotionMark tests. I have to say that some of those changes also introduced small regressions in other tests, but the global improvement is still noticeable. Here is a table with the scores of all tests before these improvements and current main branch run by WPE MiniBrowser in a RaspberryPi 4 (64bit).
Test
Score July 2024
Score April 2025
Multiply
501.17
684.23
Canvas arcs
140.24
828.05
Canvas lines
1613.93
3086.60
Paths
375.52
4255.65
Leaves
319.31
470.78
Images
162.69
267.78
Suits
232.91
445.80
Design
33.79
64.06
What’s next?
There’s still quite a lot of room for improvement, so we are already working on other features and exploring ideas to continue improving the performance. Some of those are:
Damage tracking: this feature is already present, but disabled by default because it’s still work in progress. We currently use the damage information to only paint the areas of every layer that changed. But then we always compose a whole frame inside WebKit that is passed to the UI process to be presented on screen. It’s possible to use the damage information to improve both, the composition inside WebKit and the presentation of the composited frame on the screen. For more details about this feature read Pawel’s awesome blog post about it.
Use DMA-BUF for tile textures to improve pixel transfer operations: We currently use DMA-BUF buffers to share the composited frame between the web and UI process. We are now exploring the idea of using DMA-BUF also for the textures used by the WebKit compositor to generate the frame. This would allow to improve the performance of pixel transfer operations, for example when doing CPU rendering we need to upload the dirty regions from main memory to a compositor texture on every composition. With DMA-BUF backed textures we can map the buffer into main memory and paint with the CPU directly into the mapped buffer.
Compositor synchronization: We plan to try to improve the synchronization of the WebKit compositor with the system vblank and the different sources of composition (painted layers, video layers, CSS animations, WebGL, etc.)