We’re happy to have released gst-dots-viewer, a new development tool that makes it easier to visualize and debug GStreamer pipelines. This tool, included in GStreamer 1.26, provides a web-based interface for viewing pipeline graphs in real-time as your application runs and allows to easily request all pipelines to be dumped at any time.
What is gst-dots-viewer?
gst-dots-viewer is a server application that monitors a directory for .dot files generated by GStreamer’s pipeline visualization system and displays them in your web browser. It automatically updates the visualization whenever new .dot files are created, making it simpler to debug complex applications and understand the evolution of the pipelines at runtime.
Key Features
Real-time Updates: Watch your pipelines evolve as your application runs
Interactive Visualization:
Click nodes to highlight pipeline elements
Use Shift-Ctrl-scroll or w/s keys to zoom
Drag-scroll support for easy navigation
Easily deployable in cloud based environments
How to Use It
Start the viewer server:
gst-dots-viewer
Open your browser at http://localhost:3000
Enable the dots tracer in your GStreamer application:
GST_TRACERS=dots your-gstreamer-application
The web page will automatically update whenever new pipeline are dumped, and you will be able to dump all pipelines from the web page.
New Dots Tracer
As part of this release, we’ve also introduced a new dots tracer that replaces the previous manual approach to specify where to dump pipelines. The tracer can be activated simply by setting the GST_TRACERS=dots environment variable.
Interactive Pipeline Dumps
The dots tracer integrates with the pipeline-snapshot tracer to provide real-time pipeline visualization control. Through a WebSocket connection, the web interface allows you to trigger pipeline dumps. This means you can dump pipelines exactly when you need them during debugging or development, from your browser.
Future Improvements
We plan on adding more feature and have this list of possibilities:
Additional interactive features in the web interface
Enhanced visualization options
Integration with more GStreamer tracers to provide comprehensive debugging information. For example, we could integrate the newly released memory-tracer and queue-level tracers so to plot graphs about memory usage at any time.
This could transform gst-dots-viewer into a more complete debugging and monitoring dashboard for GStreamer applications.
Hey all, just a lab notebook entry today. I’ve been working on the
Whippet GC library for about three
years now, learning a lot on the way. The goal has always been to
replace Guile’s use of the Boehm-Demers-Weiser
collector
with something more modern and maintainable. Last year I finally got to
the point that I felt Whippet was
feature-complete,
and taking into account the old adage about long arses and brief videos,
I think that wasn’t too far off. I carved out some time this spring and for the
last month have been integrating Whippet into Guile in anger, on the
wip-whippet
branch.
the haps
Well, today I removed the last direct usage of the BDW collector’s API
by Guile! Instead, Guile uses Whippet’s API any time it needs to
allocate an object, add or remove a thread from the active set, identify
the set of roots for a collection, and so on. Most tracing is still
conservative, but this will move to be more precise over time. I
haven’t had the temerity to actually try one of the Nofl-based
collectors yet, but that will come soon.
Code-wise, the initial import of Whippet added some 18K lines to Guile’s
repository, as counted by git diff --stat, which includes
documentation and other files. There was an unspeakable amount of autotomfoolery to get Whippet in Guile’s ancient build system. Changes to Whippet during the course of
integration added another 500 lines or so. Integration of Whippet
removed around 3K lines of C from Guile. It’s not a pure experiment, as
my branch is also a major version bump and so has the freedom to
refactor and simplify some things.
Things are better but not perfect. Notably, I switched to build weak
hash tables in terms of buckets and chains where the links are
ephemerons, which give me concurrent lock-free reads and writes but not
resizable tables. I would like to somehow resize these tables in
response to GC, but haven’t wired it up yet.
Accessibility in the free and open source world is somewhat of a sensitive topic.
Given the principles of free software, one would think it would be the best possible place to advocate for accessibility. After all, there’s a collection of ideologically motivated individuals trying to craft desktops to themselves and other fellow humans. And yet, when you look at the current state of accessibility on the Linux desktop, you couldn’t possibly call it good, not even sufficient.
It’s a tough situation that’s forcing people who need assistive technologies out of these spaces.
I think accessibility on the Linux desktop is in a particularly difficult position due to a combination of poor incentives and historical factors:
The dysfunctional state of accessibility on Linux makes it so that the people who need it the most cannot even contribute to it.
There is very little financial incentive for companies to invest in accessibility technologies. Often, and historically, companies invest just enough to tick some boxes on government checklists, then forget about it.
Volunteers, especially those who contribute for fun and self enjoyment, often don’t go out of their ways to make the particular projects they’re working on accessible. Or to check if their contributions regress the accessibility of the app.
The nature of accessibility makes it such that the “functional progression” is not linear. If only 50% of the stack is working, that’s practically a 0%. Accessibility requires that almost every part of the stack to be functional for even the most basic use cases.
There’s almost nobody contributing to this area anymore. Expertise and domain knowledge are almost entirely lost.
In addition to that, I feel like work on accessibility is invisible. In the sense that most people are simply apathetic to the work and contributions done on this area. Maybe due to the dynamics of social media that often favor negative engagement? I don’t know. But it sure feels unrewarding. Compare:
Now, I think if I stopped writing here, you dear reader might feel that the situation is mostly gloomy, maybe even get angry at it. However, against all odds, and fighting a fight that seems impossible, there are people working on accessibility. Often without any kind of reward, doing this out of principle. It’s just so easy to overlook their effort!
So as we prepare for the Global Accessibility Awareness Day, I thought it would be an excellent opportunity to highlight these fantastic contributors and their excellent work, and also to talk about some ongoing work on GNOME.
If you consider this kind of work important and relevant, and/or if you need accessibility features yourself, I urge you: please donate to the people mentioned here. Grab these people a coffee. Better yet, grab them a monthly coffee! Contributors who accept donations have a button beneath their avatars. Go help them.
Calendar
GNOME Calendar, the default calendaring app for GNOME, has slowly but surely progressing towards being minimally accessible. This is mostly thanks to the amazing work from Hari Rana and Jeff Fortin Tam!
Back when I was working on fixing accessibility on WebKitGTK, I found the lack of modern tools to inspect the AT-SPI bus a bit off-putting, so I wrote a little app to help me through. Didn’t think much of it, really.
Of course, almost nothing I’ve mentioned so far would be possible if the toolkit itself didn’t have support for accessibility. Thanks to Emmanuele Bassi GTK4 received an entirely new accessibility backend.
Over time, more people picked up on it, and continued improving it and filling in the gaps. Matthias Clasen and Emmanuele continue to review contributions and keep things moving.
One particular contributor is Lukáš Tyrychtr, who has implemented the Text interface of AT-SPI in GTK. Lukáš contributes to various other parts of the accessibility stack as well!
On the design side, one person in particular stands out for a series of contributions on the Accessibility panel of GNOME Settings: Sam Hewitt. Sam introduced the first mockups of this panel in GitLab, then kept on updating it. More recently, Sam introduced mockups for text-to-speech (okay technically these are in the System panel, but that’s in the accessibility mockups folder!).
Please join me in thanking Sam for these contributions!
Having apps and toolkits exposing the proper amount of accessibility information is a necessary first step, but it would be useless if there was nothing to expose to.
Thanks to Mike Gorse and others, the AT-SPI project keeps on living. AT-SPI is the service that receives and manages the accessibility information from apps. It’s the heart of accessibility in the Linux desktop! As far as my knowledge about it goes, AT-SPI is really old, dating back to Sun days.
Samuel Thibault continues to maintain speech-dispatcher and Accerciser. Speech dispatcher is the de facto text-to-speech service for Linux as of now. Accerciser is a venerable tool to inspect AT-SPI trees.
Eitan Isaacson is shaking up the speech synthesis world with libspiel, a speech framework for the desktop. Orca has experimental support for it. Eitan is now working on a desktop portal so that sandboxed apps can benefit from speech synthesis seamlessly!
One of the most common screen readers for Linux is Orca. Orca maintainers have been keeping it up an running for a very long time. Here I’d like to point out that we at Igalia significantly fund Orca development.
I would like to invite the community to share a thank you for all of them!
I tried to reach out to everyone nominally mentioned in this blog post. Some people preferred not to be mentioned. I’m pretty sure I’ve never got to learn about others that are involved in related projects.
I guess what I’m trying to say is, this list is not exhaustive. There are more people involved. If you know some of them, please let me encourage you to pay them a tea, a lunch, a boat trip in Venice, whatever you feel like; or even just reach out to them and thank them for their work.
If you contribute or know someone who contributes to desktop accessibility, and wishes to be here, please let me know. Also, please let me know if this webpage itself is properly accessible!
A Look Into The Future
Shortly after I started to write this blog post, I thought to myself: “well, this is nice and all, but it isn’t exactly robust.” Hm. If only there was a more structured, reliable way to keep investing on this.
Coincidentally, at the same time, we were introduced to our new executive director Steven. With such a blast of an introduction, and seeing Steven hanging around in various rooms, I couldn’t resist asking about it. To my great surprise and joy, Steven swiftly responded to my inquiries and we started discussing some ideas!
Conversations are still ongoing, and I don’t want to create any sort of hype in case things end up not working, but… maaaaaaybe keep in mind that there might be an announcement soon!
Huge thanks to the people above, and to everyone who helped me write this blog post
¹ – Jeff doesn’t accept donations for himself, but welcomes marketing-related business
Update on what happened in WebKit in the week from May 5 to May 12.
This week saw one more feature enabled by default, additional support to
track memory allocations, continued work on multimedia and WebAssembly.
Cross-Port 🐱
The Media Capabilities API is now enabled by default. It was previously available as a run-time option in the WPE/WebKitGTK API (WebKitSettings:enable-media-capabilities), so this is just a default tweak.
Landed a change that integrates malloc heap breakdown functionality with non-Apple ports. It works similarly to Apple's one yet in case of non-Apple ports the per-heap memory allocation statistics are printed to stdout periodically for now. In the future this functionality will be integrated with Sysprof.
Multimedia 🎥
GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.
Support for WebRTC RTP header extensions was improved, a RTP header extension for video orientation metadata handling was introduced and several simulcast tests are now passing
Progress is ongoing on resumable player suspension, which will eventually allow us to handle websites with lots of simultaneous media elements better in the GStreamer ports, but this is a complex task.
JavaScriptCore 🐟
The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.
The in-place Wasm interpreter (IPInt) port to 32-bits has seen some more work.
Fixed a bug in OMG caused by divergence with the 64-bit version. Further syncing is underway.
Releases 📦️
Michael Catanzaro has published a writeup on his blog about how the WebKitGTK API versions have changed over time.
Infrastructure 🏗️
Landed some improvements in the WebKit container SDK for Linux, particularly in error handling.
In my work on RISC-V LLVM, I end up working with the llvm-test-suite a lot,
especially as I put more effort into performance analysis, testing, and
regression hunting.
suite-helper is a
Python script that helps with some of the repetitive tasks when setting up,
building, and analysing LLVM test
suite builds. (Worth nothing for
those who aren't LLVM regulars: llvm-test-suite is a separate repository to
LLVM and includes execution tests and benchmarks, which is different to the
targeted unit tests including in the LLVM monorepo).
As always, it scratches an itch for me. The design target is to provide a
starting point that is hopefully good enough for many use cases, but it's easy
to modify (e.g. by editing the generated scripts or emitted command lines) if
doing something that isn't directly supported.
The main motivation for putting this script together came from my habit of
writing fairly detailed "lab notes" for most of my work. This typically
includes a listing of commands run, but I've found such listings rather
verbose and annoying to work with. This presented a good opportunity for
factoring out common tasks into a script, resulting in suite-helper.
Functionality overview
suite-helper has the following subtools:
create
Checkout llvm-test-suite to the given directory. Use the --reference
argument to reference git objects from an existing local checkout.
add-config
Add a build configuration using either the "cross" or "native" template.
See suite-helper add-config --help for a listing of available options.
For a build configuration 'foo', a _rebuild-foo.sh file will be created
that can be used to build it within the build.foo subdirectory.
status
Gives a listing of suite-helper managed build configurations that were
detected, attempting to indicate if they are up to date or not (e.g.
spotting if the hash of the compiler has changed).
run
Run the given build configuration using llvm-lit, with any additional
options passed on to lit.
match-tool
A helper that is used by suite-helper reduce-ll but may be useful in
your own reduction scripts. When looking at generated assembly or
disassembly of an object file/binary and an area of interest, your natural
inclination may well be to try to carefully craft logic to match something
that has equivalent/similar properties. Credit to Philip Reames for
underlining to me just how unresonably effective it is to completely
ignore that inclination and just write something that naively matches a
precise or near-precise assembly sequence. The resulting IR might include
some extraneous stuff, but it's a lot easier to cut down after this
initial minimisation stage, and a lot of the time it's good enough. The
match-tool helper takes a multiline sequence of glob patterns as its
argument, and will attempt to find a match for them (a sequential set of
lines) on stdin. It also normalises whitespace.
get-ll
Query ninja nad process its output to try to produce and execute a
compiler command that will emit a .ll for the given input file (e.g. a .c
file). This is a common first step for llvm-reduce, or for starting to
inspect the compilation of a file with debug options enabled.
reduce-ll
For me, it's fairly common to want to produce a minimised .ll file that
produces a certain assembly pattern, based on compiling a given source
input. This subtool automates that process, using get-ll to retrieve the
ll, then llvm-reduce and match-tool to match the assembly.
Usage example
suite-helper isn't intended to avoid the need to understand how to build the
LLVM test suite using CMake and run it using lit, rather it aims to
streamline the flow. As such, a good starting point might be to work through
some llvm-test-suite builds yourself and then look here to see if anything
makes your use case easier or not.
All of the notes above may seem rather abstract, so here is an example of
using the helper to while investigating some poorly canonicalised
instructions and testing my work-in-progress patch to address them.
suite-helper create llvmts-redundancies --reference ~/llvm-test-suite
for CONFIG in baseline trial; do
suite-helper add-config cross $CONFIG\
--cc=~/llvm-project/build/$CONFIG/bin/clang \
--target=riscv64-linux-gnu \
--sysroot=~/rvsysroot \
--cflags="-march=rva22u64 -save-temps=obj"\
--spec2017-dir=~/cpu2017 \
--extra-cmake-args="-DTEST_SUITE_COLLECT_CODE_SIZE=OFF -DTEST_SUITE_COLLECT_COMPILE_TIME=OFF"
./_rebuild-$CONFIG.sh
done# Test suite builds are now available in build.baseline and build.trial, and# can be compared with e.g. ./utils/tdiff.py.# A separate script had found a suspect instruction sequence in sqlite3.c, so# let's get a minimal reproducer.
suite-helper reduce build.baseline ./MultiSource/Applications/sqlite3/sqlite3.c \'add.uw a0, zero, a2 subw a4, a4, zero'\
--reduce-bin=~/llvm-project/build/baseline/bin/llvm-reduce \
--llc-bin=~/llvm-project/build/baseline/bin/llc \
--llc-args=-O3
Hey peoples! Tonight, some meta-words. As you know I am fascinated by
compilers and language implementations, and I just want to know all the
things and implement all the fun stuff: intermediate representations,
flow-sensitive source-to-source optimization passes, register
allocation, instruction selection, garbage collection, all of that.
It started long ago with a combination of curiosity and a hubris to satisfy
that curiosity. The usual way to slake such a thirst is structured
higher education followed by industry apprenticeship, but for whatever
reason my path sent me through a nuclear engineering bachelor’s program
instead of computer science, and continuing that path was so distasteful
that I noped out all the way to rural Namibia for a couple years.
Fast-forward, after 20 years in the programming industry, and having
picked up some language implementation experience, a few years ago I
returned to garbage collection. I have a good level of language
implementation chops but never wrote a memory manager, and Guile’s
performance was limited by its use of the Boehm collector. I had been
on the lookout for something that could help, and when I learned of
Immix it seemed to me that the only thing missing was an appropriate
implementation for Guile, and hey I could do that!
whippet
I started with the idea of an MMTk-style
interface to a memory manager that was abstract enough to be implemented
by a variety of different collection algorithms. This kind of
abstraction is important, because in this domain it’s easy to convince
oneself that a given algorithm is amazing, just based on vibes; to stay
grounded, I find I always need to compare what I am doing to some fixed
point of reference. This GC implementation effort grew into
Whippet, but as it did so a funny
thing happened: the mark-sweep collector that I
prototyped
as a direct replacement for the Boehm collector maintained mark bits in
a side table, which I realized was a suitable substrate for
Immix-inspired bump-pointer allocation into holes. I ended up building
on that to develop an Immix collector, but without lines: instead each
granule of allocation (16 bytes for a 64-bit system) is its own line.
regions?
The Immix paper is
funny, because it defines itself as a new class of mark-region
collector, fundamentally different from the three other fundamental
algorithms (mark-sweep, mark-compact, and evacuation). Immix’s
regions are blocks (64kB coarse-grained heap divisions) and lines
(128B “fine-grained” divisions); the innovation (for me) is the
optimistic evacuation discipline by which one can potentially
defragment a block without a second pass over the heap, while also
allowing for bump-pointer allocation. See the papers for the deets!
However what, really, are the regions referred to by mark-region? If
they are blocks, then the concept is trivial: everyone has a
block-structured heap these days. If they are spans of lines, well, how
does one choose a line size? As I understand it, Immix’s choice of 128
bytes was to be fine-grained enough to not lose too much space to
fragmentation, while also being coarse enough to be eagerly swept during
the GC pause.
This constraint was odd, to me; all of the mark-sweep systems I have
ever dealt with have had lazy or concurrent sweeping, so the lower bound
on the line size to me had little meaning. Indeed, as one reads papers
in this domain, it is hard to know the real from the rhetorical; the
review process prizes novelty over nuance. Anyway. What if we cranked
the precision dial to 16 instead, and had a line per granule?
That was the process that led me to Nofl. It is a space in a collector
that came from mark-sweep with a side table, but instead uses the side
table for bump-pointer allocation. Or you could see it as an Immix
whose line size is 16 bytes; it’s certainly easier to explain it that
way, and that’s the tack I took in a recent paper submission to
ISMM’25.
paper??!?
Wait what! I have a fine job in industry and a blog, why write a paper?
Gosh I have meditated on this for a long time and the answers are very
silly. Firstly, one of my language communities is Scheme, which was a
research hotbed some 20-25 years ago, which means many
practitioners—people I would be pleased to call peers—came up
through the PhD factories and published many interesting results in
academic venues. These are the folks I like to hang out with! This is
also what academic conferences are, chances to shoot the shit with
far-flung fellows. In Scheme this is fine, my work on Guile is enough
to pay the intellectual cover charge, but I need more, and in the field
of GC I am not a proven player. So I did an atypical thing, which is to
cosplay at being an independent researcher without having first been a
dependent researcher, and just solo-submit a paper. Kids: if you see
yourself here, just go get a doctorate. It is not easy but I can only
think it is a much more direct path to goal.
And the result? Well, friends, it is this blog post :) I got the usual
assortment of review feedback, from the very sympathetic to the less so,
but ultimately people were confused by leading with a comparison to
Immix but ending without an evaluation against Immix. This is fair and
the paper does not mention that, you know, I don’t have an Immix lying
around. To my eyes it was a good paper, an 80%
paper, but, you know, just a
try. I’ll try again sometime.
In the meantime, I am driving towards getting Whippet into Guile. I am
hoping that sometime next week I will have excised all the uses of the
BDW (Boehm GC) API in Guile, which will finally allow for testing Nofl
in more than a laboratory environment. Onwards and upwards!
GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.
The GstWPE2 GStreamer plugin landed in GStreamer
main,
it makes use of the WPEPlatform API. It will ship in GStreamer 1.28. Compared
to GstWPE1 it provides the same features, but improved support for NVIDIA
GPUs. The main regression is lack of audio support, which is work-in-progress,
both on the WPE and GStreamer sides.
JavaScriptCore 🐟
The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.
Work on enabling the in-place Wasm interpreter (IPInt) on 32-bits has progressednicely
The JSC tests runner can now guard against a pathological failure mode.
In JavaScriptCore's implementation of
Temporal,
Tim Chevalier fixed the parsing of RFC
9557 annotations in date
strings to work according to the standard. So now syntactically valid but
unknown annotations [foo=bar] are correctly ignored, and the ! flag in an
annotation is handled correctly. Philip Chimento expanded the test suite
around this feature and fixed a couple of crashes in Temporal.
Math.hypot(x, y, z) received a fix for a corner case.
WPE WebKit 📟
WPE now uses the new pasteboard API, aligning it with the GTK port, and enabling features that were previously disabled. Note that the new features work only with WPEPlatform, because libwpe-based backends are limited to access clipboard text items.
WPE Platform API 🧩
New, modern platform API that supersedes usage of libwpe and WPE backends.
Platform backends may add their own clipboard handling, with the Wayland one being the first one to, using wl_data_device_manager.
This continues the effort to close the feature gap between the “traditional” libwpe-based WPE backends and the new WPEPlatform ones.
Community & Events 🤝
Carlos García has published a blog post about the optimizations introduced in
the WPE and GTK WebKit
ports
since the introduction of Skia replacing Cairo for 2D rendering. Plus, there
are some hints about what is coming next.
Cast your mind back to the late 2000s and one thing you might remember is the
excitement about netbooks. You
sacrifice something in raw computational power, but get a lightweight, low
cost and ultra-portable system. Their popularity peaked and started wane maybe
15 years ago now, but I was pleased to discover that the idea lives on in the
form of the Chuwi MiniBook X
N150 and have
been using it as my daily driver for about a month now. Read on for some notes
and thoughts on the device as well as more information than you probably want
about configuring Linux on it.
The bottom line is that I enjoy it, I'd buy it again. But there are real
limitations to keep in mind if you're considering following suit.
Background
First a little detour. As many of my comments are made in reference to my
previous laptops it's probably worth fleshing out that history a little. The
first thing to understand is that my local computing needs are relatively
simple and minimal. I work on large C/C++ codebases (primarily LLVM) with
lengthy compile times, but I build and run tests on a remote machine. This
means I only need enough local compute to comfortably navigate codebases, do
whatever smaller local projects I want to do, and use any needed browser based
tools like videoconferencing or GDocs.
Looking back at my previous two laptops (oldest first):
Intel
N5000
processor, 4GiB RAM (huge weak point even then), 256GB SSD, 14" 1920x1080
matte screen.
Fanless and absolutely silent.
A big draw was the long battery life. Claimed 17h by the manufacturer,
tested at ~12h20m 'light websurfing' in one
review
which I found to be representative, with runtimes closer to 17h possible
if e.g. mostly doing text editing when traveling without WiFi.
Three USB-A ports, one USB-C port, 3.5mm audio jack, HDMI, SD card slot.
Charging via proprietary power plug.
1.30kg weight and 32.3cm x 22.8cm dimensions.
Took design stylings of rather more expensive devices, with a metal
chassis, the ability to fold flat, and a large touchpad.
Claimed battery life reduced to 15h. I found it very similar in practice.
But the battery has degraded significantly over time.
Two USB-A ports, one USB-C port, 3.5mm audio jack, HDMI. Charging via
proprietary power plug.
1.30kg weight and 32.3cm x 21.2cm dimensions.
Still a metal chassis, though sadly designed without the ability to fold
the screen completely flat and the size of the touchpad was downgraded.
I think you can see a pattern here.
As for the processors, the N5000 was part of Intel "Gemini
Lake"
which used the Goldmont Plus microarchitecture. This targets the same market
segment as earlier Atom branded processors (as used by many of those early
netbooks) but with substantially higher performance and a much more
complicated microarchitecture than the early Atom (which was dual issue, in
order with a 16 stage pipeline). The
best reference I can see for the microarchitectures used in the N5000 and
N6000 is AnandTech's Tremont microarchitecture
write-up
(matching the
N6000),
which makes copious reference to differences vs previous iterations. Both the
N5000 and N6000 have a TDP of 6W and 4 cores (no hyperthreading). Notably,
all these designs lack AVX support.
The successor to Tremont, was the Gracemont
microarchitecture,
this time featuring AVX2 and seeing much wider usage due to being used as the
"E-Core" design throughout Intel's chips pairing some number of more
performance-oriented P-Cores with energy efficiency optimised E-Cores. Low
TDP chips featuring just E-Cores were released such as the N100 serving
as a successor to the
N6000
and later the N150 added as a slightly higher clocked version. There have been
further iterations on the microarchitecture since Gracemont with
Crestmont and
Skymont,
but at the time of writing I don't believe these have made it into similar
E-Core only low TDP chips. I'd love to see competitive devices at similar
pricepoints using AMD or Arm chips (and one day RISC-V of course), but this
series of Intel chips seems to have really found a niche.
28.8Wh battery, seems to give 4-6h battery depending on what you're doing
(possibly more if offline and text editing, I've not tried to push to the
limits).
Two USB-C ports (both supporting charging via USB PD), 3.5mm audio jack.
0.92kg weight and 24.4cm x 16.6cm dimensions.
Display is touchscreen, and can fold all the way around for tablet-style
usage.
Just looking at the specs the key trade-offs are clear. There's a big drop in
battery life, but a newer faster processor and fun mini size.
Overall, it's a positive upgrade but there are definitely some downsides. Main
highlights:
Smol! Reasonably light. The 10" display works well at 125% zoom.
The keyboard is surprisingly pleasant to use. The trackpad is obviously
small given size constraints, but again it works just fine for me. It feels
like this is the smallest size where you can have a fairly normal experience
in terms of display and input.
With a metal chassis, the build quality feels good overall. Of course the
real test is how it lasts.
Charging via USB-C PD! I am so happy to be free of laptop power bricks.
The N150 is a nice upgrade vs the N5000 and N6000. AVX2 support means
we're much more likely to hit optimised codepaths for libraries that make
use of it.
But of course there's a long list of niggles or drawbacks. As I say, overall
it works for me, but if it didn't have these drawbacks I'd probably move
more towards actively recommending it without lots of caveats:
Battery life isn't fantastic. I'd be much happier with 10-12h. Though given
the USB-C PD support, it's not hard to reach this with an external battery.
I miss having a silent fanless machine. The fan doesn't come on frequently
in normal usage, but of course it's noticeable when it does. My unit also
suffers from some coil wine which is audible sometimes when scrolling.
Neither is particularly loud but there is a huge difference between never
being able to hear your computer vs sometimes being able to hear it.
Some tinkering needed for initial Linux setup. Depending on your mindset,
this might be a pro! Regardless, I've documented what I've done down below.
I should note that all the basic hardware does work including the
touchscreen, webcam, and microphone. The fact the display is rotated is
mostly an easy fix, but I haven't checked if the fact it shows as 1200x1920
rather than 1920x1080 causes problems for e.g. games.
In-built display is 50Hz rather than 60Hz and I haven't yet succeeded at
overriding this in Linux (although it seems possible in Windows).
It's unfortunate there's no ability to limit charging at e.g. 80% as
supported by some charge controllers as a way of extending battery lifetime.
It charges relatively slowly (~20W draw), which is a further incentive to
have an external battery if out and about.
It's a shame they went with the soldered on Intel AX101 WiFi module rather
than spending a few dollars more for a better module from Intel's line-up.
I totally understand why Chuwi don't/can't have different variants with
different keyboards, but I would sure love a version with a UK key layout!
Screen real estate is lost to the bezel. Additionally, the rounded corners
of the bezel cutting off the corner pixels is annoying.
Do beware that the laptop ships with a 12V/3A charger with a USB-C connection
that apparently will use that voltage without any negotiation. It's best not
to use it at all due to the risk of plugging in something that can't handle
12V input.
Conclusion: It's not perfect machine but I'm a huge fan of this form
factor. I really hope we get future iterations or competing products.
Appendix A: Accessories
YMMV, but I picked up the following with the most notable clearly being the
replacement SSD. Prices are the approximate amount paid including any
shipping.
Installation was trivial. Undo 8 screws on the MiniBook underside and it
comes off easily.
The spec is overkill for this laptop (PCIe Gen4 when the MiniBook only
supports Gen3 speeds). But the price was good meaning it wasn't very
attractive to spend a similar amount for a slower last-generation drive
with worse random read/write performance.
Unlike the MiniBook itself, charges very quickly. Also supports
pass-through charging so you can charge the battery while also charging
downstream devices, through a single wall socket.
Goes for a thin but wider squared shape vs many other batteries that are
quick thick, though narrower. For me this is more convenient in most
scenarios.
Despite being designed for the Steam Deck, this actually works really nicely
for holding it vertically. The part that holds the device is adjustable and
comfortably holds it without blocking the air vents. I use this at my work
desk and just need to plug in a single USB-C cable for power, monitor, and
peripherals (and additionally the 3.5mm audio jack if using speakers).
I'd wondered if I might have to instead find some below-desk setup to keep
cables out of the way, but placing this at the side of my desk and using
right-angled cables (or adapters) that go straight down off the side means
seems to work fairly well for keeping the spiders web of cables out of the
way.
Support 20V 1.75A when only a USB-C cable is connected, which is more than
enough for charging the MiniBook.
Given all my devices when traveling are USB, I was interested in
something compact that avoids the need for separate adapter plugs. This
seems to fit the bill.
Case: 11" Tablet
case (~£2.50 when
bought with some other things)
Took a gamble but this fits remarkably well, and has room for extra cables
/ adapters.
Appendix B: Arch Linux setup
As much for my future reference as for anything else, here are notes on
installing and configuring Arch Linux on the MiniBook X to my liking, and
working through as many niggles as I can. I'm grateful to Sonny Piers' GitHub
repo for some pointers on dealing
with initial challenges like screen rotation.
Initial install
Download an Arch Linux install image and
write to a USB drive. Enter the BIOS by pressing F2 while booting and disable
secure boot. I found I had to do this, then save and exit for
it to stick. Then enter BIOS again on a subsequent boot and select the option
to boot straight into it (under the "Save and Exit" menu).
In order to have the screen rotated correctly, we need to set the boot
parameter video=DSI-1:panel_orientation=right_side_up. Do this by pressing
e at the boot menu and manually adding.
Then connect to WiFi (iwctl then station wlan0 scan, station wlan0 get-networks, station wlan0 connect $NETWORK_NAME and enter the WiFi
password). It's likely more convenient to do the rest of the setup via ssh,
which can be done by setting a temporary root password with passwd and then
connecting with
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null root@archiso.
Set the SSD sector size to 4k:
# Confirm 4k sector sizes are available and supported.
nvme id-ns -H /dev/nvme0n1
# Shows:# LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)# LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better
nvme format --lbaf=1 /dev/nvme0n1
Now partition disks and create filesystems (with encrypted rootfs):
The touchscreen input also needs to be rotated to work properly. See
here for guidance
on the transformation matrix for xinput and confirm the name to match with
xinput list.
git clone https://aur.archlinux.org/yay.git &&cd yay
makepkg -si
cd .. && rm -rf yay
yay xautolock
yay ps_mem
Use UK keymap and in X11 use caps lock as escape:
localectl set-keymap uk
localectl set-x11-keymap gb """" caps:escape
The device has a US keyboard layout which has one less key than than the UK
layout and
several keys in different places. As I regularly use a UK layout external
keyboard, rather than just get used to this I set a UK layout and use AltGr
keycodes for backslash (AltGr+-) and pipe
(AltGR+`).
For audio support, I didn't need to do anything other than get rid of
excessive microphone noise by opening alsamixer and turning "Interl Mic
Boost" down to zero.
Suspend rather than shutdown when pressing power button
It's too easy to accidentally hit the power button especially when
plugging/unplugging usb-c devices, so lets make it just suspend rather than
shutdown.
See the Arch
wiki
for a discussion. s2idle and deep are reported as supported from
/sys/power/mem_sleep, but the discharge rate leaving the laptop suspended
overnight feels higher than I'd like. Let's enable deep sleep in the hope it
reduces it.
Check last sleep mode used with sudo journalctl | grep "PM: suspend" | tail -2. And check the current sleep mode with cat /sys/power/mem_sleep.
Checking the latter after boot you're likely to be worried to see that s2idle
is still default. But try suspending and then checking the journal and you'll
see systemd switches it just prior to suspending. (i.e. the setting works as
expected, even if it's only applied lazily).
I haven't done a reasonably controlled test of the impact.
Changing DPI
The strategy is to use xsettingsd to
update applications on the fly that support it, and otherwuse update Xft.dpi
in Xresources. I've found a DPI of 120 works well for me. So add systemctl --user restart xsettingsd to .xinitrc as well as a call to this set_dpi
script with the desired DPI:
!/bin/sh
DPI="$1"if[ -z "$DPI"]; thenecho"Usage: $0 <dpi>"exit1fiCONFIG_FILE="$HOME/.config/xsettingsd/xsettingsd.conf"
mkdir -p "$(dirname "$CONFIG_FILE")"if ! [ -e "$CONFIG_FILE"]; then
touch "$CONFIG_FILE"fiif grep -q 'Xft/DPI'"$CONFIG_FILE"; then
sed -i "s|Xft/DPI.*|Xft/DPI $(($DPI*1024))|""$CONFIG_FILE"elseecho"Xft/DPI $(($DPI*1024))" >> "$CONFIG_FILE"fi
systemctl --user restart xsettingsd.service
echo"Xft.dpi: $DPI" | xrdb -merge
echo"DPI set to $DPI"
If attaching to an external display where a different DPI is desirable, just
call set_dpi as needed.
Enabing Jabra bluetooth headset
sudo systemctl enable --now bluetooth.service
Follow instructions in https://wiki.archlinux.org/title/bluetooth_headset to
pair
Remember to do the 'trust' step so it automatically reconnects
Automatically enabling/disabling display outputs upon plugging in a monitor
The srandrd tool provides a handy way of
listening for changes in the plug/unplugged status of connections and
launching a shell script. First try it out with the following to observe
events:
yay srandrd
cat - <<'EOF' > /tmp/echo.sh
echo $SRANDRD_OUTPUT $SRANDRD_EVENT $SRANDRD_EDID
EOF
chmod +x /tmp/echo.sh
srandrd -n /tmp/echo.sh
# You should now see the events as you plug/unplug devices.
So this is simple - we just write a shell script that srandrd will invoke
which calls xrandr as desired when connect/disconnect of the device with the
target EDID happens? Almost. There are two problems I need to work around:
The monitor I use for work is fairly bad at picking up a 4k60Hz input
signal. As far as I can tell this is independent of the cable used or input
device. What does seem to reliably work is to output a 1080p signal, wait a
bit, and then reconfigure to 4k60Hz.
The USB-C cable I normally plug into in my sitting room is also connected
to the TV via HDMI (I often use this for my Steam Deck). I noticed occasional
graphical slowdowns and after more debugging found I could reliably see this
in hiccups / reduced measured frame rate in glxgears that correspond with
recurrent plug/unplug events. The issue disappears completely if video output
via the cable is configured once and then unconfigured again. Very weird, but
at least there's a way round it.
Solving both of the above can readily be addressed by producing a short
sequence of xrandr calls rather than just one. Except these xrandr calls
themselves trigger new events that cause srandrd to reinvoke the
script. So I add a mechanism to
have the script ignore events if received in short succession. We end up with
the following:
#!/usr/bin/shEVENT_STAMP=/tmp/display-change-stamp
# Recognised displays (as reported by $SRANDRD_EDID).WORK_MONITOR="720405518350B628"TELEVISION="6D1E82C501010101"
msg(){printf"display-change-handler: %s\n""$*" >&2}# Call xrandr, but refresh $EVENT_STAMP just before doing so. This causes# connect/disconnect events generated by the xrandr operation to be skipped at# the head of this script. Call xrefresh afterwards to ensure windows are# redrawn if necessary.
wrapped_xrandr(){
touch $EVENT_STAMP
xrandr "$@"
xrefresh
}
msg "received event '$SRANDRD_OUTPUT: $SRANDRD_EVENT$SRANDRD_EDID'"# Suppress event if within 2 seconds of the timestamp file being updated.if[ -f $EVENT_STAMP]; thencur_time=$(date +%s)file_time=$(stat -c %Y $EVENT_STAMP)if[$((cur_time-file_time)) -le 2]; then
msg "suppressing event (exiting)"exit0fifi
touch $EVENT_STAMP
is_output_outputting(){
xrandr --query | grep -q "^$1 connected.*[0-9]\+x[0-9]\++[0-9]\++[0-9]\+"}# When connecting the main 'docked' display, disable the internal screen. Undo# this when disconnecting.case"$SRANDRD_EVENT$SRANDRD_EDID"in"connected $WORK_MONITOR")
msg "enabling 1920x1080 output on $SRANDRD_OUTPUT, disabling laptop display, and sleeping for 10 seconds"
wrapped_xrandr --output DSI-1 --off --output $SRANDRD_OUTPUT --mode 1920x1080
sleep 10
msg "switching up to 4k output"
wrapped_xrandr --output DSI-1 --off --output $SRANDRD_OUTPUT --preferred
msg "done"exit
;;
"disconnected $WORK_MONITOR")
msg "re-enabling laptop display and disabling $SRANDRD_OUTPUT"
wrapped_xrandr --output DSI-1 --preferred --rotate right --output $SRANDRD_OUTPUT --off
msg "done"exit
;;
"connected $TELEVISION")# If we get the 'connected' event and a resolution is already configured# and being emitted, then do nothing as the event was likely generated by# a manual xrandr call from outside this script.if is_output_outputting $SRANDRD_OUTPUT; then
msg "doing nothing as manual reconfiguration suspected"exit0fi
msg "enabling then disabling output $SRANDRD_OUTPUT which seems to avoid subsequent disconnect/reconnects"
wrapped_xrandr --output $SRANDRD_OUTPUT --mode 1920x1080
sleep 1
wrapped_xrandr --output $SRANDRD_OUTPUT --off
msg "done"exit
;;
*)
msg "no handler for $SRANDRD_EVENT$SRANDRD_EDID"exit
;;
esac
Outputting to in-built screen at 60Hz (not yet solved)
The screen is unfortunately limited to 50Hz out of the box, but at least on
Windows it's possible to use Custom Resolution
Utility
to edit the EDID and add a 1200x1920 60Hz mode (reminder: the display is
rotated to the right which is why width x height is the opposite order to
normal). To add Custom Resolution utility:
Open CRU
Click to "add a detailed resolution"
Select "Exact reduced" and enter Active: 1200 horizontal pixels, Vertical
1920 lines, and Refresh rate: 60.000 Hz. This results in Horizontal:
117.000kHz and pixel clock 159.12MHz. Leave interlaced unticked.
I exported this to a file with the hope of reusing on Linux.
As is often the case, the Arch Linux wiki has some relevant
guidance
on configuring an EDID override on Linux. I tried to follow the guidance by:
Copying the exported EDID file to
/usr/lib/firmware/edid/minibook_x_60hz.bin.
Adding drm.edid_firmware=DSI-1:edid/minibook_x_60hz.bin (DSI-1 is the
internal display) to the kernel commandline using efibootmgr.
Confirming this shows up in the kernel command line in dmesg but there are
no DRM messages regarding EDID override or loading the file. I also verify
it shows up in cat /sys/module/drm/parameters/edid_firmware.
Attempt adding /usr/lib/firmware/edid/minibook_x_60hz.bin to FILES in
/etc/mkinitcpio.conf and regenerating the initramfs. No effect.
Over the past eight months, Igalia has been working through RISE on the LLVM compiler, focusing on its RISC-V target. The goal is to improve the performance of generated code for application-class RISC-V processors, especially where there are gaps between LLVM and GCC RISC-V.
The result? A set of improvements that reduces execution time by up to 15% on our SPEC CPU® 2017-based benchmark harness.
In this blog post, I’ll walk through the challenges, the work we did across different areas of LLVM (including instruction scheduling, vectorization, and late-stage optimizations), and the resulting performance gains that demonstrate the power of targeted compiler optimization for the RISC-V architecture on current RVA22U64+V and future RVA23 hardware.
First, to understand the work involved in optimizing the RISC-V performance, let’s briefly discuss the key components of this project: the RISC-V architecture itself, the LLVM compiler infrastructure, and the Banana Pi BPI-F3 board as our target platform.
RISC-V is a modern, open-standard instruction set architecture (ISA) built around simplicity and extensibility. Unlike proprietary ISAs, RISC-V’s modular design allows implementers to choose from base instruction sets (e.g., RV32I, RV64I) and optional extensions (e.g., vector ops, compressed instructions). This flexibility makes it ideal for everything from microcontrollers to high-performance cores, while avoiding the licensing hurdles of closed ISAs. However, this flexibility also creates complexity: without guidance, developers might struggle to choose the right combination of extensions for their hardware.
Enter RISC-V Profiles: standardized bundles of extensions that ensure software compatibility across implementations. For the BPI-F3’s CPU, the relevant profile is RVA22U64, which includes:
Mandatory: RV64GC (64-bit with general-purpose + compressed instructions), Zicsr (control registers), Zifencei (instruction-fetch sync), and more.
Optional: The Vector extension (V) v1.0 (for SIMD operations) and other accelerators.
We chose to focus our testing on two configurations: RVA22U64 (scalar) and RVA22U64+V (vector), since they cover a wide variety of hardware. It's also important to note that code generation for vector-capable systems (RVA22U64+V) differs significantly from scalar-only targets, making it crucial to optimize both paths carefully.
RVA23U64, which mandates the vector extension, was not chosen because the BPI-F3 doesn’t support it.
LLVM is a powerful and widely used open-source compiler infrastructure. It's not a single compiler but rather a collection of modular and reusable compiler and toolchain technologies. LLVM's strength lies in its flexible and well-defined architecture, which allows it to efficiently compile code written in various source languages (like C, C++, Rust, etc.) for a multitude of target architectures, including RISC-V. A key aspect of LLVM is its optimization pipeline. This series of analysis and transformation passes works to improve the generated machine code in various ways, such as reducing the number of instructions, improving data locality, and exploiting target-specific hardware features.
The Banana Pi BPI-F3 is a board featuring a SpacemiT K1 8-core RISC-V chip: PU integrates 2.0 TOPs AI computing power. 2/4/8/16G DDR and 8/16/32/128G eMMC onboard.2x GbE Ethernet port, 4x USB 3.0 and PCIe for M.2 interface, support HDMI and Dual MIPI-CSI Camera.
Most notably, the RISC-V CPU supports the RVA22U64 Profile and 256-bit RVV 1.0 standard.
Let's define the testing environment. We use the training dataset on SPEC CPU® 2017-based benchmark to measure the impact of changes to the LLVM codebase. We do not use the reference dataset for practical reasons, i.e., the training dataset finishes in hours instead of days.
The benchmarks were executed on the BPI-F3, running Arch Linux and Kernel 6.1.15. The configuration of each compiler invocation is as follows:
LLVM at the start of the project (commit cd0373e0): SPEC benchmarks built with optimization level 3 (-O3), and LTO enabled (-flto). We’ll show the results using both RVA22U64 (-march=rva22u64) and the RVA22U64+V profiles (-march=rva22u64_v).
LLVM today (commit b48c476f): SPEC benchmarks built with optimization level 3 (-O3), LTO enabled (-flto), tuned for the SpacemiT-X60 (-mcpu=spacemit-x60), and IPRA enabled (-mllvm -enable-ipra -Wl,-mllvm,-enable-ipra). We’ll also show the results using both RVA22U64 (-march=rva22u64) and the RVA22U64+V profile (-march=rva22u64_v).
GCC 14.2: SPEC benchmarks built with optimization level 3 (-O3), and LTO enabled (-flto). GCC 14.2 doesn't support profile names in -march, so a functionally equivalent ISA naming string was used (skipping the assortment of extensions that don't affect codegen and aren't recognised by GCC 14.2) for both RVA22U64 and RVA22U64+V.
The following graph shows the improvements in execution time of the SPEC benchmarks from the start of the project (light blue bar) to today (dark blue bar) using the RVA22U64 profile, on the BPI-F3. Note that these include not only my contributions but also the improvements of all other individuals working on the RISC-V backend. We also include the results of GCC 14.2 for comparison (orange bar). Our contributions will be discussed later.
The graph is sorted by the execution time improvements brought by the new scheduling model. We see improvements across almost all benchmarks, from small gains in 531.deepsjeng_r (3.63%) to considerable ones in 538.imagick_r (19.67%) and 508.namd_r (25.73%). There were small regressions in the execution time of 510.parest_r (-3.25%); however, 510.parest_r results vary greatly in daily tests, so it might be just noise. Five benchmarks are within 1% of previous results, so we assume there was no impact on their execution time.
When compared to GCC, LLVM today is faster in 11 out of the 16 tested benchmarks (up to 23.58% faster than GCC in 541.leela_r), while being slower in three benchmarks (up to 6.51% slower than GCC in 510.parest_r). Current LLVM and GCC are within 1% of each other in the other two benchmarks. Compared to the baseline of the project, GCC was faster in ten benchmarks (up to 26.83% in 508.namd_r) while being slower in only five.
Similarly, the following graph shows the improvements in the execution time of SPEC benchmarks from the start of the project (light blue bar) to today (dark blue bar) on the BPI-F3, but this time with the RVA22U64+V profile, i.e., the RVA22U64 plus the vector extension (V) enabled. Again, GCC results are included (orange bar), and the graph shows all improvements gained during the project.
The graph is sorted by the execution time improvements brought by the new scheduling model. The results for RVA22U64+V follow a similar trend, and we see improvements in almost all benchmarks. From 4.91% in 500.perlbench_r to (again) a considerable 25.26% improvement in 508.namd_r. Similar to the RVA22U64 results, we see a couple of regressions: 510.parest_r with (-3.74%) and 523.xalancbmk_r (-6.01%). Similar to the results on RVA22U64, 523.xalancbmk_r, and 510.parest_r vary greatly in daily tests on RVA22u64+V, so these regressions are likely noise. Four benchmarks are within 1% of previous results, so we assume there was no impact on their execution time.
When compared to GCC, LLVM today is faster in 10 out of the 16 tested benchmarks (up to 23.76% faster than GCC in 557.xz_r), while being slower in three benchmark (up to 5.58% slower in 538.imagick_r). LLVM today and GCC are within 1-2% of each other in the other three benchmarks. Compared to the baseline of the project, GCC was faster in eight benchmarks (up to 25.73% in 508.namd_r) while being slower in five.
Over the past eight months, our efforts have concentrated on several key areas within the LLVM compiler infrastructure to specifically target and improve the efficiency of RISC-V code generation. These contributions have involved delving into various stages of the compilation process, from instruction selection to instruction scheduling. Here, we'll focus on three major areas where substantial progress has been made:
Introducing a scheduling model for the hardware used for benchmarking (SpacemiT-X60): LLVM had no scheduling model for the SpacemiT-X60, leading to pessimistic and inefficient code generation. We added a model tailored to the X60’s pipeline, allowing LLVM to better schedule instructions and improve performance. Longer term, a more generic in-order model could be introduced in LLVM to help other RISC-V targets that currently lack scheduling information, similar to how it’s already done for other targets, e.g., Aarch64. This contribution alone brings up to 15.76% improvement on the execution time of SPEC benchmarks.
Improved Vectorization Efficiency: LLVM’s SLP vectorizer used to skip over entire basic blocks when calculating spill costs, leading to inaccurate estimations and suboptimal vectorization when functions were present in the skipped blocks. We addressed this by improving the backward traversal to consider all relevant blocks, ensuring spill costs were properly accounted for. The final solution, contributed by the SLP Vectorizer maintainer, was to fix the issue without impacting compile times, unlocking better vectorization decisions and performance. This contribution brings up to 11.87% improvement on the execution time of SPEC benchmarks.
Register Allocation with IPRA Support: enabling Inter-Procedural Register Allocation (IPRA) to the RISC-V backend. IPRA reduces save/restore overhead across function calls by tracking which registers are used. In the RISC-V backend, supporting IPRA required implementing a hook to report callee-saved registers and prevent miscompilation. This contribution brings up to 3.42% improvement on the execution time of SPEC benchmarks.
The biggest contribution so far is the scheduler modeling tailored for the SpacemiT-X60. This scheduler is integrated into LLVM's backend and is designed to optimize instruction ordering based on the specific characteristics of the X60 CPU.
The scheduler was introduced in PR 137343. It includes detailed scheduling models that account for the X60's pipeline structure, instruction latencies for all scalar instructions, and resource constraints. The current scheduler model does not include latencies for vector instructions, but it is a planned future work. By providing LLVM with accurate information about the target architecture, the scheduler enables more efficient instruction scheduling, reducing pipeline stalls and improving overall execution performance.
The graph is sorted by the execution time improvements brought by the new scheduling model. The introduction of a dedicated scheduler yielded substantial performance gains. Execution time improvements were observed across several benchmarks, ranging from 1.04% in 541.leela_r to 15.76% in 525.x264_r.
Additionally, the scheduler brings significant benefits even when vector extensions are enabled, as shown above. The graph is sorted by the execution time improvements brought by the new scheduling model. Execution time improvements range from 3.66% in 544.nab_r to 15.58% in 508.namd_r, with notable code size reductions as well, e.g., a 6.47% improvement in 519.lbm_r (due to decreased register spilling).
Finally, the previous graph shows the comparison between RVA22U64 vs RVA22U64+V, both with the X60 scheduling model enabled. The only difference is 525.x264_r: it is 17.48% faster on RVA22U64+V.
A key takeaway from these results is the critical importance of scheduling for in-order processors like the SpacemiT-X60. The new scheduler effectively closed the performance gap between the scalar (RVA22U64) and vector (RVA22U64+V) configurations, with the vector configuration now outperforming only in a single benchmark (525.x264_r). On out-of-order processors, the impact of scheduling would likely be smaller, and vectorization would be expected to deliver more noticeable gains.
SLP Vectorizer Spill Cost Fix + DAG Combiner Tuning #
One surprising outcome in early benchmarking was that scalar code sometimes outperformed vectorized code, despite RISC-V vector support being available. This result prompted a detailed investigation.
Using profiling data, we noticed increased cycle counts around loads and stores in vectorized functions; the extra cycles were due to register spilling, particularly around function call boundaries. Digging further, we found that the SLP Vectorizer was aggressively vectorizing regions without properly accounting for the cost of spilling vector registers across calls.
To understand how spill cost miscalculations led to poor vectorization decisions, consider this simplified function, and its graph representation:
This function loads two values from %p, conditionally calls @g() (in both foo and bar), and finally stores the values to %q. Previously, the SLP vectorizer only analyzed the entry and baz blocks, ignoring foo and bar entirely. As a result, it missed the fact that both branches contain a call, which increases the cost of spilling vector registers. This led LLVM to vectorize loads and stores here, introducing unprofitable spills across the calls to @g().
To address the issue, we first proposed PR 128620, which modified the SLP vectorizer to properly walk through all basic blocks when analyzing cost. This allowed the SLP vectorizer to correctly factor in function calls and estimate the spill overhead more accurately.
The results were promising: execution time dropped by 9.92% in 544.nab_r, and code size improved by 1.73% in 508.namd_r. However, the patch also increased compile time in some cases (e.g., +6.9% in 502.gcc_r), making it unsuitable for upstream merging.
Following discussions with the community, Alexey Bataev (SLP Vectorizer code owner) proposed a refined solution in PR 129258. His patch achieved the same performance improvements without any measurable compile-time overhead and was subsequently merged.
The graph shows execution time improvements from Alexey’s patch, ranging from 1.49% in 500.perlbench_r to 11.87% in 544.nab_r. Code size also improved modestly, with a 2.20% reduction in 508.namd_r.
RVA22U64 results are not shown since this is an optimization tailored to prevent the spill of vectors. Scalar code was not affected by this change.
Finally, PR 130430 addressed the same issue in the DAG Combiner by preventing stores from being merged across call boundaries. While this change had minimal impact on performance in the current benchmarks, it improves code correctness and consistency and may benefit other workloads in the future.
IPRA (Inter-Procedural Register Allocation) Support #
Inter-Procedural Register Allocation (IPRA) is a compiler optimization technique that aims to reduce the overhead of saving and restoring registers across function calls. By analyzing the entire program, IPRA determines which registers are used across function boundaries, allowing the compiler to avoid unnecessary save/restore operations.
In the context of the RISC-V backend in LLVM, enabling IPRA required implementing a hook in LLVM. This hook informs the compiler that callee-saved registers should always be saved in a function, ensuring that critical registers like the return address register (ra) are correctly preserved. Without this hook, enabling IPRA would lead to miscompilation issues, e.g., 508.namd_r would never finish running (probably stuck in an infinite loop).
To understand how IPRA works, consider the following program before IPRA. Let’s assume function foo uses s0 but doesn't touch s1:
# Function bar calls foo and conservatively saves all callee-saved registers.
bar:
addi sp, sp, -32
sd ra, 16(sp) # Save return address (missing before our PR)
sd s0, 8(sp)
sd s1, 0(sp) # Unnecessary spill (foo won't clobber s1)
call foo
ld s1, 0(sp) # Wasted reload
ld s0, 8(sp)
ld ra, 16(sp)
addi sp, sp, 32
ret
After IPRA (optimized spills):
# bar now knows foo preserves s1: no s1 spill/reload.
bar:
addi sp, sp, -16
sd ra, 8(sp) # Save return address (missing before our PR)
sd s0, 0(sp)
call foo
ld s0, 0(sp)
ld ra, 8(sp)
addi sp, sp, 16
ret
By enabling IPRA for RISC-V, we eliminated redundant spills and reloads of callee-saved registers across function boundaries. In our example, IPRA reduced stack usage and cut unnecessary memory accesses. Crucially, the optimization maintains correctness: preserving the return address (ra) while pruning spills for registers like s1 when provably unused. Other architectures like x86 already support IPRA in LLVM, and we enable IPRA for RISC-V PR 125586.
IPRA is not enabled by default due to a bug, described in issue 119556; however, it does not affect the SPEC benchmarks.
The graph shows the improvements achieved by this transformation alone, using the RVA22U64 profile. There were execution time improvements ranging from 1.57% in 505.mcf_r to 3.16% in 519.lbm_r.
The graph shows the improvements achieved by this transformation alone, using the RVA22U64+V profile. We see similar gains, with execution time improvements of 1.14% in 505.mcf_r and 3.42% in 531.deepsjeng_r.
While we initially looked at code size impact, the improvements were marginal. Given that save/restore sequences tend to be a small fraction of total size, this isn't surprising and not the main goal of this optimization.
Setting Up Reliable Performance Testing. A key part of this project was being able to measure the impact of our changes consistently and meaningfully. For that, we used LNT, LLVM’s performance testing tool, to automate test builds, runs, and result comparisons. Once set up, LNT allowed us to identify regressions early, track improvements over time, and visualize the impact of each patch through clear graphs.
Reducing Noise on the BPI-F3. Benchmarking is noisy by default, and it took considerable effort to reduce variability between runs. These steps helped:
Disabling ASLR: To ensure a more deterministic memory layout.
Running one benchmark at a time on the same core: This helped eliminate cross-run contention and improved result consistency.
Multiple samples per benchmark: We collected 3 samples to compute statistical confidence and reduce the impact of outliers.
These measures significantly reduced noise, allowing us to detect even small performance changes with confidence.
Interpreting Results and Debugging Regressions. Another challenge was interpreting performance regressions or unexpected results. Often, regressions weren't caused by the patch under test, but by unrelated interactions with the backend. This required:
Cross-checking disassembly between runs.
Profiling with hardware counters (e.g., using perf).
Identifying missed optimization opportunities due to incorrect cost models or spill decisions.
Comparing scalar vs vector codegen and spotting unnecessary spills or register pressure.
My colleague Luke Lau also set up a centralized LNT instance that runs nightly tests. This made it easy to detect and track performance regressions (or gains) shortly after new commits landed. When regressions did appear, we could use the profiles and disassembly generated by LNT to narrow down which functions were affected, and why.
Using llvm-exegesis (sort of). At the start of the project, llvm-exegesis, the tool LLVM provides to measure instruction latencies and throughput, didn’t support RISC-V at all. Over time, support was added incrementally across three patches: first for basic arithmetic instructions, then load instructions, and eventually vector instructions. This made it a lot more viable as a tool for microarchitectural analysis on RISC-V. However, despite this progress, we ultimately didn’t use llvm-exegesis to collect the latency data for our scheduling model. The results were too noisy, and we needed more control over how measurements were gathered. Instead, we developed an internal tool to generate the latency data, something we plan to share in the future.
Notable Contributions Without Immediate Benchmark Impact. While some patches may not have led to significant performance improvements in benchmarks, they were crucial for enhancing the RISC-V backend's robustness and maintainability:
Improved Vector Handling in matchSplatAsGather (PR #117878): This patch updated the matchSplatAsGather function to handle vectors of different sizes, enhancing code generation for @llvm.experimental.vector.match on RISC-V.
Addition of FMA Cost Model (PRs #125683 and #126076): These patches extended the cost model to cover the FMA instruction, ensuring accurate cost estimations for fused multiply-add operations.
Generalization of vp_fneg Cost Model (PR #126915): This change moved the cost model for vp_fneg from the RISC-V-specific implementation to the generic Target Transform Info (TTI) layer, promoting consistent handling across different targets.
Late Conditional Branch Optimization for RISC-V (PR #133256): Introduced a late RISC-V-specific optimization pass that replaces conditional branches with unconditional ones when the condition can be statically evaluated. This creates opportunities for further branch folding and cleanup later in the pipeline. While performance impact was limited in current benchmarks, it lays the foundation for smarter late-stage CFG optimizations.
These contributions, while not directly impacting benchmark results, laid the groundwork for future improvements.
This project significantly improved the performance of the RISC-V backend in LLVM through a combination of targeted optimizations, infrastructure improvements, and upstream contributions. We tackled key issues in vectorization, register allocation, and scheduling, demonstrating that careful backend tuning can yield substantial real-world benefits, especially on in-order cores like the SpacemiT-X60.
Future Work:
Vector latency modeling: The current scheduling model lacks accurate latencies for vector instructions.
Further scheduling model fine-tuning: This would impact the largest number of users and would align RISC-V with other targets in LLVM.
Improve vectorization: The similar performance between scalar and vectorized code suggests we are not fully exploiting vectorization opportunities. Deeper analysis might uncover missed cases or necessary model tuning.
Improvements to DAGCombine: after PR 130430, Philip Reames created issue 132787 with ideas to improve the store merging code.
This work was made possible thanks to support from RISE, under Research Project RP009. I would like to thank my colleagues Luke Lau and Alex Bradbury for their ongoing technical collaboration and insight throughout the project. I’m also grateful to Philip Reames from Rivos for his guidance and feedback. Finally, a sincere thank you to all the reviewers in the LLVM community who took the time to review, discuss, and help shape the patches that made these improvements possible.
Recent presentations at BlinkOn strike some familliar notes. Seems a common theme, ideas come back.
Since I joined Igalia in 2019, I don't think I've missed a BlinkOn. This year, however, there was a conflict with the W3C AC meetings and we felt that it was more useful that I attend those, since Igalia already had a sizable contingent at BlinkOn itself and my Web History talk with Chris Lilley was pre-recorded.
When I returned, and videos of the event began landing, I was keen to see what people talked about. There were lots of interesting talks, but one jumped out at me right away: Bramus gave one called "CSS Parser Extensions" - which I wasn't familliar with, so was keen to see. Turns out it was just very beginnings of him exploring ideas to make CSS polyfillable.
This talk made me sit up and pay attention because, actually, it's really how I came to be involved in standards. It's the thing that started a lot of the conversations that eventually became the Extensible Web Community Group and the Extensible Web Manifesto, and ultimately Houdini, a joint Task Force of the W3C TAG and CSS Working Group (in fact, I am also the one who proposed the name ✨). In his talk, he hit on many of the same notes that led me there too.
Polyfills are really interesting when you step back and look at them. They can be used to make the standards development, feedback and rollout so much better. But CSS is almost historically hostile to that approach becauase it just throws away anything it doesn't understand. That means if you want to polyfill something you've got to re-implement lots of stuff that the browser already does: You've got to re-fetch the stylesheet (if you can!) as text, and then bring your own parser to parse it, and then... well, you still can't actually realistically implement many things.
But what if you could?
Houdini has stalled. In my mind, this mainly due to when it happened and what it chose to focus on shipping first. One of the first things that we all agreed to in the first Houdini meeting was that we expose the parser. This is true for all of the reasons Bramus discussed, and more. But that effort got hung up on the fact that there was a sense we first needed a typed OM. I'm not sure how true that really is. Other cool Houdini things were, I think, also hung up on lots of things that were being reworked at the time, and resource competition. But I think that the thing that really killed it was just what shipped first. It was not something that might be really useful for polyfilling, like custom functions or custom media queries or custom pseduo classes, or very ambitiously, something like custom layouts --- but custom paint. The CSS Working Group doesn't publish a lot of new "paints". There are approximately 0 named background images, for example. There's no background-image: checkerboard; for example. But the working group does publish lots of those other things like functions or psueudo classes. See what I mean? Those other things were part of the real vision - they can be used to make cow paths. Or, they can be used to show that, actually, nobody wants that cow path. Or, if it isn't - It can instead rapidly inspire better solutions..
Anyway, the real challenge with most polyfills is performance. Any time that we're going to step out of "60 fps scrollers" into JS land that's iffy... But not impossible, and if we're honest, we'd have to admit that the truth is that our current/actual attempts to polyfill are definitely worse than something closer to native. With effort, surely we can at least improve things by looking at where there are some nice "joints" where we can cleave the problem.
This is why in recent years I've suggested that perhaps what would really benefit us is a few custom things (like functions) and then just enabling CSS-like languages, which can handle the fetch/parse problems and perhaps give us some of the most basic primitive ideas.
So, where will all of this go? Who knows - but I'm glad some others are interested and talking about some of it again.
This blog post might interest you if you want to try the bleeding edge NVK driver which allows to decode H264/5 video with the power of the Vulkan extensions VK_KHR_video_decode_h26x.
This is a summary of the instructions provided in the MR. This work needs a recent kernel with new features, so it will describe the steps to add this feature and build this new kernel on an Ubuntu based system.
To run the NVK driver, you need a custom patch to be applied on top of the Nouveau driver. This patch applies on minimum 6.12 kernel so you need to build a new kernel except if your distribution is running bleeding edge kernel which i doubt so here is my method I used to build this kernel.
Next step will be to configure the kernel. The best option I’ll recommend you is to copy the kernel config, your distribution is shipping with. On Ubuntu you can find it in /boot with the name config-6.8.0-52-generic for example.
$ cp /boot/config-6.8.0-52-generic .config
Then to get the default config, your kernel will use, including the specific options coming with Ubuntu, you’ll have to run:
$ make defconfig
This will setup the build and make it ready to compile with this version of the kernel, auto configuring the new features.
Two options CONFIG_SYSTEM_TRUSTED_KEYS and CONFIG_SYSTEM_REVOCATION_KEYS must be disabled to avoid compilation errors with missing certificates. For that you can set it up within menuconfig or you can edit .config and set these values to ""
Then you should be ready to go for a break ☕, short or long depending on your machine to cook the brand new kernel debian packaged, ready to use:
$ make clean
$ make -j `getconf _NPROCESSORS_ONLN` deb-pkg LOCALVERSION=-custom
The process should end up with a new package named linux-image-6.12.8-custom_6.12.8-3_amd64.deb in the upper folder which can then be installed along your previous kernel.
The first one will replace your current default menulist item in grub upon installation. This means that if you install it, next time you reboot, you’ll boot into that kernel.
Mesa depends on various system packages in addition to python modules and the the rust toolchain. So first we’ll have to install the given package which are all present in Ubuntu 24.04:
Now that the kernel and the Mesa driver have been built and are available for your machine, you should be able to decode your first h264 stream with the NVK driver.
As you might have use the Nvidia driver first and installed with your regular kernel, you might hit a weired error when invoking vulkaninfo such as:
ERROR: [Loader Message] Code 0 : setup_loader_term_phys_devs: Failed to detect any valid GPUs in the current config
ERROR at ./vulkaninfo/./vulkaninfo.h:247:vkEnumeratePhysicalDevices failed with ERROR_INITIALIZATION_FAILED
Indeed nouveau driver can not live along the Nvidia driver, so you’ll have to uninstall the Nvidia driver first to be able to use nouveau properly and the vulkan extensions.
One other solution is to boot on your new custom kernel and modify the file /etc/modprobe.d/nvidia-installer-disable-nouveau.conf to get something like:
# generated by nvidia-installer
#blacklist nouveau
options nouveau modeset=1
In that case the modeset=1 option will enable the driver and allow to use it.
Then you’ll have to reboot with this new configuration
As you may have noticed, during the configure stage that we chose to install the artifacts of the build in a folder named mesa/builddir/install.
Here is a script run-nvk.sh which you can use before calling any binary which will use this folder as a base to set the environment variable dedicated to the NVK Vulkan driver
Now its time to run a real application exploiting the power of Vulkan to decode multimedia content. For that I’ll recommend you to use GStreamer which ship with Vulkan elements for decoding in 1.24.2 version bundled in Ubuntu 24.04.
First of all, you’ll have to install the ubuntu packages for GStreamer
If you succeed to see this list of elements, you should be able to run a GStreamer pipeline with Vulkan Video extensioms. Here is a pipeline to decode a content:
Today, some more words on memory management, on the practicalities of a
system with conservatively-traced references.
The context is that I have finally started banging
Whippet into
Guile, initially in a configuration that
continues to use the conservative Boehm-Demers-Weiser (BDW) collector
behind the scene. In that way I can incrementally migrate over all of
the uses of the BDW API in Guile to use Whippet API instead, and then if
all goes well, I should be able to switch Whippet to use another GC
algorithm, probably the mostly-marking collector
(MMC).
MMC scales better than BDW for multithreaded mutators, and it can
eliminate fragmentation via Immix-inspired optimistic evacuation.
problem statement: how to manage ambiguous edges
A garbage-collected heap consists of memory, which is a set of
addressable locations. An object is a disjoint part of a heap, and is
the unit of allocation. A field is memory within an object that may
refer to another object by address. Objects are nodes in a directed graph in
which each edge is a field containing an object reference. A root is an
edge into the heap from outside. Garbage collection reclaims memory from objects that are not reachable from the graph
that starts from a set of roots. Reclaimed memory is available for new
allocations.
In the course of its work, a collector may want to relocate an object,
moving it to a different part of the heap. The collector can do so if
it can update all edges that refer to the object to instead refer to its
new location. Usually a collector arranges things so all edges have the
same representation, for example an aligned word in memory; updating an
edge means replacing the word’s value with the new address. Relocating
objects can improve locality and reduce fragmentation, so it is a good
technique to have available. (Sometimes we say evacuate, move, or compact
instead of relocate; it’s all the same.)
Some collectors allow ambiguous edges: words in memory whose value
may be the address of an object, or might just be scalar data.
Ambiguous edges usually come about if a compiler doesn’t precisely
record which stack locations or registers contain GC-managed objects.
Such ambiguous edges must be traced conservatively: the collector adds
the object to its idea of the set of live objects, as if the edge were a
real reference. This tracing mode isn’t supported by all collectors.
Any object that might be the target of an ambiguous edge cannot be
relocated by the collector; a collector that allows conservative edges
cannot rely on relocation as part of its reclamation strategy.
Still, if the collector can know that a given object will not be the referent
of an ambiguous edge, relocating it is possible.
How can one know that an object is not the target of an ambiguous edge?
We have to partition the heap somehow into
possibly-conservatively-referenced and
definitely-not-conservatively-referenced. The two ways that I know to
do this are spatially and temporally.
Spatial partitioning means that regardless of the set of root and
intra-heap edges, there are some objects that will never be
conservatively referenced. This might be the case for a type of object
that is “internal” to a language implementation; third-party users that
may lack the discipline to precisely track roots might not be exposed to
objects of a given kind. Still, link-time optimization tends to weather
these boundaries, so I don’t see it as being too reliable over time.
Temporal partitioning is more robust: if all ambiguous references come
from roots, then if one traces roots before intra-heap edges, then any
object not referenced after the roots-tracing phase is available for
relocation.
kinds of ambiguous edges in guile
So let’s talk about Guile! Guile uses BDW currently, which considers
edges to be ambiguous by default. However, given that objects carry
type tags, Guile can, with relatively little effort, switch to precisely
tracing most edges. “Most”, however, is not sufficient; to allow for
relocation, we need to eliminate intra-heap ambiguous edges, to
confine conservative tracing to the roots-tracing phase.
Conservatively tracing references from C stacks or even from static data
sections is not a problem: these are roots, so, fine.
Guile currently traces Scheme stacks almost-precisely: its compiler
emits stack maps for every call site, which uses liveness analysis to
only mark those slots that are Scheme values that will be used in the
continuation. However it’s possible that any given frame is marked
conservatively. The most common case is when using the BDW collector
and a thread is pre-empted by a signal; then its most recent stack frame
is likely not at a safepoint and indeed is likely undefined in terms of
Guile’s VM. It can also happen if there is a call site within a VM
operation, for example to a builtin procedure, if it throws an exception
and recurses, or causes GC itself. Also, when per-instruction
traps
are enabled, we can run Scheme between any two Guile VM operations.
So, Guile could change to trace Scheme stacks fully precisely, but this
is a lot of work; in the short term we will probably just trace Scheme
stacks as roots instead of during the main trace.
However, there is one more significant source of ambiguous roots, and
that is reified continuation objects. Unlike active stacks, these have
to be discovered during a trace and cannot be partitioned out to the
root phase. For delimited continuations, these consist of a slice of
the Scheme stack. Traversing a stack slice precisely is less
problematic than for active stacks, because it isn’t in motion, and it
is captured at a known point; but we will have to deal with stack frames
that are pre-empted in unexpected locations due to exceptions within
builtins. If a stack map is missing, probably the solution there is to
reconstruct one using local flow analysis over the bytecode of the stack
frame’s function; time-consuming, but it should be robust as we do it
elsewhere.
Undelimited continuations (those captured by call/cc) contain a slice
of the C stack also, for historical reasons, and there we can’t trace it
precisely at all. Therefore either we disable relocation if there are
any live undelimited continuation objects, or we eagerly pin any object
referred to by a freshly captured stack slice.
fin
If you want to follow along with the Whippet-in-Guile work, see the
wip-whippet
branch in Git. I’ve bumped its version to 4.0 because, well, why the
hell not; if it works, it will certainly be worth it. Until next time,
happy hacking!
Some short thoughts on recent antitrust and the future of the web platform...
Last week, in a 115 page US Antitrust ruling a federal judge in Virginia found that Google had two more monopolies, this time with relation to advertising technolgies. Previously, you'll recall that we had rulings related to search. There are still more open cases related to Android. And it's not only in the US that similar actions are playing out.
All of these cases kind of mention one another because the problems themselves are all deeply intertwined - but this one is really at the heart of it: That sweet, sweet ad money. I think that you could argue, reasonably, that pretty much everything else was somehow in service of that.
Initially, they made a ton of money showing ads every time someone searches, and pretty quickly signed a default search deal with Mozilla to drive the number of searches way up.
Why make a browser of your own? To drive the searches that show the ads, but also keep more of the money.
Why make OSes of your own, and deals around things that need to be installed? To guarantee that all of those devices drive the searches to show the ads.
And so on...
For a long time now, I've been trying to discuss what, to me, is a rather worrying problem: That those default search dollars are, in the end, what funds the whole web ecosystem. Don't forget that it's not just about the web itself, it's about the platform which provides the underlying technology for just about everything else at this point too.
Between years of blog posts, a podcast series, several talks, experiments like Open Prioritization I have been thinking about this a lot. Untangling it all is going to be quite complex.
In the US, the governments proposed remedies touch just about every part of this. I've been trying to think about how I could sum up my feelings and concerns, but it's quite complex. Then, the other day an article on arstechnica contained an illustration which seemed pretty perfect..
A "game" board that looks like the game Operation, but instead of pieces of internal anatomy there are logos for chrome, gmail, ads, adsense, android and on the board it says "Monoperation: Skill game where you are the DOJ" and the person is removing chrome, and a buzzer is going ff
This image (credited to Aurich Lawson) kind of hit the nail on its head for me: I deeply hope they will be absoltely surgical about this intervention, because the patient I'm worried about isn't Google, it's the whole Web Platform.
If this is interesting to you, my colleague Eric Meyer and I posted an Igalia Chats podcast episode on the topic: Adpocalypse Now?
Notes on AI for Mathematics and Theoretical Computer Science
In April 2025 I had the pleasure to attend an intense week-long workshop at the Simons Institute for the Theory of Computing entitled AI for Mathematics and Theoretical Computer Science. The event was organized jointly with the Simons Laufer Mathematical Sciences Institute (SLMath, for short). It was an intense time (five fully-packed days!) for learning a lot about cutting-edge ideas in this intersection of formal mathematics (primarily in Lean), AI, and powerful techniques for solving mathematical problems, such as SAT solvers and decision procedures (e.g., the Walnut system). Videos of the talks (but not of the training sessions) have been made available.
Every day, several dozen people were in attendance. Judging from the array of unclaimed badges (easily another several dozen), quite a lot more had signed up for the event, but didn't come for one reason or another. It was inspiring to be in the room with so many people involved in these ideas. The training sessions in the afternoon had a great vibe, since so many people we learning and working together simultaneously.
It was great to connect with a number of people, of all stripes. Most of the presenters and attendees were coming from academia, with a minority, such as myself, coming from industry.
The organization was fantastic. We had talks in the morning and training in the afternoon. The final talk in the morning, before lunch, was an introduction to the afternoon training. The training topics were:
The links above point to the tutorial git repos for following along at home.
In the open discussion on the final afternoon, I raised my hand and outed myself as someone coming to the workshop from an industry perspective. Although I had already met a few people in industry prior to Friday, I was able to meet even more by raising my hand and inviting fellow practioners to discuss things. This led to meeting a few more people.
The talks were fascinating; the selection of speakers and topics was excellent. Go ahead and take a look at the list of videos, pick out one or two of interest, grab a beverage of your choice, and enjoy.
With the release of GStreamer 1.26, we now have playback support for Versatile Video Coding (VVC/H.266). In this post, I’ll describe the pieces of the puzzle that enable this, the contributions that led to it, and hopefully provide a useful guideline to adding a new video codec in GStreamer.
With GStreamer 1.26 and the relevant plugins enabled, one can play multimedia files containing VVC content, for example, by using gst-play-1.0:
gst-play-1.0 vvc.mp4
By using gst-play-1.0, a pipeline using playbin3 will be created and the appropriate elements will be auto-plugged to decode and present the VVC content. Here’s what such a pipeline looks like:
Although the pipeline is quite large, the specific bits we’ll focus on in this blog are inside parsebin and decodebin3:
qtdemux → ... → h266parse → ... → avdec_h266
I’ll explain what each of those elements is doing in the next sections.
To store multiple kinds of media (e.g. video, audio and captions) in a way that keeps them synchronized, we typically make use of container formats. This process is usually called muxing, and in order to play back the file we perform de-muxing, which separates the streams again. That is what the qtdemux element is doing in the pipeline above, by extracting the audio and video streams from the input MP4 file and exposing them as the audio_0 and video_0 pads.
Support for muxing and demuxing VVC streams in container formats was added to:
qtmux and qtdemux: for ISOBMFF/QuickTime/MP4 files (often saved with the .mp4 extension)
mpegtsmux and tsdemux: for MPEG transport stream (MPEG-TS) files (often saved with the .ts extension)
Besides the fact that the demuxers are used for playback, by also adding support to VVC in the muxer elements we are then also able to perform remuxing: changing the container format without transcoding the underlying streams.
Some examples of simplified re-muxing pipelines (only taking into account the VVC video stream):
But why do we need h266parse when re-muxing from Matroska to MPEG-TS? That’s what I’ll explain in the next section.
Parsing and converting between VVC bitstream formats #
Video codecs like H.264, H.265, H.266 and AV1 may have different stream formats, depending on which container format is used to transport them. For VVC specifically, there are two main variants, as shown in the caps for h266parse:
Pad Templates: SINK template:'sink' Availability: Always Capabilities: video/x-h266
byte-stream or so-called Annex-B format (as in Annex B from the VVC specification): it separates the NAL units by start code prefixes (0x000001 or 0x00000001), and is the format used in MPEG-TS, or also when storing VVC bitstreams in files without containers (so-called “raw bitstream files”).
ℹ️ Note: It’s also possible to play raw VVC bitstream files with gst-play-1.0. That is achieved by the typefind element detecting the input file as VVC and playbin taking care of auto-plugging the elements.
vvc1 and vvi1: those formats use length field prefixes before each NAL unit. The difference between the two formats is the way that parameter sets (e.g. SPS, PPS, VPS NALs) are stored, and reflected in the codec_data field in GStreamer caps. For vvc1, the parameter sets are stored as container-level metadata, while vvi1 allows for the parameter sets to be stored also in the video bitstream.
The alignment field in the caps signals whether h266parse will collect multiple NALs into an Access Unit (AU) for a single GstBuffer, where an AU is the smallest unit for a decodable video frame, or whether each buffer will carry only one NAL.
That explains why we needed the h266parse when converting from MKV to MPEG-TS: it’s converting from vvc1/vvi1 to byte-stream! So the gst-launch-1.0 command with more explicit caps would be:
FFmpeg 7.1 has a native VVC decoder which is considered stable. In GStreamer 1.26, we have allowlisted that decoder in gst-libav, and it is now exposed as the avdec_h266 element.
Intel has added the vah266dec element in GStreamer 1.26, which enables hardware-accelerated VVC decoding on Intel Lunar Luke CPUs. However, it still has rank of 0 in GStreamer 1.26, so in order to test it out, one would need to, for example, manually set GST_PLUGIN_FEATURE_RANK.
Similar to h266parse, initially vah266dec was added with support for only the byte-stream format. I implemented support for the vvc1 and vvi1 modes in the base h266decoder class, which fixes the support for them in vah266dec as well. However, it hasn’t yet been merged and I don’t expect it to be backported to 1.26, so likely it will only be available in GStreamer 1.28.
Here’s a quick demo of vah266dec in action on an ASUS ExpertBook P5. In this screencast, I perform the following actions:
Run vainfo and display the presence of VVC decoding profile
gst-inspect vah266dec
export GST_PLUGIN_FEATURE_RANK='vah266dec:max'
Start playback of six simultaneous 4K@60 DASH VVC streams. The stream in question is the classic Tears of Steel, sourced from the DVB VVC test streams.
Run nvtop, showing GPU video decoding & CPU usage per process.
A tool that is handy for testing the new decoder elements is Fluster. It simplifies the process of testing decoder conformance and comparing decoders by using test suites that are adopted by the industry. It’s worth checking it out, and it’s already common practice to test new decoders with this test framework. I added the GStreamer VVC decoders to it: vvdec, avdec_h266 and vah266dec.
We’re still missing the ability to encode VVC video in GStreamer. I have a work-in-progress branch that adds the vvenc element, by using VVenC and safe Rust bindings (similarly to the vvdec element), but it still needs some work. I intend to work on it during the GStreamer Spring Hackfest 2025 to make it ready to submit upstream 🤞
2025 was my first year at FOSDEM, and I can say it was an incredible experience
where I met many colleagues from Igalia who live around
the world, and also many friends from the Linux display stack who are part of
my daily work and contributions to DRM/KMS. In addition, I met new faces and
recognized others with whom I had interacted on some online forums and we had
good and long conversations.
During FOSDEM 2025 I had the opportunity to present
about kworkflow in the kernel devroom. Kworkflow is a
set of tools that help kernel developers with their routine tasks and it is the
tool I use for my development tasks. In short, every contribution I make to the
Linux kernel is assisted by kworkflow.
The goal of my presentation was to spread the word about kworkflow. I aimed to
show how the suite consolidates good practices and recommendations of the
kernel workflow in short commands. These commands are easily configurable and
memorized for your current work setup, or for your multiple setups.
For me, Kworkflow is a tool that accommodates the needs of different agents in
the Linux kernel community. Active developers and maintainers are the main
target audience for kworkflow, but it is also inviting for users and user-space
developers who just want to report a problem and validate a solution without
needing to know every detail of the kernel development workflow.
Something I didn’t emphasize during the presentation but would like to correct
this flaw here is that the main author and developer of kworkflow is my
colleague at Igalia, Rodrigo Siqueira. Being honest,
my contributions are mostly on requesting and validating new features, fixing
bugs, and sharing scripts to increase feature coverage.
So, the video and slide deck of my FOSDEM presentation are available for
download
here.
And, as usual, you will find in this blog post the script of this presentation
and more detailed explanation of the demo presented there.
Kworkflow at FOSDEM 2025: Speaker Notes and Demo
Hi, I’m Melissa, a GPU kernel driver developer at Igalia and today I’ll be
giving a very inclusive talk to not let your motivation go by saving time with
kworkflow.
So, you’re a kernel developer, or you want to be a kernel developer, or you
don’t want to be a kernel developer. But you’re all united by a single need:
you need to validate a custom kernel with just one change, and you need to
verify that it fixes or improves something in the kernel.
And that’s a given change for a given distribution, or for a given device, or
for a given subsystem…
Look to this diagram and try to figure out the number of subsystems and related
work trees you can handle in the kernel.
So, whether you are a kernel developer or not, at some point you may come
across this type of situation:
There is a userspace developer who wants to report a kernel issue and says:
Oh, there is a problem in your driver that can only be reproduced by running this specific distribution.
And the kernel developer asks:
Oh, have you checked if this issue is still present in the latest kernel version of this branch?
But the userspace developer has never compiled and installed a custom kernel
before. So they have to read a lot of tutorials and kernel documentation to
create a kernel compilation and deployment script. Finally, the reporter
managed to compile and deploy a custom kernel and reports:
Sorry for the delay, this is the first time I have installed a custom kernel.
I am not sure if I did it right, but the issue is still present in the kernel
of the branch you pointed out.
And then, the kernel developer needs to reproduce this issue on their side, but
they have never worked with this distribution, so they just created a new
script, but the same script created by the reporter.
What’s the problem of this situation? The problem is that you keep creating new
scripts!
Every time you change distribution, change architecture, change hardware,
change project - even in the same company - the development setup may change
when you switch to a different project, you create another script for your new
kernel development workflow!
You know, you have a lot of babies, you have a collection of “my precious
scripts”, like Sméagol (Lord of the Rings) with the precious ring.
Instead of creating and accumulating scripts, save yourself time with
kworkflow. Here is a typical script that many of you may have. This is a
Raspberry Pi 4 script and contains everything you need to memorize to compile
and deploy a kernel on your Raspberry Pi 4.
With kworkflow, you only need to memorize two commands, and those commands are
not specific to Raspberry Pi. They are the same commands to different
architecture, kernel configuration, target device.
What is kworkflow?
Kworkflow is a collection of tools and software combined to:
Optimize Linux kernel development workflow.
Reduce time spent on repetitive tasks, since we are spending our lives
compiling kernels.
Standardize best practices.
Ensure reliable data exchange across kernel workflow. For example: two people
describe the same setup, but they are not seeing the same thing, kworkflow
can ensure both are actually with the same kernel, modules and options enabled.
I don’t know if you will get this analogy, but kworkflow is for me a megazord
of scripts. You are combining all of your scripts to create a very powerful
tool.
What is the main feature of kworflow?
There are many, but these are the most important for me:
Build & deploy custom kernels across devices & distros.
Handle cross-compilation seamlessly.
Manage multiple architecture, settings and target devices in the same work tree.
Organize kernel configuration files.
Facilitate remote debugging & code inspection.
Standardize Linux kernel patch submission guidelines. You don’t need to
double check documentantion neither Greg needs to tell you that you are not
following Linux kernel guidelines.
Upcoming: Interface to bookmark, apply and “reviewed-by” patches from
mailing lists (lore.kernel.org).
This is the list of commands you can run with kworkflow.
The first subset is to configure your tool for various situations you may face
in your daily tasks.
We have some tools to manage and interact with target machines.
# Manage and interact with target machines
kw ssh (s) - SSH support
kw remote (r) - Manage machines available via ssh
kw vm - QEMU support
To inspect and debug a kernel.
# Inspect and debug
kw device - Show basic hardware information
kw explore (e) - Explore string patterns in the work tree and git logs
kw debug - Linux kernel debug utilities
kw drm - Set of commands to work with DRM drivers
To automatize best practices for patch submission like codestyle, maintainers
and the correct list of recipients and mailing lists of this change, to ensure
we are sending the patch to who is interested in it.
# Automatize best practices for patch submission
kw codestyle (c) - Check code style
kw maintainers (m) - Get maintainers/mailing list
kw send-patch - Send patches via email
And the last one, the upcoming patch hub.
# Upcoming
kw patch-hub - Interact with patches (lore.kernel.org)
How can you save time with Kworkflow?
So how can you save time building and deploying a custom kernel?
First, you need a .config file.
Without kworkflow: You may be manually extracting and managing .config
files from different targets and saving them with different suffixes to link
the kernel to the target device or distribution, or any descriptive suffix to
help identify which is which. Or even copying and pasting from somewhere.
With kworkflow: you can use the kernel-config-manager command, or simply
kw k, to store, describe and retrieve a specific .config file very easily,
according to your current needs.
Then you want to build the kernel:
Without kworkflow: You are probably now memorizing a combination of
commands and options.
With kworkflow: you just need kw b (kw build) to build the kernel with
the correct settings for cross-compilation, compilation warnings, cflags,
etc. It also shows some information about the kernel, like number of modules.
Finally, to deploy the kernel in a target machine.
Without kworkflow: You might be doing things like: SSH connecting to the
remote machine, copying and removing files according to distributions and
architecture, and manually updating the bootloader for the target distribution.
With kworkflow: you just need kw d which does a lot of things for you,
like: deploying the kernel, preparing the target machine for the new
installation, listing available kernels and uninstall them, creating a tarball,
rebooting the machine after deploying the kernel, etc.
You can also save time on debugging kernels locally or remotely.
Without kworkflow: you do: ssh, manual setup and traces enablement,
copy&paste logs.
With kworkflow: more straighforward access to debug utilities: events,
trace, dmesg.
You can save time on managing multiple kernel images in the same work tree.
Without kworkflow: now you can be cloning multiple times the same
repository so you don’t lose compiled files when changing kernel
configuration or compilation options and manually managing build and deployment
scripts.
With kworkflow: you can use kw env to isolate multiple contexts in the
same worktree as environments, so you can keep different configurations in
the same worktree and switch between them easily without losing anything from
the last time you worked in a specific context.
Finally, you can save time when submitting kernel patches. In kworkflow, you
can find everything you need to wrap your changes in patch format and submit
them to the right list of recipients, those who can review, comment on, and
accept your changes.
This is a demo that the lead developer of the kw patch-hub feature sent me.
With this feature, you will be able to check out a series on a specific mailing
list, bookmark those patches in the kernel for validation, and when you are
satisfied with the proposed changes, you can automatically submit a reviewed-by
for that whole series to the mailing list.
Demo
Now a demo of how to use kw environment to deal with different devices,
architectures and distributions in the same work tree without losing compiled
files, build and deploy settings, .config file, remote access configuration and
other settings specific for those three devices that I have.
Setup
Three devices:
laptop (debian
x86
intel
local)
SteamDeck (steamos
x86
amd
remote)
RaspberryPi 4 (raspbian
arm64
broadcomm
remote)
Goal: To validate a change on DRM/VKMS using a single kernel tree.
Kworkflow commands:
kw env
kw d
kw bd
kw device
kw debug
kw drm
Demo script
In the same terminal and worktree.
First target device: Laptop (debian|x86|intel|local)
$ kw env --list # list environments available in this work tree
$ kw env --use LOCAL # select the environment of local machine (laptop) to use: loading pre-compiled files, kernel and kworkflow settings.
$ kw device # show device information
$ sudo modinfo vkms # show VKMS module information before applying kernel changes.
$ <open VKMS file and change module info>
$ kw bd # compile and install kernel with the given change
$ sudo modinfo vkms # show VKMS module information after kernel changes.
$ git checkout -- drivers
Second target device: RaspberryPi 4 (raspbian|arm64|broadcomm|remote)
$ kw env --use RPI_64 # move to the environment for a different target device.
$ kw device # show device information and kernel image name
$ kw drm --gui-off-after-reboot # set the system to not load graphical layer after reboot
$ kw b # build the kernel with the VKMS change
$ kw d --reboot # deploy the custom kernel in a Raspberry Pi 4 with Raspbian 64, and reboot
$ kw s # connect with the target machine via ssh and check the kernel image name
$ exit
Third target device: SteamDeck (steamos|x86|amd|remote)
$ kw env --use STEAMDECK # move to the environment for a different target device
$ kw device # show device information
$ kw debug --dmesg --follow --history --cmd="modprobe vkms" # run a command and show the related dmesg output
$ kw debug --dmesg --follow --history --cmd="modprobe -r vkms" # run a command and show the related dmesg output
$ <add a printk with a random msg to appear on dmesg log>
$ kw bd # deploy and install custom kernel to the target device
$ kw debug --dmesg --follow --history --cmd="modprobe vkms" # run a command and show the related dmesg output after build and deploy the kernel change
Q&A
Most of the questions raised at the end of the presentation were actually
suggestions and additions of new features to kworkflow.
The first participant, that is also a kernel maintainer, asked about two
features: (1) automatize getting patches from patchwork (or lore) and
triggering the process of building, deploying and validating them using the
existing workflow, (2) bisecting support. They are both very interesting
features. The first one fits well the patch-hub subproject, that is
under-development, and I’ve actually made a similar
request a couple of weeks
before the talk. The second is an already existing
request in kworkflow github
project.
Another request was to use kexec and avoid rebooting the kernel for testing.
Reviewing my presentation I realized I wasn’t very clear that kworkflow doesn’t
support kexec. As I replied, what it does is to install the modules and you can
load/unload them for validations, but for built-in parts, you need to reboot
the kernel.
Another two questions: one about Android Debug Bridge (ADB) support instead of
SSH and another about support to alternative ways of booting when the custom
kernel ended up broken but you only have one kernel image there. Kworkflow
doesn’t manage it yet, but I agree this is a very useful feature for embedded
devices. On Raspberry Pi 4, kworkflow mitigates this issue by preserving the
distro kernel image and using config.txt file to set a custom kernel for
booting. For ADB, there is no support too, and as I don’t see currently users
of KW working with Android, I don’t think we will have this support any time
soon, except if we find new volunteers and increase the pool of contributors.
The last two questions were regarding the status of b4 integration, that is
under development, and other debugging features that the tool doesn’t support
yet.
Finally, when Andrea and I were changing turn on the stage, he suggested to add
support for virtme-ng to kworkflow. So I
opened an issue for
tracking this feature request in the project github.
With all these questions and requests, I could see the general need for a tool
that integrates the variety of kernel developer workflows, as proposed by
kworflow. Also, there are still many cases to be covered by kworkflow.
Despite the high demand, this is a completely voluntary project and it is
unlikely that we will be able to meet these needs given the limited resources.
We will keep trying our best in the hope we can increase the pool of users and
contributors too.
In my previous post, when I introduced the switch to Skia for 2D rendering, I explained that we replaced Cairo with Skia keeping mostly the same architecture. This alone was an important improvement in performance, but still the graphics implementation was designed for Cairo and CPU rendering. Once we considered the switch to Skia as stable, we started to work on changes to take more advantage of Skia and GPU rendering to improve the performance even more. In this post I’m going to present some of those improvements and other not directly related to Skia and GPU rendering.
Explicit fence support
This is related to the DMA-BUF renderer used by the GTK port and WPE when using the new API. The composited buffer is shared as a DMA-BUF between the web and UI processes. Once the web process finished the composition we created a fence and waited for it, to make sure that when the UI process was notified that the composition was done the buffer was actually ready. This approach was safe, but slow. In 281640@main we introduced support for explicit fencing to the WPE port. When possible, an exportable fence is created, so that instead of waiting for it immediately, we export it as a file descriptor that is sent to the UI process as part of the message that notifies that a new frame has been composited. This unblocks the web process as soon as composition is done. When supported by the platform, for example in WPE under Wayland when the zwp_linux_explicit_synchronization_v1 protocol is available, the fence file descriptor is passed to the platform implementation. Otherwise, the UI process asynchronously waits for the fence by polling the file descriptor before passing the buffer to the platform. This is what we always do in the GTK port since 281744@main. This change improved the score of all MotionMark tests, see for example multiply.
Enable MSAA when available
In 282223@main we enabled the support for MSAA when possible in the WPE port only, because this is more important for embedded devices where we use 4 samples providing good enough quality with a better performance. This change improved the Motion Mark tests that use 2D canvas like canvas arcs, paths and canvas lines. You can see here the change in paths when run in a RaspberryPi 4 with WPE 64 bits.
Avoid textures copies in accelerated 2D canvas
As I also explained in the previous post, when 2D canvas is accelerated we now use a dedicated layer that renders into a texture that is copied to be passed to the compositor. In 283460@main we changed the implementation to use a CoordinatedPlatformLayerBufferNativeImage to handle the canvas texture and avoid the copy, directly passing the texture to the compositor. This improved the MotionMark tests that use 2D canvas. See canvas arcs, for example.
Introduce threaded GPU painting mode
In the initial implementation of the GPU rendering mode, layers were painted in the main thread. In 287060@main we moved the rendering task to a dedicated thread when using the GPU, with the same threaded rendering architecture we have always used for CPU rendering, but limited to 1 worker thread. This improved the performance of several MotionMark tests like images, suits and multiply. See images.
Update default GPU thread settings
Parallelization is not so important for GPU rendering compared to CPU, but still we realized that we got better results by increasing a bit the amount of worker threads when doing GPU rendering. In 290781@main we increased the limit of GPU worker threads to 2 for systems with at least 4 CPU cores. This improved mainly images and suits in MotionMark. See suits.
Hybrid threaded CPU+GPU rendering mode
We had either GPU or CPU worker threads for layer rendering. In systems with 4 CPU cores or more we now have 2 GPU worker threads. When those 2 threads are busy rendering, why not using the CPU to render other pending tiles? And the same applies when doing CPU rendering, when all workers are busy, could we use the GPU to render other pending tasks? We tried and turned out to be a good idea, especially in embedded devices. In 291106@main we introduced the hybrid mode, giving priority to GPU or CPU workers depending on the default rendering mode, and also taking into account special cases like on HiDPI, where we are always scaling, and we always prefer the GPU. This improved multiply, images and suits. See images.
Use Skia API for display list implementation
When rendering with Cairo and threaded rendering enabled we use our own implementation of display lists specific to Cairo. When switching to Skia we thought it was a good idea to use the WebCore display list implementation instead, since it’s cross-platform implementation shared with other ports. But we realized this implementation is not yet ready to support multiple threads, because it holds references to WebCore objects that are not thread safe. Main thread might change those objects before they have been processed by painting threads. So, we decided to try to use the Skia API (SkPicture) that supports recording in the main thread and replaying from worker threads. In 292639@main we replaced the WebCore display list usage by SkPicture. This was expected to be a neutral change in terms of performance but it surprisingly improved several MotionMark tests like leaves, multiply and suits. See leaves.
Use Damage to track the dirty region of GraphicsLayer
Every time there’s a change in a GraphicsLayer and it needs to be repainted, it’s notified and the area that changed is included so that we only render the parts of the layer that changed. That’s what we call the layer dirty region. It can happen that when there are many small updates in a layer we end up with lots of dirty regions on every layer flush. We used to have a limit of 32 dirty regions per layer, so that when more than 32 are added we just united them into the first dirty area. This limit was removed because we always unite the dirty areas for the same tiles when processing the updates to prepare the rendering tasks. However, we also tried to avoid handling the same dirty region twice, so every time a new dirty region was added we iterated the existing regions to check if it was already present. Without the 32 regions limit that means we ended up iterating a potentially very long list on every dirty region addition. The damage propagation feature uses a Damage class to efficiently handle dirty regions, so we thought we could reuse it to track the layer dirty region, bringing back the limit but uniting in a more efficient way than using always the first dirty area of the list. It also allowed to remove check for duplicated area in the list. This change was added in 292747@main and improved the performance of MotionMark leaves and multiply tests. See leaves.
Record all dirty tiles of a layer once
After the switch to use SkPicture for the display list implementation, we realized that this API would also allow to record the graphics layer once, using the bounding box of the dirty region, and then replay multiple times on worker threads for every dirty tile. Recording can be a very heavy operation, specially when there are shadows or filters, and it was always done for every tile due to the limitations of the previous display list implementation. In 292929@main we introduced the change with improvements in MotionMark leaves and multiply tests. See multiply.
MotionMark results
I’ve shown here the improvements of these changes in some of the MotionMark tests. I have to say that some of those changes also introduced small regressions in other tests, but the global improvement is still noticeable. Here is a table with the scores of all tests before these improvements and current main branch run by WPE MiniBrowser in a RaspberryPi 4 (64bit).
Test
Score July 2024
Score April 2025
Multiply
501.17
684.23
Canvas arcs
140.24
828.05
Canvas lines
1613.93
3086.60
Paths
375.52
4255.65
Leaves
319.31
470.78
Images
162.69
267.78
Suits
232.91
445.80
Design
33.79
64.06
What’s next?
There’s still quite a lot of room for improvement, so we are already working on other features and exploring ideas to continue improving the performance. Some of those are:
Damage tracking: this feature is already present, but disabled by default because it’s still work in progress. We currently use the damage information to only paint the areas of every layer that changed. But then we always compose a whole frame inside WebKit that is passed to the UI process to be presented on screen. It’s possible to use the damage information to improve both, the composition inside WebKit and the presentation of the composited frame on the screen. For more details about this feature read Pawel’s awesome blog post about it.
Use DMA-BUF for tile textures to improve pixel transfer operations: We currently use DMA-BUF buffers to share the composited frame between the web and UI process. We are now exploring the idea of using DMA-BUF also for the textures used by the WebKit compositor to generate the frame. This would allow to improve the performance of pixel transfer operations, for example when doing CPU rendering we need to upload the dirty regions from main memory to a compositor texture on every composition. With DMA-BUF backed textures we can map the buffer into main memory and paint with the CPU directly into the mapped buffer.
Compositor synchronization: We plan to try to improve the synchronization of the WebKit compositor with the system vblank and the different sources of composition (painted layers, video layers, CSS animations, WebGL, etc.)
Damage propagation is an optional WPE/GTK WebKit feature that — when enabled — reduces browser’s GPU utilization at the expense of increased CPU and memory utilization. It’s very useful especially in the context of low- and mid-end
embedded devices, where GPUs are most often not too powerful and thus become a performance bottleneck in many applications.
In computer graphics, the damage term is usually used in the context of repeatable rendering and means essentially “the region of a rendered scene that changed and requires repainting”.
In the context of WebKit, the above definition may be specialized a bit as WebKit’s rendering engine is about rendering web content to frames (passed further to the platform) in response to changes within a web page.
Thus the definition of WebKit’s damage refers, more specifically, to “the region of web page view that changed since previous frame and requires repainting”.
On the implementation level, the damage is almost always a collection of rectangles that cover the changed region. This is exactly the case for WPE and GTK WebKit ports.
To better understand what the above means, it’s recommended to carefully examine the below screenshot of GTK MiniBrowser as it depicts the rendering of the poster circle demo
with the damage visualizer activated:
In the image above, one can see the following elements:
the web page view — marked with a rectangle stroked to magenta color,
the damage — marked with red rectangles,
the browser elements — everything that lays above the rectangle stroked to a magenta color.
What the above image depicts in practice, is that during that particular frame rendering, the area highlighted red (the damage) has changed and needs to be repainted. Thus — as expected — only the moving parts of the demo require repainting.
It’s also worth emphasizing that in that case, it’s also easy to see how small fraction of the web page view requires repainting. Hence one can imagine the gains from the reduced amount of painting.
Normally, the job of the rendering engine is to paint the contents of a web page view to a frame (or buffer in more general terms) and provide such rendering result to the platform on every scene rendering iteration —
which usually is 60 times per second.
Without the damage propagation feature, the whole frame is marked as changed (the whole web page view) always. Therefore, the platform has to perform the full update of the pixels it has 60 times per second.
While in most of the use cases, the above approach is good enough, in the case of embedded devices with less powerful GPUs, this can be optimized. The basic idea is to produce the frame along with the damage information i.e. a hint for
the platform on what changed within the produced frame. With the damage provided (usually as an array of rectangles), the platform can optimize a lot of its operations as — effectively — it can
perform just a partial update of its internal memory. In practice, this usually means that fewer pixels require updating on the screen.
For the above optimization to work, the damage has to be calculated by the rendering engine for each frame and then propagated along with the produced frame up to its final destination. Thus the damage propagation can be summarized
as continuous damage calculation and propagation throughout the web engine.
Once the general idea has been highlighted, it’s possible to examine the damage propagation in more detail. Before reading further, however, it’s highly recommended for the reader to go carefully through the
famous “WPE Graphics architecture” article that gives a good overview of the WebKit graphics pipeline in general and which introduces the basic terminology
used in that context.
The information on the visual changes within the web page view has to travel a very long way before it reaches the final destination. As it traverses the thread and process boundaries in an orderly manner, it can be summarized
as forming a pipeline within the broader graphics pipeline. The image below presents an overview of such damage propagation pipeline:
This pipeline starts with the changes to the web page view visual state (RenderTree) being triggered by one of many possible sources. Such sources may include:
User interactions — e.g. moving mouse cursor around (and hence hovering elements etc.), typing text using keyboard etc.
Web API usage — e.g. the web page changing DOM, CSS etc.
multimedia — e.g. the media player in a playing state,
and many others.
Once the changes are induced for certain RenderObjects, their visual impact is calculated and encoded as rectangles called dirty as they
require re-painting within a GraphicsLayer the particular RenderObject
maps to. At this point, the visual changes may simply be called layer damage as the dirty rectangles are stored in the layer coordinate space and as they describe what changed within that certain layer since the last frame was rendered.
The next step in the pipeline is passing the layer damage of each GraphicsLayer
(GraphicsLayerCoordinated) to the WebKit’s compositor. This is done along with any other layer
updates and is mostly covered by the CoordinatedPlatformLayer.
The “coordinated” prefix of that name is not without meaning. As threaded accelerated compositing is usually used nowadays, passing the layer damage to the WebKit’s compositor must be coordinated between the main thread and
the compositor thread.
When the layer damage of each layer is passed to the WebKit’s compositor, it’s stored in the TextureMapperLayer that corresponds to the given
layer’s CoordinatedPlatformLayer. With that — and with all other layer-level updates — the
WebKit’s compositor can start computing the frame damage i.e. damage that is the final damage to be passed to the very end of the pipeline.
The first step to building frame damage is to process the layer updates. Layer updates describe changes of various layer properties such as size, position, transform, opacity, background color, etc. Many of those updates
have a visual impact on the final frame, therefore a portion of frame damage must be inferred from those changes. For example, a layer’s transform change that effectively changes the layer position means that the layer
visually disappears from one place and appears in the other. Thus the frame damage has to account for both the layer’s old and new position.
Once the layer updates are processed, WebKit’s compositor has a full set of information to take the layer damage of each layer into account. Thus in the second step, WebKit’s compositor traverses the tree formed out of
TextureMapperLayer objects and collects their layer damages. Once the layer damage of a certain layer
is collected, it’s transformed from the layer coordinate space into a global coordinate space so that it can be added to the frame damage directly.
After those two steps, the frame damage is ready. At this point, it can be used for a couple of extra use cases:
for WebKit’s compositor itself to perform some extra optimizations — as will be explained in the WebKit’s compositor optimizations section,
for layout tests.
Eventually — regardless of extra uses — the WebKit’s compositor composes the frame and sends it (a handle to it) to the UI Process along with frame damage using the IPC mechanism.
In the UI process, there are basically two options determining frame damage destiny — it can be either consumed or ignored — depending on the platform-facing implementation. At the moment of writing:
At the moment of writing, the damage propagation feature is run-time-disabled by default (PropagateDamagingInformation feature flag) and compile-time enabled by default for GTK and WPE (with new platform API) ports.
Overall, the feature works pretty well in the majority of real-world scenarios. However, there are still some uncovered code paths that lead to visual glitches. Therefore it’s fair to say the feature is still a work in progress.
The work, however, is pretty advanced. Moreover, the feature is set to a testable state and thus it’s active throughout all the layout test runs on CI.
Not only the feature is tested by every layout test that tests any kind of rendering, but it also has quite a lot of dedicated layout tests.
Not to mention the unit tests covering the Damage class.
In terms of functionalities, when the feature is enabled it:
activates the damage propagation pipeline and hence propagates the damage up to the platform,
When the feature is enabled, the main goal is to activate the damage propagation pipeline so that eventually the damage can be provided to the platform. However, in reality, a substantial part of the pipeline is always active
regardless of the features being enabled or compiled. This part of the pipeline ends before the damage reaches
CoordinatedPlatformLayer and is always active because it was used for layer-level optimizations for a long time.
More specifically — this part of the pipeline existed long before the damage propagation feature and was using layer damage to optimize the layer painting to the intermediate surfaces.
Because of the above, when the feature is enabled, only the part of the pipeline that starts with CoordinatedPlatformLayer
is activated. It is, however, still a significant portion of the pipeline and therefore it implies additional CPU/memory costs.
When the feature is activated and the damage flows through the WebKit’s compositor, it creates a unique opportunity for the compositor to utilize that information and reduce the amount of painting/compositing it has to perform.
At the moment of writing, the GTK/WPE WebKit’s compositor is using the damage to optimize the following:
to apply global glScissor to define the smallest possible clipping rect for all the painting it does — thus reducing the amount of painting,
to reduce the amount of painting when compositing the tiles of the layers using tiled backing stores.
Detailed descriptions of the above optimizations are well beyond the scope of this article and thus will be provided in one of the next articles on the subject of damage propagation.
As mentioned in the above sections, the feature only works in the GTK and the new-platform-API-powered WPE ports. This means that:
In the case of GTK, one can use MiniBrowser or any up-to-date GTK-WebKit-derived browser to test the feature.
In the case of WPE with the new WPE platform API the cog browser cannot be used as it uses the old API. Therefore, one has to use MiniBrowser
with the --use-wpe-platform-api argument to activate the new WPE platform API.
Moreover, as the feature is run-time-disabled by default, it’s necessary to activate it. In the case of MiniBrowser, the switch is --features=+PropagateDamagingInformation.
For quick testing, it’s highly recommended to use the latest revision of WebKit@main with wkdev SDK container and with GTK port.
Assuming one has set up the container, the commands to build and run GTK’s MiniBrowser are as follows:
It’s also worth mentioning that WEBKIT_SHOW_DAMAGE=1 environment variable disables damage-driven GTK/WPE WebKit’s compositor optimizations and therefore some glitches that are seen without the envvar, may not be seen
when it is set. The URL to this presentation is a great example to explore various glitches that are yet to be fixed. To trigger them, it’s enough to navigate
around the presentation using top/right/down/left arrows.
This article was meant to scratch the surface of the broad, damage propagation topic. While it focused mostly on introducing basic terminology and describing the damage propagation pipeline in more detail,
it briefly mentioned or skipped completely the following aspects of the feature:
the problem of storing the damage information efficiently,
the damage-driven optimizations of the GTK/WPE WebKit’s compositor,
the most common use cases for the feature,
the benchmark results on desktop-class and embedded devices.
Therefore, in the next articles, the above topics will be examined to a larger extent.
The new WPE platform API is still not released and thus it’s not yet officially announced. Some information on it, however, is provided by
this presentation prepared for a WebKit contributors meeting.
The platform that the WebKit renders to depends on the WebKit port:
in case of GTK port, the platform is GTK so the rendering is done to GtkWidget,
in case of WPE port with new WPE platform API, the platform is one of the following:
wayland — in that case rendering is done to the system’s compositor,
DRM — in that case rendering is done directly to the screen,
headless — in that case rendering is usually done into memory buffer.
GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.
On the WebRTC front, basic support for Rapid Synchronization was added, along with a couple of spec coverage improvements (https://commits.webkit.org/293567@main, https://commits.webkit.org/293569@main).
Dispatch a "canceled" error event for all queued utterances in case of SpeechSynthesis.
Support for the Camera desktop portal was added recently, it will benefit mostly Flatpak apps using WebKitGTK, such as GNOME Web, for access to capture devices, which is a requirement for WebRTC support.
JavaScriptCore 🐟
The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.
Work continued on porting the in-place wasm interpreter (IPInt) to 32-bits.
We have been working on bringing the Temporal implementation in JSC up to the current spec, and a step towards that goal was implemented in WebKit PR #43849. This PR changes how calendar annotation key parsing works; it doesn't change anything observable, but sets the groundwork for parsing calendar critical flags and unknown annotations.
Releases 📦️
The recent releases of WebKitGTK and WPE WebKit 2.48 introduced a number of improvements to performance, reduced resource usage, better support for web platform features and standards, multimedia, and more!
Read more about these updates in the freshly published articles for WebKitGTK, and WPE WebKit.
Pawel Lampe published a blog post on the damage propagation feature. This feature reduces browser's GPU utilization at the expense of increased CPU and memory utilization in the WPE and GTK WebKit ports.
Our efforts to bring GstWebRTC support to WebKitGTK and WPEWebKit also include direct contributions to GStreamer. We recently improved WebRTC spec compliance in webrtcbin, by making the SDP mid attribute optional in offers and answers.
Servo has shown that we can build a browser with a modern, parallel layout engine in a fraction of the cost of the big incumbents, thanks to our powerfultooling, our strong community, and our thorough documentation.
But we can, and should, build Servo without generative AI tools like GitHub Copilot.
This post is my personal opinion, not necessarily representative of Servo or my colleagues at Igalia.
I hope it makes a difference.
Recently the TSC voted in favour of two proposals that relax our ban on AI contributions.
This was a mistake, and it was also a mistake to wait until after we had made our decision to seek community feedback (see § On governance).
§ Your feedback made it clear that those proposals are the wrong way forward for Servo.
Correction (2025-04-12)
A previous version of this post highlighted a logic error in the AI-assisted patch we used as a basis for those two proposals.
This error was made in a non-AI-assisted part of the patch.
I call on the TSC to explicitly reaffirm that generative AI tools like Copilot are not welcome in Servo, and make it clear that we intend to keep it that way indefinitely, in both our policy and the community, so we can start rebuilding trust.
It’s not enough to say oops, sorry, we will not be moving forward with these proposals.
Like any logic written by humans, this policy does have some unintended consequences.
Our intent was to ban AI tools that generate bullshit [a] in inscrutable ways, including GitHub Copilot and ChatGPT.
But there are other tools that use similarunderlyingtechnology in more useful and less problematic ways (see § Potential exceptions).
Reviewing these tools for use in Servo should be a community-driven process.
We should not punish contributors for honest mistakes, but we should make our policy easier to follow.
Some ways to do this include documenting the tools that are known to be allowed and not allowed, documenting how to turn off features that are not allowed, and giving contributors a way to declare that they’ve read and followed the policy.
The declaration would be a good place to provide a dated link to the policy, giving contributors the best chance to understand the policy and knowingly follow it (or violate it).
This is not perfect, and it won’t always be easy to enforce, but it should give contributors and maintainers a foundation of trust.
Potential exceptions
Proposals for exceptions should start in the community, and should focus on a specific tool used for a specific purpose.
If the proposal is for a specific kind of tool, it must come with concrete examples of which tools are to be allowed.
Much of the harm being caused by generative AI in the world around us comes from people using open-ended tools that are not fit for any purpose, or even treating them like they are AGI.
The goal of these discussions would be to understand:
the underlying challenges faced by contributors
how effective the tool is for the purpose
how well the tool and purpose mitigate the issues in the policy
whether there are any existing or alternative solutions
whether those solutions have problems that need to be addressed
Sometimes the purpose may need to be constrained to mitigate the issues in the policy.
Let’s look at a couple of examples.
For some tasks like speech recognition[b] and machine translation[c][d], tools with large language models and transformers are the state of the art (other than humans).
This means those tools may be probabilistic tools, and strictly speaking, they may be generative AI tools, because the models they use are generative models.
Generative AI does not necessarily mean “AI that generates bullshit in inscrutable ways”.
Speech recognition can be used in a variety of ways.
If plumbed into ChatGPT, it will have all of the same problems as ChatGPT.
If used for automatic captions, it can make videos and calls accessible to people that can’t hear well (myself included), but it can also infantilise us by censoring profanities and make serious errors that cause real harm.
If deployed for that purpose by an online video platform, it can undermine the labour of human transcribers and lower the overall quality of captions.
If used as an input method, it would be a clear win for accessibility.
My understanding of speech input tools is that they have a clear (if configurable) mapping from the things you say to the text they generate or the edits they make, so they may be a good fit.
In that case, maintainer burden and correctness and security would not be an issue, because the author is in complete control of what they write.
Copyright issues seem less of a concern to me, since these tools operate on such a small scale (words and symbols) that they are unlikely to reproduce a copyrightable amount of text verbatim, but I am not a lawyer.
As for ethical issues, these tools are generally trained once then run on the author’s device.
When used as an input method, they are not being used to undermine labour or justify layoffs.
I’m not sure about the process of training their models.
Machine translation can also be used in a variety of ways.
If deployed by a language learning app, it can ruin the quality of your core product, but hey, then you can lay off those pesky human translators.
If used to localise your product, your users will finally be able to compress to postcode file.
If used to localise your docs, it can make your docs worse than useless unless you take very careful precautions.
What if we allowed contributors to use machine translation to communicate with each other, but not in code commits, documentation, or any other work products?
Deployed carelessly, they will waste the reader’s time, and undermine the labour of actual human translators who would otherwise be happy to contribute to Servo.
If constrained to collaboration, it would still be far from perfect, but it may be worthwhile.
Maintainer burden should be mitigated, because this won’t change the amount or kind of text that needs to be reviewed.
Correctness and security too, because this won’t change the text that can be committed to Servo.
I can’t comment on the copyright issues, because I am not a lawyer.
The ethical issues may be significantly reduced, because this use case wasn’t a market for human translators in the first place.
Your feedback
I appreciate the feedback you gave on the Fediverse, on Bluesky, and on Reddit.
I also appreciate the comments on GitHub from several people who were more on the favouringsideoftheproposal, even though we reached different conclusions in most cases.
One comment argued that it’s possible to use AI autocomplete safely by accepting the completions one word at a time.
That said, the overall consensus in our community was overwhelmingly clear, including among many of those who were in favour of the proposals.
None of the benefits of generative AI tools are worth the cost in community goodwill [e].
Much of the dissent on GitHub was already covered by our existing policy, but there were quite a few arguments worth highlighting.
Machine translation
is generally not useful or effective for technical writing [h][i][j].
It can be, if some precautions are taken [k].
It may be less ethically encumbered than generative AI tools [l].
Client-side machine translation is ok [m].
Machine translation for collaboration is ok [n][o].
The proposals.
Proposal 1 is ill-defined [p].
Proposal 2 has an ill-defined distinction between autocompletes and “full” code generation [q][r][s].
Documentation
is just as technical as code [u].
Wrong documentation is worse than no documentation [v][w][x].
Good documentation requires human context [y][z].
GitHub Copilot
is not a good tool for answering questions [ab].
It isn’t even that good of a programming tool [ac].
Using it may be incompatible with the DCO [ad].
Using it could make us depend on Microsoft to protect us against legal liability [ae].
Correctness.
Generative AI code is wrong at an alarming rate [af].
Generative AI tools will lie to us with complete confidence [ag].
Generative AI tools (and users of those tools) cannot explain their reasoning [ah][ai].
Humans as supervisors are ill-equipped to deal with the subtle errors that generative AI tools make [aj][ak][al][am].
Even experts can easily be misled by these tools [an].
Typing is not the hard part of programming [ao], as even some of those in favour have said:
If I could offload that part of the work to copilot, I would be left with more energy for the challenging part.
Project health.
Partially lifting the ban will create uncertainty that increases maintainer burden for all contributions [ap][aq].
Becoming dependent on tools with non-free models is risky [ar].
Generative AI tools may not be fair use [as] → [at].
Outside of Servo, people have spent so much time cleaning up after LLM-generated mess [au].
Material.
Servo contributor refuses to spend time cleaning up after LLM-generated mess [av].
Others will stop donating [aw][ax][ay][az][ba][bb][bc][bd][be][bf][bg], will stop contributing [bh], will not start donating [bi], will not start contributing [bj][bk], or will not start promoting [bl] the project.
Broader context.
Allowing AI contributions is a bad signal for the project’s relationship with the broader AI movement [bm][bn][bo].
The modern AI movement is backed by overwhelming capital interests, and must be opposed equally strongly [bp].
People often “need” GitHub or Firefox, but no one “needs” Servo, so we can and should be held to a higher standard [bq].
Rejection of AI is only credible if the project rejects AI contributions [br].
We can attract funding from AI-adjacent parties without getting into AI ourselves [bs], though that may be easier said than done [bt].
On governance
Several people have raised concerns about how Servo’s governance could have led to this decision, and some have even suspected foul play.
But like most discussions in the TSC, most of the discussion around AI contributions happened async on Zulip, and we didn’t save anything special for the synchronous monthly public calls.
As a result, whenever the discussion overflowed the sync meeting, we just continued it internally, so the public minutes were missing the vast majority of the discussion (and the decisions).
These decisions should probably have happened in public.
Our decisions followed the TSC’s usual process, with a strong preference for resolving disagreements by consensus rather than by voting, but we didn’t have any consistent structure for moving from one to the other.
This may have made the decision process prone to being blocked and dominated by the most persistent participants.
Contrast this with decision making within Igalia, where we also prefer consensus before voting, but the consensus process is always used to inform proposals that are drafted by more than one person and then always voted on.
Most polls are “yes” or “no” by majority, and only a few polls for the most critical matters allow vetoing.
This ensures that proposals have meaningful support before being considered, and if only one person is strongly against something, they are heard but they generally can’t single-handedly block the decision with debate.
The rules are actually more complex than just by majority.
There’s clear advice on what “yes”, “no”, and “abstain” actually mean, they take into account abstaining and undecided voters, there are set time limits and times to contact undecided voters, and they provide for a way to abort a poll if the wording of the proposal is ill-formed.
We had twenty years to figure out all those details, and one of the improvements above only landed a couple of months ago.
We also didn’t have any consistent structure for community consultation, so it wasn’t clear how or when we should seek feedback.
A public RFC process may have helped with this, and would also help us collaborate on and document other decisions.
More personally, I did not participate in the extensive discussion in January and February that helped move consensus in the TSC towards allowing the non-code and Copilot exceptions until fairly late.
Some of that was because I was on leave, including for the vote on the initial Copilot “experiments”, but most of it was that I didn’t have the bandwidth.
Doing politics is hard, exhausting work, and there’s only so much of it you can do, even when you’re not wearing three other hats.
I’m a little (okay, a lot) late to it, but meyerweb is now participating in CSS Naked Day — I’ve removed the site’s styles, except in cases where pages have embedded CSS, which I’m not going to do a find-and-replace to try to suppress. So if I embedded a one-off CSS Grid layout, like on the Toolbox page, that will still be in force. Also, cached files with CSS links could take a little time to clear out. Otherwise, you should get 1990-style HTML. Enjoy!
(The site’s design will return tomorrow, or whenever I remember [or am prodded] to restore it.)
Na Igalia, a gente trabalha com o desenvolvimento de navegadores da Internet, tipo Chrome e Safari. Na verdade, atuamos com as tecnologias por trás desses navegadores, que permitem que os sites tenham uma boa aparência e funcionem corretamente, como HTML, CSS e JavaScript—os blocos de construção de todas as aplicações da Internet.
Em tudo o que fazemos, tentamos dar grande ênfase à responsabilidade social. Isso significa que o nosso foco vai além do lucro, priorizando ações que geram um impacto positivo na sociedade. Além disso, a Igalia é construída com base em valores de igualdade e transparência, profundamente enraizados em nossa estrutura organizacional. Estes compromissos com valores e responsabilidade social moldam os princípios fundamentais que orientam o nosso trabalho.
Update on what happened in WebKit in the week from March 31 to April 7.
Cross-Port 🐱
Graphics 🖼️
By default we divide layers that need to be painted into 512x512 tiles, and
only paint the tiles that have changed. We record each layer/tile combination
into a SkPicture and replay the painting commands in worker threads, either
on the CPU or the GPU. A change was
landed to improve the algorithm,
by recording the changed area of each layer into a single SkPicture, and for
each tile replay the same picture, but clipped to the tile dimensions and
position.
WPE WebKit 📟
WPE Platform API 🧩
New, modern platform API that supersedes usage of libwpe and WPE backends.
A WPE Platform-based implementation of Media Queries' Interaction Media
Features, supporting
pointer and hover-related queries, has
landed in WPE WebKit.
When using the Wayland backend, this change exposes the current state of
pointing devices (mouse and touchscreen), dynamically reacting to changes such
as plugging or unplugging. When the new WPEPlatform API is not used, the
previous behaviour, defined at build time, is still used.
Adaptation of WPE WebKit targeting the Android operating system.
A number of fixes havebeenmerged to fix and improve building WPE WebKit for Android. This is part of an ongoing effort to make it possible to build WPE-Android using upstream WebKit without needing additional patches.
The example MiniBrowser included with WPE-Android has beenfixed to handle edge-to-edge layouts on Android 15.
At Igalia, we work on developing Internet browsers such as Chrome and Safari. In fact, we work with the technologies behind these browsers that allow websites to look good and function correctly, such as HTML, CSS, and JavaScript—the building blocks of all Internet applications.
In everything we do, we try to place a strong emphasis on social responsibility. This means that our focus goes beyond profit, prioritizing actions that generate a positive impact on society. Igalia is built on values of equality and transparency, which are deeply embedded in our organizational structure. These commitments to values and social responsibility shape the fundamental principles that guide our work.
Doing a 32-bit build on ARM64 hardware now works with GCC 11.x as well.
Graphics 🖼️
Landed a change that improves the painting of tile fragments in the compositor if damage propagation is enabled and if the tiles sizes are bigger than 256x256. In those cases, less GPU is utilized when damage allows.
The GTK and WPE ports no longer useDisplayList to serialize the painting commands and replay them in worker threads, but SkPictureRecorder/SkPicture from Skia. Some parts of the WebCore painting system, especially font rendering, are not thread-safe yet, and our current cross-thread use of DisplayList makes it harder to improve the current architecture without breaking GTK and WPE ports. This motivated the search for an alternative implementation.
Community & Events 🤝
Sysprof is now able to filter samples by marks. This allows for statistically relevant data on what's running when a specific mark is ongoing, and as a consequence, allows for better data analysis. You can read more here.
The other day was W3C Breakouts Day: Effectively, an opportunity for people to create sessions to talk about or work on anything related to the Web. We do this twice a year. Once entirely virtually (last week) and then again during hybrid meetings in association with W3C's biggest event of the year, called "TPAC".
This year there were 4 blocks of sessions with 4-5 sessions competing concurrently for the same time slot. There were 2 in the morning that I'd chosen, but then realized that I had calendar conflicts with other meetings and wound up missing them.
In the afternoon there were 3 sessions proposed by Igalians - unfortunately two of them simulteneously:
There were actually several in that timeslot that I wanted to attend but the choice was pretty straightforward: I'd attend mine and Eric's. Unfortunately I did a bad job chairing and forgot to assign a scribe or record a presentation.
Eric's session told the story of SVG and how, despite it getting a lot of love from developers and designers and other working group participants, it just didn't get a lot of love or investment from the browser vendors themselves. Amelia Belamy-Royd's (former editor/chair) told us about how things ultimately stalled out around 2020 with so much left undone, burnout, disappointments, and a W3C working group that was almost closed up with a note last year before being sort of "rescued" at the last minute with a very targeted rechartering. However, that rechartering still hasn't resulted in much - no new meetings, etc. We talked about how Igalia has been investing, along with companies like Wix, in trying to help move things forward. Then the question was: How can we move it forward, together? You can read the minutes for more, but one idea was a collective around SVG.
I mention this because while might they seem at first like pretty different topics, Eric's session and my own are really just two things raising the same overarching question: Why have we built the infrastructure of the whole web ecosystem to be fundamentally dependent on the funding and prioritization of a few private organizations?. And - how can we fix that?.
My session presented some things we've done, or tried so far.
Things we're trying...
First, just moving from proprietary projects to open source has helped a bit. Instead of one organization being 100% responsible, the stewards themselves today contribute "only" ~80 to 95% of the commits.
My company Igalia is among the top few committers to all of the web engine projects every year for at least half a decade, and we have many ways that we try to diversify work upstream. What's interesting about attempts to diversify funding is that there is a kind of a spectrum here relating to both the scope of what we're tackling and how well understood the approach is...
one sponsor/single issuecollective/whole project
On the left is a single sponsor with very specific/measurable tasks. This could be a list of bugs, or implementation of a pretty stable feature in a single engine. You can contract Igalia for stuff like this - The :has() implementation and proof-of-concept that moved the needle in Chromium, for example, was sponsored by EyeO. CSS Grid in Blink and WebKit was sponsored by Bloomberg Tech. We know how to do this. We've got lots of practice.
Moving slightly more 'right', we have things like our Open Priortization which is trying to decide how to prioritize and share a funding burden. We know less of how to do this, we learned things with the experiment, but we've not repeated it because it involved a lot of "extra" work.
Then there's our efforts with MathML, funded by grants, then partnerships, then a collective aimed at generally advancing and maintaining MathML support in browsers. That's now involving more decision making - there's not a tight list of tasks or even a specific browser. You need to take in money to find out how much budget you have, and then try to figure out how to use it. In this case, there's a steering committee made up of the biggest funders. Every 6 months or so there is a report of the work that was done submitted for payment. Is that a great model?
We also have Wolvic, which attempted to have partnerships and a collective, with a model where if you put in a kind of minimum amount every 6 months then you would get invited to a town hall to discuss how we'd prioritize the spending of available funds. Is that a great model? The thing I like about it is that it binds things to some kind of reality. It's great if someone gave $500, for example, but if 10 other orgs also did so and they all want different things, none of which we can do with the available funds... At least they can all help decide what to do. In that model, the input is still only informative - it's ultimately Igalia who decides.
Or, you might have seen a new thing hosted at the Linux Foundation for funding work in Chromium. That's more formal, but kind of similar in challenges. Unfortunately there was no one attending our sessions who could talk about it - I know a bit more, but I'm not going to break news on my blog :). We'll have a podcast about it when we can.
Or, a little further right there's Open Web Docs (OWD), which I'm pleased does seem to be succeeding at sponsoring some real work and doing real prioritization. Florian from OWD was there and was able to give us some real thoughts. They too are drawing now from grants like Soverign Tech Fund. They produce as part of this an Impact Report.
Or, maybe all the way to the right you have a general browser fund like Servo. There it is the Technical Steering Committee that decides what to do with the funds.
A lot of this thinking is currently still pretty esoteric because, for example, through the collective, Servo isn't taking in enough money for even a single full time employee. I also mentioned our work with Interop, and the fact that even with that level of effort to priortize, it is very difficult to not wind up taking resources away from actual work in order to figure out which work to potentially do! It's hard to not add admin costs that eat away at the overall progress. But that's exactly why we should be talking about these things now. How can we make this work?
The Actual Discussion
As I said, we didn't have a scribe, but there was some discussion I can attempt to summarize from the queue I can still see in IRC.
Q. Do you think the W3C has a role to play?
That's why I brought it here. My own desire would be yes. There are some venues where working groups and interest groups are encouraged to pool funds and so on - but the W3C isn't currently one of them and I'm unsure how others feel. It's a thing I have been trying to pitch for a few years now that we could figure out.
The Wolvic model that I described above was born from thoughts around this. What if W3C groups had the ability to put money into a common fund and then at the end of the year they could decide how to deploy it. Maybe sometimes that would be for tools. Maybe sometimes that would be for fixing bugs. Maybe sometimes that would be for new implementation, or last implementation, or supporting spec authorship. But it would always be grounded on something real about collaboration that puts at some degree of power beyond the implementers. Maybe that would mean that print CSS would inch along at a snail's pace - but at least it's better than not at all, and it shows a way. At least it doesn't seem like the group isn't even picking up the phone.
Q. Are engines like Servo and Ladybird filing implementation reports? WPT?
Basically... Not exactly. The main thrust is that both engines do make themselves available on Web Platform Tests and closely track their scores. That said, the models differ quite a bit in their approach and Ladybird, for example, generally is more focused from the outside in. It's possible they only start with 25% of the standard required to load a certain page. So, at what point would they submit an impact report? Both projects do currently open issues when they find implementation related questions or problems, and both have the ability to expand the test suites in response to those issues. So, effectively: Nobody has time for that level of formality, but they're following the spirit of it.
Q. Doesn't it help to have a thing you can measure if you approach someone to ask for money?
Yes, that is the whole point of my spectrum above. The further right you move, the more of a question this is. On our podcast, for example, we've asked people "Why don't you donate to Mozilla?" and the main answer given is always "I don't know what they'll spend it on, but probably not things I would agree with". The more people putting in, the more there is to balance or require trust. Currently a lot of these work for grants or more specific/finite problems which do reports of success - like the one I provided above for MathML or the implementation report in OWD.
But does it mean there shouldn't be more general funds? Currently the general funds come from default search and wind up with a single org to decide - is it what we want?
So, that's all that was discussed. All in all, it was pretty good - but I'm slightly disappointed there wasn't a little more discussion of how we could collectively govern or priortize, or even if we all agreed that that's something that we want. I'd love to hear any of your thoughts!
Linux 6.14 is the second release of 2025, and as usual Igalia took part on it. It’s a very normal release, except that it was release on Monday, instead of the usual Sunday release that has been going on for years now. The reason behind this? Well, quoting Linus himself:
I’d like to say that some important last-minute thing came up and
delayed things.
But no. It’s just pure incompetence.
But we did not forget about it, so here’s our Linux 6.14 blog post!
A part of the development cycle for this release happened during late December, when a lot of maintainers and developers were taking their deserved breaks. As a result of this, this release contains less changes than usual as stated by LWN as the “lowest level of merge-window activity seen in years”. Nevertheless, some cool features made through this release:
NT synchronization primitives: Elizabeth Figura, from Codeweavers, is know from her work around improving Wine sync functions, like mutexes and semaphores. She was one the main collaborators behind the futex_waitv() work and now developed a virtual driver that is more compliant with the precise semantics that the NT kernel exposes. This allows Wine to behave closer to Windows without the need to create new syscalls, since this driver uses ioctl() as the front-end uAPI.
RWF_UNCACHED: Linux has two ways of dealing with storage I/O: buffered I/O (usually the preferred one) that stores data in a temporary buffer and regularly syncs the cache data with the device; and direct I/O that doesn’t use cache and always writes/reads synchronously with the storage device. Now a new mixed approach is available: uncached buffered I/O. This method is aimed to have a fast way to write or read data that will not be needed again in the short term. For reading, the device writes data in the buffer and as soon as the user finished reading the buffer, it’s cleared from the cache. For writing, as soon as userspace fills the cache, the device reads it and removes it from the cache. In this way we still have the advantage of using a fast cache but reducing the cache pressure.
amdgpu panic support: AMD developers added kernel panic support for amdgpu driver, “which displays a pretty user friendly message on the screen when a Linux kernel panic occurs” instead of just a black screen or a partial dmesg log.
As usual Kernel Newbies provides a very good summary, you should check it for more details: Linux 6.14 changelog. Now let’s jump to see what were the merged contributions by Igalia for this release!
DRM
For the DRM common infrastructure, we helped to land a standardization for DRM client memory usage reporting. Additionally, we contributed to improve and fix bugs found in drivers of AMD, Intel, Broadcom, and Vivante.
AMDGPU
For the AMD driver, we fixed bugs experienced by users of Cosmic Desktop Environment on several AMD hardware versions. One was uncovered with the introduction of overlay cursor mode, and a definition mismatch across the display driver caused a page-fault in the usage of multiple overlay planes. Another bug was related to division by zero on plane scaling. Also, we fixed regressions on VRR and MST generated by the series of changes to migrate AMD display driver from open-coded EDID handling to drm_edid struct.
Intel
For the Intel drivers, we fixed a bug in the xe GPU driver which prevented certain type of workarounds from being applied, helped with the maintainership of the i915 driver, handled external code contributions, maintained the development branch and sent several pull requests.
Also in the V3D driver, the active performance monitor is now properly stopped before being destroyed, addressing a potential use-after-free issue. Additionally, support for a global performance monitor has been added via a new DRM_IOCTL_V3D_PERFMON_SET_GLOBAL ioctl. This allows all jobs to share a single, globally configured perfmon, enabling more consistent performance tracking and paving the way for integration with user-space tools such as perfetto.
A small video demo of perfetto integration with V3D
etnaviv
On the etnaviv side, fdinfo support has been implemented to expose memory usage statistics per file descriptor, enhancing observability and debugging capabilities for memory-related behavior.
sched_ext
Many BPF schedulers (e.g., scx_lavd) frequently call bpf_ktime_get_ns() for tracking tasks’ runtime properties. bpf_ktime_get_ns() eventually reads a hardware timestamp counter (TSC). However, reading a hardware TSC is not performant in some hardware platforms, degrading instructions per cycyle (IPC).
We addressed the performance problem of reading hardware TSC by leveraging the rq clock in the scheduler core, introducing a scx_bpf_now() function for BPF schedulers. Whenever the rq clock is fresh and valid, scx_bpf_now() provides the rq clock, which is already updated by the scheduler core, so it can reduce reading the hardware TSC. Using scx_bpf_now() reduces the number of reading hardware TSC by 50-80% (e.g., 76% for scx_lavd).
Assorted kernel fixes
Continuing our efforts on cleaning up kernel bugs, we provided a few fixes that address issues reported by syzbot with the goal of increasing stability and security, leveraging the fuzzing capabilities of syzkaller to bring to the surface certain bugs that are hard to notice otherwise. We’re addressing bug reports from different kernel areas, including drivers and core subsystems such as the memory manager. As part of this effort, several fixes were done for the probe path of the rtlwifi driver.
Check the complete list of Igalia’s contributions for the 6.14 release
The February TC39 meeting in Seattle wrapped up with significant updates and advancements in ECMAScript, setting an exciting trajectory for the language's evolution. Here are the key highlights, proposal advancements, and lively discussions from the latest plenary.
The following proposals advanced to stage 4 early in the meeting, officially becoming a part of ECMAScript 2025. Congratulations to the people who shepherded them through the standardization process!
Float16Array: a typed array that uses 16-bit floating point values, mostly for interfacing with other systems that need 16-bit float arrays.
Champions: Leo Balter, Kevin Gibbons
RegExp.escape(): Sanitizes a string so that it can be used as a string literal pattern for the RegExp constructor.
Champions: Kevin Gibbons, Jordan Harband
Redeclarable global eval vars simplifies the mental model of global properties. It's no longer an error to redeclare a var or function global property with a let or const of the same name.
Now with full test262 coverage, the import defer proposal advanced to stage 3, without changes since its previous presentation. This is the signal for implementors to go ahead and implement it. This means that the proposal is likely to appear soon in browsers!
To clamp a number x to an interval [a, b] means to produce a value no smaller than a and no greater than b (returning x if x is in the interval). Oliver Medhurst presented a neat little proposal to add this feature to JS's Math standard library object. And Oliver was able to convince the committee to advance the discussion to stage 1.
Instances of Error and its subclasses have a stack property that returns a string representing the stack trace. However, this property is not specified, and previous attempts to define it in the spec did not get far because different JS engines have different string representations for the stack trace, and implementations can't converge on one behavior because there's code in the wild that does browser detection to know how to parse the format.
In December it was decided that specifying the presence of a stack property should be split off of the error stack proposal. This new error stack accessor proposal was first presented in this plenary, where it reached stage 2. The proposal achieves some amount of browser alignment on some details (e.g. is stack an own property? is it a getter/setter pair?), while also providing a specified base on which other proposals and web specs can build, but it leaves the stack trace string implementation-defined.
New TC39 contributor ZiJian Liu offered a suggestion for tackling a problem routinely faced by JS programmers who work closely with numbers:
"Am I sure that this numeric string S, if interpreted as a JS number, is going to be exactly preserved?"
The proposal is a new method on Numbers, isSafeNumeric, that would allow us to check this in advance. Essentially, ZiJian is trying to delimit a safe space for JS numbers. The discussion was heated, with many in the committee not sure what it actually means for a numeric string to be "preserved", and whether it can even be solved at all. Others thought that, although there may be no solution, it's worth advancing the proposal to stage 1 to begin to explore the space. Ultimately, the proposal did not advance to stage 1, but that doesn't mean it's the end—this topic may well come back in a sharper, more clearly defined form later on.
Temporal, the upcoming proposal for better date and time support in JS, has been seeing a surge of interest in the last few weeks because of a complete implementation being available in Firefox Nightly. Folks seem to be looking forward to using it in their codebases!
Our colleague Philip Chimento presented a status update. Firefox is at ~100% conformance with just a handful of open questions, and the Ladybird browser is the next closest to shipping a full implementation, at 97% conformance with the test suite.
The committee also reached consensus on making a minor change to the proposal which relaxed the requirements on JS engines when calculating lunar years far in the future or past.
Philip also presented a status update on the ShadowRealm proposal. ShadowRealm is a mechanism that lets you execute JavaScript code synchronously, in a fresh, isolated environment. This has a bunch of useful applications such as running user-supplied plugins without letting them destabilize your app, or prevention of supply chain attacks in your dependencies.
We think we have resolved all of the open questions on the TC39 side, but what remains is to gauge the interest in implementing the web integration parts. We had a lively discussion on what kinds of use cases we'd like to see in TC39 versus what kinds of use cases the web platform world would like to see.
Champions: Dave Herman, Caridy Patiño, Mark S. Miller, Leo Balter, Rick Waldron, Chengzhong Wu
Implementations of the decorators proposal are in progress: the Microsoft Edge team has an almost complete implementation on top of Chromium's V8 engine, and Firefox's implementation is in progress.
Although there are two implementations in progress, it has been stated that none of the three major browsers want to be the first one to ship among them, leaving the future of the proposal uncertain.
There was some progress on proposals related to ArrayBuffers. One topic was about the Immutable ArrayBuffer proposal, which allows creating ArrayBuffers in JS from read-only data, and in some cases allows zero-copy optimizations. The proposal advanced to stage 2.7.
Champions: Mark S. Miller, Peter Hoddie, Richard Gibson, Jack Works
In light of that, the committee considered whether or not it made sense to withdraw the Limited ArrayBuffer proposal (read-only views of mutable buffers). It was not withdrawn and remains at stage 1.
The well-known symbols Symbol.match, Symbol.matchAll, Symbol.replace, Symbol.search and Symbol.split allow an arbitrary object to be passed as the argument to the corresponding string methods and behave like a custom RegExp. However, these methods don't check that the argument is an object, so you could make "foo bar".split(" ") have arbitrary behavior by setting String.prototype[Symbol.split].
This is an issue especially for Node.js and Deno, since a lot of their internal code is written in JavaScript. They use primordials to guard their internal code from userland monkeypatching, but guarding against overriding the matching behavior of strings could lead to performance issues.
The proposed solution was to have the relevant string methods only look for these well-known symbols if the argument is an object, rather than doing so for all primitives other than null or undefined. This is technically a breaking change, but it's not expected to lead to web compatibility issues in the wild because of how niche these symbols are. Given that, the committee reached consensus on making this change.
The Records and Tuples proposal has been stuck at stage 2 for long time, due to significant concerns around unrealistic performance expectations. The committee again discussed the proposal, and how to rewrite it to introduce some of its capabilities to the language without falling into the same performance risks.
You can read more details on GitHub, but the summary is that Records and Tuples might become:
objects, rather than primitives
shallowly immutable, rather than enforcing deep immutability
using an equals() method rather than relying on === for recursive comparison
have special handling in Map/Set, easily allowing multi-value keys.
In discussions surrounding the topic of decimal, it has become increasingly clear that it overlaps with the measure proposal (possibly to be rechristened as amount -- watch this space) to such an extent that it might make sense to consider both proposals in a unified way. That may or may not mean that the proposals get literally merged together (although that could be a path forward). Ultimately, the committee wasn't in favor of merging the proposals, though there were concerns that, if they were kept separate, one proposal might advance without the other advancing. As usual with all discussions of decimal, the discussion overflowed into the next day, with Shane Carr of Google presenting his own sketch of how the unity of decimal and measure might happen.
Champions (Decimal): Jesse Alama , Jirka Maršík, Andrew Paprocki
Since at least 2015, V8 has exposed a non-standard API called Error.captureStackTrace() to expose an Error-like stack property on any arbitrary object. It also allows passing a function or constructor as its second argument, to skip any stack frames after the last call to that function, which can be used to hide implementation details that won't be useful to the user.
Although this API was V8-internal for so long, in 2023 JSC shipped an implementation of this API, and now SpiderMonkey is working on one. And since the V8 and JSC implementations have some differences, this is now being brought up as a TC39 proposal to settle on some exact behavior, which is now stage 1.
This proposal is only about Error.captureStackTrace(), and it does not attempt to specify V8's Error.prepareStackTrace() API, which allows customizing the stack trace string representation. This API is still V8-only, and there don't seem to be any plans to implement it elsewhere.
Champion: Matthew Gaudet
The "fixed" and "stable" object integrity traits #
The Stabilize proposal is exploring adding a new integrity trait for objects, similar to Object.preventExtension, Object.seal and Object.freeze.
A fixed object is an object whose properties can be safely introspected without triggering side effects, except for when triggering a getter through property access. In addition to that, it's also free from what we call the "override mistake": a non-writeable property doesn't prevent the same property from being set on objects that inherit from it.
const inheritFixed ={__proto__: fixedObject }; inheritFixed.x =3;// sets `inheritFixed.x` to `3`, while leaving `fixedObject.x` as `1`
An object that is both fixed and frozen is called stable.
The proposal was originally also exploring one more integrity trait, to prevent code from defining new private fields on an existing object through the return override trick. This has been removed from this proposal, and instead we are exploring changing the behavior of Object.preventExtensions to also cover this case.
Champions: Mark S. Miller, Chip Morningstar, Richard Gibson, Mathieu Hofman
Once upon a time, async functions did not exist, and neither did promises. For a long time, the only way to do asynchronous work in JS was with callbacks, resulting in "callback hell". Promises first appeared in userland libraries, which eventually converged into one single interoperable API shape called Promises/A+.
When JS added the Promise built-in in ES6, it followed Promises/A+, which included a way to interoperate with other promise libraries. You might know that resolving a promise p1 with a value which is a different promise p2 will not immediately fulfill p1, but it will wait until p2 is resolved and have the same fulfilled value or rejection reason as p2. For compatibility with promise libraries, this doesn't only work for built-in promises, but for any object that has a .then method (called "thenables").
In the time since Promise was added as a built-in, however, it has become clear that thenables are a problem, because it's easy for folks working on the JS engines to forget they exist, resulting in JS code execution happening in unexpected places inside the engine. In fact, even objects fully created within the engine can end up being thenables, since you can set Object.prototype.then to a function. This has led to a number of security vulnerabilities in JS engines, including one last year involving async generators that needed fixes in the JS specification.
It is not feasible to completely get rid of thenables because pre-ES6 promise libraries are still being used to some extent. However, this proposal is about looking for ways to change their behavior so that these bugs can be avoided. It just became stage 1, meaning the possible solutions are still in the process of being explored, but some proposed ideas were:
Make it impossible to set Object.prototype.then.
Ignore thenables for some internal operations.
Change the definition of thenable so that having a then method on Object.prototype and other fundamental built-in objects and prototypes doesn't count, while it would count for regular user-created objects.
During the discussion in plenary, it was mentioned that userland JS also runs into issues with thenables when the call to .then leads to reentrancy (that is, if it calls back into the code that called it). If all engine vulnerabilities caused by thenables are related to reentrancy, then both issues could be solved at once. But it does not seem like that is the case, and solving the reentrancy issue might be harder.
Eemeli Aro from Mozilla presented an update on stable formatting. This proposal aims to add a "stable" locale for addressing some of the weaknesses of our current model of localization. Most importantly, people need an entirely separate code path for either unlocalized or machine-readable use cases which hurts in that it adds complexity to the interface, and it distracts users away from the patterns they ought to be using for their interfaces. This is relevant for use-cases such as testing (especially snapshot testing).
Temporal already made some strides in this direction by keeping the API surface consistent while allowing users to specify the ISO8601 calendar or the UTC timezone instead of relying on localizable alternatives. This proposal would add a "null" locale either in the form of the literal null value in JavaScript or using the "zxx" pattern commonly used in the Internationalization world, in order to provide a stable formatting output so users could write their interface once and just use this specific locale to achieve their desired result.
In the meeting, Eemeli presented their proposal for various formats that should be a part of this stable locale and the committee expressed a preference for the "zxx" locale instead of null with some concerns regarding null being too similar to undefined.
The Intl Locale Info API proposal, which is very close to done, was brought back to the committee perhaps for the last time before Stage 4. The notable change was to remove minimal days from the API due to the lack of strong use cases for it. Finally, there were discussions about the final remaining open questions, especially those that would block implementations. These are planned to be fixed shortly before the proposal goes to Stage 4.
On Thursday evening after the meeting adjourned, the committee members traveled 2 blocks down the road to a SeattleJS meetup kindly hosted by DocuSign at their HQ. A number of committee members gave presentations on TC39-related topics. Two of these were by our colleagues Nicolò Ribaudo, who gave an introduction to the deferred imports proposal, and Philip Chimento, who gave a tour of the Temporal API.
After the plenary ended on Thursday, the discussion continued on Friday with a session of the TG5 part of TC39, which is dedicated to research aspects of JavaScript.
Our colleague Jesse Alama presented on Formalizing JS decimal numbers with the Lean proof assistant. There were a number of other presentations - a report on user studies of the MessageFormat 2.0, studies on TC39 proposals, A parser generator template literal tag generating template literal tags and "uncanny valleys" in language design.
I’ve blogged in the past about how WebKit on Linux integrates with Sysprof, and provides a number of marks on various metrics. At the time that was a pretty big leap in WebKit development since it gave use a number of new insights, and enabled various performance optimizations to land.
But over time we started to notice some limitations in Sysprof. We now have tons of data being collected (yay!) but some types of data analysis were pretty difficult yet. In particular, it was difficult to answer questions like “why does render times increased after 3 seconds?” or “what is the CPU doing during layout?”
In order to answer these questions, I’ve introduced a new feature in Sysprof: filtering by marks.
Select a mark to filter by in the Marks view
Samples will be filtered by that mark
Hopefully people can use this new feature to provide developers with more insightful profiling data! For example if you spot a slowdown in GNOME Shell, you open Sysprof, profile your whole system, and filter by the relevant Mutter marks to demonstrate what’s happening there.
Here’s a fancier video (with music) demonstrating the new feature:
The browser find-in-page feature is intended to allow users to identify where a search term appears in the page. Browsers highlight the locations of the string
using highlights, typically one color for the active match and another color for
the other matches. Both Chrome and Firefox have this behavior, and problems arise
when the default browser color offers poor contrast in the page, or can be easily
confused with other highlighted content.
Safari works around this problem by applying an overlay and painting search
highlights on top. But what can be done in Chrome and Firefox?
The newly available ::search-text pseudo highlight provides styling for the
find-in-page highlight using CSS. Within a ::search-text rule you can use
properties for colors (text and background), text decorations such as underlines,
and shadows. A companion ::search-text:current selector matches the active
find-in-page result and can be styled separately.
As an example, let’s color find-in-page
results green and add a red underline for the active match:
<style> :root::search-text{ color: green; } :root::search-text:current{ color: green; text-decoration: red solid underline; } </style> <p>Find find in this example of find-in-page result styling.</p>
In general, the find-in-page markers should be consistent across the entire page,
so we recommend that you define the ::search-text properties on the root. All
other elements will inherit the pseudo class through the
highlight inheritence
mechanism.
Note that if you do not specify ::serach-text:current,
the active match will use the same styling as inactive matches. In practice, it is best
to always provide styles for ::search-text:current when defining ::search-text, and
vice versa, with sufficient difference in visual appearance to make it clear to users
which is the active match.
This feature is available in Chrome 136.0.7090.0 and later, when “Experimental Web Platform features”
is enabled with chrome://flags. It will likely be available to all users in Chrome 138.
Modifying the appearance of user agent features, such as find-in-page highlighting,
has significant potential to hurt users through poor contrast, small details, or other
accessibility problems. Always maintain good contrast with all the backgrounds present
on the page. Ensure that ::search-text styling is unambiguous. Include find-in-page
actions in your accessibility testing.
Find-in-page highlight styling is beneficial when the native markers pose accessibility
problems, such as poor contrast. It’s one of the motivations for providing this feature.
User find-in-page actions express personal information, so steps have been
taken to ensure that sites cannot observe the presence or absence of find-in-page
results. Computed style queries in Javascript will report the presence of
::search-text styles regardless of whether or not they are currently rendered
for an element.
As part of my performance analysis work for LGE webOS, I often had to capture Chrome Traces from an embedded device. So, to make it convenient, I wrote a simple command line helper to obtain the traces remotely, named trace-chrome.
In this blog post will explain why it is useful, and how to use it.
Chromium provides an infrastructure for capturing static tracing data, based on Perfetto. In this blog post I am not going through its architecture or implementation, but focus on how we instruct a trace capture to start and stop, and how to then fetch the results.
Chrome/Chromium provides user interfaces for capturing and analyzing traces locally. This can be done opening a tab and pointing it to the chrome://tracing URL.
The tracing capture UI is quite powerful, and completely implemented in web. This has a downside, though: running the capture UI introduces a significant overhead in several resources (CPU, memory, GPU, …).
This overhead may be even more significant when tracing Chromium or any other Chromium based web runtime in an embedded device, where we have CPU, storage and memory constraints.
Chromium does a great work at minimizing the overhead, by postponing the trace processing as much as possible, and providing a minimal UI when the capture is ongoing. But it may still be too much.
How to avoid this problem?
Capturing UI should not run in the system we are tracing. We can run the UI in a different computer to capture the trace.
Same about storage, we want it to happen in a different computer.
The solution for both is tracing remotely. Both the user interface for controlling the recording, and the recording storage happen in a different computer.
Target device: it is the one that runs the Chromium web runtime instance we are going to trace.
Host device: the device that will run the tracing UI, to configure, start and stop the recording, and to explore the tracing results.
OK, now we know we want to trace remotely the target device Chromium instance. How can we do that? First, we need to connect our tracing tools running in the host to the Chromium instance in the target device.
This is done using the remote debugging port: a multi-purpose HTTP port provided by Chromium. This port is used not only for tracing, it offers access to Chrome Developer Tools.
The Chromium remote debugging port is disabled by default, but it can be enabled using the command line switch --remote-debugging-port=PORT in the target Chromium instance. This will open an HTTP port in the localhost interface, that can be used to connect.
Why localhost? Because this interface does not provide any authentication or encryption. So it is unsafe. It is user responsibility to provide some security (i.e. by using an setting an SSH tunnel between the host and the target device to connect to the remote debugging port).
Chromium browser provides a solution for tracing remotely. Just opening the URL chrome://inspect in the host device. It provides this user interface:
First, the checkbox for Discover network targets needs to be set.
Then press the Configure… button to set the list of IP addressed and ports where we expect target remote debugging ports to be.
Do not forget to add to the list the end point that is accessible from the host device. I.e. in the case of an SSH tunnel from the host device to the target device port, it needs to be the host side of the tunnel.
For the case we set up the host side tunnel at the port 10101, we will see this:
Then, just pressing the trace link will show the Chromium tracing UI, but connected to the target device Chromium instance.
Over the last 8 years, I have been involved quite often in exploring the performance of Chromium in embedded devices. Specifically for the LGE webOS web stack. In this problem space, Chromium tracing capabilities are handy, providing a developers oriented view of different metrics, including the time spent running known operations in specific threads.
At that time I did not know about chrome://inspect so I really did not have an easy way to collect Chromium traces from a different machine. This is important as one performance analysis principle is that collecting the information should be as lightweight as possible. Running the tracing UI in the same Chromium instance that is analyzed is against that principle.
The solution? I wrote a very simple NodeJS script, that allows to capture a Chromium trace from the command line.
This is convenient for several reasons:
No need to launch the full tracing UI.
As we completely detach that UI from the capturing step, without an additional step to record the trace to a file, we are not affected on the unstability of the tracing UI handling the captured trace (not a problem usually, but it happens).
Easier to repeat tests for specific tracing categories, instead of manually enabling them in the tracing UI.
The script just provides an easy to use command line interface to the already existing chrome-remote-interface NodeJS module.
To connect to a running Chromium instance remote debugging port, the --host and --port parameters need to be used. In the examples I am going to use the port 9999 and the host localhost.
Warning
Note that, in this case, the parameter --host refers to the network address of the remote debugging port access point. It is not referring to the host machine where we run the script.
Now, the most important step: recording a Chromium trace. To do this, we will provide a list of categories (parameter --categories), and a file path to record the trace (parameter --output):
This will start recording. To stop recording, just press <Ctrl>+C, and the trace will be transferred and stored to the provided file path.
Tip
Which categories to use? Good presets for certain problem scopes can be obtained in Chrome. Just open chrome://tracing, press the Record button, and play with the predefined settings. In the bottom you will see the list of categories to pass for each of them.
Now the tracing file has been obtained, it can be opened from Chrome or Chromium running in host: load in a tab the URL chrome://tracing and press the Load button.
Tip
The traces are completely standalone. So they can be loaded in any other computer without any additional artifact. This is useful, as those traces can be shared among developers or uploaded to a ticket tracker.
But, if you want to do that, do not forget to compress first with gzip to make the trace smaller. chrome://tracing can open the compressed traces directly.
The script also supports periodical recording of the memory-infra system. This captures periodical dumps of the state of memory, with specific instrumentation in several categories.
To use it, add the category disabled-by-default-memory-infra, and pass the following parameters to configure the capture:
--memory_dump_mode <background|detailed|simple>: level of detail. background is designed to have almost no impact in execution, running very fast. light mode shows a few entries, while detailed is unlimited, and provides the most complete information.
--memory_dump_interval: the interval in miliseconds between snapshots.
For convenience, it is also possible to use trace-chrome with npx. It will install the script and the dependencies in the NPM cache, and run from them:
trace-chrome is a very simple tool, just providing a convenient command line interface for interacting with remote Chromium instances. It is specially useful for tracing embedded devices.
It has been useful for me for years, in a number of platforms, from Windows to Linux, from desktop to low end devices.
Update on what happened in WebKit in the week from March 17 to March 24.
Cross-Port 🐱
Limited the amount data stored for certain elements of WebKitWebViewSessionState. This results in memory savings, and avoids oddly large objects which resulted in web view state being restored slowly.
Multimedia 🎥
GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.
The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.
Fixed an integer overflow when using wasm/gc on 32-bits.
Graphics 🖼️
Landed a change that fixes a few scenarios where the damage was not generated on layer property changes.
Releases 📦️
WebKitGTK 2.48.0 and WPE WebKit 2.48.0 have been released. While they may not look as exciting as the 2.46 series, which introduced the use of Skia for painting, they nevertheless includes half a year of improvements. This development cycle focused on reworking internals, which brings modest performance improvements for all kinds of devices, but most importantly cleanups which will enable further improvements going forward.
For those who need longer to integrate newer releases, which we know can be a longer process for embedded device distrihytos, we have also published WPE WebKit 2.46.7 with a few stability and security fixes.
Accompanying these releases there is security advisory WSA-2025-0002 (GTK, WPE), which covers the solved security issues. Crucially, all three contain the fix for an issue known to be exploited in the wild, and therefore we strongly encourage updating.
The inspiration for doing Leaning In! came from the tutorial at BOBKonf 2024 by Joachim Breitner and David Christiansen. The tutorial room was full; in fact, it was overfull and not everyone who wanted to attend could attend. I’d kept my eye on Lean from its earliest days but lost the thread for a long time. The image I had of Lean came from its version 1 and 2 days, when the project was still closely aligned the aims of homotopy type theory. I didn’t know about Lean version 3. So when I opened my eyes and woke up, I was in the current era of Lean (version 4), with a great language, humongous standard library, and pretty sensibile tooling. I was on board right away. As an organizer of Racketfest, I had some experience putting together (small) conferences, so I thought I’d give it a go with Lean.
I announced the conference a few months ago, so there wasn’t all that much time to find speakers and plan. Still, we had 33 people in the room. When I first started planning the workshop, I thought there’d only be 10-12 people. This was my first time organizing a Lean workshop of any sort, so my initial expectations were very modest. I booked a fairly small room at Spielfeld for that. After some encouragement from Joachim, who politely suggested that 10-12 might be a bit too small, I requested a somewhat larger room, for up to 20 people. But as registrations kept coming in, I needed to renegotiate with Spielfeld. Ultimately, they put us in their largest room (a more appropriately sized room exists but had already been booked). The room we were in was somewhat too big, but I’m glad we had the space.
Lean is a delightful mix of program verification and mathematics formalization. That was reflected in the program. We had three talks,
that, I’d say, were definitely more in the computer science camp. With Lean, it’s not so clear at times. Lukas’s talk was motivated by some applications coming from computer science but the topic makes sense on its own and could have been taken up by a mathematician. The opening talk, Recursive definitions, by Joachim Breitner, was about the internals of Lean itself, so I think it doesn’t count as a talk on formalizing mathematics. But it sort of was, in the sense that it was about the logic in the Lean kernel. It was computer science-y, but it wasn’t really about using Lean, but more about better understanding how Lean works under the hood.
It is clear that mathematics formalization in Lean is very much ready for research level mathematics. The mathematics library is very well developed, and Lean is fast enough, with good enough tooling, to enable mathematicians to do serious stuff. We are light years past noodling about the Peano axioms or How do I formalize a group?. I have a guy feeling that we may be approaching a point in the near future wher Lean might become a common way of doing mathematics.
What didn’t go so well
The part of the event that probably didn’t go quite as I had planned was the Proof Clinic in the afternoon. The intention of the proof clinic was to take advantage of the fact that many of us had come to Berlin to meet face-to-face, and there were several experts in the room. Let’s work together! If there’s anything you’re stuck on, let’s talk about it and make some progress, today. Think of it as a sort of micro-unconference (just one hour long) within a workshop.
That sounds good, but I didn’t prepare the attendees well enough. I only started adding topics to the list of potential discussion items in the morning, and I was the only one adding them. Privately, I had a few discussion items in my back pocket, but they were intended just to get the conversation going. My idea was that once we prime the pump, we’ll have all sorts of things to talk about.
That’s not quite what happened. We did, ultimately, discuss a few interesting things but it took a while for us to warm up. Also, doing the proof clinic as a single large group might not have been the best idea. Perhaps we should have split up into groups and tried to work together that way.
I also learned that several attendees don’t use Zulip, so my assumption that Zulip is the one and only way for people to communicate about Lean wasn’t quite right. I could have done better communication with attendees in advance to make sure that we coordinate discussion in Zulip, instead of simply assuming that, of course, everyone is there.
The future
Will there be another edition of Leaning In!
Yes, I think so. It's a lot of work to organize a conference (and there's always more to do, even when you know that there's a lot!). But the community benefits are clear. Stay tuned!
(This is my first NPM package. I made it in TypeScript; it’s my first go at the language.)
What?
Decimal128 is an IEEE standard for floating-point decimal numbers. These numbers aren’t the binary floating-point numbers that you know and love (?), but decimal numbers. You know, the kind we learn about before we’re even ten years old. In the binary world, things like 0.1 + 0.2 aren’t exactly* equal to 0.3, and calculations like 0.7 * 1.05 work out to exactly0.735. These kinds of numbers are what we use when doing all sorts of everyday calculations, especially those having to do with money.
Decimal128 encodes decimal numbers into 128 bits. It is a fixed-width encoding, unlike arbitrary-precision numbers, which, of course, require an arbitrary amount of space. The encoding can represent of numbers with up to 34 significant digits and an exponent of –6143 to 6144. That is a truly vast amount of space if one keeps the intended use cases involving human-readable and -writable numbers (read: money) in mind.
Why?
I’m working on extending the JavaScript language with decimal numbers (proposal-decimal). One of the design decisions that has to be made there is whether to implement arbitrary-precision decimal numbers or to implement some kind of approximation thereof, with Decimal128 being the main contender. As far as I could tell, there was no implementation of Decimal128 in JavaScript, so I made one.
The intention isn’t to support the full Decimal128 standard, nor should one expect to achieve the performance that, say, a C/C++ library would give you in userland JavaScript. (To say nothing of having machine-native decimal instructions, which is truly exotic.) The intention is to give JavaScript developers something that genuinely strives to approximate Decimal128 for JS programs.
In particular, the hope is that this library offers the JS community a chance to get a feel for what Decimal128 might be like.
How to use
Just do
$ npm install decimal128
and start using the provided Decimal128 class.
Issues?
If you find any bugs or would like to request a feature, just open an issue and I’ll get on it.
The decimals around us: Cataloging support for decimal numbers
Decimals numbers are a data type that aims to exactly represent decimal numbers. Some programmers may not know, or fully realize, that, in most programming languages, the numbers that you enter look like decimal numbers but internally are represented as binary—that is, base-2—floating-point numbers. Things that are totally simple for us, such as 0.1, simply cannot be represented exactly in binary. The decimal data type—whatever its stripe or flavor—aims to remedy this by giving us a way of representing and working with decimal numbers, not binary approximations thereof. (Wikipedia has more.)
To help with my work on adding decimals to JavaScript, I've gone through a list of popular programming languages, taken from the 2022 StackOverflow developer survey. What follows is a brief summary of where these languages stand regarding decimals. The intention is to keep things simple. The purpose is:
If a language does have decimals, say so;
If a language does not have decimals, but at least one third-party library exists, mention it and link to it. If a discussion is underway to add decimals to the language, link to that discussion.
There is no intention to filter out an language in particular; I'm just working with a slice of languages found in in the StackOverflow list linked to earlier. If a language does not have decimals, there may well be multiple third-part decimal libraries. I'm not aware of all libraries, so if I have linked to a minor library and neglect to link to a more high-profile one, please let me know. More importantly, if I have misrepresented the basic fact of whether decimals exists at all in a language, send mail.
C
C does not have decimals. But they're working on it! The C23 standard (as in, 2023) standard proposes to add new fixed bit-width data types (32, 64, and 128) for these numbers.
C#
C# has decimals in its underlying .NET subsystem. (For the same reason, decimals also exist in Visual Basic.)
C++
C++ does not have decimals. But—like C—they're working on it!
Dart
Dart does not have decimals. But a third-party library exists.
Go
Go does not have decimals, but a third-party library exists.
JavaScript does not have decimals. We're working on it!
Kotlin
Kotlin does not have decimals. But, in a way, it does: since Kotlin is running on the JVM, one can get decimals by using Java's built-in support.
Ruby has decimals. Despite that, there is some third-party work to improve the built-in support.
Rust
Rust does not have decimals, but a crate exists.
SQL
SQL has decimals (it is the DECIMAL data type). (Here is the documentation for, e.g., PostgreSQL, and here is the documentation for MySQL.)
TypeScript does not have decimals. However, if decimals get added to JavaScript (see above), TypeScript will probably inherit decimals, eventually.
Here’s how to unbreak floating-point math in JavaScript
Because computers are limited, they work in a finite range of numbers, namely, those that can be represented straightforwardly as fixed-length (usually 32 or 64) sequences of bits. If you’ve only got 32 or 64 bits, it’s clear that there are only so many numbers you can represent, whether we’re talking about integers or decimals. For integers, it’s clear that there’s a way to exactly represent mathematical integers (within the finite domain permitted by 32 or 64 bits). For decimals, we have to deal with the limits imposed by having only a fixed number of bits: most decimal numbers cannot be exactly represented. This leads to headaches in all sorts of contexts where decimals arise, such as finance, science, engineering, and machine learning.
It has to do with our use of base 10 and the computer’s use of base 2. Math strikes again! Exactness of decimal numbers isn’t an abstruse, edge case-y problem that some mathematicians thought up to poke fun at programmers engineers who aren’t blessed to work in an infinite domain. Consider a simple example. Fire up your favorite JavaScript engine and evaluate this:
1 + 2 === 3
You should get true. Duh. But take that example and work it with decimals:
0.1 + 0.2 === 0.3
You’ll get false.
How can that be? Is floating-point math broken in JavaScript? Short answer: yes, it is. But if it’s any consolation, it’s not just JavaScript that’s broken in this regard. You’ll get the same result in all sorts of other languages. This isn’t wat. This is the unavoidable burden we programmers bear when dealing with decimal numbers on machines with limited precision.
Maybe you’re thinking OK, but if that’s right, how in the world do decimal numbers get handled at all? Think of all the financial applications out there that must be doing the wrong thing countless times a day. You’re quite right! One way of getting around oddities like the one above is by always rounding. So instead of working with, say, this is by handling decimal numbers as strings (sequences of digits). You would then define operations such as addition, multiplication, and equality by doing elementary school math, digit by digit (or, rather, character by character).
So what to do?
Numbers in JavaScript are supposed to be IEEE 754 floating-point numbers. A consequence of this is, effectively, that 0.1 + 0.2 will never be 0.3 (in the sense of the === operator in JavaScript). So what can be done?
There’s an npm library out there, decimal.js, that provides support for arbitrary precision decimals. There are probably other libraries out there that have similar or equivalent functionality.
As you might imagine, the issue under discussion is old. There are workarounds using a library.
But what about extending the language of JavaScript so that the equation does get validated? Can we make JavaScript work with decimals correctly, without using a library?
Yes, we can.
Aside: Huge integers
It’s worth thinking about a similar issue that also arises from the finiteness of our machines: arbitrarily large integers in JavaScript. Out of the box, JavaScript didn’t support extremely large integers. You’ve got 32-bit or (more likely) 64-bit signed integers. But even though that’s a big range, it’s still, of course, limited. BigInt, a proposal to extend JS with precisely this kind of thing, reached Stage 4 in 2019, so it should be available in pretty much every JavaScript engine you can find. Go ahead and fire up Node or open your browser’s inspector and plug in the number of nanoseconds since the Big Bang:
(Not a scientician. May not be true. Not intended to be a factual claim.)
Adding big decimals to the language
OK, enough about big integers. What about adding support for arbitrary precision decimals in JavaScript? Or, at least, high-precision decimals? As we see above, we don’t even need to wrack our brains trying to think of complicated scenarios where a ton of digits after the decimal point are needed. Just look at 0.1 + 0.2 = 0.3. That’s pretty low-precision, and it still doesn’t work. Is there anything analogous to BigInt for non-integer decimal numbers? No, not as a library; we already discussed that. Can we add it to the language, so that, out of the box—with no third-party library—we can work with decimals?
The answer is yes. Work is proceeding on this matter, but things remain to unsettled. The relevant proposal is BigDecimal. I’ll be working on this for a while. I want to get big decimals into JavaScript. There are all sorts of issues to resolve, but they’re definitely resolvable. We have experience with arbitrary precision arithmetic in other languages. It can be done.
So yes, floating-point math is broken in JavaScript, but help is on the way. You’ll see more from me here as I tackle this interesting problem; stay tuned!
Binary floats can let us down! When close enough isn't enough
If you've played Monopoly, you'll know abuot the Bank Error in Your Favor card in the Community Chest. Remember this?
A bank error in your favor? Sweet! But what if the bank makes an error in its favor? Surely that's just as possible, right?
I'm here to tell you that if you're doing everyday financial calculations—nothing fancy, but involving money that you care about—then you might need to know that using binary floating point numbers, then something might be going wrong. Let's see how binary floating-point numbers might yield bank errors in your favor—or the bank's.
In a wonderful paper on decimal floating-point numbers, Mike Colishaw gives an example.
Here's how you can reproduce that in JavaScript:
(1.05 * 0.7).toPrecision(2);
# 0.73
Some programmers might not be aware of this, but many are. By pointing this out I'm not trying to be a smartypants who knows something you don't. For me, this example illustrates just how common this sort of error might be.
For programmers who are aware of the issue, one typical approache to dealing with it is this: Never work with sub-units of a currency. (Some currencies don't have this issue. If that's you and your problem domain, you can kick back and be glad that you don't need to engage in the following sorts of headaches.) For instance, when working with US dollars of euros, this approach mandates that one never works with euros and cents, but only with cents. In this setting, dollars exist only as an abstraction on top of cents. As far as possible, calculations never use floats. But if a floating-point number threatens to come up, some form of rounding is used.
Another aproach for a programmer is to delegate financial calculations to an external system, such as a relational database, that natively supports proper decimal calculations. One difficulty is that even if one delegates these calculations to an external system, if one lets a floating-point value flow int your program, even a value that can be trusted, it may become tainted just by being imported into a language that doesn't properly support decimals. If, for instance, the result of a calculation done in, say, Postgres, is exactly 0.1, and that flows into your JavaScript program as a number, it's possible that you'll be dealing with a contaminated value. For instance:
This example, admittedly, requires quite a lot of decimals (19!) before the ugly reality of the situation rears its head. The reality is that 0.1 does not, and cannot, have an exact representation in binary. The earlier example with the cost of a phone call is there to raise your awareness of the possibility that one doesn't need to go 19 decimal places before one starts to see some weirdness showing up.
There are all sorts of examples of this. It's exceedingly rare for a decimal number to have an exact representation in binary. Of the numbers 0.1, 0.2, …, 0.9, only 0.5 can be exactly represented in binary.
Next time you look at a bank statement, or a bill where some tax is calculated, I invite you to ask how that was calculated. Are they using decimals, or floats? Is it correct?
I'm working on the decimal proposal for TC39 to try to work what it might be like to add proper decimal numbers to JavaScript. There are a few very interesting degrees of freedom in the design space (such as the precise datatype to be used to represent these kinds of number), but I'm optimistic that a reasonable path forward exists, that consensus between JS programmers and JS engine implementors can be found, and eventually implemented. If you're interested in these issues, check out the README in the proposal and get in touch!
I‘m on macOS and use Homebrew extensively. My simple go-to approach to finding new software is to do brew search lean. This revealed lean as well as surface elan. Running brew info lean showed me that that package (at the time I write this) installs Lean 3. But I know, out-of-band, that Lean 4 is what I want to work with. Running brew info elan looked better, but the output reminds me that (1) the information is for the elan-init package, not the elancask, and (2) elan-init conflicts with both the elan and the aforementioned lean. Yikes! This strikes me as a potential problem for the community, because I think Lean 3, though it still works, is presumably not where new Lean development should be taking place. Perhaps the Homebrew formula for Lean should be updated called lean3, and a new lean4 package should be made available. I‘m not sure. The situation seems less than ideal, but in short, I have been successful with the elan-init package.
After installing elan-init, you‘ll have the elan tool available in your shell. elan is the tool used for maintaining different versions of Lean, similar to nvm in the Node.js world or pyenv.
Setting up a blank package
When I did the Lean 4 tutorial at BOB, I worked entirely within VS Code and created a new standalone package using some in-editor functionality. At the command line, I use lake init to manually create a new Lean package. At first, I made the mistake of running this command, assuming it would create a new directory for me and set up any configuration and boilerplate code there. I was surprised to find, instead, that lake init sets things up in the current directory, in addition to creating a subdirectory and populating it. Using lake --help, I read about the lake new command, which does what I had in mind. So I might suggest using lake new rather than lake init.
What‘s in the new directory? Doing tree foobar reveals
Taking a look there, I see four .lean files. Here‘s what they contain:
Main.lean
import «Foobar»
def main : IO Unit :=
IO.println s!"Hello, {hello}!"
Foobar.lean
-- This module serves as the root of the `Foobar` library.
-- Import modules here that should be built as part of the library.
import «Foobar».Basic
Foobar/Basic.lean
def hello := "world"
lakefile.lean
import Lake
open Lake DSL
package «foobar» where
-- add package configuration options here
lean_lib «Foobar» where
-- add library configuration options here
@[default_target]
lean_exe «foobar» where
root := `Main
It looks like there‘s a little module structure here, and a reference to the identifier hello, defined in Foobar/Basic.lean and made available via Foobar.lean. I’m not going to touch lakefile.lean for now; as a newbie, it looks scary enough that I think I’ll just stick to things like Basic.lean.
There‘s also an automatically created .git there, not shown in the directory output above.
Now what?
Now that you‘ve got Lean 4 installed and set up a package, you‘re ready to dive in to one of the official tutorials. The one I‘m working through is David‘s Functional Programming in Lean. There‘s all sorts of additional things to learn, such as all the different lake commands. Enjoy!
Announcing a polyfill for the TC39 decimal proposal
I’m happy to announce that the decimal proposal—a proposed extension of JavaScript to support decimal numbers—is now available as an NPM package called proposal-decimal!
(Actually, it has been available for some time, made available not long after we decided to pursue IEEE 754 Decimal128 as a data model for the decimal proposal rather than some alternatives. The old package was—and still is—available under a different name—decimal128—but I’ll be sunsetting that package in favor of the new one announced here. If you’ve been using decimal128, you can continue to use it, but you’ll probably want to switch to proposal-decimal.)
To use proposal-decimal in your project, install the NPM package. If you’re looking to use this code in Node.js or other JS engines that support ESM, you'll want to import the code like this:
import { Decimal128 } from 'proposal-decimal';
const x = new Decimal128("0.1");
// etc.
For use in a browser, the file dist/Decimal128.mjs contains the Decimal128 class and all its internal dependencies in a single file. Use it like this:
<script type="module">
import { Decimal128 } from 'path/to/Decimal128.mjs';
const x = new Decimal128("0.1");
// keep rocking decimals!
</script>
The intention of this polyfill is to track the spec text for the decimal proposal. I cannot recommend this package for production use just yet, but it is usable and I’d love to hear any experience reports you may have. We’re aiming to be as faithful as possible to the spec, so we don’t aim to be blazingly fast. That said, please do report any wild deviations in performance compared to other decimal libraries for JS as an issue. Any crashes or incorrect results should likewise be reported as an issue.
Update on what happened in WebKit in the week from March 10 to March 17.
Cross-Port 🐱
Web Platform 🌐
Updated button activation behaviour and type property reflection with command and commandfor. Also aligned popovertarget behaviour with latest specification.
Very nearly none of the content that we encounter online is entirely hand authored from opening doctype to closing HTML tag - it's assembled. We have layouts, and includes and templates and so on. And, most of the actual content that we produce and consume is written in some more familiar or easier to write shorthand. For most of the people reading this, it's probably mostly markdown.
But did you know that lots of the places that you use markdown like GitHub and GitLab, and Visual Studio support embedding mathematical expressions written in LaTex surrounded by $ (inline) or $$ (block)? Those are then transformed for you to rendered Math with MathML?
It got me thinking that we should have a kind of similarly standard easy setup for 11ty. It would be a huge win to process it on the server, MathML will render natively, fast, without FOUC. It will be accessible, styleable, scale appropriately with text-size and zoom and so on.
The super interesting thing to note about most of the tools where you can use markup is that so many of them are built on common infrastructure: markdown-it. The architectural pattern of markdown-it allows people to write plugins, and if you're looking to match those above, you can do it pretty easily with the @mdit/plugin-katex:
And... That's it. Now you can embed LaTex math in your markdown just as you can in those other places and it will do the work of generating fast, native, accessible, and styleable MathML...
Some math $\frac{x^2}{a^2} + \frac{y^2}{b^2} = 1$ whee.
Yields...
<p>Some math <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><msup><mi>x</mi><mn>2</mn></msup><msup><mi>a</mi><mn>2</mn></msup></mfrac><mo>+</mo><mfrac><msup><mi>y</mi><mn>2</mn></msup><msup><mi>b</mi><mn>2</mn></msup></mfrac><mo>=</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">\frac{x^2}{a^2} + \frac{y^2}{b^2} = 1</annotation></semantics></math></span> whee.</p>
Which your browser renders as...
Some math inline x2a2+y2b2=1\frac{x^2}{a^2} + \frac{y^2}{b^2} = 1 whee.
Surrounding the math part alone with $$ instead yields block math, which renders as..
x2a2+y2b2=1\frac{x^2}{a^2} + \frac{y^2}{b^2} = 1
Mathematical fonts are critical for rendering. We almost have universally good default math fonts, but some operating systems (eyes Android disapprovingly) still don't. However, like me you can include some CSS to help. On this page I've included
His mathup package seems pretty nice (an AsciiMath dialect more than just AsciiMath), and there is also a corresponding markdown-it-math which is similarly easy to use...
Then, you can embed AsciiMath in your markdown, fenced by the same $ or $$ and it will generate some nice MathML. For example...
$$
e = sum_(n=0)^oo 1/n!
$$
Will be transformed at build time and render in your browser as...
e=∑n=0∞1n!
Make sure you get markdown-it-math 5.0.0-rc.0 or above or this won't work. You might also consider including their stylesheet.
markdown-it-math also supports a nice pattern for easily integrating other engines for embedded transformations like Ron Kok's Temml or Fred Wang's TeXZilla.
Unicode Math
There is also Unicode Math, which Murray Sargent III developed and had integrated into all of the Microsoft products. It's pretty nifty too if you ask me. This repo has a nice comparison of the three.
Unfortunately there is no npm module for it (yet), so for now, unfortunately that remains an open wish.
So, that's it. Enjoy. Mathify your static sites.
Before you go... The work of Math rendering in browsers is severely under funded. Almost none of the effort or funding over the last 30 years to make this possible has come from browser vendors, but rather from individual contributors and those willing to help fund them. If you appreciate the importance of this work, please consider helping to support the work with a small monthly donation, and please help us to publicly lobby implementers to invest in it.
Recently, GStreamer development story integrated the usage of
pre-commit.
pre-commit is a Git hook script that chain different
linters, checkers, validators, formatters, etc., that are executed at git commit. This script is in Python. And there’s other GStreamer utility in
Python: hotdoc
The challenge is that Debian doesn’t allow to install Python packages through
pip, they have to be installed as Debian packages or inside virtual
environments, such as venv.
So, instead of activating a virtual environment when I work in GStreamer, let’s
just use direnv to activate it automatically.
Here’s a screencast of what I did to setup a Python virtual environment, within
direnv, and installing pre-commit, hotdoc and gst-indent-1.0.
UPDATE: Tim told me that wit pipx we can do
the same without the venv hassle.
A month ago I attended Vulkanised 2025 in Cambridge, UK, to present a talk about Device-Generated Commands in Vulkan.
The event was organized by Khronos and took place in the Arm Cambridge office.
The talk I presented was similar to the one from XDC 2024, but instead of being a lightning 5-minutes talk, I had 25-30 minutes to present and I could expand the contents to contain proper explanations of almost all major DGC concepts that appear in the spec.
I attended the event together with my Igalia colleagues Lucas Fryzek and Stéphane Cerveau, who presented about lavapipe and Vulkan Video, respectively.
We had a fun time in Cambridge and I can sincerely recommend attending the event to any Vulkan enthusiasts out there.
It allows you to meet Khronos members and people working on both the specification and drivers, as well as many other Vulkan users from a wide variety of backgrounds.
The recordings for all sessions are now publicly available, and the one for my talk can be found embedded below.
For those of you preferring slides and text, I’m also providing a transcription of my presentation together with slide screenshots further down.
In addition, at the end of the video there’s a small Q&A section but I’ve always found it challenging to answer questions properly on the fly and with limited time.
For this reason, instead of transcribing the Q&A section literally, I’ve taken the liberty of writing down the questions and providing better answers in written form, and I’ve also included an extra question that I got in the hallways as bonus content.
You can find the Q&A section right after the embedded video.
Vulkanised 2025 recording
* {
padding: 0;
margin: 0;
overflow: hidden;
}
html, body {
height: 100%;
}
img, span {
/* All elements take the whole iframe width and are vertically centered. */
position: absolute;
width: 100%;
top: 0;
bottom: 0;
margin: auto;
}
span {
/* This mostly applies to the play button. */
height: 1.5em;
text-align: center;
font-family: sans-serif;
font-size: 500%;
color: white;
}
▶">
* {
padding: 0;
margin: 0;
overflow: hidden;
}
html, body {
height: 100%;
}
img, span {
/* All elements take the whole iframe width and are vertically centered. */
position: absolute;
width: 100%;
top: 0;
bottom: 0;
margin: auto;
}
span {
/* This mostly applies to the play button. */
height: 1.5em;
text-align: center;
font-family: sans-serif;
font-size: 500%;
color: white;
}
▶"
>
Questions and answers with longer explanations
Question: can you give an example of when it’s beneficial to use Device-Generated Commands?
There are two main use cases where DGC would improve performance: on the one hand, many times game engines use compute pre-passes to analyze the scene they want to draw and prepare some data for that scene.
This includes maybe deciding LOD levels, discarding content, etc.
After that compute pre-pass, results would need to be analyzed from the CPU in some way.
This implies a stall: the output from that compute pre-pass needs to be transferred to the CPU so the CPU can use it to record the right drawing commands, or maybe you do this compute pre-pass during the previous frame and it contains data that is slightly out of date.
With DGC, this compute dispatch (or set of compute dispatches) could generate the drawing commands directly, so you don’t stall or you can use more precise data.
You also save some memory bandwidth because you don’t need to copy the compute results to host-visible memory.
On the other hand, sometimes scenes contain so much detail and geometry that recording all the draw calls from the CPU takes a nontrivial amount of time, even if you distribute this draw call recording among different threads.
With DGC, the GPU itself can generate these draw calls, so potentially it saves you a lot of CPU time.
Question: as the extension makes heavy use of buffer device addresses, what are the challenges for tools like GFXReconstruct when used to record and replay traces that use DGC?
The extension makes use of buffer device addresses for two separate things.
First, it uses them to pass some buffer information to different API functions, instead of passing buffer handles, offsets and sizes.
This is not different from other APIs that existed before.
The VK_KHR_buffer_device_address extension contains APIs like vkGetBufferOpaqueCaptureAddressKHR, vkGetDeviceMemoryOpaqueCaptureAddressKHR that are designed to take care of those cases and make it possible to record and reply those traces.
Contrary to VK_KHR_ray_tracing_pipeline, which has a feature to indicate if you can capture and replay shader group handles (fundamental for capture and replay when using ray tracing), DGC does not have any specific feature for capture-replay.
DGC does not add any new problem from that point of view.
Second, the data for some commands that is stored in the DGC buffer sometimes includes device addresses.
This is the case for the index buffer bind command, the vertex buffer bind command, indirect draws with count (double indirection here) and ray tracing command.
But, again, the addresses in those commands are buffer device addresses.
That does not add new challenges for capture and replay compared to what we already had.
Question: what is the deal with the last token being the one that dispatches work?
One minor detail from DGC, that’s important to remember, is that, by default, DGC respects the order in which sequences appear in the DGC buffer and the state used for those sequences.
If you have a DGC buffer that dispatches multiple draws, you know the state that is used precisely for each draw: it’s the state that was recorded before the execute-generated-commands call, plus the small changes that a particular sequence modifies like push constant values or vertex and index buffer binds, for example.
In addition, you know precisely the order of those draws: executing the DGC buffer is equivalent, by default, to recording those commands in a regular command buffer from the CPU, in the same order they appear in the DGC buffer.
However, when you create an indirect commands layout you can indicate that the sequences in the buffer may run in an undefined order (this is VK_INDIRECT_COMMANDS_LAYOUT_USAGE_UNORDERED_SEQUENCES_BIT_EXT).
If the sequences could dispatch work and then change state, we would have a logical problem: what do those state changes affect?
The sequence that is executed right after the current one?
Which one is that?
We would not know the state used for each draw.
Forcing the work-dispatching command to be the last one is much easier to reason about and is also logically tight.
Naturally, if you have a series of draws on the CPU where, for some of them, you change some small bits of state (e.g. like disabling the depth or stencil tests) you cannot do that in a single DGC sequence.
For those cases, you need to batch your sequences in groups with the same state (and use multiple DGC buffers) or you could use regular draws for parts of the scene and DGC for the rest.
Question from the hallway: do you know what drivers do exactly at preprocessing time that is so important for performance?
Most GPU drivers these days have a kernel side and a userspace side.
The kernel driver does a lot of things like talking to the hardware, managing different types of memory and buffers, talking to the display controller, etc.
The kernel driver normally also has facilities to receive a command list from userspace and send it to the GPU.
These command lists are particular for each GPU vendor and model.
The packets that form it control different aspects of the GPU.
For example (this is completely made-up), maybe one GPU has a particular packet to modify depth buffer and test parameters, and another packet for the stencil test and its parameters, while another GPU from another vendor has a single packet that controls both.
There may be another packet that dispatches draw work of all kinds and is flexible to accomodate the different draw commands that are available on Vulkan.
The Vulkan userspace driver translates Vulkan command buffer contents to these GPU-specific command lists.
In many drivers, the preprocessing step in DGC takes the command buffer state, combines it with the DGC buffer contents and generates a final command list for the GPU, storing that final command list in the preprocess buffer.
Once the preprocess buffer is ready, executing the DGC commands is only a matter of sending that command list to the GPU.
Talk slides and transcription
Hello, everyone! I’m Ricardo from Igalia and I’m going to talk about device-generated commands in Vulkan.
First, some bits about me.
I have been part of the graphics team at Igalia since 2019.
For those that don’t know us, Igalia is a small consultancy company specialized in open source and my colleagues in the graphics team work on things such as Mesa drivers, Linux kernel drivers, compositors… that kind of things.
In my particular case the focus of my work is contributing to the Vulkan Conformance Test Suite and I do that as part of a collaboration between Igalia and Valve that has been going on for a number of years now.
Just to highlight a couple of things, I’m the main author of the tests for the mesh shading extension and device-generated commands that we are talking about today.
So what are device-generated commands?
So basically it’s a new extension, a new functionality, that allows a driver to read command sequences from a regular buffer: something like, for example, a storage buffer, instead of the usual regular command buffers that you use.
The contents of the DGC buffer could be filled from the GPU itself.
This is what saves you the round trip to the CPU and, that way, you can improve the GPU-driven rendering process in your application.
It’s like one step ahead of indirect draws and dispatches, and one step behind work graphs.
And it’s also interesting because device-generated commands provide a better foundation for translating DX12.
If you have a translation layer that implements DX12 on top of Vulkan like, for example, Proton, and you want to implement ExecuteIndirect, you can do that much more easily with device generated commands.
This is important for Proton, which Valve uses to run games on the Steam Deck, i.e. Windows games on top of Linux.
If we set aside Vulkan for a moment, and we stop thinking about GPUs and such, and you want to come up with a naive CPU-based way of running commands from a storage buffer, how do you do that?
Well, one immediate solution we can think of is: first of all, I’m going to assign a token, an identifier, to each of the commands I want to run, and I’m going to store that token in the buffer first.
Then, depending on what the command is, I want to store more information.
For example, if we have a sequence like we see here in the slide where we have a push constant command followed by dispatch, I’m going to store the token for the push constants command first, then I’m going to store some information that I need for the push constants command, like the pipeline layout, the stage flags, the offset and the size.
Then, after that, depending on the size that I said I need, I am going to store the data for the command, which is the push constant values themselves.
And then, after that, I’m done with it, and I store the token for the dispatch, and then the dispatch size, and that’s it.
But this doesn’t really work: this is not how GPUs work.
A GPU would have a hard time running commands from a buffer if we store them this way.
And this is not how Vulkan works because in Vulkan you want to provide as much information as possible in advance and you want to make things run in parallel as much as possible, and take advantage of the GPU.
So what do we do in Vulkan?
In Vulkan, and in the Vulkan VK_EXT_device_generated_commands extension, we have this central concept, which is called the Indirect Commands Layout.
This is the main thing, and if you want to remember just one thing about device generated commands, you can remember this one.
The indirect commands layout is basically like a template for a short sequence of commands.
The way you build this template is using the tokens and the command information that we saw colored red and green in the previous slide, and you build that in advance and pass that in advance so that, in the end, in the command buffer itself, in the buffer that you’re filling with commands, you don’t need to store that information.
You just store the data for each command.
That’s how you make it work.
And the result of this is that with the commands layout, that I said is a template for a short sequence of commands (and by short I mean a handful of them like just three, four or five commands, maybe 10), the DGC buffer can be pretty large, but it does not contain a random sequence of commands where you don’t know what comes next.
You can think about it as divided into small chunks that the specification calls sequences, and you get a large number of sequences stored in the buffer but all of them follow this template, this commands layout.
In the example we had, push constant followed by dispatch, the contents of the buffer would be push constant values, dispatch size, push content values, dispatch size, many times repeated.
The second thing that Vulkan does to be able to make this work is that we limit a lot what you can do with device-generated commands.
There are a lot of things you cannot do.
In fact, the only things you can do are the ones that are present in this slide.
You have some things like, for example, update push constants, you can bind index buffers, vertex buffers, and you can draw in different ways, using mesh shading maybe, you can dispatch compute work and you can dispatch raytracing work, and that’s it.
You also need to check which features the driver supports, because maybe the driver only supports device-generated commands for compute or ray tracing or graphics.
But you notice you cannot do things like start render passes or insert barriers or bind descriptor sets or that kind of thing.
No, you cannot do that.
You can only do these things.
This indirect commands layout, which is the backbone of the extension, specifies, as I said, the layout for each sequence in the buffer and it has additional restrictions.
The first one is that it must specify exactly one token that dispatches some kind of work and it must be the last token in the sequence.
You cannot have a sequence that dispatches graphics work twice, or that dispatches computer work twice, or that dispatches compute first and then draws, or something like that.
No, you can only do one thing with each DGC buffer and each commands layout and it has to be the last one in the sequence.
And one interesting thing that also Vulkan allows you to do, that DX12 doesn’t let you do, is that it allows you (on some drivers, you need to check the properties for this) to choose which shaders you want to use for each sequence.
This is a restricted version of the bind pipeline command in Vulkan.
You cannot choose arbitrary pipelines and you cannot change arbitrary states but you can switch shaders.
For example, if you want to use a different fragment shader for each of the draws in the sequence, you can do that.
This is pretty powerful.
How do you create one of those indirect commands layout?
Well, with one of those typical Vulkan calls, to create an object that you pass these CreateInfo structures that are always present in Vulkan.
And, as you can see, you have to pass these shader stages that will be used, will be active, while you draw or you execute those indirect commands.
You have to pass the pipeline layout, and you have to pass in an indirect stride.
The stride is the amount of bytes for each sequence, from the start of a sequence to the next one.
And the most important information of course, is the list of tokens: an array of tokens that you pass as the token count and then the pointer to the first element.
Now, each of those tokens contains a bit of information and the most important one is the type, of course.
Then you can also pass an offset that tells you how many bytes into the sequence for the start of the data for that command.
Together with the stride, it tells us that you don’t need to pack the data for those commands together.
If you want to include some padding, because it’s convenient or something, you can do that.
And then there’s also the token data which allows you to pass the information that I was painting in green in other slides like information to be able to run the command with some extra parameters.
Only a few tokens, a few commands, need that.
Depending on the command it is, you have to fill one of the pointers in the union but for most commands they don’t need this kind of information.
Knowing which command it is you just know you are going to find some fixed data in the buffer and you just read that and process that.
One thing that is interesting, like I said, is the ability to switch shaders and to choose which shaders are going to be used for each of those individual sequences.
Some form of pipeline switching, or restricted pipeline switching.
To do that you have to create something that is called Indirect Execution Sets.
Each of these execution sets is like a group or an array, if you want to think about it like that, of pipelines: similar pipelines or shader objects.
They have to share something in common, which is that all of the state in the pipeline has to be identical, basically.
Only the shaders can change.
When you create these execution sets and you start adding pipelines or shaders to them, you assign an index to each pipeline in the set.
Then, you pass this execution set beforehand, before executing the commands, so that the driver knows which set of pipelines you are going to use.
And then, in the DGC buffer, when you have this pipeline token, you only have to store the index of the pipeline that you want to use.
You create the execution set with 20 pipelines and you pass an index for the pipeline that you want to use for each draw, for each dispatch, or whatever.
The way to create the execution sets is the one you see here, where we have, again, one of those CreateInfo structures.
There, we have to indicate the type, which is pipelines or shader objects.
Depending on that, you have to fill one of the pointers from the union on the top right here.
If we focus on pipelines because it’s easier on the bottom left, you have to pass the maximum pipeline count that you’re going to store in the set and an initial pipeline.
The initial pipeline is what is going to set the template that all pipelines in the set are going to conform to.
They all have to share essentially the same state as the initial pipeline and then you can change the shaders.
With shader objects, it’s basically the same, but you have to pass more information for the shader objects, like the descriptor set layouts used by each stage, push-constant information… but it’s essentially the same.
Once you have that execution set created, you can use those two functions (vkUpdateIndirectExecutionSetPipelineEXT and vkUpdateIndirectExecutionSetShaderEXT) to update and add pipelines to that execution set.
You need to take into account that you have to pass a couple of special creation flags to the pipelines, or the shader objects, to tell the driver that you may use those inside an execution set because the driver may need to do something special for them.
And one additional restriction that we have is that if you use an execution set token in your sequences, it must appear only once and it must be the first one in the sequence.
The recap, so far, is that the DGC buffer is divided into small chunks that we call sequences.
Each sequence follows a template that we call the Indirect Commands Layout.
Each sequence must dispatch work exactly once and you may be able to switch the set of shaders we used with with each sequence with an Indirect Execution Set.
Wow do we go about actually telling Vulkan to execute the contents of a specific buffer?
Well, before executing the contents of the DGC buffer the application needs to have bound all the needed states to run those commands.
That includes descriptor sets, initial push constant values, initial shader state, initial pipeline state.
Even if you are going to use an Execution Set to switch shaders later you have to specify some kind of initial shader state.
Once you have that, you can call this vkCmdExecuteGeneratedCommands.
You bind all the state into your regular command buffer and then you record this command to tell the driver: at this point, execute the contents of this buffer.
As you can see, you typically pass a regular command buffer as the first argument.
Then there’s some kind of boolean value called isPreprocessed, which is kind of confusing because it’s the first time it appears and you don’t know what it is about, but we will talk about it in a minute.
And then you pass a relatively larger structure containing information about what to execute.
In that GeneratedCommandsInfo structure, you need to pass again the shader stages that will be used.
You have to pass the handle for the Execution Set, if you’re going to use one (if not you can use the null handle).
Of course, the indirect commands layout, which is the central piece here.
And then you pass the information about the buffer that you want to execute, which is the indirect address and the indirect address size as the buffer size.
We are using buffer device address to pass information.
And then we have something again mentioning some kind of preprocessing thing, which is really weird: preprocess address and preprocess size which looks like a buffer of some kind (we will talk about it later).
You have to pass the maximum number of sequences that you are going to execute.
Optionally, you can also pass a buffer address for an actual counter of sequences.
And the last thing that you need is the max draw count, but you can forget about that if you are not dispatching work using draw-with-count tokens as it only applies there.
If not, you leave it as zero and it should work.
We have a couple of things here that we haven’t talked about yet, which are the preprocessing things.
Starting from the bottom, that preprocess address and size give us a hint that there may be a pre-processing step going on.
Some kind of thing that the driver may need to do before actually executing the commands, and we need to pass information about the buffer there.
The boolean value that we pass to the command ExecuteGeneratedCommands tells us that the pre-processing step may have happened before so it may be possible to explicitly do that pre-processing instead of letting the driver do that at execution time.
Let’s take a look at that in more detail.
First of all, what is the pre-process buffer?
The pre-process buffer is auxiliary space, a scratch buffer, because some drivers need to take a look at how the command sequence looks like before actually starting to execute things.
They need to go over the sequence first and they need to write a few things down just to be able to properly do the job later to execute those commands.
Once you have the commands layout and you have the maximum number of sequences that you are going to execute, you can call this vkGetGeneratedCommandMemoryRequirementsEXT and the driver is going to tell you how much space it needs.
Then, you can create a buffer, you can allocate the space for that, you need to pass a special new buffer usage flag (VK_BUFFER_USAGE_2_PREPROCESS_BUFFER_BIT_EXT) and, once you have that buffer, you pass the address and you pass a size in the previous structure.
Now the second thing is that we have the possibility of ding this preprocessing step explicitly.
Explicit pre-processing is something that’s optional, but you probably want to do that if you care about performance because it’s the key to performance with some drivers.
When you use explicit pre-processing you don’t want to (1) record the state, (2) call this vkPreProcessGeneratedCommandsEXT and (3) call vkExecuteGeneratedCommandsEXT.
That is what implicit pre-processing does so this doesn’t give you anything if you do it this way.
This is designed so that, if you want to do explicit pre-processing, you’re going to probably want to use a separate command buffer for pre-processing.
You want to batch pre-processing calls together and submit them all together to keep the GPU busy and to give you the performance that you want.
While you submit the pre-processing steps you may be still preparing the rest of the command buffers to enqueue the next batch of work.
That’s the key to doing pre-processing optimally.
You need to decide beforehand if you are going to use explicit pre-processing or not because, if you’re going to use explicit preprocessing, you need to pass a flag when you create the commands layout, and then you have to call the function to preprocess generated commands.
If you don’t pass that flag, you cannot call the preprocessing function, so it’s an all or nothing.
You have to decide, and you do what you want.
One thing that is important to note is that preprocessing needs to know and has to have the same state, the same contents of the input buffers as when you execute so it can run properly.
The video contains a cut here because the presentation laptop ran out of battery.
If the pre-processing step needs to have the same state as the execution, you need to have bound the same pipeline state, the same shaders, the same descriptor sets, the same contents.
I said that explicit pre-processing is normally used using a separate command buffer that we submit before actual execution.
You have a small problem to solve, which is that you would need to record state twice: once on the pre-process command buffer, so that the pre-process step knows everything, and once on the execution, the regular command buffer, when you call execute.
That would be annoying.
Instead of that, the pre-process generated commands function takes an argument that is a state command buffer and the specification tells you: this is a command buffer that needs to be in the recording state, and the pre-process step is going to read the state from it.
This is the first time, and I think the only time in the specification, that something like this is done.
You may be puzzled about what this is exactly: how do you use this and how do we pass this?
I just wanted to get this slide out to tell you: if you’re going to use explicit pre-processing, the ergonomic way of using it and how we thought about using the processing step is like you see in this slide.
You take your main command buffer and you record all the state first and, just before calling execute-generated-commands, the regular command buffer contains all the state that you want and that preprocess needs.
You stop there for a moment and then you prepare your separate preprocessing command buffer passing the main one as an argument to the preprocess call, and then you continue recording commands in your regular command buffer.
That’s the ergonomic way of using it.
You do need some synchronization at some steps.
The main one is that, if you generate the contents of the DGC buffer from the GPU itself, you’re going to need some synchronization: writes to that buffer need to be synchronized with something else that comes later which is executing or reading those commands from from the buffer.
Depending on if you use explicit preprocessing you can use the pipeline stage command-pre-process which is new and pre-process-read or you synchronize that with the regular device-generated-commands-execution which was considered part of the regular draw-indirect-stage using indirect-command-read access.
If you use explicit pre-processing you need to make sure that writes to the pre-process buffer happen before you start reading from that.
So you use these just here (VK_PIPELINE_STAGE_COMMAND_PREPROCESS_BIT_EXT, VK_ACCESS_COMMAND_PREPROCESS_WRITE_BIT_EXT) to synchronize processing with execution (VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT, VK_ACCESS_INDIRECT_COMMAND_READ_BIT) if you use explicit preprocessing.
The quick how-to: I just wanted to get this slide out for those wanting a reference that says exactly what you need to do.
All the steps that I mentioned here about creating the commands layout, the execution set, allocating the preprocess buffer, etc.
This is the basic how-to.
Long time no see this beautiful grey sky, roast beef on sunday and large but full packed pub when there is a football or a rugby game (The rose team has been lucky this year, grrr).
It was a delightful journey in the UK starting with my family visiting London including a lot (yes a lot…) of sightviews in a very short amount of time. But we managed to fit everything in. We saw the changing of the guards, the Thames river tide on a boat, Harry Potter gift shops and the beautiful Arsenal stadium with its legendary pitch, one of the best of England.
It was our last attraction in London and now it was time for my family to go to Standsted back home and me to Cambridge and its legendary university.
To start the journey in Cambridge, first I got some rest on Monday in the hotel to face the hail of information I will get during the conference. This year, Vulkanised took place on Arm’s campus, who kindly hosted the event, providing everything we needed to feel at home and comfortable.
The first day, we started with an introduction from Ralph Potter, the Vulkan Working Group Chair at Khronos, who introduced the new 1.4 release and all the extensions coming along
including “Vulkan Video”. Then we could start this conference with my favorite topic, decoding video content with Vulkan Video. And the game was on! There was a presentation
every 30 minutes including a neat one from my colleague at Igalia Ricardo Garcia about Device-Generated Commands in Vulkan and a break every 3 presentations. It took a lot of mental energy to keep up with all the topics as each presentation was more interesting than the last.
During the break, we had time to relax with good coffee, delicious cookies, and nice conversations.
The first day ended up with a tooling demonstrations from LunarG, helping us all to understand and tame the Vulkan beast. The beast is ours now!
As I was not in the best shape due to a bug I caught on Sunday, I decided to play it safe and went to the hotel just after a nice indian meal. I had to prepare myself for the next day, where I would present “Vulkan Video is Open: Application Showcase”.
First Srinath Kumarapuram from Nvidia gave a presentation about the new extensions made available during 2024 by the Vulkan Video TSG.
It started with a brief timeline of the video extensions from the initial h26x decoding to the latest VP9 decode coming this year including the 2024 extensions such as the AV1 codec.
Then he presented more specific extensions such as VK_KHR_video_encode_quantization_map, VK_KHR_video_maintenance2 released during 2024 and coming in 2025, VK_KHR_video_encode_intra_refresh.
He mentioned that the Vulkan toolbox now completely supports Vulkan Video, including the Validation Layers, Vulkan Profiles, vulkaninfo or GFXReconstruct.
After some deserved applause for a neat presentation, it was my time to be on stage.
During this presentation I focused on the Open source ecosystem around Vulkan Video. Indeed Vulkan Video ships with a sample app which is totally open
along with the regular Conformance Test Suite. But that’s not all!
Two major frameworks now ship with Vulkan Video support: GStreamer and FFmpeg.
Before this, I started by talking about Mesa, the open graphics library. This library which is totally open provides drivers which support Vulkan Video extensions and allow applications to run Vulkan Video decode or encode.
The 3 major chip vendors are now supported. It started in 2022 with RADV, a userspace driver that implements the Vulkan API on most modern AMD GPUs. This driver supports all the vulkan video extensions except the lastest ones such as VK_KHR_video_encode_quantization_map or VK_KHR_video_maintenance2 but this they should be implemented sometime in 2025. Intel GPUs are now supported with the ANV driver, this driver also supports the common video extensions such as h264/5 and AV1 codec. The last driver to gain support was at the end of 2024
where several of the Vulkan Video extensions were introduced to NVK, a Vulkan driver for NVIDIA GPUs. This driver is still experimental but it’s possible to decode H264 and H265 content as well as its proprietary version. This completes the offering of the main GPUs on the market.
Then I moved to the applications including GStreamer, FFmpeg and Vulkan-Video-Samples. In addition to the extensions supported in 2025, we talked mainly about the decode conformance using Fluster. To compare all the implementations, including the driver, the version and the framework, a spreadsheet can be found here.
In this spreadsheet we summarize the 3 supported codecs (H264, H265 and AV1) with their associated test suites and compare their implemententations using Vulkan Video (or not, see results
for VAAPI with GStreamer).
GStreamer, my favorite playground, can now decode H264 and H265 since 1.24 and recently got the support for AV1 but the merge request is still under review. It supports more than 80% of the H264 test vectors for the JVT-AVC_V1 and
85% of the H265 test vectors in JCT-VC-HEVC_V1.
FFMpeg is offering better figures passing 90% of the tests. It supports all the avaliable codecs including all of the encoders as well.
And finally Vulkan-Video-Samples is the app that you want to use to support all codecs for both encode and decode, but its currently missing support for mesa drivers when it comes to use Fluster decode tests…
During the 3rd day, we had interesting talks as well demonstrating the power of Vulkan, from Blender, a free and open-source 3D computer graphics software tool switching progressively to Vulkan, to the implementation of 3D a game engine using Rust, or compute shaders in Astronomy. My other colleague at Igalia, Lucas Fryzek, also had a presentation on Mesa with Lavapipe: a Mesa’s Software Renderer for Vulkan which allows you to have a hardware free implementation of Vulkan and to validate extensions in a simpler way. Finally, we finished this prolific and dense conference with Android and its close collaboration with Vulkan.
If you are interested in 3D graphics, I encourage you to attend future Vulkanised editions, which are full of passionate people. And if you can not attend you can still watch the presentation online.
If you are interested in the Vulkan Video presentation I gave, you can catch up the video here:
Or follow our Igalia live blog post on Vulkan Video:
Update on what happened in WebKit in the week from March 3 to March 10.
Cross-Port 🐱
Web Platform 🌐
Forced styling to field-sizing: fixed when an input element is auto filled, and added
support for changing field-sizing
dynamically.
Fixed an issue where the imperative
popover APIs didn't take into account the source parameter for focus behavior.
Multimedia 🎥
GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.
Fixed YouTube breakage on videos with
advertisements. The fix prevents scrolling to the comments section when the
videos are fullscreened, but having working video playback was considered more
important for now.
Graphics 🖼️
Fixed re-layout issues for form
controls with the experimental field-sizing implementation.
Landed a change that improves
the quality of damage rectangles and reduces the amount of painting done in the
compositor in some simple scenarios.
Introduce a hybrid threaded rendering
mode, scheduling tasks to both the CPU and GPU worker pools. By default we use
CPU-affine rendering on WPE, and GPU-affine rendering on the GTK port,
saturating the CPU/GPU worker pool first, before switching to the GPU/CPU.
Infrastructure 🏗️
We have recently enabled automatic nightly runs of WPT tests with WPE for the
Web Platform Tests (WPT) dashboard. If you click on the “Edit” button at the
wpt.fyi dashboard now there is the option to select WPE.
These nightly runs happen now daily on the TaskCluster CI sponsored by Mozilla
(Thanks to James Graham!).
If you want to run WPT tests with WPE WebKit locally, there are instructions
at the WPT documentation.
Earlier this
week
I took an inventory of how Guile uses the
Boehm-Demers-Weiser (BDW) garbage collector, with the goal of making
sure that I had replacements for all uses lined up in
Whippet. I categorized the uses
into seven broad categories, and I was mostly satisfied that I have
replacements for all except the last: I didn’t know what to do with
untagged allocations: those that contain arbitrary data, possibly full
of pointers to other objects, and which don’t have a header that we can
use to inspect on their type.
But now I do! Today’s note is about how we can support untagged
allocations of a few different kinds in Whippet’s mostly-marking
collector.
inside and outside
Why bother supporting untagged allocations at all? Well, if I had my
way, I wouldn’t; I would just slog through Guile and fix all uses to be
tagged. There are only a finite number of use sites and I could get to
them all in a month or so.
The problem comes for uses of scm_gc_malloc from outside libguile
itself, in C extensions and embedding programs. These users are loathe
to adapt to any kind of change, and garbage-collection-related changes
are the worst. So, somehow, we need to support these users if we are
not to break the Guile community.
on intent
The problem with scm_gc_malloc, though, is that it is missing an expression of intent, notably as regards tagging. You can use it
to allocate an object that has a tag and thus can be traced precisely,
or you can use it to allocate, well, anything else. I think we will
have to add an API for the tagged case and assume that anything that
goes through scm_gc_malloc is requesting an untagged,
conservatively-scanned block of memory. Similarly for
scm_gc_malloc_pointerless: you could be allocating a tagged object
that happens to not contain pointers, or you could be allocating an
untagged array of whatever. A new API is needed there too for
pointerless untagged allocations.
on data
Recall that the mostly-marking collector can be built in a number of
different ways: it can support conservative and/or precise roots, it can
trace the heap precisely or conservatively, it can be generational or
not, and the collector can use multiple threads during pauses or not.
Consider a basic configuration with precise roots. You can make
tagged pointerless allocations just fine: the trace function for that
tag is just trivial. You would like to extend the collector with the ability
to make untagged pointerless allocations, for raw data. How to do
this?
Consider first that when the collector goes to trace an object, it can’t use bits inside
the object to discriminate between the tagged and untagged cases.
Fortunately though the main space of the mostly-marking collector has one metadata byte for each 16 bytes of
payload. Of those 8 bits, 3 are used for the mark (five different
states, allowing for future concurrent tracing), two for the precise
field-logging write
barrier,
one to indicate whether the object is pinned or not, and one to indicate
the end of the object, so that we can determine object bounds just by
scanning the metadata byte array. That leaves 1 bit, and we can use it
to indicate untagged pointerless allocations. Hooray!
However there is a wrinkle: when Whippet decides the it should evacuate
an object, it tracks the evacuation state in the object itself; the
embedder has to provide an implementation of a little state machine,
allowing the collector to detect whether an object is forwarded or not,
to claim an object for forwarding, to commit a forwarding pointer, and
so on. We can’t do that for raw data, because all bit states belong to
the object, not the collector or the embedder. So, we have to set the
“pinned” bit on the object, indicating that these objects can’t move.
We could in theory manage the forwarding state in the metadata byte, but
we don’t have the bits to do that currently; maybe some day. For now,
untagged pointerless allocations are pinned.
on slop
You might also want to support untagged allocations that contain
pointers to other GC-managed objects. In this case you would want these
untagged allocations to be scanned conservatively. We can do this, but
if we do, it will pin all objects.
Thing is, conservative stack roots is a kind of a sweet spot in
language run-time design. You get to avoid constraining your compiler,
you avoid a class of bugs related to rooting, but you can still support
compaction of the heap.
How is this, you ask? Well, consider that you can move any object for
which we can precisely enumerate the incoming references. This is
trivially the case for precise roots and precise tracing. For
conservative roots, we don’t know whether a given edge is really an
object reference or not, so we have to conservatively avoid moving those
objects. But once you are done tracing conservative edges, any live
object that hasn’t yet been traced is fair game for evacuation, because
none of its predecessors have yet been visited.
But once you add conservatively-traced objects back into the mix, you
don’t know when you are done tracing conservative edges; you could
always discover another conservatively-traced object later in the trace,
so you have to pin everything.
The good news, though, is that we have gained an easier migration path.
I can now shove Whippet into Guile and get it running even before I have
removed untagged allocations. Once I have done so, I will be able to
allow for compaction / evacuation; things only get better from here.
Also as a side benefit, the mostly-marking collector’s heap-conservative
configurations are now faster, because we have metadata attached to
objects which allows tracing to skip known-pointerless objects. This
regains an optimization that BDW has long had via its
GC_malloc_atomic, used in Guile since time out of mind.
fin
With support for untagged allocations, I think I am finally ready to
start getting Whippet into Guile itself. Happy hacking, and see you on
the other side!
It started with my need to debug Chromium’s implementation of OpenXR. I wanted to understand how Chromium interfaces with OpenXR APIs. However, I noticed that only the Android and Windows ports of Chromium currently support OpenXR bindings. Since I needed to debug a desktop implementation, Windows was the only viable option. Additionally, I did not have access to a physical XR device, so I explored whether a simulator or emulator environment could be used to test WebXR support for websites.
Understanding WebXR and OpenXR
Before diving into implementation details, it’s useful to understand what WebXR and OpenXR are and how they differ.
WebXR is a web standard that enables immersive experiences, such as Virtual Reality (VR) and Augmented Reality (AR),
in web browsers. It allows developers to create XR content using JavaScript and run it directly in a browser
without requiring platform-specific applications.
OpenXR is a cross-platform API standard developed by the Khronos Group, designed to unify access to different
XR hardware and software. It provides a common interface for VR and AR devices, ensuring interoperability across
different platforms and vendors.
The key difference is that WebXR is a high-level API used by web applications to access XR experiences, whereas
OpenXR is a low-level API used by platforms and engines to communicate with XR hardware. WebXR implementations,
such as the one in Chromium use OpenXR as the backend to interact with different XR runtimes.
Chromium OpenXR Implementation
Chromium’s OpenXR implementation, which interacts with the platform-specific OpenXR runtime, is located in the device/vr/ directory. WebXR code interacts with this device/vr/ OpenXR implementation, which abstracts WebXR features across multiple platforms.
WebXR ---> device/vr/ ---> OpenXR API ---> OpenXR runtime
Installing OpenXR Runtime
To run OpenXR on Windows, you need to install an OpenXR runtime. You can download and install OpenXR Tools for Windows Mixed Reality from the Microsoft App Store:
If it is not available on your machine, you can enable it from the OpenXR Runtime tab in the application.
Installing Microsoft Mixed Reality Simulator
To set up a simulated environment for WebXR testing, follow these steps:
Install Mixed Reality Portal from the Microsoft App Store.
Chromium provides a flag to select the OpenXR implementation.
Open Chrome and navigate to:
chrome://flags/#webxr-runtime
Set the flag to OpenXR.
This enables Chromium to use the OpenXR runtime for WebXR applications.
Launch WebVR application
Launch chromium and Open : https://immersive-web.github.io/webxr-samples/immersive-vr-session.html
CallStack
When we call navigator.xr.requestSession("immersive-vr"); from Javascript, below call stack get triggered.
Conclusions
With this setup, you can explore and debug WebXR applications on Windows even without a physical VR headset.
The combination of Chromium’s OpenXR implementation and Microsoft’s Mixed Reality Simulator provides
a practical way to test WebXR features and interactions.
If you’re interested in further experimenting, try developing a simple WebXR scene to validate your
setup! Additionally, we plan to post more about Chromium’s architecture on OpenXR and will link
those posts here once they are ready.
Salutations, populations. Today’s note is more of a work-in-progress
than usual; I have been finally starting to look at getting
Whippet into
Guile, and there are some open questions.
inventory
I started by taking a look at how Guile uses the Boehm-Demers-Weiser
collector‘s API, to make sure I had all
my bases covered for an eventual switch to something that was not BDW.
I think I have a good overview now, and have divided the parts of BDW-GC
used by Guile into seven categories.
implicit uses
Firstly there are the ways in which Guile’s run-time and compiler depend
on BDW-GC’s behavior, without actually using BDW-GC’s API. By this I
mean principally that we assume that any reference to a GC-managed
object from any thread’s stack will keep that object alive. The same
goes for references originating in global variables, or static data
segments more generally. Additionally, we rely on GC objects not to
move: references to GC-managed objects in registers or stacks are valid
across a GC boundary, even if those references are outside the GC-traced
graph: all objects are pinned.
Some of these “uses” are internal to Guile’s implementation itself, and
thus amenable to being changed, albeit with some effort. However some
escape into the wild via Guile’s API, or, as in this case, as implicit
behaviors; these are hard to change or evolve, which is why I am putting
my hopes on Whippet’s mostly-marking
collector,
which allows for conservative roots.
defensive uses
Then there are the uses of BDW-GC’s API, not to accomplish a task, but
to protect the mutator from the collector:
GC_call_with_alloc_lock,
explicitly enabling or disabling GC, calls to sigmask that take
BDW-GC’s use of POSIX signals into account, and so on. BDW-GC can stop
any thread at any time, between any two instructions; for most users is
anodyne, but if ever you use weak references, things start to get really
gnarly.
Of course a new collector would have its own constraints, but switching
to cooperative instead of pre-emptive safepoints would be a welcome
relief from this mess. On the other hand, we will require client code
to explicitly mark their threads as inactive during calls in more cases,
to ensure that all threads can promptly reach safepoints at all times.
Swings and roundabouts?
precise tracing
Did you know that the Boehm collector allows for precise tracing? It
does! It’s slow and truly gnarly, but when you need precision, precise
tracing nice to have. (This is the
GC_new_kind
interface.) Guile uses it to mark Scheme stacks, allowing it to avoid
treating unboxed locals as roots. When it loads compiled files, Guile
also adds some sliced of the mapped files to the root set. These
interfaces will need to change a bit in a switch to Whippet but are
ultimately internal, so that’s fine.
What is not fine is that Guile allows C users to hook into precise
tracing, notably via
scm_smob_set_mark.
This is not only the wrong interface, not allowing for copying
collection, but these functions are just truly gnarly. I don’t know
know what to do with them yet; are our external users ready to forgo
this interface entirely? We have been working on them over time, but I
am not sure.
reachability
Weak references, weak maps of various kinds: the implementation of these
in terms of BDW’s API is incredibly gnarly and ultimately unsatisfying.
We will be able to replace all of these with ephemerons and tables of
ephemerons, which are natively supported by Whippet. The same goes with
finalizers.
The same goes for constructs built on top of finalizers, such as
guardians;
we’ll get to reimplement these on top of nice Whippet-supplied
primitives. Whippet allows for resuscitation of finalized objects, so
all is good here.
misc
There is a long list of miscellanea: the interfaces to explicitly
trigger GC, to get statistics, to control the number of marker threads,
to initialize the GC; these will change, but all uses are internal, making it not a terribly big
deal.
I should mention one API concern, which is that BDW’s state is all
implicit. For example, when you go to allocate, you don’t pass the API
a handle which you have obtained for your thread, and which might hold
some thread-local freelists; BDW will instead load thread-local
variables in its API. That’s not as efficient as it could be and
Whippet goes the explicit route, so there is some additional plumbing to
do.
Finally I should mention the true miscellaneous BDW-GC function:
GC_free. Guile exposes it via an API, scm_gc_free. It was already
vestigial and we should just remove it, as it has no sensible semantics
or implementation.
allocation
That brings me to what I wanted to write about today, but am going to
have to finish tomorrow: the actual allocation routines. BDW-GC
provides two, essentially: GC_malloc and GC_malloc_atomic. The
difference is that “atomic” allocations don’t refer to other
GC-managed objects, and as such are well-suited to raw data. Otherwise you can think of atomic allocations as a pure optimization, given that BDW-GC mostly traces conservatively anyway.
From the perspective of a user of BDW-GC looking to switch away, there
are two broad categories of allocations, tagged and untagged.
Tagged objects have attached metadata bits allowing their type to be inspected by the user later on. This is the
happy path! We’ll be able to write a gc_trace_object function that
takes any object, does a switch on, say, some bits in the first word,
dispatching to type-specific tracing code. As long as the object is
sufficiently initialized by the time the next safepoint comes around,
we’re good, and given cooperative safepoints, the compiler should be able to
ensure this invariant.
Then there are untagged allocations. Generally speaking, these are of
two kinds: temporary and auxiliary. An example of a temporary
allocation would be growable storage used by a C run-time routine,
perhaps as an unbounded-sized alternative to alloca. Guile uses these a
fair amount, as they compose well with non-local control flow as
occurring for example in exception handling.
An auxiliary allocation on the other hand might be a data structure only
referred to by the internals of a tagged object, but which itself never
escapes to Scheme, so you never need to inquire about its type; it’s
convenient to have the lifetimes of these values managed by the GC, and
when desired to have the GC automatically trace their contents. Some of
these should just be folded into the allocations of the tagged objects
themselves, to avoid pointer-chasing. Others are harder to change,
notably for mutable objects. And the trouble is that for external users of scm_gc_malloc, I fear that we won’t be able to migrate them over, as we don’t know whether they are making tagged mallocs or not.
what is to be done?
One conventional way to handle untagged allocations is to manage
to fit your data into other tagged data structures; V8 does this in many
places with instances of FixedArray, for example, and Guile should do
more of this. Otherwise, you make new tagged data types. In either case, all auxiliary data
should be tagged.
I think there may be an alternative, which would be just to support the
equivalent of untagged GC_malloc and GC_malloc_atomic; but for that,
I am out of time today, so type at y’all tomorrow. Happy hacking!
After fixing
an issue with Trusted Types when doing attribute mutation within the default
callback, and implementing
performance improvements for Trusted Types enforcement, the
Trusted Types
implementation is now considered stable and has been
enabled by default.
Multimedia 🎥
GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.
Landed one fix which,
along with previous patches, solved the webKitMediaSrcStreamFlush() crash
reported in bug #260455.
Unfortunately, in some pages where the crash previously occurred, now a
different blank video bug has been revealed. The cause of this bug is known,
but fixing it would cause performance regressions in pages with many video
elements. Work is ongoing to find a better solution for both.
The initial support of MP4-muxed WebVTT in-band text
tracks is about to be merged,
which will bring this MSE feature to the ports using GStreamer. Text tracks
for the macOS port of WebKit only landed two weeks
ago and we expect there will be
issues to iron out in WebKit ports, multiplatform code and even potentially in
spec work—we are already aware of a few potential ones.
Note that out-of band text-tracks are well supported in MSE across browsers
and commonly used. On the other hand, no browsers currently ship with in-band
text track support in MSE at this point.
Support for MediaStreamTrack.configurationchange events was
added,
along with related
improvements
in the GStreamer PipeWire plugin. This will allow WebRTC applications to
seamlessly handle default audio/video capture changes.
Graphics 🖼️
Continued improving the support for handling graphics damage:
Added support
for validating damage rectangles in Layout Tests.
Landed a change that adds
layout tests covering the damage propagation feature.
Landed a change that fixes
damage rectangles on layer resize operations.
Landed a change that
improves damage rectangles produced by scrolling so that they are
clipped to the parent container.
The number of threads used for painting with the GPU has been slightly
tweaked, which brings a measurable
performance improvement in all kinds of devices with four or mores processor
cores.
Releases 📦️
The stable branch
for the upcoming 2.48.x stable release series of the GTK and WPE ports has
been created. The first preview releases from this branch are WebKitGTK
2.47.90 and
WPE WebKit 2.47.90.
People willing to report issues and help with stabilization are encouraged to
test them and report issues in Bugzilla.
Community & Events 🤝
Published a blog
post
that presents an opinionated approach to the work with textual logs obtained
from WebKit and GStreamer.
Long story short, yes, it’s possible to use the Ed25519 and X25519 algorithms through the Web Cryptography API exposed by the major browser engines: Blink (Chrome, Edge, Brave, …), WebKit (Safari) and Gecko (Firefox).
However, despite the hard work during the last year we haven’t been able to ship Ed25519 in Chrome. It’s still available behind the Experimental Web Platform Features runtime flag. In this post, I will explain the current blockers and plans for the future regarding this feature.
Although disappointed about the current status of the Ed25519 in Chrome, I’m very satisfied to see the results of our efforts, with the implementation now moving forward in other major web engines and shipping by default for both Ed25119 [1] and X25119 [2].
Finally, I want to remark about the work done to improve interoperability, which is a relevant debt this feature carried the last few years, to ensure applications can realiably use the Curve21559 features in any of the major browsers.
Context
Before analyzing the current status and blockers, it’s important to understand why browser support for this feature matters, why merging the WICG draft into the Web Cryptography API specification is key.
I’ve already written about this in my last post so I’m not going to elaborate too much, but it’s important to describe some of the advantages for the Web Platform users this API has over the current alternatives.
The Ed25519 algorithm for EdDSA signing and the X25519 function for key-agreement offer stronger security and better performance than other algorithms. For instance, the RSA keys are explicitly banned from new features like Web Transport. The smaller key size (32 bytes) and EdDSA signatures (64 bytes) provide advantages in terms of transmission rates, especially in distributed systems and peer-to-peer communications.
The lack of a browser API to use the Curve25519 algorithms have forced web authors to rely on external components, either JS libraries or WASM compiled, which implies a security risk. This situation is especially sad, considering that browsers already have support for these algorithms as part of the TLS 1.3 implementation; it’s just not exposed to web authors.
Web Platform Feature Development Timeline: Key Milestones
To get an idea of of what the time-frame and effort required to develop a web feature like this looks like, lets consider the following milestones:
Chrome 133 release shipped with X25519 enabled by default
This is a good example of a third-party actor not affiliated with a browser-vendor investing time and money to change the priorities of the companies behind the main browsers to bring an important feature to the Web Platform. It’s been a large effort during 2 years, reaching agreement between 3 different browsers, spec editor and contributors and the W3C Web App Sec WG, which manages the Web Cryptography API specification.
It’s worth mentioning that a large part of the time has been invested in increasing the testing coverage in the Web Platform Tests’ WebCryptoAPI test suite and improving the interoperability between the three main web engines. This effort implies filing bugs in the corresponding browsers, discussing in the Secure Curves WICG about the best approach to address the interop issue and writing tests to ensure we don’t regress in the future.
Unfortunately, a few of these interop issues are the reason why the Ed25519 algorithm has not been shipped when I expected, but I’ll elaborate more in this later in this post.
Current implementation status of the Curve25519 algorithms
The following table provides a high-level overview of the support of the Secure Curve25519 features in some of the main browsers:
If we want to take a broader look at the implementation status of these algorithms, there is a nice table in the issue #20 at the WICG repository:
Interoperability
As I commented before, most of the work done during this year was focused on improving the spec and increasing the test coverage by the WPT suite. The main goal of these efforts was improving the interoperability of the feature between the three main browsers. It’s interesting to compare the results with the data shown in my previous post.
Rejection of any invalid and small-order points (issue #27)
There are other minor disagreements regarding the Ed25519 specification, like the removal of the “alg” member in the JWK format in the import/export operations. There is a bug 40074061 in Chrome to implement this spec change, but apparently it has not been agreed on by Chrome, and now there is not enough support to proceed. Firefox already implements the specified behavior and there is a similar bug report in WebKit (bug 262613) where it seems there is support to implement the behavior change. However, I’d rather avoid introducing an interop issue and delay the implementation until there is more consensus regarding the spec.
The issue about the use of randomized EdDSA signatures comes from the fact that WebKit’s CryptoKit, the underlaying cryptography component of the WebKit engine, follows that approach in their implementation of the Ed25519 generateKey() operation. It’s been always in the spirit of the Secure Curves spec to rely completely on the corresponding official RFCs. In the case of the Ed25519 algorithm it refers to the RFC8032, where it’s stated the deterministic nature of the Ed25519 keys. However, the CRFG is currently discussing the issue and there is a proposal of defining a Ed25519-bis with randomized signatures.
The small-order issue is more complex, it seems. The spec states clearly that any invalid or small-order point should be rejected during the Verify operation. This behavior is based on the RFC8032 mentioned before. Ideally, the underlying cryptography library should take care of performing these checks, and this has been the approach followed by the three main browsers in the implementation of the whole API; in the case of Chrome, this cryptography library is BoringSSL. The main problem here is that there are differences in how the cryptography libraries implement these checks, and BoringSSL is not an exception. The WPT I have implemented to cover these cases also show interop issues between the three main engines. I’ve filed the bug 697, but it was marked as low priority. The alternative would be to implement the additional checks in the WebCrypto implementation, but Chrome is not very positive about this approach.
The small-order checks has been a request from Mozilla since the initial standard-position request submitted long time ago. This has been stated again in the PR#362 and Apple expressed positive opinions about this as well, so I believe this is going to be the selected approach.
Conclusions
I believe that Secure Curves being added into the Web Cryptography specification is a great achievement for the Web Platform and brings a very powerful feature for web developers. This is especially true in the realm of a decentralized web where Content Addressing is a key concept, which is based on cryptography hashes. Browsers exposing APIs to use Ed25519 and X25519 is going to offer a big advantage for decentralized web applications. I want to thank Daniel Huigens, editor of the Web Cryptography API specification, for his huge effort to address all the spec issues filed during the years, driving discussions always with the aim to reach consensus based resolutions, despite how frustrating the process sometimes is to develop a feature like this until it’s shipped in the browser.
The implementation in the three major browsers is a clear sign of the stability of these features and ensures it will be maintained properly in the future. This includes the effort to keep interoperability of the three implementations, which is crucial for a healthy Web Platform ecosystem and ultimately to the web authors. The fact that we couldn’t ship Ed25519 in Chrome is the only negative aspect of the work done this year but I believe it’s going to be resolved soon.
At Igalia we expect to continue working in 2025 on the Web Cryptography API specification, at least until the blocker issues mentioned before are addressed and we can send the Intent-To-Ship request. We hope to also find the opportunity to contribute to the spec to add new algorithms and also carry out maintenance work; ensuring an interoperable Web Platform has been always a priority for Igalia and an important factor when evaluating the projects we take on as a company.
If you're regularly rebuilding a large project like LLVM, you almost certainly
want to be using ccache. Incremental builds are
helpful, but it's quite common to be swapping between different commit IDs and
it's very handy to have the build complete relatively quickly without needing
any explicit thought on your side. Enabling ccache with LLVM's CMake build
system is trivial, but out of the box you will suffer from cache misses if
building llvm-project in different build directories, even for an identical
commit and identical build settings (and even for identical source directories,
i.e. without even considering separate checkouts or separate git work trees):
We can see that as LLVM generates header files, it has absolute directories
specified in -I within the build directory, which of course differs for
build a and build b above, causing a cache miss. Even if there was a
workaround for the generated headers, we'd still fail to get cache hits if
building from different llvm-project checkouts or worktrees.
Solution
Unsurprisingly, this is a common problem with ccache and it has good
documentation
on the solution. It advises:
Setting the base_dir ccache option to enable ccache's rewriting of
absolute to relative paths for any path with that prefix.
Setting the absolute_paths_in_stderr ccache option in order to rewrite
relative paths in stderr output to absolute (thus avoiding confusing error
messages).
I have to admit that trialling this alongside forcing an error in a header
with # error "forced error" I'm not sure I see a difference for Clang's
output when attempting to build LLVM.
If compiling with -g, use the -fdebug-prefix-map option.
Rewriting paths to relative works in most cases, but you'll still experience
cache misses if the location of your build directory relative to the source
directory differs. This might happen if you compile directly in
build/ in one checkout, but in build/foo in another, or if compiling
outside of the llvm-project source tree altogether in one case, but within
it (e.g. in build/) in another.
This is normally pretty easy to avoid, but is worth being aware of. For
instance I find it helpful on LLVM buildbots I administer to be able to rapidly
reproduce a previous build using ccache, but the default source vs build
directory layout used during CI is different to what I normally use in day to
day development.
Other helpful options
I was going to advertise inode_cache = true, but I see this is enabled by
default since I last looked. Otherwise, file_clone = true
(docs makes sense
for my case where I'm in a filesystem with reflink support (XFS) and have
plenty of space.
WebKit has grown into a massive codebase throughout the years. To make developers’ lives easier, it offers various subsystems and integrations.
One such subsystem is a logging subsystem that offers the recording of textual logs describing an execution of the internal engine parts.
The logging subsystem in WebKit (as in any computer system), is usually used for both debugging and educational purposes. As WebKit is a widely-used piece of software that runs on
everything ranging from desktop-class devices up to low-end embedded devices, it’s not uncommon that logging is sometimes the only way for debugging when various limiting
factors come into play. Such limiting factors don’t have to be only technical - it may also be that the software runs on some restricted systems and direct debugging is not allowed.
Requirements for efficient work with textual logs #
Regardless of the reasons why logging is used, once the set of logs is produced, one can work with it according to the particular need.
From my experience, efficient work with textual logs requires a tool with the following capabilities:
Ability to search for a particular substring or regular expression.
Ability to filter text lines according to the substring or regular expressions.
Ability to highlight particular substrings.
Ability to mark certain lines for separate examination (with extra notes if possible).
Ability to save and restore the current state of work.
While all text editors should be able to provide requirement 1, requirements 2-5 are usually more tricky and text editors won’t support them out of the box.
Fortunately, any modern extensible text editor should be able to support requirements 2-5 after some extra configuration.
Throughout the following sections, I use Emacs, the classic “extensible, customizable, free/libre text editor”, to showcase how it can be set up and used to meet
the above criteria and to make work with logs a gentle experience.
Emacs, just like any other text editor, provides the support for requirement 1 from the previous section out of the box.
To support requirement 2, it requires some extra mode. My recommendation for that is loccur - the minor mode
that acts just like a classic grep *nix utility yet directly in the editor. The benefit of that mode (over e.g. occur)
is that it works in-place. Therefore it’s very ergonomic and - as I’ll show later - it works well in conjunction with bookmarking mode.
Installation of loccur is very simple and can be done from within the built-in package manager:
M-x package-install RET loccur RET
With loccur installed, one can immediately start using it by calling M-x loccur RET <regex> RET. The figure below depicts the example of filtering:
highlight-symbol - the package with utility functions for text highlighting #
To support requirement 3, Emacs also requires the installation of extra module. In that case my recommendation is highlight-symbol
that is a simple set of functions that enables basic text fragment highlighting on the fly.
Installation of this module is also very simple and boils down to:
M-x package-install RET highlight-symbol RET
With the above, it’s very easy to get results like in the figure below:
just by moving the cursor around and using C-c h to toggle the highlight of the text at the current cursor position.
bm - the package with utility functions for buffer lines bookmarking #
Finally, to support requirements 4 and 5, Emacs requires one last extra package. This time my recommendation is bm
that is quite a powerful set of utilities for text bookmarking.
In this case, installation is also very simple and is all about:
M-x package-install RET bm RET
In a nutshell, the bm package brings some visual capabilities like in the figure below:
as well as non-visual capabilities that will be discussed in further sections.
Once all the necessary modules are installed, it’s worth to spend some time on configuration. With just a few simple tweaks it’s possible to make the work with logs
simple and easily reproducible.
To not influence other workflows, I recommend attaching as much configuration as possible to any major mode and setting that mode as a default for
files with certain extensions. The configuration below uses a major mode called text-mode as the one for working with logs and associates all the files with a
suffix .log with it. Moreover, the most critical commands of the modes installed in the previous sections are binded to the key shortcuts. The one last
thing is to enable truncating the lines ((set-default 'truncate-lines t)) and highlighting the line that the cursor is currently at ((hl-line-mode 1)).
To show what the workflow of Emacs is with the above configuration and modules, some logs are required first. It’s very easy to
get some logs out of WebKit, so I’ll additionally get some GStreamer logs as well. For that, I’ll build a WebKit GTK port from the latest revision of WebKit repository.
To make the build process easier, I’ll use the WebKit container SDK.
The above command disables the ENABLE_JOURNALD_LOG build option so that logs are printed to stderr. This will result in the WebKit and GStreamer logs being bundled together as intended.
Once the build is ready, one can run any URL to get the logs. I’ve chosen a YouTube conformance tests suite from 2021 and selected test case “39. PlaybackRateChange”
to get some interesting entries from multimedia-related subsystems:
Once the logs are collected, one can open them using Emacs and start making sense out of them by gradually exploring the flow of execution. In the below exercise, I intend to understand
what happened from the multimedia perspective during the execution of the test case “39. PlaybackRateChange”.
The first step is usually to find the most critical lines that mark more/less the area in the file where the interesting things happen. In that case I propose using M-x loccur RET CONSOLE LOG RET to check what the
console logs printed by the application itself are. Once some lines are filtered, one can use bm-toggle command (C-c t) to mark some lines for later examination (highlighted as orange):
For practicing purposes I propose exiting the filtered view M-x loccur RET and trying again to see what events the browser was dispatching e.g. using M-x loccur RET on node node 0x7535d70700b0 VIDEO RET:
In general, the combination of loccur and substring/regexp searches should be very convenient to quickly explore various types of logs along with marking them for later. In case of very important log
lines, one can additionally use bm-bookmark-annotate command to add extra notes for later.
Once some interesting log lines are marked, the most basic thing to do is to jump between them using bm-previous (C-c n) and bm-next (C-c p). However, the true power of bm mode comes with
the use of M-x bm-show RET to get the view containing only the lines marked with bm-toggle (originally highlighted orange):
This view is especially useful as it shows only the lines deliberately marked using bm-toggle and allows one to quickly jump to them in the original file. Moreover, the lines are displayed in
the order they appear in the original file. Therefore it’s very easy to see the unified flow of the system and start making sense out of the data presented. What’s even more interesting,
the view contains also the line numbers from the original file as well as manually added annotations if any. The line numbers are especially useful as they may be used for resuming the work
after ending the Emacs session - which I’ll describe further in this section.
When the *bm-bookmarks* view is rendered, the only problem left is that the lines are hard to read as they are displayed using a single color. To overcome that problem one can use the macros from
the highlight-symbol package using the C-c h shortcut defined in the configuration. The result of highlighting some strings is depicted in the figure below:
With some colors added, it’s much easier to read the logs and focus on essential parts.
On some rare occasions it may happen that it’s necessary to close the Emacs session yet the work with certain log file is not done and needs to be resumed later. For that, the simple trick is to open the current
set of bookmarks with M-x bm-show RET and then save that buffer to the file. Personally, I just create a file with the same name as log file yet with .bm prefix - so for log.log it’s log.log.bm.
Once the session is resumed, it is enough to open both log.log and log.log.bm files side by side and create a simple ad-hoc macro to use line numbers from log.log.bm to mark them again in the log.log
file:
As shown in the above gif, within a few seconds all the marks are applied in the buffer with log.log file and the work can resume from that point i.e. one can jump around using bm, add new marks etc.
Although the above approach may not be ideal for everybody, I find it fairly ergonomic, smooth, and covering all the requirements I identified earlier.
I’m certain that editors other than Emacs can be set up to allow the same or very similar flow, yet any particular configurations are left for the reader to explore.
Lutter contre les discriminations professionnelles dans le secteur informatique.
Chaque année, je m’occupe du stage « Implémentation des normes Web » qui consiste à modifier les navigateurs (Chromium, Firefox, Safari…) afin d’améliorer le support de technologies Web (HTML, CSS, DOM…). Il faut notamment étudier les spécifications correspondantes et écrire des tests de conformité. Notez bien que ce n’est pas un stage de développement Web mais de développement C++.
Je vous invite à lire My internship with Igalia1 de ma collègue Delan Azabani pour un exemple concret. Ces dernières années, en plus de la communication par messagerie instantanée, je mets en place des visioconférences hebdomadaires qui se révèlent plutôt efficaces pour permettre aux stagiaires de progresser.
J’ai commencé des cours de LSF depuis quelques mois et assisté à plusieurs spectacles de l’IVT, notamment « Parle plus fort ! » qui décrit avec humour les difficultés des Sourds au travail. Cette année, j’envisage de prendre un·e stagiaire Sourd·e afin de contribuer à une meilleure intégration des Sourds en milieu professionnel. Je pense que ce sera aussi une expérience positive pour mon entreprise et pour moi-même.
Profil recherché :
Étudiant·e en informatique niveau licence/master.
Résidant en région parisienne (pour faciliter l’encadrement).
Pouvant lire/écrire en anglais (et communiquer en LSF).
Intéressé·e par les technologies Web.
Connaissance de développement C/C++.
Si vous êtes intéressé·e, les candidatures se font ici jusqu’au 4 avril 2025.
Le programme “Coding Experience” d’Igalia ne correspond pas forcément à un stage au sens français du terme. Si vous souhaiteriez en faire un stage conventionné, précisez-le lors de la candidature et nous pourrons trouver une solution. ↩↩2
I recently led a small training session at Igalia where I proposed to find mistakes in five small testharness.js tests I wrote.
These mistakes are based on actual issues I found in official web platform tests, or on mistakes I made myself in the past while writing tests, so I believe they would be useful to know.
The feedback from my teammates was quite positive, with very good participation and many ideas.
They suggested I write a blog post about it, so here it is.
Please read the tests carefully and try to find the mistakes before looking at the proposed fixes…
1. Multiple tests in one loop
We often need to perform identical assertions for a set of similar objects.
A good practice is to split such checks into multiple test() calls, so that it’s easier to figure out which of the objects are causing failures.
Below, I’m testing the reflected autoplay attribute on the <audio> and <video> elements.
What small mistake did I make?
<!DOCTYPE html><script src="/resources/testharness.js"></script><script src="/resources/testharnessreport.js"></script><script>["audio","video"].forEach(tagName=>{test(function(){letelement=document.createElement(tagName);assert_equals(element.autoplay,false,"inital value");element.setAttribute("autoplay","autoplay");assert_equals(element.autoplay,true,"after setting attribute");element.removeAttribute("autoplay");assert_equals(element.autoplay,false,"after removing attribute");},"Basic test for HTMLMediaElement.autoplay.");});</script>
Proposed fix
Each loop iteration creates one test, but they all have have the name "Basic test for HTMLMediaElement.autoplay.".
Because this name identifies the test in various places (e.g. failure expectations), it must be unique to be useful.
These tests will even cause a “Harness status: Error” with the message “duplicate test name”.
One way to solve that is to move the loop iteration into the test(), which will fix the error but won’t help you with fine-grained failure reports.
We can instead use a different description for each iteration:
assert_equals(element.autoplay, true, "after setting attribute");
element.removeAttribute("autoplay");
assert_equals(element.autoplay, false, "after removing attribute");
- }, "Basic test for HTMLMediaElement.autoplay.");
+ }, `Basic test for HTMLMediaElement.autoplay (${tagName} element).`);
});
</script>
2. Cleanup between tests
Sometimes, it is convenient to reuse objects (e.g. DOM elements) for several test() calls, and some cleanup may be necessary.
For instance, in the following test, I’m checking that setting the class attribute via setAttribute() or setAttributeNS() is properly reflected on the className property.
However, I must clear the className at the end of the first test(), so that we can really catch the failure in the second test() if, for example, setAttributeNS() does not modify the className because of an implementation bug.
What’s wrong with this approach?
<!DOCTYPE html><script src="/resources/testharness.js"></script><script src="/resources/testharnessreport.js"></script><divid="element"></div><script>test(function(){element.setAttribute("class","myClass");assert_equals(element.className,"myClass");element.className="";},"Setting the class attribute via setAttribute().");test(function(){element.setAttributeNS(null,"class","myClass");assert_equals(element.className,"myClass");element.className="";},"Setting the class attribute via setAttributeNS().");</script>
Proposed fix
In general, it is difficult to guarantee that a final cleanup is executed.
In this particular case, for example, if the assert_equals() fails because of bad browser implementation, then an exception is thrown and the rest of the function is not executed.
- test(function() {
+ function resetClassName() { element.className = ""; }
+
+ test(function(t) {
+ t.add_cleanup(resetClassName);
element.setAttribute("class", "myClass");
assert_equals(element.className, "myClass");
- element.className = "";
}, "Setting the class attribute via setAttribute().");
- test(function() {
+ test(function(t) {
+ t.add_cleanup(resetClassName);
element.setAttributeNS(null, "class", "myClass");
assert_equals(element.className, "myClass");
- element.className = "";
}, "Setting the class attribute via setAttributeNS().");
3. Checking whether an exception is thrown
Another very frequent test pattern involves checking whether a Web API throws an exception.
Here, I’m trying to use DOMParser.parseFromString() to parse a small MathML document.
The HTML spec says that it should throw a TypeError if one specifies the MathML MIME type.
The second test() asserts that the rest of the try branch is not executed and that the correct exception type is found in the catch branch.
Is this approach correct?
Can the test be rewritten in a better way?
If the assert_unreached() is executed because of an implementation bug with parseFromString(), then the assertion will actually throw an exception.
That exception won’t be a TypeError, so the test will still fail because of the assert_true(), but the failure report will look a bit confusing.
The following test verifies a very basic feature: clicking a button triggers the registered event listener.
We use the (asynchronous) testdriver API test_driver.click() to emulate that user click, and a promise_test() call to wait for the click event listener to be called.
The test may time out if there is something wrong in the browser implementation, but do you see a risk for flaky failures?
Note: the testdriver API only works when running tests automatically.
If you run the test manually, you need to click the button yourself.
<!DOCTYPE html><script src="/resources/testharness.js"></script><script src="/resources/testharnessreport.js"></script><script src="/resources/testdriver.js"></script><script src="/resources/testdriver-vendor.js"></script><buttonid="button">Click me to run manually</button><script>promise_test(function(){test_driver.click(button);returnnewPromise(resolve=>{button.addEventListener("click",resolve);});},"Clicking the button triggers registered click event handler.");</script>
Proposed fix
The problem I wanted to show here is that we are sending the click event before actually registering the listener.
The test would likely still work, because test_driver.click() is asynchronous and communication to the test automation scripts is slow, whereas registering the event is synchronous.
But rather than making this kind of assumption, which poses a risk of flaky failures as well as making the test hard to read, I prefer to just move the statement that triggers the event into the Promise, after the listener registration:
<button id="button">Click me to run manually</button>
<script>
promise_test(function() {
- test_driver.click(button);
return new Promise(resolve => {
button.addEventListener("click", resolve);
+ test_driver.click(button);
});
}, "Clicking the button triggers registered click event handler.");
</script>
My colleagues also pointed out that if the promise returned by test_driver.click() fails, then a “Harness status: Error” could actually be reported with “Unhandled rejection”.
We can add a catch to handle this case:
<button id="button">Click me to run manually</button>
<script>
promise_test(function() {
- return new Promise(resolve => {
+ return new Promise((resolve, reject) => {
button.addEventListener("click", resolve);
- test_driver.click(button);
+ test_driver.click(button).catch(reject);
});
}, "Clicking the button triggers registered click event handler.");
</script>
5. Dealing with asynchronous resources
It’s very common to deal with asynchronous resources in web platform tests.
The following test case verifies the behavior of a frame with lazy loading: it is initially outside the viewport (so not loaded) and then scrolled into the viewport (which should trigger its load).
The actual loading of the frame is tested via the window name of /common/window-name-setter.html (should be “spices”).
Again, this test may time out if there is something wrong in the browser implementation, but can you see a way to make the test a bit more robust?
Side question: the <div id="log"></div> and add_cleanup() are not really necessary for this test to work, so what’s the point of using them?
Can you think of one?
<!DOCTYPE html><script src="/resources/testharness.js"></script><script src="/resources/testharnessreport.js"></script><style>#lazyframe{margin-top:10000px;}</style><divid="log"></div><iframeid="lazyframe"loading="lazy"src="/common/window-name-setter.html"></iframe><script>promise_test(function(){returnnewPromise(resolve=>{window.addEventListener("load",()=>{assert_not_equals(lazyframe.contentWindow.name,"spices");resolve();});});},"lazy frame not loaded after page load");promise_test(t=>{t.add_cleanup(_=>window.scrollTo(0,0));returnnewPromise(resolve=>{lazyframe.addEventListener('load',()=>{assert_equals(lazyframe.contentWindow.name,"spices");resolve();});lazyframe.scrollIntoView();});},"lazy frame loaded after appearing in the viewport");</script>
Proposed fix
This is similar to what we discussed in the previous tests.
If the assert_equals() in the listener fails, then an exception is thrown, but it won’t be caught by the testharness.js framework.
A “Harness status: Error” is reported, but the test will only complete after the timeout.
This can slow down test execution, especially if this pattern is repeated for several tests.
To make sure we report the failure immediately in that case, we can instead reject the promise if the equality does not hold, or even better, place the assert_equals() check after the promise resolution:
Regarding the side question, if you run the test by opening the page in the browser, then the report will be appended at the bottom of the page by default.
But lazyframe has very large height, and the page may be scrolled to some other location. An explicit <div id="log"> ensures the report is inserted inside that div at top of the page, while the add_cleanup() ensures that we scroll to that location after test execution.
Recently, I have been working on an issue in Electron, which required bisecting and finding the exact version of Electron, when the regression happened.
A quick research did not reveal any guides, but as my search was progressing, I found one interesting commit - feat: add Bisect helper.
Electron Fiddle
Fiddle is an Electron playground that allows developers to experiment with Electron APIs. It has a quick startup template, which you can change as you wish.
You can save fiddle locally or as a GitHub Gist, which can be shared with anyone by just entering the Gist URL in the address bar.
Moreover, you can choose what Electron version you wish to use - from stable to nightly releases.
Electron Releases
You can run fiddle using any version of Electron you wish - either stable, beta, or nightly. One can either run fiddle with obsolete versions, which is super great when comparing behaviour
between different versions.
An option to choose the version of the Electron can be found at the top-left corner of the Fiddle window.
Once pressed, you can use filter to choose any Electron version you wish.
However, you may not find beta or nightly versions in the filter. For that, go to Settings (a gear icon on the left of the filter), then Electron, and select the desired channels.
Now, you can access all the available Electron versions and try any of them on the fly.
I hope this small guide helps you to triage your Electron problems :)))
Hey there! I’m glad to finally start paying my blogging debt :) as this
is something I’ve been planning to do for quite some time now. To get the
ball rolling, I’ve shared some bits about me in my very first blog post
Olá Mundo.
In this article, I’m going to walk through what we’ve been working on
since last year in the Chromium Ozone/Wayland project, on which I’ve
been involved (directly or indirectly) since I’ve joined Igalia back in
2018.
Igalia is arranging the twelfth annual Web Engines Hackfest, which will be held on Monday 2nd June through Wednesday 4th June.
As usual, this is a hybrid event, at Palexco in A Coruña (Galicia, Spain) as well as remotely.
Registration is now open:
Submit your talks and breakout sessions. The deadline to submit proposals is Wednesday 30th April.
The Web Engines Hackfest is an event where folks working on various parts of the web platform gather for a few days to share knowledge and discuss a variety of topics.
These topics include web standards, browser engines, JavaScript engines, and all the related technology around them.
Last year, we had eight talks (watch them on YouTube) and 15 breakout sessions (read them on GitHub).
A wide range of experts with a diverse set of profiles and skills attend each year, so if you are working on the web platform, this event is a great opportunity to chat with people that are both developing the standards and working on the implementations.
We’re really grateful for all the people that regularly join us for this event; you are the ones that make this event something useful and interesting for everyone! 🙏
Really enjoying Web Engines Hackfest by @igalia once again. Recommended for everyone interested in web technology.
The breakout sessions are probably the most interesting part of the event.
Many different topics are discussed there, from high-level issues like how to build a new feature for the web platform, to lower-level efforts like the new WebKit SDK container.
Together with the hallway discussions and impromptu meetings, the Web Engines Hackfest is an invaluable experience.
Big shout-out to Igalia for organising the Web Engines Hackfest every year since 2014, as well as the original WebKitGTK+ Hackfest starting in 2009.
The event has grown and we’re now close to 100 onsite participants with representation from all major browser vendors.
If your organization is interested in helping make this event possible, please contact us regarding our sponsorship options.