Planet Igalia

March 23, 2023

Eric Meyer

Peerless Whisper

What happened was, I was hanging out in an online chatter channel when a little birdy named Bruce chirped about OpenAI’s Whisper and how he was using it to transcribe audio.  And I thought, Hey, I have audio that needs to be transcribed.  Brucie Bird also mentioned it would output text, SRT, and WebVTT formats, and I thought, Hey, I have videos I’ll need to upload with transcription to YouTube!  And then he said you could run it from the command line, and I thought, Hey, I have a command line!

So off I went to install it and try it out, and immediately ran smack into some hurdles I thought I’d document here in case someone else has similar problems.  All of this took place on my M2 MacBook Pro, though I believe most of the below should be relevant to anyone trying to do this at the command line.

The first thing I did was what the GitHub repository’s README recommended, which is:

$ pip install -U openai-whisper

That failed because I didn’t have pip installed.  Okay, fair enough.  I figured out how to install that, setting up an alias of python for python3 along the way, and then tried again.  This time, the install started and then bombed out:

Collecting openai-whisper
  Using cached openai-whisper-20230314.tar.gz (792 kB)
  Installing build dependencies ...  done
  Getting requirements to build wheel ...  done
  Preparing metadata (pyproject.toml) ...  done
Collecting numba
  Using cached numba-0.56.4.tar.gz (2.4 MB)
  Preparing metadata ( ...  error
  error: subprocess-exited-with-error

…followed by some stack trace stuff, none of which was really useful until ten or so lines down, where I found:

RuntimeError: Cannot install on Python version 3.11.2; only versions >=3.7,<3.11 are supported.

In other words, the version of Python I have installed is too modern to run AI.  What a world.

I DuckDucked around a bit and hit upon pyenv, which is I guess a way of installing and running older versions of Python without having to overwrite whatever version(s) you already have.  I’ll skip over the error part of my trial-and-error process and give you the commands that made it all work:

$ brew install pyenv

$ pyenv install 3.10

$ PATH="~/.pyenv/shims:${PATH}"

$ pyenv local 3.10

$ pip install -U openai-whisper

That got Whisper to install.  It didn’t take very long.

At that point, I wondered what I’d have to configure to transcribe something, and the answer turned out to be precisely zilch.  Once the install was done, I dropped into the directory containing my MP4 video, and typed this:

$ whisper wpe-mse-eme-v2.mp4

Here’s what I got back.  I’ve marked the very few errors.

[00:00.000 --> 00:07.000]  In this video, we'll show you several demos showcasing multi-media capabilities in WPE WebKit,
[00:07.000 --> 00:11.000]  the official port of the WebKit engine for embedded devices.
[00:11.000 --> 00:18.000]  Each of these demos are running on the low-powered Raspberry Pi 3 seen in the lower right-hand side of the screen here.
[00:18.000 --> 00:25.000]  Infotainment systems and media players often need to consume digital rights-managed videos.
[00:25.000 --> 00:32.000]  They tell me, is Michael coming out?  Affirmative, Mike's coming out.
[00:32.000 --> 00:45.000]  Here you can see just that, smooth streaming playback using encrypted media extensions, or EME, with PlayReady 4.
[00:45.000 --> 00:52.000]  Media source extensions, or MSE, are used by many players for greater control over playback.
[00:52.000 --> 01:00.000]  YouTube TV has a whole conformance test suite for this, which WPE has been passing since 2021.
[01:00.000 --> 01:09.000]  The loan exceptions here are those tests requiring hardware support not available on the Raspberry Pi 4, but available for other platforms.
[01:09.000 --> 01:16.000]  YouTube TV has a conformance test for EME, which WPE WebKit passes with flying colors.
[01:22.000 --> 01:40.000]  Music
[01:40.000 --> 01:45.000]  Finally, perhaps most impressively, we can put all these things together.
[01:45.000 --> 01:56.000]  Here is the dash.js player using MSE, running in a page, and using Widevine DRM to decrypt and play rights-managed video with EME all fluidly.
[01:56.000 --> 02:04.000]  Music
[02:04.000 --> 02:09.000]  Remember, all of this is being played back on the same low-powered Raspberry Pi 3.
[02:27.000 --> 02:34.000]  For more about WPE WebKit, please visit WPE
[02:34.000 --> 02:42.000]  For more information about EGALIA, or to find out how we can help with your embedded device needs, please visit us at  

I am, frankly, astonished.  This has no business being as accurate as it is, for all kinds of reasons.  There’s a lot of jargon and very specific terminology in there, and Whisper nailed pretty much every last bit of it, first time in, no special configuration, nothing.  I didn’t even bump up the model size from the default of small.  I felt a little like that Froyo guy in the animated Hunchback of Notre Dame meme yelling about sorcery or whatever.

True, the output isn’t absolutely perfect.  Let’s review the glitches in reverse order.  The last two errors, turning “Igalia” into “EGALIA”, seems fair enough given I didn’t specify that there would be languages other than English involved.  I routinely have to spell it for my fellow Americans, so no reason to think a codebase could do any better.

The space inserted into “WPEWebKit” (which happens throughout) is similarly understandable.  I’m impressed it understood “WebKit” at all, never mind that it was properly capitalized and not-spaced.

The place where it says Music and I marked it as an error: This is essentially an echoing countdown and then a white-noise roar from rocket engines.  There’s a “music today is just noise” joke in here somewhere, but I’m too hip to find it.

Whisper turning “lone” into “loan” doesn’t particularly faze me, given the difficulty of handling soundalike words.  Hell, just yesterday, I was scribing a conference call and mistakenly recorded “gamut” as “gamma”, and those aren’t even technically homophones.  They just sound like they are.

Rounding out the glitch tour, “Hey” got turned into “They”, which (given the audio quality of that particular part of the video) is still pretty good.

There is one other error I couldn’t mark because there’s nothing to mark, but if you scrutinize the timeline, you’ll see a gap from 02:09.000 and 02:27.000.  In there, a short clip from a movie plays, and there’s a brief dialogue between two characters in not-very-Dutch-accented English there.  It’s definitely louder and more clear than the 00:25.000 –> 00:32.000 bit, so I’m not sure why Whisper just skipped over it.  Manually transcribing that part isn’t a big deal, but it’s odd to see it perform so flawlessly on every other piece of speech and then drop this completely on the floor.

Before posting, I decided to give Whisper another go, this time on a different video:

$ whisper wpe-gamepad-support-v3.mp4

This was the result, with the one actual error marked:

[00:00.000 --> 00:13.760]  In this video, we demonstrate WPE WebKit's support for the W3C's GamePad API.
[00:13.760 --> 00:20.080]  Here we're running WPE WebKit on a Raspberry Pi 4, but any device that will run WPE WebKit
[00:20.080 --> 00:22.960]  can benefit from this support.
[00:22.960 --> 00:28.560]  The GamePad API provides a JavaScript interface that makes it possible for developers to access
[00:28.560 --> 00:35.600]  and respond to signals from GamePads and other game controllers in a simple, consistent way.
[00:35.600 --> 00:40.320]  Having connected a standard Xbox controller, we boot up the Raspberry Pi with a customized
[00:40.320 --> 00:43.040]  build route image.
[00:43.040 --> 00:48.560]  Once the device is booted, we run cog, which is a small, single window launcher made specifically
[00:48.560 --> 00:51.080]  for WPE WebKit.
[00:51.080 --> 00:57.360]  The window cog creates can be full screen, which is what we're doing here.
[00:57.360 --> 01:01.800]  The game is loaded from a website that hosts a version of the classic video arcade game
[01:01.800 --> 01:05.480]  Asteroids.
[01:05.480 --> 01:11.240]  Once the game has loaded, the Xbox controller is used to start the game and control the spaceship.
[01:11.240 --> 01:17.040]  All the GamePad inputs are handled by the JavaScript GamePad API.
[01:17.040 --> 01:22.560]  This GamePad support is now possible thanks to work done by Igalia in 2022 and is available
[01:22.560 --> 01:27.160]  to anyone who uses WPE WebKit on their embedded device.
[01:27.160 --> 01:32.000]  For more about WPE WebKit, please visit
[01:32.000 --> 01:35.840]  For more information about Igalia, or to find out how we can help with your embedded device
[01:35.840 --> 01:39.000]  needs, please visit us at  

That should have been “buildroot”.  Again, an entirely reasonable error.  I’ve made at least an order of magnitude more typos writing this post than Whisper has in transcribing these videos.  And this time, it got the spelling of Igalia correct.  I didn’t make any changes between the two runs.  It just… figured it out.

I don’t have a lot to say about this other than, wow.  Just WOW.  This is some real Clarke’s Third Law stuff right here, and the technovertigo is Marianas deep.

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at March 23, 2023 02:00 PM

March 20, 2023

Danylo Piliaiev

Command stream editing as an effective method to debug driver issues

  1. How the tool is used

In previous posts, “Graphics Flight Recorder - unknown but handy tool to debug GPU hangs” and “Debugging Unrecoverable GPU Hangs”, I demonstrated a few tricks of how to identify the location of GPU fault.

But what’s the next step once you’ve roughly pinpointed the issue? What if the problem is only sporadically reproducible and the only way to ensure consistent results is by replaying a trace of raw GPU commands? How can you precisely determine the cause and find a proper fix?

Sometimes, you may have an inkling of what’s causing the problem, then and you can simply modify the driver’s code to see if it resolves the issue. However, there are instances where the root cause remains elusive or you only want to change a specific value without affecting the same register before and after it.

The optimal approach in these situations is to directly modify the commands sent to the GPU. The ability to arbitrarily edit the command stream was always an obvious idea and has crossed my mind numerous times (and not only mine – proprietary driver developers seem to employ similar techniques). Finally, the stars aligned: my frustration with a recent bug, the kernel’s new support for user-space-defined GPU addresses for buffer objects, the tool I wrote to replay command stream traces not so long ago, and the realization that implementing a command stream editor was not as complicated as initially thought.

The end result is a tool for Adreno GPUs (with msm kernel driver) to decompile, edit, and compile back command streams: “freedreno,turnip: Add tooling to edit command streams and use them in ‘replay’”.

The primary advantage of this command stream editing tool lies the ability to rapidly iterate over hypotheses. Another highly valuable feature (which I have plans for) would be the automatic bisection of the command stream, which would be particularly beneficial in instances where only the bug reporter has the necessary hardware to reproduce the issue at hand.

How the tool is used

# Decompile one command stream from the trace
./rddecompiler -s 0 gpu_trace.rd > generate_rd.c

# Compile the executable which would output the command stream
meson setup . build
ninja -C build

# Override the command stream with the commands from the generator
./replay gpu_trace.rd --override=0 --generator=./build/generate_rd
Reading dEQP-VK.renderpass.suballocation.formats.r5g6b5_unorm_pack16.clear.clear.rd...
gpuid: 660
Uploading iova 0x100000000 size = 0x82000
Uploading iova 0x100089000 size = 0x4000
cmdstream 0: 207 dwords
generating cmdstream './generate_rd --vastart=21441282048 --vasize=33554432 gpu_trace.rd'
Uploading iova 0x4fff00000 size = 0x1d4
override cmdstream: 117 dwords
skipped cmdstream 1: 248 dwords
skipped cmdstream 2: 223 dwords

The decompiled code isn’t pretty:

/* pkt4: GRAS_SC_SCREEN_SCISSOR[0].TL = { X = 0 | Y = 0 } */
pkt4(cs, REG_A6XX_GRAS_SC_SCREEN_SCISSOR_TL(0), (2), 0);
/* pkt4: GRAS_SC_SCREEN_SCISSOR[0].BR = { X = 32767 | Y = 32767 } */
pkt(cs, 2147450879);
/* pkt4: VFD_INDEX_OFFSET = 0 */
pkt4(cs, REG_A6XX_VFD_INDEX_OFFSET, (2), 0);
pkt(cs, 0);
/* pkt4: SP_FS_OUTPUT[0].REG = { REGID = r0.x } */
pkt4(cs, REG_A6XX_SP_FS_OUTPUT_REG(0), (1), 0);
pkt4(cs, REG_A6XX_SP_TP_RAS_MSAA_CNTL, (2), 2);
pkt(cs, 2);
pkt4(cs, REG_A6XX_GRAS_RAS_MSAA_CNTL, (2), 2);

Shader assembly is editable:

const char *source = R"(
  shps #l37
  getone #l37
  cov.u32f32 r1.w, c504.z
  cov.u32f32 r2.x, c504.w
  cov.u32f32 r1.y, c504.x
upload_shader(&ctx, 0x100200d80, source);
emit_shader_iova(&ctx, cs, 0x100200d80);

However, not everything is currently editable, such as descriptors. Despite this limitations, the existing functionality is sufficient for the majority of cases.

by Danylo Piliaiev at March 20, 2023 11:00 PM

Andy Wingo

a world to win: webassembly for the rest of us

Good day, comrades!

Today I'd like to share the good news that WebAssembly is finally coming for the rest of us weirdos.

A world to win

WebAssembly for the rest of us

17 Mar 2023 – BOB 2023

Andy Wingo

Igalia, S.L.

This is a transcript-alike of a talk that I gave last week at BOB 2023, a gathering in Berlin of people that are using "technologies beyond the mainstream" to get things done: Haskell, Clojure, Elixir, and so on. PDF slides here, and I'll link the video too when it becomes available.

WebAssembly, the story

WebAssembly is an exciting new universal compute platform

WebAssembly: what even is it? Not a programming language that you would write software in, but rather a compilation target: a sort of assembly language, if you will.

WebAssembly, the pitch

Predictable portable performance

  • Low-level
  • Within 10% of native

Reliable composition via isolation

  • Modules share nothing by default
  • No nasal demons
  • Memory sandboxing

Compile your code to WebAssembly for easier distribution and composition

If you look at what the characteristics of WebAssembly are as an abstract machine, to me there are two main areas in which it is an advance over the alternatives.

Firstly it's "close to the metal" -- if you compile for example an image-processing library to WebAssembly and run it, you'll get similar performance when compared to compiling it to x86-64 or ARMv8 or what have you. (For image processing in particular, native still generally wins because the SIMD primitives in WebAssembly are more narrow and because getting the image into and out of WebAssembly may imply a copy, but the general point remains.) WebAssembly's instruction set covers a broad range of low-level operations that allows compilers to produce efficient code.

The novelty here is that WebAssembly is both portable while also being successful. We language weirdos know that it's not enough to do something technically better: you have to also succeed in getting traction for your alternative.

The second interesting characteristic is that WebAssembly is (generally speaking) a principle-of-least-authority architecture: a WebAssembly module starts with access to nothing but itself. Any capabilities that an instance of a module has must be explicitly shared with it by the host at instantiation-time. This is unlike DLLs which have access to all of main memory, or JavaScript libraries which can mutate global objects. This characteristic allows WebAssembly modules to be reliably composed into larger systems.

WebAssembly, the hype

It’s in all browsers! Serve your code to anyone in the world!

It’s on the edge! Run code from your web site close to your users!

Compose a library (eg: Expat) into your program (eg: Firefox), without risk!

It’s the new lightweight virtualization: Wasm is what containers were to VMs! Give me that Kubernetes cash!!!

Again, the remarkable thing about WebAssembly is that it is succeeding! It's on all of your phones, all your desktop web browsers, all of the content distribution networks, and in some cases it seems set to replace containers in the cloud. Launch the rocket emojis!

WebAssembly, the reality

WebAssembly is a weird backend for a C compiler

Only some source languages are having success on WebAssembly

What about Haskell, Ocaml, Scheme, F#, and so on – what about us?

Are we just lazy? (Well...)

So why aren't we there? Where is Clojure-on-WebAssembly? Where are the F#, the Elixir, the Haskell compilers? Some early efforts exist, but they aren't really succeeding. Why is that? Are we just not putting in the effort? Why is it that Rust gets to ride on the rocket ship but Scheme does not?

WebAssembly, the reality (2)

WebAssembly (1.0, 2.0) is not well-suited to garbage-collected languages

Let’s look into why

As it turns out, there is a reason that there is no good Scheme implementation on WebAssembly: the initial version of WebAssembly is a terrible target if your language relies on the presence of a garbage collector. There have been some advances but this observation still applies to the current standardized and deployed versions of WebAssembly. To better understand this issue, let's dig into the guts of the system to see what the limitations are.

GC and WebAssembly 1.0

Where do garbage-collected values live?

For WebAssembly 1.0, only possible answer: linear memory

  (global $hp (mut i32) (i32.const 0))
  (memory $mem 10)) ;; 640 kB

The primitive that WebAssembly 1.0 gives you to represent your data is what is called linear memory: just a buffer of bytes to which you can read and write. It's pretty much like what you get when compiling natively, except that the memory layout is more simple. You can obtain this memory in units of 64-kilobyte pages. In the example above we're going to request 10 pages, for 640 kB. Should be enough, right? We'll just use it all for the garbage collector, with a bump-pointer allocator. The heap pointer / allocation pointer is kept in the mutable global variable $hp.

(func $alloc (param $size i32) (result i32)
  (local $ret i32)
  (loop $retry
    (local.set $ret (global.get $hp))
    (global.set $hp
      (i32.add (local.get $size) (local.get $ret)))

    (br_if 1
      (i32.lt_u (i32.shr_u (global.get $hp) 16)
      (local.get $ret))

    (call $gc)
    (br $retry)))

Here's what an allocation function might look like. The allocation function $alloc is like malloc: it takes a number of bytes and returns a pointer. In WebAssembly, a pointer to memory is just an offset, which is a 32-bit integer (i32). (Having the option of a 64-bit address space is planned but not yet standard.)

If this is your first time seeing the text representation of a WebAssembly function, you're in for a treat, but that's not the point of the presentation :) What I'd like to focus on is the (call $gc) -- what happens when the allocation pointer reaches the end of the region?

GC and WebAssembly 1.0 (2)

What hides behind (call $gc) ?

Ship a GC over linear memory

Stop-the-world, not parallel, not concurrent

But... roots.

The first thing to note is that you have to provide the $gc yourself. Of course, this is doable -- this is what we do when compiling to a native target.

Unfortunately though the multithreading support in WebAssembly is somewhat underpowered; it lets you share memory and use atomic operations but you have to create the threads outside WebAssembly. In practice probably the GC that you ship will not take advantage of threads and so it will be rather primitive, deferring all collection work to a stop-the-world phase.

GC and WebAssembly 1.0 (3)

Live objects are

  • the roots
  • any object referenced by a live object

Roots are globals and locals in active stack frames

No way to visit active stack frames

What's worse though is that you have no access to roots on the stack. A GC has to keep live objects, as defined circularly as any object referenced by a root, or any object referenced by a live object. It starts with the roots: global variables and any GC-managed object referenced by an active stack frame.

But there we run into problems, because in WebAssembly (any version, not just 1.0) you can't iterate over the stack, so you can't find active stack frames, so you can't find the stack roots. (Sometimes people want to support this as a low-level capability but generally speaking the consensus would appear to be that overall performance will be better if the engine is the one that is responsible for implementing the GC; but that is foreshadowing!)

GC and WebAssembly 1.0 (3)


  • handle stack for precise roots
  • spill all possibly-pointer values to linear memory and collect conservatively

Handle book-keeping a drag for compiled code

Given the noniterability of the stack, there are basically two work-arounds. One is to have the compiler and run-time maintain an explicit stack of object roots, which the garbage collector can know for sure are pointers. This is nice because it lets you move objects. But, maintaining the stack is overhead; the state of the art solution is rather to create a side table (a "stack map") associating each potential point at which GC can be called with instructions on how to find the roots.

The other workaround is to spill the whole stack to memory. Or, possibly just pointer-like values; anyway, you conservatively scan all words for things that might be roots. But instead of having access to the memory to which the WebAssembly implementation would spill your stack, you have to do it yourself. This can be OK but it's sub-optimal; see my recent post on the Whippet garbage collector for a deeper discussion of the implications of conservative root-finding.

GC and WebAssembly 1.0 (4)

Cycles with external objects (e.g. JavaScript) uncollectable

A pointer to a GC-managed object is an offset to linear memory, need capability over linear memory to read/write object from outside world

No way to give back memory to the OS

Gut check: gut says no

If that were all, it would already be not so great, but it gets worse! Another problem with linear-memory GC is that it limits the potential for composing a number of modules and the host together, because the garbage collector that manages JavaScript objects in a web browser knows nothing about your garbage collector over your linear memory. You can easily create memory leaks in a system like that.

Also, it's pretty gross that a reference to an object in linear memory requires arbitrary read-write access over all of linear memory in order to read or write object fields. How do you build a reliable system without invariants?

Finally, once you collect garbage, and maybe you manage to compact memory, you can't give anything back to the OS. There are proposals in the works but they are not there yet.

If the BOB audience had to choose between Worse is Better and The Right Thing, I think the BOB audience is much closer to the Right Thing. People like that feel instinctual revulsion to ugly systems and I think GC over linear memory describes an ugly system.

GC and WebAssembly 1.0 (5)

There is already a high-performance concurrent parallel compacting GC in the browser

Halftime: C++ N – Altlangs 0

The kicker is that WebAssembly 1.0 requires you to write and deliver a terrible GC when there is already probably a great GC just sitting there in the host, one that has hundreds of person-years of effort invested in it, one that will surely do a better job than you could ever do. WebAssembly as hosted in a web browser should have access to the browser's garbage collector!

I have the feeling that while those of us with a soft spot for languages with garbage collection have been standing on the sidelines, Rust and C++ people have been busy on the playing field scoring goals. Tripping over the ball, yes, but eventually they do manage to make within striking distance.

Change is coming!

Support for built-in GC set to ship in Q4 2023

With GC, the material conditions are now in place

Let’s compile our languages to WebAssembly

But to continue the sportsball metaphor, I think in the second half our players will finally be able to get out on the pitch and give it the proverbial 110%. Support for garbage collection is coming to WebAssembly users, and I think even by the end of the year it will be shipping in major browsers. This is going to be big! We have a chance and we need to sieze it.

Scheme to Wasm

Spritely + Igalia working on Scheme to WebAssembly

Avoid truncating language to platform; bring whole self

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Even with GC, though, WebAssembly is still a weird machine. It would help to see the concrete approaches that some languages of interest manage to take when compiling to WebAssembly.

In that spirit, the rest of this article/presentation is a walkthough of the approach that I am taking as I work on a WebAssembly compiler for Scheme. (Thanks to Spritely for supporting this work!)

Before diving in, a meta-note: when you go to compile a language to, say, JavaScript, you are mightily tempted to cut corners. For example you might implement numbers as JavaScript numbers, or you might omit implementing continuations. In this work I am trying to not cut corners, and instead to implement the language faithfully. Sometimes this means I have to work around weirdness in WebAssembly, and that's OK.

When thinking about Scheme, I'd like to highlight a few specific areas that have interesting translations. We'll start with value representation, which stays in the GC theme from the introduction.

Scheme to Wasm: Values

;;       any  extern  func
;;        |
;;        eq
;;     /  |   \
;; i31 struct  array

The unitype: (ref eq)

Immediate values in (ref i31)

  • fixnums with 30-bit range
  • chars, bools, etc

Explicit nullability: (ref null eq) vs (ref eq)

The GC extensions for WebAssembly are phrased in terms of a type system. Oddly, there are three top types; as far as I understand it, this is the result of a compromise about how WebAssembly engines might want to represent these different kinds of values. For example, an opaque JavaScript value flowing into a WebAssembly program would have type (ref extern). On a system with NaN boxing, you would need 64 bits to represent a JS value. On the other hand a native WebAssembly object would be a subtype of (ref any), and might be representable in 32 bits, either because it's a 32-bit system or because of pointer compression.

Anyway, three top types. The user can define subtypes of struct and array, instantiate values of those types, and access their fields. The life cycle of reference-typed objects is automatically managed by the run-time, which is just another way of saying they are garbage-collected.

For Scheme, we need a common supertype for all values: the unitype, in Bob Harper's memorable formulation. We can use (ref any), but actually we'll use (ref eq) -- this is the supertype of values that can be compared by (pointer) identity. So now we can code up eq?:

(func $eq? (param (ref eq) (ref eq))
           (result i32)
  (ref.eq (local.get a) (local.get b)))

Generally speaking in a Scheme implementation there are immediates and heap objects. Immediates can be encoded in the bits of a value, whereas for heap object the bits of a value encode a reference (pointer) to an object on the garbage-collected heap. We usually represent small integers as immediates, as well as booleans and other oddball values.

Happily, WebAssembly gives us an immediate value type, i31. We'll encode our immediates there, and otherwise represent heap objects as instances of struct subtypes.

Scheme to Wasm: Values (2)

Heap objects subtypes of struct; concretely:

(struct $heap-object
  (struct (field $tag-and-hash i32)))
(struct $pair
  (sub $heap-object
    (struct i32 (ref eq) (ref eq))))

GC proposal allows subtyping on structs, functions, arrays

Structural type equivalance: explicit tag useful

We actually need to have a common struct supertype as well, for two reasons. One is that we need to be able to hash Scheme values by identity, but for this we need an embedded lazily-initialized hash code. It's a bit annoying to take the per-object memory hit but it's a reality, and the JVM does it this way, so it must not be so terrible.

The other reason is more subtle: WebAssembly's type system is built in such a way that types that are "structurally" equivalent are indistinguishable. So a pair has two fields, besides the hash, but there might be a number of other fundamental object types that have the same shape; you can't fully rely on WebAssembly's dynamic type checks (ref.test et al) to be able to query the type of a value. Instead we re-use the low bits of the hash word to include a type tag, which might be 1 for pairs, 2 for vectors, 3 for closures, and so on.

Scheme to Wasm: Values (3)

(func $cons (param (ref eq)
                   (ref eq))
            (result (ref $pair))
  (struct.new_canon $pair
    ;; Assume heap tag for pairs is 1.
    (i32.const 1)
    ;; Car and cdr.
    (local.get 0)
    (local.get 1)))

(func $%car (param (ref $pair))
            (result (ref eq))
  (struct.get $pair 1 (local.get 0)))

With this knowledge we can define cons, as a simple call to struct.new_canon pair.

I didn't have time for this in the talk, but there is a ghost haunting this code: the ghost of nominal typing. See, in a web browser at least, every heap object will have its first word point to its "hidden class" / "structure" / "map" word. If the engine ever needs to check that a value is of a specific shape, it can do a quick check on the map word's value; if it needs to do deeper introspection, it can dereference that word to get more details.

Under the hood, testing whether a (ref eq) is a pair or not should be a simple check that it's a (ref struct) (and not a fixnum), and then a comparison of its map word to the run-time type corresponding to $pair. If subtyping of $pair is allowed, we start to want inline caches to handle polymorphism, but the checking the map word is still the basic mechanism.

However, as I mentioned, we only have structural equality of types; two (struct (ref eq)) type definitions will define the same type and have the same map word (run-time type / RTT). Hence the _canon in the name of struct.new_canon $pair: we create an instance of $pair, with the canonical run-time-type for objects having $pair-shape.

In earlier drafts of the WebAssembly GC extensions, users could define their own RTTs, which effectively amounts to nominal typing: not only does this object have the right structure, but was it created with respect to this particular RTT. But, this facility was cut from the first release, and it left ghosts in the form of these _canon suffixes on type constructor instructions.

For the Scheme-to-WebAssembly effort, we effectively add back in a degree of nominal typing via type tags. For better or for worse this results in a so-called "open-world" system: you can instantiate a separately-compiled WebAssembly module that happens to define the same types and use the same type tags and it will be able to happily access the contents of Scheme values from another module. If you were to use nominal types, you would't be able to do so, unless there were some common base module that defined and exported the types of interests, and which any extension module would need to import.

(func $car (param (ref eq)) (result (ref eq))
  (local (ref $pair))
  (block $not-pair
    (br_if $not-pair
      (i32.eqz (ref.test $pair (local.get 0))))
    (local.set 1 (ref.cast $pair) (local.get 0))
    (br_if $not-pair
        (i32.const 1)
          (i32.const 0xff)
          (struct.get $heap-object 0 (local.get 1)))))
    (return_call $%car (local.get 1)))

  (call $type-error)

In the previous example we had $%car, with a funny % in the name, taking a (ref $pair) as an argument. But in the general case (barring compiler heroics) car will take an instance of the unitype (ref eq). To know that it's actually a pair we have to make two checks: one, that it is a struct and has the $pair shape, and two, that it has the right tag. Oh well!

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

But with all of that I think we have a solid story on how to represent values. I went through all of the basic value types in Guile and checked that they could all be represented using GC types, and it seems that all is good. Now on to the next point: varargs.

Scheme to Wasm: Varargs

(list 'hey)      ;; => (hey)
(list 'hey 'bob) ;; => (hey bob)

Problem: Wasm functions strongly typed

(func $list (param ???) (result (ref eq))

Solution: Virtualize calling convention

In WebAssembly, you define functions with a type, and it is impossible to call them in an unsound way. You must call $car exactly 2 arguments or it will not compile, and those arguments have to be of specific types, and so on. But Scheme doesn't enforce these restrictions on the language level, bless its little miscreant heart. You can call car with 5 arguments, and you'll get a run-time error. There are some functions that can take a variable number of arguments, doing different things depending on incoming argument count.

How do we square these two approaches to function types?

;; "Registers" for args 0 to 3
(global $arg0 (mut (ref eq)) ( (i32.const 0)))
(global $arg1 (mut (ref eq)) ( (i32.const 0)))
(global $arg2 (mut (ref eq)) ( (i32.const 0)))
(global $arg3 (mut (ref eq)) ( (i32.const 0)))

;; "Memory" for the rest
(type $argv (array (ref eq)))
(global $argN (ref $argv)
          $argv (i31.const 42) ( (i32.const 0))))

Uniform function type: argument count as sole parameter

Callee moves args to locals, possibly clearing roots

The approach we are taking is to virtualize the calling convention. In the same way that when calling an x86-64 function, you pass the first argument in $rdi, then $rsi, and eventually if you run out of registers you put arguments in memory, in the same way we'll pass the first argument in the $arg0 global, then $arg1, and eventually in memory if needed. The function will receive the number of incoming arguments as its sole parameter; in fact, all functions will be of type (func (param i32)).

The expectation is that after checking argument count, the callee will load its arguments from globals / memory to locals, which the compiler can do a better job on than globals. We might not even emit code to null out the argument globals; might leak a little memory but probably would be a win.

You can imagine a world in which $arg0 actually gets globally allocated to $rdi, because it is only live during the call sequence; but I don't think that world is this one :)

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Great, two points out of the way! Next up, tail calls.

Scheme to Wasm: Tail calls

;; Call known function
(return_call $f arg ...)

;; Call function by value
(return_call_ref $type callee arg ...)

Friends -- I almost cried making this slide. We Schemers are used to working around the lack of tail calls, and I could have done so here, but it's just such a relief that these functions are just going to be there and I don't have to think much more about them. Technically speaking the proposal isn't merged yet; checking the phases document it's at the last station before headed to the great depot in the sky. But, soon soon it will be present and enabled in all WebAssembly implementations, and we should build systems now that rely on it.

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Next up, my favorite favorite topic: delimited continuations.

Scheme to Wasm: Prompts (1)

Problem: Lightweight threads/fibers, exceptions

Possible solutions

  • Eventually, built-in coroutines
  • binaryen’s asyncify (not yet ready for GC); see Julia
  • Delimited continuations

“Bring your whole self”

Before diving in though, one might wonder why bother. Delimited continuations are a building-block that one can use to build other, more useful things, notably exceptions and light-weight threading / fibers. Could there be another way of achieving these end goals without having to implement this relatively uncommon primitive?

For fibers, it is possible to implement them in terms of a built-in coroutine facility. The standards body seems willing to include a coroutine primitive, but it seems far off to me; not within the next 3-4 years I would say. So let's put that to one side.

There is a more near-term solution, to use asyncify to implement coroutines somehow; but my understanding is that asyncify is not ready for GC yet.

For the Guile flavor of Scheme at least, delimited continuations are table stakes of their own right, so given that we will have them on WebAssembly, we might as well use them to implement fibers and exceptions in the same way as we do on native targets. Why compromise if you don't have to?

Scheme to Wasm: Prompts (2)

Prompts delimit continuations

(define k
  (call-with-prompt ’foo
    ; body
    (lambda ()
      (+ 34 (abort-to-prompt 'foo)))
    ; handler
    (lambda (continuation)

(k 10)       ;; ⇒ 44
(- (k 10) 2) ;; ⇒ 42

k is the _ in (lambda () (+ 34 _))

There are a few ways to implement delimited continuations, but my usual way of thinking about them is that a delimited continuation is a slice of the stack. One end of the slice is the prompt established by call-with-prompt, and the other by the continuation of the call to abort-to-prompt. Capturing a slice pops it off the stack, copying it out to the heap as a callable function. Calling that function splats the captured slice back on the stack and resumes it where it left off.

Scheme to Wasm: Prompts (3)

Delimited continuations are stack slices

Make stack explicit via minimal continuation-passing-style conversion

  • Turn all calls into tail calls
  • Allocate return continuations on explicit stack
  • Breaks functions into pieces at non-tail calls

This low-level intuition of what a delimited continuation is leads naturally to an implementation; the only problem is that we can't slice the WebAssembly call stack. The workaround here is similar to the varargs case: we virtualize the stack.

The mechanism to do so is a continuation-passing-style (CPS) transformation of each function. Functions that make no calls, such as leaf functions, don't need to change at all. The same goes for functions that make only tail calls. For functions that make non-tail calls, we split them into pieces that preserve the only-tail-calls property.

Scheme to Wasm: Prompts (4)

Before a non-tail-call:

  • Push live-out vars on stacks (one stack per top type)
  • Push continuation as funcref
  • Tail-call callee

Return from call via pop and tail call:

(return_call_ref (call $pop-return)
                 (i32.const 0))

After return, continuation pops state from stacks

Consider a simple function:

(define (f x y)
  (+ x (g y))

Before making a non-tail call, a "tailified" function will instead push all live data onto an explicitly-managed stack and tail-call the callee. It also pushes on the return continuation. Returning from the callee pops the return continuation and tail-calls it. The return continuation pops the previously-saved live data and continues.

In this concrete case, tailification would split f into two pieces:

(define (f x y)
  (push! x)
  (push-return! f-return-continuation-0)
  (g y))

(define (f-return-continuation-0 g-of-y)
  (define k (pop-return!))
  (define x (pop! x))
  (k (+ x g-of-y)))

Now there are no non-tail calls, besides calls to run-time routines like push! and + and so on. This transformation is implemented by tailify.scm.

Scheme to Wasm: Prompts (5)


  • Pop stack slice to reified continuation object
  • Tail-call new top of stack: prompt handler

Calling a reified continuation:

  • Push stack slice
  • Tail-call new top of stack

No need to wait for effect handlers proposal; you can have it all now!

The salient point is that the stack on which push! operates (in reality, probably four or five stacks: one in linear memory or an array for types like i32 or f64, three for each of the managed top types any, extern, and func, and one for the stack of return continuations) are managed by us, so we can slice them.

Someone asked in the talk about whether the explicit memory traffic and avoiding the return-address-buffer branch prediction is a source of inefficiency in the transformation and I have to say, yes, but I don't know by how much. I guess we'll find out soon.

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Okeydokes, last point!

Scheme to Wasm: Numbers

Numbers can be immediate: fixnums

Or on the heap: bignums, fractions, flonums, complex

Supertype is still ref eq

Consider imports to implement bignums

  • On web: BigInt
  • On edge: Wasm support module (mini-gmp?)

Dynamic dispatch for polymorphic ops, as usual

First, I would note that sometimes the compiler can unbox numeric operations. For example if it infers that a result will be an inexact real, it can use unboxed f64 instead of library routines working on heap flonums ((struct i32 f64); the initial i32 is for the hash and tag). But we still need a story for the general case that involves dynamic type checks.

The basic idea is that we get to have fixnums and heap numbers. Fixnums will handle most of the integer arithmetic that we need, and will avoid allocation. We'll inline most fixnum operations as a fast path and call out to library routines otherwise. Of course fixnum inputs may produce a bignum output as well, so the fast path sometimes includes another slow-path callout.

We want to minimize binary module size. In an ideal compile-to-WebAssembly situation, a small program will have a small module size, down to a minimum of a kilobyte or so; larger programs can be megabytes, if the user experience allows for the download delay. Binary module size will be dominated by code, so that means we need to plan for aggressive dead-code elimination, minimize the size of fast paths, and also minimize the size of the standard library.

For numbers, we try to keep module size down by leaning on the platform. In the case of bignums, we can punt some of this work to the host; on a JavaScript host, we would use BigInt, and on a WASI host we'd compile an external bignum library. So that's the general story: inlined fixnum fast paths with dynamic checks, and otherwise library routine callouts, combined with aggressive whole-program dead-code elimination.

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Hey I think we did it! Always before when I thought about compiling Scheme or Guile to the web, I got stuck on some point or another, was tempted down the corner-cutting alleys, and eventually gave up before starting. But finally it would seem that the stars are aligned: we get to have our Scheme and run it too.


Debugging: The wild west of DWARF; prompts

Strings: stringref host strings spark joy

JS interop: Export accessors; Wasm objects opaque to JS. externref.

JIT: A whole ’nother talk!

AOT: wasm2c

Of course, like I said, WebAssembly is still a weird machine: as a compilation target but also at run-time. Debugging is a right proper mess; perhaps some other article on that some time.

How to represent strings is a surprisingly gnarly question; there is tension within the WebAssembly standards community between those that think that it's possible for JavaScript and WebAssembly to share an underlying string representation, and those that think that it's a fool's errand and that copying is the only way to go. I don't know which side will prevail; perhaps more on that as well later on.

Similarly the whole interoperation with JavaScript question is very much in its early stages, with the current situation choosing to err on the side of nothing rather than the wrong thing. You can pass a WebAssembly (ref eq) to JavaScript, but JavaScript can't do anything with it: it has no prototype. The state of the art is to also ship a JS run-time that wraps each wasm object, proxying exported functions from the wasm module as object methods.

Finally, some language implementations really need JIT support, like PyPy. There, that's a whole 'nother talk!

WebAssembly for the rest of us

With GC, WebAssembly is now ready for us

Getting our languages on WebAssembly now a S.M.O.P.

Let’s score some goals in the second half!


WebAssembly has proven to have some great wins for C, C++, Rust, and so on -- but now it's our turn to get in the game. GC is coming and we as a community need to be getting our compilers and language run-times ready. Let's put on the coffee and bang some bytes together; it's still early days and there's a world to win out there for the language community with the best WebAssembly experience. The game is afoot: happy consing!

by Andy Wingo at March 20, 2023 09:06 AM

March 18, 2023

Alex Bradbury

What's new for RISC-V in LLVM 16

LLVM 16.0.0 was just released today, and as I did for LLVM 15, I wanted to highlight some of the RISC-V specific changes and improvements. This is very much a tour of a chosen subset of additions rather than an attempt to be exhaustive. If you're interested in RISC-V, you may also want to check out my recent attempt to enumerate the commercially available RISC-V SoCs and if you want to find out what's going on in LLVM as a whole on a week-by-week basis, then I've got the perfect newsletter for you.

In case you're not familiar with LLVM's release schedule, it's worth noting that there are two major LLVM releases a year (i.e. one roughly every 6 months) and these are timed releases as opposed to being cut when a pre-agreed set of feature targets have been met. We're very fortunate to benefit from an active and growing set of contributors working on RISC-V support in LLVM projects, who are responsible for the work I describe below - thank you! I coordinate biweekly sync-up calls for RISC-V LLVM contributors, so if you're working in this area please consider dropping in.


LLVM 16 is the first release featuring a user guide for the RISC-V target (16.0.0 version, current HEAD. This fills a long-standing gap in our documentation, whereby it was difficult to tell at a glance the expected level of support for the various RISC-V instruction set extensions (standard, vendor-specific, and experimental extensions of either type) in a given LLVM release. We've tried to keep it concise but informative, and add a brief note to describe any known limitations that end users should know about. Thanks again to Philip Reames for kicking this off, and the reviewers and contributors for ensuring it's kept up to date.


LLVM 16 was a big release for vectorisation. As well as a long-running strand of work making incremental improvements (e.g. better cost modelling) and fixes, scalable vectorization was enabled by default. This allows LLVM's loop vectorizer to use scalable vectors when profitable. Follow-on work enabled support for loop vectorization using fixed length vectors and disabled vectorization of epilogue loops. See the talk optimizing code for scalable vector architectures (slides) by Sander de Smalen for more information about scalable vectorization in LLVM and introduction to the RISC-V vector extension by Roger Ferrer Ibáñez for an overview of the vector extension and some of its codegen challenges.

The RISC-V vector intrinsics supported by Clang have changed (to match e.g. this and this) during the 16.x development process in a backwards incompatible way, as the RISC-V Vector Extension Intrinsics specification evolves towards a v1.0. In retrospect, it would have been better to keep the intrinsics behind an experimental flag when the vector codegen and MC layer (assembler/disassembler) support became stable, and this is something we'll be more careful of for future extensions. The good news is that thanks to Yueh-Ting Chen, headers are available that provide the old-style intrinsics mapped to the new version.

Support for new instruction set extensions

I refer to 'experimental' support many times below. See the documentation on experimental extensions within RISC-V LLVM for guidance on what that means. One point to highlight is that the extensions remain experimental until they are ratified, which is why some extensions on the list below are 'experimental' despite the fact the LLVM support needed is trivial. On to the list of newly added instruction set extensions:

  • Experimental support for the Zca, Zcf, and Zcd instruction set extensions. These are all 16-bit instructions and are being defined as part of the output of the RISC-V code size reduction working group.
    • Zca is just a subset of the standard 'C' compressed instruction set extension but without floating point loads/stores.
    • Zcf is also a subset of the standard 'C' compressed instruction set extension, including just the single precision floating point loads and stores (c.flw, c.flwsp, c.fsw, c.fswsp).
    • Zcd, as you might have guessed, just includes the double precision floating point loads and stores from the standard 'C' compressed instruction set extension (c.fld, c.fldsp, c.fsd, c.fsdsp).
  • Experimental assembler/disassembler support for the Zihintntl instruction set extension. This provides a small set of instructions that can be used to hint that the memory accesses of the following instruction exhibits poor temporal locality.
  • Experimental assembler/disassembler support for the Zawrs instruction set extension, providing a pair of instructions meant for use in a polling loop allowing a core to enter a low-power state and wait on a store to a memory location.
  • Experimental support for the Ztso extension, which for now just means setting the appropriate ELF header flag. If a core implements Ztso, it implements the Total Store Ordering memory consistency model. Future releases will provide alternate lowerings of atomics operations that take advantage of this.
  • Code generation support for the Zfhmin extension (load/store, conversion, and GPR/FPR move support for 16-bit floating point values).
  • Codegen and assembler/disassembler support for the XVentanaCondOps vendor extension, which provides conditional arithmetic and move/select operations.
  • Codegen and assembler/disassembler support for the XTHeadVdot vendor extension, which implements vector integer four 8-bit multiple and 32-bit add.


LLDB has started to become usable for RISC-V in this period due to work by contributor 'Emmer'. As they summarise here, LLDB should be usable for debugging RV64 programs locally but support is lacking for remote debug (e.g. via the gdb server protocol). During the LLVM 16 development window, LLDB gained support for software single stepping on RISC-V, support in EmulateInstructionRISCV for RV{32,64}I, as well as extensions A and M, C, RV32F and RV64F, and D.

Short forward branch optimisation

Another improvement that's fun to look more closely at is support for "short forward branch optimisation" for the SiFive 7 series cores. What does this mean? Well, let's start by looking at the problem it's trying to solve. The base RISC-V ISA doesn't include conditional moves or predicated instructions, which can be a downside if your code features unpredictable short forward branches (with the ensuing cost in terms of branch mispredictions and bloating branch predictor state). The ISA spec includes commentary on this decision (page 23), noting some disadvantages of adding such instructions to the specification and noting microarchitectural techniques exist to convert short forward branches into predicated code internally. In the case of the SiFive 7 series, this is achieved using macro-op fusion where a branch over a single ALU instruction is fused and executed as a single conditional instruction.

In the LLVM 16 cycle, compiler optimisations targeting this microarchitectural feature were enabled for conditional move style sequences (i.e. branch over a register move) as well as for other ALU operations. The job of the compiler here is of course to emit a sequence compatible with the micro-architectural optimisation when possible and profitable. I'm not aware of other RISC-V designs implementing a similar optimisation - although there are developments in terms of instructions to support such operations directly in the ISA which would avoid the need for such microarchitectural tricks. See XVentanaCondOps, XTheadCondMov, the previously proposed but now abandoned Zbt extension (part of the earlier bitmanip spec) and more recently the proposed Zicond (integer conditional operations) standard extension.


It's perhaps not surprising that code generation for atomics can be tricky to understand, and the LLVM documentation on atomics codegen and libcalls is actually one of the best references on the topic I've found. A particularly important note in that document is that if a backend supports any inline lock-free atomic operations at a given size, all operations of that size must be supported in a lock-free manner. If targeting a RISC-V CPU without the atomics extension, all atomics operations would usually be lowered to __atomic_* libcalls. But if we know a bit more about the target, it's possible to do better - for instance, a single-core microcontroller could implement an atomic operation in a lock-free manner by disabling interrupts (and conventionally, lock-free implementations of atomics are provided through __sync_* libcalls). This kind of setup is exactly what the +forced-atomics feature enables, where atomic load/store can be lowered to a load/store with appropriate fences (as is supported in the base ISA) while other atomic operations generate a __sync_* libcall.

There's also been a very minor improvement for targets with native atomics support (the 'A' instruction set extension) that I may as well mention while on the topic. As you might know, atomic operations such as compare and swap that are lowered to an instruction sequence involving lr.{w,d} (load reserved) and sc.{w,d} (store conditional). There are very specific rules about these instruction sequences that must be met to align with the architectural forward progress guarantee (section 8.3, page 51), which is why we expand to a fixed instruction sequence at a very late stage in compilation (see original RFC). This means the sequence of instructions implementing the atomic operation are opaque to LLVM's optimisation passes and are treated as a single unit. The obvious disadvantage of avoiding LLVM's optimisations is that sometimes there are optimisations that would be helpful and wouldn't break that forward-progress guarantee. One that came up in real-world code was the lack of branch folding, which would have simplified a branch in the expanded cmpxchg sequence that just targets another branch with the same condition (by just folding in the eventual target). With some relatively simple logic, this suboptimal codegen is resolved.

; Before                 => After
.loop:                   => .loop
  lr.w.aqrl a3, (a0)     => lr.w.aqrl a3, (a0)
  bne a3, a1, .afterloop => bne a3, a1, .loop
  sc.w.aqrl a4, a2, (a0) => sc.w.aqrl a4, a2, (a0)
  bnez a4, .loop         => bnez a4, .loop
.aferloop:               =>
  bne a3, a1, .loop      =>
  ret                    => ret

Assorted optimisations

As you can imagine, there's been a lot of incremental minor improvements over the past ~6 months. I unfortunately only have space (and patience) to highight a few of them.

A new pre-regalloc pseudo instruction expansion pass was added in order to allow optimising the global address access instruction sequences such as those found in the medany code model (and was later broadened further). This results in improvements such as the following (note: this transformation was already supported for the medlow code model):

; Before                            => After
.Lpcrel_hi1:                        => .Lpcrel_hi1
auipc a0, %pcrel_hi1(ga)            => auipc a0, %pcrel_hi1(ga+4)
addi a0, a0, %pcrel_lo(.Lpcrel_hi1) =>
lw a0, 4(a0)                        => lw a0, %pcrel_lo(.Lpcrel_hi1)(a0)

A missing target hook (isUsedByReturnOnly) had been preventing tail calling libcalls in some cases. This was fixed, and later support was added for generating an inlined sequence of instructions for some of the floating point libcalls.

The RISC-V compressed instruction set extension defines a number of 16-bit encodings that map to a 32-bit longer form (with restrictions on addressable registers in the compressed form of course). The conversion 32-bit instructions 16-bit forms when possible happens at a very late stage, after instruction selection. But of course over time, we've introduced more tuning to influence codegen decisions in cases where a choice can be made to produce an instruction that can be compressed, rather than one that can't. A recent addition to this was the RISCVStripWSuffix pass, which for RV64 targets will convert addw and slliw to add or slli respectively when it can be determined that all the users of its result only use the lower 32 bits. This is a minor code size saving, as slliw has no matching compressed instruction and c.addw can address a more restricted set of registers than c.add.


At the risk of repeating myself, this has been a selective tour of some additions I thought it would be fun to write about. Apologies if I've missed your favourite new feature or improvement - the LLVM release notes will include some things I haven't had space for here. Thanks again for everyone who has been contributing to make the RISC-V in LLVM even better.

If you have a RISC-V project you think me and my colleagues and at Igalia may be able to help with, then do get in touch regarding our services.

Article changelog
  • 2023-03-19: Clarified that Zawrs and Zihintntl support just involves the MC layer (assembler/disassembler).
  • 2023-03-18: Initial publication date.

March 18, 2023 12:00 PM

March 14, 2023

José Dapena

Stack walk profiling NodeJS in Windows

Last year I wrote a series of blog posts (1, 2, 3) about stack walk profiling Chromium using Windows native tools around ETW.

A fast recap: ETW support for stack walking in V8 allows to show V8 JIT generated code in the Windows Performance Analyzer. This is a powerful tool to analyze work loads where Javascript execution time is significant.

In this blog post, I will cover the usage of this very same tool, but to analyze NodeJS execution.

Enabling stack walk JIT information in NodeJS

In an ideal situation, V8 engines would always generate stack walk information when Windows is profiling. This is something we will want to consider in the future, as we prove enabling it has no cost if we are not in a tracing session.

Meanwhile, we need to set the V8 flag --enable-etw-stack-walking somehow. This will install hooks that, when a profiling session starts, will emit the JIT generated code addresses, and the information about the source code associated to them.

For a command line execution of NodeJS runtime, it is as simple as passing the command line flag:

node --enable-etw-stack-walking

This will work enabling ETW stack walking for that specific NodeJS session… Good, but not very useful.

Enabling ETW stack walking for a session

What’s the problem here? Usually, NodeJS is invoked indirectly through other tools (based or not in NodeJS). Some examples are Yarn, NPM, or even some Windows scripts or link files.

We could tune all the existing launching scripts to pass --enable-etw-stack-walking to the NodeJS runtime when it is called. But that is not much convenient.

There is a better way though, just using NODE_OPTIONS environment variable. This way, stack walking support can be enabled for all NodeJS calls in a shell session, or even system wide.

Bad news and good news

Some bad news: NodeJS was refusing --enable-etw-stack-walking in NODE_OPTIONS. There is a filter for which V8 options it accepts (mostly for security purposes), and ETW support was not considered.

Good news? I implemented a fix adding the flag to the list accepted by NODE_OPTIONS. It has been landed already, and it is available from NodeJS 19.6.0. Unfortunately, if you are using an older version, then you may need to backport the patch.

Using it: linting TypeScript

To explain how this can be used, I will analyse ESLint on a known workload: TypeScript. For simplicity, we are using the lint task provided by TypeScript.

This example assumes the usage of Git Bash.

First, clone TypeScript from GitHub, and go to the cloned copy:

git clone
cd TypeScript

Then, install hereby and the dependencies of TypeScript:

npm install -g hereby
npm ci

Now, we are ready to profile the lint task. First, set NODE_OPTIONS:

export NODE_OPTIONS="--enable-etw-stack-walking"

Then, launch UIForETW. This tool simplifies capturing traces, and will provide good defaults for Javascript ETW analysis. It provides a very useful keyboard shortcut, <Ctrl>+<Win>+R, to start and then stop a recording.

Switch to Git Bash terminal and do this sequence:

  • Write (without pressing <Enter>): hereby lint
  • Press <Ctrl>+<Win>+R to start recording. Wait 3-4 seconds as recording does not start immediately.
  • Press <Enter>. ESLint will traverse all the TypeScript code.
  • Press again <Ctrl>+<Win>+R to stop recording.

After a few seconds UIForETW will automatically open the trace in Windows Performance Analyzer. Thanks to settings NODE_OPTIONS all the child processes of the parent node.exe execution also have stack walk information.

Randomascii inclusive (stack) analysis

Focusing on node.exe instances, in Randomascii inclusive (stack) view, we can see where time is spent for each of the node.exe processes. If I take the bigger one (that is the longest of the benchmarks I executed), I get some nice insights.

The worker threads take 40% of the CPU processing. What is happening there? I basically see JIT compilation and garbage collection concurrent marking. V8 offloads that work, so there is a benefit from a multicore machine.

Most of the work happens in the main thread, as expected. And most of the time is spent parsing and applying the lint rules (half for each).

If we go deeper in the rules processing, we can see which rules are more expensive.

Memory allocation

In total commit view, we can observe the memory usage pattern of the process running ESLint. For most of the seconds of the workload, allocation grows steadily (to over 2GB of RAM). Then there is a first garbage collection, and a bit later, the process finishes and all the memory is deallocated.

More findings

At first sight, I observe we are creating the rules objects for all the execution of ESLint. What does it mean? Could we run faster reusing those? I can also observe that a big part of the time in main thread leads to leaves doing garbage collection.

This is a good start! You can see how ETW can give you insights of what is happening and how much time it takes. And even correlate that to memory usage, File I/O, etc.

Builtins fix

Using NodeJS, as is today, will still show many missing lines in the stack. I did those tests, and could do a useful analysis, because I applied a very recent patch I landed in V8.

Before the fix, we would have this sequence:

  • Enable ETW recording
  • Run several NodeJS tests.
  • Each of the tests creates one or more JS contexts.
  • That context then sends to ETW the information of any code compiled with JIT.

But there was a problem: any JS context has already a lot of pre-compiled code associated: builtins and V8 snapshot code. Those were missing from the ETW traces captured.

The fix, as said, has been already landed to V8, and hopefully will be available soon in future NodeJS releases.

Wrapping up

There is more work to do:

  • WASM is still not supported.
  • Ideally, we would want to have --enable-etw-stack-walking set by default, as the impact while not tracing is minimal.

In any case, after these new fixes, capturing ETW stack walks of code executed by NodeJS runtime is a bit easier. I hope this gives some joy to your performance research.

One last thing! My work for these fixes is possible thanks to the sponsorship from Igalia and Bloomberg.

by José Dapena Paz at March 14, 2023 08:00 PM

Víctor Jáquez

Review of Igalia Multimedia activities (2022)

We, Igalia’s multimedia team, would like to share with you our list of achievements along the past 2022.

WebKit Multimedia


Phil already wrote a first blog post, of a series, on this regard: WebRTC in WebKitGTK and WPE, status updates, part I. Please, be sure to give it a glance, it has nice videos.

Long story short, last year we started to support Media Capture and Streams in WebKitGTK and WPE using GStreamer, either for input devices (camera and microphone), desktop sharing, webaudio, and web canvas. But this is just the first step. We are currently working on RTCPeerConnection, also using GStreamer, to share all these captured streams with other web peers. Meanwhile, we’ll wait for the second episode of Phil’s series 🙂


We worked in an initial implementation of MediaRecorder with GStreamer (1.20 or superior). The specification goes about allowing a web browser to record a selected stream. For example, a voice-memo or video application which could encode and upload a capture of your microphone / camera.


While WebKitGTK already has Gamepad support, WPE lacked it. We did the implementation last year, and there’s a blog post about it: Gamepad in WPEWebkit, with video showing a demo of it.

Capture encoded video streams from webcams

Some webcams only provide high resolution frames encoded in H.264 or so. In order to support these resolutions with those webcams we added the support for negotiate of those formats and decode them internally to handle the streams. Though we are just at the beginning of more efficient support.

Flatpak SDK maintenance

A lot of effort went to maintain the Flatpak SDK for WebKit. It is a set of runtimes that allows to have a reproducible build of WebKit, independently of the used Linux distribution. Nowadays the Flatpak SDK is used in Webkit’s EWS, and by many developers.

Among all the features added during the year we can highlight added Rust support, a full integrity check before upgrading, and offer a way to override dependencies as local projects.

MSE/EME enhancements

As every year, massive work was done in WebKit ports using GStreamer for Media Source Extensions and Encrypted Media Extensions, improving user experience with different streaming services in the Web, such as Odysee, Amazon, DAZN, etc.

In the case of encrypted media, GStreamer-based WebKit ports provide the stubs to communicate with an external Content Decryption Module (CDM). If you’re willing to support this in your platform, you can reach us.

Also we worked in a video demo showing how MSE/EME works in a Raspberry Pi 3 using WPE:

WebAudio demo

We also spent time recording video demos, such as this one, showing WebAudio using WPE on a desktop computer.


We managed to merge a lot of bug fixes in GStreamer, which in many cases can be harder to solve rather than implementing new features, though former are more interesting to tell, such as those related with making Rust the main developing language for GStreamer besides C.

Rust bindings and GStreamer elements for Vonage Video API / OpenTok

OpenTok is the legacy name of Vonage Video API, and is a PaaS (Platform As a Service) to ease the development and deployment of WebRTC services and applications.

We published our work in Github of Rust bindings both for the Client SDK for Linux and the Server SDK using REST API, along with a GStreamer plugin to publish and subscribe to video and audio streams.


In the beginning there was webrtcbin, an element that implements the majority of W3C RTCPeerConnection API. It’s so flexible and powerful that it’s rather hard to use for the most common cases. Then appeared webrtcsink, a wrapper of webrtcbin, written in Rust, which receives GStreamer streams which will be offered and streamed to web peers. Later on, we developed webrtcsrc, the webrtcsink counterpart: an element which source pads push streams from web peers, such as another browser, and forward those Web streams as GStreamer ones in a pipeline. Both webrtcsink and webrtcsrc are written in Rust.

Behavior-Driven Development test framework for GStreamer

Behavior-Driven Development is gaining relevance with tools like Cucumber for Java and its domain specific language, Gherkin to define software behaviors. Rustaceans have picked up these ideas and developed cucumber-rs. The logical consequence was obvious: Why not GStreamer?

Last year we tinkered with GStreamer-Cucumber, a BDD to define behavior tests for GStreamer pipelines.

GstValidate Rust bindings

There have been some discussion if BDD is the best way to test GStreamer pipelines, and there’s GstValidate, and also, last year, we added its Rust bindings.

GStreamer Editing Services

Though not everything was Rust. We work hard on GStreamer’s nuts and bolts.

Last year, we gathered the team to hack GStreamer Editing Services, particularly to explore adding OpenGL and DMABuf support, such as downloading or uploading a texture before processing, and selecting a proper filter to avoid those transfers.

GstVA and GStreamer-VAAPI

We helped in the maintenance of GStreamer-VAAPI and the development of its near replacement: GstVA, adding new elements such as the H.264 encoder, the compositor and the JPEG decoder. Along with participation on the debate and code reviewing of negotiating DMABuf streams in the pipeline.

Vulkan decoder and parser library for CTS

You might have heard about Vulkan has now integrated in its API video decoding, while encoding is currently work-in-progress. We devoted time on helping Khronos with the Vulkan Video Conformance Tests (CTS), particularly with a parser based on GStreamer and developing a H.264 decoder in GStreamer using Vulkan Video API.

You can check the presentation we did last Vulkanised.

WPE Android Experiment

In a joint adventure with Igalia’s Webkit team we did some experiments to port WPE to Android. This is just an internal proof of concept so far, but we are looking forward to see how this will evolve in the future, and what new possibilities this might open up.

If you have any questions about WebKit, GStreamer, Linux video stack, compilers, etc., please contact us.

by vjaquez at March 14, 2023 11:00 AM

March 12, 2023

Clayton Craft

Presence detection, without compromising privacy

I use Home Assistant (HA) to control my WiFi-enabled thermostat, which in turn is walled off from the Internet. So no matter how "smart" my thermostat wants to be, it's forced to live in isolation and follow orders from HA. For the past year, I've been trying out crude methods for detecting when humans are home, or not, so that the HVAC can be set accordingly. The last attempt required using the "ping" integration in HA. You basically provide a list of IPs, add users that are associated with the IP / device tracker for it, then you can conditionally run automation based on the status of those things.

This has worked pretty well, however there are some pain points that have been really digging at me:

  1. Setting this up requires using static DHCP for devices on the network that I want to associate with someone being home, like a cell phone. This is tricky since some phones use random MACs, and it's cumbersome to add new people to the "who could be home?" list.

  2. If people come over, I don't want the HVAC to shut off if the regular inhabitants leave. E.g. if my partner and I have to leave the house, and my mother-in-law stays at home, I'll get in trouble if the heat shuts off because HA thinks everyone is gone. Don't ask me how I know.

  3. Setting this all up requires basically configuring HA to track inhabitants and guests in my home. Even though I've gone through great lengths to try to prevent HA from sending any data externally, I still don't like that this data is being stored anywhere. I actually don't care if my partner or some particular guest is home and I'm not, it's easy enough to ask if I suddenly do. But seeing that someone is home is fine. I've heard that some folks use the "cloud" to manage HA and its data. Ouch.

So I present my latest attempt to alleviate those 3 things as much as possible:

  • Monitor IP ranges and detect if any devices in those ranges are "online" (responding to ping)
  • Use DHCP to "put" devices that humans carry or use when they are home into the appropriate monitored IP range
  • HA can use this to determine if any humans are connected (or not), and thus at home (or not)
  • Guests who are on WiFi are counted as "home" by HA without me having to do anything more than just give them the guest WiFi password (which I would have done previously anyways).
  • HA doesn't have any information about specific users and their home/not home status. And it doesn't care if devices use a consistent MAC or not.

All three pain points above are addressed... although maybe not 100% solved in every situation. But it still seems like an improvement.

At the heart of this is a script that runs ping in parallel on an arbitrary number of IP addresses given to it. I'm sure it could be improved... Like, I doubt this would turn out well if the list gets too large. And the run time of the script can be kinda long if there are no devices and it ends up retrying the max number of times. The retry is there because I've seen that some devices might not respond to ping immediately (they were sleeping, or had a WiFi reconnect event at the wrong time, etc...) So if you have any ideas, let me know!

# Pings the given addresses and prints "up" if any of
# them respond to the ping, else it prints "down". A
# non-zero return value is an error.
# A simple backoff delay is used between attempts to
# ping the group when none of them responded to the
# last attempt.

function pings(){
    local PIDS=()
    local STATUS=()
    local IP=( "$@" )

    for ip in "${IP[@]}"; do
        ping  -c1 -W1 "$ip" &>/dev/null &

    for p in "${PIDS[@]}"; do
        wait "${p}"

    for s in "${STATUS[@]}"; do
        if [[ $s -eq 0 ]]; then
            return 0

    return 1

if [ -z "$1" ]; then
    echo "Parameter required!!"
    exit 1

ADDRS=( "$@" )
try_interval=1  # starting interval, before backoff

while ! pings "${ADDRS[@]}"; do
    if [[ $tries -ge $max_tries ]]; then
        echo down
        exit 0

    sleep "$try_interval"

echo up
exit 0

This script should be saved in some directory that HA can access. Next, we need to actually run the script in a way that HA can consume the output directly and use it. A binary sensor seemed like the best choice:

  - platform: command_line
    name: 'network_users'
    device_class: presence
    scan_interval: 30 # seconds
    command_timeout: 30 # seconds
    command: 'bash -c "tools/ 192.168.20.{100..150}"'
    payload_on: 'up'
    payload_off: 'down'

The parameters to that script can be ranges that the shell can expand, so you don't have to list out every IP. Triggering every 30 seconds may be too often, I may back that off later to be a minute or more... so I'll run this for a bit to see if I start to form any opinions about changing it.

Next we need to create a "device tracker" so that this can be associated with some "user" in HA for presence detection. I used device_tracker.see in a scheduled automation to add and update a device tracker thing in HA specifically for network users:

- alias: Update presence tracking sensors
  description: This is done periodically so that device trackers for each sensor have
    updated values after HA startup / config reload
  - platform: time_pattern
    minutes: /1
  condition: []
  - parallel:
    - service: device_tracker.see
        dev_id: network_users
        location_name: '{{ "home" if is_state("binary_sensor.network_users", "on")
          else "not_home" }}'
      alias: Update Network Users Tracker Status
    alias: Update device trackers
    # .... other binary sensors used for determining presence can be listed here too!
  mode: single

I set this automation to fire every 1 minute, so that the current status is "known" to HA soon after it starts up, or config is reloaded. Maybe 1 minute is "too often", but it's just reading a value from the HA database, so, meh. Basically, I've learned that triggering automations based on state changes is flaky, and it's often more reliable to trigger some automations on a schedule and then check entity states in the condition or action.

Lastly, there needs to be a "person" in HA to associate this device tracker with. I created a generic "person" and associated the tracker with it. This can be done through Settings->People->Add Person. I called the fake person I made for this "Pamela", but the name doesn't matter.

The presence status of this person can now be used in automation and/or dashboard cards:

Home Assistant Card showing that someone is connected to the network, and HVAC is enabled by automation
Home Assistant Card showing that someone is connected to the network, and HVAC is enabled by automation

Now, this obviously isn't perfect... One of the issues is that this can lead to the HVAC being left on if someone forgets a phone at home, for example. But that could have happened over the last year with the old implementation, and I've found that 1) it's rare, and 2) the savings from having a "smarter" HVAC outweighs the handful of situations where it's being wasteful. Also, this could be circumvented if some smart guest assigns a static IP outside any expected range I pass to the script above. But if they do that, then the joke is on them since they'll just slowly freeze as the HVAC is shut off because HA thinks no one is home. I do have some other presence detection things that help out too (like IR motion detectors), that can help a little though.

With all of that, HA can tell if someone is home if they're connected to a network, and have an IP from DHCP (or otherwise) that's being monitored. And HA is completely oblivious to any specific people. Again, that's mostly just a personal concern and probably not very existential for my setup, but maybe this will be helpful for folks who do use some external thing for managing HA data, and are concerned about how location status for household members is handled.

by Unknown at March 12, 2023 12:00 AM

March 10, 2023

Andy Wingo

pre-initialization of garbage-collected webassembly heaps

Hey comrades, I just had an idea that I won't be able to work on in the next couple months and wanted to release it into the wild. They say if you love your ideas, you should let them go and see if they come back to you, right? In that spirit I abandon this idea to the woods.

Basically the idea is Wizer-like pre-initialization of WebAssembly modules, but for modules that store their data on the GC-managed heap instead of just in linear memory.

Say you have a WebAssembly module with GC types. It might look like this:

  (type $t0 (struct (ref eq)))
  (type $t1 (struct (ref $t0) i32))
  (type $t2 (array (mut (ref $t1))))
  (global $g0 (ref null eq)
    (ref.null eq))
  (global $g1 (ref $t1)
    (array.new_canon $t0 ( (i32.const 42))))
  (function $f0 ...)

You define some struct and array types, there are some global variables, and some functions to actually do the work. (There are probably also tables and other things but I am simplifying.)

If you consider the object graph of an instantiated module, you will have some set of roots R that point to GC-managed objects. The live objects in the heap are the roots and any object referenced by a live object.

Let us assume a standalone WebAssembly module. In that case the set of types T of all objects in the heap is closed: it can only be one of the types $t0, $t1, and so on that are defined in the module. These types have a partial order and can thus be sorted from most to least specific. Let's assume that this sort order is just the reverse of the definition order, for now. Therefore we can write a general type introspection function for any object in the graph:

(func $introspect (param $obj anyref)
  (block $L2 (ref $t2)
    (block $L1 (ref $t1)
      (block $L0 (ref $t0)
        (br_on_cast $L2 $t2 (local.get $obj))
        (br_on_cast $L1 $t1 (local.get $obj))
        (br_on_cast $L0 $t0 (local.get $obj))
      ;; Do $t0 things...
    ;; Do $t1 things...
  ;; Do $t2 things...

In particular, given a WebAssembly module, we can generate a function to trace edges in an object graph of its types. Using this, we can identify all live objects, and what's more, we can take a snapshot of those objects:

(func $snapshot (result (ref (array (mut anyref))))
  ;; Start from roots, use introspect to find concrete types
  ;; and trace edges, use a worklist, return an array of
  ;; all live objects in topological sort order

Having a heap snapshot is interesting for introspection purposes, but my interest is in having fast start-up. Many programs have a kind of "initialization" phase where they get the system up and running, and only then proceed to actually work on the problem at hand. For example, when you run python3, Python will first spend some time parsing and byte-compiling, importing the modules it uses and so on, and then will actually run's code. Wizer lets you snapshot the state of a module after initialization but before the real work begins, which can save on startup time.

For a GC heap, we actually have similar possibilities, but the mechanism is different. Instead of generating an array of all live objects, we could generate a serialized state of the heap as bytecode, and another function to read the bytecode and reload the heap:

(func $pickle (result (ref (array (mut i8))))
  ;; Return an array of bytecode which, when interpreted,
  ;; can reconstruct the object graph and set the roots
(func $unpickle (param (ref (array (mut i8))))
  ;; Interpret the bytecode, building object graph in
  ;; topological order

The unpickler is module-dependent: it will need one case to construct each concrete type $tN in the module. Therefore the bytecode grammar would be module-dependent too.

What you would get with a bytecode-based $pickle/$unpickle pair would be the ability to serialize and reload heap state many times. But for the pre-initialization case, probably that's not precisely what you want: you want to residualize a new WebAssembly module that, when loaded, will rehydrate the heap. In that case you want a function like:

(func $make-init (result (ref (array (mut i8))))
  ;; Return an array of WebAssembly code which, when
  ;; added to the module as a function and invoked, 
  ;; can reconstruct the object graph and set the roots.

Then you would use binary tools to add that newly generated function to the module.

In short, there is a space open for a tool which takes a WebAssembly+GC module M and produces M', a module which contains a $make-init function. Then you use a WebAssembly+GC host to load the module and call the $make-init function, resulting in a WebAssembly function $init which you then patch in to the original M to make M'', which is M pre-initialized for a given task.


Some of the object graph is constant; for example, an instance of a struct type that has no mutable fields. These objects don't have to be created in the init function; they can be declared as new constant global variables, which an engine may be able to initialize more efficiently.

The pre-initialized module will still have an initialization phase in which it builds the heap. This is a constant function and it would be nice to avoid it. Some WebAssembly hosts will be able to run pre-initialization and then snapshot the GC heap using lower-level facilities (copy-on-write mappings, pointer compression and relocatable cages, pre-initialization on an internal level...). This would potentially decrease latency and may allow for cross-instance memory sharing.


There are five preconditions to be able to pickle and unpickle the GC heap:

  1. The set of concrete types in a module must be closed.

  2. The roots of the GC graph must be enumerable.

  3. The object-graph edges from each live object must be enumerable.

  4. To prevent cycles, we have to know when an object has been visited: objects must have identity.

  5. We must be able to create each type in a module.

I think there are three limitations to this pre-initialization idea in practice.

One is externref; these values come from the host and are by definition not introspectable by WebAssembly. Let's keep the closed-world assumption and consider the case where the set of external reference types is closed also. In that case if a module allows for external references, we can perhaps make its pickling routines call out to the host to (2) provide any external roots (3) identify edges on externref values (4) compare externref values for identity and (5) indicate some imported functions which can be called to re-create exernal objects.

Another limitation is funcref. In practice in the current state of WebAssembly and GC, you will only have a funcref which is created by ref.func, and which (3) therefore has no edges and (5) can be re-created by ref.func. However neither WebAssembly nor the JS API has no way of knowing which function index corresponds to a given funcref. Including function references in the graph would therefore require some sort of host-specific API. Relatedly, function references are not comparable for equality (func is not a subtype of eq), which is a little annoying but not so bad considering that function references can't participate in a cycle. Perhaps a solution though would be to assume (!) that the host representation of a funcref is constant: the JavaScript (e.g.) representations of (ref.func 0) and (ref.func 0) are the same value (in terms of ===). Then you could compare a given function reference against a set of known values to determine its index. Note, when function references are expanded to include closures, we will have more problems in this area.

Finally, there is the question of roots. Given a module, we can generate a function to read the values of all reference-typed globals and of all entries in all tables. What we can't get at are any references from the stack, so our object graph may be incomplete. Perhaps this is not a problem though, because when we unpickle the graph we won't be able to re-create the stack anyway.

OK, that's my idea. Have at it, hackers!

by Andy Wingo at March 10, 2023 09:20 AM

March 09, 2023

Ziran Sun

Igalia on Reforestation

Igalia started the Reforestation project in 2019. As one of the company’s Corporate Social Responsibility (CSR) efforts, the Reforestation project focuses on conserving and expanding native, old growth forests to capture, and long-term storing, carbon emissions. Partnering with Galnus, we have been working on reforestation of a 10.993 hectares area with 11,000 trees, which absorbs 66 tons of carbon dioxide each year.

Phase I – Rois

The first land where the project started was framed in the woods of the communal land of “San Miguel de Costa” in Rois. Rois is a municipality of northwestern Spain in the province of A Coruña in the autonomous community of Galicia. Environment in this land was highly altered by human action and the predominance of eucalyptus plantations left few examples of native forest.

After the agreement for land transfer signed on the 29th June 2019, the project started with creating maps and technical plans for the areas affected. Purposes of these plans are not only to build a cartographic base for management and accomplishment of the works but also to designate accurately these zones as “Reserved Areas” in future management plans.

Work carried out

  • Clear and crush remaining eucalyptus and other exotic plants

Eucalyptus is an exotic and invasive species. It spread widely and had a negative impact on the environmental situation of this habitat. This work is to clear the existing growth and eliminate the regrowths of eucalyptus and other exotic plants before the start of planting work. In Q1 2020, 100% of exotic trees were cut as part of the Atlantic forest sponsorship. After that the focus was moved to eliminating the sprouts of eucalyptus grown from the seeds present in the soil in early years. Sprout and seeds elimination continues throughout the duration of the project.

  • Acquire trees, shrubs and all other materials needed.

To improve forest structure and increase biodiversity for the affected area, the following native trees and shrubs were chosen and purchased in Q3 2019 –
– 1000 downy birch (Betula pubescens)
– 650 common oak (Quercus robur)
– 350 wild cherry (Prunus avium)
– 150 common holly (Ilex aquifolium)

A lot of other materials such as tree protectors, stakes etc. were also acquired at the time time.

  • Planting

Planting started in Q4 2019, downy birches (Betula pubescens) and some common oaks (Quercus robur) were the first batch planted. This planting campaign continued in 2020 on the arrival of the rest of the trees.

  • Care during the first year

The first year after planting is key to the future development and success of the project.

In Q1 2020, 100% of planted trees were already sprouting and alive. At this stage 100% of the restoration area has already been planted with at least 65% of the trees.

The first summer after planting is vital for the trees and shrubs to settle. During the first summer, trees adapt and take root in their new location. If they survive, their chances of developing correctly will increase exponentially and the following years they will be able to focus all their energy on their growth. The beginning of the summer 2020 was difficult due to hot and dry weather. However, the tree mortality rate remains within the expected range.

In Q4 2020, most of the planted trees are well-established and ready for spring sprouting. By Q1 2021 many of them have reached over 2 meters high.

The trees and shrubs settled happily after the first year of growth. This is the photo taken in Spring 2020

And see how they had grown in a year’s time –

  • Wildlife studies

With plants sprouting all around the forest, insects, reptiles and birds start thriving too. The development of the ecosystem will provide for better results in future inventories and wildlife studies. In summer 2020 the wildlife studies started and first inventory was completed. In winter 2020 wildlife nest boxes were in the manufacturing process and some bat boxes were installed in Spring 2021.

  • Improving biodiversity

In addition to ensuring the introduction of a wide variety of tree species to reforestation areas, the project has also put effort on enhancing biodiversity in existing forest areas. For example, One of the areas targeted by the project is one hectare of young secondary forest made up mainly of oaks (Quercus robur). In this forest we have been planting understory species such as holly (Ilex aquifolium), laurel (Laurus nobilis) or Atlantic pear (Pyrus cordata) to increase biodiversity and improve the structure and complexity of the forest.

Phase II – Eume and Rois

Phase II of the reforestation project set sail in winter 2020. While the Rois project had moved to a steady stage with most trees and shrubs planted and settled through the first year’s growth, this phase is to expand restoration of new forest area in Rois and to start a new project in “Fragas do Eume” natural park.

  • Rois

In order to explain the progress in the Rois expansion project, the maps below distinguish three areas in different stages of development.
– Phase 1 (Green) – The green area is completely planted with native trees and free of eucalyptus sprouts.
– Phase 2 (Yellow) – Acacias have been eliminated in this area. The work to control this species and also the eucalyptus trees will continue.
– Phase 3 (Red) – The entire area is covered with dense plantations of eucalyptus and acacia. Work is in the preparation stage.

The following maps represent the progresses made between Spring to Fall in 2021.

Spring 2021:

Fall 2021:

  • Eume

Unlike the Rois project, habitats in “Fragas do Eume” natural park are some of the largest and best preserved examples in the world of coastal Atlantic rainforest, where much of its biodiversity and original structure is still preserved. Unfortunately, most of these forests are young secondary forests under numerous threats such as the presence of eucalyptus and other invasive species.

Our project represents one of the largest actions in recent years for the elimination and control of exotic species and the restoration of native forest in the area, increasing native forest surface area, reconnecting fragmented forest patches and improving the landscape in the “Fragas do Eume Natural Park”.

In order to explain the progress in the Eume project, it also distinguishes three areas in different stages of development.

Phase 1 (Green) – All the environmental restoration works have been completed. Only maintenance tasks remain for the next few years.
phase 2 (Yellow) – This area is in the process of control and elimination of eucalyptus to start the restoration of native vegetation with native trees plantation works.
Phase 3 (Red) – Waiting for the loggers to cut the big eucalyptus trees in this area, in order to start the work associated with the project.

The following maps represent the progresses made between Spring to Fall in 2021.

Spring 2021:

Fall 2021:


In 2021 Igalia started working on a 0.538-hectare land owned by the Esmelle Valley Association. This land is borded by a road and a forest track on each side, and the Esmelle stream flows across it. This once mighty stream and its tributary springs are currently used to provide drinking water to the houses in the surrounding area. Unfortunately, in summer time, the flow of the stream could reduce drastically due to dry weather and exhausting uses. In addition, this area also suffers invasion of numerous eucalyptus trees and other exotic species. Recovering and restoring a good example of the Atlantic Forest will bring great benefit to this enormously altered and humanized case.

After work preparation and planning, major work were carried out in 2022 including –
– Elimination and control of eucalyptus and its sprouts.
– Clearing and ground preparation
– Tree seedling, plantation and protections (placement of tree stakes and protectors etc.)
– Maintenance and replacement of dead trees

This work will continue in the next two years. Main focus will be on reinforcement and enrichment of the plantation based on the availability of tree seedlings of certain species, and finishing all the remaining maintenance work.

What’s next?

Igalia sees Reforestation as a long term effort. While maintaining and developing the current projects, Igalia doesn’t stop looking for new candidates. In Q4 2022, Igalia started preparing a new Reforestation project – Galnus: O Courel.

This is the first time that Igalia considered developing an environmental compensation project in one of the most rural and least densely populated areas of Galicia. Igalia believes that the environmental, social and economic characteristics associated with this area, offers an opportunity to carry out
environmental restoration projects. If it goes as planned, major work will happen in the next two years. Something to look forward to!


The Reforestation project is making contributions to our environment, and it has gone further. Here we’d like to share a picture of a 7-year-old boy’s work. This young student is from the community owning the Rois forest. He took advantage of a school newspaper project to communicate about our work in Rois forest.

Isn’t it a joy to see Iglaia’s Reforestation project is making impacts on our children, and our future? 🙂

by zsun at March 09, 2023 08:02 PM

March 03, 2023

André Almeida

Installing kernel modules faster with multithread XZ

As a kernel developer, everyday I need to compile and install custom kernels, and any improvement in this workflow means to be more productive. While installing my fresh compiled modules, I noticed that it would be stuck in amdgpu compression for some time:

XZ      /usr/lib/modules/6.2.0-tonyk/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.xz

XZ format

My target machine is the Steam Deck, that uses .xz for compressing the modules. Giving that we want gamers to be able to install as many games as possible, the OS shouldn’t waste much disk space. amdgpu, when compiled with debug symbols can use a good hunk of space. Here’s the comparison of disk size of the module uncompressed, and then with .zst and .xz compression:

360M amdgpu.ko
61M  amdgpu.ko.zst
38M  amdgpu.ko.xz

This more compact module comes with a cost: more CPU time for compression.

Multithread compression

When I opened htop, I saw that only a lonely thread was doing the hard work to compress amdgpu, even that compression is a task easily parallelizable. I then hacked scripts/Makefile.modinst so XZ would use as many threads as possible, with the option -T0. In my main build machine, modules_install was running 4 times faster!

# before the patch
$ time make modules_install -j16
Executed in  100.08 secs

# after the patch
$ time make  modules_install -j16
Executed in   28.60 secs

Then, I submitted a patch to make this default for everyone: [PATCH] kbuild: modinst: Enable multithread xz compression

However, as Masahiro Yamada noticed, we shouldn’t be spawning numerous threads in the build system without the user request. Until today we specify manually how many threads we should run with make -jX.

Hopefully, Nathan Chancellor suggested that the same results can be achieved using XZ_OPT=-T0, so we still can benefit from this without the patch. I experimented with different -TX and -jY values, but in my notebook the most efficient values were X = Y = nproc. You can check some results bellow:

$ make modules_install
174.83 secs

$ make modules_install -j8
100.55 secs

$ make modules_install XZ_OPT=-T0
81.51 secs

$ make modules_install -j8 XZ_OPT=-T0
53.22 sec

March 03, 2023 02:02 PM

Alex Bradbury

Commercially available RISC-V silicon

The RISC-V instruction set architecture has seen huge excitement and growth in recent years (10B cores estimated to have shipped as of Dec 2022) and I've been keeping very busy with RISC-V related work at Igalia. I thought it would be fun to look beyond the cores I've been working with and to enumerate the SoCs that are available for direct purchase or in development boards that feature RISC-V cores programmable by the end user. I'm certain to be missing some SoCs or have some mistakes or missing information - any corrections very gratefully received at or @asbradbury. I'm focusing almost exclusively on the RISC-V aspects of each SoC - i.e. don't expect a detailed listing of other on-chip peripherals or accelerators.

A few thoughts

  • It was absolutely astonishing how difficult it was to get basic information about the RISC-V specification implemented by many of these SoCs. In a number of cases, just a description of a "RV32 core at xxMHz" with further detective work being needed to find any information at all about even the standard instruction set extensions supported.
  • The way I've focused on the details of individual cores does a bit of a disservice to those SoCs with compute clusters or interesting cache hierarchies. If I were to do this again (or if I revisit this in the future), I'd look to rectify that. There a whole bunch of other micro-architectural details it would be interesting to detail too.
    • I've picked up CoreMark numbers where available, but of course that's a very limited metric to compare cores. It's also not always clear which compiler was used, which extensions were targeted and so on. Plus when the figure is taken from an IP core product page, the numbers may refer to a newer version of the IP. Where there are multiple numbers floating about I've listed them all.
  • There's a lot of chips here - but although I've likely missed some, not so many that it's impossible to enumerate. RISC-V is growing rapidly, so perhaps this will change in the next year or two.
  • A large proportion of the SoC designs listed are based on proprietary core designs. The exceptions are the collection of SoCs based on the T-Head cores, the SiFive E31 and Kendryte K210 (Rocket-derived) and the GreenWaves GAP8/GAP9 (PULP-derived). As a long-term proponent of open source silicon I'd hope to see this change over time. Once a company has moved from a proprietary ISA to an open standard (RISC-V), there's a much easier transition path to switch from a proprietary IP core to one that's open source.

64-bit Linux-capable application processors

  • StarFive JH7110
    • Core design:
      • 4 x RV64GC_Zba_Zbb SiFive U74 application cores, 1 x RV64IMAC SiFive S7 (this?) monitor core, and 1 x RV32IMFC SiFive E24 (ref, ref).
      • The U74 is a dual-issue in-order pipeline with 8 stages.
    • Key stats:
      • 1.5 GHz, fabbed on TSMC 28nm (ref).
      • StarFive report a CoreMark/MHz of 5.09.
    • Development board:
  • T-Head C910 ICE
    • Core design:
      • 2 x RV64GC T-Head C910 application cores and an additional T-Head C910 RV64GCV core (i.e., with the vector extension).
      • The C910 is a 3-issue out-of-order pipeline with 12 stages.
    • Key stats:
      • 1.2 GHz, fabbed on a 28nm process (ref).
    • Development board:
  • Allwinner D1-H (datasheet, user manual)
    • Core design:
      • 1 x RV64GC T-Head C906 application core. Additionally supports the unratified, v0.7.1 RISC-V vector specification (ref).
      • Single-issue in-order pipeline with 5 stages.
      • Verilog for the core is on GitHub under the Apache License (see discussion on what is included).
      • At least early versions of the chip incorrectly trapped on fence.tso. It's unclear if this has been fixed in later revisions.
    • Key stats:
      • 1GHz, taped out on a 22nm process node.
      • Reportedly 3.8 CoreMark/MHz.
    • Development board:
  • StarFive JH7100
    • Core design:
      • 2 x RV64GC SiFive U74 application cores and 1 x RV32IMAFC SiFive E24.
      • The U74 is a dual-issue in-order pipeline with 8 stages.
    • Key stats:
      • 1.2GHz (as listed on StarFive's page but articles about the V1 board claimed 1.5GHz), presumably fabbed on TSMC 28nm (the JH7110 is).
      • The current U74 product page claims 5.75 CoreMark/MHz (previous reports suggested 4.9 CoreMark/MHz).
    • Development board:
  • Kendryte K210 (datasheet)
    • Core design:
      • 2 x RV64GC application cores (reportedly implementations of the open-source Rocket core design).
      • If it's correct the K210 uses Rocket, it's a single-issue in-order pipeline with 5 stages.
      • Has a non-standard (older version of the privileged spec?) MMU (ref, so the nommu Linux port is typically used.
    • Key stats:
      • 400MHz, fabbed on TSMC 28nm.
    • Development board:
  • MicroChip PolarFire SoC MPFSxxxT
    • Core design:
      • 4 x RV64GC SiFive U54 application cores, 1 x RV64IMAC SiFive E51 (now renamed to S51) monitor core.
      • The U54 is a single-issue, in-order pipeline with 5 stages.
    • Key stats:
      • 667 MHz (ref), fabbed on a 28nm node (ref).
      • Microchip report 3.125 CoreMark/MHz.
    • Development board:
      • Available in the 'Icicle' development board.
  • SiFive FU740
    • Core design:
      • 4 x RV64GC SFive U74 application cores, 1 x RV64IMAC SiFive S71 monitor core.
      • It's hard to find details for the S71 core, the FU740 manual refers to it as an S7 while the HiFive Unmatched refers to it as the S71 - but neither have a page on SiFive's site. I'm told that the S71 has the same pipeline as the S76, just no support for the F and D extensions.
      • The U74 is a dual-issue in-order pipeline with 8 stages.
    • Key stats:
      • 1.2 GHz, fabbed on TSMC 28nm (ref).
      • The current U74 product page claims 5.75 CoreMark/MHz (previous reports suggested 4.9 CoreMark/MHz).
    • Development board:
  • SiFive FU540
    • Core design:
      • 4 x RV64GC SiFive U54 application cores, 1 x RV64IMAC SiFive E51 (now renamed to S51) monitor core.
      • The U54 is a single-issue, in-order pipeline with 5 stages.
    • Key stats:
      • 1.5GHz, fabbed on TSMC 28nm (ref).
      • The current U54 product page claims 3.16 CoreMark/MHz but it was 2.75 in 2017.
    • Development board:
  • Bouffalo Lab BL808
    • Core design:
    • Key stats:
      • The C906 runs at 480 MHz, the E907 at 320 MHz and the E902 at 150 MHz.
    • Development board:
      • Available in the Ox64 from Pine64.
  • Renesas RZ/Five
  • Kendryte K510
    • Core design:
      • 2 x RV64GC application cores and 1 x RV64GC core with DSP extensions. The Andes AX25MP appears to be used (ref).
    • Key stats:
      • 800 MHz.
    • Development board:
  • (Upcoming) Intel-SiFive Horse Creek SoC
    • Core design:
      • 4 x "RV64GBC" SiFive P550 application cores (docs not yet available, but an overview is here). As the 'B' extension was broken up into smaller sub-extensions, this is perhaps RV64GC_Zba_Zbb like the SiFive U74.
      • 13 stage, 3 issue, out-of-order pipeline.
      • As the bit manipulation extension was split into a range of sub-extensions it's unclear exactly which of the 'B' family extensions will be supported.
    • Key stats:
      • 2.2 GHz, fabbed on Intel's '4' process node (ref).
    • Development board:
  • (Upcoming) T-Head TH1520 (announcement)
    • Core design:
      • 4 x RV64GC T-Head C910 application cores, 1 x RV64GC T-Head C906, 1 x RV32IMC T-Head E902.
      • The C910 is a 3-issue out-of-order pipeline with 12 stages, the C906 is single-issue in-order with 5 stages, and the E902 is single-issue in-order with 2 stages.
      • Verilog for the cores is up on GitHub under the Apache license: C910, C906, E902. See discussion on what is included).
    • Key stats:
      • 2.4 GHz, fabbed on a 12nm process (ref).
    • Development board:

Embedded / specialised SoCs (mostly 32-bit)

  • SiFive FE310
  • GigaDevice GD32VF103 series
    • Core design:
      • 1 x RV32IMAC Nuclei Bumblebee N200 ("jointly developed by Nuclei System Technology and Andes Technology.")
      • No support for PMP (Physical Memory Protection), includes the 'ECLIC' interrupt controller derived from the CLIC design.
      • Single-issue, in-order pipeline with 2 stages.
    • Key stats:
      • 108 MHz. 360 CoreMark (ref), implying 3.33 CoreMark/MHz.
    • Development board:
  • GreenWaves GAP8
    • Core design:
    • Key stats:
      • 175 MHz Fabric Controller, 250 MHz cluster. Fabbed on TSMC's 55nm process (ref).
      • 22.65 GOPS at 4.24mW/GOP (ref).
      • Shipped 150,000 units, composed of roughly 80% open source and 20% proprietary IP (ref).
    • Development board:
  • GreenWaves GAP9
    • Core design:
      • Fabric Controller (FC) and compute cluster of 9 cores. Extends the RV32IMC (plus extensions) GAP8 core design with additional custom extensions (ref).
    • Key stats:
      • 400 MHz Fabric Controller and computer cluster. Fabbed on Global Foundries 22nm FDX process (ref).
      • 150.8 GOPS at 0.33mW/GOP (ref).
    • Development board:
      • GAP9 evaluation kit listed on Greenwaves store but you must email to receive access to order it.
  • Renesas RH850/U2B
    • Core design:
      • Features an NSITEXE DR1000C RISC-V parallel coprocessor, comprised of RV32I scalar processor units, a control core unit, and a vector processing unit based on the RISC-V vector extension.
    • Key stats:
      • 400 MHz. Fabbed on a 28nm process.
    • Development board:
      • None I can find.
  • Renesas R9A02G020
    • Core design:
      • 1 x RV32IMC AndesCore N22 (additionally supporting the Andes 'Performance' and 'CoDense' instruction set extensions).
      • Single-issue in-order with a 2-stage pipeline.
    • Key stats:
      • 32 MHz.
    • Development board:
  • Analog Devices MAX78000
    • Core design:
      • Features an RV32 RISC-V coprocessor of unknown (to me!) design and unknown ISA naming string.
    • Key stats:
      • 60 MHz (for the RISC-V co-processor), fabbed on a TSMC 40nm process (ref).
    • Development board:
  • Espressif ESP32-C3 / ESP8685
    • Core design:
      • 1 x RV32IMC core of unknown design, single issue in-order 4-stage pipeline (ref).
    • Key stats:
      • 160 MHz, fabbed on a TSMC 40nm process (ref).
      • 2.55 CoreMark/MHz.
    • Development board:
  • Espressif ESP32-C2 / ESP8684
    • Core design:
      • 1 x RV32IMC core of unknown design, single issue in-order 4-stage pipeline (ref).
    • Key stats:
      • 160 MHz, fabbed on a TSMC 40nm process.
      • 2.55 CoreMark/MHz.
      • Die photo available in this article.
    • Development board:
  • Espressif ESP32-C6
    • Core design:
      • 1 x RV32IMAC core of unknown design (four stage pipeline) and 1 x RV32IMAC core of unknown design (two stage pipeline) for low power operation (ref).
    • Key stats:
      • 160 MHz high performance (HP) core with 2.76 CoreMark/MHz, 20 MHz low power (LP) core.
    • Development board:
  • HiSilicon Hi3861
    • Core design:
      • 1 x RV32IM core of unknown design, supporting additional non-standard compressed instruction set extensions (ref).
    • Key stats:
      • 160 MHz.
    • Development board:
      • A low-cost board is available advertising support for Harmony OS.
  • Bouffalo Lab BL616/BL618
    • Core design:
      • 1 x RV32GC core of unknown design, also with support for the unratified 'P' packed SIMD extension (ref).
    • Key stats:
      • 320 MHz.
    • Development board:
  • Bouffalo Lab BL602/BL604
    • Core design:
    • Key stats:
      • 192 MHz, 3.1 CoreMark/MHz (ref).
    • Development board:
      • Available in the very low cost Pinecone evaluation board.
    • Other:
      • The BL702 appears to have the same core, so I haven't listed it separately.
  • Bluetrum AB5301A
  • WCH CH583/CH582/CH581
    • Core design:
      • 1 x RV32IMAC QingKe V4a core, which also supports a "hardware prologue/epilogue" extension.
      • Single issue in-order pipeline with 2 stages.
      • Unlike the other QingKe V4 series core designs, the V4a doesn't support the custom 'extended instruction' (XW) instruction set extension.
    • Key stats:
      • 20MHz.
    • Development board:
  • WCH CH32V307
    • Core design:
    • Key stats:
      • 144 MHz.
    • Development board:
  • WVH CH32V208
    • Core design:
      • 1 x RV32IMAC QingKe V4c core with custom instruction set extensions ('XW' for sign-extended byte and half word operations).
    • Key info:
      • 144 MHz.
    • Development board:
    • Other:
      • The CH32V203 is also available but I haven't listed it separately as it's not clear how the QingKe V4b core in that chip differs to the V4c in this one.
  • WCH CH569 / WCH CH573 / WCH CH32V103
    • Core design:
    • Key stats:
      • 120 MHz (CH569), 20 MHz (CH573), 80 MHz (CH32V103).
    • Development board:
  • WCH CH32V003
    • Core design:
      • 1 x RV32EC QingKe V2A with custom instruction set extensions ('XW' for sign-extended byte and half word operations).
    • Key stats:
    • Development board:
  • PicoCom PC802
    • Core design:
    • Key stats:
      • Fabbed on TSMC 12nm process (ref).
    • Development board:
  • CSM32RV20
    • Core design:
      • 1 x RV32IMAC core of unknown design.
    • Key stats:
      • 2.6 CoreMark/MHz.
    • Development board:

Bonus: Other SoCs that don't match the above criteria or where there's insufficient info

  • The Espressif ESP32-P4 was announced, featuring a dual-core 400MHz RISC-V CPU with "an AI instructions extension". I look forward to incorporating it into the list above when more information is available.
  • I won't try to enumerate every use of RISC-V in chips that aren't programmable by end users or where development boards aren't available, but it's worth noting the use of RISC-V Google's Titan M2
  • In January 2022 Intel Mobileye announced the EyeQ Ultra featuring 12 RISC-V cores (of unknown design), but there hasn't been any news since.

Article changelog
  • 2023-03-19: Add note on silicon bug in the Renesas RZ/Five.
  • 2023-03-12: Clarified details of several SiFive cores and made the listing of SoCs using cores derived from open source RTL exhaustive.
  • 2023-03-04:
    • Added in the CSM32RV20 and some extra Bouffalo Lab and WCH chips (thanks to Reddit user 1r0n_m6n).
    • Confirmed likely core design in the K510 (thanks to Reddit user zephray_wenting).
    • Further Espressif information and new ESP32-C2 entry contributed by Ivan Grokhotkov.
    • Clarified cores in the JH7110 and JH7100 (thanks to Conor Dooley for the tip).
    • Added T-Head C910-ICE (thanks to a tip via email).
  • 2023-03-03:
    • Added note about CoreMark scores.
    • Added the Renesas R9A02G020 (thanks to Giancarlo Parodi for the tip!).
    • Various typo fixes.
    • Add link to PicoCom RISC-V Summit talk and clarify the ISA extensions supported by the PC802 Andes N25F clusters.
  • 2023-03-03: Initial publication date.

March 03, 2023 12:00 PM

February 28, 2023

Lucas Fryzek

Journey Through Freedreno

Android running Freedreno
Android running Freedreno

As part of my training at Igalia I’ve been attempting to write a new backend for Freedreno that targets the proprietary “KGSL” kernel mode driver. For those unaware there are two “main” kernel mode drivers on Qualcomm SOCs for the GPU, there is the “MSM”, and “KGSL”. “MSM” is DRM compliant, and Freedreno already able to run on this driver. “KGSL” is the proprietary KMD that Qualcomm’s proprietary userspace driver targets. Now why would you want to run freedreno against KGSL, when MSM exists? Well there are a few ones, first MSM only really works on an up-streamed kernel, so if you have to run a down-streamed kernel you can continue using the version of KGSL that the manufacturer shipped with your device. Second this allows you to run both the proprietary adreno driver and the open source freedreno driver on the same device just by swapping libraries, which can be very nice for quickly testing something against both drivers.

When “DRM” isn’t just “DRM”

When working on a new backend, one of the critical things to do is to make use of as much “common code” as possible. This has a number of benefits, least of all reducing the amount of code you have to write. It also allows reduces the number of bugs that will likely exist as you are relying on well tested code, and it ensures that the backend is mostly likely going to continue to work with new driver updates.

When I started the work for a new backend I looked inside mesa’s src/freedreno/drm folder. This has the current backend code for Freedreno, and its already modularized to support multiple backends. It currently has support for the above mentioned MSM kernel mode driver as well as virtio (a backend that allows Freedreno to be used from within in a virtualized environment). From the name of this path, you would think that the code in this module would only work with kernel mode drivers that implement DRM, but actually there is only a handful of places in this module where DRM support is assumed. This made it a good starting point to introduce the KGSL backend and piggy back off the common code.

For example the drm module has a lot of code to deal with the management of synchronization primitives, buffer objects, and command submit lists. All managed at a abstraction above “DRM” and to re-implement this code would be a bad idea.

How to get Android to behave

One of this big struggles with getting the KGSL backend working was figuring out how I could get Android to load mesa instead of Qualcomm blob driver that is shipped with the device image. Thankfully a good chunk of this work has already been figured out when the Turnip developers (Turnip is the open source Vulkan implementation for Adreno GPUs) figured out how to get Turnip running on android with KGSL. Thankfully one of my coworkers Danylo is one of those Turnip developers, and he gave me a lot of guidance on getting Android setup. One thing to watch out for is the outdated instructions here. These instructions almost work, but require some modifications. First if you’re using a more modern version of the Android NDK, the compiler has been replaced with LLVM/Clang, so you need to change which compiler is being used. Second flags like system in the cross compiler script incorrectly set the system as linux instead of android. I had success using the below cross compiler script. Take note that the compiler paths need to be updated to match where you extracted the android NDK on your system.

ar = '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-ar'
c = ['ccache', '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang']
cpp = ['ccache', '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang++', '-fno-exceptions', '-fno-unwind-tables', '-fno-asynchronous-unwind-tables', '-static-libstdc++']
c_ld = 'lld'
cpp_ld = 'lld'
strip = '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-strip'
# Android doesn't come with a pkg-config, but we need one for Meson to be happy not
# finding all the optional deps it looks for.  Use system pkg-config pointing at a
# directory we get to populate with any .pc files we want to add for Android
pkgconfig = ['env', 'PKG_CONFIG_LIBDIR=/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/pkgconfig:/home/lfryzek/Documents/projects/igalia/freedreno/install-android/lib/pkgconfig', '/usr/bin/pkg-config']

system = 'android'
cpu_family = 'arm'
cpu = 'armv8'
endian = 'little'

Another thing I had to figure out with Android, that was different with these instructions, was how I would get Android to load mesa versions of mesa libraries. That’s when my colleague Mark pointed out to me that Android is open source and I could just check the source code myself. Sure enough you have find the OpenGL driver loader in Android’s source code. From this code we can that Android will try to load a few different files based on some settings, and in my case it would try to load 3 different shaded libraries in the /vendor/lib64/egl folder, ,, and I could just replace these libraries with the version built from mesa and voilà, you’re now loading a custom driver! This realization that I could just “read the code” was very powerful in debugging some more android specific issues I ran into, like dealing with gralloc.

Something cool that the opensource Freedreno & Turnip driver developers figured out was getting android to run test OpenGL applications from the adb shell without building android APKs. If you check out the freedreno repo, they have an script that can build tests in the tests-* folder. The nice benefit of this is that it provides an easy way to run simple test cases without worrying about the android window system integration. Another nifty feature about this repo is the libwrap tool that lets trace the commands being submitted to the GPU.

What even is Gralloc?

Gralloc is the graphics memory allocated in Android, and the OS will use it to allocate the surface for “windows”. This means that the memory we want to render the display to is managed by gralloc and not our KGSL backend. This means we have to get all the information about this surface from gralloc, and if you look in src/egl/driver/dri2/platform_android.c you will see existing code for handing gralloc. You would think “Hey there is no work for me here then”, but you would be wrong. The handle gralloc provides is hardware specific, and the code in platform_android.c assumes a DRM gralloc implementation. Thankfully the turnip developers had already gone through this struggle and if you look in src/freedreno/vulkan/tu_android.c you can see they have implemented a separate path when a Qualcomm msm implementation of gralloc is detected. I could copy this detection logic and add a separate path to platform_android.c.

Working with the Freedreno community

When working on any project (open-source or otherwise), it’s nice to know that you aren’t working alone. Thankfully the #freedreno channel on is very active and full of helpful people to answer any questions you may have. While working on the backend, one area I wasn’t really sure how to address was the synchronization code for buffer objects. The backend exposed a function called cpu_prep, This function was just there to call the DRM implementation of cpu_prep on the buffer object. I wasn’t exactly sure how to implement this functionality with KGSL since it doesn’t use DRM buffer objects.

I ended up reaching out to the IRC channel and Rob Clark on the channel explained to me that he was actually working on moving a lot of the code for cpu_prep into common code so that a non-drm driver (like the KGSL backend I was working on) would just need to implement that operation as NOP (no operation).

Dealing with bugs & reverse engineering the blob

I encountered a few different bugs when implementing the KGSL backend, but most of them consisted of me calling KGSL wrong, or handing synchronization incorrectly. Thankfully since Turnip is already running on KGSL, I could just more carefully compare my code to what Turnip is doing and figure out my logical mistake.

Some of the bugs I encountered required the backend interface in Freedreno to be modified to expose per a new per driver implementation of that backend function, instead of just using a common implementation. For example the existing function to map a buffer object into userspace assumed that the same fd for the device could be used for the buffer object in the mmap call. This worked fine for any buffer objects we created through KGSL but would not work for buffer objects created from gralloc (remember the above section on surface memory for windows comming from gralloc). To resolve this issue I exposed a new per backend implementation of “map” where I could take a different path if the buffer object came from gralloc.

While testing the KGSL backend I did encounter a new bug that seems to effect both my new KGSL backend and the Turnip KGSL backend. The bug is an iommu fault that occurs when the surface allocated by gralloc does not have a height that is aligned to 4. The blitting engine on a6xx GPUs copies in 16x4 chunks, so if the height is not aligned by 4 the GPU will try to write to pixels that exists outside the allocated memory. This issue only happens with KGSL backends since we import memory from gralloc, and gralloc allocates exactly enough memory for the surface, with no alignment on the height. If running on any other platform, the fdl (Freedreno Layout) code would be called to compute the minimum required size for a surface which would take into account the alignment requirement for the height. The blob driver Qualcomm didn’t seem to have this problem, even though its getting the exact same buffer from gralloc. So it must be doing something different to handle the none aligned height.

Because this issue relied on gralloc, the application needed to running as an Android APK to get a surface from gralloc. The best way to fix this issue would be to figure out what the blob driver is doing and try to replicate this behavior in Freedreno (assuming it isn’t doing something silly like switch to sysmem rendering). Unfortunately it didn’t look like the libwrap library worked to trace an APK.

The libwrap library relied on a linux feature known as LD_PRELOAD to load when the application starts and replace the system functions like open and ioctl with their own implementation that traces what is being submitted to the KGSL kernel mode driver. Thankfully android exposes this LD_PRELOAD mechanism through its “wrap” interface where you create a propety called wrap.<app-name> with a value LD_PRELOAD=<path to>. Android will then load your library like would be done in a normal linux shell. If you tried to do this with libwrap though you find very quickly that you would get corrupted traces. When android launches your APK, it doesn’t only launch your application, there are different threads for different android system related functions and some of them can also use OpenGL. The libwrap library is not designed to handle multiple threads using KGSL at the same time. After discovering this issue I created a MR that would store the tracing file handles as TLS (thread local storage) preventing the clobbering of the trace file, and also allowing you to view the traces generated by different threads separately from each other.

With this is in hand one could begin investing what the blob driver is doing to handle this unaligned surfaces.

What’s next?

Well the next obvious thing to fix is the aligned height issue which is still open. I’ve also worked on upstreaming my changes with this WIP MR.

Freedreno running 3d-mark
Freedreno running 3d-mark

February 28, 2023 08:02 PM

Iago Toral

SuperTuxKart Vulkan vs OpenGL and Zink status on Raspberry Pi 4

SuperTuxKart Vulkan vs OpenGL

The latest SuperTuxKart release comes with an experimental Vulkan renderer and I was eager to check it out on my Raspbery Pi 4 and see how well it worked.

The short story is that while I have only tested a few tracks it seems to perform really well overall. In my tests, even with a debug build of Mesa I saw the FPS ranging from 60 to 110 depending on the track. I think the game might be able to produce more than 110 fps actually, since various tracks were able to reach exactly 110 fps I think the limiting factor here was the display.

I was then naturally interested in comparing this to the GL renderer and I was a bit surprised to see that, with the same settings, the GL renderer would be somewhere in the 8-20 fps range for the same tracks. The game was clearly hitting a very bad path in the GL driver so I had to fix that before I could make a fair comparison between both.

A perf session quickly pointed me to the issue: Mesa has code to transparently translate vertex attribute formats that are not natively supported to a supported format. While this is great for compatibility it is obviously going to be very slow. In particular, SuperTuxKart uses rgba16f and rg16f with some vertex buffers and Mesa was silently translating these to 32-bit counterparts because the GL driver was not advertising support for the 16-bit variants. The hardware does support 16-bit floating point vertex attributes though, so this was very easy to fix.

The Vulkan driver was exposing support for this already, which explains the dramatic difference in performance between both drivers. Indeed, with that change SuperTuxKart now plays smooth on OpenGL too, with framerates always above 30 fps and up to 110 fps depending on the track. We should probably have an option in Mesa to make this kind of under-the-hood compatibility translations more obvious to users so we can catch silly issues like this more easily.

With that said, even if GL is now a lot better, Vulkan is still ahead by quite a lot, producing 35-50% better framerate than OpenGL depending on the track, at least for the tracks that don’t hit the 110 fps mark, which as I said above, looks like it is a display maximum, at least with my setup.


During my presentation at XDC last year I mentioned Zink wasn’t supported on Raspberry Pi 4 any more due to feature requirements we could not fulfill.

In the past, Zink used to abort when it detected unsupported features, but it seems this policy has been changed and now it simply drops a warning and points to the possibility of incorrect rendering as a result.

Also, I have been talking to zmike about one of the features we could not support natively: scalarBlockLayout. Particularly, the issue with this is that we can’t make it work with vectors in all cases and the only alternative for us would be to scalarize everything through a lowering, which would probably have a performance impact. However, zmike confirmed that Zink is already doing this, so in practice we would not see vector load/stores from Zink, in which case it should work fine .

So with all that in mind, I did give Zink a go and indeed, I get the warning that we don’t support scalar block layouts (and some other feature I don’t remember now) but otherwise it mostly works. It is not as stable as the native driver and some things that work with the native driver don’t work with Zink at present, some examples I saw include the WebGL Aquarium demo in Chromium or SuperTuxKart.

As far as performance goes, it has been a huge leap from when I tested it maybe 2 years ago. With VkQuake3‘s OpenGL renderer performance with Zink used to be ~40% of the native OpenGL driver, but is now on par with it, even if not a tiny bit better, so kudos to zmike and all the other contributors to Zink for all the work they put into this over the last 2 years, it really shows.

With all that said, I didn’t do too much testing with Zink myself so if anyone here decides to give it a more thorough go, please let me know how it went in the comments.

by Iago Toral at February 28, 2023 11:00 AM

February 27, 2023

Alex Bradbury

2023Q1 week log

I tend to keep quite a lot of notes on the development related (sometimes at work, sometimes not) I do on a week-by-week basis, and thought it might be fun to write up the parts that were public. This may or may not be of wider interest, but it aims to be a useful aide-mémoire for my purposes at least. Weeks with few entries might be due to focusing on downstream work (or perhaps just a less productive week - I am only human!).

Week of 13th March 2023

  • Most importantly, added some more footer images for this site from the Quick Draw dataset. Thanks to my son (Archie, 5) for the assistance.
  • Reviewed submissions for EuroLLVM (I'm on the program committee).
  • Added note to the commercially available RISC-V silicon post about a hardware bug in the Renesas RZ/Five.
  • Finished writing and published what's new for RISC-V in LLVM 16 article and took part in some of the discussions in the HN and Reddit threads (it's on too, but that didn't generate any comments).
  • Investigated an issue where inline asm with the m constraint was generating worse code on LLVM vs GCC, finding that LLVM conservatively lowers this to a single register, while GCC treats m as reg+imm, relying on users indicating A when using a memory operand with an instruction that can't take an immediate offset. Worked with a colleague who posted D146245 to fix this.
  • Set agenda for and ran the biweekly RISC-V LLVM contributor sync call as usual.
  • Bisected reported LLVM bug #61412, which as it happens was fixed that evening by D145474 being committed. We hope to backport this to 16.0.1.
  • Did some digging on a regression (compiler crash) for -Oz, bisecting it to the commit that enabled machine copy propagation by default. I found the issue was due to machine copy propagation running after the machine outliner, and incorrectly determining that some register writes in outlined functions were not live-out. I posted and landed D146037 to fix this by running machine copy propagation earlier in the pipeline, though a more principled fix would be desirable.
  • Filed a PR against the riscv-isa-manual to disambiguate the use of the term "reserved" for HINT instructions. I've also been looking at the proposed bfloat16 extension recently and filed an issue to clarify if Zfbfinxmin will be defined (as all the other floating point extensions so far have an *inx twin.
  • Almost finished work to resolve issues related to overzealous error checking on RISC-V ISA naming strings (with llvm-objdump and related tools being the final piece).
    • Landed D145879 and D145882 to expand RISCVISAInfo test coverage and fix an issue that surfaced through that.
    • Posted a pair of patches that makes llvm-objdump and related tools tolerant of unrecognised versions of ISA extensions. D146070 resolves this for the base ISA in a minimally invasive way, while D146114 solves this for other extensions, moving the parsing logic to using the parseNormalizedArchString function I introduced to fix a similar issue in LLD. This built on some directly committed work to expand testing.
  • The usual assortment of upstream LLVM reviews.
  • LLVM Weekly #480.

Week of 6th March 2023

Week of 27th February 2023

  • Completed (to the point I was happy to publish at least) my attempt to enumerate the commercially available RISC-V SoCs. I'm very grateful to have received a whole range of suggested additions and clarifications over the weekend, which have all been incorporated.
  • Ran the usual biweekly RISC-V LLVM sync-up call. Topics included outstanding issues for LLVM 16.x (no major issues now my backport request to fix and LLD regression was merged), an overview off _Float16 ABI lowering fixes, GP relaxation in LLD, my recent RISC-V buildbot, and some vectorisation related issues.
  • Investigated and largely resolved a issues related to ABI lowering of _Float16 for RISC-V. Primarily, we weren't handling the cases where a GPR+FPR or a pair of FPRs are used to pass small structs including _Float16.
    • Part of this work involved rebasing my previous patches to refactor our RISC-V ABI lowering tests in Clang. Now that a version of my improvements to --function-signature (required for the refactor) landed as part of D144963, this can hopefully be completed.
    • Committed a number of simple test improvements related to half floats. e.g. 570995e, 81979c3, 34b412d.
    • Posted D145070 to add proper coverage for _Float16 ABI lowering, and D145074 to fix it. Also D145071 to set the HasLegalHalfType property, but the semantics of that are less clear.
    • Posted a strawman psABI patch for __bf16, needed for the RISC-V bfloat16 extension.
  • Attended the Cambridge RISC-V Meetup.
  • After seeing the Helix editor discussed on, retried my previously shared large Markdown file test case. Unfortunately it's still unusably slow to edit, seemingly due to a tree-sitter related issue.
  • Cleaned up the static site generator used for this site a bit. e.g. now my fixes (#157, #158, #159) for the traverse() helper in mistletoe where merged upstream, I removed my downstream version.
  • The usual mix of upstream LLVM reviews.
  • Had a day off for my birthday.
  • Publicly shared this week log for the first time.
  • LLVM Weekly #478.

Week of 20th February 2023

Week of 13th February 2023

Week of 6th February 2023

Article changelog
  • 2023-03-20: Added notes for the week of 13th March 2023.
  • 2023-03-13: Added notes for the week of 6th March 2023.
  • 2023-03-06: Added notes for the week of 27th February 2023.
  • 2023-02-27: Added in a forgotten note about trivial buildbot doc improvements.
  • 2023-02-27: Initial publication date.

February 27, 2023 12:00 PM

February 23, 2023

Eric Meyer

A Leap of Decades

I’ve heard it said there are two kinds of tech power users: the ones who constantly update to stay on the bleeding edge, and the ones who update only when absolutely forced to do so.  I’m in the latter camp.  If a program, setup, or piece of hardware works for me, I stick by it like it’s the last raft off a sinking island.

And so it has been for my early 2013 MacBook Pro, which has served me incredibly well across all those years and many continents, but was sliding into the software update chasm: some applications, and for that matter its operating system, could no longer be run on its hardware.  Oh and also, the top row of letter keys was becoming unresponsive, in particular the E-R-T sequence.  Which I kind of need if I’m going to be writing English text, never mind reloading pages and opening new browser tabs.

Stepping Up

An early 2013 MacBook Pro sitting on a desk next to the box of an early 2023 MacBook Pro, the latter illuminated by shafts of sunlight.
The grizzled old veteran on the verge of retirement and the fresh new recruit that just transferred in to replace them.

So on Monday, I dropped by the Apple Store and picked up a custom-built early 2023 MacBook Pro: M2 Max with 38 GPU cores, 64GB RAM, and 2TB SSD.  (Thus quadrupling the active memory and nearly trebling the storage capacity of its predecessor.)  I went with that balance, or perhaps imbalance, because I intend to have this machine last me another ten years, and in that time, RAM is more likely to be in demand than SSD.  If I’m wrong about that, I can always plug in an external SSD.  Many thanks to the many people in my Mastodon herd who nudged me in that direction.

I chose the 14” model over the 16”, so it is a wee bit smaller than my old 15” workhorse.  The thing that surprises me is the new machine looks boxier, somehow.  Probably it’s that the corners of the case are not nearly as rounded as the 2013 model, and I think the thickness ratio of display to body is closer to 1:1 than before.  It isn’t a problem or anything, it’s just a thing that I notice.  I’ll probably forget about it soon enough.

Some things I find mildly-to-moderately annoying:

  • DragThing doesn’t work any more.  It had stopped being updated before the 64-bit revolution, never mind the shift to Apple silicon, so this was expected, but wow do I miss it.  Like a stick-shift driver uselessly stomping the floorboards and blindly grasping air while driving an automatic car, I still flip the mouse pointer toward the right edge of the screen, where I kept my DragThing dock, before remembering it’s gone.  I’ve looked at alternatives, but none of them seem like they’re meant as straight up replacements, so I’ve yet to commit to one.  Maybe some day I’ll ask Daniel to teach me Swift to I can build my own. (Because I definitely need more demands on my time.)
  • The twisty arrows in the Finder to open and close folders don’t have enough visual weight.  Really, the overall UI feels like a movie’s toy representation of an operating system, not an actual operating system.  I mean, the visual presentation of the OS looks like something I would create, and brother, that is not a compliment.
  • The Finder’s menu bar has no visually distinct background.  What the hell.  No, seriously, what the hell?  The Notch I’m actually okay with, but removing the distinction between the active area of the menu bar and the inert rest of the desktop seems… ill-advised.  Do not like.  HARK, A FIX: Cory Birdsong pointed me to “System Settings… > Accessibility > Display > Reduce Transparency”, which fixes this, over on Mastodon.  Thanks, Cory!
  • I’m not used to the system default font(s) yet, which I imagine will come with time, but still catches me here and there.
  • The alert and other systems sounds are different, and I don’t like them.  Sosumi.

Oh, and it’s weird to me that the Apple logo on the back of the display doesn’t glow.  Not annoying, just weird.

Otherwise, I’m happy with it so far.  Great display, great battery life, and the keyboard works!

Getting Migratory

The 2013 MBP was backed up nightly to a 1TB Samsung SSD, so that was how I managed the migration: plugged the SSD into the new MBP and let Migration Assistant do its thing.  This got me 90% of the way there, really.  The remaining 10% is what I’ll talk about in a bit, in case anyone else finds themselves in a similar situation.

The only major hardware hurdle I hit was that my Dell U2713HM monitor, also of mid-2010s vintage, seems to limit HDMI signals to 1920×1080 despite supposedly supporting HDMI 1.4, which caught me by surprise.  When connected to a machine via DisplayPort, even my 2013 MBP, the Dell will go up to 2560×1440.  The new MBP only has one HDMI port and three USB-C ports.  Fortunately, the USB-C ports can be used as DisplayPorts, so I acquired a DisplayPort–to–USB-C cable and that fixed the situation right up.

Yes, I could upgrade to a monitor that supports USB-C directly, but the Dell is a good size for my work environment, it still looks pretty good, and did I mention I’m the cling-tightly-to-what-works kind of user?

Otherwise, in the hardware space, I’ll have to figure out how I want to manage connecting all the USB-A devices I have (podcasting microphone, wireless headset, desktop speaker, secondary HD camera, etc., etc.) to the USB-C ports.  I expected that to be the case, just as I expected some applications would no longer work.  I expect an adapter cable or two will be necessary, at least for a while.

Trouble Brewing

I said earlier that Migration Assistant got me 90% of the way to being switched over.  Were I someone who doesn’t install stuff via the Terminal, I suspect it would have been 100% successful, but I’m not, so it wasn’t.  As with the cables, I anticipated this would happen.  What I didn’t expect was that covering that last 10% would take me only an hour or so of actual work, most of it spent waiting on downloads and installs.

First, the serious and quite unexpected problem: my version of Homebrew used an old installation prefix, one that could break newer packages.  So, I needed to migrate Homebrew itself from /usr/local to /opt/homebrew.  Some searching around indicated that the best way to do this was uninstall Homebrew entirely, then install it fresh.

Okay, except that would also remove everything I’d installed with Homebrew.  Which was maybe not as much as some of y’all, but it was still a fair number of fairly essential packages.  When I ran brew list, I got over a hundred packages, of which most were dependencies.  What I found through further searching was that brew leaves returns a list of the packages I’d installed, without their dependencies.  Here’s what I got:


That felt a lot more manageable.  After a bit more research, boiled down to its essentials, the New Brew Shuffle I came up with was:

$ brew leaves > brewlist.txt

$ /bin/bash -c "$(curl -fsSL"

$ xcode-select --install

$ /bin/bash -c "$(curl -fsSL"

$ xargs brew install < brewlist.txt

The above does elide a few things.  In step two, the Homebrew uninstall script identified a bunch of directories that it couldn’t remove, and would have to be deleted manually.  I saved all that to a text file (thanks to Warp’s “Copy output” feature) for later study, and pressed onward.  I probably also had to sudo some of those steps; I no longer remember.

In addition to all the above, I elected to delete a few of the packages in brewlist.txt before I fed it back to brew install in the last step — things like ckan, left over from my Kerbal Space Program days  —  and to remove the version dependencies for PHP and Python.  Overall, the process was pretty smooth.  I just had to sit there and watch Homrebrew chew through all the installs, including all the dependencies.


Once all the reinstalls from the last step had finished, I was left with a few things to clean up.  For example, Python didn’t seem to have installed.  Eventually I realized it had actually installed as python3 instead of just plain python, so that was mostly fine and I’m sure there’s a way to alias one to the other that I might get around to looking up one day.

Ruby also didn’t seem to reinstall cleanly: there was a library it was looking for that complained about the chip architecture, and attempts to overcome that spawned even more errors, and none of them were decipherable to me or my searches.  Multiple attempts at uninstalling and then reinstalling Ruby through a variety of means, some with Homebrew, some other ways, either got me the same indecipherable erros or a whole new set of indecipherable errors.  In the end, I just uninstalled Ruby, as I don’t actually use it for anything I’m aware of, and the default Ruby that comes with macOS is still there.  If I run into some script I need for work that requires something more, I’ll revisit this, probably with many muttered imprecations.

Finally, httpd wasn’t working as intended.  I could launch it with brew services httpd start, but the resulting server was pointing to a page that just said “It works!”, and not bringing up any of my local hosts.  Eventually, I found where Homebrew had stuffed httpd and its various files, and then replaced its configuration files with my old configuration files.  Then I went through the cycle of typing sudo apachectl start, addressing the errors it threw over directories or PHP installs or whatever by editing httpd.conf, and then trying again.

After only three or four rounds of that, everything was up and running as intended  —  and as a bonus, I was able to mark httpd as a Login item in the Finder’s System Settings, so it will automatically come back up whenever I reboot!  Which my old machine wouldn’t do, for some reason I never got around to figuring out.

Now I just need to decide what to call this thing.  The old MBP was “CoCo”, as in the TRS-80 Color Computer, meant as a wry commentary on the feel of the keyboard and a callback to the first home computer I ever used.  That joke still works, but I’m thinking the new machine will be “C64” in honor of the first actually powerful home computer I ever used and its 64 kilobytes of RAM.  There’s a pleasing echo between that and the 64 gigabytes of RAM I now have at my literal fingertips, four decades later.

Now that I’m up to date on hardware and operating system, I’d be interested to hear what y’all recommend for good quality-of-life improvement applications or configuration changes.  Link me up!

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at February 23, 2023 05:02 PM

February 20, 2023

Emmanuele Bassi

Writing Bindable API, 2023 Edition

First of all, you should go on the gobject-introspection website and read the page on how to write bindable API. What I’m going to write here is going to build upon what’s already documented, or will update the best practices, so if you maintain a GObject/C library, or you’re writing one, you must be familiar with the basics of gobject-introspection. It’s 2023: it’s already too bad we’re still writing C libraries, we should at the very least be responsible about it.

A specific note for people maintaining an existing GObject/C library with an API designed before the mainstream establishment of gobject-introspection (basically, anything written prior to 2011): you should really consider writing all new types and entry points with gobject-introspection in mind, and you should also consider phasing out older API and replacing it piecemeal with a bindable one. You should have done this 10 years ago, and I can already hear the objections, but: too bad. Just because you made an effort 10 years ago it doesn’t mean things are frozen in time, and you don’t get to fix things. Maintenance means constantly tending to your code, and that doubly applies if you’re exposing an API to other people.

Let’s take the “how to write bindable API” recommendations, and elaborate them a bit.

Structures with custom memory management

The recommendation is to use GBoxed as a way to specify a copy and a free function, in order to clearly define the memory management semantics of a type.

The important caveat is that boxed types are necessary for:

  • opaque types that can only be heap allocated
  • using a type as a GObject property
  • using a type as an argument or return value for a GObject signal

You don’t need a boxed type for the following cases:

  • your type is an argument or return value for a method, function, or virtual function
  • your type can be placed on the stack, or can be allocated with malloc()/free()

Additionally, starting with gobject-introspection 1.76, you can specify the copy and free function of a type without necessarily registering a boxed type, which leaves boxed types for the thing they were created: signals and properties.

Addendum: object types

Boxed types should only ever be used for plain old data types; if you need inheritance, then the strong recommendation is to use GObject. You can use GTypeInstance, but only if you know what you’re doing; for more information on that, see my old blog post about typed instances.

Functionality only accessible through a C macro

This ought to be fairly uncontroversial. C pre-processor symbols don’t exist at the ABI level, and gobject-introspection is a mechanism to describe a C ABI. Never, ever expose API only through C macros; those are for C developers. C macros can be used to create convenience wrappers, but remember that anything they call must be public API, and that other people will need to re-implement the convenience wrappers themselves, so don’t overdo it. C developers deserve some convenience, but not at the expense of everyone else.

Addendum: inline functions

Static inline functions are also not part of the introspectable ABI of a library, because they cannot be used with dlsym(); you can provide inlined functions for performance reasons, but remember to always provide their non-inlined equivalent.

Direct C structure access for objects

Again, another fairly uncontroversial rule. You shouldn’t be putting anything into an instance structure, as it makes your API harder to future-proof, and direct access cannot do things like change notification, or memoization.

Always provide accessor functions.


Variadic argument functions are mainly C convenience. Yes, some languages can support them, but it’s a bad idea to have this kind of API exposed as the only way to do things.

Any variadic argument function should have two additional variants:

  • a vector based version, using C arrays (zero terminated, or with an explicit length)
  • a va_list version, to be used when creating wrappers with variadic arguments themselves

The va_list variant is kind of optional, since not many people go around writing variadic argument C wrappers, these days, but at the end of the day you might be going to write an internal function that takes a va_list anyway, so it’s not particularly strange to expose it as part of your public API.

The vector-based variant, on the other hand, is fundamental.

Incidentally, if you’re using variadic arguments as a way to collect similarly typed values, e.g.:

// void
// some_object_method (SomeObject *self,
//                     ...) G_GNUC_NULL_TERMINATED

some_object_method (obj, "foo", "bar", "baz", NULL);

there’s very little difference to using a vector and C99’s compound literals:

// void
// some_object_method (SomeObject *self,
//                     const char *args[])

some_object_method (obj, (const char *[]) {

Except that now the compiler will be able to do some basic type check and scream at you if you’re doing something egregiously bad.

Compound literals and designated initialisers also help when dealing with key/value pairs:

typedef struct {
  int column;
  union {
    const char *v_str;
    int v_int;
  } value;
} ColumnValue;

enum {

// void
// some_object_method (SomeObject *self,
//                     size_t n_columns,
//                     const ColumnValue values[])

some_object_method (obj, 2,
  (ColumnValue []) {
    { .column = COLUMN_NAME, .data = { .v_str = "Emmanuele" } },
    { .column = COLUMN_AGE, .data = { .v_int = 42 } },

So you should seriously reconsider the amount of variadic arguments convenience functions you expose.

Multiple out parameters

Using a structured type with a out direction is a good recommendation as a way to both limit the amount of out arguments and provide some future-proofing for your API. It’s easy to expand an opaque pointer type with accessors, whereas adding more out arguments requires an ABI break.

Addendum: inout arguments

Don’t use in-out arguments. Just don’t.

Pass an in argument to the callable for its input, and take an out argument or a return value for the output.

Memory management and ownership of inout arguments is incredibly hard to capture with static annotations; it mainly works for scalar values, so:

some_object_update_matrix (SomeObject *self,
                           double *xx,
                           double *yy,
                           double *xy,
                           double *yx)

can work with xx, yy, xy, yx as inout arguments, because there’s no ownership transfer; but as soon as you start throwing things in like pointers to structures, or vectors of string, you open yourself to questions like:

  • who allocates the argument when it goes in?
  • who is responsible for freeing the argument when it comes out?
  • what happens if the function frees the argument in the in direction and then re-allocates the out?
  • what happens if the function uses a different allocator than the one used by the caller?
  • what happens if the function has to allocate more memory?
  • what happens if the function modifies the argument and frees memory?

Even if gobject-introspection nailed down the rules, they could not be enforced, or validated, and could lead to leaks or, worse, crashes.

So, once again: don’t use inout arguments. If your API already exposes inout arguments, especially for non-scalar types, consider deprecations and adding new entry points.

Addendum: GValue

Sadly, GValue is one of the most notable cases of inout abuse. The oldest parts of the GNOME stack use GValue in a way that requires inout annotations because they expect the caller to:

  • initialise a GValue with the desired type
  • pass the address of the value
  • let the function fill the value

The caller is then left with calling g_value_unset() in order to free the resources associated with a GValue. This means that you’re passing an initialised value to a callable, the callable will do something to it (which may or may not even entail re-allocating the value) and then you’re going to get it back at the same address.

It would be a lot easier if the API left the job of initialising the GValue to the callee; then functions could annotate the GValue argument with out and caller-allocates=1. This would leave the ownership to the caller, and remove a whole lot of uncertainty.

Various new (comparatively speaking) API allow the caller to pass an unitialised GValue, and will leave initialisation to the callee, which is how it should be, but this kind of change isn’t always possible in a backward compatible way.


You can use three types of C arrays in your API:

  • zero-terminated arrays, which are the easiest to use, especially for pointers and strings
  • fixed-size arrays
  • arrays with length arguments

Addendum: strings and byte arrays

A const char* argument for C strings with a length argument is not an array:

 * some_object_load_data:
 * @self: ...
 * @str: the data to load
 * @len: length of @str in bytes, or -1
 * ...
some_object_load_data (SomeObject *self,
                       const char *str,
                       ssize_t len)

Never annotate the str argument with array length=len. Ideally, this kind of function should not exist in the first place. You should always use const char* for NUL-terminated strings, possibly UTF-8 encoded; if you allow embedded NUL characters then use a bytes array:

 * some_object_load_data:
 * @self: ...
 * @data: (array length=len) (element-type uint8): the data to load
 * @len: the length of the data in bytes
 * ...
some_object_load_data (SomeObject *self,
                       const unsigned char *data,
                       size_t len)

Instead of unsigned char you can also use uint8_t, just to drive the point home.

Yes, it’s slightly nicer to have a single entry point for strings and byte arrays, but that’s just a C convenience: decent languages will have a proper string type, which always comes with a length; and string types are not binary data.

Addendum: GArray, GPtrArray, GByteArray

Whatever you do, however low you feel on the day, whatever particular tragedy befell your family at some point, please: never use GLib array types in your API. Nothing good will ever come of it, and you’ll just spend your days regretting this choice.

Yes: gobject-introspection transparently converts between GLib array types and C types, to the point of allowing you to annotate the contents of the array. The problem is that that information is static, and only exists at the introspection level. There’s nothing that prevents you from putting other random data into a GPtrArray, as long as it’s pointer-sized. There’s nothing that prevents a version of a library from saying that you own the data inside a GArray, and have the next version assign a clear function to the array to avoid leaking it all over the place on error conditions, or when using g_autoptr.

Adding support for GLib array types in the introspection was a well-intentioned mistake that worked in very specific cases—for instance, in a library that is private to an application. Any well-behaved, well-designed general purpose library should not expose this kind of API to its consumers.

You should use GArray, GPtrArray, and GByteArray internally; they are good types, and remove a lot of the pain of dealing with C arrays. Those types should never be exposed at the API boundary: always convert them to C arrays, or wrap them into your own data types, with proper argument validation and ownership rules.

Addendum: GHashTable

What’s worse than a type that contains data with unclear ownership rules decided at run time? A type that contains twice the amount of data with unclear ownership rules decided at run time.

Just like the GLib array types, hash tables should be used but never directly exposed to consumers of an API.

Addendum: GList, GSList, GQueue

See above, re: pain and misery. On top of that, linked lists are a terrible data type that people should rarely, if ever, use in the first place.


Your callbacks should always be in the form of a simple callable with a data argument:

typedef void (* SomeCallback) (SomeObject *obj,
                               gpointer data);

Any function that takes a callback should also take a “user data” argument that will be passed as is to the callback:

// scope: call; the callback data is valid until the
// function returns
some_object_do_stuff_immediately (SomeObject *self,
                                  SomeCallback callback,
                                  gpointer data);

// scope: notify; the callback data is valid until the
// notify function gets called
some_object_do_stuff_with_a_delay (SomeObject *self,
                                   SomeCallback callback,
                                   gpointer data,
                                   GDestroyNotify notify);

// scope: async; the callback data is valid until the async
// callback is called
some_object_do_stuff_but_async (SomeObject *self,
                                GCancellable *cancellable,
                                GAsyncReadyCallback callback,
                                gpointer data);

// not pictured here: scope forever; the data is valid fori
// the entirety of the process lifetime

If your function takes more than one callback argument, you should make sure that it also takes a different user data for each callback, and that the lifetime of the callbacks are well defined. The alternative is to use GClosure instead of a simple C function pointer—but that comes at a cost of GValue marshalling, so the recommendation is to stick with one callback per function.

Addendum: the closure annotation

It seems that many people are unclear about the closure annotation.

Whenever you’re describing a function that takes a callback, you should always annotate the callback argument with the argument that contains the user data using the (closure argument) annotation, e.g.

 * some_object_do_stuff_immediately:
 * @self: ...
 * @callback: (scope call) (closure data): the callback
 * @data: the data to be passed to the @callback
 * ...

You should not annotate the data argument with a unary (closure).

The unary (closure) is meant to be used when annotating the callback type:

 * SomeCallback:
 * @self: ...
 * @data: (closure): ...
 * ...
typedef void (* SomeCallback) (SomeObject *self,
                               gpointer data);

Yes, it’s confusing, I know.

Sadly, the introspection parser isn’t very clear about this, but in the future it will emit a warning if it finds a unary closure on anything that isn’t a callback type.

Ideally, you don’t really need to annotate anything when you call your argument user_data, but it does not hurt to be explicit.

A cleaned up version of this blog post will go up on the gobject-introspection website, and we should really have a proper set of best API design practices on the Developer Documentation website by now; nevertheless, I do hope people will actually follow these recommendations at some point, and that they will be prepared for new recommendations in the future. Only dead and unmaintained projects don’t change, after all, and I expect the GNOME stack to last a bit longer than the 25 years it already spans today.

by ebassi at February 20, 2023 12:35 AM

February 16, 2023

Philippe Normand

WebRTC in WebKitGTK and WPE, status updates, part I

Some time ago we at Igalia embarked on the journey to ship a GStreamer-powered WebRTC backend. This is a long journey, it is not over, but we made some progress. This post is the first of a series providing some insights of the challenges we are facing and the plans for the next release cycle(s).

Most web-engines nowadays bundle a version of LibWebRTC, it is indeed a pragmatic approach. WebRTC is a huge spec, spanning across many protocols, RFCs and codecs. LibWebRTC is in fact a multimedia framework on its own, and it’s a very big code-base. I still remember the suprised face of Emilio at the 2022 WebEngines conference when I told him we had unusual plans regarding WebRTC support in GStreamer WebKit ports. There are several reasons for this plan, explained in the WPE FAQ. We worked on a LibWebRTC backend for the WebKit GStreamer ports, my colleague Thibault Saunier blogged about it but unfortunately this backend has remained disabled by default and not shipped in the tarballs, for the reasons explained in the WPE FAQ.

The GStreamer project nowadays provides a library and a plugin allowing applications to interact with third-party WebRTC actors. This is in my opinion a paradigm shift, because it enables new ways of interoperability between the so-called Web and traditional native applications. Since the GstWebRTC announcement back in late 2017 I’ve been experimenting with the idea of shipping an alternative to LibWebRTC in WebKitGTK and WPE. The initial GstWebRTC WebKit backend was merged upstream on March 18, 2022.

As you might already know, before any audio/video call, your browser might ask permission to access your webcam/microphone and even during the call you can now share your screen. From the WebKitGTK/WPE perspective the procedure is the same of course. Let’s dive in.

WebCam/Microphone capture

Back in 2018 for the LibWebRTC backend, Thibault added support for GStreamer-powered media capture to WebKit, meaning that capture devices such as microphones and webcams would be accessible from WebKit applications using the getUserMedia spec. Under the hood, a GStreamer source element is created, using the GstDevice API. This implementation is now re-used for the GstWebRTC backend, it works fine, still has room for improvements but that’s a topic for a follow-up post.

A MediaStream can be rendered in a <audio> or <video> element, through a custom GStreamer source element that we also provide in WebKit, this is all internally wired up so that the following JS code will trigger the WebView in natively capturing and rendering a WebCam device using GStreamer:

    navigator.mediaDevices.getUserMedia({video: true, audio: false }).then((mediaStream) => {
        const video = document.querySelector('video');
        video.srcObject = mediaStream;
        video.onloadedmetadata = () => {

When this WebPage is rendered and after the user has granted access to capture devices, the GStreamer backends will create not one, but two pipelines.

flowchart LR
flowchart LR
    subgraph mediastreamsrc
    subgraph playbin3
       subgraph decodebin3
       subgraph webkitglsink

The first pipeline routes video frames from the capture device using pipewiresrc to an appsink. From the appsink our capturer leveraging the Observer design pattern notifies its observers. In this case there is only one observer which is a GStreamer source element internal to WebKit called mediastreamsrc. The playback pipeline shown above is heavily simplified, in reality more elements are involved, but what matters most is that thanks to the flexibility of Gstreamer, we can leverage the existing MediaPlayer backend that we at Igalia have been maintaining for more than 10 years, to render MediaStreams. All we needed was a custom source element, the rest of our MediaPlayer didn’t need much changes to support this use-case.

One notable change we did since the initial implementation though is that for us a MediaStream can be either raw, encoded or even encapsulated in a RTP payload. So depending on which component is going to render the MediaStream, we have enough flexibility to allow zero-copy, in most scenarios. In the example above, typically the stream will be raw from source to renderer. However, some webcams can provide encoded streams. WPE and WebKitGTK will be able to internally leverage these and in some cases allow for direct streaming from hardware device to outgoing PeerConnection without third-party encoding.

Desktop capture

There is another JS API, allowing to capture from your screen or a window, called getDisplayMedia, and yes, we also support it! Thanks to these recent years ground-breaking progress of the Linux Desktop such as PipeWire and xdg-desktop-portal we can now stream your favorite desktop environment over WebRTC. Under the hood when the WebView is granted access to the desktop capture through the portal, our backend creates a pipewiresrc GStreamer element, configured to source from the file descriptor provided by the portal, and we have a healthy raw video stream.

Here’s a demo:

WebAudio capture

What more, yes you can also create a MediaStream from a WebAudio node. On the backend side, the GStreamerMediaStreamAudioSource fills GstBuffers from the audio bus channels and notifies third-parties internally observing the MediaStream, such as outgoing media sources, or simply an <audio> element that was configured to source from the given MediaStream. I have no demo for this, you’ll have to take my word.

Canvas capture

But wait there is more. Did I hear canvas? Yes we can feed your favorite <canvas> to a MediaStream. The JS API is called captureStream, its code is actually cross-platform but defers to the HTMLCanvasElement::toVideoFrame() method which has a GStreamer implementation. The code is not the most optimal yet though due to shortcomings of our current graphics pipeline implementation. Here is a demo of Canvas to WebRTC running in the WebKitGTK MiniBrowser:


So we’ve got MediaStream support covered. This is only one part of the puzzle though. We are facing challenges now on the PeerConnection implementation. MediaStreams are cool but it’s even better when you can share them with your friends on the fancy A/V conferencing websites, but we’re not entirely ready for this yet in WebKitGTK and WPE. For this reason, WebRTC is not yet enabled by default in the upcoming WebKitGTK and WPE 2.40 releases. We’re just not there yet. In the next part of these series I’ll tackle the PeerConnection backend on which we’re working hard on these days, both in WebKit and in GStreamer.

Happy hacking and as always, all my gratitude goes to my fellow Igalia comrades for allowing me to keep working on these domains and to Metrological for funding some of this work. Is your organization or company interested in leveraging modern WebRTC APIs from WebKitGTK and/or WPE? If so please get in touch with us in order to help us speed-up the implementation work.

by Philippe Normand at February 16, 2023 08:30 PM

February 15, 2023

Clayton Craft

On meetings...

A developer's nightmare: a day dominated by meetings.

Now, this isn't at all what my typical day at Igalia is like... In fact, while it does still happen (out of necessity), it's rare here. And I find that quite amazing when compared to my time at a previous employer. Anyways, because of past experiences at a previous employer, I've always been on this mission to remove or reduce my need to attend meetings, as much as possible. Mainly because I often think I have better things I could be using my time for at the exact day/time of each meeting 😅. It's not really a secret that meetings are expensive. In case it's not clear yet: I don't like to have a lot of meetings! 1

Aaanyways, this happened to me recently2, my day was dominated by a series of meetings. And it got me thinking about how this could be improved. I'm pretty sure that none of the meetings could have been canceled, or that I could have skipped any of them. So what else could be done if the meeting must go on and I am required to be there?

Two possible things come to mind: 1) reduce the meeting time, 2) improve the "quality"3 of the topics/material covered. Those two things aren't always directly related, you can obviously have a 1 hour meeting that covers exactly 1 topic in excruciating detail, or you could have a 30 minute meeting that covers 10 topics poorly. But if you can effectively cover (for example) 5 topics on average in a 1 hour meeting and you only have 2 topics on the agenda, your meeting time is probably going to be shorter. Wow! Or if you can cover 5 topics in that a full 1 hour meeting and it's "engaging" and all that (i.e. high "quality"), then I'm probably going to feel like my time was well spent there. Or if I look at the topics ahead of time then I might feel a little more comfortable skipping the meeting if I had to, or if my attendance is optional.

I think that sounds nice... So how could you do that consistently, or at least more consistently? I guess the meeting organizer could follow something like this before every meeting:

  1. Do some pre-screening of topics

  2. and pick ones that are most relevant / unresolved.

  3. Of those, pick the ones that have the most info available at meeting time, so they're ready (or as close to being so) to discuss "live" in a meeting.

  4. Then, allow any time-sensitive topics to preempt any/all topics on the agenda, if necessary.

If I'm the organizer, I don't want to do all of that by myself. I am probably not knowledgeable in every possible topic that might come up, and it might be hard for me to determine if a topic is "ready" enough to spend meeting time on. And it's even harder to gently, and regularly, push other people into helping along with some process like that.

I was looking over the list of topics in an upcoming postmarketOS core team meeting, and realized that we might have accidentally4 stumbled on a neat way to achieve something kinda like the above process, but more naturally:

The postmarketOS core team maintains a collection of topics for future team meetings in a shared notes thing, and any participant can propose topics, ask questions about a topic, etc for some number of days/weeks beforehand. That's it.

As a result, sometimes topics are:

  • added, because there's almost always something new to talk about on the team
  • dropped from the list, because (for example) an issue or decision was resolved "offline" and it's not necessary to discuss in a meeting
  • modified, with clarifications or new information being added to support the topic's discussion in a meeting

On the day of the meeting, we quickly run through all of the proposed topics to set the agenda, moving any that aren't "ready" to some future meeting. Then we start going through the agenda. It's probably obvious but if some time-sensitive topic came in then we'd usually agree to discuss it regardless of what we had already planned.

In a way, we're "crowdsourcing"5 this pre-meeting work to all the participants, who probably want to use their time during the meeting effectively, or not have the meeting if it's not necessary. So, in theory at least, they might be motivated to help make that happen. And it's neat that all of this can happen before the meeting, completely asynchronously between everyone.

There's definitely a lot more to actually running an effective meeting beyond having a "good agenda". Our meetings still go off-topic at times. But I feel that this is some improvement, when we all have an opportunity to help shape the topics/agenda this way. I'll try to spend a little more time ahead of postmarketOS team meetings thinking about the proposed topics and asking questions, or proposing my own topics. Because it can happen asynchronously, I can do this whenever I feel like it. And if there's a chance it helps reduce the overall meeting time or it results in a more effective meeting, why not?

1 Do you?

2 To be fair, I didn't feel like any of the meetings I attended were poorly run or were a waste of time, but sometimes it's fun to think about how things could be improved further. Ya, I have a weird definition of "fun".

3 By "quality", I mean that topics are well-researched and presented by someone knowledgeable on the subject, folks are at least a little aware of what the topic is, and time was spent trying to anticipate questions that others might have. And so on. This is hard to achieve.

4 Or someone much smarter than me did this intentionally, for the same reasons, and I'm just now realizing it. If they did then ssshhhhhhh let me have my moment.

5 Ugghhhh.....

by Unknown at February 15, 2023 12:00 AM

February 14, 2023

Guilherme Piccoli

Debugging early boot issues on ARM64 (without console!)

In the last blog post, about booting the upstream kernel on Inforce 6640, I mentioned that there was an interesting issue with ftrace that led to a form of debug-by-rebooting approach, and that is worth a blog post. Of course it only took me 1 year+ to write it (!) – so apologies for this … Continue reading "Debugging early boot issues on ARM64 (without console!)"

by gpiccoli at February 14, 2023 10:04 PM

February 13, 2023

Cathie Chen

How does ResizeObserver get garbage collected in WebKit?

ResizeObserver is an interface provided by browsers to detect the size change of a target. It would call the js callback when the size changes. In the callback function, you can do anything including delete the target. So how ResizeObserver related object is managed?

ResizeObserver related objects

Let’s take a look at a simple example.
<div id="target"></div>
  var ro = new ResizeObserver( entries => {});
} // end of the scope
  • ro: a JSResizeObserver which is a js wrapper of ResizeObserver,
  • callback: JSResizeObserverCallback,
  • entries: JSResizeObserverEntry,
  • and observe() would create a ResizeObservation,
  • then document and the target.
So how these objects organized? Let’s take a look at the code.
ResizeObserver and Document
It needs Document to create ResizeObserver, and store document in WeakPtr<Document, WeakPtrImplWithEventTargetData> m_document;.
On the other hand, when observe(), m_document->addResizeObserver(*this), ResizeObserver is stored in Vector<WeakPtr<ResizeObserver>> m_resizeObservers;.
So ResizeObserver and Document both hold each other by WeakPtr.
ResizeObserver and Element
When observe(), ResizeObservation is created, and it is stored in ResizeObserver by Vector<Ref<ResizeObservation>> m_observations;.
ResizeObservation holds Element by WeakPtr, WeakPtr<Element, WeakPtrImplWithEventTargetData> m_target.
On the other hand, target.ensureResizeObserverData(), Element creates ResizeObserverData, which holds ResizeObserver by WeakPtr, Vector<WeakPtr<ResizeObserver>> observers;.
So the connection between ResizeObserver and element is through WeakPtr.

Keep JSResizeObserver alive

Both Document and Element hold ResizeObserver by WeakPtr, how do we keep ResizeObserver alive and get released properly?
In the example, what happens outside the scope? Per [1],
Visit Children – When JavaScriptCore’s garbage collection visits some JS wrapper during the marking phase, visit another JS wrapper or JS object that needs to be kept alive.
Reachable from Opaque Roots – Tell JavaScriptCore’s garbage collection that a JS wrapper is reachable from an opaque root which was added to the set of opaque roots during marking phase.
To keep JSResizeObserver itself alive, use the second mechanism “Reachable from Opaque Roots”, custom isReachableFromOpaqueRoots. It checks the target of m_observations, m_activeObservationTargets, and m_targetsWaitingForFirstObservation, if the targets containsWebCoreOpaqueRoot, the JSResizeObserver won’t be released. Note that it uses GCReachableRef, which means the targets won’t be released either. The timeline of m_activeObservationTargets is from gatherObservations to deliverObservations. And the timeline of m_targetsWaitingForFirstObservation is from observe() to the first time deliverObservations. So JSResizeObserver won’t be released if the observed targets are alive, or it has size changed observations not delivered, or it has any target not delivered at all.


ResizeObservation is owned by ResizeObserver, so it will be released if ResizeObserver is released.

Keep `JSCallbackDataWeak* m_data` in `JSResizeObserverCallback` alive

Though ResizeObserver hold ResizeObserverCallback by RefPtr, it is a IsWeakCallback.

JSCallbackDataWeak* m_data; in JSResizeObserverCallback does not keep align with JSResizeObserver.
Take a close look at JSCallbackDataWeak, there is JSC::Weak<JSC::JSObject> m_callback;.

To keep JSResizeObserver itself alive, ResizeObserver using the first mechanism “Visit Children”.
In JSResizeObserver::visitAdditionalChildren, it adds m_callback to Visitor, see:
void JSCallbackDataWeak::visitJSFunction(Visitor& visitor)


Like JSResizeObserver and callback, JSResizeObserverEntry would make sure the target and contentRect won’t be released when it is alive.
void JSResizeObserverEntry::visitAdditionalChildren(Visitor& visitor)
    addWebCoreOpaqueRoot(visitor, wrapped().target());
    addWebCoreOpaqueRoot(visitor, wrapped().contentRect());
ResizeObserverEntry is RefCounted.
class ResizeObserverEntry : public RefCounted<ResizeObserverEntry>
It is created in ResizeObserver::deliverObservations, and passed to the JS callback, if JS callback doesn’t keep it, it will be released when the function is finished.

by cchen at February 13, 2023 02:33 PM

February 09, 2023

Samuel Iglesias

X.Org Developers Conference 2023 in A Coruña, Spain

Some weeks ago, Igalia announced publicly that we will host X.Org Developers Conference 2023 (XDC 2023) in A Coruña, Spain. If you remember, we also organized XDC 2018 in this beautiful city in the northwest of Spain (I hope you enjoyed it!)

A Coruña

Since the announcement, I can now confirm that the conference will be in the awesome facilities of Palexco conference center, at the city center of A Coruña, Spain, from 17th to 19th of October 2023.

Palexco conference center

We are going to setup soon the website, and prepare everything to open the Call For Papers in the coming weeks. Stay tuned!

XDC 2023 is a three-day conference full of talks and workshops related to the open-source graphics stack: from Wayland to X11, from DRM/KMS to Mesa drivers, toolkits, libraries… you name it! This is the go-to conference if you are involved in the development of any part of the open-source graphics stack. Don’t miss it!

Igalia logo

February 09, 2023 01:15 PM

Brian Kardell

A Few Good Lies

A Few Good Lies

On Wolvic’s new UA string in 1.3, things working, and frustrating intertwined problems.

Ok, let’s play a game… Imagine that you look at your server logs and see this UA string. Which browser sent it to you?

Mozilla/5.0 (Linux; Android 10; Quest 2) AppleWebKit/537.36 (KHTML, like Gecko) OculusBrowser/ SamsungBrowser/4.0 Chrome/106.0.5249.79 VR Safari/537.36

I won’t blame you if you don’t immediately recognize the browser that sends that string by default (it’s the OculusBrowser). It’s already a giant set of lies and there’s no intuitive way to decide which one to believe.

But... Maybe you should believe none of them! Wolvic, for example, will send that string sometimes. And sometimes the OculusBrowser will send some other string. Eek.

Okay, maybe that was just a particularly complicated example. How about this one:

Mozilla/5.0 (Android 10; Mobile VR; rv:81.0) Gecko/81.0 Firefox/81.0

Ah, Firefox for Android? Sort of! Firefox Reality used to send that. Yes, it was for AOS-based devices, but not really for your Pixel phone’s Android.

Alright, alright — this one should be easy given the last one, right? Look how similar it is…

Mozilla/5.0 (Android 10; Mobile VR; rv:105.0) Gecko/105.0 Firefox/105.0

That is… Wolvic.

At least by default. Anything where one of those numbers was over ‘81’ was probably Wolvic. At least before this week’s release.

Starting with this release, Wolvic will normally send:

Mozilla/5.0 (Android 10; Mobile VR; rv:105.0) Gecko/105.0 Firefox/105.0 Wolvic/1.3

Ah finally - the text “Wolvic” appears in the string! But also, still, some lies. And then, of course, sometimes we’ll have to tell a completely different lie.

But... I mean...WHO CARES, right!?

You might imagine that nobody cares about the UA stirng, but you would be very wrong. Everybody seems to care about it, somehow.

The lies we need to tell ourselves

It’s not just about an app that suddenly starts refusing to service HTTPS requests, except to tell them to use a different browser (maybe, for example, the one they just so happen to make).

It’s all kinds of things that people try to reason about that ultimately somehow involves the UA string — and it affects so much.

Maybe, for example, you use some CMS or tool that lets you think in simpler terms: Is it “mobile” or is it a “tablet” or is it a “desktop”? Cool, but how does it know, and by what definition? Does that definition hold forever?

Or maybe you use a library which tries to make it simple to give all different devices as good an experience as they can. Again, there is some kind of attempt at categorizing in order to determine the ‘best’ experience — and it often involves (just a little maybe!) the UA string. And again, does that hold up forever?

Even if this works perfectly with every device in the whole world (it probably doesn’t!), as soon as something changes, or doesn’t neatly fit those definitions anymore, this is probably going to give users a pretty bad result.

And guess what? We’re all users! You can imagine how it might be frustrating if you are, for example, a user of some new device or browser. Like, hypothetically, one who has bought an XR device, running an immersive operating system, and browsing the web in a broswer built for that, and the cool immersive experience link someone sent you is giving you the non-immersive experience instead. The site thinks your totally capable browser just isn’t.

That kind of shit happens, a lot, and when it happens in popular libraries or CMSes or whatever, it can affect thousands of sites.

In fact, that’s a thing that happened with Wolvic. A lot.

You know who users get mad at when that happens? Not the site, nor the library. They get mad at the browser. When that happens, browsers will start lying to your site because that makes the user happy, and stops their hate. At the end of the day, making users happy is their job.

And this isn't a new thing. My colleague Eric Meyer wrote a great post about something similar from 2002, and we did a whole Igalia chats episode about all of this.

"Making users happy is their job" is tricky though, right? Because in our case, the site or CMS or library maker probably introduced something there because they cared about users. They wanted to make sure that some browser that falls through the logic cracks gets at least the basic mobile experience or something. Which is great, until it sucks.


Even if you aren’t looking at the UA string functionally somehow, you probably still care about it for functional reasons, if you think about it.

Let’s just take a for example: Should you care about how your site looks on these XR headsets? Is it worth doing something specially interesting on those? How do you decide what is worth your energy? Probably by asking “how many users of that are there?” in general, and more specifically “how many users are using that to visit my website?”

In both cases, you’re probably relying on the UA string. Either by looking through your own server logs and trying to see, or else by looking at market share numbers, which are literally a summary of other people’s server logs.

But, it’s possible you have 1000 people all visiting your site with Wolvic or the OculusBrowser right now, and you see none of that in your logs, because the browser makers have already figured out that we have to lie about what they are.

So, this is an especially frustrating thing for new browsers and new classes of devices.

Say, for example, you’re building a new browser. Maybe you have a logo that looks like a Wolf. Hypothetically. And you’re trying to fund a handful of developers to implement features which will serve the needs of your users.

The main way that a browser can pull in some income to pay said developers, historically, is by signing a default/integrated a search deal with a search engine. However, this presents something of a chicken and egg problem. To be a worthwhile deal, the search engine wants to know what kind of market share you have — in other words, how many users? And, well, they’re certainly not going to trust your private data on this. Metrics based on the UA string are the most obvious and public ones, but as a new browser you have to lie about what you are so that your users can, you know, use websites.

Sooooo… Cool, cool, cool, cool.

Brooklyn Nine Nine Jake saying cool cool meme

I wish none of this was the way it was, but I also don’t know how to fix it except to say “we really need to talk about the funding model”.

So, please be careful. Try not to make it worse.

Use Wolvic if you can.

Test with it if you can.

Please definitely report bugs, but also please be patient with us if you’re a user, we’re fighting systemic problems sometimes.

Maybe toss some money at our collective efforts if you can and help instigate the kind of change we need… or, if not, maybe just tell someone about it.


February 09, 2023 05:00 AM

Clayton Craft

FOSDEM 2023!

This was my first time attending FOSDEM, and it was a really neat conference! I saw a few talks, though not as many as I had originally planned, because I decided to try and prioritize meeting people and hanging out with them over going to talks that I could watch online. So I ended up mostly attending talks with others I knew, as a workaround 😅

I don't have a many friends, so getting to see some for the first time in person was amazing. And I hopefully made some new ones too!

It was a fairly long trip for a 2 day conference, ~20 hours of travel to there, and ~24hrs of travel home. I added a couple of extra days afterwards to stick around Brussels and hang out with some friends, which was totally the right decision.

I'm looking forward to seeing folks again, and encourage others who are fortunate enough to have a chance to attend conferences/meet-ups to go, or consider doing it!

My talk

Here's a talk I did, my first one in a looong time. Link to description


most of the postmarketOS team in front of the Linux Mobile stand
most of the postmarketOS team in front of the Linux Mobile stand

group of mobile devices on the Linux Mobile stand (photo by: A-wai)
group of mobile devices on the Linux Mobile stand (photo by: A-wai)

recording postmarketOS podcast in person for the first time
recording postmarketOS podcast in person for the first time

aftermath of reaching #avocadosnumber (photo by: Caleb)
aftermath of reaching #avocadosnumber (photo by: Caleb)

Belgium waffles are something else... mmmmmmm
Belgium waffles are something else... mmmmmmm

scene from a 1 day 'pmOS hackathon'
scene from a 1 day 'pmOS hackathon'

labne w/ poached egg, starting a long layover
labne w/ poached egg, starting a long layover

view of St Helens as I'm coming home
view of St Helens as I'm coming home

by Unknown at February 09, 2023 12:00 AM

February 08, 2023

Eric Meyer

CSS Wish List 2023

Dave asked people to share their CSS wish lists for 2023, and even though it’s well into the second month of the year, I’m going to sprint wild-eyed out of the brush along the road, grab the hitch on the back of the departing bandwagon, and try to claw my way aboard. At first I thought I had maybe four or five things to put on my list, but as I worked on it, I kept thinking of one more thing and one more thing until eventually I had a list of (checks notes) sixt — no, SEVENTEEN?!?!?  What the hell.

There’s going to be some overlap with the things being worked on for Interop 2023, and I’m sure there will be overlap with other peoples’ lists.  Regardless, here we go.


Back in the day, I asserted Grid should wait for subgrid.  I was probably wrong about that, but I wasn’t wrong about the usefulness of and need for subgrid.  Nearly every time I implement a design, I trip over the lack of widespread support.

I have a blog post in my head about how I hacked around this problem for by applying the same grid column template to nested containers, and how I could make it marginally more efficient with variables.  I keep not writing it, because it would show the approach and improvement and then mostly be about the limitations, flaws, and annoyances this approach embodies.  The whole idea just depresses me, and I would probably become impolitic.

So instead I’ll just say that I hope those browser engines that have yet to catch up with subgrid support will do so in 2023.

Masonry layout

Grid layout is great, and you can use it to create a masonry-style layout, but having a real masonry layout mechanism would be better, particularly if you can set it on a per-axis basis.  What I mean is, you could define a bunch of fixed (or flexible) columns and then say the rows use masonry layout.  It’s not something I’m likely to use myself, but I always think the more layout possibilities there are, the better.

Grid track styles

For someone who doesn’t do a ton of layout, I find myself wanting to style grid lines a surprising amount.  I just want to do something like:

display: grid;
gap: 1em;
grid-rules-columns: 1px dotted red;

…although it would be much better to be able to style separators for a specific grid track, or even an individual grid cell, as well as be able to apply it to an entire grid all at once.

No, I don’t know exactly how this should work.  I’m an idea guy!  But that’s what I want.  I want to be able to define what separator lines look like between grid tracks, centered on the grid lines.

Anchored positioning

I’ve wanted this in one form or another almost since CSS2 was published.  The general idea is, you can position an element in relation to the edges of another element that isn’t a containing block.  I wrote about this a bit in my post on connector lines for, but another place it would have come in handy was with the footnotes on The Effects of Nuclear Weapons.

See, I wanted those to actually be sidenotes, Tufteee-styleee.  Right now, in order to make sidenotes, you have to stick the footnote into the text, right where its footnote reference appears  —  or, at a minimum, right after the element containing the footnote reference.  Neither was acceptable to me, because it would dork up the source text.

What I wanted to be able to do was collect all the footnotes as endnotes at the end of the markup (which we did) and then absolutely position each to sit next to the element that referenced them, or have it pop up there on click, tap, hover, whatever.  Anchored positioning would make that not just possible, but fairly easy to do.


Following on anchored positioning, I’d love to have CSS Exclusions finally come to browsers.  Exclusions are a way to mark an element to have other content avoid it.  You know how floats move out of the normal flow, but normal-flow text avoids overlapping them?  That’s an exclusion.  Now imagine being able to position an element by other means, whether grid layout or absolute positioning or whatever, and then say “have the content of other elements flow around it”.  Exclusions!  See this article by Rob Weychert for a more in-depth explanation of a common use case.

Element transitions

The web is cool and all, but you know how futuristic interfaces in movies have pieces of the interface sliding and zooming and popping out and all that stuff?  Element transitions.  You can already try them out in Chrome Canary, Batman, and I’d love to see them across the board.  Even more, I’d love some gentle, easy-to-follow tutorials on how to make them work, because even in their single-page form, I found the coding requirements basically impossible to work out.  Make them all-CSS, and explain them like I’m a newb, and I’m in.

Nested Selectors

A lot of people I know are still hanging on to preprocessors solely because they permit nested selectors, like:

main {
	padding: 1em;
	background: #F1F1F0;
	h2 {
		border-block-end: 1px solid gray;
	p {
		text-indent: 2em;

The CSS Working Group has been wrestling with this for quite some time now, because it turns out the CSS parsing rules make it hard to just add this, there are a lot of questions about how this should interact with pseudo-classes like :is(), there are serious concerns about doing this in a way that will be maximally future-compatible, and also there has been a whole lot of argument over whether it’s okay to clash with Sass syntax or not.

So it’s a difficult thing to make happen in native CSS, and the debates are both wide-ranging and slow, but it’s on my (and probably nearly everyone else’s) wish list.  You can try it out in Safari Technology Preview as I write this, so here’s hoping for accelerating adoption!

More and better :has()

Okay, since I’m talking about selectors already, I’ll throw in universal, full-featured, more optimized support for :has().  One browser doesn’t support compound selectors, for example.  I’ve also thought that maybe some combinators would be nice, like making a:has(> b) can be made equal to a < b.

But I also wish for people to basically go crazy with :has().  There’s SO MUCH THERE.  There are so many things we can do with it, and I don’t think we’ve touched even a tiny fraction of the possibility space.

More attr()

I’ve wanted attr() to be more widely accepted in CSS values since, well, I can’t remember.  A long time.  I want to be able to do something like:

p[data-size] {width: attr(data-width, rem);}

<p data-size="27">…</p>

Okay, not a great example, but it conveys the idea.  I also talked about this in my post about aligning table columns. I realize adding this would probably lead to someone creating a framework called Headgust where all the styling is jammed into a million data-*attributes and the whole of the framework’s CSS is nothing but property: attr() declarations for every single CSS property known to man, but we shouldn’t let that stop us.

Variables in media queries

Basically I want to be able to do this:

:root {--mobile: 35em;}

@media (min-width: var(--mobile)) {
	/* non-mobile styles go here */

That’s it.  This was made possible in container queries, I believe, so maybe it can spread to media (and feature?) queries.  I sure hope so!

Logical modifiers

You can do this:

p {margin-block: 1em; margin-inline: 1rem;}

But you can’t do this:

p {margin: logical 1em 1rem;}

I want to be able to do that.  We should all want to be able to do that, however it’s made possible.

Additive values

You know how you can set a bunch of values with a comma-separated list, but if you want to override just one of them, you have to do the whole thing over?  I want to be able to add another thing to the list without having to do the whole thing over.  So rather than adding a value like this:

background-clip: content, content, border, padding; /* needed to add padding */

…I want to be able to do something like:

background-clip: add(padding);

No, I don’t know how to figure out where in the list it should be added.  I don’t know a lot of things.  I just know I want to be able to do this.  And also to remove values from a list in a similar way, since I’m pony-wishing.

Color shading and blending

Preprocessors already allow you to say you want the color of an element to be 30% lighter than it would otherwise be.  Or darker.  Or blend two colors together.  Native CSS should have the same power.  It’s being worked on.  Let’s get it done, browsers.

Hanging punctuation

Safari has supported hanging-punctuation forever (where “forever”, in this case, means since 2016) and it’s long past time for other browsers to get with the program.  This should be universally supported.

Cross-boundary styles

I want to be able to apply styles from my external (or even embedded) CSS to a resource like an external SVG.  I realize this sets up all kinds of security and privacy concerns.  I still want to be able to do it.  Every time I have to embed an entire inline SVG into a template just so I can change the fill color of a logo based on its page context, I grit my teeth just that little bit harder.  It tasks me.

Scoped styling (including imports)

The Mirror Universe version of the previous wish is that I want to be able to say a bit of CSS, or an entire style sheet (embedded or external), only applies to a certain DOM node and all its descendants. “But you can do that with descendant selectors!” Not always.  For that matter, I’d love to be able to just say:

<div style="@import(styles.css);">

…and have that apply to that <div> and its descendants, as if it were an <iframe>, while not being an <iframe> so styles from the document could also apply to it.  Crazy?  Don’t care.  Still want it.

Linked flow regions(?)

SPECIAL BONUS TENTATIVE WISH: I didn’t particularly like how CSS Regions were structured, but I really liked the general idea.  It would be really great to be able to link elements together, and allow the content to flow between them in a “smooth” manner.  Even to allow the content from the second region to flow back into the first, if there’s room for it and nothing prevents it.  I admit, this is really a “try to recreate Aldus PageMaker in CSS” thing for me, but the idea still appeals to me, and I’d love to see it come to CSS some day.

So there you go.  I’d love to hear what you’d like to see added to CSS, either in the comments below or in posts of your own.

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at February 08, 2023 06:16 PM

February 07, 2023

Andy Wingo

whippet: towards a new local maximum

Friends, you might have noted, but over the last year or so I really caught the GC bug. Today's post sums up that year, in the form of a talk I gave yesterday at FOSDEM. It's long! If you prefer video, you can have a look instead to the at the FOSDEM event page.

Whippet: A New GC for Guile

4 Feb 2023 – FOSDEM

Andy Wingo

Guile is...

Mostly written in Scheme

Also a 30 year old C library

// API
SCM scm_cons (SCM car, SCM cdr);

// Many third-party users
SCM x = scm_cons (a, b);

So the context for the whole effort is that Guile has this part of its implementation which is in C. It also exposes a lot of that implementation to users as an API.

Putting the C into GC

SCM x = scm_cons (a, b);

Live objects: the roots, plus anything a live object refers to

How to include x into roots?

  • Refcounting
  • Register (& later unregister) &x with gc
  • Conservative roots

So what contraints does this kind of API impose on the garbage collector?

Let's start by considering the simple cons call above. In a garbage-collected environment, the GC is responsible for reclaiming unused memory. How does the GC know that the result of a scm_cons call is in use?

Generally speaking there are two main strategies for automatic memory management. One is reference counting: you associate a count with an object, incremented once for each referrer; in this case, the stack would hold a reference to x. When removing the reference, you decrement the count, and if it goes to 0 the object is unused and can be freed.

We GC people used to laugh at reference-counting as a memory management solution because it over-approximates the live object set in the presence of cycles, but it would seem that refcounting is coming back. Anyway, this isn't what Guile does, not right now anyway.

The other strategy we can use is tracing: the garbage collector periodically finds all of the live objects on the system and then recycles the memory for everything else. But how to actually find the first live objects to trace?

One way is to inform the garbage collector of the locations of all roots: references to objects originating from outside the heap. This can be done explicitly, as in V8's Handle<> API, or implicitly, in the form of a side table generated by the compiler associating code locations with root locations. This is called precise rooting: the GC is aware of all root locations at all code positions where GC might happen. Generally speaking you want the side table approach, in which the compiler writes out root locations to stack maps, because it doesn't impose any overhead at run-time to register and unregister locations. However for run-time routines implemented in C or C++, you won't be able to get the C compiler to do this for you, so you need the explicit approach if you want precise roots.

Conservative roots

Treat every word in stack as potential root; over-approximate live object set

1993: Bespoke GC inherited from SCM

2006 (1.8): Added pthreads, bugs

2009 (2.0): Switch to BDW-GC

BDW-GC: Roots also from extern SCM foo;, etc

The other way to find roots is very much not The Right Thing. Call it cheeky, call it sloppy, call it yolo, call it what you like, but in the trade it's known as conservative root-finding. This strategy looks like this:

uintptr_t *limit = stack_base_for_platform();
uintptr_t *sp = __builtin_frame_address();
for (; sp < limit; sp++) {
  void *obj = object_at_address(*sp);
  if (obj) add_to_live_objects(obj);

You just look at every word on the stack and pretend it's a pointer. If it happens to point to an object in the heap, we add that object to the live set. Of course this algorithm can find a spicy integer whose value just happens to correspond to an object's address, even if that object wouldn't have been counted as live otherwise. This approach doesn't compute the minimal live set, but rather a conservative over-approximation. Oh well. In practice this doesn't seem to be a big deal?

Guile has used conservative root-finding since its beginnings, 30 years ago and more. We had our own bespoke mark-sweep GC in the beginning, but it's now going on 15 years or so that we switched to the third-party Boehm-Demers-Weiser (BDW) collector. It's been good to us! It's better than what we had, it's mostly just worked, and it works correctly with threads.

Conservative roots

+: Ergonomic, eliminates class of bugs (handle registration), no compiler constraints

-: Potential leakage, no compaction / object motion; no bump-pointer allocation, calcifies GC choice

Conservative root-finding does have advantages. It's quite pleasant to program with, in environments in which the compiler is unable to produce stack maps for you, as it eliminates a set of potential bugs related to explicit handle registration and unregistration. Like stack maps, it also doesn't impose run-time overhead on the user program. And although the compiler isn't constrained to emit code to clear roots, it generally does, and sometimes does so more promptly than would be the case with explicit handle deregistration.

But, there are disadvantages too. The potential for leaks is one, though I have to say that in 20 years of using conservative-roots systems, I have not found this to be a problem. It's a source of anxiety whenever a program has memory consumption issues but I've never identified it as being the culprit.

The more serious disadvantage, though, is that conservative edges prevent objects from being moved by the GC. If you know that a location holds a pointer, you can update that location to point to a new location for an object. But if a location only might be a pointer, you can't do that.

In the end, the ergonomics of conservative collection lead to a kind of calcification in Guile, that we thought that BDW was as good as we could get given the constraints, and that changing to anything else would require precise roots, and thus an API and ABI change, losing users, and so on.

What if I told you

You can find roots conservatively and

  • move objects and compact the heap
  • do fast bump-pointer allocation
  • incrementally migrate to precise roots

BDW is not the local maximum

But it turns out, that's not true! There is a way to have conservative roots and also use more optimal GC algorithms, and one which preserves the ability to incrementally refactor the system to have more precision if that's what you want.


Fundamental GC algorithms

  • mark-compact
  • mark-sweep
  • evacuation
  • mark-region

Immix is a mark-region collector

Let's back up to a high level. Garbage collector implementations are assembled from instances of algorithms, and there are only so many kinds of algorithms out there.

There's mark-compact, in which the collector traverses the object graph once to find live objects, then once again to slide them down to one end of the space they are in.

There's mark-sweep, where the collector traverses the graph once to find live objects, then traverses the whole heap, sweeping dead objects into free lists to be used for future allocations.

There's evacuation, where the collector does a single pass over the object graph, copying the objects outside their space and leaving a forwarding pointer behind.

The BDW collector used by Guile is a mark-sweep collector, and its use of free lists means that allocation isn't as fast as it could be. We want bump-pointer allocation and all the other algorithms give it to us.

Then in 2008, Stephen Blackburn and Kathryn McKinley put out their Immix paper that identified a new kind of collection algorithm, mark-region. A mark-region collector will mark the object graph and then sweep the whole heap for unmarked regions, which can then be reused for allocating new objects.

Allocate: Bump-pointer into holes in thread-local block, objects can span lines but not blocks

Trace: Mark objects and lines

Sweep: Coarse eager scan over line mark bytes

Blackburn and McKinley's paper also describes a new mark-region GC algorithm, Immix, which is interesting because it gives us bump-pointer allocation without requiring that objects be moveable. The diagram above, from the paper, shows the organization of an Immix heap. Allocating threads (mutators) obtain 64-kilobyte blocks from the heap. Blocks contains 128-byte lines. When Immix traces the object graph, it marks both objects and the line the object is on. (Usually blocks are part of 2MB aligned slabs, with line mark bits/bytes are stored in a packed array at the start of the slab. When marking an object, it's easy to find the associated line mark just with address arithmetic.)

Immix reclaims memory in units of lines. A set of contiguous lines that were not marked in the previous collection form a hole (a region). Allocation proceeds into holes, in the usual bump-pointer fashion, giving us good locality for contemporaneously-allocated objects, unlike freelist allocation. The slow path, if the object doesn't fit in the hole, is to look for the next hole in the block, or if needed to acquire another block, or to stop for collection if there are no more blocks.

Immix: Opportunistic evacuation

Before trace, determine if compaction needed. If not, mark as usual

If so, select candidate blocks and evacuation target blocks. When tracing in that block, try to evacuate, fall back to mark

The neat thing that Immix adds is a way to compact the heap via opportunistic evacuation. As Immix allocates, it can end up skipping over holes and leaving them unpopulated, and as subsequent cycles of GC occur, it could be that a block ends up with many small holes. If that happens to many blocks it could be time to compact.

To fight fragmentation, Immix decides at the beginning of a GC cycle whether to try to compact or not. If things aren't fragmented, Immix marks in place; it's cheaper that way. But if compaction is needed, Immix selects a set of blocks needing evacuation and another set of empty blocks to evacuate into. (Immix has to keep around a couple percent of memory in empty blocks in reserve for this purpose.)

As Immix traverses the object graph, if it finds that an object is in a block that needs evacuation, it will try to evacuate instead of marking. It may or may not succeed, depending on how much space is available to evacuate into. Maybe it will succeed for all objects in that block, and you will be left with an empty block, which might even be given back to the OS.

Immix: Guile

Opportunistic evacuation compatible with conservative roots!

Bump-pointer allocation


1 year ago: start work on WIP GC implementation

Tying this back to Guile, this gives us all of our desiderata: we can evacuate, but we don't have to, allowing us to cause referents of conservative roots to be marked in place instead of moved; we can bump-pointer allocate; and we are back on the train of modern GC implementations. I could no longer restrain myself: I started hacking on a work-in-progress garbage collector workbench about a year ago, and ended up with something that seems to take us in the right direction.

Whippet vs Immix: Tiny lines

Immix: 128B lines + mark bit in object

Whippet: 16B “lines”; mark byte in side table

More size overhead: 1/16 vs 1/128

Less fragmentation (1 live obj = 2 lines retained)

More alloc overhead? More small holes

What I ended up building wasn't quite Immix. Guile's object representation is very thin and doesn't currently have space for a mark bit, for example, so I would have to have a side table of mark bits. (I could have changed Guile's object representation but I didn't want to require it.) I actually chose mark bytes instead of bits because both the Immix line marks and BDW's own side table of marks were bytes, to allow for parallel markers to race when setting marks.

Then, given that you have a contiguous table of mark bytes, why not remove the idea of lines altogether? Or what amounts to the same thing, why not make line size to be 16 bytes and do away with per-object mark bits? You can then bump-pointer into holes in the mark byte array. The only thing you need to do to that is to be able to cheaply find the end of an object, so you can skip to the next hole while sweeping; you don't want to have to chase pointers to do that. But consider, you've already paid the cost of having a mark byte associated with every possible start of an object, so if your basic object alignment is 16 bytes, that's a memory overhead of 1/16, or 6.25%; OK. Let's put that mark byte to work and include an "end" bit, indicating the end of the object. Allocating an object has to store into the mark byte array to initialize this "end" marker, but you need to write the mark byte anyway to allow for conservative roots ("does this address hold an object?"); writing the end at the same time isn't so bad, perhaps.

The expected outcome would be that relative to 128-byte lines, Whippet ends up with more, smaller holes. Such a block would be a prime target for evacuation, of course, but during allocation this is overhead. Or, it could be a source of memory efficiency; who knows. There is some science yet to do to properly compare this tactic to original Immix, but I don't think I will get around to it.

While I am here and I remember these things, I need to mention two more details. If you read the Immix paper, it describes "conservative line marking", which is related to how you find the end of an object; basically Immix always marks the line an object is on and the next one, in case the object spans the line boundary. Only objects larger than a line have to precisely mark the line mark array when they are traced. Whippet doesn't do this because we have the end bit.

The other detail is the overflow allocator; in the original Immix paper, if you allocate an object that's smallish but still larger than a line or two, but there's no hole big enough in the block, Immix keeps around a completely empty block per mutator in which to bump-pointer-allocate these medium-sized objects. Whippet doesn't do that either, instead relying on such failure to allocate in a block to cause fragmentation and thus hurry along the process of compaction.

Whippet vs Immix: Lazy sweeping

Immix: “cheap” eager coarse sweep

Whippet: just-in-time lazy fine-grained sweep

Corrolary: Data computed by sweep available when sweep complete

Live data at previous GC only known before next GC

Empty blocks discovered by sweeping

Having a fine-grained line mark array means that it's no longer a win to do an eager sweep of all blocks after collecting. Instead Whippet applies the classic "lazy sweeping" optimization to make mutators sweep their blocks just before allocating into them. This introduces a delay in the collection algorithm: Whippet doesn't find out about e.g. fragmentation until the whole heap is swept, but by the time we fully sweep the heap, we've exhausted it via allocation. It introduces a different flavor to the GC, not entirely unlike original Immix, but foreign.

Whippet vs BDW

Compaction/defrag/pinning, heap shrinking, sticky-mark generational GC, threads/contention/allocation, ephemerons, precision, tools

Right! With that out of the way, let's talk about what Whippet gives to Guile, relative to BDW-GC.

Whippet vs BDW: Motion

Heap-conservative tracing: no object moveable

Stack-conservative tracing: stack referents pinned, others not

Whippet: If whole-heap fragmentation exceeds threshold, evacuate most-fragmented blocks

Stack roots scanned first; marked instead of evacuated, implicitly pinned

Explicit pinning: bit in mark byte

If all edges in the heap are conservative, then you can't move anything, because you don't know if an edge is a pointer that can be updated or just a spicy integer. But most systems aren't actually like this: you have conservative edges from the stack, but you can precisely enumerate intra-object edges on the heap. In that case, you have a known set of conservative edges, and you can simply visit those edges first, marking their referents in place instead of evacuating. (Marking an object instead of evacuating implicitly pins it for the duration of the current GC cycle.) Then you visit heap edges precisely, possibly evacuating objects.

I should note that Whippet has a bit in the mark byte for use in explicitly pinning an object. I'm not sure how to manage who is responsible for setting that bit, or what the policy will be; the current idea is to set it for any object whose identity-hash value is taken. We'll see.

Whippet vs BDW: Shrinking

Lazy sweeping finds empty blocks: potentially give back to OS

Need empty blocks? Do evacuating collection

Possibility to do

With the BDW collector, your heap can only grow; it will never shrink (unless you enable a non-default option and you happen to have verrry low fragmentation). But with Whippet and evacuation, we can rearrange objects so as to produce empty blocks, which can then be returned to the OS if so desired.

In one of my microbenchmarks I have the system allocating long-lived data, interspersed with garbage (objects that are dead after allocation) whose size is in a power-law distribution. This should produce quite some fragmentation, eventually, and it does. But then Whippet decides to defragment, and it works great! Since Whippet doesn't keep a whole 2x reserve like a semi-space collector, it usually takes more than one GC cycle to fully compact the heap; usually about 3 cycles, from what I can see. I should do some more measurements here.

Of course, this is just mechanism; choosing the right heap sizing policy is a different question.

Card marking barrier (256B); compare to BDW mprotect / SIGSEGV

The Boehm collector also has a non-default mode in which it uses mprotect and a SIGSEGV handler to enable sticky-mark-bit generational collection. I haven't done a serious investigation, but I see it actually increasing run-time by 20% on one of my microbenchmarks that is actually generation-friendly. I know that Azul's C4 collector used to use page protection tricks but I can only assume that BDW's algorithm just doesn't work very well. (BDW's page barriers have another purpose, to enable incremental collection, in which marking is interleaved with allocation, but this mode is off if parallel markers are supported, and I don't know how well it works.)

Anyway, it seems we can do better. The ideal would be a semi-space nursery, which is the usual solution, but because of conservative roots we are limited to the sticky mark-bit algorithm. Some benchmarks aren't very generation-friendly; the first pair of bars in the chart above shows the mt-gcbench microbenchmark running with and without generational collection, and there's no difference. But in the second, for the quads benchmark, we see a 2x speedup or so.

Of course, to get generational collection to work, we require mutators to use write barriers, which are little bits of code that run when an object is mutated that tell the GC where it might find links from old objects to new objects. Right now in Guile we don't do this, but this benchmark shows what can happen if we do.

Whippet vs BDW: Scale

BDW: TLS segregated-size freelists, lock to refill freelists, SIGPWR for stop

Whippet: thread-local block, sweep without contention, wait-free acquisition of next block, safepoints to stop with ragged marking

Both: parallel markers

Another thing Whippet can do better than BDW is performance when there are multiple allocating threads. The Immix heap organization facilitates minimal coordination between mutators, and maximum locality for each mutator. Sweeping is naturally parallelized according to how many threads are allocating. For BDW, on the other hand, every time an mutator needs to refill its thread-local free lists, it grabs a global lock; sweeping is lazy but serial.

Here's a chart showing whippet versus BDW on one microbenchmark. On the X axis I add more mutator threads; each mutator does the same amount of allocation, so I'm increasing the heap size also by the same factor as the number of mutators. For simplicity I'm running both whippet and BDW with a single marker thread, so I expect to see a linear increase in elapsed time as the heap gets larger (as with 4 mutators there are roughly 4 times the number of live objects to trace). This test is run on a Xeon Silver 4114, taskset to free cores on a single socket.

What we see is that as I add workers, elapsed time increases linearly for both collectors, but more steeply for BDW. I think (but am not sure) that this is because whippet effectively parallelizes sweeping and allocation, whereas BDW has to contend over a global lock to sweep and refill free lists. Both have the linear factor of tracing the object graph, but BDW has the additional linear factor of sweeping, whereas whippet scales with mutator count.

Incidentally you might notice that at 4 mutator threads, BDW randomly crashed, when constrained to a fixed heap size. I have noticed that if you fix the heap size, BDW sometimes (and somewhat randomly) fails. I suspect the crash due to fragmentation and inability to compact, but who knows; multiple threads allocating is a source of indeterminism. Usually when you run BDW you let it choose its own heap size, but for these experiments I needed to have a fixed heap size instead.

Another measure of scalability is, how does the collector do as you add marker threads? This chart shows that for both collectors, runtime decreases as you add threads. It also shows that whippet is significantly slower than BDW on this benchmark, which is Very Weird, and I didn't have access to the machine on which these benchmarks were run when preparing the slides in the train... so, let's call this chart a good reminder that Whippet is a WIP :)

While in the train to Brussels I re-ran this test on the 4-core laptop I had on hand, and got the results that I expected: that whippet performed similarly to BDW, and that adding markers improved things, albeit marginally. Perhaps I should look on a different microbenchmark.

Incidentally, when you configure Whippet for parallel marking at build-time, it uses a different implementation of the mark stack when compared to the parallel marker, even when only 1 marker is enabled. Certainly the parallel marker could use some tuning.

Whippet vs BDW: Ephemerons

BDW: No ephemerons

Whippet: Yes

Another deep irritation I have with BDW is that it doesn't support ephemerons. In Guile we have a number of facilities (finalizers, guardians, the symbol table, weak maps, et al) built on what BDW does have (finalizers, weak references), but the implementations of these facilities in Guile are hacky, slow, sometimes buggy, and don't compose (try putting an object in a guardian and giving it a finalizer to see what I mean). It would be much better if the collector API supported ephemerons natively, specifying their relationship to finalizers and other facilities, allowing us to build what we need in terms of those primitives. With our own GC, we can do that, and do it in such a way that it doesn't depend on the details of the specific collection algorithm. The exception of course is that as BDW doesn't support ephemerons per se, what we get is actually a weak-key association instead, whose value can keep the key alive. Oh well, it's no worse than the current situation.

Whippet vs BDW: Precision

BDW: ~Always stack-conservative, often heap-conservative

Whippet: Fully configurable (at compile-time)

Guile in mid/near-term: C stack conservative, Scheme stack precise, heap precise

Possibly fully precise: unlock semi-space nursery

Conservative tracing is a fundamental design feature of the BDW collector, both of roots and of inter-heap edges. You can tell BDW how to trace specific kinds of heap values, but the default is to do a conservative scan, and the stack is always scanned conservatively. In contrast, these tradeoffs are all configurable in Whippet. You can scan the stack and heap precisely, or stack conservatively and heap precisely, or vice versa (though that doesn't make much sense), or both conservatively.

The long-term future in Guile is probably to continue to scan the C stack conservatively, to continue to scan the Scheme stack precisely (even with BDW-GC, the Scheme compiler emits stack maps and installs a custom mark routine), but to scan the heap as precisely as possible. It could be that a user uses some of our hoary ancient APIs to allocate an object that Whippet can't trace precisely; in that case we'd have to disable evacuation / object motion, but we could still trace other objects precisely.

If Guile ever moved to a fully precise world, that would be a boon for performance, in two ways: first that we would get the ability to use a semi-space nursery instead of the sticky-mark-bit algorithm, and relatedly that we wouldn't need to initialize mark bytes when allocating objects. Second, we'd gain the option to use must-move algorithms for the old space as well (mark-compact, semi-space) if we wanted to. But it's just an option, one that that Whippet opens up for us.

Whippet vs BDW: Tools?

Can build heap tracers and profilers moer easily

More hackable

(BDW-GC has as many preprocessor directives as whippet has source lines)

Finally, relative to BDW-GC, whippet has a more intangible advantage: I can actually hack on it. Just as an indication, 15% of BDW source lines are pre-processor directives, and there is one file that has like 150 #ifdef's, not counting #elseif's, many of them nested. I haven't done all that much to BDW itself, but I personally find it excruciating to work on.

Hackability opens up the possibility to build more tools to help us diagnose memory use problems. They aren't in Whippet yet, but there can be!

Engineering Whippet

Embed-only, abstractions, migration, modern; timeline

OK, that rounds out the comparison between BDW and Whippet, at least on a design level. Now I have a few words about how to actually get this new collector into Guile without breaking the bug budget. I try to arrange my work areas on Guile in such a way that I spend a minimum of time on bugs. Part of my strategy is negligence, I will admit, but part also is anticipating problems and avoiding them ahead of time, even if it takes more work up front.

Engineering Whippet: Embed-only

Semi: 6 kB; Whippet: 22 kB; BDW: 184 kB

Compile-time specialization:

  • for embedder (e.g. how to forward objects)
  • for selected GC algorithm (e.g. semi-space vs whippet)

Built apart, but with LTO to remove library overhead

So the BDW collector is typically shipped as a shared library that you dynamically link to. I should say that we've had an overall good experience with upgrading BDW-GC in the past; its maintainer (Ivan Maidanski) does a great and responsible job on a hard project. It's been many, many years since we had a bug in BDW-GC. But still, BDW is dependency, and all things being equal we prefer to remove moving parts.

The approach that Whippet is taking is to be an embed-only library: it's designed to be compiled into your project. It's not an include-only library; it still has to be compiled, but with link-time-optimization and a judicious selection of fast-path interfaces, Whippet is mostly able to avoid abstractions being a performance barrier.

The result is that Whippet is small, both in source and in binary, which minimizes its maintenance overhead. Taking additional stripped optimized binary size as the metric, by my calculations a semi-space collector (with a large object space and ephemeron support) takes about 6 kB of object file size, whereas Whippet takes 22 and BDW takes 184. Part of how Whippet gets so small is that it is is configured in major ways at compile-time (choice of main GC algorithm), and specialized against the program it's embedding against (e.g. how to patch in a forwarding pointer). Having all API being internal and visible to LTO instead of going through ELF symbol resolution helps in a minor way as well.

Engineering Whippet: Abstract performance

User API abstracts over GC algorithm, e.g. semi-space or whippet

Expose enough info to allow JIT to open-code fast paths

Inspired by

Abstractions permit change: of algorithm, over time

From a composition standpoint, Whippet is actually a few things. Firstly there is an abstract API to make a heap, create per-thread mutators for a heap, and allocate objects for a mutator. There is the aforementioned embedder API, for having the embedding program indicate how to trace objects and install forwarding pointers. Then there is some common code (for example ephemeron support). There are implementations of the different spaces: semi-space, large object, whippet/immix; and finally collector implementations that tie together the spaces into a full implementation of the abstract API. (In practice the more iconic spaces are intertwingled with the collector implementations they define.)

I don't think I would have gone down this route without seeing some prior work, for example libpas, but it was really MMTk that convinced me that it was worth spending a little time thinking about the GC not as a structureless blob but as a system made of parts and exposing a minimal interface. In particular, I was inspired by seeing that MMTk is able to get good performance while also being abstract, exposing representation details such as how to tell a JIT compiler about allocation fast-paths, but in a principled way. So, thanks MMTk people, for this and so many things!

I'm particularly happy that the API is abstract enough that it frees up not only the garbage collector to change implementations, but also Guile and other embedders, in that they don't have to bake in a dependency on specific collectors. The semi-space collector has been particularly useful here in ensuring that the abstractions don't accidentally rely on support for object pinning.

Engineering Whippet: Migration

API implementable by BDW-GC (except ephemerons)

First step for Guile: BDW behind Whippet API

Then switch to whippet/immix (by default)

The collector API can actually be implemented by the BDW collector. Whippet includes a collector that is a thin wrapper around the BDW API, with support for fast-path allocation via thread-local freelists. In this way we can always check the performance of any given collector against an external fixed point (BDW) as well as a theoretically known point (the semi-space collector).

Indeed I think the first step for Guile is precisely this: refactor Guile to allocate through the Whippet API, but using the BDW collector as the implementation. This will ensure that the Whippet API is sufficient, and then allow an incremental switch to other collectors.

Incidentally, when it comes to integrating Whippet, there are some choices to be made. I mentioned that it's quite configurable, and this chart can give you some idea. On the left side is one microbenchmark (mt-gcbench) and on the right is another (quads). The first generates a lot of fragmentation and has a wide range of object sizes, including some very large objects. The second is very uniform and many allocations die young.

(I know these images are small; right-click to open in new tab or pinch to zoom to see more detail.)

Within each set of bars we have 10 different scenarios, corresponding to different Whippet configurations. (All of these tests are run on my old 4-core laptop with 4 markers if parallel marking is supported, and a 2x heap.)

The first bar in each side is serial whippet: one marker. Then we see parallel whippet: four markers. Great. Then there's generational whippet: one marker, but just scanning objects allocated in the current cycle, hoping that produces enough holes. Then generational parallel whippet: the same as before, but with 4 markers.

The next 4 bars are the same: serial, parallel, generational, parallel-generational, but with one difference: the stack is scanned conservatively instead of precisely. You might be surprised but all of these configurations actually perform better than their precise counterparts. I think the reason is that the microbenchmark uses explicit handle registration and deregistration (it's a stack) instead of compiler-generated stack maps in a side table, but I'm not precisely (ahem) sure.

Finally the next 2 bars are serial and parallel collectors, but marking everything conservatively. I have generational measurements for this configuration but it really doesn't make much sense to assume that you can emit write barriers in this context. These runs are slower than the previous configuration, mostly because there are some non-pointer locations that get scanned conservatively that wouldn't get scanned precisely. I think conservative heap scanning is less efficient than precise but I'm honestly not sure, there are some instruction locality arguments in the other direction. For mt-gcbench though there's a big array of floating-point values that a precise scan will omit, which causes significant overhead there. Probably for this configuration to be viable Whippet would need the equivalent of BDW's API to allocate known-pointerless objects.

Engineering Whippet: Modern



pthreads (for parallel markers)

No void*; instead struct types: gc_ref, gc_edge, gc_conservative_ref, etc

Embed-only lib avoids any returns-struct-by-value ABI issue

Rust? MMTk; supply chain concerns

Platform abstraction for conservative root finding

I know it's a sin, but Whippet is implemented in C. I know. The thing is, in the Guile context I need to not introduce wild compile-time dependencies, because of bootstrapping. And I know that Rust is a fine language to use for GC implementation, so if that's what you want, please do go take a look at MMTk! It's a fantastic project, written in Rust, and it can just slot into your project, regardless of the language your project is written in.

But if what you're looking for is something in C, well then you have to pick and choose your C. In the case of Whippet I try to use the limited abilities of C to help prevent bugs; for example, I generally avoid void* and instead wrap pointers or addresses into single-field structs that can't be automatically cast, for example to prevent a struct gc_ref that denotes an object reference (or NULL; it's an option type) from being confused with a struct gc_conservative_ref, which might not point to an object at all.

(Of course, by "C" I mean "C as compiled by gcc and clang with -fno-strict-aliasing". I don't know if it's possible to implement even a simple semi-space collector in C without aliasing violations. Can you access a Foo* object within a mmap'd heap through its new address after it has been moved via memcpy? Maybe not, right? Thoughts are welcome.)

As a project written in the 2020s instead of the 1990s, Whippet gets to assume a competent C compiler, for example relying on the compiler to inline and fold branches where appropriate. As in libpas, Whippet liberally passes functions as values to inline functions, and relies on the compiler to boil away function calls. Whippet only uses the C preprocessor when it absolutely has to.

Finally, there is a clean abstraction for anything that's platform-specific, for example finding the current stack bounds. I haven't compiled this code on Windows or MacOS yet, but I am not anticipating too many troubles.

Engineering Whippet: Timeline

As time permits

Whippet TODO: heap growth/shrinking, finalizers, safepoint API

Guile TODO: safepoints; heap-conservative first

Precise heap TODO: gc_trace_object, SMOBs, user structs with raw ptr fields, user gc_malloc usage; 3.2

6 months for 3.1.1; 12 for 3.2.0 ?

So where does this get us? Where are we now?

For Whippet itself, I think it's mostly done -- enough to start shifting focus to some different phase. It's missing some needed features, notably the ability to grow the heap at all, as I've been in fixed-heap-size-only mode during development. It's also missing finalizers. And, something needs to be done to unify Guile's handling of safepoints and processing of asynchronous signals with Whippet's need to stop all mutators. Some details remain.

But, I think we are close to ready to start integrating in Guile. At first this is just porting Guile to use the Whippet API to access BDW instead of using BDW directly. This whole thing is a side project for me that I work on when I can, so it doesn't exactly proceed at full pace. Perhaps this takes 6 months. Then we can cut a new unstable release, and hopefully release 3.2 withe support for the Immix-flavored collector in another 6 or 9 months.

I thought that we would be forced to make ABI changes, if only because of some legacy APIs assume conservative tracing of object contents. But after a discussion at FOSDEM with Carlo Piovesan I realized this isn't true: because the decision to evacuate or not is made on a collection-by-collection basis, I could simply disable evacuation if the user ever uses a facility that might prohibit object motion, for example if they ever define a SMOB type. If the user wants evacuation, they need to be more precise with their data types, but either way Guile is ready.

Whippet: A Better GC?

An Immix-derived GC

Guile 3.2 ?

Thanks to MMTk authors for inspiration!

And that's it! Thanks for reading all the way here. Comments are quite welcome.

As I mentioned in the very beginning, this talk was really about Whippet in the context of Guile. There is a different talk to be made about Guile+Whippet versus other language implementations, for example those with concurrent marking or semi-space nurseries or the like. Yet another talk is Whippet in the context of other GC algorithms. But this is a start. It's something I've been working on for a while now already and I'm pleased that it's gotten to a point where it seems to be at least OK, at least an improvement with respect to BDW-GC in some ways.

But before leaving you, another chart, to give a more global idea of the state of things. Here we compare a single mutator thread performing a specific microbenchmark that makes trees and also lots of fragmentation, across three different GC implementations and a range of heap sizes. The heap size multipliers in this and in all the other tests in this post are calculated analytically based on what the test thinks its maximum heap size should be, not by measuring minimum heap sizes that work. This size is surely lower than the actual maximum required heap size due to internal fragmentation, but the tests don't know about this.

The three collectors are BDW, a semi-space collector, and whippet. Semi-space manages to squeeze in less than 2x of a heap multiplier because it has (and whippet has) a separate large object space that isn't ever evacuated.

What we expect is that tighter heaps impose more GC time, and indeed we see that times are higher on the left side than the right.

Whippet is the only implementation that manages to run at a 1.3x heap, but it takes some time. It's slower than BDW at a 1.5x heap but better there on out, until what appears to be a bug or pathology makes it take longer at 5x. Adding memory should always decrease run time.

The semi-space collector starts working at 1.75x and then surpasses all collectors from 2.5x onwards. We expect the semi-space collector to win for big heaps, because its overhead is proportional to live data only, whereas mark-sweep and mark-region collectors have to sweep, which is proportional to heap size, and indeed that's what we see.

I think this chart shows we have some tuning yet to do. The range between 2x and 3x is quite acceptable, but we need to see what's causing Whippet to be slower than BDW at 1.5x. I haven't done as much performance tuning as I would like to but am happy to finally be able to know where we stand.

And that's it! Happy hacking, friends, and may your heap sizes be ever righteous.

by Andy Wingo at February 07, 2023 01:14 PM

Brian Kardell

Superb Howl

Superb Howl

We are rapidly approaching Sunday, Feburary 12th, or in America "Super Bowl Sunday". There's something interesting about the Super Bowl that I wish we could emulate there for the Web.

Chances are pretty good that even if you don't care at all about American football or watch broadcast TV, you're still going to wind up hearing about the Super Bowl. Or, at least, the commercials.

In fact, there's a pretty good chance you'll even see them. They'll be written about by media and talked about on social media. The Super Bowl is the only event that I know of that is like that. The commercials aren't just an annoyance. In fact, it is the only one I know of where where a segment of the audience is tuning in for the commercials. They want to see them live, as they air. The commercials are part of the event.

That is... unique

Maybe the web needs that.

Wait, don't run away! Please!! Let me explain!!

Check this out...

A screenshot of a tab in Firefox with something about Pocket, explained below.

Most people I show this to don't realize what they're seeing, but this is the "What's New" page. Whever Firefox updates it can launch a "What's new?" page, and it can put whatever it wants on there. This isn't really about what's new in this release of Firefox at all - it is a full page advertisement! This particular advertisement is for a product that Mozilla acquired called Pocket, but I believe it was also a place where they highlighted the partnership with the Going Red movie. That's pretty interesting and like a lot of things Mozilla has done, pretty innovative. I guess now we'll see if like a lot of other Mozilla things, they were just a little too early or got a few details wrong.

It requires Firefox to sell that space, and, I'm pretty sure they're not selling it successfully right now. Maybe they never tried. Maybe they've stopped trying. But, what's very interesting to me is that while their numbers have been declining for years, even right now "Firefox users" is more than twice the audience that the Super Bowl will next week.


That could be a valuable ad spot, right? And not just for Firefox — it’s valuable in just about every browser. For example, Brave’s user numbers are right now getting close to half the audience of the Super Bowl.

In fact, even Brave's numbers are, at this point, approaching about half the audience of the Super Bowl.

Maybe this approach could be compelling for them too? Imagine if they did a thing like Firefox.. Oh wait..

A screenshot of the 'new tab' interface - the background photo of which is a full page advertisement.

Yeah, look at that. Brave also recognizes this is valuable, though they are offering something similar, but a little different, more repeatable - the whole start backdrop.

That's a pretty interesting idea which one could argue is just a better version of something that came before. Opera, I think, was the first one to introduce a start page with "speed dial". That's sort of like "default bookmarks on the start page" and for both Vivaldi or Opera their funding sources can include default bookmarks.

A screenshot of Vivaldi's new tab start page "speed dial", full of icons to web properties.

Microsoft, I think has had some of their own product stuff in their start page at various times. Google's browser, unsuprisingly places Search almost alone on the new tab page, despite the fact that the wonderbar will also search with Google. Search is their main money making business, after all. There is still, however, an drawer you can open which will link you to all of the other Google properties too.

Changing ideas

As I wrote in The NASCAR Model we need better ways to fund browser development and to incentivize direct investement, and for better or worse, that is way more practical if we can find a way to involve marketing budgets... Or just charging users somehow (they're charged now, they just don't realize it).

So, all of those ideas above are kind of interesting attempts to do just that.

In Wolvic we also have an environment around us, and we've effectively put up ads for Wolvic in those so far...

A screenshot of Wolvic's "cyberpunk" environment, it's a city background at night and on the buildings are numerous digital signs or illuminated billboards.

But, the same thing holds there: There's no real reason we couldn't put some small amount of marking in the environment there. There's no reason that one there couldn't be a billboard there for a new Star Wars show or Avengers movie...

(psst... Disney, call me).

Somehow, I feel like they're all worth exploring - or at least worth discussing as we figure out how to collectively come to grips with the fact that ultimately we need to find new funding models for browsers (and their underlying engines). You can also consider adding some sponsorship via the collective, right now, and read more about that.

I'd love to hear your thoughts on any of this. Could we really somehow convince marketing dollars to be spent on this? What would it take? Is there a way, like the Super Bowl, we can design create "just the right thing" where it's very valuable and not something users loathe. How could we get ads worthy of the opportunity and not turn into another vector for attention stealing? That's difficult. There are, of course, way more questions than answers - but if you'd like to talk or share your thoughts, hit me up. Especially if you are in Disney's marketing department, for example.

February 07, 2023 05:00 AM

February 02, 2023

Lucas Fryzek

2022 Graphics Team Contributions at Igalia

This year I started a new job working with Igalia’s Graphics Team. For those of you who don’t know Igalia they are a “worker-owned, employee-run cooperative model consultancy focused on open source software”.

As a new member of the team, I thought it would be a great idea to summarize the incredible amount of work the team completed in 2022. If you’re interested keep reading!

Vulkan 1.2 Conformance on RPi 4

One of the big milestones for the team in 2022 was achieving Vulkan 1.2 conformance on the Raspberry Pi 4. The folks over at the Raspberry Pi company wrote a nice article about the achievement. Igalia has been partnering with the Raspberry Pi company to bring build and improve the graphics driver on all versions of the Raspberry Pi.

The Vulkan 1.2 spec ratification came with a few extensions that were promoted to Core. This means a conformant Vulkan 1.2 driver needs to implement those extensions. Alejandro Piñeiro wrote this interesting blog post that talks about some of those extensions.

Vulkan 1.2 also came with a number of optional extensions such as VK_KHR_pipeline_executable_properties. My colleague Iago Toral wrote an excellent blog post on how we implemented that extension on the Raspberry Pi 4 and what benefits it provides for debugging.

Vulkan 1.3 support on Turnip

Igalia has been heavily supporting the Open-Source Turnip Vulkan driver for Qualcomm Adreno GPUs, and in 2022 we helped it achieve Vulkan 1.3 conformance. Danylo Piliaiev on the graphics team here at Igalia, wrote a great blog post on this achievement! One of the biggest challenges for the Turnip driver is that it is a completely reverse-engineered driver that has been built without access to any hardware documentation or reference driver code.

With Vulkan 1.3 conformance has also come the ability to run more commercial games on Adreno GPUs through the use of the DirectX translation layers. If you would like to see more of this check out this post from Danylo where he talks about getting “The Witcher 3”, “The Talos Principle”, and “OMD2” running on the A660 GPU. Outside of Vulkan 1.3 support he also talks about some of the extensions that were implemented to allow “Zink” (the OpenGL over Vulkan driver) to run Turnip, and bring OpenGL 4.6 support to Adreno GPUs.

Vulkan Extensions

Several developers on the Graphics Team made several key contributions to Vulkan Extensions and the Vulkan conformance test suite (CTS). My colleague Ricardo Garcia made an excellent blog post about those contributions. Below I’ve listed what Igalia did for each of the extensions:

  • VK_EXT_image_2d_view_of_3d
    • We reviewed the spec and are listed as contributors to this extension
  • VK_EXT_shader_module_identifier
    • We reviewed the spec, contributed to it, and created tests for this extension
  • VK_EXT_attachment_feedback_loop_layout
    • We reviewed, created tests and contributed to this extension
  • VK_EXT_mesh_shader
    • We contributed to the spec and created tests for this extension
  • VK_EXT_mutable_descriptor_type
    • We reviewed the spec and created tests for this extension
  • VK_EXT_extended_dynamic_state3
    • We wrote tests and reviewed the spec for this extension

AMDGPU kernel driver contributions

Our resident “Not an AMD expert” Melissa Wen made several contributions to the AMDGPU driver. Those contributions include connecting parts of the pixel blending and post blending code in AMD’s DC module to DRM and fixing a bug related to how panel orientation is set when a display is connected. She also had a presentation at XDC 2022, where she talks about techniques you can use to understand and debug AMDGPU, even when there aren’t hardware docs available.

André Almeida also completed and submitted work on enabled logging features for the new GFXOFF hardware feature in AMD GPUs. He also created a userspace application (which you can find here), that lets you interact with this feature through the debugfs interface. Additionally, he submitted a patch for async page flips (which he also talked about in his XDC 2022 presentation) which is still yet to be merged.

Modesetting without Glamor on RPi

Christopher Michael joined the Graphics Team in 2022 and along with Chema Casanova made some key contributions to enabling hardware acceleration and mode setting on the Raspberry Pi without the use of Glamor which allows making more video memory available to graphics applications running on a Raspberry Pi.

The older generation Raspberry Pis (1-3) only have a maximum of 256MB of memory available for video memory, and using Glamor will consume part of that video memory. Christopher wrote an excellent blog post on this work. Both him and Chema also had a joint presentation at XDC 2022 going into more detail on this work.

Linux Format Magazine Column

Our very own Samuel Iglesias had a column published in Linux Format Magazine. It’s a short column about reaching Vulkan 1.1 conformance for v3dv & Turnip Vulkan drivers, and how Open-Source GPU drivers can go from a “hobby project” to the defacto driver for the platform. Check it out on page 7 of issue #288!

XDC 2022

X.Org Developers Conference is one of the big conferences for us here at the Graphics Team. Last year at XDC 2022 our Team presented 5 talks in Minneapolis, Minnesota. XDC 2022 took place towards the end of the year in October, so it provides some good context on how the team closed out the year. If you didn’t attend or missed their presentation, here’s a breakdown:

“Replacing the geometry pipeline with mesh shaders” (Ricardo García)

Ricardo presents what exactly mesh shaders are in Vulkan. He made many contributions to this extension including writing 1000s of CTS tests for this extension with a blog post on his presentation that should check out!

“Status of Vulkan on Raspberry Pi” (Iago Toral)

Iago goes into detail about the current status of the Raspberry Pi Vulkan driver. He talks about achieving Vulkan 1.2 conformance, as well as some of the challenges the team had to solve due to hardware limitations of the Broadcom GPU.

“Enable hardware acceleration for GL applications without Glamor on Xorg modesetting driver” (Jose María Casanova, Christopher Michael)

Chema and Christopher talk about the challenges they had to solve to enable hardware acceleration on the Raspberry Pi without Glamor.

“I’m not an AMD expert, but…” (Melissa Wen)

In this non-technical presentation, Melissa talks about techniques developers can use to understand and debug drivers without access to hardware documentation.

“Async page flip in atomic API” (André Almeida)

André talks about the work that has been done to enable asynchronous page flipping in DRM’s atomic API with an introduction to the topic by explaining about what exactly is asynchronous page flip, and why you would want it.


Another important conference for us is FOSDEM, and last year we presented 3 of the 5 talks in the graphics dev room. FOSDEM took place in early February 2022, these talks provide some good context of where the team started in 2022.

The status of Turnip driver development (Hyunjun Ko)

Hyunjun presented the current state of the Turnip driver, also talking about the difficulties of developing a driver for a platform without hardware documentation. He talks about how Turnip developers reverse engineer the behaviour of the hardware, and then implement that in an open-source driver. He also made a companion blog post to checkout along with his presentation.

v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi 4 (Alejandro Piñeiro)

Igalia has been presenting the status of the v3dv driver since December 2019 and in this presentation, Alejandro talks about the status of the v3dv driver in early 2022. He talks about achieving conformance, the extensions that had to be implemented, and the future plans of the v3dv driver.

Fun with border colors in Vulkan (Ricardo Garcia)

Ricardo presents about the work he did on the VK_EXT_border_color_swizzle extension in Vulkan. He talks about the specific contributions he did and how the extension fits in with sampling color operations in Vulkan.

GSoC & Igalia CE

Last year Melissa & André co-mentored contributors working on introducing KUnit tests to the AMD display driver. This project was hosted as a “Google Summer of Code” (GSoC) project from the X.Org Foundation. If you’re interested in seeing their work Tales da Aparecida, Maíra Canal, Magali Lemes, and Isabella Basso presented their work at the Linux Plumbers Conference 2022 and across two talks at XDC 2022. Here you can see their first presentation and here you can see their second second presentation.

André & Melissa also mentored two “Igalia Coding Experience” (CE) projects, one related to IGT GPU test tools on the VKMS kernel driver, and the other for IGT GPU test tools on the V3D kernel driver. If you’re interested in reading up on some of that work, Maíra Canal wrote about her experience being part of the Igalia CE.

Ella Stanforth was also part of the Igalia Coding Experience, being mentored by Iago & Alejandro. They worked on the VK_KHR_sampler_ycbcr_conversion extension for the v3dv driver. Alejandro talks about their work in his blog post here.

What’s Next?

The graphics team is looking forward to having a jam-packed 2023 with just as many if not more contributions to the Open-Source graphics stack! I’m super excited to be part of the team, and hope to see my name in our 2023 recap post!

Also, you might have heard that Igalia will be hosting XDC 2023 in the beautiful city of A Coruña! We hope to see you there where there will be many presentations from all the great people working on the Open-Source graphics stack, and most importantly where you can dream in the Atlantic!

Photo of A Coruña
Photo of A Coruña

February 02, 2023 05:00 AM

January 30, 2023

Yeunjoo Choi

Tips for using Vim when developing Chromium

There are lots of powerful IDEs which are broadly used by developers. Vim is a text editor, but it can be turned into a good IDE with awesome plugins and a little bit of configuration.

Many people already prefer using Vim to develop software because of its lightness and availability. I am one of them, and always use Vim to develop Chromium. However, someone would think that it’s hard to get Vim to reach the same performance as other IDEs, since Chromium is a very huge project.

For those potential users, this post introduces some tips for using Vim as an IDE when developing Chromium.

The context of this document is for Linux users who are used to using Vim.

Code Formatting

I strongly recommend using vim-codefmt for code formatting. It supports most file types in Chromium including GN (See GN docs) and autoformatting. However, I don’t like to use autoformatting and just bind :FormatLines to ctrl-I for formating a selected visual block.

map <C-I> :FormatLines<CR> 


Another option is using clang-format in depot_tools by clang-format.vim in tools/vim. Please check about tools/vim at the following section.


You can easily find a bunch of Vim files in tools/vim. I create a .vimrc file locally in a working directory and load only what I need.

1 2 3 4 5 6 7 
let chtool_path=getcwd().'/tools' filetype off let &rtp.=','.chtool_path.'/vim/mojom' exec 'source' chtool_path.'/vim/filetypes.vim' exec 'source' chtool_path.'/vim/ninja-build.vim' filetype plugin indent on 

Chromium provides vimscript files for syntax highliting and file detection of Mojom, which is the IDL for Mojo interfaces (IPC) among Chromium services. And ninja-build.vim allows you compile a file with ctrl-O or build the specific target with :CrBuild command. See each Vim files for details.

Running Tests

tools/ is very useful when you run tests of Chromium. As the description of the script, finds the appropriate test suits and builds it, then runs it. ! is an option for running inside Vim, but sometimes it’s a hassle to type all parameters. What about to write simple commands (or functions) with scripts under tools/vim?

This is an example script for running a test for the current line. Some codes are copied from ninja-build.vim and imported from

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 
pythonx << endpython import os, vim def path_to_build_dir(): # Codes from tools/vim/ninja-build.vim. chrome_root = os.path.dirname( fingerprints = ['chrome', 'net', 'v8', 'build', 'skia'] while chrome_root and not all( [os.path.isdir(os.path.join(chrome_root , fp)) for fp in fingerprints]): chrome_root = os.path.dirname(chrome_root) sys.path.append(os.path.join(chrome_root, 'tools', 'vim')) # Import GetNinjaOutputDirectory from tools/vim/ from ninja_output import GetNinjaOutputDirectory return GetNinjaOutputDirectory(chrome_root) def run_test_for_line(linenumb): run_cmd = ' '.join(['!tools/', '-C', path_to_build_dir(), '%', '--line', linenumb] ) vim.command(run_cmd) endpython fun! RunTestForCurrentLine() let l:current_line = shellescape(line('.')) pythonx run_test_for_line(vim.eval('l:current_line')) endfun map <C-T> :call RunTestForCurrentLine()<CR> 

Place the cursor on what we want to test and ctrl-t … Here you go. autotest


tools/vim has an example configuration for YouCompleteMe (a code completion engine for Vim). See tools/vim/

1 2 
let g:ycm_extra_conf_globlist = ['../'] let g:ycm_goto_buffer_command = 'split-or-existing-window' 

As you already know, YouCompleteMe requires clangd. Very fortunately, Chromium already supports clangd and remote index server to get daily index snapshots.

Do not skip documents about using Clangd to build chromium and chromium remote index server.

Here are short demos for completion suggestions, ycm_auto

and code jumping(:YcmCompleter GoTo, :YcmCompleter GoToReferences) during Chromium development. ycm_jump


A good developer writes good comments, so being a good commenter makes you a good developer (joke). In my humble opinion, NERD commenter will make you a good developer (joke again). You can get all details from the docs of NERD commenter, and Chromium needs only an additional option for mojom.

let g:NERDCustomDelimiters = { 'mojom': {'left': '//'} } 

File Navigation

Chromium is really huge, so we benefit from efficient file navigation tools. There are lots of awesome fuzzy finder plugins for this purpose and I have been using command-t for a long time. But command-t requires a Vim executable with ruby/lua support or Neovim. If you need a simpler yet still powerful navigator, fzf.vim is an altenative great option.

Since Chromium a git project, search commands based by git ls-files will provide results very quickly. Use the option let g:CommandTFileScanner = 'git' for command-t, and use the command :GFiles for fzf.vim.


In case you can’t depend on git ls-files based commands, please make sure your plugin use rg or fd which are the fastest command line search tools. For example, fzf.vim has an option FZF_DEFAULT_COMMAND and the following suggestion in the manual: export FZF_DEFAULT_COMMAND='fd --type f'.


Thanks for reading and I hope this blog post can be some help to your development enviroment for Chromium. Any comment for sharing some other cool tips always be welcomed, so that we can make our Vim more productive. (Right, this is the real purpose of this blog post.)

…. I will be back someday with debugging Chromium in Vim. sneak peek

January 30, 2023 02:14 AM

January 27, 2023

Andy Wingo

three approaches to heap sizing

How much memory should a program get? Tonight, a quick note on sizing for garbage-collected heaps. There are a few possible answers, depending on what your goals are for the system.

you: doctor science

Sometimes you build a system and you want to study it: to identify its principal components and see how they work together, or to isolate the effect of altering a single component. In that case, what you want is a fixed heap size. You run your program a few times and determine a heap size that is sufficient for your problem, and then in future run the program with that new fixed heap size. This allows you to concentrate on the other components of the system.

A good approach to choosing the fixed heap size for a program is to determine the minimum heap size a program can have by bisection, then multiplying that size by a constant factor. Garbage collection is a space/time tradeoff: the factor you choose represents a point on the space/time tradeoff curve. I would choose 1.5 in general, but this is arbitrary; I'd go more with 3 or even 5 if memory isn't scarce and I'm really optimizing for throughput.

Note that a fixed-size heap is not generally what you want. It's not good user experience for running ./foo at the command line, for example. The reason for this is that program memory use is usually a function of the program's input, and only in some cases do you know what the input might look like, and until you run the program you don't know what the exact effect of input on memory is. Still, if you have a team of operations people that knows what input patterns look like and has experience with a GC-using server-side process, fixed heap sizes could be a good solution there too.

you: average josé/fina

On the other end of the spectrum is the average user. You just want to run your program. The program should have the memory it needs! Not too much of course; that would be wasteful. Not too little either; I can tell you, my house is less than 100m², and I spend way too much time shuffling things from one surface to another. If I had more space I could avoid this wasted effort, and in a similar way, you don't want to be too stingy with a program's heap. Do the right thing!

Of course, you probably have multiple programs running on a system that are making similar heap sizing choices at the same time, and the relative needs and importances of these programs could change over time, for example as you switch tabs in a web browser, so the right thing really refers to overall system performance, whereas what you are controlling is just one process' heap size; what is the Right Thing, anyway?

My corner of the GC discourse agrees that something like the right solution was outlined by Kirisame, Shenoy, and Panchekha in a 2022 OOPSLA paper, in which the optimum heap size depends on the allocation rate and the gc cost for a process, which you measure on an ongoing basis. Interestingly, their formulation of heap size calculation can be made by each process without coordination, but results in a whole-system optimum.

There are some details but you can imagine some instinctive results: for example, when a program stops allocating because it's waiting for some external event like user input, it doesn't need so much memory, so it can start shrinking its heap. After all, it might be quite a while before the program has new input. If the program starts allocating again, perhaps because there is new input, it can grow its heap rapidly, and might then shrink again later. The mechanism by which this happens is pleasantly simple, and I salute (again!) the authors for identifying the practical benefits that an abstract model brings to the problem domain.

you: a damaged, suspicious individual

Hoo, friends-- I don't know. I've seen some things. Not to exaggerate, I like to think I'm a well-balanced sort of fellow, but there's some suspicion too, right? So when I imagine a background thread determining that my web server hasn't gotten so much action in the last 100ms and that really what it needs to be doing is shrinking its heap, kicking off additional work to mark-compact it or whatever, when the whole point of the virtual machine is to run that web server and not much else, only to have to probably give it more heap 50ms later, I-- well, again, I exaggerate. The MemBalancer paper has a heartbeat period of 1 Hz and a smoothing function for the heap size, but it just smells like danger. Do I need danger? I mean, maybe? Probably in most cases? But maybe it would be better to avoid danger if I can. Heap growth is usually both necessary and cheap when it happens, but shrinkage is never necessary and is sometimes expensive because you have to shuffle around data.

So, I think there is probably a case for a third mode: not fixed, not adaptive like the MemBalancer approach, but just growable: grow the heap when and if its size is less than a configurable multiplier (e.g. 1.5) of live data. Never shrink the heap. If you ever notice that a process is taking too much memory, manually kill it and start over, or whatever. Default to adaptive, of course, but when you start to troubleshoot a high GC overhead in a long-lived proess, perhaps switch to growable to see its effect.

unavoidable badness

There is some heuristic badness that one cannot avoid: even with the adaptive MemBalancer approach, you have to choose a point on the space/time tradeoff curve. Regardless of what you do, your system will grow a hairy nest of knobs and dials, and if your system is successful there will be a lively aftermarket industry of tuning articles: "Are you experiencing poor object transit? One knob you must know"; "Four knobs to heaven"; "It's raining knobs"; "GC engineers DO NOT want you to grab this knob!!"; etc. (I hope that my British readers are enjoying this.)

These ad-hoc heuristics are just part of the domain. What I want to say though is that having a general framework for how you approach heap sizing can limit knob profusion, and can help you organize what you have into a structure of sorts.

At least, this is what I tell myself; inshallah. Now I have told you too. Until next time, happy hacking!

by Andy Wingo at January 27, 2023 09:45 PM

January 24, 2023

Andy Wingo

parallel ephemeron tracing

Hello all, and happy new year. Today's note continues the series on implementing ephemerons in a garbage collector.

In our last dispatch we looked at a serial algorithm to trace ephemerons. However, production garbage collectors are parallel: during collection, they trace the object graph using multiple worker threads. Our problem is to extend the ephemeron-tracing algorithm with support for multiple tracing threads, without introducing stalls or serial bottlenecks.

Recall that we ended up having to define a table of pending ephemerons:

struct gc_pending_ephemeron_table {
  struct gc_ephemeron *resolved;
  size_t nbuckets;
  struct gc_ephemeron *buckets[0];

This table holds pending ephemerons that have been visited by the graph tracer but whose keys haven't been found yet, as well as a singly-linked list of resolved ephemerons that are waiting to have their values traced. As a global data structure, the pending ephemeron table is a point of contention between tracing threads that we need to design around.

a confession

Allow me to confess my sins: things would be a bit simpler if I didn't allow tracing workers to race.

As background, if your GC supports marking in place instead of always evacuating, then there is a mark bit associated with each object. To reduce the overhead of contention, a common strategy is to actually use a whole byte for the mark bit, and to write to it using relaxed atomics (or even raw stores). This avoids the cost of a compare-and-swap, but at the cost that multiple marking threads might see that an object's mark was unset, go to mark the object, and think that they were the thread that marked the object. As far as the mark byte goes, that's OK because everybody is writing the same value. The object gets pushed on the to-be-traced grey object queues multiple times, but that's OK too because tracing should be idempotent.

This is a common optimization for parallel marking, and it doesn't have any significant impact on other parts of the GC--except ephemeron marking. For ephemerons, because the state transition isn't simply from unmarked to marked, we need more coordination.

high level

The parallel ephemeron marking algorithm modifies the serial algorithm in just a few ways:

  1. We have an atomically-updated state field in the ephemeron, used to know if e.g. an ephemeron is pending or resolved;

  2. We use separate fields for the pending and resolved links, to allow for concurrent readers across a state change;

  3. We introduce "traced" and "claimed" states to resolve races between parallel tracers on the same ephemeron, and track the "epoch" at which an ephemeron was last traced;

  4. We remove resolved ephemerons from the pending ephemeron hash table lazily, and use atomic swaps to pop from the resolved ephemerons list;

  5. We have to re-check key liveness after publishing an ephemeron to the pending ephemeron table.

Regarding the first point, there are four possible values for the ephemeron's state field:

enum {

The state transition diagram looks like this:

 ,         | ^        .
,          v |         .
|        CLAIMED        |
|  ,-----/     \---.    |
|  v               v    |

With this information, we can start to flesh out the ephemeron object itself:

struct gc_ephemeron {
  uint8_t state;
  uint8_t is_dead;
  unsigned epoch;
  struct gc_ephemeron *pending;
  struct gc_ephemeron *resolved;
  void *key;
  void *value;

The state field holds one of the four state values; is_dead indicates if a live ephemeron was ever proven to have a dead key, or if the user explicitly killed the ephemeron; and epoch is the GC count at which the ephemeron was last traced. Ephemerons are born TRACED in the current GC epoch, and the collector is responsible for incrementing the current epoch before each collection.

algorithm: tracing ephemerons

When the collector first finds an ephemeron, it does a compare-and-swap (CAS) on the state from TRACED to CLAIMED. If that succeeds, we check the epoch; if it's current, we revert to the TRACED state: there's nothing to do.

(Without marking races, you wouldn't need either TRACED or CLAIMED states, or the epoch; it would be implicit in the fact that the ephemeron was being traced at all that you had a TRACED ephemeron with an old epoch.)

So now we have a CLAIMED ephemeron with an out-of-date epoch. We update the epoch and clear the pending and resolved fields, setting them to NULL. If, then, the ephemeron is_dead, we are done, and we go back to TRACED.

Otherwise we check if the key has already been traced. If so we forward it (if evacuating) and then trace the value edge as well, and transition to TRACED.

Otherwise we have a live E but we don't know about K; this ephemeron is pending. We transition E's state to PENDING and add it to the front of K's hash bucket in the pending ephemerons table, using CAS to avoid locks.

We then have to re-check if K is live, after publishing E, to account for other threads racing to mark to K while we mark E; if indeed K is live, then we transition to RESOLVED and push E on the global resolved ephemeron list, using CAS, via the resolved link.

So far, so good: either the ephemeron is fully traced, or it's pending and published, or (rarely) published-then-resolved and waiting to be traced.

algorithm: tracing objects

The annoying thing about tracing ephemerons is that it potentially impacts tracing of all objects: any object could be the key that resolves a pending ephemeron.

When we trace an object, we look it up in the pending ephemeron hash table. But, as we traverse the chains in a bucket, we also load each node's state. If we find a node that's not in the PENDING state, we atomically forward its predecessor to point to its successor. This is correct for concurrent readers because the end of the chain is always reachable: we only skip nodes that are not PENDING, nodes never become PENDING after they transition away from being PENDING, and we only add PENDING nodes to the front of the chain. We even leave the pending field in place, so that any concurrent reader of the chain can still find the tail, even when the ephemeron has gone on to be RESOLVED or even TRACED.

(I had thought I would need Tim Harris' atomic list implementation, but it turns out that since I only ever insert items at the head, having annotated links is not necessary.)

If we find a PENDING ephemeron that has K as its key, then we CAS its state from PENDING to RESOLVED. If this works, we CAS it onto the front of the resolved list. (Note that we also have to forward the key at this point, for a moving GC; this was a bug in my original implementation.)

algorithm: resolved ephemerons

Periodically a thread tracing the graph will run out of objects to trace (its mark stack is empty). That's a good time to check if there are resolved ephemerons to trace. We atomically exchange the global resolved list with NULL, and then if there were resolved ephemerons, then we trace their values and transition them to TRACED.

At the very end of the GC cycle, we sweep the pending ephemeron table, marking any ephemeron that's still there as is_dead, transitioning them back to TRACED, clearing the buckets of the pending ephemeron table as we go.


So that's it. There are some drawbacks, for example that this solution takes at least three words per ephemeron. Oh well.

There is also an annoying point of serialization, which is related to the lazy ephemeron resolution optimization. Consider that checking the pending ephemeron table on every object visit is overhead; it would be nice to avoid this. So instead, we start in "lazy" mode, in which pending ephemerons are never resolved by marking; and then once the mark stack / grey object worklist fully empties, we sweep through the pending ephemeron table, checking each ephemeron's key to see if it was visited in the end, and resolving those ephemerons; we then switch to "eager" mode in which each object visit could potentially resolve ephemerons. In this way the cost of ephemeron tracing is avoided for that part of the graph that is strongly reachable. However, with parallel markers, would you switch to eager mode when any thread runs out of objects to mark, or when all threads run out of objects? You would get greatest parallelism with the former, but you run the risk of some workers prematurely running out of data, but when there is still a significant part of the strongly-reachable graph to traverse. If you wait for all threads to be done, you introduce a serialization point. There is a related question of when to pump the resolved ephemerons list. But these are engineering details.

Speaking of details, there are some gnarly pitfalls, particularly that you have to be very careful about pre-visit versus post-visit object addresses; for a semi-space collector, visiting an object will move it, so for example in the pending ephemeron table which by definition is keyed by pre-visit (fromspace) object addresses, you need to be sure to trace the ephemeron key for any transition to RESOLVED, and there are a few places this happens (the re-check after publish, sweeping the table after transitioning from lazy to eager, and when resolving eagerly).


If you've read this far, you may be interested in the implementation; it's only a few hundred lines long. It took me quite a while to whittle it down!

Ephemerons are challenging from a software engineering perspective, because they are logically a separate module, but they interact both with users of the GC and with the collector implementations. It's tricky to find the abstractions that work for all GC algorithms, whether they mark in place or move their objects, and whether they mark the heap precisely or if there are some conservative edges. But if this is the sort of thing that interests you, voilà the API for users and the API to and from collector implementations.

And, that's it! I am looking forward to climbing out of this GC hole, one blog at a time. There are just a few more features before I can seriously attack integrating this into Guile. Until the next time, happy hacking :)

by Andy Wingo at January 24, 2023 10:48 AM

January 23, 2023

Jacobo Aragunde

Accessible name computation in Chromium chapter III: name from hidden subtrees

Back in March of 2021, as part of my maintenance work on Chrome/Chromium accessibility at Igalia, I inadvertently started a long, deep dive into the accessible name calculation code, that had me intermittently busy since then.

This is the third and last post in the series. We started with an introduction to the problem, and followed by describing how we fixed things in Chromium. In this post, we will describe some problems we detected in the accessible name computation spec, and the process to address them.

This post extends what was recently presented in a lightning talk at BlinkOn 17, in November 2022:

Hidden content and aria-labelledby

Authors may know they can use aria-labelledby to name a node after another, hidden node. The attribute aria-describedby works analogously for descriptions, but in this post we will refer to names for simplicity.

This is a simple example, where the produced name for the input is “foo”:

<input aria-labelledby="label">
<div id="label" class="hidden">foo</div>

This is because the spec says, in step 2B:

  • if computing a name, and the current node has an aria-labelledby attribute that contains at least one valid IDREF, and the current node is not already part of an aria-labelledby traversal, process its IDREFs in the order they occur:
  • or, if computing a description, and the current node has an aria-describedby attribute that contains at least one valid IDREF, and the current node is not already part of an aria-describedby traversal, process its IDREFs in the order they occur:
    i. Set the accumulated text to the empty string.
    ii. For each IDREF:
    a. Set the current node to the node referenced by the IDREF.
    b. Compute the text alternative of the current node beginning with step 2. Set the result to that text alternative.
    c. Append the result, with a space, to the accumulated text.
    iii. Return the accumulated text.

When processing each IDREF per 2B.ii.b, it will stumble upon this condition in 2A:

If the current node is hidden and is not directly referenced by aria-labelledby or aria-describedby, nor directly referenced by a native host language text alternative element (e.g. label in HTML) or attribute, return the empty string. 

So, a hidden node referenced by aria-labelledby or aria-describedby will not return the empty string. Instead, it will go on with the subsequent steps in the procedure.

Unfortunately, the spec did not provide clear answers to some corner cases: what happens if there are elements that are explicitly hidden inside the first hidden node? What about nodes that aren’t directly referenced? How deep are we expected to go inside a hidden tree to calculate its name? And what about aria-hidden nodes?

Consider another example:

  <input id="test" aria-labelledby="t1">
  <div id="t1" style="visibility:hidden">
    <span aria-hidden="true">a</span>
    <span style="visibility:hidden">b</span>

Given the markup above, WebKit generated the name “d”, Firefox said “abcd” and Chrome said “bcd”. In absence of clear answers, user agent behaviors differ.

There was an ongoing discussion in Github regarding this topic, where I shared my findings and concerns. Some conclusions that came out from it:

  • It appears that the original intention of the spec was allowing the naming of nodes from hidden subtrees, if done explicitly, and to include only those children that were not explicitly hidden.
  • It’s really hard to implement a check for explicitly hidden nodes inside a hidden subtree. For optimization purposes, that information is simplified away in early stages of stylesheet processing, and it wouldn’t be practical to recover or keep it.
  • Given the current state of affairs and how long the current behavior has been in place, changing it radically would cause confusion among authors, backwards compatibility problems, etc.

The agreement was to change the spec to match the actual behavior in browsers, and clarify the points where implementations differed. But, before that…

A dead end: the innerText proposal

An approach we discussed which looked promising was to use the innerText property for the specific case of producing a name from a hidden subtree. The spec would say something like: “if node is hidden and it’s the target of an aria-labelledby relation, return the innerText property”.

It’s easy to explain, and saves user agents the trouble of implementing the traversal. It went as far as to have a proposed wording and I coded a prototype for Chromium.

Unfortunately, innerText has a peculiar behavior with regard to hidden content: it will return empty if the node was hidden with the visibility:hidden rule.

The relevant spec says:

The innerText and outerText getter steps are:
1. If this is not being rendered or if the user agent is a non-CSS user agent, then return this’s descendant text content.


An element is being rendered if it has any associated CSS layout boxes, SVG layout boxes, or some equivalent in other styling languages.

An element with visibility:hidden has an associated layout box, hence it’s considered “rendered” and follows the steps 2 and beyond:

  1. Let results be a new empty list.
  2. For each child node node of this:
    i. Let current be the list resulting in running the rendered text collection steps with node. Each item in results will either be a string or a positive integer (a required line break count).

Where “rendered text collection steps” says:

  1. Let items be the result of running the rendered text collection steps with each child node of node in tree order, and then concatenating the results to a single list.
  2. If node’s computed value of ‘visibility’ is not ‘visible’, then return items.

The value of visibility is hidden, hence it returns items, which is empty because we had just started.

This behavior makes using innerText too big of a change to be acceptable for backwards compatibility reasons.

Clarifying the spec: behavior on naming after hidden subtrees

In the face of the challenges and preconditions described above, we agreed that the spec would document what most browsers are already doing in this case, which means exposing the entire, hidden subtree, down to the bottom. There was one detail that diverged: what about nodes with aria-hidden="true"? Firefox included them in the name, but Chrome and WebKit didn’t.

Letting authors use aria-hidden to really omit a node in a hidden subtree appeared to be a nice feature to have, so we first tried to codify that behavior into words. Unfortunately, the result became too convoluted: it added new steps, it created differences in what “hidden” means… We decided not to submit this proposal, but you can still take a look at it if you’re curious.

Instead, the proposal we submitted does not add or change many words in the spec text. The new text is:

“If the current node is hidden and is not part of an aria-labelledby or aria-describedby traversal, where the node directly referenced by that relation was hidden […], return the empty string.”

Instead, it adds a lot of clarifications and examples. Although it includes guidelines on what’s considered hidden, it does not make differences in the specific rules used to hide; as a result, aria-hidden nodes are expected to be included in the name when naming after a hidden subtree. 

Updating Chromium to match the spec: aria-hidden in hidden subtrees

We finally settled on including aria-hidden nodes when calculating a name or a description after a hidden subtree, via the aria-labelledby and aria-describedby relations. Firefox already behaved like this but Chromium did not. It appeared to be an easy change, like removing only one condition somewhere in the code… The reality is never like that, though.

The change landed successfully, but it’s much more than removing a condition: we found out the removal had side effects in other parts of the code. Additionally, we had some corrections in Chromium to prevent authors to use aria-hidden on focusable elements, but they weren’t explicit in the code; our change codifies them and adds tests. All this was addressed in the patch and, in general, I think we left things in a better shape than they were.

A recap

In the process to fix a couple of bugs about accessible name computation in Chrome, we detected undocumented cases in the spec and TODOs in the implementation. To address all of it, we ended up rewriting the traversal code, discussing and updating the spec and adding even more tests to document corner cases.

Many people were involved in the process, not only Googlers but also people from other browser vendors, independent agents, and members of the ARIA working group. I’d like to thank them all for their hard work building and maintaining accessibility on the Web.

But there is still a lot to do! There are some known bugs, and most likely some unknown ones. Some discussions around the spec still remain open and could require changes in browser behavior. Contributions are welcome!

Thanks for reading, and happy hacking!

by Jacobo Aragunde Pérez at January 23, 2023 05:00 PM

January 20, 2023

Gyuyoung Kim

WebGPU Conformance Test Suite

WebGPU Introduction

WebGPU is a new Web API that exposes modern computer graphics capabilities, specifically Direct3D 12, Metal, and Vulkan, for performing rendering and computation operations on a GPU. This goal is similar to the WebGL family of APIs, but WebGPU enables access to more advanced features of GPUs. Whereas WebGL is mostly for drawing images but can be repurposed (with great effort) for other kinds of computations, WebGPU provides first-class support for performing general computations on the GPU.  In other words, the WebGPU API is the successor to the WebGL and WebGL2 graphics APIs for the Web. It provides modern features such as “GPU compute” as well as lower overhead access to GPU hardware and better, more predictable performance. The below picture shows the WebGPU architecture diagram simply.[1]


There is currently a CTS (Conformance Test Suites) that tests the behaviors defined by the WebGPU specification. The CTS has been written in TypeScript and developed in GitHub, and the CTS can be run with a sort of ‘adapter layer’ to work under other test infrastructures like WPT and Telemetry. At this point, there are around 1,400 tests that check the validation and operation of WebGPU implementations. There is a site to run the WebGPU CTS. If you go to the URL below with a browser that enables the WebGPU feature, you can run the CTS on your browser.


The WebGPU CTS has progressed in GitHub. If you want to contribute to the CTS project, you can start by taking a look at the issues that have been filed. Please follow the sequence below:
  1. Set up the CTS developing environment.
  2. Add to or edit the test plan in the CTS project site.
  3. Implement the test.
  4. Upload the test and request a review.


The CTS is largely composed of four categories – API, Shader, IDL, and Web Platform. First, the API category tests full coverage of the Javascript API surface of WebGPU. The Shader category tests the full coverage of the shaders that can be passed to WebGPU. The IDL category checks that the WebGPU IDL is correctly implemented. Lastly, the Web Platform category tests WebPlatform-specific interactions.


This is a simple test for checking writeBuffer function validation. Basically, a general test consists of three blocks like description, parameters, and function body.
  • desc: Describes what it tests and summarize how it tests that functionality.
  • params: Defines parameters with a list of objects to be tested. Each such instance of the test is a “case”
  • fn: Implementation body of the test.


Let’s take a look at the parameterization a bit deeper. The CTS provides helpers like params() and others for creating large Cartesian products of test parameters. These generate “test cases” further subdivided into “test subcases”. Thanks to the parameterization support, we can test with more varied input values.


The WebGPU CTS has been developed by several contributors. Google (let’s assume that is Google) has been writing most of the test cases. After that, Intel,, and Igalia have contributed to the CTS tests.
This table shows the commit count of the main contributors, classified by email accounts. As you can see in the table, Google and Intel are the main contributors to the WebGPU CTS. Igalia has also participated in the CTS project since June 2022, and has written test cases for checking the validity of the texture format and copy functionality as well as the operation of the index format and the depth/stencil state.

The next steps

The Chromium WebGPU team announced their plan via an intent-to-ship posted to blink-dev recently. The Origin Trial for WebGPU expires starting in M110 and they aim to ship the WebGPU feature to Chromium in M113.



by gyuyoung at January 20, 2023 03:05 AM

January 18, 2023

Alexander Dunaev

Deprecating old stuff

Chromium 109 (released January 10, 2023) has dropped support for xdg_shell_unstable_v6, one of Wayland shell protocols.

For years, the Chromium’s Wayland backend was aware of two versions of the XDG shell interface. There were two parallel implementations of the client part, similar like twins. In certain parts, the only difference was names of functions. Wayland has a naming convention that requires these names to start with the name of the interface they work with, so where one implementation would call xdg_toplevel_resize(), the other one would call zxdg_toplevel_v6_resize().

The shell is a now standard Wayland protocol extension that describes semantics for the “windows”. It does so by assigning “roles” to the Wayland surfaces, and by providing means to set relative positions of these surfaces. That altogether makes it possible to build tree-like hierarchies of windows, which is absolutely necessary to render a window that has a menu with submenus.

The XDG shell protocol was introduced in Wayland protocols a long time ago. The definition file mentions 2008, but I was unable to track its history back that far. The Git repository where these files are hosted these days was created in October 2015, and the earlier history is not preserved. (Maybe the older repository still exists somewhere?)

Back in 2015, the XDG shell interface was termed “unstable”. Wayland waives backwards compatibility for unstable stuff, and requires the versions to be explicitly mentioned in the names of protocols and their definition files. Following that rule, the XDG shell protocol was named xdg_shell_unstable_v5. Soon after the initial migration of the project, the version 6 was introduced, named xdg_shell_unstable_v6. Since then, there are two manifests in the unstable section, xdg-shell-unstable-v5.xml and xdg-shell-unstable-v6.xml. I assume versions prior to 5 were dropped at the migration.

This may be confusing, so let me clarify. Wayland defines several entities to describe stuff. A protocol is a grouping entity: it is the root element of the XML file where it is defined. A protocol may contain a number of interfaces that describe behaviour of objects. The real objects and calls are bound to interfaces, but the header files are generated for the protocol as a whole, so even if the client uses only one interface, it will import the whole protocol.

The XDG shell allows the application to, quoting its description, “create desktop-style surfaces”. A number of desktop environments implemented support for the unstable protocol, providing feedback and proposing improvements. In 2017, the shell protocol was declared stable, and got a new name: xdg_shell. The shell interface was renamed from zxdg_shell_v6 to xdg_wm_base. Since then, the clients had to choose between two protocols, or to support both.

For the stable stuff, the convention is simpler and easier to follow: the changes are assumed to be backwards compatible, so names of interfaces and protocols are fixed and no longer change. The version tracking is built into the protocol definition and uses the self-explanatory version and since attributes. The client can import one protocol and decide at run time which calls are possible and which are not depending on the version exposed by the compositor.

The stable version continued to evolve. Eventually, the unstable version became redundant, and in 2022 major Wayland compositors started to drop support for it. By November in the past year we realised that the code that implemented the unstable shell protocol in Chromium was no longer used by real clients. Of 30 thousand recorded uses of the Wayland backend, only 3 (yeah, just three) used the unstable interface. The time had come to get rid of the legacy, which was more than one thousand lines of code.

Some bits of the support for alternative shell interfaces is preserved, though. A special factory class named ui::ShellObjectFactory is used to hide the implementation details of the shell objects. Before getting rid of the unstable shell implementation, the primary use of the factory was to select the XDG shell interface supported by the host. However, it also made it possible to integrate easily with the alternative shells. For that purpose, mainly for the downstream projects, the factory will remain.

by Alex at January 18, 2023 01:21 PM

January 17, 2023

Manuel Rego

10 years ago

Once upon a time…

10 years ago I landed my first WebKit patch 🎂:

Bug 107275 - [GTK] Implement LayoutTestController::addUserScript

That was my first patch to a web engine related project, and also my first professional C++ patch. My previous work at Igalia was around web applications development with PHP and Java. At that time, 2013, Igalia had been already involved on web rendering engines for several years, and the work around web applications was fading out inside the company, so moving to the other side of the platform looked like a good idea. 10 years have passed and lots of great things have happened.

Since that first patch many more have come mostly in Chromium/Blink and WebKit; but also in WPT, CSS, HTML and other specs; and just a few, sadly, in Gecko (I’d love to find the opportunity to contribute there more at some point). I’ve became committer and reviewer in both Chromium and WebKit projects. I’m also member of the CSS Working Group (though not very active lately) and Blink API owner.

During these years I have had the chance to attend several events, spoke at a few conferences, got some interviews. I’ve been also helping to organize the Web Engines Hackfest since 2014 and a CSSWG face-to-face meeting at Igalia headquarters in A Coruña in January 2020 (I have so great memories of the last dinner).

These years have allowed me to meet lots of wonderful people from which I’ve learnt, and still learn every day, many things. I have the pleasure to work with amazing folks on a daily basis in the open. Every new feature or project in which I have to work, it’s again a new learning experience, and that’s something I really like about this kind of work.

In this period I’ve seen Igalia grow a lot in the web platform community, to the point that these days we’re the top world consultancy on the web platform, with an important position in several projects and standards bodies. We’re getting fairly well know in this ecosystem, and we’re very proud about having achieved that.

Looking back, I’m so grateful for the opportunity given by Igalia. Thanks to the amazing colleagues and the community and how they have helped me to contribute work on the different projects. Also thanks to the different customers that have allowed me to work upstream and develop my career around web browsers. 🙏 These days the source code I write (together with many others) is being used daily by millions of people, that’s totally unbelievable if you stop for a minute to think about it.

Looking forward to the next decade of involvement in the web platform. See you all on the web! 🚀

January 17, 2023 11:00 PM

Amanda Falke

Tutorial: Building WPE WebKit for Raspberry Pi 3

A lightning guide for building WPE WebKit with buildroot

This tutorial will be for getting “up and running” with WPE WebKit using a Raspberry Pi 3 using a laptop/desktop with Linux Fedora installed. WPE WebKit has many benefits; you may read here about why WPE WebKit is a great choice for embedded devices. WPE WebKit has minimal dependencies and it displays high-quality animations, WebGL and videos on embedded devices.

WebPlatformForEmbedded is our focus; for this tutorial, we’ll be building WPE WebKit with buildroot to build our image, so, make sure to clone the buildroot repository.

You will need:

Raspberry pi 3 items:

  • A raspberry pi 3.
  • A microSD card for the pi. (I usually choose 32GB microSD cards).
  • An external monitor, extra mouse, and extra keyboard for our rpi3, separately from the laptop.
  • An HDMI cable to connect the rpi3 to its own external monitor.

Laptop items:

  • Linux laptop/desktop.
    • This tutorial will be based on Fedora, but you can use Debian or any distro of your choice.
  • You also need a way to interface the microSD card with your laptop. You can get an SD Adapter for laptops that have an SD Card adapter slot, or you can use an external SD Card adapter interface for your computer.
    • This tutorial will be based on having a laptop with an SD card slot, and hence an SD Card Adapter will work just fine.

Items for laptop to communicate with rpi3:

  • An ethernet cable to connect the rpi3 to your laptop.
  • You need some way to get ethernet into your laptop. This is either in the form of an ethernet port on your laptop (not likely), or an adapter of some sort (likely a USB adapter).

Steps: High level overview

This is a high level overview of the steps we will be taking.

  1. Partition the blank SD card.
  2. In the buildroot repository, make <desired_config>.
  3. Run the buildroot menuconfig with make menuconfig to set up .config file.
  4. Run make to build sdcard.img in buildroot/output dir; change .config settings as needed.
  5. Write sdcard.img file to the SD card.
  6. Connect the rpi3 to its own external monitor, and its own mouse and keyboard.
  7. Connect the rpi3 to the laptop using ethernet cable.
  8. Put the SD card into the rpi3.
  9. Setup a shared ethernet connection between the laptop and rpi to get the IP address of rpi.
  10. ssh into the rpi and start WPE WebKit.

Steps: Detailed overview/sequence

1. Partition the blank SD card using fdisk. Create a boot partition and a root partition.

  • Note: this is only needed in case the output image format is root.tar. If it’s sdcard.img, then that is dumped directly to the sdcard, that image is already partitioned internally.
  • If you’re unfamiliar with fdisk, this is a good tutorial.

2. In the buildroot repository root directory, make the desired config.

  • Make sure to clone the buildroot repository.
  • Since things change a lot over time, it’s important to note the specific buildroot commit this tutorial was built on, and that this tutorial was built on January 12th, 2023. It is recommended to build from that commit for consistency to ensure that the tutorial works for you.
  • We are building the cog 2.28 wpe with buildroot. See the build options from the buildroot repository.
  • Run make list-defconfigs to get a list of configurations.
  • Copy raspberrypi3_wpe_2_28_cog_defconfig and run it: make raspberrypi3_wpe_2_28_cog_defconfig.
  • You will quickly get output which indicates that a .config file has been written in the root directory of the buildroot repository.

3. Run the buildroot menuconfig with make menuconfig to set up .config file.

  • Run make menuconfig. You’ll see options here for configuration. Go slowly and be careful.
  • Change these settings. Help menus are available for menuconfig, you’ll see them displayed on the screen.
Operation in menuconfig Location Value
ENABLE Target packages -> Filesystem and flash utilities dosfstools
ENABLE Target packages -> Filesystem and flash utilities mtools
ENABLE Filesystem images ext2/3/4 root filesystem
SET VALUE Filesystem images -> ext2/3/4 root filesystem -> ext2/3/4 variant ext4
DISABLE Filesystem images initial RAM filesystem linked into linux kernel

4. Run make to build sdcard.img in buildroot/output dir; change .config settings as needed.

  • Run make. Then get a coffee as the build and cross-compilation will take awhile.

  • In reality, you may encounter some errors along the way, as cross-compilation can be an intricate matter. This tutorial will guide you through those potential errors.
  • When you encounter errors, you’ll follow a “loop” of sorts:

    Run make -> encounter errors -> manually edit .config file -> -> remove buildroot/output dir -> run make again until sdcard.img is built successfully.

  • If you encounter CMake errors, such as fatal error: stdlib.h: No such file or directory, compilation terminated, and you have a relatively new version of CMake on your system, the reason for the error may be that buildroot is using your local CMake instead of the one specified in the buildroot configuration.
    • We will fix this error by setting in .config file: BR2_FORCE_HOST_BUILD=y. Then remove buildroot/output dir, and run make again.
  • If you encounter error such as path/to/buildroot/output/host/lib/gcc/arm-buildroot-linux-gnueabihf/9.2.0/plugin/include/builtins.h:23:10: fatal error: mpc.h: No such file or directory:
  • then we can fix this error by changing Makefile in ./output/build/linux-rpi-5.10.y/scripts/gcc-plugins/Makefile, by adding -I path/to/buildroot/output/host/include in plugin_cxxflags stanza. Then, as usual, remove buildroot/output dir, and run make again.

5. Write sdcard.img file to the SD card.

  • At this point after the make process, we should have sdcard.img file in buildroot/output/images directory.
  • Write this file to the SD card.
  • Consider using Etcher to do so.

6. Connect the rpi3 to its own external monitor, and its own mouse and keyboard.

  • We’ll have separate monitor, mouse and keyboard all connected to the raspberry pi so that we can use it independently from the laptop.

7. Connect the rpi3 to the laptop using ethernet cable.

8. Put the SD card into the rpi3.

9. Setup a shared ethernet connection between the laptop and rpi to get the IP address of rpi.

In general, one of the main problems for connecting via ssh to the Raspberry Pi is to know the IP address of the device. This is very simple with Raspbian OS; simply turn on the raspberry pi and edit configurations to enable ssh, often over wifi.

This is where the ethernet capabilities of the raspberry pi come in.

Goal: To find syslog message DHCPACK acknowledgement and assignment of the IP address after setting up shared connection between raspberry pi and the laptop.

Throughout this process, continually look at logs. Eventually we will see a message DHCPACK which will likely be preceded by several DHCP handshake related messages such as DHCP DISCOVER, REQUEST etc. The DHCPACK message will contain the IP address of the ethernet device, and we will then be able to ssh into it.

    1. Tail the syslogs of the laptop. On Debian distributions, this is often /var/log/syslog. Since we are using Fedora, we’ll be using systemd's journald with the journactl command:
      • sudo journalctl -f
      • Keep this open in a terminal window.
      • You can also come up with a better solution like grepping logs, if you like, or piping output of stdout elsewhere.
    1. In a second terminal window, open up the NetworkManager.
      • Become familiar with existing devices prior to powering on the raspberry pi, by running nmcli.
    1. Power on the raspberry pi. Watch your system logs.
      • Syslogs will detail the raspberry pi’s name.
    1. Look for that name in NetworkManager nmcli device.
      • Using NetworkManager nmcli, set up shared connection for the ethernet device.
      • Setting up a shared connection is as simple as nmcli connection add type ethernet ifname $ETHERNET_DEVICE_NAME ipv4.method shared con-name local
      • This is a good tutorial for setting up a shared connection with NetworkManager.
    1. Once the connection is shared, syslogs will show a DHCPACK message acknowledging the ethernet device and its IP address. (You may need to power cycle the rpi to see this message, but it will happen).

10. ssh into the rpi and start WPE WebKit.

  • Now that we have the IP address of the raspberry pi, we can ssh into it from the laptop: ssh root@<RPI3_IP_ADDRESS>. (The default password is ‘root’. You can also add your user public key to /root/.ssh/authorized_keys on the pi. You can simplify this process by creating an overlay/root/.ssh/authorized_keys on your computer and by specifying the path to the overlay directory in the BR2_ROOTFS_OVERLAY config variable. That will copy everything in the overlay dir to the image.)
  • After that, export these env variables WPE_BCMRPI_TOUCH=1 and WPE_BCMRPI_CURSOR=1 to enable keyboard and mouse control.
    • Why: Recall that generally WPE WebKit is for embedded devices, such as kioks, or set top boxes requiring control with a remote control or similar device or touch interaction. We are exporting these environment variables so that we can “test” WPE WebKit with our separate mouse and keyboard for our raspberry pi without the need for a touch screen or special hardware targets, or a Wayland compositor such as weston. If this piques your curiosity, please see the WPE WebKit FAQ on Wayland.
  • Start WPE WebKit with cog: cog ""
  • A browser will launch in the external monitor connected to the raspberry pi 3, and we can control the browser with the raspberry pi’s mouse and keyboard!

That’s all for now. Feel free to reach out in the support channels for WPE on Matrix.

WPE’s Frequently Asked Questions

by Amanda Falke at January 17, 2023 08:03 AM

January 16, 2023

Eric Meyer

Minimal Dark Mode Styling

Spurred by a toot in reply to a goofy dev joke I made, I’ve set up a dirty-rough Dark Mode handler for meyerweb:

@media (prefers-color-scheme: dark) {
	html body {filter: invert(1);}
	/* the following really should be managed by a cascade layer */
	html img, 
	html #archipelago a:hover img {filter: invert(1);}
	html #thoughts figure.standalone img {
	  box-shadow: 0.25em 0.25em 0.67em #FFF8;

It’s the work of about five minutes’ thought and typing, so I suspect it’s teetering on the edge of Minimum Viable Product, but I’m not sure which side of that line it lands on.

Other than restructuring things so that I can use Cascade Layers to handle the Light and Dark Mode styling, what am I missing or overlooking?  <video> elements, I suppose, but anything else jump out at you as in need of correction?  Let me know!

(P.S. “Use only classes, never IDs” is not something that needs to be corrected.  Yes, I know why people think that, and this situation seems to be an argument in favor of that view, but I have a different view and will not be converting IDs to classes.  Thanks in advance for your forbearance.)

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at January 16, 2023 09:04 PM

Alejandro Piñeiro

v3dv status update 2023-01

We haven’t posted updates to the work done on the V3DV driver since
we announced the driver becoming Vulkan 1.2 Conformant

But after reaching that milestone, we’ve been very busy working on more improvements, so let’s summarize the work done since then.

Moved more functionality to the GPU

Our implementation for Events and Occlusion Queries were both mostly CPU based. We have refactored both features with a new GPU-side implementation based on the use of compute shaders.

In addition to be more “the Vulkan way”, has additional benefits. For example, for the case of the events, we no longer need to stall on the CPU when we need to handle GPU-side event commnds, and allowed to re-enable sync_fd import/export.


We have just landed a real implementation for this extension, based on the work of Ella Stanforth as part of her Igalia Coding Experience with us. This was a really complex work, as this feature added support for multi-plane formats, and needed to modify various parts of the driver. A big kudos to Ella for getting this tricky feature going. Also thanks to Jason Ekstrand, as he worked on a common Mesa framework for ycbcr support.

Support for new extensions

Since 1.2 got announced the following extension got exposed:

  • VK_EXT_texel_buffer_alignment
  • VK_KHR_maintenance4
  • VK_KHR_zero_initialize_workgroup_memory
  • VK_KHR_synchronization2
  • VK_KHR_workgroup_memory_explicit_layout
  • VK_EXT_tooling_info (0 tools exposed though)
  • VK_EXT_border_color_swizzle
  • VK_EXT_shader_module_identifier
  • VK_EXT_depth_clip_control
  • VK_EXT_attachment_feeback_loop_layout
  • VK_EXT_memory_budget
  • VK_EXT_primitive_topology_list_restart
  • VK_EXT_load_store_op_none
  • VK_EXT_image_robustness
  • VK_EXT_pipeline_robustness
  • VK_KHR_shader_integer_dot_product

Some miscellanea

In addition to those, we also worked on the following:

  • Implemented heuristic to decide to enable double-buffer mode, that could help to improve performance on some cases. It still needs to be enabled through the V3D_DEBUG environment variable.

  • Getting v3dv and v3d using the same shader optimization method, that would allow to reuse more code between the OpenGL and Vulkan driver.

  • Getting the driver working with the fossilize-db tools

  • Bugfixing, mostly related to bugs identified through new Khronos CTS releases

by infapi00 at January 16, 2023 03:47 PM

January 12, 2023

Miyoung Shin

Removing Reference FrameTreeNode from RenderFrameHost

MPArch stands for Multiple Pages Architecture, and the Chromium teams at Igalia and Google are working together on an MPArch project that unites the implementation of several features (back/forward-cache, pre-rendering, guest views, and ports) that support multiple pages in the same tab but are implemented with different architecture models into a single architecture. I recommend you read the recently posted blog by Que first if you are not familiar with MPArch. It will help you understand the detailed history and structure of MPArch. This post assumes that the reader is familiar with the //content API and describes the ongoing task of removing reference FrameTreeNode from RenderFrameHost on the MPArch project.

Relationship between FrameTreeNode and RenderFrameHost in MPArch

The main frame and iframe are mirrored as FrameTreeNode and the document of each frame is mirrored as RenderFrameHost. Each FrameTreeNode has a current RenderFrameHost, which can change over time as the frame is navigated. Let’s look at the relationship between FrameTreeNode and RenderFrameHost of features using the MPArch structure.

BFCache (back/forward cache) is a feature to cache pages in-memory (preserving javascript state and the DOM state) when the user navigates away from them. When the document of the new page is navigated, the existing document, RenderFrameHost, is saved in BFCache, and when the back/forward action occurs, RenderFrameHost is restored from BFCache. The FrameTreeNode is kept when RenderFrameHost is saved/restored.

Prerender2 is a feature to pre-render the next page for faster navigation. In prerender, a document will be invisible to the user and isn’t allowed to show any UI changes, but the page is allowed to load and run in the background.

Navigation changes to support multiple pages in WebContents

When the prerendered page is activated, the current RenderFrameHost that FrameTreeNode had is replaced with RenderFrameHost in Prerender.

Fenced Frame
The fenced frame enforces a boundary between the embedding page and the cross-site embedded document such that user data visible to the two sites is not able to be joined together. The fenced frame works similarly to the main frame for security and privacy-related access, and to the subframe for other cases on the browser side. To this end, it is wrapped as a dummy FrameTreeNode in FrameTree, and the fenced frame can be accessed through a delegate.

GuestView & Portals
Chromium implements a tag with a GuestVIew which is the templated base class for out-of-process frames in the chrome layer.
For more detail, please refer to Julie’s blog.

Portals allow the embedding of a page inside another page, which can later be activated and replace the main page.

Navigation changes to support multiple pages in WebContents

The GuestView and Portals are implemented with a multiple WebContents model and considering refactoring to MPArch.

Pending Deletion
When RenderFrameHost has started running unload handlers (this includes handlers for the unload, pagehide, and visibilitychange events) or when RenderFrameHost has completed running the unload handlers, RenderFrameHost‘s lifecycle is in the pending Deletion.

Reference FrameTreeNode from RenderFrameHost

Accessing FrameTreeNode from RenderFrameHost that enters BFCache, accessing FrameTreeNode from old RenderFrameHost that has already been swapped from the prerendered page, or accessing FrameTreeNode in the pending deletion status all have security issues. Currently, RenderFrameHost has direct access to FrameTreeNode internally and exposed methods to allow access to FrameTreeNode externally.

A meta bug 1179502 addresses this issue.

Introduce RenderFrameHostOwner
RenderFrameHostOwner is an interface for RenderFrameHost to communicate with FrameTreeNode owning it which can be null or can change during the lifetime of RenderFrameHost to prevent accidental violation of implicit “associated FrameTreeNode stays the same” assumptions. RenderFrameHostOwner is,
  // - Owning FrameTreeNode for main RenderFrameHosts in kActive, kPrerender,
  //   kSpeculative or kPendingCommit lifecycle states.
  // - Null for main RenderFrameHosts in kPendingDeletion lifecycle state.
  // - Null for main RenderFrameHosts stored in BFCache.
  // - Owning FrameTreeNode for subframes (which stays the same for the entire
  //   lifetime of a subframe RenderFrameHostImpl).


At all places where FrameTreeNode is used, we are auditing them if we could replace FrameTreeNode with RenderFrameHostOwner after checking what lifecycle states of RenderFrameHost we can have. However, there are still some places that are not available to use RenderFrameHostOwner (e.g. unload handler). We are still investigating how to deal with these cases. It would be necessary to refactor for RenderFrameHost::frame_tree_node(), which is widely used inside and outside RenderFrameHost, and that work is ongoing.

by mshin at January 12, 2023 10:52 AM

January 06, 2023

Enrique Ocaña

Cat’s Panic

It’s been 8 years since the last time I wrote a videogame just for personal fun. As it’s now become a tradition, I took advantage of the extra focused personal time I usually have on the Christmas season and gave a try to Processing to do my own “advent of code”. It’s a programming environment based on Java that offers a similar visual, canvas-based experience to the one I enjoyed as a child in 8 bit computers. I certainly found coding there to be a pleasant and fun experience.

So, what I coded is called Cat’s Panic, my own version of a known arcade game with a similar name. In this version, the player has to unveil the outline of a hidden cute cat on each stage.

The player uses the arrow keys to control a cursor that can freely move inside a border line. When pressing space, the cursor can start an excursion to try to cover a new area of the image to be unveiled. If any of the enemies touches the excursion path, the player loses a life. The excursion can be canceled at any time by releasing the space key. Enemies can be killed by trapping them in a released area. A stage is completed when 85% of the outline is unveiled.

Although this game is released under GPLv2, I don’t recommend anybody to look at its source code. It breaks all principles of good software design, it’s messy, ugly, and it’s only purpose was to make the developing process entertaining for me. You’ve been warned.

I’m open to contributions in the way of new cat pictures that add more stages to the already existing ones, though.

You can get the source code in the GitHub repository and a binary release for Linux here (with all the Java dependencies, which weight a lot).

Meow, enjoy!

by eocanha at January 06, 2023 03:32 AM

January 05, 2023

Ziran Sun

2022 Review – Igalia Web Platform Team

2022 was a good year for the Web Platform team in Igalia in a lot of ways. We worked on some really interesting Web technologies, and explored some really interesting new ways of funding them. We hired some great new people, and attended and spoke at some great events. And we did some of our meetings in person!


Open Prioritization Projects

In 2020 Igalia launched the Open Prioritization experiment. It is an experiment in crowd-funding prioritization of new feature implementations for web browsers.

  • :focus-visible in WebKit

Our first Open-Priotization experiment to choose a project to crowdfund launched with 6 possible candidates representing work in every browser engine, across many areas of the platform and at various stages of completeness. According to the pledges that different projects got, :focus-visible in WebKit stood out and became the winner. Rego worked on this feature in 2021 and he had been reporting regularly about his progress. :focus-visible was finally enabled by default in WebKit and shipped in Safari 15.4.

Bear in mind that Apple Did Not Crowdfund :focus-visible in Safari. Eric made it very clear in his blog that, “the addition of :focus-visible to WebKit was lead by the community, done by Igalia, and contributed to WebKit without any involvement from Apple except in the sense of their reviewing patches and accepting the contributions”.

  • Enter: Wolvic

In February 2022, Igalia announced Wolvic, a new browser project with an initial focus of picking up where Firefox Reality leaves off. Brian published Enter: Wolvic explaining what WebXR could bring and Igalia’s initial focus with Wolvic. As an open collective project, Wolvic Collective is seeking more partners and collective funding to further explore the potential and capability of WebXR technology. To understand in details about this project, please follow Brian’s talk on Wolvic’s vision.

The igalians involved in the Wolvic project have put a lot of effort on updating and nurturing the Firefox Reality codebase. Follow the commits by Sergio (@svillar), Felipe(@felipeerias) and Edu (@elima) from Web Platform team at Wolvic Github codebase.

  • MathML

As part of the Open Prioritization effort, it’s exciting that MathML is going to be shipped in in Chrome 109. MathML collective is also part of our open collective effort. The plan is to have MathML-Core support in other browsers for browser interoperability. The MathML-Core support is level 1 for this project. We have envisioned more work to follow and will move to Level 2 after that. Your supports would make a lot of difference to move this forward.

Other Collaborative Projects

Most projects happening in the team are collaborative efforts with our clients and other parties in the Open Source community.

  • :has()

The :has() pseudo class is a selector that specifies an element which has at least one element that matches the relative selector passed as an argument. Byungwoo (@byungwoo_bw on Twitter) started the prototyping work since 2021 and It’s shipped in Chromium 105 this year.

CSS :has() pseudo-class is a very exciting feature. Igalians have been talking about it since its initial stage of filing an intent to prototype. Byungwoo presented implementation and performance details in Blinkon 16. And you can find a recent blog on examples of using it by Eric. In addition, Byungwoo had a talk on Updates since has was enabled in Chrome.

  • CSS highlight pseudos

Since 2021, Delan has been working on spelling and grammar features in Chromium. She has been writing a series on development of this work ( Chromium spelling and grammar features and Chromium spelling and grammar, part 2). As the project progresses, Delan produced her third part of the series on CSS highlight pseudos.

  • Custom Handlers component for Chrome

Igalia has been working on improving the integration of the IPFS protocol in the main browsers. Javier has been blogging on his work including New Custom Handlers component for Chrome to introduce some architectural changes in Chrome to move the Custom Handlers logic into a new component, and Discovering Chrome’s pre-defined
Custom Handlers
to discuss feasible solutions.

  • Automated accessibility testing

The Web Platform team has been engaging in various aspects of accessiblity work in Browser, from web standard to solutions. The Automated accessibility testing by Alex is a reflection of our knowledge on testing.

  • WPT Python 3 Migration

In 2020, Igalia was involved in the Python 3 migration work for the web-platform-tests (WPT) project. The WPT Python 3 Migration Post walks through challenges, approaches and some implementations for this migration work.

As always, there are a lot of other Web technology topics Igalians are working on. We recommend you to follow Brian’s blog and Eric’s blog for updates.

Conferences & Events

The Web Platform team were able to attend some conferences and Events on-site this year after the Covid-19 Pandemic. Here are some highlights.

  • Igalia Summit and Web Engines Hackfest

With employees distributed in over 20 countries globally and most are working remotely, Igalia holds summits every year to give the employees opportunities to meet face-to-face. The summit normally runs in a period of a week with Hackfest, code camps, team building and recreational activities etc..

As it is a great opportunity to bring together people working on the different browsers and related standards to discuss ideas and plans, during the summit week Igalia normally hosts Web Engines Hackfest to invite members from all parts of the Web Platform community. This year’s Hackfest was a success with over 70 people attendees during 2 days involving lots of discussions and conversations around different features. Rego noted down his personal highlights for the event. Frédéric also published the status Update on OpenType MATH fonts.

The summit is a great event for Igalians gathering together. For a new joiner, it’s a great first summit. And for an experienced Igalian, there are a lot to take away too.

  • Blinkon 17

Blinkon 17 in November was a hybrid event hosted in the San Francisco Bay area as well as via the Internet. Igalia had a strong presence in the event with 7 Lightning Talks and three breakout talks. From the Web Platform team, Delan presented her work on Faster Style & Paint for CSS Highlights and Frédéric updated the status on MathML in his talk Shipping MathML in Chrome 109 in breakout sessions. Andreu’s lightning talk on Specifying Line-clamp is very interesting. And you might find the live session on History of the Web by Brian and Eric fascinating to follow.

  • TPAC

TPAC this year is a hybrid event and rego was happy that he managed to travel away to TPAC 2022 after pandemic and meet so many people in real life again.


Igalia has been hiring actively . Web Platform team is no exception. In his blog of Igalia Web Platform team is hiring , Rego explains how Igalia’s cooperative flat structure works and what you could expect working in Igalia. Web Platform team is a very friendly and hard-working team. The team’s blogs probably only show a small part of work happening in the team. If you’d like to know more, contact us. And even better if you are interested in joining the team, check on rego’s blog about how to apply.

We were all super excited that Alice Boxhall (@sundress on Twitter) joined the team in November 2022. Alice represented Igalia as a candidate for the W3C TAG election. Brian tells you in his blog Alice 2023 why we thought that Alice could be a good addition to the W3C TAG. Although the final result didn’t go our way, our support for Alice continues for the future :-).

Looking forward to 2023

With some exciting work planned for 2023, we are looking forward to another productive year ahead!

by zsun at January 05, 2023 02:41 PM

January 04, 2023

Danylo Piliaiev

Turnips in the wild (Part 3)

This is the third part of my “Turnips in the wild” blog post series where I describe how I found and fixed graphical issues in the Mesa Turnip Vulkan driver for Adreno GPUs. If you missed the first two parts, you can find them here:

  1. Psychonauts 2
    1. Step 1 - Toggling driver options
    2. Step 2 - Finding the Draw Call and staring at it intensively
    3. Step 3 - Bisecting the shader until nothing is left
    4. Step 4 - Going deeper
      1. Loading varyings
      2. Packing varyings
      3. Interpolating varyings
  2. Injustice 2
  3. Monster Hunter: World

Psychonauts 2

A few months ago it was reported that “Psychonauts 2” has rendering artifacts in the main menu. Though only recently I got my hands on it.

Screenshot of the main menu with the rendering artifacts
Notice the mark on the top right, the game was running directly on Qualcomm board via FEX-Emu

Step 1 - Toggling driver options

Forcing direct rendering, forcing tiled rendering, disabling UBWC compression, forcing synchronizations everywhere, and so on, nothing helped or changed the outcome.

Step 2 - Finding the Draw Call and staring at it intensively

Screenshot of the RenderDoc with problematic draw call being selected
The first draw call with visible corruption

When looking around the draw call, everything looks good, but a single input image:

Input image of the draw call with corruption already present
One of the input images

Ughhh, it seems like this image comes from the previous frame, that’s bad. Inter-frame issues are hard to debug since there is no tooling to inspect two frames together…

* Looks around nervously *

Ok, let’s forget about it, maybe it doesn’t matter. Then next step would be looking at the pixel values in the corrupted region:

color = (35.0625, 18.15625, 2.2382, 0.00)

Now, let’s see whether RenderDoc’s built-in shader debugger would give us the same value as GPU or not.

color = (0.0335, 0.0459, 0.0226, 0.50)

(Or not)

After looking at the similar pixel on the RADV, RenderDoc seems right. So the issue is somewhere in how driver compiled the shader.

Step 3 - Bisecting the shader until nothing is left

A good start is to print the shader values with debugPrintfEXT, much nicer than looking at color values. Adding the debugPrintfEXT aaaand, the issue goes away, great, not like I wanted to debug it or anything.

Adding a printf changes the shader which affects the compilation process, so the changed result is not unexpected, though it’s much better when it works. So now we are stuck with observing pixel colors.

Bisecting a shader isn’t hard, especially if there is a reference GPU, with the same capture opened, to compare the changes. You delete pieces of the shader until results are the same, when you get the same results you take one step back and start removing other expressions, repeat until nothing could be removed.

First iteration of the elimination, most of the shader is gone (click me)
float _258 = 1.0 / gl_FragCoord.w;
vec4 _273 = vec4(_222, _222, _222, 1.0) * _258;
vec4 _280 = View.View_SVPositionToTranslatedWorld * vec4(, 1.0);
vec3 _284 = / vec3(_280.w);
vec3 _287 = _284 - View.View_PreViewTranslation;
vec3 _289 = normalize(-_284);
vec2 _303 = vec2(in_var_TEXCOORD0[0].x, in_var_TEXCOORD0[0].y) * Material.Material_ScalarExpressions[0].x;
vec4 _309 = texture(sampler2D(Material_Texture2D_0, Material_Texture2D_0Sampler), _303, View.View_MaterialTextureMipBias);
vec2 _312 = (_309.xy * vec2(2.0)) - vec2(1.0);
vec3 _331 = normalize(mat3(, cross(, * in_var_TEXCOORD11_centroid.w, * normalize((vec4(_312, sqrt(clamp(1.0 - dot(_312, _312), 0.0, 1.0)), 1.0).xyz * View.View_NormalOverrideParameter.w) + * ((View.View_CullingSign * View_PrimitiveSceneData._m0[(in_var_PRIMITIVE_ID * 37u) + 4u].w) * float(gl_FrontFacing ? (-1) : 1));

vec2 _405 = in_var_TEXCOORD4.xy * vec2(1.0, 0.5);
vec4 _412 = texture(sampler2D(LightmapResourceCluster_LightMapTexture, LightmapResourceCluster_LightMapSampler), _405 + vec2(0.0, 0.5));
uint _418 = in_var_LIGHTMAP_ID; // <<<<<<-
float _447 = _331.y;

vec3 _531 = (((max(0.0, dot((_412 * View_LightmapSceneData_1._m0[_418 + 5u]), vec4(_447, _331.zx, 1.0))))) * View.View_IndirectLightingColorScale);

bool _1313 = TranslucentBasePass.TranslucentBasePass_Shared_Fog_ApplyVolumetricFog > 0.0;
vec4 _1364;

vec4 _1322 = View.View_WorldToClip * vec4(_287, 1.0);
float _1323 = _1322.w;
vec4 _1352;
if (_1313)
    _1352 = textureLod(sampler3D(TranslucentBasePass_Shared_Fog_IntegratedLightScattering, View_SharedBilinearClampedSampler), vec3(((_1322.xy / vec2(_1323)).xy * vec2(0.5, -0.5)) + vec2(0.5), (log2((_1323 * View.View_VolumetricFogGridZParams.x) + View.View_VolumetricFogGridZParams.y) * View.View_VolumetricFogGridZParams.z) * View.View_VolumetricFogInvGridSize.z), 0.0);
    _1352 = vec4(0.0, 0.0, 0.0, 1.0);
_1364 = vec4( + ( * _1352.w), _1352.w * in_var_TEXCOORD7.w);

out_var_SV_Target0 = vec4(_531.x, _1364.w, 0, 1);

After that it became harder to reduce the code.

More elimination (click me)
vec2 _303 = vec2(in_var_TEXCOORD0[0].x, in_var_TEXCOORD0[0].y);
vec4 _309 = texture(sampler2D(Material_Texture2D_0, Material_Texture2D_0Sampler), _303, View.View_MaterialTextureMipBias);
vec3 _331 = normalize(mat3(, in_var_TEXCOORD11_centroid.www, * normalize((vec4(_309.xy, 1.0, 1.0).xyz))) * ((View_PrimitiveSceneData._m0[(in_var_PRIMITIVE_ID)].w) );

vec4 _412 = texture(sampler2D(LightmapResourceCluster_LightMapTexture, LightmapResourceCluster_LightMapSampler), in_var_TEXCOORD4.xy);
uint _418 = in_var_LIGHTMAP_ID; // <<<<<<-

vec3 _531 = (((dot((_412 * View_LightmapSceneData_1._m0[_418 + 5u]), vec4(_331.x, 1,1, 1.0)))) * View.View_IndirectLightingColorScale);

vec4 _1352 = textureLod(sampler3D(TranslucentBasePass_Shared_Fog_IntegratedLightScattering, View_SharedBilinearClampedSampler), vec3(vec2(0.5), View.View_VolumetricFogInvGridSize.z), 0.0);

out_var_SV_Target0 = vec4(_531.x, in_var_TEXCOORD7.w, 0, 1);

And finally, the end result:

vec3 a = +;
float b = a.x + a.y + a.z + in_var_TEXCOORD11_centroid.w + in_var_TEXCOORD0[0].x + in_var_TEXCOORD0[0].y + in_var_PRIMITIVE_ID.x;
float c = b + in_var_TEXCOORD4.x + in_var_TEXCOORD4.y + in_var_LIGHTMAP_ID;

out_var_SV_Target0 = vec4(c, in_var_TEXCOORD7.w, 0, 1);

Nothing left but loading of varyings and the simplest operations on them in order to prevent their elimination by the compiler.

in_var_TEXCOORD7.w values are several orders of magnitude different from the expected ones and if any varying is removed the issue goes away. Seems like an issue with loading of varyings.

I created a simple standalone reproducer in vkrunner to isolate this case and make my life easier, but the same fragment shader passed without any trouble. This should have pointed me to undefined behavior somewhere.

Step 4 - Going deeper

Anyway, one major difference is the vertex shader, changing it does “fix” the issue. Though changing it resulted in the changes in varyings layout, without changing the layout the issue is present. Thus the vertex shader is an unlikely culprit here.

Let’s take a look at the fragment shader assembly:

bary.f r0.z, 0, r0.x
bary.f r0.w, 3, r0.x
bary.f r1.x, 1, r0.x
bary.f r1.z, 4, r0.x
bary.f r1.y, 2, r0.x
bary.f r1.w, 5, r0.x
bary.f r2.x, 6, r0.x
bary.f r2.y, 7, r0.x
flat.b r2.z, 11, 16
bary.f r2.w, 8, r0.x
bary.f r3.x, 9, r0.x
flat.b r3.y, 12, 17
bary.f r3.z, 10, r0.x
bary.f (ei)r1.x, 16, r0.x

Loading varyings

bary.f loads interpolated varying, flat.b loads it without interpolation. bary.f (ei)r1.x, 16, r0.x is what loads the problematic varying, though it doesn’t look suspicious at all. Looking through the state which defines how varyings are passed between VS and FS also doesn’t yield anything useful.

Ok, but what does second operand of flat.b r2.z, 11, 16 means (the command format is flat.b dst, src1, src2). The first one is location from where varying is loaded, and looking through the Turnip’s code the second one should be equal to the first otherwise “some bad things may happen”. Forced the sources to be equal - nothing changed… What have I expected? Since the standalone reproducer with the same assembly works fine.

The same description which promised bad things to happen also said that using 2 immediate sources for flat.b isn’t really expected. Let’s revert the change and get something like flat.b r2.z, 11, r0.x, nothing is changed, again.

Packing varyings

What else happens with these varyings? They are being packed! To remove their unused components, so let’s stop packing them. Aha! Now it works correctly!

Looking several times through the code, nothing is wrong. Changing the order of varyings helps, aligning them helps, aligning only flat varyings also helps. But code is entirely correct.

Though one thing changes, during the shuffling of varyings order I noticed that the resulting misrendering changed, so likely it’s not the order, but the location which is cursed.

Interpolating varyings

What’s left? How varyings interpolation is specified. The code emits interpolations only for used varyings, but looking closer the “used varyings” part isn’t that obviously defined. Emitting the whole interpolation state fixes the issue!

The culprit is found, stale data is being read of varying interpolation. The resulting fix could be found in tu: Fix varyings interpolation reading stale values + cosmetic changes

Correctly rendered draw call after the changes
Correctly rendered draw call after the changes

Injustice 2

Another corruption in the main menu.

Bad draw call on Turnip

How it should look:

The same draw call on RADV

The draw call inputs and state look good enough. So it’s time to bisect the shader.

Here is the output of the reduced shader on Turnip:

Enabling the display of NaNs and Infs shows that there are NaNs in the output on Turnip (NaNs have green color here):

While the correct rendering on RADV is:

Carefully reducing the shader further resulted in the following fragment which reproduces the issue:

r12 = uintBitsToFloat(uvec4(texelFetch(t34, _1195 + 0).x, texelFetch(t34, _1195 + 1).x, texelFetch(t34, _1195 + 2).x, texelFetch(t34, _1195 + 3).x));
vec4 _1268 = r12;
_1268.w = uintBitsToFloat(floatBitsToUint(r12.w) & 65535u);
_1275.w = unpackHalf2x16(floatBitsToUint(r12.w)).x;

On Turnip this _1275.w is NaN, while on RADV it is a proper number. Looking at assembly, the calculation of _1275.w from the above is translated into:

isaml.base0 (u16)(x)hr2.z, r0.w, r0.y, s#0, t#12
(sy)cov.f16f32 r1.z, hr2.z

In GLSL there is a read of uint32, stripping it of the high 16 bits, then converting the lower 16 bits to a half float.

In assembly the “read and strip the high 16 bits” part is done in a single command isaml, where the stripping is done via (u16) conversion.

At this point I wrote a simple reproducer to speed up iteration on the issue:

result = uint(unpackHalf2x16(texelFetch(t34, 0).x & 65535u).x);

After testing different values I confirmed that (u16) conversion doesn’t strip higher 16 bits, but clamps the value to 16 bit unsigned integer. Running the reproducer on the proprietary driver shown that it doesn’t fold u32 -> u16 conversion into isaml.

Knowing that the fix is easy: ir3: Do 16b tex dst folding only for floats

Monster Hunter: World

Main menu, again =) Before we even got here two other issues were fixed before, including the one which seems like an HW bug which proprietary driver is not aware of.

In this case of misrendering the culprit is a compute shader.

How it should look:

Compute shader are generally easier to deal with since much less state is involved.

None of debug options helped and shader printf didn’t work at that time for some reason. So I decided to look at the shader assembly trying to spot something funny.

ldl.u32 r6.w, l[r6.z-4016], 1
ldl.u32 r7.x, l[r6.z-4012], 1
ldl.u32 r7.y, l[r6.z-4032], 1
ldl.u32 r7.z, l[r6.z-4028], 1
ldl.u32 r0.z, l[r6.z-4024], 1
ldl.u32 r2.z, l[r6.z-4020], 1

Negative offsets into shared memory are not suspicious at all. Were they always there? How does it look right before being passed into our backend compiler?

vec1 32 ssa_206 = intrinsic load_shared (ssa_138) (base=4176, align_mul=4, align_offset=0)
vec1 32 ssa_207 = intrinsic load_shared (ssa_138) (base=4180, align_mul=4, align_offset=0)
vec1 32 ssa_208 = intrinsic load_shared (ssa_138) (base=4160, align_mul=4, align_offset=0)
vec1 32 ssa_209 = intrinsic load_shared (ssa_138) (base=4164, align_mul=4, align_offset=0)
vec1 32 ssa_210 = intrinsic load_shared (ssa_138) (base=4168, align_mul=4, align_offset=0)
vec1 32 ssa_211 = intrinsic load_shared (ssa_138) (base=4172, align_mul=4, align_offset=0)
vec1 32 ssa_212 = intrinsic load_shared (ssa_138) (base=4192, align_mul=4, align_offset=0)

Nope, no negative offsets, just a number of offsets close to 4096. Looks like offsets got wrapped around!

Looking at ldl definition it has 13 bits for the offset:

<pattern pos="0"           >1</pattern>
<field   low="1"  high="13" name="OFF" type="offset"/>    <--- This is the offset field
<field   low="14" high="21" name="SRC" type="#reg-gpr"/>
<pattern pos="22"          >x</pattern>
<pattern pos="23"          >1</pattern>

With offset type being a signed integer (so the one bit is for the sign). Which leaves us with 12 bits, meaning the upper bound of 4095. Case closed!

I know that there is a upper bound set on offset during optimizations, but where and how it is set?

The upper bound is set via nir_opt_offsets_options::shared_max and is equal to (1 << 13) - 1, which we saw is incorrect. Who set it?

Subject: [PATCH] ir3: Limit the maximum imm offset in nir_opt_offset for
 shared vars

STL/LDL have 13 bits to store imm offset.

Fixes crash in CS compilation in Monster Hunter World.

Fixes: b024102d7c2959451bfef323432beaa4dca4dd88
("freedreno/ir3: Use nir_opt_offset for removing constant adds for shared vars.")

Signed-off-by: Danylo Piliaiev
Part-of: <>
 src/freedreno/ir3/ir3_nir.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

@@ -124,7 +124,7 @@ ir3_optimize_loop(struct ir3_compiler *compiler, nir_shader *s)
          .uniform_max = (1 << 9) - 1,

-         .shared_max = ~0,
+         .shared_max = (1 << 13) - 1,

Weeeell, totally unexpected, it was me! Fixing the same game, maybe even the same shader…

Let’s set the shared_max to a correct value . . . . . Nothing changed, not even the assembly. The same incorrect offset is still there.

After a bit of wandering around the optimization pass, it was found that in one case the upper bound is not enforced correctly. Fixing it fixed the rendering.

The final changes were:

by Danylo Piliaiev at January 04, 2023 11:00 PM

Brian Kardell

The NASCAR Model

The NASCAR Model

Let's talk about interesting solutions to problems and how we can (today) begin to change things...

Over the past two years I’ve gone into a whole lot about the web ecosystem, and in particular the economics of it. I’ve made the case that the model we’ve built and grown for the last 20 years is ultimately fatally flawed. It is fundamentally dependent on a very few organizations. It is already inadequate and showing signs of strain.

I've been making a case for a while now that we need to re-think how we look at and fund web engines.

It seems really easy at some level, right? Just diversify the investment that is sponsoring work on engines. Every tech company in the world benefits from these projects, and lots of them make money hand over fist - giving back should be easy? But yeah, it's (currently) not and this should suprise no one. It isn't the norm, and companies don't like to spend money that they don't have to. Neither do we. We don't (currently) want to buy a browser or pay to use the open source project if we can avoid it, it's not the norm we're used to.

I was in the middle of trying to write something up on all of this just before the holidays (which got strange for me). I wanted to explain one such opportunity that we're building now, attempting to bend the arc of things in a healthier way. Before I could get back to post something, I saw Zach Leatherman post a thing on Mastodon:

the nascar car advertising model but for t-shirts on speakers at tech conferences

In reply, Chris Coyier shared screenshot of him wearing a F1 jacket that's full of logos like Red Bull, Tezos and Oracle.

To which I replied...

This, but for browser funding

While it sounds funny, and it obviously needs more words to explain it, I'm kind of serious.

It makes sense to look at history and try to learn from what worked (or didn't) elsewhere. It's interesting how many things seem to have followed a trend of getting a single original backer and ultimately diversifying. Lots of early TV shows had individual sponsors - soaps like Lux and cigarette companies like Winston, for example were, for decades even, lucrative single sources of funding.

While it's not what Zach was saying, it's interesting that for over 30 years, the NASCAR Cup Series was called the Winston Cup, until that became a problem. In 2004 it became the Nextel Cup, until it became the Sprint Cup, and then the Monster Energy Cup. Finally, in 2020 it just became the NASCAR Cup Series with a tiered sponsoship model and diverse sponsors.

You know, just like a lot of tech events.

Have a look at this page for AWS summit in DC sponsors for example. Or check out the video in this page for Salesforce's Dreamforce Sponsors. These pages are basically the NASCAR model. They are similarly tiered: you get different things for different levels of sponsorship. And those, I think, learn some things from what work for sports teams too. Here in Pittsbugh, for example, you'll see and hear ads that say "Proud Partner of the Pittsburgh Steelers". It costs money to be able to say that, and one of the things you get for that money is the ability to use the team's logo in your ad.

Spot the difference

A single event sponsorship can range from several thousand to literally millions of dollars. So, why do we get hundreds of tech companies to invest in sponsoring lots of events, or even putting their name on an actual NASCAR jacket (or car), but very few invest in web browsers?

I think the answer is pretty simple: Different departments and budgets.

Funding (from non-vendors) that goes into the browser engines today is coming from engineering departments of downstream products like Microsoft Edge or ad blocking extensions or the PlayStation 5. That's great, but engineering departments are constantly too tight already. Engineering is about efficiency, right? Do more with less.

Those sponsorships, on the other hand, are often marketing dollars (just like default search). Entirely different budgets/intents.

But, if you think about it, shouldn't those companies get some NASCAR-style credit for their browser support? Don't you think it's worth mentioning that Bloomberg Tech was really key to making CSS Grid Happen? Or that EyeO was really key to making :has() happen? Isn't that a kind of marketing? Don't we kind of need both kinds of investment?

Hey, there it is!

I haven't had much of a chance to talk about this yet, but have a look at this...

A screenshot of the Wolvic browser's sponsors page...

In February 2022, in a joint announcement with Mozilla, Igalia announced we'd be taking over stewardship of the project "Firefox Reality" which would now be called "Wolvic" with the tagline "Join the Pack". That's not just a cute slogan, it's kind of the whole idea. Rather than a singularly badass wolf, we need a pack.

We're trying some really interesting ideas with Wolvic to reimagine how we can fund the development of the browser, upstream work, and manage priorities. I talked about our plans in a small 'launch' event recently which you can watch if you like, but a brief summary is:

  1. We're partnering with some device manufacturers directly at several different levels. On some (Lynx R1, Huawei Lens, HTC Vive) we will be the default browser and they will commit to some funding, and we'll work with them closely.
  2. We're exploring other ideas with a new Wolvic collective. There are a lot of details there explaining all this, but just like other sponsorships, you get different things. One of those is being on a Sponsors page which will be in the browser itself too. Another is that you can help us figure out priorities.

I would encourage you to have a look at the collective page. Share it with your engineering department. Share it with your marketing department.

Or, just help be a part of sparking the change and contribute yourself. I think this isn't the worst idea - many of us (myself included) owe our entire careers to the web - your mileage and personal comfort levels may vary of course, but it doesn't feel wrong for me to throw a cup of coffee's worth of funding back to it every month.

I hope you'll help us explore these ideas, either through some funding, or just sharing the idea and encouraging organizations to participate (or all of the above).

January 04, 2023 05:00 AM

January 03, 2023

Víctor Jáquez

GStreamer compilation with third party libraries

Suppose that you have to hack a GStreamer element which requires a library that is not (yet) packaged by your distribution, nor wrapped as a Meson’s subproject. How do you do?

In our case, we needed the latest version of

Which are interrelated CMake projects.

For these cases, GStreamer’s uninstalled development scripts can use a special directory: gstreamer/prefix. As the says:

NOTE: In the development environment, a fully usable prefix is also configured in gstreamer/prefix where you can install any extra dependency/project.

This means that script (the responsible of setting up the uninstalled development environment) will add

  • gstreamer/prefix/bin in PATH for executable files.
  • gstreamer/prefix/lib and gstreamer/prefix/share/gstreamer-1.0 in GST_PLUGIN_PATH, for out-of-tree elements.
  • gstreamer/prefix/lib in GI_TYPELIB_PATH for GObject Introspection metadata.
  • gstreamer/prefix/lib/pkgconfig in PKG_CONFIG_PATH for third party dependencies (our case!)
  • gstreamer/prefix/etc/xdg for XDG_CONFIG_DIRS for XDG compliant configuration files.
  • gstreamer/prefix/lib and gstreamer/prefix/lib64 in LD_LIBRARY_PATH for third party libraries.

Therefore, the general idea, is to compile those third party libraries with their installation prefix as gstreamer/prefix.

In our case, Vulkan repositories are interrelated so they need to be compiled in certain order. Also, we decided, for self-containment, to clone them in gstreamer/subprojects.


$ cd ~/gst/gstreamer/subprojects
$ git clone
$ cd Vulkan-Headers
$ mkdir build
$ cd build
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=/home/vjaquez/gst/gstreamer/prefix ..
$ cmake --build . --install


$ cd ~/gst/gstreamer/subprojects
$ git clone
$ cd Vulkan-Loader
$ mkdir build
$ cd build
$ cmake  -DCMAKE_BUILD_TYPE=Debug -DVULKAN_HEADERS_INSTALL_DIR=/home/vjaquez/gst/gstreamer/prefix DCMAKE_INSTALL_PREFIX=/home/vjaquez/gst/gstreamer/prefix ..
$ cmake --build . --install


$ cd ~/gst/gstreamer/subprojects
$ git clone
$ cd Vulkan-Tools
$ mkdir build
$ cd build
$ cmake  -DCMAKE_BUILD_TYPE=Debug -DVULKAN_HEADERS_INSTALL_DIR=/home/vjaquez/gst/gstreamer/prefix DCMAKE_INSTALL_PREFIX=/home/vjaquez/gst/gstreamer/prefix ..
$ cmake --build . --install

Right now we have the Vulkan headers and the Vulkan loader pkg-config file in place. And we should be able to compile GStreamer. Right?

Not exactly, because only sets the environment variables for the development environment, not for GStreamer compilation. But the solution is simple, because we have all set in the proper order: just to set PKG_CONFIG_PATH when executing meson setup:

$ PKG_CONFIG_PATH=/home/vjaquez/gst/gstreamer/prefix/lib/pkgconfig meson setup --buildtype=debug build

by vjaquez at January 03, 2023 06:35 PM

December 30, 2022

Brian Kardell

Surreal Holiday

Surreal Holiday

Hello, yes, I had a heart attack.

In the middle of the night last Tuesday (technically Wednesday morning), I had a heart attack. Ambulance, EKGs - you get the idea. In the end they placed two stents in my heart and I spent two days in the ICU, one more in the general hospital before being sent home Friday afternoon.

Mainly I just wanted to say...

  • Sorry if I haven't replied somewhere
  • That's why
  • Thanks for the concern if you were worried
  • I'm gonna be ok
  • I am mostly ok now. I just happen to also have planned holidays, so it's been a good spot to disconnect for a bit.
  • Yes, I quit smoking (8 days cold and counting).
  • ... The whole thing was just... very weird


In my experience - our brains are actually not very good at imagining the big "smack you in the face" things like this. When they happen I'm often like "Oh wow!... That was nothing like I imagined it would be."

Yes, I have imagined many of the bits here many times (I have an anxiety disoder) and this wasn't an exception. Literally no aspect of what unfolded was anything like I imagined it would be.

Short of literally sitting and imagining bad scenarios: I'm sure you think you have some idea of what a heart attack would feel like. Like, vaguely, at least. But... maybe you don't? Maybe it is very diffrent from that? It certainly was for me.

I guess the natural question is "different how?" and the answer is "yes".

Different like zuchini and cucumber, maybe? Superficially similar, but much different. If you bit into something thinking it contained lots of cucumber and that turned out to be zucchini (or vice versa) you would say "Oh wow!... That was nothing like I imagined it would be."

Anyway... I've been sitting here for way too long trying to think of how to explain all of the ways in which none of this was what I expected or how surreal it all felt, but honestly, I think I can't.

I spent a lot of time thinking about my family. Would they be ok, even if I wasnt? Pretty shitty for them to have this happen at Christmas. It was the middle of the night, most of them wouldn't find out for a while still probably. What would they be finding out?

I don't know, it was just many layers of strange, really.

One especially surreal moment I'd like to write down: As the doctors who were getting ready to put a stint in my heart were chatting, one of them said something like...

Man, can you imagine? So sad. Just before 50. He was looking great just yesterday, and then you just die. So sad.

Here's the thing: They weren't talking about me and I knew it. But, lying there on the table, not so short of 50 myself, at that moment... It totally could have... It still might, you know? They hadn't gone in yet. Did they just not think of that?

What they were talking about was the fact that Franco Harris had just died the same morning. It was just a few days shy of the 50th anniversay of The Immaculate Reception, a rematch game and so many events planned around it and him.

You don't really have to know about any of this, I guess - except that Franco's impact in Pittsburgh was huge and began just before I was born. My grandparents were part of the large Italian community that loved the Steelers - "Franco's Italian Army". I have a lot of memories of shared moments with family around the Steelers - ultimately because of all of this. I always felt a kind of connection with him. It was very strange to think that we we both seemed pretty great just yesterday, and now here we were.

Anyways.... It was weird.

Happily for my family, I made it through and I'm doing great.

I was home for the game too, which I was able to watch with my son. It was great - and happily for Pittsburgh, we won.

December 30, 2022 05:00 AM

December 29, 2022

Eric Meyer

Styling a ‘pre’ That Contains a ‘code’

I’ve just committed my first :has() selector to production CSS and want to share it, so I will!  But first, a little context that will feel familiar to nerds like me who post snippets of computer code to the Web, and have markup with this kind of tree structure (not raw source; we’d never allow this much formatting whitespace inside a <pre>):

		{{content goes here}}

It’s nicely semantic, indicating that the contents of the <pre> are in fact code of some kind, as opposed to just plain preformatted text like, say, output from a shell script or npm job, which isn’t code and thus should, perhaps, be styled in a distinct way.

Given cases like that, you’ve probably written rules that go a little something like this:

pre > code {
	display: block;
	background: #f1f1f1;
	padding: 1.33em;
	border-radius: 0.33em;

Which says: if a <code> element is the child of a <pre> element, then turn the <code> into a block box and give it some background color, padding, etc.

It works out fine, but it always feels a little fragile.  What if there are already <pre> styles that throw off the intended effect on code blocks?  Do we get into specificity wars between those rules and the code-block rules?  Find other ways to figure out what should be adjusted in which cases?  Those are probably manageable problems, but it would be better not to have them.

It’s also, when you step back for a moment, a little weird.  The <pre> is already a block box and the container for the code; why aren’t we styling that?  Because unless you wrote some scripting, whether server-side or client-side, to add a class to the <pre> in scenarios like this, there wasn’t a way to address it directly based on its structural contents.

There is now:

pre:has(> code) {
	background: #f1f1f1;
	padding: 1.33em;
	border-radius: 0.33em;

Now I’m styling any <pre> that has a <code> as a child, which is why I took out the display: block.  I don’t need it any more!

But suppose you have a framework or ancient script or something that inserts classed <span> elements between the <pre> and the <code>, like this:

	<span class="pref">
			{{content goes here}}

First of all, ew, address the root problem here if at all possible.  But if that isn’t possible for whatever reason, you can still style the <pre> based on the presence of a <code> by removing the child combinator from the selector.  In other words:

pre:has(code) {
	background: #f1f1f1;
	padding: 1.33em;
	border-radius: 0.33em;

Now I’m styling any <pre> that has a <code> as a descendant  —  child, grandchild, great-great-great-great grandchild, whatever.

Which is not only more robust, it’s a lot more future-proof: even if some hot new front-end framework that sticks in <span> elements or something gets added to the site next year, this style will just keep chugging along, styling <pre> elements that contain <code> elements until long after that hot new framework has cooled to ash and been chucked into the bit-bucket.

There is one thing to keep in mind here, as pointed out by Emmanuel over on Mastodon: if you have a scenario where <pre> elements can contain child text nodes in addition to <code> blocks, the <pre> will still be styled in its entirely.  Consider:

	{{some text is here}}
		{{content goes here}}
	{{or text is here}}

pre:has(> code) and pre:has(code) will still match the <pre> element here, which means all of the text (both inside and outside the <code> elements)  will sit inside the light-gray box with the rounded corners.  If that’s fine for your use case, great!  If not, then don’t use :has() in this scenario, and stick with the pre > code {…} or pre code {…} approach of yore.  That will style just the <code> elements instead of the whole <pre>, as in the example at the beginning of this article.

As I write this, the code hasn’t gone into production on yet, but I think it will within the next week or two, and will be my first wide-production use of :has().  I feel like it’s a great way to close out 2022 and kick off 2023, because I am that kind of nerd.  If you are too, I hope you enjoyed this quick dive into the world of :has().

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at December 29, 2022 03:36 PM

December 21, 2022

José Dapena

Native call stack profiling (3/3): 2022 work in V8

This is the last blog post of the series. In first post I presented some concepts of call stack profiling, and why it is useful. In second post I reviewed Event Tracing for Windows, the native tool for the purpose, and how it can be used to trace Chromium.

This last post will review the work done in 2022 to improve the support in V8 of call stack profiling in Windows.

I worked on several of the fixes this year. This work has been sponsored by Bloomberg and Igalia.

This work was presented as a lightning talk in BlinkOn 17.

Some bad news to start… and a fix

In March I started working on the report that Windows event traces where not properly resolving the Javascript symbols.

After some bisecting I found this was a regression introduced by this commit, that changed the --js-flags handling to a later stage. This happened to be after V8 initialization, so the code that would enable instrumentation would not consider the flag.

The fix I implemented moved flags processing to happen right before platform initialization, so instrumentation worked again.

Simplified method names

Another fix I worked was to improve the methods name generation. Windows tracing would show a quite redundant description of each level, and that was making analysis more difficult.

Before my work, the entries would look like this:

string-tagcloud.js!LazyCompile:~makeTagCloud- string-tagcloud.js:231-232:22 0x0

After my change, now it looks like this:

string-tagcloud.js!makeTagCloud-231:22 0x0

The fix adds a specific implementation for ETW. Instead of reusing the method name that is also used for Perf, it has a specific implementation for function that takes into account what ETW backend exports already, to avoid redundancy. It also takes advantage of the existing method DebugNameCStr to retrieve inferred method names in case there is no name available.

Problem with Javascript code compiled before tracing

The way V8 ETW worked was that, when tracing was ongoing and a new function was compiled in JIT, it would emit information to ETW.

This implied a big problem. If a function was compiled by V8 before tracing started, then ETW would not properly resolve the function names so, when analyzing the traces, it would not be possible to know which function was called at any of the samples.

The solution is conceptually simple. When tracing starts, V8 traverse the living Javascript contexts and emit all the symbols. This adds noise to the tracing, as it is an expensive process. But, as it happens at the start of the tracing, it is very easy to isolate in the captured trace.

And a performance fix

I also fixed a huge performance penalty when tracing code from snapshots, caused by calculating all the time the end line numbers of code instead of caching it.

Initialization improvements

Paolo Severini improved the initialization code, so the initialization of an ETW session was lighter, and also tracing would be started or stopped correctly.

Benchmarking ETW overhead

After all these changes I did some benchmarking with and without ETW. The goal was knowing if it would be good to enable by default ETW support in V8, not requiring to pass any JS flag.

With Sunspider in a Windows 64 bits build:

Image showing slight overhead with ETW and bigger one with interpreted frames.

Other benchmarks I tried gave similar numbers.

So far, in 64 bits architecture I could not detect any overhead of enabling ETW support when recording is not happening, and the cost when it is enabled is very low.

Though, when combined with interpreted frames native stack, the overhead is close to 10%. This was expected as explained here.

So, good news so far. We still need to benchmark 32 bit architecture to see if the impact is similar.

Try it!

The work described in this post is available in V8 10.9.0. I hope you enjoy the improvements, and specially hope these tools help in the investigation of performance issues around Javascript, in NodeJS, Google Chrome or Microsoft Edge.

What next?

There is still a lot of things to do, and I hope I can continue working on improvements for V8 ETW support next year:

  • First, finishing the benchmarks, and considering to enable ETW instrumentation by default in V8 and derivatives.
  • Add full support for WASM.
  • Bugfixing, as we still see segments missing in certain benchnarmks.
  • Create specific events for when the JIT information of already compiled symbols is sent to ETW, to make it easier to differenciate from the code compiled while recording a trace.

If you want to track the work, keep an eye on V8 issue 11043.

The end

This is the last post in the series.

Thanks to Bloomberg and Igalia for sponsoring my work in ETW Chromium integration improvements!

by José Dapena Paz at December 21, 2022 08:30 AM

December 18, 2022

Víctor Jáquez

Video decoding in GStreamer with Vulkan Video extension (part 2)

Its has been a while since I reported my tinkering with the Vulkan Video provisional extension. Now the specification will have its final release soonish, and also there has been more engagement within the open source communities, such as the work-in-progress FFMpeg implementation by Lynne (please, please, read that post), and the also work-in-progress Mesa 3D drivers both for AMD and Intel by Dave Airlie! Along with the well known NVIDIA beta drivers for Vulkan.

From our side, we have been trying to provide an open source alternative to the video parser used by the Conformance Test Suite and the NVIDIA
vk_video_samples, using GStreamer: GstVkVideoParser, which intends to be a drop-in replacement of the current proprietary parser library.

Along the way, we have sketched the Vulkan Video support in
, for getting traces of the API usage. Sadly, its kind of bit-rotten right now, even more because the specification has changed since then.

Regarding the H.264 decoder for GStreamer, we just restarted its hacking. The merge request was moved to monorepo, but for the sake of the well needed complete re-write, we changed the branch to this one (vkh264dec). We needed to re-write it because, besides the specification updates, we have learned many things along the journey, such as the out-of-band parameters update, Vulkan’s recommendation for memory pre-allocation as much as possible, the DPB/references handling, the debate about buffer vs. slice uploading, and other friction points that Lynne has spotted for future early adopters.

The way to compile it is grab the branch and compile as usually GStreamer is compiled with meson:

meson setup builddir -Dgst-plugins-bad:vulkan-video=enabled --buildtype=debug
ninja C builddir

And run simple pipelines such as

gst-launch-1.0 filesrc location=INPUT ! parsebin ! vulkanh264dec ! fakesink -v

Our objective is to have a functional demo for the next Vulkanised in
. We are very ambitious, we want it to work in Linux, Windows and in many GPU as possible. Wish us luck. And happy December festivities!

by vjaquez at December 18, 2022 01:14 PM

December 15, 2022

Andy Wingo

ephemeral success

Good evening, patient hackers :) Today finishes off my series on implementing ephemerons in a garbage collector.

Last time, we had a working solution for ephemerons, but it involved recursively visiting any pending ephemerons from within the copy routine—the bit of a semi-space collector that is called when traversing the object graph and we see an object that we hadn't seen yet. This recursive visit could itself recurse, and so we could overflow the control stack.

The solution, of course, is "don't do that": instead of visiting recursively, enqueue the ephemeron for visiting later. Iterate, don't recurse. But here we run into a funny problem: how do we add an ephemeron to a queue or worklist? It's such a pedestrian question ("just... enqueue it?") but I think it illustrates some of the particular concerns of garbage collection hacking.

speak, memory

The issue is that we are in the land of "can't use my tools because I broke my tools with my tools". You can't make a standard List<T> because you can't allocate list nodes inside the tracing routine: if you had memory in which you could allocate, you wouldn't be calling the garbage collector.

If the collector needs a data structure whose size doesn't depend on the connectivity of the object graph, you can pre-allocate it in a reserved part of the heap. This adds memory overhead, of course; for a 1000 MB heap, say, you used to be able to make graphs 500 MB in size (for a semi-space collector), but now you can only do 475 MB because you have to reserve 50 MB (say) for your data structures. Another way to look at it is, if you have a 400 MB live set and then you allocate 2GB of garbage, if your heap limit is 500 MB you will collect 20 times, but if it's 475 MB you'll collect 26 times, which is more expensive. This is part of why GC algorithms are so primitive; implementors have to be stingy that we don't get to have nice things / data structures.

However in the case of ephemerons, we will potentially need one worklist entry per ephemeron in the object graph. There is no optimal fixed size for a worklist of ephemerons. Most object graphs will have no or few ephemerons. Some, though, will have practically the whole heap.

For data structure needs like this, the standard solution is to reserve the needed space for a GC-managed data structure in the object itself. For example, for concurrent copying collectors, the GC might reserve a word in the object for a forwarding pointer, instead of just clobbering the first word. If you needed a GC-managed binary tree for a specific kind of object, you'd reserve two words. Again there are strong pressures to minimize this overhead, but in the case of ephemerons it seems sensible to make them pay their way on a per-ephemeron basis.

so let's retake the thing

So sometimes we might need to put an ephemeron in a worklist. Let's add a member to the ephemeron structure:

struct gc_ephemeron {
  struct gc_obj header;
  int dead;
  struct gc_obj *key;
  struct gc_obj *value;
  struct gc_ephemeron *gc_link; // *

Incidentally this also solves the problem of how to represent the struct gc_pending_ephemeron_table; just reserve 0.5% of the heap or so as a bucket array for a buckets-and-chains hash table, and use the gc_link as the intrachain links.

struct gc_pending_ephemeron_table {
  struct gc_ephemeron *resolved;
  size_t nbuckets;
  struct gc_ephemeron buckets[0];

An ephemeron can end up in three states, then:

  1. Outside a collection: gc_link can be whatever.

  2. In a collection, the ephemeron is in the pending ephemeron table: gc_link is part of a hash table.

  3. In a collection, the ephemeron's key has been visited, and the ephemeron is on the to-visit worklist; gc_link is part of the resolved singly-linked list.

Instead of phrasing the interface to ephemerons in terms of visiting edges in the graph, the verb is to resolve ephemerons. Resolving an ephemeron adds it to a worklist instead of immediately visiting any edge.

struct gc_ephemeron **
pending_ephemeron_bucket(struct gc_pending_ephemeron_table *table,
                         struct gc_obj *key) {
  return &table->buckets[hash_pointer(obj) % table->nbuckets];

void add_pending_ephemeron(struct gc_pending_ephemeron_table *table,
                           struct gc_obj *key,
                           struct gc_ephemeron *ephemeron) {
  struct gc_ephemeron **bucket = pending_ephemeron_bucket(table, key);
  ephemeron->gc_link = *bucket;
  *bucket = ephemeron;

void resolve_pending_ephemerons(struct gc_pending_ephemeron_table *table,
                                struct gc_obj *obj) {
  struct gc_ephemeron **link = pending_ephemeron_bucket(table, obj);
  struct gc_ephemeron *ephemeron;
  while ((ephemeron = *link)) {
    if (ephemeron->key == obj) {
      *link = ephemeron->gc_link;
      add_resolved_ephemeron(table, ephemeron);
    } else {
      link = &ephemeron->gc_link;

Copying an object may add it to the set of pending ephemerons, if it is itself an ephemeron, and also may resolve other pending ephemerons.

void resolve_ephemerons(struct gc_heap *heap, struct gc_obj *obj) {
  resolve_pending_ephemerons(heap->pending_ephemerons, obj);

  struct gc_ephemeron *ephemeron;
  if ((ephemeron = as_ephemeron(forwarded(obj)))
      && !ephemeron->dead) {
    if (is_forwarded(ephemeron->key))
                            ephemeron->key, ephemeron);

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  resolve_ephemerons(heap, obj); // *
  return new_obj;

Finally, we need to add something to the core collector to scan resolved ephemerons:

int trace_some_ephemerons(struct gc_heap *heap) {
  struct gc_ephemeron *resolved = heap->pending_ephemerons->resolved;
  if (!resolved) return 0;
  heap->pending_ephemerons->resolved = NULL;
  while (resolved) {
    resolved->key = forwarded(resolved->key);
    visit_field(&resolved->value, heap);
    resolved = resolved->gc_link;
  return 1;

void kill_pending_ephemerons(struct gc_heap *heap) {
  struct gc_ephemeron *ephemeron;
  struct gc_pending_ephemeron_table *table = heap->pending_ephemerons;
  for (size_t i = 0; i < table->nbuckets; i++) {
    for (struct gc_ephemeron *chain = table->buckets[i];
         chain = chain->gc_link)
      chain->dead = 1;    
    table->buckets[i] = NULL;

void collect(struct gc_heap *heap) {
  uintptr_t scan = heap->hp;
  trace_roots(heap, visit_field);
  do { // *
    while(scan < heap->hp) {
      struct gc_obj *obj = scan;
      scan += align_size(trace_heap_object(obj, heap, visit_field));
  } while (trace_ephemerons(heap)); // *
  kill_pending_ephemerons(heap); // *

The result is... not so bad? It makes sense to make ephemerons pay their own way in terms of memory, having an internal field managed by the GC. In fact I must confess that in the implementation I have been woodshedding, I actually have three of these damn things; perhaps more on that in some other post. But the perturbation to the core algorithm is perhaps less than the original code. There are still some optimizations to make, notably postponing hash-table lookups until the whole strongly-reachable graph is discovered; but again, another day.

And with that, thanks for coming along with me for my journeys into ephemeron-space.

I would like to specifically thank Erik Corry and Steve Blackburn for their advice over the years, and patience with my ignorance; I can only imagine that it's quite amusing when you have experience in a domain to see someone new and eager come in and make many of the classic mistakes. They have both had a kind of generous parsimony in the sense of allowing me to make the necessary gaffes but also providing insight where it can be helpful.

I'm thinking of many occasions but I especially appreciate the advice to start with a semi-space collector when trying new things, be it benchmarks or test cases or API design or new functionality, as it's a simple algorithm, hard to get wrong on the implementation side, and perfect for bringing out any bugs in other parts of the system. In this case the difference between fromspace and tospace pointers has a material difference to how you structure the ephemeron implementation; it's not something you can do just in a trace_heap_object function, as you don't have the old pointers there, and the pending ephemeron table is indexed by old object addresses.

Well, until some other time, gentle hackfolk, do accept my sincerest waste disposal greetings. As always, yours in garbage, etc.,

by Andy Wingo at December 15, 2022 09:21 PM

December 14, 2022

Eric Meyer

Highlighting Image Accessibility on Mastodon

Some time ago, I published user styles to visually highlight images on Twitter that didn’t have alt text  —  this was in the time before the “ALT” badge and the much improved accessibility features Twitter deployed in the pre-Musk era.

With the mass Mastodon migration currently underway in the circles I frequent, I spend more time there, and I missed the quick visual indication of images having alt text, as well as my de-emphasis styles for those images that don’t have useful alt text.  So I put the two together and wrote a new user stylesheet, which I apply via the Stylus browser extension.  If you’d like to also use it, please do so!

/* ==UserStyle==
@name  - 12/14/2022, 9:37:56 AM
@version        1.0.0
@description    Styles images posted on based on whether or not they have alt text.
==/UserStyle== */

@-moz-document regexp(".*\.social.*") {
	.media-gallery__item-thumbnail img:not([alt]) {
		filter: grayscale(1) contrast(0.5);
	.media-gallery__item-thumbnail img:not([alt]):hover {
		filter: none;
	.media-gallery__item-thumbnail:has(img[alt]) {
		position: relative;
	.media-gallery__item-thumbnail:has(img[alt])::after {
		content: "ALT";
		position: absolute;
		bottom: 0;
		right: 0;
		border-radius: 5px 0 0 0;
		border: 1px solid #FFF3;
		color: #FFFD;
		background: #000A;
		border-width: 1px 0 0 1px;
		padding-block: 0.5em 0.4em;
		padding-inline: 0.7em 0.8em;
		font: bold 90% sans-serif;

Because most of my (admittedly limited and sporadic) Mastodon time is spent on, the styles I wrote are attuned to’s markup.  I set things up so these styles should be applied to any *.social site, but only those who use the same markup uses will get the benefits., for example, has different markup (I think they’re using Svelte).

You can always adapt the selectors to fit the markup of whatever Mastodon instance you use, if you’re so inclined.  Please feel free to share your changes in the comments, or in posts of your own.  And with any luck, this will be a temporary solution before Mastodon adds these sorts of things natively, just as Twitter eventually did.

Addendum: It was rightly pointed out to me that Firefox does not, as of this writing, support :has() by default.  If you want to use this in Firefox, as I do, set the layout.css.has-selector.enabled flag in about:config to true.

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at December 14, 2022 07:48 PM

December 13, 2022

Jesse Alama

A comprehensive, authoritative FAQ on decimal arithmetic

Mike Cowlishaw’s FAQ on dec­i­mal arith­metic

If you’re in­ter­est­ed in dec­i­mal arith­metic in com­put­ers, you’ve got to check out Mike Cowlishaw’s FAQ on the sub­ject. There’s a ton of in­sight to be had there. If you like the kind of writ­ing that makes you feel smarter as you read it, this one is worth your time.

For con­text: Cowlishaw is the ed­i­tor of the 2008 edi­tion of the IEEE 754 stan­dard, up­dat­ing the 1985 and 1987 stan­dards. The words thus car­ry a lot of au­thor­i­ty, and it would be quite un­wise to ig­nore Mike in these mat­ters.

If you pre­fer sim­i­lar in­for­ma­tion in ar­ti­cle form, take a look at Mike’s Dec­i­mal Float­ing-Point: Al­go­rism for Com­put­ers. (Note the de­light­ful use of al­go­rism. Yes, it’s a word.)

The FAQ fo­cused main­ly on float­ing-point dec­i­mal arith­metic, not ar­bi­trary-pre­ci­sion dec­i­mal arith­metic (which is what one might im­me­di­ate­ly think of when the one hears dec­i­mal arith­metic). Ar­bi­trary-pre­ci­sion dec­i­mal arith­metic is whole oth­er ball of wax. In that set­ting, we’re talk­ing about se­quences of dec­i­mal dig­its whose length can­not be spec­i­fied in ad­vance. Pro­pos­als such as dec­i­mal128 are about a fixed bit width—128 bits—which al­lows for a lot of pre­ci­sion, but not ar­bi­trary pre­ci­sion.

One cru­cial in­sight I take away from Mike’s FAQ—a real mis­un­der­stand­ing on my part which is a bit em­bar­rass­ing to ad­mit—is that dec­i­mal128 is not just a 128-bit ver­sion of the same old bi­na­ry float­ing-point arith­metic we all know about (and might find bro­ken). It’s not as though adding more bits meets the de­mands of those who want high-pre­ci­sion arith­metic. No! Al­though dec­i­mal128 is a fixed-width en­cod­ing (128 bits), the un­der­ly­ing en­cod­ing is dec­i­mal, not bi­na­ry. That is, dec­i­mal128 isn’t just bi­na­ry float­ing-point with ex­tra juice. Just adding bits won’t un­break bust­ed float­ing-point arith­metic; some new ideas are need­ed. And dec­i­mal128 is a way for­ward. It is a new (well, rel­a­tive­ly new) for­mat that ad­dress­es all sorts of use cas­es that mo­ti­vate dec­i­mal arith­metic, in­clud­ing needs in busi­ness, fi­nance, ac­count­ing, and any­thing that uses hu­man dec­i­mal num­bers. What prob­a­bly led to my con­fu­sion is think­ing that the ad­jec­tive float­ing-point, re­gard­less of what it mod­i­fies, must be some kind of vari­a­tion of bi­na­ry float­ing-point arith­metic.

December 13, 2022 08:36 AM

December 12, 2022

Ricardo García

Vulkan extensions Igalia helped ship in 2022

Igalia Logo next to the Vulkan Logo

The end of 2022 is very close so I’m just in time for some self-promotion. As you may know, the ongoing collaboration between Valve and Igalia lets me and some of my colleagues work on improving the open-source Vulkan and OpenGL Conformance Test Suite. This work is essential to ship quality Vulkan drivers and, from the Khronos side, to improve the Vulkan standard further by, among other things, adding new functionality through API extensions. When creating a new extension, apart from reaching consensus among vendors about the scope and shape of the new APIs, CTS tests are developed in order to check the specification text is clear and vendors provide a uniform implementation of the basic functionality, corner cases and, sometimes, interactions with other extensions.

In addition to our CTS work, many times we review the Vulkan specification text from those extensions we develop tests for. We also do the same for other extensions and changes, and we also submit fixes and improvements of our own.

In 2022, our work was important to be able to ship a bunch of extensions you can probably see implemented in Mesa and used by VKD3D-Proton when playing your favorite games on Linux, be it on your PC or perhaps on the fantastic Steam Deck. Or maybe used by Zink when implementing OpenGL on top of your favorite Vulkan driver. Anyway, without further ado, let’s take a look.


This extension was created by our beloved super good coder Mike Blumenkrantz to be able to create a 2D view of a single slice of a 3D image. It helps emulate functionality which was already possible with OpenGL, and is used by Zink. Siru developed tests for this one but we reviewed the spec and are listed as contributors.


One of my favorite extensions shipped in 2022. Created by Hans-Kristian Arntzen to be used by Proton, this extension lets applications query identifiers (hashes, if you want to think about them like that) for existing VkShaderModule objects and also to provide said identifiers in lieu of actual VkShaderModule objects when creating a pipeline. This apparently simple change has real-world impact when downloading and playing games on a Steam Deck, for example.

You see, DX12 games ship their shaders typically in an intermediate assembly-like representation called DXIL. This is the equivalent of the assembly-like SPIR-V language when used with Vulkan. But Proton has to, when implementing DX12 on top of Vulkan, translate this DXIL to SPIR-V before passing the shader down to Vulkan, and this translation takes some time that may result in stuttering that, if done correctly, would not be present in the game when it runs natively on Windows.

Ideally, we would bypass this cost by shipping a Proton translation cache with the game when you download it on Linux. This cache would allow us to hash the DXIL module and use the resulting hash as an index into a database to find a pre-translated SPIR-V module, which can be super-fast. Hooray, no more stuttering from that! You may still get stuttering when the Vulkan driver has to compile the SPIR-V module to native GPU instructions, just like the DX12 driver would when translating DXIL to native instructions, if the game does not, or cannot, pre-compile shaders somehow. Yet there’s a second workaround for that.

If you’re playing on a known platform with known hardware and drivers (think Steam Deck), you can also ship a shader cache for that particular driver and hardware. Mesa drivers already have shader caches, so shipping a RADV cache with the game makes total sense and we would avoid stuttering once more, because the driver can hash the SPIR-V module and use the resulting hash to find the native GPU module. Again, this can be super-fast so it’s fantastic! But now we have a problem, you see? We are shipping a cache that translates DXIL hashes to SPIR-V modules, and a driver cache that translates SPIR-V hashes to native modules. And both are big. Quite big for some games. And what do we want the SPIR-V modules for? For the driver to calculate their hashes and find the native module? Wouldn’t it be much more efficient if we could pass the SPIR-V hash directly to the driver instead of the actual module? That way, the database translating DXIL hashes to SPIR-V modules could be replaced with a database that translates DXIL hashes to SPIR-V hashes. This can save space in the order of gigabytes for some games, and this is precisely what this extension allows. Enjoy your extra disk space on the Deck and thank Hans-Kristian for it! We reviewed the spec, contributed to it, and created tests for this one.


This one was written by Joshua Ashton and allows applications to put images in the special VK_IMAGE_LAYOUT_ATTACHMENT_FEEDBACK_LOOP_OPTIMAL_EXT layout, in which they can both be used to render to and to sample from at the same time. It’s used by DXVK 2.0+ to more efficiently support D3D9 games that read from active render targets. We reviewed, created tests and contributed to this one.


I don’t need to tell you more about this one. You saw the Khronos blog post. You watched my XDC 2022 talk (and Timur’s). You read my slides. You attended the Vulkan Webinar. Important to actually have mesh shaders on Vulkan like you have in DX12, so emulation of DX12 on Vulkan was a top goal. We contributed to the spec and created tests.


“Hey, hey, hey!” I hear you protest. “This is just a rename of the VK_VALVE_mutable_descriptor_type extension which was released at the end of 2020.” Right, but I didn’t get to tell you about it then, so bear with me for a moment. This extension was created by Hans-Kristian Arntzen and Joshua Ashton and it helps efficiently emulate the raw D3D12 binding model on top of Vulkan. For that, it allows you to have descriptors with a type that is only known at runtime, and also to have descriptor pools and sets that reside only in host memory. We had reviewed the spec and created tests for the Valve version of the extension. Those same tests are the VK_EXT_mutable_descriptor_type tests today.


The final boss of dynamic state, which helps you reduce the number of pipeline objects in your application as much as possibly allowed. Combine some of the old and new dynamic states with graphics pipeline libraries and you may enjoy stutter-free gaming. Guaranteed!1 This one will be used by native apps, translation layers (including Zink) and you-name-it. We developed tests and reviewed the spec for it.

1Disclaimer: not actually guaranteed.

December 12, 2022 09:37 PM

Andy Wingo

i'm throwing ephemeron party & you're invited

Good day, hackfolk. Today's note tries to extend our semi-space collector with support for ephemerons. Spoiler alert: we fail in a subtle and interesting way. See if you can spot it before the end :)

Recall that, as we concluded in an earlier article, a memory manager needs to incorporate ephemerons as a core part of the tracing algorithm. Ephemerons are not macro-expressible in terms of object trace functions.

Instead, to support ephemerons we need to augment our core trace routine. When we see an ephemeron E, we need to check if the key K is already visited (and therefore live); if so, we trace the value V directly, and we're done. Otherwise, we add E to a global table of pending ephemerons T, indexed under K. Finally whenever we trace a new object O, ephemerons included, we look up O in T, to trace any pending ephemerons for O.

So, taking our semi-space collector as a workbench, let's start by defining what an ephemeron is.

struct gc_ephemeron {
  struct gc_obj header;
  int dead;
  struct gc_obj *key;
  struct gc_obj *value;

enum gc_obj_kind { ..., EPHEMERON, ... };

static struct gc_ephemeron* as_ephemeron(struct gc_obj *obj) {
  uintptr_t ephemeron_tag = NOT_FORWARDED_BIT | (EPHEMERON << 1);
  if (obj->tag == ephemeron_tag)
    return (struct gc_ephemeron*)obj;
  return NULL;

First we need to allow the GC to know when an object is an ephemeron or not. This is somewhat annoying, as you would like to make this concern entirely the responsibility of the user, and let the GC be indifferent to the kinds of objects it's dealing with, but it seems to be unavoidable.

The heap will need some kind of data structure to track pending ephemerons:

struct gc_pending_ephemeron_table;

struct gc_heap {
  struct gc_pending_ephemeron_table *pending_ephemerons;

struct gc_ephemeron *
pop_pending_ephemeron(struct gc_pending_ephemeron_table*,
                      struct gc_obj*);
add_pending_ephemeron(struct gc_pending_ephemeron_table*,
                      struct gc_obj*, struct gc_ephemeron*);
struct gc_ephemeron *
pop_any_pending_ephemeron(struct gc_pending_ephemeron_table*);

Now let's define a function to handle ephemeron shenanigans:

void visit_ephemerons(struct gc_heap *heap, struct gc_obj *obj) {
  // We are visiting OBJ for the first time.
  // OBJ is the old address, but it is already forwarded.

  // First, visit any pending ephemeron for OBJ.
  struct gc_ephemeron *ephemeron;
  while ((ephemeron =
            pop_pending_ephemeron(heap->pending_ephemerons, obj))) {
    ASSERT(obj == ephemeron->key);
    ephemeron->key = forwarded(obj);
    visit_field(&ephemeron->value, heap);

  // Then if OBJ is itself an ephemeron, trace it.
  if ((ephemeron = as_ephemeron(forwarded(obj)))
      && !ephemeron->dead) {
    if (is_forwarded(ephemeron->key)) {
      ephemeron->key = forwarded(ephemeron->key);
      visit_field(&ephemeron->value, heap);
    } else {
                            ephemeron->key, ephemeron);

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  visit_ephemerons(heap, obj); // *
  return new_obj;

We wire it into the copy routine, as that's the bit of the collector that is called only once per object and which has access to the old address. We actually can't process ephemerons during the Cheney field scan, as there we don't have old object addresses.

Then at the end of collection, we kill any ephemeron whose key hasn't been traced:

void kill_pending_ephemerons(struct gc_heap *heap) {
  struct gc_ephemeron *ephemeron;
  while ((ephemeron =
    ephemeron->dead = 1;    

void collect(struct gc_heap *heap) {
  // ...

First observation: Gosh, this is quite a mess. It's more code than the core collector, and it's gnarly. There's a hash table, for goodness' sake. Goodbye, elegant algorithm!

Second observation: Well, at least it works.

Third observation: Oh. It works in the same way as tracing in the copy routine works: well enough for shallow graphs, but catastrophically for arbitrary graphs. Calling visit_field from within copy introduces unbounded recursion, as tracing one value can cause more ephemerons to resolve, ad infinitum.

Well. We seem to have reached a dead-end, for now. Will our hero wrest victory from the jaws of defeat? Tune in next time for find out: same garbage time (unpredictable), same garbage channel (my wordhoard). Happy hacking!

by Andy Wingo at December 12, 2022 02:02 PM

December 11, 2022

Andy Wingo

we iterate so that you can recurse

Sometimes when you see an elegant algorithm, you think "looks great, I just need it to also do X". Perhaps you are able to build X directly out of what the algorithm gives you; fantastic. Or, perhaps you can alter the algorithm a bit, and it works just as well while also doing X. Sometimes, though, you alter the algorithm and things go pear-shaped.

Tonight's little note builds on yesterday's semi-space collector article and discusses an worse alternative to the Cheney scanning algorithm.

To recall, we had this visit_field function that takes a edge in the object graph, as the address of a field in memory containing a struct gc_obj*. If the edge points to an object that was already copied, visit_field updates it to the forwarded address. Otherwise it copies the object, thus computing the new address, and then updates the field.

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  size_t size = heap_object_size(obj);
  struct gc_obj *new_obj = (struct gc_obj*)heap->hp;
  memcpy(new_obj, obj, size);
  forward(obj, new_obj);
  heap->hp += align_size(size);
  return new_obj;

void visit_field(struct gc_obj **field, struct gc_heap *heap) {
  struct gc_obj *from = *field;
  struct gc_obj *to =
    is_forwarded(from) ? forwarded(from) : copy(heap, from);
  *field = to;

Although a newly copied object is in tospace, all of its fields still point to fromspace. The Cheney scan algorithm later visits the fields in the newly copied object with visit_field, which both discovers new objects and updates the fields to point to tospace.

One disadvantage of this approach is that the order in which the objects are copied is a bit random. Given a hierarchical memory system, it's better if objects that are accessed together in time are close together in space. This is an impossible task without instrumenting the actual data access in a program and then assuming future accesses will be like the past. Instead, the generally-accepted solution is to ensure that objects that are allocated close together in time be adjacent in space. The bump-pointer allocator in a semi-space collector provides this property, but the evacuation algorithm above does not: it would need to preserve allocation order, but instead its order is driven by graph connectivity.

I say that the copying algorithm above is random but really it favors a breadth-first traversal; if you have a binary tree, first you will copy the left and the right nodes of the root, then the left and right children of the left, then the left and right children of the right, then grandchildren, and so on. Maybe it would be better to keep parent and child nodes together? After all they are probably allocated that way.

So, what if we change the algorithm:

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  size_t size = heap_object_size(obj);
  struct gc_obj *new_obj = (struct gc_obj*)heap->hp;
  memcpy(new_obj, obj, size);
  forward(obj, new_obj);
  heap->hp += align_size(size);
  trace_heap_object(new_obj, heap, visit_field); // *
  return new_obj;

void visit_field(struct gc_obj **field, struct gc_heap *heap) {
  struct gc_obj *from = *field;
  struct gc_obj *to =
    is_forwarded(from) ? forwarded(from) : copy(heap, from);
  *field = to;

void collect(struct gc_heap *heap) {
  trace_roots(heap, visit_field);

Here we favor a depth-first traversal: we eagerly call trace_heap_object within copy. No need for the Cheney scan algorithm; tracing does it all.

The thing is, this works! It might even have better performance for some workloads, depending on access patterns. And yet, nobody does this. Why?

Well, consider a linked list with a million nodes; you'll end up with a million recursive calls to copy, as visiting each link eagerly traverses the next. While I am all about unbounded recursion, an infinitely extensible stack is something that a language runtime has to provide to a user, and here we're deep into implementing-the-language-runtime territory. At some point a user's deep heap graph is going to cause a gnarly system failure via stack overflow.

Ultimately stack space needed by a GC algorithm counts towards collector memory overhead. In the case of a semi-space collector you already need twice the amount memory as your live object graph, and if you recursed instead of iterated this might balloon to 3x or more, depending on the heap graph shape.

Hey that's my note! All this has been context for some future article, so this will be on the final exam. Until then!

by Andy Wingo at December 11, 2022 09:19 PM

December 10, 2022

Andy Wingo

a simple semi-space collector

Good day, hackfolk. Today's article is about semi-space collectors. Many of you know what these are, but perhaps not so many have seen an annotated implementation, so let's do that.

Just to recap, the big picture here is that a semi-space collector divides a chunk of memory into two equal halves or spaces, called the fromspace and the tospace. Allocation proceeds linearly across tospace, from one end to the other. When the tospace is full, we flip the spaces: the tospace becomes the fromspace, and the fromspace becomes the tospace. The collector copies out all live data from the fromspace to the tospace (hence the names), starting from some set of root objects. Once the copy is done, allocation then proceeds in the new tospace.

In practice when you build a GC, it's parameterized in a few ways, one of them being how the user of the GC will represent objects. Let's take as an example a simple tag-in-the-first-word scheme:

struct gc_obj {
  union {
    uintptr_t tag;
    struct gc_obj *forwarded; // for GC
  uintptr_t payload[0];

We'll divide all the code in the system into GC code and user code. Users of the GC define how objects are represented. When user code wants to know what the type of an object is, it looks at the first word to check the tag. But, you see that GC has a say in what the representation of user objects needs to be: there's a forwarded member too.

static const uintptr_t NOT_FORWARDED_BIT = 1;
int is_forwarded(struct gc_obj *obj) {
  return (obj->tag & NOT_FORWARDED_BIT) == 1;
void* forwarded(struct gc_obj *obj) { 
  return obj->forwarded;
void forward(struct gc_obj *from, struct gc_obj *to) {
  from->forwarded = to;

forwarded is a forwarding pointer. When GC copies an object from fromspace to tospace, it clobbers the first word of the old copy in fromspace, writing the new address there. It's like when you move to a new flat and have your mail forwarded from your old to your new address.

There is a contract between the GC and the user in which the user agrees to always set the NOT_FORWARDED_BIT in the first word of its objects. That bit is a way for the GC to check if an object is forwarded or not: a forwarded pointer will never have its low bit set, because allocations are aligned on some power-of-two boundary, for example 8 bytes.

struct gc_heap;

// To implement by the user:
size_t heap_object_size(struct gc_obj *obj);
size_t trace_heap_object(struct gc_obj *obj, struct gc_heap *heap,
                         void (*visit)(struct gc_obj **field,
                                       struct gc_heap *heap));
size_t trace_roots(struct gc_heap *heap,
                   void (*visit)(struct gc_obj **field,
                                 struct gc_heap *heap));

The contract between GC and user is in practice one of the most important details of a memory management system. As a GC author, you want to expose the absolute minimum interface, to preserve your freedom to change implementations. The GC-user interface does need to have some minimum surface area, though, for example to enable inlining of the hot path for object allocation. Also, as we see here, there are some operations needed by the GC which are usually implemented by the user: computing the size of an object, tracing its references, and tracing the root references. If this aspect of GC design interests you, I would strongly recommend having a look at MMTk, which has been fruitfully exploring this space over the last two decades.

struct gc_heap {
  uintptr_t hp;
  uintptr_t limit;
  uintptr_t from_space;
  uintptr_t to_space;
  size_t size;

Now we get to the implementation of the GC. With the exception of how to inline the allocation hot-path, none of this needs to be exposed to the user. We start with a basic definition of what a semi-space heap is, above, and below we will implement collection and allocation.

static uintptr_t align(uintptr_t val, uintptr_t alignment) {
  return (val + alignment - 1) & ~(alignment - 1);
static uintptr_t align_size(uintptr_t size) {
  return align(size, sizeof(uintptr_t));

All allocators have some minimum alignment, which is usually a power of two at least as large as the target language's ABI alignment. Usually it's a word or two; here we just use one word (4 or 8 bytes).

struct gc_heap* make_heap(size_t size) {
  size = align(size, getpagesize());
  struct gc_heap *heap = malloc(sizeof(struct gc_heap));
  void *mem = mmap(NULL, size, PROT_READ|PROT_WRITE,
                   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  heap->to_space = heap->hp = (uintptr_t) mem;
  heap->from_space = heap->limit = space->hp + size / 2;
  heap->size = size;
  return heap;

Making a heap is just requesting a bunch of memory and dividing it in two. How you get that space differs depending on your platform; here we use mmap and also the platform malloc for the struct gc_heap metadata. Of course you will want to check that both the mmap and the malloc succeed :)

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  size_t size = heap_object_size(obj);
  struct gc_obj *new_obj = (struct gc_obj*)heap->hp;
  memcpy(new_obj, obj, size);
  forward(obj, new_obj);
  heap->hp += align_size(size);
  return new_obj;

void flip(struct gc_heap *heap) {
  heap->hp = heap->from_space;
  heap->from_space = heap->to_space;
  heap->to_space = heap->hp;
  heap->limit = heap->hp + heap->size / 2;

void visit_field(struct gc_obj **field, struct gc_heap *heap) {
  struct gc_obj *from = *field;
  struct gc_obj *to =
    is_forwarded(from) ? forwarded(from) : copy(heap, from);
  *field = to;

void collect(struct gc_heap *heap) {
  uintptr_t scan = heap->hp;
  trace_roots(heap, visit_field);
  while(scan < heap->hp) {
    struct gc_obj *obj = scan;
    scan += align_size(trace_heap_object(obj, heap, visit_field));

Here we have the actual semi-space collection algorithm! It's a tiny bit of code about which people have written reams of prose, and to be fair there are many things to say—too many for here.

Personally I think the most interesting aspect of a semi-space collector is the so-called "Cheney scanning algorithm": when we see an object that's not yet traced, in visit_field, we copy() it to tospace, but don't actually look at its fields. Instead collect keeps track of the partition of tospace that contains copied objects which have not yet been traced, which are those in [scan, heap->hp). The Cheney scan sweeps through this space, advancing scan, possibly copying more objects and extending heap->hp, until such a time as the needs-tracing partition is empty. It's quite a neat solution that requires no additional memory.

inline struct gc_obj* allocate(struct gc_heap *heap, size_t size) {
  uintptr_t addr = heap->hp;
  uintptr_t new_hp = align_size(addr + size);
  if (heap->limit < new_hp) {
    if (heap->limit - heap->hp < size) {
      fprintf(stderr, "out of memory\n");
    goto retry;
  heap->hp = new_hp;
  return (struct gc_obj*)addr;

Finally, we have the allocator: the reason we have the GC in the first place. The fast path just returns heap->hp, and arranges for the next allocation to return heap->hp + size. The slow path calls collect() and then retries.

Welp, that's a semi-space collector. Until next time for some notes on ephemerons again. Until then, have a garbage holiday season!

by Andy Wingo at December 10, 2022 08:50 PM

December 02, 2022

Emmanuele Bassi

On PyGObject

Okay, I can’t believe I have to do this again.

This time, let’s leave the Hellsite out of it, and let’s try to be nuanced from the start. I’d like to avoid getting grief online.

The current state of the Python bindings for GObject-based libraries is making it really hard to recommend using Python as a language for developing GTK and GNOME applications.

PyGObject is currently undermaintained, even after the heroic efforts of Christoph Reiter to keep the fires burning through the long night. The Python community needs more people to work on the bindings, if we want Python to be a first class citizen of the ecosystem.

There’s a lot to do, and not nearly enough people left to do it.

Case study: typed instances

Yes, thou shall use GObject should be the law of the land; but there are legitimate reasons to use typed instances, and GTK 4 has a few of them:

At this very moment, it is impossible to use the types above from Python. PyGObject will literally error out if you try to do so. There are technical reasons why that was a reasonable choice 15+ years ago, but most language bindings written since then can handle typed instances just fine. In fact, PyGObject does handle them, since GParamSpec is a GTypeInstance; of course, that’s because PyGObject has some ad hoc code for them.

Dealing with events and render nodes is not so important in GTK4; but not having access to the expressions API makes writing list widgets incredibly more complicated, requiring to set up everything through UI definition files and never modifying the objects programmatically.

Case study: constructing and disposing

While most of the API in PyGObject is built through introspection, the base wrapper for the GObject class is still very much written in CPython. This requires, among other things, wiring the class’s virtual functions manually, and figuring out the interactions between Python, GObject, and the type system wrappers. This means Python types that inherit from GObject don’t have automatic access to the GObjectClass.constructed and GObjectClass.dispose virtual functions. Normally, this would not be an issue, but modern GObject-based libraries have started to depend on being able to control construction and destruction sequences.

For instance, it is necessary for any type that inherits from GtkWidget to ensure that all its child widgets are disposed manually. While that was possible through the “destroy” signal in GTK3, in GTK4 the signal was removed and everything should go through the GObjectClass.dispose virtual function. Since PyGObject does not allow overriding or implementing that virtual function, your Python class cannot inherit from GtkWidget, but must inherit from ancillary classes like GtkBox or AdwBin. That’s even more relevant for the disposal of resources created through the composite template API. Theoretically, using Gtk.Template would take care of this, but since we cannot plug the Python code into the underlying CPython, we’re stuck.

Case study: documentation, examples, and tutorials

While I have been trying to write Python examples for the GNOME developers documentation, I cannot write everything by myself. Plus, I can’t convince Google to only link to what I write. The result is that searching for “Python” and “GNOME” or “GTK” will inevitably lead people to the GTK3 tutorials and references.

The fragmentation of the documentation is also an issue. The PyGObject website is off to, which is understandable for a Python projects; then we have the GTK3 tutorial, which hasn’t been updated in a while, and for which there are no plans to have a GTK 4 version.

Additionally, the Python API reference is currently stuck on GTK3, with Python references for GTK4 and friends off on the side.

It would be great to unify the reference sites, and possibly have them under the GNOME infrastructure; but at this point, I’d settle for having a single place to find everything, like GJS did.

What can you do?

Pick up PyGObject. Learn how it works, and how the GObject and CPython API interact.

Write Python overrides to ensure that API like GTK and GIO are nice to use with idiomatic Python.

Write tests for PyGObject.

Write documentation.

Port existing tutorials, examples, and demos to GTK4.

Help Christoph out when it comes to triaging issues, and reviewing merge requests.

Join GNOME Python on Matrix, or the #gnome-python channel on Libera.

Ask questions and provide answers on Discourse.

What happens if nobody does anything?

Right now, we’re in maintenance mode. Things work because of inertia, and because nobody is really pushing the bindings outside of their existing functionality.

Let’s be positive, for a change, and assume that people will show up. They did for Vala when I ranted about it five years ago after a particularly frustrating week dealing with constant build failures in GNOME, so maybe the magic will happen again.

If people do not show up, though, what will likely happen is that Python will just fall on the wayside inside GNOME. Python developers won’t use GTK to write GUI applications; existing Python applications targeting GNOME will either wither and die, or get ported to other languages. The Python bindings themselves may stop working with newer versions of Python, which will inevitably lead downstream distributors to jettison the bindings themselves.

We have been through this dance with the C# bindings and Mono, and the GNOME and Mono communities are all the poorer for it, so I’d like to avoid losing another community of talented developers. History does not need to repeat itself.

by ebassi at December 02, 2022 06:03 PM

November 30, 2022

Eric Meyer

How to Verify Site Ownership on Mastodon Profiles

Like many of you, I’ve been checking out Mastodon and finding more and more things I like.  Including the use of XFN (XHTML Friends Network) semantics to verify ownership of sites you link from your profile’s metadata!  What that means is, you can add up to four links in your profile, and if you have an XFN-compliant link on that URL pointing to your Mastodon profile, it will show up as verified as actually being your site.

Okay, that probably also comes off a little confusing.  Let me walk through the process.

First, go to your home Mastodon server and edit your profile.  On servers like, there should be an “Edit profile” link under your user avatar.

Here’s what it looks like for me.  Yes, I prefer Light Mode.  No, I don’t want to have a debate about it.

I saw the same thing on another Mastodon server where I have an account, so it seems to be common to Mastodon in general.  I can’t know what every Mastodon server does, though, so you might have to root around to find how you edit your profile.  (Similarly, I can’t be sure that everything will be exactly as I depict it below, but hopefully it will be at least reasonably close.)

Under “Appearance” in the profile editing screen, which I believe is the default profile editing page, there should be a section called “Profile metadata”.  You’ll probably have to scroll a bit to reach it.  You can add up to four labels with content, and a very common label is “Web” or “Homepage” with the URL of your personal server.  They don’t all have to be links to sites; you could add your favorite color or relationship preference(s) or whatever

I filled a couple of these in for demonstration purposes, but now they’re officially part of my profile.  No, I don’t want to debate indentation preferences, either.

But look, over there next to the table, there’s a “Verification” section, with a little bit of explanation and a field containing some markup you can copy, but it’s cut off before the end of the markup.  Here’s what mine looks like in full:

<a rel="me" href="">Mastodon</a>

If I take this markup and add it to any URL I list in my metadata, then that entry in my metadata table will get special treatment, because it will mean I’ve verified it.

The important part is the rel="me", which establishes a shared identity.  Here’s how it’s (partially) described by XFN 1.1:

A link to yourself at a different URL. Exclusive of all other XFN values. Required symmetric.

I admit, that’s written in terse spec-speak, so let’s see how this works out in practice.

First, let’s look at the markup in my Mastodon profile’s page.  Any link to another site in the table of profile metadata has a me value in the rel attribute, like so:

<a href="" rel="nofollow noopener noreferrer me">

That means I’ve claimed via Mastodon that is me at another URL.

But that’s not enough, because I could point at the home page of, say, Wikipedia as if it were mine.  That’s why XFN requires the relationship to be symmetric, which is to say, there needs to be a rel="me" annotated link on each end.  (On both ends.  However you want to say that.)

So on the page being pointed to, which in my case is, I need to include a link back to my Mastodon profile page, and that link also has to have rel="me".  That’s the markup Mastodon provided for me to copy, which we saw before and I’ll repeat here:

<a rel="me" href="">Mastodon</a>

Again, the important part is that the href points to my Mastodon profile page, and there’s a rel attribute containing me.  It can contain other things, like noreferrer, but needs to have me for the verfiication to work.  Note that the content of the link element doesn’t have to be the text “Mastodon”.  In my case, I’m using a Mastodon logo, with the markup looking like this:

<a rel="me" href="">
 	<img src="/pix/icons/mastodon.svg" alt="Mastodon">

With that in place, there’s a “me” link pointing to a page that contains a “me” link.  That’s a symmetric relationship, as XFN requires, and it verifies that the two pages have a shared owner.  Who is me!

Thus, if you go to my Mastodon profile page, in the table of my profile metadata, the entry for my homepage is specially styled to indicate it’s been verified as actually belonging to me.

The table as it appears on my profile page for me.  Your colors may vary.

And that’s how it works.

Next question: how can I verify my GitHub page?  At the moment, I’d have to put my Mastodon profile page’s URL into the one open field for URLs in GitHub profiles, because GitHub also does the rel="me" thing for its profile links.  But if I do that, I’d have to remove the link to my homepage, which I don’t want to do.

Until GitHub either provides a dedicated Mastodon profile field the way it provides a dedicated Twitter profile field, or else allows people to add multiple URLs to their profiles the way Mastodon does, I won’t be able to verify my GitHub page on Mastodon.  Not a huge deal for me, personally, but in general it would be nice to see GitHub become more flexible in this area.  Very smart people are also asking for this, so hopefully that will happen soon(ish).

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at November 30, 2022 06:45 PM

November 29, 2022

José Dapena

Native call stack profiling (2/3): Event Tracing for Windows and Chromium

In last blog post, I introduced call stack profiling, why it is useful, and how a system wide support can be useful. This new blog post will talk about Windows native call stack tracing, and how it is integrated in Chromium.

Event Tracing for Windows (ETW)

Event Tracing for Windows, usually also named with the acronym ETW, is a Windows kernel based tool that allows to log kernel and application events to a file.

A good description of its architecture is available at Microsoft Learn: About Event Tracing.

Essentially, it is an efficient event capturing tool, in some ways similar to LTTng. Its events recording stage is as lightweight as possible to avoid processing of collected data impacting the results as much as possible, reducing the observer effect.

The main participants are:
– Providers: kernel components (including device drivers) or applications that emit log events.
– Controllers: tools that control when a recording session starts and stops, which providers to record, and what each provider is expected to log. Controllers also decide where to dump the recorded data (typically a file).
– Consumers: tools that can read and analyze the recorded data, and combine with system information (i.e. debugging information). Consumers will usually get the data from previously recorded files, but it is also possible to consume tracing information in real time.

What about call stack profiling? ETW supports call stack sampling, allowing to capture call stacks when certain events happen, and associates the call stack to that event. Bruce Dawson has written a fantastic blog post about the topic.

Chromium call stack profiling in Windows

Chromium provides support for call stack profiling. This is done at different levels of the stack:
– It allows to build with frame pointers, so CPU profile samplers can properly capture the full call stack.
– v8 can generate symbol information for for JIT-compiled code. This is supported for ETW (and also for Linux Perf).


In any platform compilation will usually benefit from compiling with the GN flag enable_profiling=true. This will enable frame pointers support. In Windows, it will also enable generation of function information for code generated by V8.

Also, symbol_level=1 should be added at least, so the compilation stage function names are available.

Chrome startup

To enable generation of V8 profiling information in Windows, these flags should be passed to chrome on launch:

chrome --js-flags="--enable-etw-stack-walking --interpreted-frames-native-stack"

--enable-etw-stack-walking will emit information of the functions compiled by V8 JIT engine, so they can be recorded while sampling the stack.

--interpreted-frames-native-stack will show the frames of interpreted code in the native stack, so external profilers as ETW can properly show those in the profiling samples.


Then, a session of the workload to analyze can be captured with Windows Performance Recorder.

An alternate tool with specific Chromium support, UIForETW can be used too. Main advantage is that it allows to select specific Chromium tracer categories, that will be emitted in the same trace. Its author, Bruce Dawson, has explained very well how to use it.


For analysis, the tool Windows Performance Analyzer (WPA) can be used. Both UIForETW and Windows Performance Recorder will offer opening the obtained trace at the end of the capture for analysis.

Before starting analysis, in WPA, add the paths where the .PDB files with debugging information are available.

Then, select Computation/CPU Usage (Sampled):


From the available charts, we are interested in the ones providing stackwalk information:


In the last post of this series, I will present the work done in 2022 to improve V8 support for Windows ETW.

by José Dapena Paz at November 29, 2022 09:15 AM

November 28, 2022

Andy Wingo

are ephemerons primitive?

Good evening :) A quick note, tonight: I've long thought that ephemerons are primitive and can't be implemented with mark functions and/or finalizers, but today I think I have a counterexample.

For context, one of the goals of the GC implementation I have been working on on is to replace Guile's current use of the Boehm-Demers-Weiser (BDW) conservative collector. Of course, changing a garbage collector for a production language runtime is risky, and for Guile one of the mitigation strategies for this work is that the new collector is behind an abstract API whose implementation can be chosen at compile-time, without requiring changes to user code. That way we can first switch to BDW-implementing-the-new-GC-API, then switch the implementation behind that API to something else.

Abstracting GC is a tricky problem to get right, and I thank the MMTk project for showing that this is possible -- you have user-facing APIs that need to be implemented by concrete collectors, but also extension points so that the user can provide some compile-time configuration too, for example to provide field-tracing visitors that take into account how a user wants to lay out objects.

Anyway. As we discussed last time, ephemerons are usually have explicit support from the GC, so we need an ephemeron abstraction as part of the abstract GC API. The question is, can BDW-GC provide an implementation of this API?

I think the answer is "yes, but it's very gnarly and will kill performance so bad that you won't want to do it."

the contenders

Consider that the primitives that you get with BDW-GC are custom mark functions, run on objects when they are found to be live by the mark workers; disappearing links, a kind of weak reference; and finalizers, which receive the object being finalized, can allocate, and indeed can resurrect the object.

BDW-GC's finalizers are a powerful primitive, but not one that is useful for implementing the "conjunction" aspect of ephemerons, as they cannot constrain the marker's idea of graph connectivity: a finalizer can only prolong the life of an object subgraph, not cut it short. So let's put finalizers aside.

Weak references have a tantalizingly close kind of conjunction property: if the weak reference itself is alive, and the referent is also otherwise reachable, then the weak reference can be dereferenced. However this primitive only involves the two objects E and K; there's no way to then condition traceability of a third object V to E and K.

We are left with mark functions. These are an extraordinarily powerful interface in BDW-GC, but somewhat expensive also: not inlined, and going against the grain of what BDW-GC is really about (heaps in which the majority of all references are conservative). But, OK. They way they work is, your program allocates a number of GC "kinds", and associates mark functions with those kinds. Then when you allocate objects, you use those kinds. BDW-GC will call your mark functions when tracing an object of those kinds.

Let's assume firstly that you have a kind for ephemerons; then when you go to mark an ephemeron E, you mark the value V only if the key K has been marked. Problem solved, right? Only halfway: you also have to handle the case in which E is marked first, then K. So you publish E to a global hash table, and... well. You would mark V when you mark a K for which there is a published E. But, for that you need a hook into marking V, and V can be any object...

So now we assume additionally that all objects are allocated with user-provided custom mark functions, and that all mark functions check if the marked object is in the published table of pending ephemerons, and if so marks values. This is essentially what a proper ephemeron implementation would do, though there are some optimizations one can do to avoid checking the table for each object before the mark stack runs empty for the first time. In this case, yes you can do it! Additionally if you register disappearing links for the K field in each E, you can know if an ephemeron E was marked dead in a previous collection. Add a pre-mark hook (something BDW-GC provides) to clear the pending ephemeron table, and you are in business.

yes, but no

So, it is possible to implement ephemerons with just custom mark functions. I wouldn't want to do it, though: missing the mostly-avoid-pending-ephemeron-check optimization would be devastating, and really what you want is support in the GC implementation. I think that for the BDW-GC implementation in whippet I'll just implement weak-key associations, in which the value is always marked strongly unless the key was dead on a previous collection, using disappearing links on the key field. That way a (possibly indirect) reference from a value V to a key K can indeed keep K alive, but oh well: it's a conservative approximation of what should happen, and not worse than what Guile has currently.

Good night and happy hacking!

by Andy Wingo at November 28, 2022 09:11 PM