# Planet Igalia

## September 23, 2022

### André Almeida

#### futex2 at Linux Plumbers Conference 2022

In-person conferences are finally back! After two years of remote conferences, the kernel development community got together in Dublin, Ireland, to discuss current problems that need collaboration to be solved. As in the past editions, I took the opportunity to discuss about futex2, a project I’m deeply involved in. futex2 is a project to solve issues found in the current futex interface. This years’ session was about NUMA awaress of futex.

## September 21, 2022

#### What's new for RISC-V in LLVM 15

LLVM 15.0.0 was released around about two weeks ago now, and I wanted to highlight some of RISC-V specific changes or improvements that were introduced while going into a little more detail than I was able to in the release notes.

In case you're not familiar with LLVM's release schedule, it's worth noting that there are two major LLVM releases a year (i.e. one roughly every 6 months) and these are timed releases as opposed to being cut when a pre-agreed set of feature targets have been met. We're very fortunate to benefit from an active and growing set of contributors working on RISC-V support in LLVM projects, who are responsible for the work I describe below - thank you! I coordinate biweekly sync-up calls for RISC-V LLVM contributors, so if you're working in this area please consider dropping in.

Linker relaxation is a mechanism for allowing the linker to optimise code sequences at link time. A code sequence to jump to a symbol might conservatively take two instructions, but once the target address is known at link-time it might be small enough to fit in the immediate of a single instruction, meaning the other can be deleted. Because a linker performing relaxation may delete bytes (rather than just patching them), offsets including those for jumps within a function may be changed. To allow this to happen without breaking program semantics, even local branches that might typically be resolved by the assembler must be emitted as a relocation when linker relaxation is enabled. See the description in the RISC-V psABI or Palmer Dabbelt's blog post on linker relaxation for more background.

Although LLVM has supported codegen for linker relaxation for a long time, LLD (the LLVM linker) has until now lacked support for processing these relaxations. Relaxation is primarily an optimisation, but processing of R_RISCV_ALIGN (the alignment relocation) is necessary for correctness when linker relaxation is enabled, meaning it's not possible to link such object files correctly without at least some minimal support. Fangrui Song implemented support for R_RISCV_ALIGN/R_RISCV_CALL/R_RISCV_CALL_PLT/R_RISCV_TPREL_* relocations in LLVM 15 and wrote up a blog post with more implementation details, which is a major step in bringing us to parity with the GCC/binutils toolchain.

## Optimisations

As with any release, there's been a large number of codegen improvements, both target-independent and target-dependent. One addition to highlight in the RISC-V backend is the new RISCVCodeGenPrepare pass. This is the latest piece of a long-running campaign (largely led by Craig Topper) to improve code generation related to sign/zero extensions on RV64. CodeGenPrepare is a target-independent pass that performs some late-stage transformations to the input ahead of lowering to SelectionDAG. The RISC-V specific version looks for opportunities to convert zero-extension to i64 with a sign-extension (which is cheaper).

Another new pass that may be of interest is RISCVMakeCompressible (contributed by Lewis Revill and Craig Blackmore). Rather than trying to improve generated code performance, this is solely focused on reducing code size, and may increase the static instruction count in order to do so (which is why it's currently only enabled at the -Oz optimisation level). It looks for cases where an instruction has been selected which can't be represented by one of the compressed (16-bit as opposed to 32-bit wide) instruction forms. For instance due to the register not being one of the registers addressable from the compressed instruction, or the offset being out of range). It will then look for opportunities to transform the input to make the instructions compressible. Grabbing two examples from the header comment of the pass:

; 'zero' register not addressable in compressed store.
=>   li a1, 0
sw zero, 0(a0)   =>   c.sw a1, 0(a0)
sw zero, 8(a0)   =>   c.sw a1, 8(a0)
sw zero, 4(a0)   =>   c.sw a1, 4(a0)
sw zero, 24(a0)  =>   c.sw a1, 24(a0)


and

; compressed stores support limited offsets
lui a2, 983065     =>   lui a2, 983065
sw  a1, -236(a2)   =>   c.sw  a1, 20(a3)
sw  a1, -240(a2)   =>   c.sw  a1, 16(a3)
sw  a1, -244(a2)   =>   c.sw  a1, 12(a3)
sw  a1, -248(a2)   =>   c.sw  a1, 8(a3)
sw  a1, -252(a2)   =>   c.sw  a1, 4(a3)
sw  a0, -256(a2)   =>   c.sw  a0, 0(a3)


There's a whole range of other backend codegen improvements, including additions to existing RISC-V specific passes but unfortunately it's not feasible to enumerate them all.

One improvement to note from the Clang frontend is that the C intrinsics for the RISC-V Vector extension are now lazily generated, avoiding the need to parse a huge pre-generated header file and improving compile times.

## Support for new instruction set extensions

A batch of new instruction set extensions were ratified at the end of last year (see also the recently ratified extension list. LLVM 14 already featured a number of these (with the vector and ratified bit manipulation extensions no longer being marked as experimental). In LLVM 15 we were able to fill in some of the gaps, adding support for additional ratified extensions as well as some new experimental extensions.

In particular:

• Assembler and disassembler support for the Zdinx, Zfinx, Zhinx, and Zhinxmin extensions. Cores that implement these extensions store double/single/half precision floating point values in the integer register file (GPRs) as opposed to having a separate floating-point register file (FPRs).
• The instructions defined in the conventional floating point extensions are defined to instead operate on the general purpose registers, and instructions that become redundant (namely those that involve moving values from FPRs to GPRs) are removed.
• Cores might implement these extensions rather than the conventional floating-point in order to reduce the amount of architectural state that is needed, reducing area and context-switch cost. The downside is of course that register pressure for the GPRs will be increased.
• Codegen for these extensions is not yet supported (i.e. the extensions are only supported for assembly input or inline assembly). A patch to provide this support is under review though.
• Assembler and disassembler support for the Zicbom, Zicbop, and Zicboz extensions. These cache management operation (CMO) extensions add new instructions for invalidating, cleaning, and flushing cache blocks (Zicbom), zeroing cache blocks (Zicboz), and prefetching cache blocks (Zicbop).
• These operations aren't currently exposed via C intrinsics, but these will be added once the appropriate naming has been agreed.
• One of the questions raised during implementation was about the preferred textual format for the operands. Specifically, whether it should be e.g. cbo.clean (a0)/cbo.clean 0(a0) to match the format used for other memory operations, or cbo.clean a0 as was used in an early binutils patch. We were able to agree between the CMO working group, LLVM, and GCC developers on the former approach.
• Assembler, disassembler, and codegen support for the Zmmul extension. This extension is just a subset of the 'M' extension providing just the multiplication instructions without the division instructions.
• Assembler and disassembler support for the additional CSRs (control and status registers) and instructions introduced by the hypervisor and Svinval additions to the privileged architecture specification. Svinval provides fine-grained address-translation cache invalidation and fencing, while the hypervisor extension provides support for efficiently virtualising the supervisor-level architecture (used to implement KVM for RISC-V).
• Assembler and disassembler support for the Zihintpause extension. This adds the pause instruction intended for use as a hint within spin-wait loops.
• Zihintpause was actually the first extension to go through RISC-V International's fast-track architecture extension process back in early 2021. We were clearly slow to add it to LLVM, but are trying to keep a closer eye on ratified extensions going forwards.
• Support was added for the not yet ratified Zvfh extension, providing support for half precision floating point values in RISC-V vectors.
• Unlike the extensions listed above, support for Zvfh is experimental. This is a status we use within the RISC-V backend for extensions that are not yet ratified and may change from release to release with no guarantees on backwards compatibility. Enabling support for such extensions requires passing -menable-experimental-extensions to Clang and specifying the extension's version when listing it in the -march string.

It's not present in LLVM 15, but LLVM 16 onwards will feature a user guide for the RISC-V target summarising the level of support for each extension (huge thanks to Philip Reames for kicking off this effort).

## Other changes

In case I haven't said it enough times, there's far more interesting changes than I could reasonably cover. Apologies if I've missed your favourite new feature or improvement. In particular, I've said relatively little about RISC-V Vector support. There's been a long series of improvements and correctness fixes in the LLVM 15 development window, after RVV was made non-experimental in LLVM 14 and there's much more to come in LLVM 16 (e.g. scalable vectorisation becoming enabled by default).

## September 20, 2022

### Danylo Piliaiev

It is a major milestone for a driver that is created without any hardware documentation.

The last major roadblocks were VK_KHR_dynamic_rendering and, to a much lesser extent, VK_EXT_inline_uniform_block. Huge props to Connor Abbott for implementing them both!

VK_KHR_dynamic_rendering was an especially nasty extension to implement on tiling GPUs because dynamic rendering allows splitting a render pass between several command buffers.

For desktop GPUs there are no issues with this. They could just record and execute commands in the same order they are submitted without any additional post-processing. Desktop GPUs don’t have render passes internally, they are just a sequence of commands for them.

On the other hand, tiling GPUs have the internal concept of a render pass: they do binning of the whole render pass geometry first, load part of the framebuffer into the tile memory, execute all render pass commands, store framebuffer contents into the main memory, then repeat load_framebufer -> execute_renderpass -> store_framebuffer for all tiles. In Turnip the required glue code is created at the end of a render pass, while the whole render pass contents (when the render pass is split across several command buffers) are known only at the submit time. Therefore we have to stitch the final render pass right there.

## What’s next?

Implementing Vulkan 1.3 was necessary to support the latest DXVK (Direct3D 9-11 translation layer). VK_KHR_dynamic_rendering itself was also necessary for the latest VKD3D (Direct3D 12 translation layer).

For now my plan is:

• Continue implementing new extensions for DXVK, VKD3D, and Zink as they come out.
• Focus more on performance.
• Improvements to driver debug tooling so it works better with internal and external debugging utilities.

## Introduction: an immutable OS

The Steam Deck runs SteamOS, a single-user operating system based on Arch Linux. Although derived from a standard package-based distro, the OS in the Steam Deck is immutable and system updates replace the contents of the root filesystem atomically instead of using the package manager.

An immutable OS makes the system more stable and its updates less error-prone, but users cannot install additional packages to add more software. This is not a problem for most users since they are only going to run Steam and its games (which are stored in the home partition). Nevertheless, the OS also has a desktop mode which provides a standard Linux desktop experience, and here it makes sense to be able to install more software.

How to do that though? It is possible for the user to become root, make the root filesytem read-write and install additional software there, but any changes will be gone after the next OS update. Modifying the rootfs can also be dangerous if the user is not careful.

The simplest and safest way to install additional software is with Flatpak, and that’s the method recommended in the Steam Deck Desktop FAQ. Flatpak is already installed and integrated in the system via the Discover app so I won’t go into more details here.

However, while Flatpak works great for desktop applications not every piece of software is currently available, and Flatpak is also not designed for other types of programs like system services or command-line tools.

Fortunately there are several ways to add software to the Steam Deck without touching the root filesystem, each one with different pros and cons. I will probably talk about some of them in the future, but in this post I’m going to focus on one that is already available in the system: systemd-sysext.

This is a tool included in recent versions of systemd and it is designed to add additional files (in the form of system extensions) to an otherwise immutable root filesystem. Each one of these extensions contains a set of files. When extensions are enabled (aka “merged”) those files will appear on the root filesystem using overlayfs. From then on the user can open and run them normally as if they had been installed with a package manager. Merged extensions are seamlessly integrated with the rest of the OS.

Since extensions are just collections of files they can be used to add new applications but also other things like system services, development tools, language packs, etc.

## Creating an extension: yakuake

I’m using yakuake as an example for this tutorial since the extension is very easy to create, it is an application that some users are demanding and is not easy to distribute with Flatpak.

So let’s create a yakuake extension. Here are the steps:

1) Create a directory and unpack the files there:

$mkdir yakuake$ wget https://steamdeck-packages.steamos.cloud/archlinux-mirror/extra/os/x86_64/yakuake-21.12.1-1-x86_64.pkg.tar.zst
$tar -C yakuake -xaf yakuake-*.tar.zst usr  2) Create a file called extension-release.NAME under usr/lib/extension-release.d with the fields ID and VERSION_ID taken from the Steam Deck’s /etc/os-release file. $ mkdir -p yakuake/usr/lib/extension-release.d/
$echo ID=steamos > yakuake/usr/lib/extension-release.d/extension-release.yakuake$ echo VERSION_ID=3.3.1 >> yakuake/usr/lib/extension-release.d/extension-release.yakuake


3) Create an image file with the contents of the extension:

$mksquashfs yakuake yakuake.raw  That’s it! The extension is ready. A couple of important things: image files must have the .raw suffix and, despite the name, they can contain any filesystem that the OS can mount. In this example I used SquashFS but other alternatives like EroFS or ext4 are equally valid. NOTE: systemd-sysext can also use extensions from plain directories (i.e skipping the mksquashfs part). Unfortunately we cannot use them in our case because overlayfs does not work with the casefold feature that is enabled on the Steam Deck. ## Using the extension Once the extension is created you simply need to copy it to a place where systemd-systext can find it. There are several places where they can be installed (see the manual for a list) but due to the Deck’s partition layout and the potentially large size of some extensions it probably makes more sense to store them in the home partition and create a link from one of the supported locations (/var/lib/extensions in this example): (deck@steamdeck ~)$ mkdir extensions
(deck@steamdeck ~)$scp user@host:/path/to/yakuake.raw extensions/ (deck@steamdeck ~)$ sudo ln -s $PWD/extensions /var/lib/extensions  Once the extension is installed in that directory you only need to enable and start systemd-sysext: (deck@steamdeck ~)$ sudo systemctl enable systemd-sysext
(deck@steamdeck ~)$sudo systemctl start systemd-sysext  After this, if everything went fine you should be able to see (and run) /usr/bin/yakuake. The files should remain there from now on, also if you reboot the device. You can see what extensions are enabled with this command: $ systemd-sysext status
HIERARCHY EXTENSIONS SINCE
/opt      none       -
/usr      yakuake    Tue 2022-09-13 18:21:53 CEST


If you add or remove extensions from the directory then a simple “systemd-sysext refresh” is enough to apply the changes.

Unfortunately, and unlike distro packages, extensions don’t have any kind of post-installation hooks or triggers, so in the case of Yakuake you probably won’t see an entry in the KDE application menu immediately after enabling the extension. You can solve that by running kbuildsycoca5 once from the command line.

## Limitations and caveats

Using systemd extensions is generally very easy but there are some things that you need to take into account:

1. Using extensions is easy (you put them in the directory and voilà!). However, creating extensions is not necessarily always easy. To begin with, any libraries, files, etc., that your extensions may need should be either present in the root filesystem or provided by the extension itself. You may need to combine files from different sources or packages into a single extension, or compile them yourself.
2. In particular, if the extension contains binaries they should probably come from the Steam Deck repository or they should be built to work with those packages. If you need to build your own binaries then having a SteamOS virtual machine can be handy. There you can install all development files and also test that everything works as expected. One could also create a Steam Deck SDK extension with all the necessary files to develop directly on the Deck
3. Extensions are not distribution packages, they don’t have dependency information and therefore they should be self-contained. They also lack triggers and other features available in packages. For desktop applications I still recommend using a system like Flatpak when possible.
4. Extensions are tied to a particular version of the OS and, as explained above, the ID and VERSION_ID of each extension must match the values from /etc/os-release. If the fields don’t match then the extension will be ignored. This is to be expected because there’s no guarantee that a particular extension is going to work with a different version of the OS. This can happen after a system update. In the best case one simply needs to update the extension’s VERSION_ID, but in some cases it might be necessary to create the extension again with different/updated files.
5. Extensions only install files in /usr and /opt. Any other file in the image will be ignored. This can be a problem if a particular piece of software needs files in other directories.
6. When extensions are enabled the /usr and /opt directories become read-only because they are now part of an overlayfs. They will remain read-only even if you run steamos-readonly disable !!. If you really want to make the rootfs read-write you need to disable the extensions (systemd-sysext unmerge) first.
7. Unlike Flatpak or Podman (including toolbox / distrobox), this is (by design) not meant to isolate the contents of the extension from the rest of the system, so you should be careful with what you’re installing. On the other hand, this lack of isolation makes systemd-sysext better suited to some use cases than those container-based systems.

## Conclusion

systemd extensions are an easy way to add software (or data files) to the immutable OS of the Steam Deck in a way that is seamlessly integrated with the rest of the system. Creating them can be more or less easy depending on the case, but using them is extremely simple. Extensions are not packages, and systemd-sysext is not a package manager or a general-purpose tool to solve all problems, but if you are aware of its limitations it can be a practical tool. It is also possible to share extensions with other users, but here the usual warning against installing binaries from untrusted sources applies. Use with caution, and enjoy!

## September 12, 2022

### Eric Meyer

#### Nuclear Targeted Footnotes

One of the more interesting design challenges of The Effects of Nuclear Weapons was the fact that, like many technical texts, it has footnotes.  Not a huge number, and in fact one chapter has none at all, but they couldn’t be ignored.  And I didn’t want them to be inline between paragraphs or stuck into the middle of the text.

This was actually a case where Chris and I decided to depart a bit from the print layout, because in print a chapter has many pages, but online it has a single page.  So we turned the footnotes into endnotes, and collected them all near the end of each chapter.

Originally I had thought about putting footnotes off to one side in desktop views, such as in the right-hand grid gutter.  After playing with some rough prototypes, I realized this wasn’t going to go the way I wanted it to, and would likely make life difficult in a variety of display sizes between the “big desktop monitor” and “mobile device” realms.  I don’t know, maybe I gave up too easily, but Chris and I had already decided that endnotes were an acceptable adaptation and I decided to roll with that.

So here’s how the footnotes work.  First off, in the main-body text, a footnote marker is wrapped in a <sup> element and is a link that points at a named anchor in the endnotes. (I may go back and replace all the superscript elements with styled <mark> elements, but for now, they’re superscript elements.)  Here’s an example from the beginning of Chapter I, which also has a cross-reference link in it, classed as such even though we don’t actually style them any differently than other links.

This is true for a conventional “high explosive,” such as TNT, as well as for a nuclear (or atomic) explosion,<sup><a href="#fnote01">1</a></sup> although the energy is produced in quite different ways (<a href="#§1.11" class="xref">§ 1.11</a>).

Then, down near the end of the document, there’s a section that contains an ordered list.  Inside that list are the endnotes, which are in part marked up like this:

<li id="fnote01"><sup>1</sup> The terms “nuclear” and atomic” may be used interchangeably so far as weapons, explosions, and energy are concerned, but “nuclear” is preferred for the reason given in <a href="#§1.11" class="xref">§ 1.11</a>.

The list item markers are switched off with CSS, and superscripted numbers stand in their place.  I do it that way because the footnote numbers are important to the content, but also have specific presentation demands that are difficult  —  nay, impossible — to pull off with normal markers, like raising them superscript-style. (List markers are only affected by a very limited set of properties.)

In order to get the footnote text to align along the start (left) edge of their content and have the numbers hang off the side, I elected to use the old negative-text-indent-positive-padding trick:

.endnotes li {
text-indent: -0.75em;
}

That works great as long as there are never any double-digit footnote numbers, which was indeed the case… until Chapter VIII.  Dang it.

So, for any footnote number above 9, I needed a different set of values for the indent-padding trick, and I didn’t feel like adding in a bunch of greater-than-nine classes. Following-sibling combinator to the rescue!

.endnotes li:nth-of-type(9) ~ li {
margin-inline-start: -0.33em;
text-indent: -1.1em;
}

The extra negative start margin is necessary solely to get the text in the list items to align horizontally, though unnecessary if you don’t care about that sort of thing.

Okay, so the endnotes looked right when seen in their list, but I needed a way to get back to the referring paragraph after reading a footnote.  Thus, some “backjump” links got added to each footnote, pointing back to the paragraph that referred to them.

<span class="backjump">[ref. <a href="#§1.01">§ 1.01</a>]</span>

With that, a reader can click/tap a footnote number to jump to the corresponding footnote, then click/tap the reference link to get back to where they started.  Which is fine, as far as it goes, but that idea of having footnotes appear in context hadn’t left me.  I decided I’d make them happen, one way or another.

(Throughout all this, I wished more than once the HTML 3.0 proposal for <fn> had gone somewhere other than the dustbin of history and the industry’s collective memory hole.  Ah, well.)

I was thinking I’d need some kind of JavaScript thing to swap element nodes around when it occurred to me that clicking a footnote number would make the corresponding footnote list item a target, and if an element is a target, it can be styled using the :target pseudo-class.  Making it appear in context could be a simple matter of positioning it in the viewport, rather than with relation to the document.  And so:

.endnotes li:target {
position: fixed;
bottom: 0;
margin-inline: -2em 0;
border-top: 1px solid;
background: #FFF;
box-shadow: 0 0 3em 3em #FFF;
max-width: 45em;
}

That is to say, when an endnote list item is targeted, it’s fixedly positioned against the bottom of the viewport and given some padding and background and a top border and a box shadow, so it has a bit of a halo above it that sets it apart from the content it’s overlaying.  It actually looks pretty sweet, if I do say so myself, and allows the reader to see footnotes without having to jump back and forth on the page.  Now all I needed was a way to make the footnote go away.

Again I thought about going the JavaScript route, but I’m trying to keep to the Web’s slower pace layers as much as possible in this project for maximum compatibility over time and technology.  Thus, every footnote gets a “close this” link right after the backjump link, marked up like this:

<a href="#fnclosed" class="close">X</a></li>

(I realize that probably looks a little weird, but hang in there and hopefully I can clear it up in the next few paragraphs.)

So every footnote ends with two links, one to jump to the paragraph (or heading) that referred to it, which is unnecessary when the footnote has popped up due to user interaction; and then, one to make the footnote go away, which is unnecessary when looking at the list of footnotes at the end of the chapter.  It was time to juggle display and visibility values to make each appear only when necessary.

.endnotes li .close {
display: none;
visibility: hidden;
}
.endnotes li:target .close {
display: block;
visibility: visible;
}
.endnotes li:target .backjump {
display: none;
visibility: hidden;
}

Thus, the “close this” links are hidden by default, and revealed when the list item is targeted and thus pops up.  By contrast, the backjump links are shown by default, and hidden when the list item is targeted.

As it now stands, this approach has some upsides and some downsides.  One upside is that, since a URL with an identifier fragment is distinct from the URL of the page itself, you can dismiss a popped-up footnote with the browser’s Back button.  On kind of the same hand, though, one downside is that since a URL with an identifier fragment is distinct from the URL of the page itself, if you consistently use the “close this” link to dismiss a popped-up footnote, the browser history gets cluttered with the opened and closed states of various footnotes.

This is bad because you can get partway through a chapter, look at a few footnotes, and then decide you want to go back one page by hitting the Back button, at which point you discover have to go back through all those footnote states in the history before you actually go back one page.

I feel like this is a thing I can (probably should) address by layering progressively-enhancing JavaScript over top of all this, but I’m still not quite sure how best to go about it.  Should I add event handlers and such so the fragment-identifier stuff is suppressed and the URL never actually changes?  Should I add listeners that will silently rewrite the browser history as needed to avoid this?  Ya got me.  Suggestions or pointers to live examples of solutions to similar problems are welcomed in the comments below.

Less crucially, the way the footnote just appears and disappears bugs me a little, because it’s easy to miss if you aren’t looking in the right place.  My first thought was that it would be nice to have the footnote unfurl from the bottom of the page, but it’s basically impossible (so far as I can tell) to animate the height of an element from 0 to auto.  You also can’t animate something like bottom: calc(-1 * calculated-height) to 0 because there is no CSS keyword (so far as I know) that returns the calculated height of an element.  And you can’t really animate from top: 100vh to bottom: 0 because animations are of a property’s values, not across properties.

I’m currently considering a quick animation from something like bottom: -50em to 0, going on the assumption that no footnote will ever be more than 50 em tall, regardless of the display environment.  But that means short footnotes will slide in later than tall footnotes, and probably appear to move faster.  Maybe that’s okay?  Maybe I should do more of a fade-and-scale-in thing instead, which will be visually consistent regardless of footnote size.  Or I could have them 3D-pivot up from the bottom edge of the viewport!  Or maybe this is another place to layer a little JS on top.

Or maybe I’ve overlooked something that will let me unfurl the way I first envisioned with just HTML and CSS, a clever new technique I’ve missed or an old solution I’ve forgotten.  As before, comments with suggestions are welcome.

Have something to say to all that? You can add a comment to the post, or email Eric directly.

## Summary

simple-reload provides a straight-forward (~30 lines of JS and zero server-side requirements) way of reloading a web page as it is iteratively developed or modified. Once activated, a page will be reloaded whenever it regains focus.

If you encounter any problems, please file an issue on the simple-reload GitHub repository.

Cons:

• Reload won't take place until the page is focused by the user, which requires manual interaction.
• This is significantly less burdensome if using focus follows pointer in your window manager.
• Reloads will occur even if there were no changes.
• With e.g. LiveReload, a reload only happens when the server indicates there has been a change. This may be a big advantage for stateful pages or pages with lots of forms.

Pros:

• Tiny, easy to modify implementation.
• This can be helpful to keep a fixed revision of a page in one tab to compare against.
• No server-side requirements.
• So works even from file://.
• Makes it easier if using with a remote server (e.g. no need to worry about exposing a port as for LiveReload).

## Code

<script type="module">
const enableByDefault = false;
// Firefox triggers blur/focus events when resizing, so we ignore a focus
// following a blur within 200ms (assumed to be generated by resizing rather
// than human interaction).
let blurTimeStamp = null;
function focusListener(ev) {
if (ev.timeStamp - blurTimeStamp >= 200) {
}
}
function blurListener(ev) {
if (blurTimeStamp === null) {
}
blurTimeStamp = ev.timeStamp;
}
function deactivate() {
window.removeEventListener("focus", focusListener);
window.removeEventListener("blur", blurListener);
document.title = document.title.replace(/^\u27F3 /, "");
window.addEventListener("dblclick", activate, { once: true });
}
function activate() {
}
if (enableByDefault || sessionStorage.getItem("simple-reload") == "activated") {
document.title = "\u27F3 " + document.title;
window.addEventListener("dblclick", deactivate, { once: true });
} else {
}
</script>


## Usage

Paste the above code into the <head> or <body> of a HTML file. You can then enable the reload behaviour by double-clicking on a page (double-click again to disable again). The title is prefixed with ⟳ while reload-on-focus is enabled. If you'd like reload-on-focus enabled by default, just flip the enableByDefault variable to true. You could either modify whatever you're using to emit HTML to include this code when in development mode, or configure your web server of choice to inject it for you.

The enabled/disabled state of the reload logic is scoped to the current tab, so will be maintained if navigating to different pages within the same domain in that tab.

In terms of licensing, the implementation is so straight-forward it hardly feels copyrightable. Please consider it public domain, or MIT if you're more comfortable with an explicit license.

## Implementation notes

• In case it's not clear, "blur" events mentioned above are events fired when an element is no longer in focus.
• As noted in the code comment, Firefox (under dwm in Linux at least) seems to trigger blur+focus events when resizing the window using a keybinding while Chrome doesn't. Being able to remove the logic to deal with this issue would be a good additional simplification.

Article changelog
• 2022-09-11: Removed incorrect statement about sessionStorage usage being redundant if enableByDefault = true (it does mean disabling reloading will persist).
• 2022-09-10: Initial publication date.

#### Muxup implementation notes

This article contains a few notes on various implementation decisions made when creating the Muxup website. They're intended primarily as a reference for myself, but some parts may be of wider interest. See about for more information about things like site structure.

## Site generation

I ended up writing my own script for generating the site pages from a tree of Markdown files. Zola seems like an excellent option, but as I had a very specific idea on how I wanted pages to be represented in source form and the format and structure of the output, writing my own was the easier option (see build.py). Plus, yak-shaving is fun.

I opted to use mistletoe for Markdown parsing. I found a few bugs in the traverse helper function when implementing some transformations on the generated AST, but upstream was very responsive about reviewing and merging my PRs. The main wart I've found is that parsing and rendering aren't fully separated, although this doesn't pose any real practical concern for my use case and will hopefully be fixed in the future. mistune also seemed promising, but has some conformance issues.

One goal was to keep everything possible in standard Markdown format. This means, for instance, avoiding custom frontmatter entries or link formats if the same information could be extracted from the file). Therefore:

• There is no title frontmatter entry - the title is extracted by grabbing the first level 1 heading from the Markdown AST (and erroring if it isn't present).
• All internal links are written as [foo](/relative/to/root/foo.md). The generator will error if the file can't be found, and will otherwise translate the target to refer to the appropriate permalink.
• This has the additional advantage that links are still usable if viewing the Markdown on GitHub, which can be handy if reviewing a previous revision of an article.
• Article changelogs are just a standard Markdown list under the "Article changelog" heading, which is checked and transformed at the Markdown AST level in order to produce the desired output (emitting the list using a <details> and <summary>).

All CSS was written through the usual mix of experimentation (see simple-reload for the page reloading solution I used to aid iterative development) and learning from the CSS used by other sites.

## Randomly generated title highlights

The main visual element throughout the site is the randomised, roughly drawn highlight used for the site name and article headings. This takes some inspiration from the RoughNotation library (see also the author's description of the algorithms used), but uses my own implementation that's much more tightly linked to my use case.

The core logic for drawing these highlights is based around drawing the SVG path:

• Determine the position and size of the text element to be highlight (keeping in mind it might be described by multiple rectangles if split across several lines).
• For each rectangle, draw a bezier curve starting from the middle of the left hand side through to the right hand side, with its two control points at between 20-40% and 40-80% of the width.
• Apply randomised offsets in the x and y directions for every point.
• The range of the randomised offsets should depend on length of the text. Generally speaking, a smaller offsets should be used for a shorter piece of text.

This logic is implemented in preparePath and reproduced below (with the knowledge the offset(delta) is a helper to return a random number between delta and -delta, hopefully it's clear how this relates to the logic described above):

function preparePath(hlInfo) {
const parentRect = hlInfo.svg.getBoundingClientRect();
const rects = hlInfo.hlEl.getClientRects();
let pathD = "";
for (const rect of rects) {
const x = rect.x - parentRect.x, y = rect.y - parentRect.y,
w = rect.width, h = rect.height;
const mid_y = y + h / 2;
let maxOff = w < 75 ? 3 : w < 300 ? 6 : 8;
const divergePoint = .2 + .2 * Math.random();
pathD = ${pathD} M${x+offset(maxOff)} ${mid_y+offset(maxOff)} C${x+w*divergePoint+offset(maxOff)} ${mid_y+offset(maxOff)},${x+2*w*divergePoint+offset(maxOff)} ${mid_y+offset(maxOff)}${x+w+offset(maxOff)} ${mid_y+offset(maxOff)}; } hlInfo.nextPathD = pathD; hlInfo.strokeWidth = 0.85*rects[0].height; }  I took some care to avoid forced layout/reflow by batching together reads and writes of the DOM into separate phases when drawing the initial set of highlights for the page, which is why this function is generates the path but doesn't modify the SVG directly. Separate logic adds handlers to links that are highlighted continuously redraw the highlight (I liked the visual effect). ## Minification and optimisation I primarily targeted the low-hanging fruit here, and avoided adding in too many dependencies (e.g. separate HTML and CSS minifiers) during the build. muxup.com is a very lightweight site - the main extravagance I've allowed myself is the use of a webfont (Nunito) but that only weighs in at ~36KiB (the variable width version converted to woff2 and subsetted using pyftsubset from fontTools). As the CSS and JS payload is so small, it's inlined into each page. The required JS for the article pages and the home page is assembled and then minified using terser. This reduces the size of the JS required for the page you're reading from 5077 bytes to 2620 bytes uncompressed (1450 bytes to 991 bytes if compressing the result with Brotli, though in practice the impact will be a bit different when compressing the JS together with the rest of the page + CSS). When first published, the page you are reading (including inlined CSS and JS) was ~27.7KiB uncompressed (7.7KiB Brotli compressed), which compares to 14.4KiB for its source Markdown (4.9KiB Brotli compressed). Each page contains an embedded stylesheet with a conservative approximation of the minimal CSS needed. The total amount of CSS is small enough that it's easy to manually split between the CSS that is common across the site, the CSS only needed for the home page, the CSS common to all articles, and then other CSS rules that may or may not be needed depending on the page content. For the latter case, CSS snippets to include are gated on matching a given string (e.g. <kbd for use of the <kbd> tag). For my particular use case, this is more straight forward and faster than e.g. relying on PurgeCSS as part of the build process. The final trick on the frontend is prefetching. Upon hovering on an internal link, it will be fetched (see the logic at the end of common.js for the implementation approach), meaning that in the common case any link you click will already have been loaded and cached. More complex approaches could be used to e.g. hook the mouse click event and directly update the DOM using the retrieved data. But this would require work to provide UI feedback during the load and the incremental benefit over prefetching to prime the cache seems small. ## Serving using Caddy I had a few goals in setting up Caddy to serve this site: • Enable new and shiny things like HTTP3 and serving Brotli compressed content. • Both are very widely supported on the browser side (and Brotli really isn't "new" any more), but require non-standard modules or custom builds on Nginx. • Redirect wwww.muxup.com/* URLs to muxup.com/*. • HTTP request methods other than GET or HEAD should return error 405. • Set appropriate Cache-Control headers in order to avoid unnecessary re-fetching content. Set shorter lifetimes for served .html and 308 redirects vs other assets. Leave 404 responses with no Cache-Control header. • Avoid serving the same content at multiple URLs (unless explicitly asked for) and don't expose the internal filenames of content served via a different canonical URL. Also, prefer URLs without a trailing slash, but ensure not to issue a redirect if the target file doesn't exist. This means (for example): • muxup.com/about/ should redirect to muxup.com/about • muxup.com/2022q3////muxup-implementation-notes should 404 or redirect. • muxup.com/about/./././ should 404 or redirect • muxup.com/index.html should 404. • muxup.com/index.html.br (referring to the precompressed brotli file) should 404. • muxup.com/non-existing-path/ should 404. • If there is a directory foo and a foo.html at the same level, serve foo.html for GET /foo (and redirect to it for GET /foo/). • Never try to serve */index.html or similar (except in the special case of GET /). Perhaps because my requirements were so specific, this turned out to be a little more involved than I expected. If seeking to understand the Caddyfile format and Caddy configuration in general, I'd strongly recommend getting a good understanding of the key ideas by reading Caddyfile concepts, understanding the order in which directives are handled by default and how you might control the execution order of directives using the route directive or use the handle directive to specify groups of directives in a mutually exclusive fashion based on different matchers. The Composing in the Caddyfile article provides a good discussion of these options). Ultimately, I wrote a quick test script for the desired properties and came up with the following Caddyfile that meets almost all goals: { servers { protocol { experimental_http3 } } email asb@asbradbury.org } (muxup_file_server) { file_server { index "" precompressed br disable_canonical_uris } } www.muxup.com { redir https://muxup.com{uri} 308 header Cache-Control "max-age=2592000, stale-while-revalidate=2592000" } muxup.com { root * /var/www/muxup.com/htdocs encode gzip log { output file /var/log/caddy/muxup.com.access.log } header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" vars short_cache_control "max-age=3600" vars long_cache_control "max-age=2592000, stale-while-revalidate=2592000" @method_isnt_GET_or_HEAD not method GET HEAD @path_is_suffixed_with_html_or_br path *.html *.html/ *.br *.br/ @path_or_html_suffixed_path_exists file {path}.html {path} @html_suffixed_path_exists file {path}.html @path_or_html_suffixed_path_doesnt_exist not file {path}.html {path} @path_is_root path / @path_has_trailing_slash path_regexp ^/(.*)/$

error 405
}
handle @path_is_suffixed_with_html_or_br {
error 404
}
handle @path_has_trailing_slash {
route {
uri strip_suffix /
redir @path_or_html_suffixed_path_exists {path} 308
error @path_or_html_suffixed_path_doesnt_exist 404
}
}
handle @path_is_root {
rewrite index.html
import muxup_file_server
}
handle @html_suffixed_path_exists {
rewrite {path}.html
import muxup_file_server
}
handle * {
import muxup_file_server
}
handle_errors {
respond "{err.status_code} {err.status_text}"
}
}


A few notes on the above:

• It isn't currently possible to match URLs with // due to the canonicalisation Caddy performs, but 2.6.0 including PR #4948 hopefully provides a solution. Hopefully to allow matching /./ too.
• HTTP3 should enabled by default in Caddy 2.6.0.
• Surprisingly, you need to explicitly opt in to enabling gzip compression (see this discussion with the author of Caddy about that choice).
• The combination of try_files and file_server provides a large chunk of the basics, but something like the above handlers is needed to get the precise desired behaviour for things like redirects, *.html and *.br etc.
• route is needed within handle @path_has_trailing_slash because the default execution order of directives has uri occurring some time after redir and error.
• Caddy doesn't support dynamically brotli compressing responses, so the precompressed br option of file_server is used to serve pre-compressed files (as prepared by the deploy script).

## Analytics

The simplest possible solution - don't have any.

Last but by no means least, is the randomly selected doodle at the bottom of each page. I select images from the Quick, Draw! dataset and export them to SVG to be randomly selected on each page load. A rather scrappy script contains logic to generate SVGs from the dataset's NDJSON format and contains a simple a Flask application that allows selecting desired images from randomly displayed batches from each dataset.

With examples such as O'Reilly's beautiful engravings of animals on their book covers there's a well established tradition of animal illustrations on technical content - and what better way to honour that tradition than with a hastily drawn doodle by a random person on the internet that spins when your mouse hovers over it?

Article changelog
• 2022-09-10: Initial publication date.

## September 02, 2022

### Ricardo García

Vulkan 1.3.226 was released yesterday and it finally includes the cross-vendor VK_EXT_mesh_shader extension. This has definitely been an important moment for me. As part of my job at Igalia and our collaboration with Valve, I had the chance to work reviewing this extension in depth and writing thousands of CTS tests for it. You’ll notice I’m listed as one of the extension contributors. Hopefully, the new tests will be released to the public soon as part of the open source VK-GL-CTS Khronos project.

During this multi-month journey I had the chance to work closely with several vendors working on adding support for this extension in multiple drivers, including NVIDIA (special shout-out to Christoph Kubisch, Patrick Mours, Pankaj Mistry and Piers Daniell among others), Intel (thanks Marcin Ślusarz for finding and reporting many test bugs) and, of course, Valve. Working for the latter, Timur Kristóf provided an implementation for RADV and reported to me dozens of bugs, test ideas, suggestions and improvements. Do not miss his blog post series about mesh shaders and how they’re implemented on RDNA2 hardware. Timur’s implementation will be used in your Linux system if you have a capable AMD GPU and, of course, the Steam Deck.

The extension has been developed with DX12 compatibility in mind. It’s possible to use mesh shading from Vulkan natively and it also allows future titles using DX12 mesh shading to be properly run on top of VKD3D-Proton and enjoyed on Linux, if possible, from day one. It’s hard to provide a summary of the added functionality and what mesh shaders are about in a short blog post like this one, so I’ll refer you to external documentation sources, starting by the Vulkan mesh shading post on the Khronos Blog. Both Timur and myself have submitted a couple of talks to XDC 2022 which have been accepted and will give you a primer on mesh shading as well as some more information on the RADV implementation. Do not miss the event at Minneapolis or enjoy it remotely while it’s being livestreamed in October.

## September 01, 2022

### Delan Azabani

#### Meet the CSS highlight pseudos

A year and a half ago, I was asked to help upstream a Chromium patch allowing authors to recolor spelling and grammar errors in CSS. At the time, I didn’t realise that this was part of a far more ambitious effort to reimagine spelling errors, grammar errors, text selections, and more as a coherent system that didn’t yet exist as such in any browser. That system is known as the highlight pseudos, and this post will focus on the design of said system and its consequences for authors.

This is the third part of a series (part one, part two) about Igalia’s work towards making the CSS highlight pseudos a reality.

article { --cr-highlight: #3584E4; --cr-highlight-aC0h: #3584E4C0; } article figure > img { max-width: 100%; } article figure > figcaption { max-width: 30rem; margin-left: auto; margin-right: auto; } article pre, article code { font-family: Inconsolata, monospace, monospace; } article aside, article blockquote { font-size: 0.75em; max-width: 30rem; } article aside { margin-left: 0; padding-left: 1rem; border-left: 3px double rebeccapurple; } article blockquote { margin-left: 3rem; } article blockquote:before { margin-left: -2rem; } ._spelling, ._grammar { text-decoration-thickness: /* iOS takes 0 literally */ 1px; text-decoration-skip-ink: none; } ._spelling { text-decoration: /* not a shorthand on iOS */ underline; text-decoration-style: wavy; text-decoration-color: red; } ._grammar { text-decoration: /* not a shorthand on iOS */ underline; text-decoration-style: wavy; text-decoration-color: green; } ._example { border: 2px dotted rebeccapurple; } ._example * + *, ._hpdemo * + * { margin-top: 0; } ._checker { position: relative; margin-left: auto; margin-right: auto; } ._checker:focus { outline: none; } ._checker::before { display: flex; align-items: center; justify-content: center; position: absolute; top: 0; bottom: 0; left: 0; right: 0; width: 100%; font-size: 7em; color: transparent; background: transparent; content: "▶"; } ._checker:not(:focus)::before { color: rebeccapurple; background: #66339940; } ._checker tbody th { text-align: left; } ._checker ._live::selection, ._checker ._live *::selection { color: currentColor; background: transparent; } ._checker:not(:focus) ._live > div { visibility: hidden; } ._checker:not([data-phase=done]):not(#specificity) ._live > div, ._checker:not([data-phase=done]):not(#specificity) ._live > div * { color: transparent; } ._checker:not([data-phase=done]):not(#specificity) ._live > div::selection, ._checker:not([data-phase=done]):not(#specificity) ._live > div *::selection { color: transparent; } ._checker:not([data-phase=done]):not(#specificity) ._live > div::highlight(checker), ._checker:not([data-phase=done]):not(#specificity) ._live > div *::highlight(checker), ._checker:not([data-phase=done]):not(#specificity) ._live > div::highlight(lower), ._checker:not([data-phase=done]):not(#specificity) ._live > div *::highlight(lower) { color: transparent; } ._checker ._live > div { width: 5em; } ._checker ._live > div { position: relative; line-height: 1; } ._checker ._live > div > span { position: absolute; margin: 0; padding-top: calc((1.5em - 1em) / 2); width: 5em; } /* ::highlight() [end-to-end test] = no, if the pseudo selector is broken and/or no active highlight = yes, if the pseudo selector works and highlight is active */ ._checker ._custom :nth-child(2) { color: transparent; background: transparent; } ._checker ._custom :nth-child(1)::highlight(checker) { color: transparent; } ._checker ._custom :nth-child(2)::highlight(checker) { color: CanvasText; background: Canvas; } /* ::highlight() [selector] = no, if the pseudo selector is unsupported = yes, if the pseudo selector is supported • highlight not active, only for selector list validity */ ._checker ._chps :nth-child(2) { color: transparent; } ._checker ._chps :nth-child(1), :not(*)::highlight(checker) { color: transparent; } ._checker ._chps :nth-child(2), :not(*)::highlight(checker) { color: CanvasText; } /* ::highlight() [API] = no, if the API is missing or broken = yes, if the API is present and working */ ._checker ._cha {} /* ::spelling-error [end-to-end test] = no, if the pseudo selector is broken and/or no active highlight = yes, if the pseudo selector works and highlight is active */ ._checker [spellcheck] :nth-child(2) { color: transparent; background: transparent; } ._checker [spellcheck] :nth-child(1)::spelling-error { color: transparent; } ._checker [spellcheck] :nth-child(2)::spelling-error { color: CanvasText; background: Canvas; } ._checker [spellcheck] *::spelling-error { text-decoration: none; } /* Highlight inheritance (::highlight) = no, if var() inherits from originating element = yes, if var() ignores originating element and uses fallback */ ._checker ._hih { color: transparent; background: transparent; } ._checker ._hih::highlight(checker) { --t: transparent; --x: CanvasText; --y: Canvas; } ._checker ._hih :nth-child(1)::highlight(checker) { color: var(--t, CanvasText); background: var(--t, Canvas); } ._checker ._hih :nth-child(2)::highlight(checker) { color: var(--x, transparent); background: var(--y, transparent); } /* Highlight inheritance (::selection) = no, if var() inherits from originating element = yes, if var() ignores originating element and uses fallback */ ._checker ._his { color: transparent; background: transparent; } ._checker ._his::selection { --t: transparent; --x: CanvasText; --y: Canvas; } ._checker ._his :nth-child(1)::selection { color: var(--t, CanvasText); background: var(--t, Canvas); } ._checker ._his :nth-child(2)::selection { color: var(--x, transparent); background: var(--y, transparent); } /* Highlight overlay painting = no, if currentColor takes color from originating element only = yes, if currentColor takes color from next active highlight • lower highlight “yes” is hidden by ‘-webkit-text-fill-color’ */ ._checker ._hop { color: transparent; background: transparent; } ._checker ._hop :nth-child(1) { color: CanvasText; } ._checker ._hop :nth-child(1)::highlight(lower) { color: transparent; } ._checker ._hop :nth-child(1)::highlight(checker) { color: currentColor; } ._checker ._hop :nth-child(2) { color: transparent; } ._checker ._hop :nth-child(2)::highlight(lower) { color: CanvasText; -webkit-text-fill-color: transparent; } ._checker ._hop :nth-child(2)::highlight(checker) { color: currentColor; -webkit-text-fill-color: currentColor; } ._table { font-size: 0.75em; } ._table td, ._table th { vertical-align: top; border: 1px solid black; } ._table td:not(._tight), ._table th:not(._tight) { padding: 0.5em; } ._tight picture, ._tight img { vertical-align: top; } ._compare * + *, ._tight * + *, ._gifs * + * { margin-top: 0; } ._compare { max-width: 100%; border: 1px solid rebeccapurple; } ._compare > div { max-width: 100%; position: relative; touch-action: pinch-zoom; --cut: 50%; } ._compare > div > * { vertical-align: top; max-width: 100%; } ._compare > div > :nth-child(1) { position: absolute; clip: rect(auto, auto, auto, var(--cut)); } ._compare > div > :nth-child(2) { position: absolute; width: var(--cut); height: 100%; border-right: 1px solid rebeccapurple; } ._compare > div > :nth-child(2):before { content: var(--left-label); color: rebeccapurple; font-size: 0.75em; position: absolute; right: 0.5em; } ._compare > div > :nth-child(2):after { content: var(--right-label); color: rebeccapurple; font-size: 0.75em; position: absolute; left: calc(100% + 0.5em); } ._sum td:first-of-type { padding-right: 1em; } ._gifs { position: relative; display: flex; flex-flow: column nowrap; } ._gifs > video { transition: opacity 0.125s linear; } ._gifs > button { transition: 0.125s linear; transition-property: color, background-color; } ._gifs._paused > video { opacity: 0.5; } ._gifs._paused > button { color: rebeccapurple; background: #66339940; } ._gifs > button { position: absolute; top: 0; bottom: 0; left: 0; right: 0; width: 100%; font-size: 7em; color: transparent; background: transparent; content: "▶"; } ._gifs > button:focus-visible { outline: 0.25rem solid #663399C0; outline-offset: -0.25rem; } ._commits { position: relative; } ._commits > :first-child { position: absolute; right: -0.1em; height: 100%; border-right: 0.2em solid rgba(102,51,153,0.5); } ._commits > :last-child { position: relative; padding-right: 0.5em; } * + ._commit, ._commit * + * { margin-top: 0; } ._commit { line-height: 2; margin-right: -1.5em; text-align: right; } ._commit > img { width: 2em; vertical-align: middle; } ._commit > a { padding-right: 0.5em; text-decoration: none; color: rebeccapurple; } ._commit > a > code { font-size: 1em; } ._commit-none > a { color: rgba(102,51,153,0.5); }

## What are they?

CSS has four highlight pseudos and an open set of author-defined custom highlight pseudos. They have their roots in ::selection, which was a rudimentary and non-standard, but widely supported, way of styling text and images selected by the user.

The built-in highlights are ::selection for user-selected content, ::target-text for linking to text fragments, ::spelling-error for misspelled words, and ::grammar-error for text with grammar errors, while the custom highlights are known as ::highlight(x) where x is the author-defined highlight name.

## Can I use them?

::selection has long been supported by all of the major browsers, and ::target-text shipped in Chromium 89. But for most of that time, no browser had yet implemented the more robust highlight pseudo system in the CSS pseudo spec.

::highlight() and the custom highlight API shipped in Chromium 105, thanks to the work by members1 of the Microsoft Edge team. They are also available in Safari 14.1 (including iOS 14.5) as an experimental feature (Highlight API). You can enable that feature in the Develop menu, or for iOS, under Settings > Safari > Advanced.

Safari’s support currently has a couple of quirks, as of TP 152. Range is not supported for custom highlights yet, only StaticRange, and the Highlight constructor has a bug where it requires passing exactly one range, ignoring any additional arguments. To create a Highlight with no ranges, first create one with a dummy range, then call the clear or delete methods.

Chromium 105 also implements the vast majority of the new highlight pseudo system. This includes highlight overlay painting, which was enabled for all highlight pseudos, and highlight inheritance, which was enabled for ::highlight() only.

Chromium 108 includes ::spelling-error and ::grammar-error as an experimental feature, together with the new ‘text-decoration-line’ values ‘spelling-error’ and ‘grammar-error’. You can enable these features at

chrome://flags/#enable-experimental-web-platform-features

Chromium’s support also currently has some bugs, as of r1041796. Notably, highlights don’t yet work under ::first-line and ::first-letter2, ‘text-shadow’ is not yet enabled for ::highlight(), computedStyleMap results are wrong for ‘currentColor’, and highlights that split ligatures (e.g. for complex scripts) only render accurately in ::selection2.

Click the table below to see if your browser supports these features.

act:
sel:
cha:


## How do I use them?

While you can write rules for highlight pseudos that target all elements, as was commonly done for pre-standard ::selection, selecting specific elements can be more powerful, allowing descendants to cleanly override highlight styles.

Previously the same code would yield…

…unless you also selected the descendants of :root and aside:

Note that a bare ::selection rule still means *::selection, and like any universal rule, it can interfere with inheritance when mixed with non-universal highlight rules.

::selection is primarily controlled by user input, though pages can both read and write the active ranges via the Selection API with getSelection().

::target-text is activated by navigating to a URL ending in a fragment directive, which has its own syntax embedded in the #fragment. For example:

• #foo:~:text=bar targets #foo and highlights the first occurrence of “bar”
• #:~:text=the,dog highlights the first range of text from “the” to “dog”

::spelling-error and ::grammar-error are controlled by the user’s spell checker, which is only used where the user can input text, such as with textarea or contenteditable, subject to the spellcheck attribute (which also affects grammar checking). For privacy reasons, pages can’t read the active ranges of these highlights, despite being visible to the user.

::highlight() is controlled via the Highlight API with CSS.highlights. CSS.highlights is a maplike object, which means the interface is the same as a Map of strings (highlight names) to Highlight objects. Highlight objects, in turn, are setlike objects, which you can use like a Set of Range or StaticRange objects.

You can use getComputedStyle() to query resolved highlight styles under a particular element. Regardless of which parts (if any) are highlighted, the styles returned are as if the given highlight is active and all other highlights are inactive.

## How do they work?

Highlight pseudos are defined as pseudo-elements, but they actually have very little in common with other pseudo-elements like ::before and ::first-line.

Unlike other pseudos, they generate highlight overlays, not boxes, and these overlays are like layers over the original content. Where text is highlighted, a highlight overlay can add backgrounds and text shadows, while the text proper and any other decorations are “lifted” to the very top.

@import url(/images/hpdemo.css);

You can think of highlight pseudos as innermost pseudo-elements that always exist at the bottom of any tree of elements and other pseudos, but unlike other pseudos, they don’t inherit their styles from that element tree.

Instead each highlight pseudo forms its own inheritance tree, parallel to the element tree. This means body::selection inherits from html::selection, not from ‘body’ itself.

At this point, you can probably see that the highlight pseudos are quite different from the rest of CSS, but there are also several special cases and rules needed to make them a coherent system.

For the typical appearance of spelling and grammar errors, highlight pseudos need to be able to add their own decorations, and they need to be able to leave the underlying foreground color unchanged. Highlight inheritance happens separately from the element tree, so we need some way to refer to the underlying foreground color.

That escape hatch is to set ‘color’ itself to ‘currentColor’, which is the default if nothing in the highlight tree sets ‘color’.

This is a bit of a special case within a special case.

You see, ‘currentColor’ is usually defined as “the computed value of ‘color’”, but the way I like to think of it is “don’t change the foreground color”, and most color-valued properties like ‘text-decoration-color’ default to this value.

For ‘color’ itself that wouldn’t make sense, so we instead define ‘color:currentColor’ as equivalent to ‘color:inherit’, which still fits that mental model. But for highlights, that definition would no longer fit, so we redefine it as being the ‘color’ of the next active highlight below.

To make highlight inheritance actually useful for ‘text-decoration’ and ‘background-color’, all properties are inherited in highlight styles, even those that are not usually inherited.

This would conflict with the usual rules3 for decorating boxes, because descendants would get two decorations, one propagated and one inherited. We resolved this by making decorations added by highlights not propagate to any descendants.

Unstyled highlight pseudos generally don’t change the appearance of the original content, so the default ‘color’ and ‘background-color’ in highlights are ‘currentColor’ and ‘transparent’ respectively, the latter being the property’s initial value. But two highlight pseudos, ::selection and ::target-text, have UA default foreground and background colors.

For compatibility with ::selection in older browsers, the UA default ‘color’ and ‘background-color’ (e.g. white on blue) is only used if neither were set by the author. This rule is known as paired cascade, and for consistency it also applies to ::target-text.

It’s common for selected text to almost invert the original text colors, turning black on white into white on blue, for example. To guarantee that the original decorations remain as legible as the text when highlighted, which is especially important for decorations with semantic meaning (e.g. line-through), originating decorations are recolored to the highlight ‘color’. This doesn’t apply to decorations added by highlights though, because that would break the typical appearance of spelling and grammar errors.

The default style rules for highlight pseudos might look something like this. Notice the new ‘spelling-error’ and ‘grammar-error’ decorations, which authors can use to imitate native spelling and grammar errors.

The way the highlight pseudos have been designed naturally leads to some limitations.

## Gotchas

Older browsers with ::selection tend to treat it purely as a way to change the original content’s styles, including text shadows and other decorations. Some tutorial content has even been written to that effect:

One of the most helpful uses for ::selection is turning off a text-shadow during selection. A text-shadow can clash with the selection’s background color and make the text difficult to read. Set text-shadow: none; to make text clear and easy to read during selection.

Under the spec, highlight pseudos can no longer remove or really change the original content’s decorations and shadows. Setting these properties in highlight pseudos to values other than ‘none’ adds decorations and shadows to the overlays when they are active.

While the new :has() selector might appear to offer a solution to this problem, pseudo-element selectors are not allowed in :has(), at least not yet.

Removing shadows that might clash with highlight backgrounds (as suggested in the tutorial above) will no longer be as necessary anyway, since highlight backgrounds now paint on top of the original text shadows.

If you still want to ensure those shadows don’t clash with highlights in older browsers, you can set ‘text-shadow’ to ‘none’, which is harmless in newer browsers.

As for line decorations, if you’re really determined, you can work around this limitation by using ‘-webkit-text-fill-color’, a standard property (believe it or not) that controls the foreground fill color of text4.

Fun fact: because of ‘-webkit-text-fill-color’ and its stroke-related siblings, it isn’t always possible for highlight pseudos to avoid changing the foreground colors of text, at least not without out-of-band knowledge of what those colors are.

### Accessing global constants

Highlight pseudos also don’t automatically have access to custom properties set in the element tree, which can make things tricky if you have a design system that exposes a color palette via custom properties on :root.

You can work around this by adding selectors for the necessary highlight pseudos to the rule defining the constants, or if the necessary highlight pseudos are unknown, by rewriting each constant as a custom @property rule.

### Spec issues

While the design of the highlight pseudos has mostly settled, there are still some unresolved issues to watch out for.

• how to use spelling and grammar decorations with the UA default colors (#7522)
• values of non-applicable properties, e.g. ‘text-shadow’ with em units (#7591)
• the meaning of underline- and emphasis-related properties in highlights (#7101)
• whether ‘-webkit-text-fill-color’ and friends are allowed in highlights (#7580)
• some browsers “tweak” the colors or alphas set in highlight styles (#6853)
• how the highlight pseudos are supposed to interact with SVG (svgwg#894)

## What now?

The highlight pseudos are a radical departure from older browsers with ::selection, and have some significant differences with CSS as we know it. Now that we have some experimental support, we want your help to play around with these features and help us make them as useful and ergonomic as possible before they’re set in stone.

Special thanks to Rego, Brian, Eric (Igalia), Florian, fantasai (CSSWG), Emilio (Mozilla), and Dan for their work in shaping the highlight pseudos (and this post). We would also like to thank Bloomberg for sponsoring this work.

1. Dan, Fernando, Sanket, Luis, Bo, and anyone else I missed.

2. See this demo for more details.  2

3. CSSWG discussion also found that decorating box semantics are undesirable for decorations added by highlights anyway.

4. This is actually the case everywhere the WHATWG compat spec applies, at all times. If you think about it, the only reason why setting ‘color’ to ‘red’ makes your text red is because ‘-webkit-text-fill-color’ defaults to ‘currentColor’.

### Andy Wingo

#### new month, new brainworm

Today, a brainworm! I had a thought a few days ago and can't get it out of my head, so I need to pass it on to another host.

So, imagine a world in which there is a a drive to build a kind of Kubernetes on top of WebAssembly. Kubernetes nodes are generally containers, associated with additional metadata indicating their place in overall system topology (network connections and so on). (I am not a Kubernetes specialist, as you can see; corrections welcome.) Now in a WebAssembly cloud, the nodes would be components, probably also with additional topological metadata. VC-backed companies will duke it out for dominance of the WebAssembly cloud space, and in a couple years we will probably emerge with an open source project that has become a de-facto standard (though it might be dominated by one or two players).

In this world, Kubernetes and Spiffy-Wasm-Cloud will coexist. One of the success factors for Kubernetes was that you can just put your old database binary inside a container: it's the same ABI as when you run your database in a virtual machine, or on (so-called!) bare metal. The means of composition are TCP and UDP network connections between containers, possibly facilitated by some kind of network fabric. In contrast, in Spiffy-Wasm-Cloud we aren't starting from the kernel ABI, with processes and such: instead there's WASI, which is more of a kind of specialized and limited libc. You can't just drop in your database binary, you have to write code to get it to conform to the new interfaces.

One consequence of this situation is that I expect WASI and the component model to develop a rich network API, to allow WebAssembly components to interoperate not just with end-users but also other (micro-)services running in the same cloud. Likewise there is room here for a company to develop some complicated network fabrics for linking these things together.

However, WebAssembly-to-WebAssembly links are better expressed via typed functional interfaces; it's more expressive and can be faster. Not only can you end up having fine-grained composition that looks more like lightweight Erlang processes, you can also string together components in a pipeline with communications overhead approaching that of a simple function call. Relative to Kubernetes, there are potential 10x-100x improvements to be had, in throughput and in memory footprint, at least in some cases. It's the promise of this kind of improvement that can drive investment in this area, and eventually adoption.

But, you still have some legacy things running in containers. What to do? Well... Maybe recompile them to WebAssembly? That's my brain-worm.

A container is a file system image containing executable files and data. Starting with the executable files, they are in machine code, generally x64, and interoperate with system libraries and the run-time via an ABI. You could compile them to WebAssembly instead. You could interpret them as data, or JIT-compile them as webvm does, or directly compile them to WebAssembly. This is the sort of thing you hire Fabrice Bellard to do ;) Then you have the filesystem. Let's assume it is stateless: any change to the filesystem at runtime doesn't need to be preserved. (I understand this is a goal, though I could be wrong.) So you could put the filesystem in memory, as some kind of addressable data structure, and you make the libc interface access that data structure. It's something like the microkernel approach. And then you translate whatever topological connectivity metadata you had for Kubernetes to your Spiffy-Wasm-Cloud's format.

Anyway in the end you have a WebAssembly module and some metadata, and you can run it in your WebAssembly cloud. Or on the more basic level, you have a container and you can now run it on any machine with a WebAssembly implementation, even on other architectures (coucou RISC-V!).

Anyway, that's the tweet. Have fun, whoever gets to work on this :)

## August 31, 2022

### Qiuyi Zhang (Joyee)

#### Building V8 on an M1 MacBook

I’ve recently got an M1 MacBook and played around with it a bit. It seems many open source projects still haven’t added MacOS with ARM64

## August 29, 2022

### Lauro Moura

#### Using Breakpad to generate crash dumps with WPE WebKit

Breakpad is a tool from Google that helps generate crash reports. From its description:

Breakpad is a library and tool suite that allows you to distribute an application to users with compiler-provided debugging information removed, record crashes in compact "minidump" files, send them back to your server, and produce C and C++ stack traces from these minidumps. Breakpad can also write minidumps on request for programs that have not crashed.

It works by stripping the debug information from the executable and saving it into "symbol files." When a crash occurs or upon request, the Breakpad client library generates the crash information in these "minidumps." The Breakpad minidump processor combines these files with the symbol files and generates a human-readable stack trace. The following picture, also from Breakpad's documentation, describes this process:

In WPE, Breakpad support was added initially for the downstream 2.28 branch by Vivek Arumugam and backported upstream. It'll be available in the soon-to-be-released 2.38 version, and the WebKit Flatpak SDK bundles the Breakpad client library since late May 2022.

As a developer feature, Breakpad support is disabled by default but can be enabled by passing -DENABLE_BREAKPAD=1 to cmake when building WebKit. Optionally, you can also set -DBREAKPAD_MINIDUMP_DIR=<some path> to hardcode the path used by Breakpad to save the minidumps. If not set during build time, BREAKPAD_MINIDUMP_DIR must be set as an environment variable pointing to a valid directory when running the application. If defined, this variable also overrides the path defined during the build.

## Generating the symbols

To generate the symbol files, Breakpad provides the dump_syms tool. It takes a path to the executable/library and dumps to stdout the symbol information.

Once generated, the symbol files must be laid out in a specific tree structure so minidump_stackwalk can find them when merging with the crash information. The folder containing the symbol files must match a hash code generated for that specific binary. For example, in the case of the libWPEWebKit:

$dump_syms WebKitBuild/WPE/Release/lib/libWPEWebKit-1.1.so > libWPEWebKit-1.1.so.0.sym$ head -n 1 libWPEWebKit-1.1.so.0.sym
MODULE Linux x86_64 A2DA230C159B97DC00000000000000000 libWPEWebKit-1.1.so.0
$mkdir -p ./symbols/libWPEWebKit-1.1.so.0/A2DA230C159B97DC00000000000000000$ cp libWPEWebKit-1.1.so.0.sym ./symbols/libWPEWebKit-1.1.so.0/A2DA230C159B97DC00000000000000000/


## Generating the crash log

Besides the symbol files, we need a minidump file with the stack information, which can be generated in two ways. First, by asking Breakpad to create it. The other way is, well, when the application crashes :)

To generate a minidump manually, you can either call google_breakpad::WriteMiniDump(path, callback, context) or send one of the crashing signals Breakpad recognizes. The former is helpful to generate the dumps programmatically at specific points, while the signal approach might be helpful to inspect hanging processes. These are the signals Breakpad handles as crashing ones:

• SIGSEGV
• SIGABRT
• SIGFPE
• SIGILL (Note: this is for illegal instruction, not the ordinary SIGKILL)
• SIGBUS
• SIGTRAP

Now, first we must run Cog:

$BREAKPAD_MINIDUMP_DIR=/home/lauro/minidumps ./Tools/Scripts/run-minibrowser --wpe --release https://www.wpewebkit.org  Crashing the WebProcess using SIGTRAP: $ ps aux | grep WebProcess
<SOME-PID> ... /app/webkit/.../WebProcess
$kill -TRAP <SOME-PID>$ ls /home/lauro/minidumps
5c2d93f2-6e9f-48cf-6f3972ac-b3619fa9.dmp
$file ~/minidumps/5c2d93f2-6e9f-48cf-6f3972ac-b3619fa9.dmp /home/lauro/minidumps/5c2d93f2-6e9f-48cf-6f3972ac-b3619fa9.dmp: Mini DuMP crash report, 13 streams, Thu Aug 25 20:29:11 2022, 0 type  Note: In the current form, WebKit supports Breakpad dumps in the WebProcess and NetworkProcess, which WebKit spawns itself. The developer must manually add support for it in the UIProcess (the browser/application using WebKit). The exception handler should be installed as early as possible, and many programs might do some initialization before initializing WebKit itself. ## Translating the crash log Once we have a .dmp crash log and the symbol files, we can use minidump_stackwalk to show the human-readable crash log: $ minidump_stackwalk ~/minidumps/5c2d93f2-6e9f-48cf-6f3972ac-b3619fa9.dmp ./symbols/
Operating system: Linux
0.0.0 Linux 5.15.0-46-generic #49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 x86_64
CPU: amd64
family 23 model 113 stepping 0
1 CPU

GPU: UNKNOWN

Crash reason:  SIGTRAP
Process uptime: not available

0  libc.so.6 + 0xf71fd
rax = 0xfffffffffffffffc   rdx = 0x0000000000000090
rcx = 0x00007f6ae47d61fd   rbx = 0x00007f6ae4f0c2e0
rsi = 0x0000000000000001   rdi = 0x000056187adbf6d0
rbp = 0x00007fffa9268f10   rsp = 0x00007fffa9268ef0
r8 = 0x0000000000000000    r9 = 0x00007f6ae4fdc2c0
r10 = 0x00007fffa9333080   r11 = 0x0000000000000293
r12 = 0x0000000000000001   r13 = 0x00007fffa9268f34
r14 = 0x0000000000000090   r15 = 0x000056187ade7aa0
rip = 0x00007f6ae47d61fd
Found by: given as instruction pointer in context
1  libglib-2.0.so.0 + 0x585ce
rsp = 0x00007fffa9268f20   rip = 0x00007f6ae4efc5ce
Found by: stack scanning
2  libglib-2.0.so.0 + 0x58943
rsp = 0x00007fffa9268f80   rip = 0x00007f6ae4efc943
Found by: stack scanning
3  libWPEWebKit-1.1.so.0!WTF::RunLoop::run() + 0x120
rsp = 0x00007fffa9268fa0   rip = 0x00007f6ae8923180
Found by: stack scanning
4  libWPEWebKit-1.1.so.0!WebKit::WebProcessMain(int, char**) + 0x11e
rbx = 0x000056187a5428d0   rbp = 0x0000000000000003
rsp = 0x00007fffa9268fd0   r12 = 0x00007fffa9269148
rip = 0x00007f6ae74719fe
Found by: call frame info
<snip remaining trace>


## Final words

This article briefly overviews enabling and using Breakpad to generate crash dumps. In a future article, we'll cover using Breakpad to get crashdumps while running WPE on embedded boards like RaspberryPis.

o/

## August 23, 2022

### Andy Wingo

#### accessing webassembly reference-typed arrays from c++

The WebAssembly garbage collection proposal is coming soonish (really!) and will extend WebAssembly with the the capability to create and access arrays whose memory is automatically managed by the host. As long as some system component has a reference to an array, it will be kept alive, and as soon as nobody references it any more, it becomes "garbage" and is thus eligible for collection.

(In a way it's funny to define the proposal this way, in terms of what happens to garbage objects that by definition aren't part of the program's future any more; really the interesting thing is the new things you can do with live data, defining new data types and representing them outside of linear memory and passing them between components without copying. But "extensible-arrays-structs-and-other-data-types" just isn't as catchy as "GC". Anyway, I digress!)

One potential use case for garbage-collected arrays is for passing large buffers between parts of a WebAssembly system. For example, a webcam driver could produce a stream of frames as reference-typed arrays of bytes, and then pass them by reference to a sandboxed WebAssembly instance to, I don't know, identify cats in the images or something. You get the idea. Reference-typed arrays let you avoid copying large video frames.

A lot of image-processing code is written in C++ or Rust. With WebAssembly 1.0, you just have linear memory and no reference-typed values, which works well for these languages that like to think of memory as having a single address space. But once you get reference-typed arrays in the mix, you effectively have multiple address spaces: you can't address the contents of the array using a normal pointer, as you might be able to do if you mmap'd the buffer into a program's address space. So what do you do?

reference-typed values are special

The broader question of C++ and GC-managed arrays is, well, too broad for today. The set of array types is infinite, because it's not just arrays of i32, it's also arrays of arrays of i32, and arrays of those, and arrays of records, and so on.

So let's limit the question to just arrays of i8, to see if we can make some progress. So imagine a C function that takes an array of i8:

void process(array_of_i8 array) {
// ?
}


If you know WebAssembly, there's a clear translation of the sort of code that we want:

(func (param $array (ref (array i8))) ; operate on local 0 )  The WebAssembly function will have an array as a parameter. But, here we start to run into more problems with the LLVM toolchain that we use to compile C and other languages to WebAssembly. When the C front-end of LLVM (clang) compiles a function to the LLVM middle-end's intermediate representation (IR), it models all local variables (including function parameters) as mutable memory locations created with alloca. Later optimizations might turn these memory locations back to SSA variables and thence to registers or stack slots. But, a reference-typed value has no bit representation, and it can't be stored to linear memory: there is no alloca that can hold it. Incidentally this problem is not isolated to future extensions to WebAssembly; the externref and funcref data types that landed in WebAssembly 2.0 and in all browsers are also reference types that can't be written to main memory. Similarly, the table data type which is also part of shipping WebAssembly is not dissimilar to GC-managed arrays, except that they are statically allocated at compile-time. At Igalia, my colleagues Paulo Matos and Alex Bradbury have been hard at work to solve this gnarly problem and finally expose reference-typed values to C. The full details and final vision are probably a bit too much for this article, but some bits on the mechanism will help. Firstly, note that LLVM has a fairly traditional breakdown between front-end (clang), middle-end ("the IR layer"), and back-end ("the MC layer"). The back-end can be quite target-specific, and though it can be annoying, we've managed to get fairly good support for reference types there. In the IR layer, we are currently representing GC-managed values as opaque pointers into non-default, non-integral address spaces. LLVM attaches an address space (an integer less than 224 or so) to each pointer, mostly for OpenCL and GPU sorts of use-cases, and we abuse this to prevent LLVM from doing much reasoning about these values. This is a bit of a theme, incidentally: get the IR layer to avoid assuming anything about reference-typed values. We're fighting the system, in a way. As another example, because LLVM is really oriented towards lowering high-level constructs to low-level machine operations, it doesn't necessarily preserve types attached to pointers on the IR layer. Whereas for WebAssembly, we need exactly that: we reify types when we write out WebAssembly object files, and we need LLVM to pass some types through from front-end to back-end unmolested. We've had to change tack a number of times to get a good way to preserve data from front-end to back-end, and probably will have to do so again before we end up with a final design. Finally on the front-end we need to generate an alloca in different address spaces depending on the type being allocated. And because reference-typed arrays can't be stored to main memory, there are semantic restrictions as to how they can be used, which need to be enforced by clang. Fortunately, this set of restrictions is similar enough to what is imposed by the ARM C Language Extensions (ACLE) for scalable vector (SVE) values, which also don't have a known bit representation at compile-time, so we can piggy-back on those. This patch hasn't landed yet, but who knows, it might land soon; in the mean-time we are going to run ahead of upstream a bit to show how you might define and use an array type definition. Further tacks here are also expected, as we try to thread the needle between exposing these features to users and not imposing too much of a burden on clang maintenance. accessing array contents All this is a bit basic, though; it just gives you enough to have a local variable or a function parameter of a reference-valued type. Let's continue our example: void process(array_of_i8 array) { uint32_t sum; for (size_t idx = 0; i < __builtin_wasm_array_length(array); i++) sum += (uint8_t)__builtin_wasm_array_ref_i8(array, idx); // ... }  The most basic way to extend C to access these otherwise opaque values is to expose some builtins, say __builtin_wasm_array_length and so on. Probably you need different intrinsics for each scalar array element type (i8, i16, and so on), and one for arrays which return reference-typed values. We'll talk about arrays of references another day, but focusing on the i8 case, the C builtin then lowers to a dedicated LLVM intrinsic, which passes through the middle layer unscathed. In C++ I think we can provide some nicer syntax which preserves the syntactic illusion of array access. I think this is going to be sufficient as an MVP, but there's one caveat: SIMD. You can indeed have an array of i128 values, but you can only access that array's elements as i128; worse, you can't load multiple data from an i8 array as i128 or even i32. Compare this to to the memory control proposal, which instead proposes to map buffers to non-default memories. In WebAssembly, you can in theory (and perhaps soon in practice) have multiple memories. The easiest way I can see on the toolchain side is to use the address space feature in clang: void process(uint8_t *array __attribute__((address_space(42))), size_t len) { uint32_t sum; for (size_t idx = 0; i < len; i++) sum += array[idx]; // ... }  How exactly to plumb the mapping between address spaces which can only be specified by number from the front-end to the back-end is a little gnarly; really you'd like to declare the set of address spaces that a compilation unit uses symbolically, and then have the linker produce a final allocation of memory indices. But I digress, it's clear that with this solution we can use SIMD instructions to load multiple bytes from memory at a time, so it's a winner with respect to accessing GC arrays. Or is it? Perhaps there could be SIMD extensions for packed GC arrays. I think it makes sense, but it's a fair amount of (admittedly somewhat mechanical) specification and implementation work. & future In some future bloggies we'll talk about how we will declare new reference types: first some basics, then some more integrated visions for reference types and C++. Lots going on, and this is just a brain-dump of the current state of things; thoughts are very much welcome. ## August 18, 2022 ### Andy Wingo #### just-in-time code generation within webassembly Just-in-time (JIT) code generation is an important tactic when implementing a programming language. Generating code at run-time allows a program to specialize itself against the specific data it is run against. For a program that implements a programming language, that specialization is with respect to the program being run, and possibly with respect to the data that program uses. The way this typically works is that the program generates bytes for the instruction set of the machine it's running on, and then transfers control to those instructions. Usually the program has to put its generated code in memory that is specially marked as executable. However, this capability is missing in WebAssembly. How, then, to do just-in-time compilation in WebAssembly? webassembly as a harvard architecture In a von Neumman machine, like the ones that you are probably reading this on, code and data share an address space. There's only one kind of pointer, and it can point to anything: the bytes that implement the sin function, the number 42, the characters in "biscuits", or anything at all. WebAssembly is different in that its code is not addressable at run-time. Functions in a WebAssembly module are numbered sequentially from 0, and the WebAssembly call instruction takes the callee as an immediate parameter. So, to add code to a WebAssembly program, somehow you'd have to augment the program with more functions. Let's assume we will make that possible somehow -- that your WebAssembly module that had N functions will now have N+1 functions, and with function N being the new one your program generated. How would we call it? Given that the call instructions hard-code the callee, the existing functions 0 to N-1 won't call it. Here the answer is call_indirect. A bit of a reminder, this instruction take the callee as an operand, not an immediate parameter, allowing it to choose the callee function at run-time. The callee operand is an index into a table of functions. Conventionally, table 0 is called the indirect function table as it contains an entry for each function which might ever be the target of an indirect call. With this in mind, our problem has two parts, then: (1) how to augment a WebAssembly module with a new function, and (2) how to get the original module to call the new code. late linking of auxiliary webassembly modules The key idea here is that to add code, the main program should generate a new WebAssembly module containing that code. Then we run a linking phase to actually bring that new code to life and make it available. System linkers like ld typically require a complete set of symbols and relocations to resolve inter-archive references. However when performing a late link of JIT-generated code, we can take a short-cut: the main program can embed memory addresses directly into the code it generates. Therefore the generated module would import memory from the main module. All references from the generated code to the main module can be directly embedded in this way. The generated module would also import the indirect function table from the main module. (We would ensure that the main module exports its memory and indirect function table via the toolchain.) When the main module makes the generated module, it also embeds a special patch function in the generated module. This function would add the new functions to the main module's indirect function table, and perform any relocations onto the main module's memory. All references from the main module to generated functions are installed via the patch function. We plan on two implementations of late linking, but both share the fundamental mechanism of a generated WebAssembly module with a patch function. dynamic linking via the run-time One implementation of a linker is for the main module to cause the run-time to dynamically instantiate a new WebAssembly module. The run-time would provide the memory and indirect function table from the main module as imports when instantiating the generated module. The advantage of dynamic linking is that it can update a live WebAssembly module without any need for re-instantiation or special run-time checkpointing support. In the context of the web, JIT compilation can be triggered by the WebAssembly module in question, by calling out to functionality from JavaScript, or we can use a "pull-based" model to allow the JavaScript host to poll the WebAssembly instance for any pending JIT code. For WASI deployments, you need a capability from the host. Either you import a module that provides run-time JIT capability, or you rely on the host to poll you for data. static linking via wizer Another idea is to build on Wizer's ability to take a snapshot of a WebAssembly module. You could extend Wizer to also be able to augment a module with new code. In this role, Wizer is effectively a late linker, linking in a new archive to an existing object. Wizer already needs the ability to instantiate a WebAssembly module and to run its code. Causing Wizer to ask the module if it has any generated auxiliary module that should be instantiated, patched, and incorporated into the main module should not be a huge deal. Wizer can already run the patch function, to perform relocations to patch in access to the new functions. After having done that, Wizer (or some other tool) would need to snapshot the module, as usual, but also adding in the extra code. As a technical detail, in the simplest case in which code is generated in units of functions which don't directly call each other, this is as simple as just appending the functions to the code section and then and appending the generated element segments to the main module's element segment, updating the appended function references to their new values by adding the total number of functions in the module before the new module was concatenated to each function reference. late linking appears to be async codegen From the perspective of a main program, WebAssembly JIT code generation via late linking appears the same as aynchronous code generation. For example, take the C program: struct Value; struct Func { struct Expr *body; void *jitCode; }; void recordJitCandidate(struct Func *func); uint8_t* flushJitCode(); // Call to actually generate JIT code. struct Value* interpretCall(struct Expr *body, struct Value *arg); struct Value* call(struct Func *func, struct Value* val) { if (func->jitCode) { struct Value* (*f)(struct Value*) = jitCode; return f(val); } else { recordJitCandidate(func); return interpretCall(func->body, val); } }  Here the C program allows for the possibility of JIT code generation: there is a slot in a Func instance to fill in with a code pointer. If this program generates code for a given Func, it won't be able to fill in the pointer -- it can't add new code to the image. But, it could tell Wizer to do so, and Wizer could snapshot the program, link in the new function, and patch &func->jitCode. From the program's perspective, it's as if the code becomes available asynchronously. demo! So many words, right? Let's see some code! As a sketch for other JIT compiler work, I implemented a little Scheme interpreter and JIT compiler, targetting WebAssembly. See interp.cc for the source. You compile it like this: $ /opt/wasi-sdk/bin/clang++ -O2 -Wall \
-mexec-model=reactor \
-Wl,--growable-table \
-Wl,--export-table \
-DLIBRARY=1 \
-fno-exceptions \
interp.cc -o interplib.wasm


Here we are compiling with WASI SDK. I have version 14.

The -mexec-model=reactor argument means that this WASI module isn't just a run-once thing, after which its state is torn down; rather it's a multiple-entry component.

The two -Wl, options tell the linker to export the indirect function table, and to allow the indirect function table to be augmented by the JIT module.

The -DLIBRARY=1 is used by interp.cc; you can actually run and debug it natively but that's just for development. We're instead compiling to wasm and running with a WASI environment, giving us fprintf and other debugging niceties.

The -fno-exceptions is because WASI doesn't support exceptions currently. Also we don't need them.

WASI is mainly for non-browser use cases, but this module does so little that it doesn't need much from WASI and I can just polyfill it in browser JavaScript. So that's what we have here:

JavaScript disabled, no wasm-jit demo. See the wasm-jit web page for more information. &&<<<<><>>&&<&>>>><><><><><><>>><>>

Each time you enter a Scheme expression, it will be parsed to an internal tree-like intermediate language. You can then run a recursive interpreter over that tree by pressing the "Evaluate" button. Press it a number of times, you should get the same result.

As the interpreter runs, it records any closures that it created. The Func instances attached to the closures have a slot for a C++ function pointer, which is initially NULL. Function pointers in WebAssembly are indexes into the indirect function table; the first slot is kept empty so that calling a NULL pointer (a pointer with value 0) causes an error. If the interpreter gets to a closure call and the closure's function's JIT code pointer is NULL, it will interpret the closure's body. Otherwise it will call the function pointer.

If you then press the "JIT" button above, the module will assemble a fresh WebAssembly module containing JIT code for the closures that it saw at run-time. Obviously that's just one heuristic: you could be more eager or more lazy; this is just a detail.

Although the particular JIT compiler isn't much of interest---the point being to see JIT code generation at all---it's nice to see that the fibonacci example sees a good speedup; try it yourself, and try it on different browsers if you can. Neat stuff!

not just the web

I was wondering how to get something like this working in a non-webby environment and it turns out that the Python interface to wasmtime is just the thing. I wrote a little interp.py harness that can do the same thing that we can do on the web; just run as python3 interp.py, after having pip3 install wasmtime:

python3 interp.py ... Calling eval(0x11eb0) 5 times took 1.716s. Calling jitModule() jitModule result: <wasmtime._module.Module object at 0x7f2bef0821c0> Instantiating and patching in JIT module ... Calling eval(0x11eb0) 5 times took 1.161s.  Interestingly it would appear that the performance of wasmtime's code (0.232s/invocation) is somewhat better than both SpiderMonkey (0.392s) and V8 (0.729s). reflections This work is just a proof of concept, but it's a step in a particular direction. As part of previous work with Fastly, we enabled the SpiderMonkey JavaScript engine to run on top of WebAssembly. When combined with pre-initialization via Wizer, you end up with a system that can start in microseconds: fast enough to instantiate a fresh, shared-nothing module on every HTTP request, for example. The SpiderMonkey-on-WASI work left out JIT compilation, though, because, you know, WebAssembly doesn't support JIT compilation. JavaScript code actually ran via the C++ bytecode interpreter. But as we just found out, actually you can compile the bytecode: just-in-time, but at a different time-scale. What if you took a SpiderMonkey interpreter, pre-generated WebAssembly code for a user's JavaScript file, and then combined them into a single freeze-dried WebAssembly module via Wizer? You get the benefits of fast startup while also getting decent baseline performance. There are many engineering considerations here, but as part of work sponsored by Shopify, we have made good progress in this regard; details in another missive. I think a kind of "offline JIT" has a lot of value for deployment environments like Shopify's and Fastly's, and you don't have to limit yourself to "total" optimizations: you can still collect and incorporate type feedback, and you get the benefit of taking advantage of adaptive optimization without having to actually run the JIT compiler at run-time. But if we think of more traditional "online JIT" use cases, it's clear that relying on host JIT capabilities, while a good MVP, is not optimal. For one, you would like to be able to freely emit direct calls from generated code to existing code, instead of having to call indirectly or via imports. I think it still might make sense to have a language run-time express its generated code in the form of a WebAssembly module, though really you might want native support for compiling that code (asynchronously) from within WebAssembly itself, without calling out to a run-time. Most people I have talked to that work on WebAssembly implementations in JS engines believe that a JIT proposal will come some day, but it's good to know that we don't have to wait for it to start generating code and taking advantage of it. & out If you want to play around with the demo, do take a look at the wasm-jit Github project; it's fun stuff. Happy hacking, and until next time! ## August 15, 2022 ### Eric Meyer #### Table Column Alignment with Variable Transforms One of the bigger challenges of recreating The Effects of Nuclear Weapons for the Web was its tables. It was easy enough to turn tab-separated text and numbers into table markup, but the column alignment almost broke me. To illustrate what I mean, here are just a few examples of columns that had to be aligned. At first I naïvely thought, “No worries, I can right- or left-align most of these columns and figure out the rest later.” But then I looked at the centered column headings, and how the column contents were essentially centered on the headings while having their own internal horizontal alignment logic, and realized all my dreams of simple fixes were naught but ashes. My next thought was to put blank spacer columns between the columns of visible content, since table layout doesn’t honor the gap property, and then set a fixed width for various columns. I really didn’t like all the empty-cell spam that would require, even with liberal application of the rowspan attribute, and it felt overly fragile — any shifts in font face (say, on an older or niche system) might cause layout upset within the visible columns, such as wrapping content that shouldn’t be wrapped or content overlapping other content. I felt like there was a better answer. I also thought about segregating every number and symbol (including decimal separators) into separate columns, like this: <tr> <th>Neutrinos from fission products</th> <td>10</td> <td></td> <td></td> </tr> <tr class="total"> <th>Total energy per fission</th> <td>200</td> <td>±</td> <td>6</td> </tr> Then I contemplated what that would do to screen readers and the document structure in general, and after the nausea subsided, I decided to look elsewhere. It was at that point I thought about using spacer <span>s. Like, anywhere I needed some space next to text in order to move it to one side or the other, I’d throw in something like one of these: <span class="spacer"></span> <span style="display: inline; width: 2ch;"></span> Again, the markup spam repulsed me, but there was the kernel of an idea in there… and when I combined it with the truism “CSS doesn’t care what you expect elements to look or act like”, I’d hit upon my solution. Let’s return to Table 1.43, which I used as an illustration in the announcement post. It’s shown here in its not-aligned and aligned states, with borders added to the table-cell elements. This is exactly the same table, only with cells shifted to one side or another in the second case. To make this happen, I first set up a series of CSS rules: figure.table .lp1 {transform: translateX(0.5ch);} figure.table .lp2 {transform: translateX(1ch);} figure.table .lp3 {transform: translateX(1.5ch);} figure.table .lp4 {transform: translateX(2ch);} figure.table .lp5 {transform: translateX(2.5ch);} figure.table .rp1 {transform: translateX(-0.5ch);} figure.table .rp2 {transform: translateX(-1ch);} For a given class, the table cell is translated along the X axis by the declared number of ch units. Yes, that means the table cells sharing a column no longer actually sit in the column. No, I don’t care — and neither, as I said, does CSS. I chose the labels lp and rp for “left pad” and “right pad”, in part as a callback to the left-pad debacle of yore even though it has basically nothing to do with what I’m doing here. (Many of my class names are private jokes to myself. We take our pleasures where we can.) The number in each class name represents the number of “characters” to pad, which here increment by half-ch measures. Since I was trying to move things by characters, using the unit that looks like it’s a character measure (even though it really isn’t) made sense to me. With those rules set up, I could add simple classes to table cells that needed to be shifted, like so: <td class="lp3">5 ± 0.5</td> <td class="rp2">10</td> That was most of the solution, but it turned out to not be quite enough. See, things like decimal places and commas aren’t as wide as the numbers surrounding them, and sometimes that was enough to prevent a specific cell from being able to line up with the rest of its column. There were also situations where the data cells could all be aligned with each other, but were unacceptably offset from the column header, which was nearly always centered. So I decided to calc() the crap out of this to add the flexibility a custom property can provide. First, I set a sitewide variable: body { --offset: 0ch; } I then added that variable to the various transforms: figure.table .lp1 {transform: translateX(calc(0.5ch + var(--offset)));} figure.table .lp2 {transform: translateX(calc(1ch + var(--offset)));} figure.table .lp3 {transform: translateX(calc(1.5ch + var(--offset)));} figure.table .lp4 {transform: translateX(calc(2ch + var(--offset)));} figure.table .lp5 {transform: translateX(calc(2.5ch + var(--offset)));} figure.table .rp1 {transform: translateX(calc(-0.5ch + var(--offset)));} figure.table .rp2 {transform: translateX(calc(-1ch + var(--offset)));} Why use a variable at all? Because it allows me to define offsets specific to a given table, or even specific to certain table cells within a table. Consider the styles embedded along with Table 3.66: #tbl3-66 tbody tr:first-child td:nth-child(1), #tbl3-66 tbody td:nth-child(7) { --offset: 0.25ch; } #tbl3-66 tbody td:nth-child(4) { --offset: 0.1ch; } Yeah. The first cell of the first row and the seventh cell of every row in the table body needed to be shoved over an extra quarter-ch, and the fourth cell in every table-body row (under the heading “Sp”) got a tenth-ch nudge. You can judge the results for yourself. So, in the end, I needed only sprinkle class names around table markup where needed, and add a little extra offset via a custom property that I could scope to exactly where needed. Sure, the whole setup is hackier than a panel of professional political pundits, but it works, and to my mind, it beats the alternatives. I’d have been a lot happier if I could have aligned some of the columns on a specific character. I think I still would have needed the left- and right-pad approach, but there were a lot of columns where I could have reduced or eliminated all the classes. A quarter-century ago, HTML 4 had this capability, in that you could write: <COLGROUP> <COL> <COL> <COL align="±"> </COLGROUP> CSS2 was also given this power via text-align, where you could give it a string value in order to specify horizontal alignment. But browsers never really supported these features, even if some of them do still have bugs open on the issue. (I chuckle aridly every time I go there and see “Opened 24 years ago” a few lines above “Status: NEW”.) I know it’s not top of anybody’s wish list, but I wouldn’t mind seeing that capability return, somehow. Maybe as something that could be used in Grid column tracks as well as table columns. I also found myself really pining for the ability to use attr() here, which would have allowed me to drop the classes and use data-* attributes on the table cells to say how far to shift them. I could even have dropped the offset variable. Instead, it could have looked something like this: <td data-pad="3.25">5 ± 0.5</td> <td data-pad="-1.9">10</td>  figure.table *[data-pad] {transform: translateX(attr(data-pad,'ch'));}  Alas, attr() is confined to the content property, and the idea of letting it be used more widely remains unrealized. Anyway, that was my journey into recreating mid-20th-Century table column alignment on the Web. It’s true that sufficiently old browsers won’t get the fancy alignment due to not supporting custom properties or calc(), but the data will all still be there. It just won’t have the very specific column alignment, that’s all. Hooray for progressive enhancement! Have something to say to all that? You can add a comment to the post, or email Eric directly. ## August 10, 2022 ### Javier Fernández #### New Custom Handlers component for Chrome The HTML Standard section on Custom Handlers describes the procedure to register custom protocol handlers for URLs with specific schemes. When a URL is followed, the protocol dictates which part of the system handles it by consulting registries. In some cases, handlers may fall to the OS level and and launch a desktop application (eg. mailto:// may launch a native application) or, the browser may redirect the request to a different HTTP(S) URL (eg. mailto:// can go to gmail). There are multiple ways to invoke the registerProtocolHandler method from the HTML API. The most common is through Browser Extensions, or add-ons, where the user must install third-party software to add or modify a browser’s feature. However, this approach puts on the users the responsibility of ensuring the extension is secure and that it respects their privacy. Ideally, downstream browsers could ship built-in protocol handlers that take this load off of users, but this was previously difficult. During the last years Igalia and Protocol Labs have been working on improving the support of this HTML API in several of the main web engines, with the long term goal of getting the IPFS protocol a first-class member inside the most relevant browsers. In this post I’m going to explain the most recent improvements, especially in Chrome, and some side advantages for Chromium embedders that we got on the way. ## Componentization of the Custom Handler logic The first challenge faced by Chromium-based browsers wishing to add support for a new protocol was the architecture. Most of the logic and relevant codebase lived in the //chrome layer. Thus, any intent to introduce a behavior change on this logic would require a direct fork and patch the //chrome layers with the new logic. The design wasn’t defined to allow Chrome embedders to extend it with new features The Chromium project provides a Public Content API to allow embedders reusing a great part of the browser’s common logic and to implement, if needed, specific behavior though the specialization of these APIs. These would seem to be the ideal way of extending the browser capabilities to handle new protocols. However, accessing //chrome from any layer (eg. //content or //components) is forbidden and constitutes a layering violation. After some discussion with Chrome engineers (special thanks to Kuniko Yasuda and Colin Blundell) we decided to move the Protocol Handlers logic to the //components layer and create a new Custom Handlers component. The following diagram provides a high-level overview of the architectural change: The new component makes the Protocol Handler logic Chrome independent, bringing us the possibility of using the Content Public API to introduce the mentioned behavior changes, as we’ll see later in this post. Only the registry’s Delegate and the Factory classes will remain in the //chrome layer. The Delegate class was defined to provide OS integration, like setting as default browser, which needs to access the underlying operating systems interfaces. It also deals with other browser-specific logic, like Profiles. With the new componentized design, embedders may provide their own Delegate implementations with different strategies to interact with the operating system’s desktop or even support additional flavors. On the other hand, the Factory (in addition to the creation of the appropriated Delegate’s instance) provides a BrowserContext that allows the registry to operate under Incognito mode. Lets now get a deeper look into the component, trying to understand the responsibilities of each class. The following class diagram illustrates the multi-process architecture of the Custom Handler logic: The ProtocolHandlerRegistry class has been modified to allow operating without user preferences storage; this is particularly useful to move most of the browser tests to the new component and avoid the dependency with the prefs and user_prefs components (eg. testing). In this scenario the registered handlers will be stored in memory only. It’s worth mentioning that as part of this architectural change, I’ve applied some code cleanup and refactoring, following the policies dictated in the base::Value Code Health task-force. Some examples: The ProtocolHandlerThrottle implements URL redirection with the handler provided by the registry. The ChromeContentBrowserClient and ShellBrowserContentClient classes use it to implement the CreateURLLoaderThrottles method of the ContentBrowserClient interface. The specialized request is the responsible for handling the permission prompt dialog response, except for the the content-shell implementations, which bypass the Permission APIs (more on this later). I’ve also been working on a new design that relies on the PermissionContextBase::CreatePermissionRequest interface, so that we could create our own custom request and avoid the direct call to the PermissionRequestManager. Unfortunately this still needs some additional support to deal with the RegisterProtocolHandlerPermissionRequest‘s specific parameters needed for creating new instances, but I won’t extend on this, as it deserves its own post. Finally, I’d like to remark that the new component provides a simple registry factory, designed mainly for testing purposes. It’s basically the same as the one implemented in the //chrome layer except some limitations of the new component: • No incognito mode • Use of a mocking Delegate (no actual OS interaction) ## Single-Point security checks I also wanted to apply some refactoring as well to improve the inter-process consistency and increase the amount of shared code between the renderer and browser process. One of the most relevant parts of this refactoring was the one applied to the implement the Custom Handlers’ normalization process, which goes from some URL’s syntax validation to the security and privacy checks. I’ve landed some patches (CL#3507332, CL#3692750) to implement functions that are shared among the renderer and browser processes to implement the mentioned normalization procedure. The code has been moved to blink/common so that it could be invoked also by the new Custom Handlers component described before. Both the WebContentsImpl (run by the Browser process) and the NavigatorContentUtils (by the Renderer process) rely on this common code now. Additionally, I have moved all the security and privacy checks from the Browser class, in Chrome, to the WebContentsImpl class. This implies that any embedder that implements the WebContentsDelegate interface shouldn’t need to bother about these checks, assuming that any ProtocolHandler instance is valid. # Conclusion In the introduction I mentioned that the main goal of this refactoring was to decouple the Custom Handlers logic from the Chrome’s specific codebase; this would allow us to modify or even extend the implementation of the Custom Handlers APIs. The componentization of some parts of the Chrome’s codebase has many advantages, as it’s described in the Browser Components design document. Additionally, a new Custom Handlers component would provide other interesting advantages on different fronts. One of the most interesting use cases is the definition of Web Platform Tests for the Custom Handlers APIs. This goal has 2 main challenges to address: • Testing Automation • Content Shell support Finally, we would like this new component to be useful for Chrome embedders, and even browsers built directly on top of a full featured Chrome, to implement new Custom Handlers related features. This line of work would eventually be useful to improve the support of distributed protocols like IPFS, as an intermediate step towards a native implementation of the protocols in the browser. ## August 09, 2022 ### Eric Meyer #### Recreating “The Effects of Nuclear Weapons” for the Web In my previous post, I wrote about a way to center elements based on their content, without forcing the element to be a specific width, while preserving the interior text alignment. In this post, I’d like to talk about why I developed that technique. Near the beginning of this year, fellow Web nerd and nuclear history buff Chris Griffith mentioned a project to put an entire book online: The Effects of Nuclear Weapons by Samuel Glasstone and Philip J. Dolan, specifically the third (1977) edition. Like Chris, I own a physical copy of this book, and in fact, the information and tools therein were critical to the creation of HYDEsim, way back in the Aughts. I acquired it while in pursuit of my degree in History, for which I studied the Cold War and the policy effects of the nuclear arms race, from the first bombers to the Strategic Defense Initiative. I was immediately intrigued by the idea and volunteered my technical services, which Chris accepted. So we started taking the OCR output of a PDF scan of the book, cleaning up the myriad errors, re-typing the bits the OCR mangled too badly to just clean up, structuring it all with HTML, converting figures to PNGs and photos to JPGs, and styling the whole thing for publication, working after hours and in odd down times to bring this historical document to the Web in a widely accessible form. The result of all that work is now online. That linked page is the best example of the technique I wrote about in the aforementioned previous post: as a Table of Contents, none of the lines actually get long enough to wrap. Rather than figuring out the exact length of the longest line and centering based on that, I just let CSS do the work for me. There were a number of other things I invented (probably re-invented) as we progressed. Footnotes appear at the bottom of pages when the footnote number is activated through the use of the :target pseudo-class and some fixed positioning. It’s not completely where I wanted it to be, but I think the rest will require JS to pull off, and my aim was to keep the scripting to an absolute minimum. I couldn’t keep the scripting to zero, because we decided early on to use MathJax for the many formulas and other mathematical expressions found throughout the text. I’d never written LaTeX before, and was very quickly impressed by how compact and yet powerful the syntax is. Over time, I do hope to replace the MathJax-parsed LaTeX with raw MathML for both accessibility and project-weight reasons, but as of this writing, Chromium lacks even halfway-decent MathML support, so we went with the more widely-supported solution. (My colleague Frédéric Wang at Igalia is pushing hard to fix this sorry state of affairs in Chromium, so I do have hopes for a migration to MathML… some day.) The figures (as distinct from the photos) throughout the text presented an interesting challenge. To look at them, you’d think SVG would be the ideal image format. Had they come as vector images, I’d agree, but they’re raster scans. I tried recreating one or two in hand-crafted SVG and quickly determined the effort to create each was significant, and really only worked for the figures that weren’t charts, graphs, or other presentations of data. For anything that was a chart or graph, the risk of introducing inaccuracies was too high, and again, each would have required an inordinate amount of effort to get even close to correct. That’s particularly true considering that without knowing what font face was being used for the text labels in the figures, they’d have to be recreated with paths or polygons or whatever, driving the cost-to-recreate astronomically higher. So I made the figures PNGs that are mostly transparent, except for the places where there was ink on the paper. After any necessary straightening and some imperfection cleanup in Acorn, I then ran the PNGs through the color-index optimization process I wrote about back in 2020, which got them down to an average of 75 kilobytes each, ranging from 443KB down to 7KB. At the 11th hour, still secretly hoping for a magic win, I ran them all through svgco.de to see if we could get automated savings. Of the 161 figures, exactly eight of them were made smaller, which is not a huge surprise, given the source material. So, I saved those eight for possible future updates and plowed ahead with the optimized PNGs. Will I return to this again in the future? Probably. It bugs me that the figures could be better, and yet aren’t. It also bugs me that we didn’t get all of the figures and photos fully described in alt text. I did write up alternative text for the figures in Chapter I, and a few of the photos have semi-decent captions, but this was something we didn’t see all the way through, and like I say, that bugs me. If it also bugs you, please feel free to fork the repository and submit a pull request with good alt text. Or, if you prefer, you could open an issue and include your suggested alt text that way. By the image, by the section, by the chapter: whatever you can contribute would be appreciated. Those image captions, by the way? In the printed text, they’re laid out as a label (e.g., “Figure 1.02”) and then the caption text follows. But when the text wraps, it doesn’t wrap below the label. Instead, it wraps in its own self-contained block instead, with the text fully justified except for the last line, which is centered. Centered! So I set up the markup and CSS like this: <figure> <img src="…" alt="…" loading="lazy"> <figcaption> <span>Figure 1.02.</span> <span>Effects of a nuclear explosion.</span> </figcaption> </figure> figure figcaption { display: grid; grid-template-columns: max-content auto; gap: 0.75em; justify-content: center; text-align: justify; text-align-last: center; } Oh CSS Grid, how I adore thee. And you too, CSS box alignment. You made this little bit of historical recreation so easy, it felt like cheating. Some other things weren’t easy. The data tables, for example, have a tendency to align columns on the decimal place, even when most but not all of the numbers are integers. Long, long ago, it was proposed that text-align be allowed a string value, something like text-align: '.', which you could then apply to a table column and have everything line up on that character. For a variety of reasons, this was never implemented, a fact which frosts my windows to this day. In general, I mean, though particularly so for this project. The lack of it made keeping the presentation historically accurate a right pain, one I may get around to writing about, if I ever overcome my shame. [Editor’s note: he overcame that shame.] There are two things about the book that we deliberately chose not to faithfully recreate. The first is the font face. My best guess is that the book was typeset using something from the Century family, possibly Century Schoolbook (the New version of which was a particular favorite of mine in college). The very-widely-installed Cambria seems fairly similar, at least to my admittedly untrained eye, and furthermore was designed specifically for screen media, so I went with body text styling no more complicated than this: body { font: 1em/1.35 Cambria, Times, serif; hyphens: auto; } I suppose I could have tracked down a free version of Century and used it as a custom font, but I couldn’t justify the performance cost in both download and rendering speed to myself and any future readers. And the result really did seem close enough to the original to accept. The second thing we didn’t recreate is the printed-page layout, which is two-column. That sort of layout can work very well on the book page; it almost always stinks on a Web page. Thus, the content of the book is rendered online in a single column. The exceptions are the chapter-ending Bibliography sections and the book’s Index, both of which contain content compact and granular enough that we could get away with the original layout. There’s a lot more I could say about how this style or that pattern came about, and maybe someday I will, but for now let me leave you with this: all these decisions are subject to change, and open to input. If you come up with a superior markup scheme for any of the bits of the book, we’re happy to look at pull requests or issues, and to act on them. It is, as we say in our preface to the online edition, a living project. We also hope that, by laying bare the grim reality of these horrific weapons, we can contribute in some small way to making them a dead and buried technology. Have something to say to all that? You can add a comment to the post, or email Eric directly. ## August 07, 2022 ### Andy Wingo #### coarse or lazy? sweeping, coarse and lazy One of the things that had perplexed me about the Immix collector was how to effectively defragment the heap via evacuation while keeping just 2-3% of space as free blocks for an evacuation reserve. The original Immix paper states: To evacuate the object, the collector uses the same allocator as the mutator, continuing allocation right where the mutator left off. Once it exhausts any unused recyclable blocks, it uses any completely free blocks. By default, immix sets aside a small number of free blocks that it never returns to the global allocator and only ever uses for evacuating. This headroom eases defragmentation and is counted against immix's overall heap budget. By default immix reserves 2.5% of the heap as compaction headroom, but [...] is fairly insensitive to values ranging between 1 and 3%. To Immix, a "recyclable" block is partially full: it contains surviving data from a previous collection, but also some holes in which to allocate. But when would you have recyclable blocks at evacuation-time? Evacuation occurs as part of collection. Collection usually occurs when there's no more memory in which to allocate. At that point any recyclable block would have been allocated into already, and won't become recyclable again until the next trace of the heap identifies the block's surviving data. Of course after the next trace they could become "empty", if no object survives, or "full", if all lines have survivor objects. In general, after a full allocation cycle, you don't know much about the heap. If you could easily know where the live data and the holes were, a garbage collector's job would be much easier :) Any algorithm that starts from the assumption that you know where the holes are can't be used before a heap trace. So, I was not sure what the Immix paper is meaning here about allocating into recyclable blocks. Thinking on it again, I realized that Immix might trigger collection early sometimes, before it has exhausted the previous cycle's set of blocks in which to allocate. As we discussed earlier, there is a case in which you might want to trigger an early compaction: when a large object allocator runs out of blocks to decommission from the immix space. And if one evacuating collection didn't yield enough free blocks, you might trigger the next one early, reserving some recyclable and empty blocks as evacuation targets. when do you know what you know: lazy and eager Consider a basic question, such as "how many bytes in the heap are used by live objects". In general you don't know! Indeed you often never know precisely. For example, concurrent collectors often have some amount of "floating garbage" which is unreachable data but which survives across a collection. And of course you don't know the difference between floating garbage and precious data: if you did, you would have collected the garbage. Even the idea of "when" is tricky in systems that allow parallel mutator threads. Unless the program has a total ordering of mutations of the object graph, there's no one timeline with respect to which you can measure the heap. Still, Immix is a stop-the-world collector, and since such collectors synchronously trace the heap while mutators are stopped, these are times when you can exactly compute properties about the heap. Let's retake the question of measuring live bytes. For an evacuating semi-space, knowing the number of live bytes after a collection is trivial: all survivors are packed into to-space. But for a mark-sweep space, you would have to compute this information. You could compute it at mark-time, while tracing the graph, but doing so takes time, which means delaying the time at which mutators can start again. Alternately, for a mark-sweep collector, you can compute free bytes at sweep-time. This is the phase in which you go through the whole heap and return any space that wasn't marked in the last collection to the allocator, allowing it to be used for fresh allocations. This is the point in the garbage collection cycle in which you can answer questions such as "what is the set of recyclable blocks": you know what is garbage and you know what is not. Though you could sweep during the stop-the-world pause, you don't have to; sweeping only touches dead objects, so it is correct to allow mutators to continue and then sweep as the mutators run. There are two general strategies: spawn a thread that sweeps as fast as it can (concurrent sweeping), or make mutators sweep as needed, just before they allocate (lazy sweeping). But this introduces a lag between when you know and what you know—your count of total live heap bytes describes a time in the past, not the present, because mutators have moved on since then. For most collectors with a sweep phase, deciding between eager (during the stop-the-world phase) and deferred (concurrent or lazy) sweeping is very easy. You don't immediately need the information that sweeping allows you to compute; it's quite sufficient to wait until the next cycle. Moving work out of the stop-the-world phase is a win for mutator responsiveness (latency). Usually people implement lazy sweeping, as it is naturally incremental with the mutator, naturally parallel for parallel mutators, and any sweeping overhead due to cache misses can be mitigated by immediately using swept space for allocation. The case for concurrent sweeping is less clear to me, but if you have cores that would otherwise be idle, sure. eager coarse sweeping Immix is interesting in that it chooses to sweep eagerly, during the stop-the-world phase. Instead of sweeping irregularly-sized objects, however, it sweeps over its "line mark" array: one byte for each 128-byte "line" in the mark space. For 32 kB blocks, there will be 256 bytes per block, and line mark bytes in each 4 MB slab of the heap are packed contiguously. Therefore you get relatively good locality, but this just mitigates a cost that other collectors don't have to pay. So what does eager marking over these coarse 128-byte regions buy Immix? Firstly, eager sweeping buys you eager identification of empty blocks. If your large object space needs to steal blocks from the mark space, but the mark space doesn't have enough empties, it can just trigger collection and then it knows if enough blocks are available. If no blocks are available, you can grow the heap or signal out-of-memory. If the lospace (large object space) runs out of blocks before the mark space has used all recyclable blocks, that's no problem: evacuation can move the survivors of fragmented blocks into these recyclable blocks, which have also already been identified by the eager coarse sweep. Without eager empty block identification, if the lospace runs out of blocks, firstly you don't know how many empty blocks the mark space has. Sweeping is a kind of wavefront that moves through the whole heap; empty blocks behind the wavefront will be identified, but those ahead of the wavefront will not. Such a lospace allocation would then have to either wait for a concurrent sweeper to advance, or perform some lazy sweeping work. The expected latency of a lospace allocation would thus be higher, without eager identification of empty blocks. Secondly, eager sweeping might reduce allocation overhead for mutators. If allocation just has to identify holes and not compute information or decide on what to do with a block, maybe it go brr? Not sure. lines, lines, lines The original Immix paper also notes a relative insensitivity of the collector to line size: 64 or 256 bytes could have worked just as well. This was a somewhat surprising result to me but I think I didn't appreciate all the roles that lines play in Immix. Obviously line size affect the worst-case fragmentation, though this is mitigated by evacuation (which evacuates objects, not lines). This I got from the paper. In this case, smaller lines are better. Line size affects allocation-time overhead for mutators, though which way I don't know: scanning for holes will be easier with fewer lines in a block, but smaller lines would contain more free space and thus result in fewer collections. I can only imagine though that with smaller line sizes, average hole size would decrease and thus medium-sized allocations would be harder to service. Something of a wash, perhaps. However if we ask ourselves the thought experiment, why not just have 16-byte lines? How crazy would that be? I think the impediment to having such a precise line size would mainly be Immix's eager sweep, as a fine-grained traversal of the heap would process much more data and incur possibly-unacceptable pause time overheads. But, in such a design you would do away with some other downsides of coarse-grained lines: a side table of mark bytes would make the line mark table redundant, and you eliminate much possible "dark matter" hidden by internal fragmentation in lines. You'd need to defer sweeping. But then you lose eager identification of empty blocks, and perhaps also the ability to evacuate into recyclable blocks. What would such a system look like? Readers that have gotten this far will be pleased to hear that I have made some investigations in this area. But, this post is already long, so let's revisit this in another dispatch. Until then, happy allocations in all regions. ## August 04, 2022 ### Igalia Compilers Team #### Igalia’s Compilers Team in 2022H1 As we enter the second half of 2022, we’d like to provide a summary (necessarily highly condensed and selective!) of what we’ve been up to recently, providing some insight into the breadth of technical challenges our team of over 20 compiler engineers has been tackling. ## Low-level JS / JSC on 32-bit systems We have continued to maintain support for 32-bit systems (mainly ARMv7, but also MIPS) in JavaScriptCore (JSC). The work involves continuous tracking of upstream development to prevent regressions as well as the development of new features: • A major milestone for this has been the completion of support for WebAssembly in the Low-Level Interpreter (LLInt) for ARMv7. The MIPS support is mostly complete. • Developed an initial prototype of the concurrency compilation in the DFG tier for 32-bit systems, and the results are promising. The work continues, and we expect to upstream it in 2022H2. • Code reduction and optimizations: we upstreamed several code reductions and optimizations for 32-bit systems (mainly ARMv7): 25% size reduction in DFGOSRExit blocks, 24% in baseline JIT on JetStream2 and 25% code size reduction from porting EXTRA_CTI_THUNKS. • Improved our hardware testing infrastructure with more MIPS and faster ARMv7 hardware for the buildbots running in the EWS (Early Warning System), which allows for a smaller response time for regressions. • Deployed two fuzzing bots that run test JSC 24/7. The bots already found a few issues upstream that we reported to Apple. The bugs that affect 64-bit systems were fixed by the team at Apple, while we are responsible for fixing the ones affecting 32-bit systems. We expect to work on them in 2022H2. • Added logic to transparently re-run failing JSC tests (on 32-bit platforms) and declare them a pass if they’re simply flaky, as long as the flakiness does not rise above a threshold. This means fewer false alerts for developers submitting patches to the EWS and for the people doing QA work. Naturally, the flakiness information is stored in the WebKit resultsdb and visualized at results.webkit.org. ## JS and standards Another aspect of our work is our contribution to the JavaScript standards effort, through involvement in the TC39 standards body, direct contribution to standards proposals, and implementation of those proposals in the major JS engines. • Further coverage of the Temporal spec in the Test262 conformance suite, as well as various specification updates. See this blog post for insight into some of the challenges tackled by Temporal. • Performance improvements for JS class features in V8, such as faster initialisations. • Work towards supporting snapshots in node.js, including activities such as fixing support for V8 startup snapshots in the presence of class field initializers. • Collaborating with others on the “types as comments” proposal for JS, successfully reaching stage 1 in the TC39 process. • Implementing ShadowRealm support in WebKit ## WebAssembly WebAssembly is a low-level compilation target for the web, which we have contributed to in terms of specification proposals, LLVM toolchain modifications, implementation work in the JS engines, and working with customers on use cases both on the server and in web browsers. Some highlights from the last 6 months include: • Creation of a proposal for reference-typed strings in WebAssembly to ensure efficient operability with languages like JavaScript. We also landed patches to implement this proposal in V8. • Prototyping Just-In-Time (JIT) compilation within WebAssembly. • Working to implement support for WebAssembly GC types in Clang and LLVM (with one important use case being efficient and leak-free sharing of object graphs between JS and C++ compiled to Wasm). • Implementation of support for GC types in WebKit’s implementation of WebAssembly. ## Events With in-person meetups becoming possible again, Igalians have been talking on a range of topics – Multi-core Javascript (BeJS), TC39 (JS Nation), RISC-V LLVM (Cambridge RISC-V meetup), and more. We’ve also had opportunities for much-needed face to face time within the team, with many of the compilers team meeting in Brussels in May, and for the company-wide summit held in A Coruña in June. These events provided a great opportunity to discuss current technical challenges, strategy, and ideas for the future, knowledge sharing, and of course socialising. ## Team growth Our team has grown further this year, being joined by: • Nicolò Ribaudo – a core maintainer of BabelJS, continuing work on that project after joining Igalia in June as well as contributing to work on JS modules. • Aditi Singh – previously worked with the team through the Coding Experience program, joining full time in March focusing on the Temporal project. • Alex Bradbury – a long-time LLVM developer who joined in March and is focusing on WebAssembly and RISC-V work in Clang/LLVM. We’re keen to continue to grow the team and actively hiring, so if you think you might be interested in working in any of the areas discussed above, please apply here. ## More about Igalia If you’re keen to learn more about how we work at Igalia, a recent article at The New Stack provides a fantastic overview and includes comments from a number of customers who have supported the work described in this post. ### André Almeida #### Keeping a project bisectable People write code. Test coverage is never enough. Some angry contributor will disable the CI. And we all write bugs. But that’s OK, it part of the job. Programming is hard and sometimes we may miss a corner case, forget that numbers overflow and all other strange things that computers can do. One easy thing that we can do to help the poor developer that needs to find what changed in the code that stopped their printer to work properly, is to keep the project bisectable. ## July 24, 2022 ### Danylo Piliaiev #### July 2022 Turnip Status Update There is a steady progress being made since :tada: Turnip is Vulkan 1.1 Conformant :tada:. We now support GL 4.6 via Zink, implemented a lot of extensions, and are close to Vulkan 1.3 conformance. Support of real-world games is also looking good, here is a video of Adreno 660 rendering “The Witcher 3”, “The Talos Principle”, and “OMD2”: All of them have reasonable frame rate. However, there was a bit of “cheating” involved. Only “The Talos Principle” game was fully running on the development board (via box64), other two games were only rendered in real time on Adreno GPU, but were ran on x86-64 laptop with their VK commands being streamed to the dev board. You could read about this method in my post “Testing Vulkan drivers with games that cannot run on the target device”. The video was captured directly on the device via OBS with obs-vkcapture, which worked surprisingly well after fighting a bunch of issues due to the lack of binary package for it and a bit dated Ubuntu installation. ## Zink (GL over Vulkan) A number of extensions were implemented that are required for Zink to support higher GL versions. As of now Turnip supports OpenGL 4.6 via Zink, and while yet not conformant - only a handful of GL CTS tests are failing. For the perspective, Freedreno (our GL driver for Adreno) supports only OpenGL 3.3. For Zink adventures and profound post titles check out Mike Blumenkrantz’s awesome blog supergoodcode.com If you are interested in Zink over Turnip bring up in particular, you should read: ## Low Resolution Z improvements A major improvement for low resolution Z optimization (LRZ) was recently made in Turnip, read about it in the previous post of mine: LRZ on Adreno GPUs ## Extensions Anyway, since the last update Turnip supports many more extensions (in no particular order): ## What about Vulkan conformance? For Vulkan 1.3 conformance there are only a few extension left to implement. The only major ones are VK_KHR_dynamic_rendering and VK_EXT_inline_uniform_block required for Vulkan 1.3. VK_KHR_dynamic_rendering is currently being reviewed and foundation for VK_EXT_inline_uniform_block was recently merged. That’s all for today! ## July 20, 2022 ### Andy Wingo #### unintentional concurrency Good evening, gentle hackfolk. Last time we talked about heuristics for when you might want to compact a heap. Compacting garbage collection is nice and tidy and appeals to our orderly instincts, and it enables heap shrinking and reallocation of pages to large object spaces and it can reduce fragmentation: all very good things. But evacuation is more expensive than just marking objects in place, and so a production garbage collector will usually just mark objects in place, and only compact or evacuate when needed. Today's post is more details! dedication Just because it's been, oh, a couple decades, I would like to reintroduce a term I learned from Marnanel years ago on advogato, a nerdy group blog kind of a site. As I recall, there is a word that originates in the Oxbridge social environment, "narg", from "Not A Real Gentleman", and which therefore denotes things that not-real-gentlemen do: nerd out about anything that's not, like, fox-hunting or golf; or generally spending time on something not because it will advance you in conventional hierarchies but because you just can't help it, because you love it, because it is just your thing. Anyway, in the spirit of pursuits that are really not What One Does With One's Time, this post is dedicated to the word "nargery". side note, bis: immix-style evacuation versus mark-compact In my last post I described Immix-style evacuation, and noted that it might take a few cycles to fully compact the heap, and that it has a few pathologies: the heap might never reach full compaction, and that Immix might run out of free blocks in which to evacuate. With these disadvantages, why bother? Why not just do a single mark-compact pass and be done? I implicitly asked this question last time but didn't really answer it. For some people will be, yep, yebo, mark-compact is the right answer. And yet, there are a few reasons that one might choose to evacuate a fraction of the heap instead of compacting it all at once. The first reason is object pinning. Mark-compact systems assume that all objects can be moved; you can't usefully relax this assumption. Most algorithms "slide" objects down to lower addresses, squeezing out the holes, and therefore every live object's address needs to be available to use when sliding down other objects with higher addresses. And yet, it would be nice sometimes to prevent an object from being moved. This is the case, for example, when you grant a foreign interface (e.g. a C function) access to a buffer: if garbage collection happens while in that foreign interface, it would be nice to be able to prevent garbage collection from moving the object out from under the C function's feet. Another reason to want to pin an object is because of conservative root-finding. Guile currently uses the Boehm-Demers-Weiser collector, which conservatively scans the stack and data segments for anything that looks like a pointer to the heap. The garbage collector can't update such a global root in response to compaction, because you can't be sure that a given word is a pointer and not just an integer with an inconvenient value. In short, objects referenced by conservative roots need to be pinned. I would like to support precise roots at some point but part of my interest in Immix is to allow Guile to move to a better GC algorithm, without necessarily requiring precise enumeration of GC roots. Optimistic partial evacuation allows for the possibility that any given evacuation might fail, which makes it appropriate for conservative root-finding. Finally, as moving objects has a cost, it's reasonable to want to only incur that cost for the part of the heap that needs it. In any given heap, there will likely be some data that stays live across a series of collections, and which, once compacted, can't be profitably moved for many cycles. Focussing evacuation on only the part of the heap with the lowest survival rates avoids wasting time on copies that don't result in additional compaction. (I should admit one thing: sliding mark-compact compaction preserves allocation order, whereas evacuation does not. The memory layout of sliding compaction is more optimal than evacuation.) multi-cycle evacuation Say a mutator runs out of memory, and therefore invokes the collector. The collector decides for whatever reason that we should evacuate at least part of the heap instead of marking in place. How much of the heap can we evacuate? The answer depends primarily on how many free blocks you have reserved for evacuation. These are known-empty blocks that haven't been allocated into by the last cycle. If you don't have any, you can't evacuate! So probably you should keep some around, even when performing in-place collections. The Immix papers suggest 2% and that works for me too. Then you evacuate some blocks. Hopefully the result is that after this collection cycle, you have more free blocks. But you haven't compacted the heap, at least probably not on the first try: not into 2% of total space. Therefore you tell the mutator to put any empty blocks it finds as a result of lazy sweeping during the next cycle onto the evacuation target list, and then the next cycle you have more blocks to evacuate into, and more and more and so on until after some number of cycles you fall below some overall heap fragmentation low-watermark target, at which point you can switch back to marking in place. I don't know how this works in practice! In my test setups which triggers compaction at 10% fragmentation and continues until it drops below 5%, it's rare that it takes more than 3 cycles of evacuation until the heap drops to effectively 0% fragmentation. Of course I had to introduce fragmented allocation patterns into the microbenchmarks to even cause evacuation to happen at all. I look forward to some day soon testing with real applications. concurrency Just as a terminological note, in the world of garbage collectors, "parallel" refers to multiple threads being used by a garbage collector. Parallelism within a collector is essentially an implementation detail; when the world is stopped for collection, the mutator (the user program) generally doesn't care if the collector uses 1 thread or 15. On the other hand, "concurrent" means the collector and the mutator running at the same time. Different parts of the collector can be concurrent with the mutator: for example, sweeping, marking, or evacuation. Concurrent sweeping is just a detail, because it just visits dead objects. Concurrent marking is interesting, because it can significantly reduce stop-the-world pauses by performing most of the computation while the mutator is running. It's tricky, as you might imagine; the collector traverses the object graph while the mutator is, you know, mutating it. But there are standard techniques to make this work. Concurrent evacuation is a nightmare. It's not that you can't implement it; you can. But it's very very hard to get an overall performance win from concurrent evacuation/copying. So if you are looking for a good bargain in the marketplace of garbage collector algorithms, it would seem that you need to avoid concurrent copying/evacuation. It's an expensive product that would seem to not buy you very much. All that is just a prelude to an observation that there is a funny source of concurrency even in some systems that don't see themselves as concurrent: mutator threads marking their own roots. To recall, when you stop the world for a garbage collection, all mutator threads have to somehow notice the request to stop, reach a safepoint, and then stop. Then the collector traces the roots from all mutators and everything they reference, transitively. Then you let the threads go again. Thing is, once you get more than a thread or four, stopping threads can take time. You'd be tempted to just have threads notice that they need to stop, then traverse their own stacks at their own safepoint to find their roots, then stop. But, this introduces concurrency between root-tracing and other mutators that might not have seen the request to stop. For marking, this concurrency can be fine: you are just setting mark bits, not mutating the roots. You might need to add an additional mark pattern that can be distinguished from marked-last-time and marked-the-time-before-but-dead-now, but that's a detail. Fine. But if you instead start an evacuating collection, the gates of hell open wide and toothy maws and horns fill your vision. One thread could be stopping and evacuating the objects referenced by its roots, while another hasn't noticed the request to stop and is happily using the same objects: chaos! You are trying to make a minor optimization to move some work out of the stop-the-world phase but instead everything falls apart. Anyway, this whole article was really to get here and note that you can't do ragged-stops with evacuation without supporting full concurrent evacuation. Otherwise, you need to postpone root traversal until all threads are stopped. Perhaps this is another argument that evacuation is expensive, relative to marking in place. In practice I haven't seen the ragged-stop effect making so much of a difference, but perhaps that is because evacuation is infrequent in my test cases. Zokay? Zokay. Welp, this evening's nargery was indeed nargy. Happy hacking to all collectors out there, and until next time. ### Víctor Jáquez #### Gamepad in WPEWebkit This is the brief story of the Gamepad implementation in WPEWebKit. It started with an early development done by Eugene Mutavchi (kudos!). Later, by the end of 2021, I retook those patches and dicussed them with my fellow igalian Adrián, and we decided to come with a slightly different approach. Before going into the details, let’s quickly review the WPE architecture: 1. cog library — it’s a shell library that simplifies the task of writing a WPE browser from the scratch, by providing common functionality and helper APIs. 2. WebKit library — that’s the web engine that, given an URI and other following inputs, returns, among other ouputs, graphic buffers with the page rendered. 3. WPE library — it’s the API that bridges cog (1) (or whatever other browser application) and WebKit (2). 4. WPE backend — it’s main duty is to provide graphic buffers to WebKit, buffers supported by the hardware, the operating system, windowing system, etc. Eugene’s implementation has code in WebKit (implementing the gamepad support for WPE port); code in WPE library with an API to communicate WebKit’s gamepad and WPE backend, which provided a custom implementation of gamepad, reading directly the event in the Linux device. Almost everything was there, but there were some issues: • WPE backend is mainly designed as a set of protocols, similar to Wayland, to deal with graphic buffers or audio buffers, but not for input events. Cog library is the place where input events are handled and injected to WebKit, such as keyboard. • The gamepad handling in a WPE backend was ad-hoc and low level, reading directly the events from Linux devices. This approach is problematic since there are plenty gamepads in the market and each has its own axis and buttons, so remapping them to the standard map is required. To overcome this issue and many others, there’s a GNOME library: libmanette, which is already used by WebKitGTK port. Today’s status of the gamepad support is that it works but it’s not yet fully upstreamed. • merged ">">libwpe pull request. • cog pull request — there are two implementations: none and libmanette. None is just a dummy implementation which will ignore any request for a gamepad provider; it’s provided if libmanette is not available or if available libwpe hasn’t gamepad support. • WebKit pull request. To prove you all that it works my exhibit A is this video, where I play asteroids in a RasberryPi 4 64 bits: The image was done with buildroot, using its master branch (from a week ago) with a bunch of modifications, such as adding libmanette, a kernel patch for my gamepad device, kernel 5.15.55 and its corresponding firmware, etc. ## July 19, 2022 ### Manuel Rego #### Some highlights of the Web Engines Hackfest 2022 Last month Igalia arranged a new edition of the Web Engines Hackfest in A Coruña (Galicia, Spain), where brought together more than 70 people working on the web platform during 2 days, with lots of discussions and conversations around different features. This was my first onsite event since “before times”, it was amazing seeing people for real after such a long time, meeting again some old colleagues and also a bunch of people for the first time. Being an organizer of the event meant that they were very busy days for me, but it looks like people were happy with the result and enjoyed the event quite a lot. This is a brief post about my personal highlights during the event. ## Talks During the hackfest we had an afternoon with 5 talks, the talks were live streamed on YouTube and people could follow them remotely and also ask questions through the event matrix channel. Leo Balter’s Talk • Leo Balter talked about how Salesforce participates on the web platform as partner, working with browsers and web standards. I really liked this talk, because it explains how companies that use the web platform, can collaborate and have a direct impact on the evolution of the web. And there are many ways to do that: reporting bugs that affect your company, explaining use cases that are important for you and the things you miss from the platform, providing feedback about different features, looking for partners to fix outstanding issues or add support for new stuff, etc. Igalia has been showing during the last decade that there’s a way to have an impact on the web platform outside of the big companies and browser vendors. Thanks to our position on the different communities, we can help companies to push features they’re interested in and that would benefit the entire web platform in the future. • Dominik Röttsches gave a talk about COLRv1 fonts giving details on Chromium implementation and the different open-source software components involved. This new font format allows to do really amazing things and Dominik showed how to create a Galician emoji font with popular things like Tower of Hercules or Polbo á feira. With some early demos on variable COLRv1 and the beginnings of the first Galician emoji font… • Daniel Minor explained the work done in Gecko and SpiderMonkey to refactor the internationalization system. Very interesting talk with lots of information and details about internationalization, going deep on text segmentation and how it works on different languages, and also introducing the ICU4X project. • Ada Rose Cannon did a great introduction to WebXR and Augmented Reality. Despite not being onsite, this was an awesome talk and the video was actually a very immersive experience. Ada explained many concepts and features around WebXR and Augmented Reality with a bunch of cool examples and demos. • Thomas Steiner talked about Project Fugu APIs that have been implemented in Chromium. Using the Web Engines Hackfest logo as example, he explained different new capabilities that Project Fugu is adding to the web through a real application called SVGcode. It was a great set of talks, and you can now watch them all on YouTube. We hope you enjoy them if you haven’t the chance to watch them yet. ## CSS & Interop 2022 On the CSS breakout session we talked about all the new big features that are arriving to browsers these days. Container Queries and :has being probably the most notable examples, features that people have been requesting since the early days and that are shipping into browsers this year. Apart from that, we talked about the Interop 2022 effort. How the target areas to improve interoperability are defined, and how much it implies the work in some of them. ## MathML & Fonts Frédéric Wang did a nice presentation about MathML and all the work that has been done in the recent years. The feature is close to shipping in Chromium (modulo finding some solution regarding printing or waiting for LayoutNG to be ready to print), that will be a huge step forward for MathML becoming it a feature supported in the major browser engines. Related to the MathML work there were some discussion around fonts, particularly OpenType MATH fonts, you can read Fred’s post for more details. There are some good news regarding this topic, new macOS version includes STIX Two Math installed by default, and there are ongoing conversations to get some OpenType MATH font by default in Android too. MathML Breakout Session ## Accessibility & AOM Valerie Young, who has recently started acting as co-chair of the ARIA Working Group, was leading a session around accessibility where we talked about ARIA and related things like AOM. The Accessibility Object Model (AOM) is an effort that involves a lot of different things. In this session we talked about ARIA Attribute Reflection and the issues making accessible custom elements that use Shadow DOM, and that proposals like Cross-root ARIA Delegation are trying to solve. Accessibility Breakout Session ## Acknowledgements To close this post I’d like to say thank you to everyone that participated in the Web Engines Hackfest, without your presence this event wouldn’t make any sense. Also I’d like to thank the speakers for the great talks and the time devoted to work on them for this event. As usual big thanks to the sponsors Arm, Google and Igalia to make this event possible once more. And thanks again to Igalia for letting me be part of the event organization. Web Engines Hackfest 2022 Sponsors ## July 18, 2022 ### Víctor Jáquez #### GstVA H.264 encoder, compositor and JPEG decoder There are, right now, three new GstVA elements merged in main: vah264enc, vacompositor and vajpegdec. Just to recap, GstVA is a GStreamer plugin in gst-plugins-bad (yes, we agree it’s not a great name anymore), to differentiate it from gstreamer-vaapi. Both plugins use libva to access stateless video processing operations; the main difference is, precisely, how stream’s state is handled: while GstVA uses GStreamer libraries shared by other hardware accelerated plugins (such as d3d11 and v4l2codecs), gstreamer-vaapi uses an internal, tightly coupled and convoluted library. Also, note that right now (release 1.20) GstVA elements are ranked NONE, while gstreamer-vaapi ones are mostly PRIMARY+1. Back to the three new elements in GstVA, the most complex one is vah264enc wrote almost completely by He Junyan, from Intel. For it, He had to write a H.264 bitwriter which is, until certain extend, the opposite for H.264 parser: construct the bitstream buffer from H.264 structures such as PPS, SPS, slice header, etc. This API is part of libgstcodecparsers, ready to be reused by other plugins or applications. Currently vah264enc is fairly complete and functional, dealing with profiles and rate controls, among other parameters. It still have rough spots, but we’re working on them. But He Junyan is restless and he already has in the pipeline an encoder common class along with a HEVC and AV1 encoders. The second element is vacompositor, wrote by Artie Eoff. It’s the replacement of vaapioverlay in gstreamer-vaapi. The suffix compositor is preferred to follow the name of primary video mixing (software-based) element: compositor, successor of videomixer. See this discussion for further details. The purpose of this element is to compose a single video stream from multiple video streams. It works with Intel’s media-driver supporting alpha channel, and also works with AMD Mesa Gallium, but without alpha channel (in other words, a custom degree of transparency). The last, but not the least, element is vajpegdec, which I worked on. The main issue was not the decoder itself, but jpegparse, which didn’t signal the image caps required for the hardware accelerated decoders. For instance, VA only decodes images with SOF marker 0 (Baseline DCT). It wasn’t needed before because the main and only consumer of the parser is jpegdec which deals with any type of JPEG image. Long story short, we revamped jpegparse and now it signals sof marker, color space (YUV, RGB, etc.) and chroma subsampling (if it has YUV color space), along with comments and EXIF-like metadata as pipeline’s tags. Thus vajpegdec will expose in caps template the supported color space and chroma subsampling supported by the driver. For example, Intel supports (more or less) RGB color space, while AMD Mesa Gallium don’t. And that’s all for now. Thanks. ## July 12, 2022 ### Danylo Piliaiev #### Low-resolution-Z on Adreno GPUs #### Table of Contents ## What is LRZ? Citing official Adreno documentation: [A Low Resolution Z (LRZ)] pass is also referred to as draw order independent depth rejection. During the binning pass, a low resolution Z-buffer is constructed, and can reject LRZ-tile wide contributions to boost binning performance. This LRZ is then used during the rendering pass to reject pixels efficiently before testing against the full resolution Z-buffer. My colleague Samuel Iglesias did the initial reverse-engineering of this feature; for its in-depth overview you could read his great blog post Low Resolution Z Buffer support on Turnip. Here are a few excerpts from this post describing what is LRZ? To understand better how LRZ works, we need to talk a bit about tiled-based rendering. This is a way of rendering based on subdividing the framebuffer in tiles and rendering each tile separately. The binning pass processes the geometry of the scene and records in a table on which tiles a primitive will be rendered. By doing this, the HW only needs to render the primitives that affect a specific tile when is processed. The rendering pass gets the rasterized primitives and executes all the fragment related processes of the pipeline. Once it finishes, the resolve pass starts. Where is LRZ used then? Well, in both binning and rendering passes. In the binning pass, it is possible to store the depth value of each vertex of the geometries of the scene in a buffer as the HW has that data available. That is the depth buffer used internally for LRZ. It has lower resolution as too much detail is not needed, which helps to save bandwidth while transferring its contents to system memory. Thanks to LRZ, the rendering pass is only executed on the fragments that are going to be visible at the end. LRZ brings a couple of things on the table that makes it interesting. One is that applications don’t need to reorder their primitives before submission to be more efficient, that is done by the HW with LRZ automatically. Now, a year later, I returned to this feature to make some important improvements, for nitty-gritty details you could dive into Mesa MR#16251 “tu: Overhaul LRZ, implement on-GPU dir tracking and LRZ fast-clear”. There I implemented on-GPU LRZ direction tracking, LRZ reuse between renderpasses, and fast-clear of LRZ. In this post I want to give a practical advice, based on things I learnt while reverse-engineering this feature, on how to help driver to enable LRZ. Some of it could be self-evident, some is already written in the official docs, and some cannot be found there. It should be applicable for Vulkan, GLES, and likely for Direct3D. ## Do not change the direction of depth comparisons Or rather, when writing depth - do not change the direction of depth comparisons. If depth comparison direction is changed while writing into depth buffer - LRZ would have to be disabled. Why? Because if depth comparison direction is GREATER - LRZ stores the lowest depth value of the block of pixels, if direction is LESS - it stores the highest value of the block. So if direction is changed the LRZ value becomes wrong for the new direction. A few examples: • :thumbsup: Going from VK_COMPARE_OP_GREATER -> VK_COMPARE_OP_GREATER_OR_EQUAL is good; • :x: Going from VK_COMPARE_OP_GREATER -> VK_COMPARE_OP_LESS is bad; • :neutral_face: From VK_COMPARE_OP_GREATER with depth write -> VK_COMPARE_OP_LESS without depth write is ok; • LRZ would be just temporally disabled for VK_COMPARE_OP_LESS draw calls. The rules could be summarized as: • Changing depth write direction disables LRZ; • For calls with different direction but without depth write LRZ is temporally disabled; • VK_COMPARE_OP_GREATER and VK_COMPARE_OP_GREATER_OR_EQUAL have same direction; • VK_COMPARE_OP_LESS and VK_COMPARE_OP_LESS_OR_EQUAL have same direction; • VK_COMPARE_OP_EQUAL and VK_COMPARE_OP_NEVER don’t have a direction, LRZ is temporally disabled; • Surprise, your VK_COMPARE_OP_EQUAL compares don’t benefit from LRZ; • VK_COMPARE_OP_ALWAYS and VK_COMPARE_OP_NOT_EQUAL either temporally or completely disable LRZ, depending on depth being written. ## Simple rules for fragment shader ### Do not write depth This obviously makes resulting depth value unpredictable, so LRZ has to be completely disabled. Note, the output values of manually written depth could be bound by conservative depth modifier, for GLSL this is achieved by GL_ARB_conservative_depth extension, like this: layout (depth_greater) out float gl_FragDepth;  However, Turnip at the moment does not consider this hint, and it is unknown if Qualcomm’s proprietary driver does. ### Do not use Blending/Logic OPs/colorWriteMask All of them make a new fragment value depend on the old fragment value. LRZ is temporary disabled in this case. ### Do not have side-effects in fragment shaders Writing to SSBO, images, … from fragment shader forces late Z, thus it is incompatible with LRZ. At the moment Turnip completely disables LRZ when shader has such side-effects. ### Do not discard fragments Discarding fragments moves the decision whether fragment contributes to the depth buffer to the time of fragment shader execution. LRZ is temporary disabled in this case. ## LRZ in secondary command buffers and dynamic rendering TLDR: Since Snapdragon 865 (Adreno 650) LRZ supported in secondary command buffers. TLDR: LRZ would work with VK_KHR_dynamic_rendering, but you’d like to avoid using this extension because it isn’t nice to tilers. Official docs state that LRZ is disabled with “Use of secondary command buffers (Vulkan)”, and on another page that “Snapdragon 865 and newer will not disable LRZ based on this criteria”. Why? Because up to Snapdragon 865 tracking of the direction is done on the CPU, meaning that LRZ direction is kept in internal renderpass object, updated and checked without any GPU involvement. But starting from Snapdragon 865 the direction could be tracked on GPU which allows driver not to know previous LRZ direction during a command buffer construction. Therefor secondary command buffers could now use LRZ! Recently Vulkan 1.3 came out and mandated the support of VK_KHR_dynamic_rendering. It gets rid of complicated VkRenderpass and VkFramebuffer setup, but much more exciting is a simpler way for parallel renderpasses construction (with VK_RENDERING_SUSPENDING_BIT / VK_RENDERING_RESUMING_BIT flags). VK_KHR_dynamic_rendering poses a similar challenge for LRZ as secondary command buffers and has the same solution. ## Reusing LRZ between renderpasses TLDR: Since Snapdragon 865 (Adreno 650) LRZ would work if you store depth in one renderpass and load it later, giving depth image isn’t changed in-between. Another major improvement brought by Snapdragon 865 is the possibility to reuse LRZ state between renderpasses. The on-GPU direction tracking is part of the equation here, another part is the tracking of a depth view being used. Depth image has a single LRZ buffer which corresponds to a single array layer + single mip level of the image. So if view with different array layer or mip layer is used - LRZ state couldn’t be reused and will be invalidated. With the above knowledge here are the conditions when LRZ state could be reused: • Depth attachment was stored (STORE_OP_STORE) at the end of some past renderpass; • The same depth attachment with the same depth view settings is being loaded (not cleared) in the current renderpass; • There were no changes in the underlying depth image, meaning there was no vkCmdBlitImage*, vkCmdCopyBufferToImage*, or vkCmdCopyImage*. Otherwise LRZ state would be invalidated; Misc notes: • LRZ state is saved per depth image so you don’t lose the state if you you have several renderpasses with different depth attachments; • vkCmdClearAttachments + LOAD_OP_LOAD is just equal to LOAD_OP_CLEAR. ## Conclusion While there are many rules listed above - it all boils down to keeping things simple in the main renderpass(es) and not being too clever. ## July 07, 2022 ### Brian Kardell #### Where Browsers Come From # Where Browsers Come From In this post, I’ll talk about the evolution of the economics around browsers, and what we should think about as we move forward. The very first web browsers were rather small, single-person affairs. The source for Tim Berners-Lee’s original WorldWideWeb (which included a WYSIWYG editor!) was rougly a third larger than the unminified build of jQuery. Tim even abstracted the HTTP bits so you could build your own browser more easily, or a server (or both/all in one). For the most part, we can say these were created in academia. There was at least some assumption at the time that like most other available software, someone could monetize a good one as a product. In fact, before building his own, Tim famously tried to give the idea away to a company doing just that. For a while we flirted with that idea in various forms, creating companies “around” browsers. These companies largely never “just sold the browser” but involved lots of experimenting with ways to keep browsers “free” for individuals by subsidizing them in other ways: coporate sales, licences, support, server sales, and so on. With these concerted effort, and money, it was soon impractical for a 1 person affair to keep up or compete. Instead, the competition pretty quickly became between Netscape and Microsoft. Microsoft, on the other hand, subsidized the browser through products people were already buying. They were already spending money to add similar web features to many those products, so why not just centralize it? They eventually just shipped it with the OS. When it became clear that Netscape’s model was losing, the people there who really cared about the web had a real problem: Now what? Unless someone could do something radical, Microsoft’s way would be the only way. ## And then, two interesting things happened… In 1998, as a result, Mozilla was spun off from Netscape and Open Source in the modern sense was born. The browser known as Firefox wouldn’t even launch its 1.0 version until 2004. At literally the same time, Google (the search engine) was developing and getting users. It existed for 6 years before it launched its IPO – also in 2004. The web had indexes and search engines before Google, and had some clear winners. But, it would be an understatement to say that Google was game-changing. It just absolutely nailed search with some totally new ideas. They made it simple and accurate. Google was banking on what really good search meant to the Web: It made it practical and easy. The easier they made it, the more people turned to the web for answers, and so on. That meant people would just be searching, literally all the all the time - and each of those searches came with ad opportunities! You can see just how important this is to getting around on the web today. We’ve replaced the URL bar with the Wonder Bar (integrated search). If it doesn’t look like a URL, it goes to a search engine - automatically… But it wasn’t always like that. Just before Firefox 1.0, Mozilla signed a landmark “default search” deal which would pay for 85% of their total budget. It wasn’t the intent for that to be the only means forever. In fact, they tried to expand into product ideas too. However, this dependency has generally only increased, recently accounting for up to 95% of the total Mozilla budget. Over time, this model would become the dominant approach. Also, by the release of Firefox 1.0 its codebase was up to 2.1 million lines of code. ## How we pay for the web Step back and look at the growth in the size and complexity of the 3 remaining engines over time… browser ~lines of code OG WWW 13,000 Firefox 1.0 2,100,000 Chromium 2022 34,900,821 Today, the cost to develop and maintain a browser is measured in the many hundreds of millions of dollars per year – each. And here’s the thing… Most of the actual, practical revenue that has made the whole thing possible ever since is in a large sense driven by the model of monetizing browsers through default search. In some cases this is pretty direct and easy to point to. In other cases, it’s a little obscured. However, I ask you to consider that companies (no matter how profitable) don’t voluntarily commit to bleeding hundreds of millions of dollars per year (and rising), forever. Something has to make it more economically viable - and today that thing is lots and lots of default search revenue. Without it, the ledgers would suddenly go red, but default search deals ensure the books aren’t being drained of massive piles of money. In fact, they’re generating revenue. It’s tempting to think “Well, so? It seems to be working pretty well? This is the way”. …But, is it? Is it the only way? Should it always be the only way? I kind of think not. ## Starting a conversation about our future We’ve never really just sold the browser, we’ve always subsidized browser development in other ways. I’m definitely not suggesting that users should have to pay to download and use a web browser, but I do think it’s useful to realize that that doesn’t exactly make browsers free either. Instead, it spreads out costs some other way that obscures those details from the conversation. From that angle, it might be a useful thought exercise to consider: What does the browser commons cost, per-user, per-year? Based on what we know of team sizes, scales and budgets - several people seem to have arrived on a similar ballpark figure: Somewhere around 2 billion dollars per year. Those are current costs to maintain all the current engines and make/keep them competive. I expect that for many this will induce a kind of “sticker shock” and it will be tempting to think “wow, that’s inefficient, maybe we don’t need 3 engines after all.” However, let’s dig in a little further. The web currently has about 5 billion users. So, a hyptothetical per-user, per-year cost is less than 50 cents per year. As I said earlier, we’ve never required that, and we shouldn’t. We spread costs and subsidize them. But, let’s continue our thought exercise a bit further: If the wealthiest 1/5th of users instead paid2 per year, that would still meet costs. If we keep going ‘up’, at $10 per year it would take a representative number of users comparative to the population Brazil. Going further, if the the wealthiest 1,000 companies equally subsidized the web, that would cost them$2 million each.

All of this is just to illustrate a few pretty simple things: The more we spread it out, the less it costs anyone.

Today, the search-ads-based model spreads those costs to everyone who advertises - which, it turns out, is pretty large population with money. Fractions of fractions of the revenue generated from this actually makes it back down to cover those enormous browser costs today.

They do this in exhange for certain things: At a minimum, this is in exchange for showing you lots of ads. However, many advertisers also think it’s valuable for them to know as much as possible about you in the process. And, well, at some level - it’s worth keeping in mind that currently, they’re paying. If we want to change any aspect of that, we’ve got to reckon with that fact.

But there are lots of opportunities to spread out those costs beyond that, aren’t there?

### Downstream

Open Source has a lot of great qualities. Among them, it lets ideas smash together and new projects can get spun up which wouldn’t otherwise be economically possible.

It’s important to realize that the engines/open source browser projects are all cost. It’s the actual deployed browsers that are built with them that are currently monetizable in this way at all. But there are lots of other things built with these projects too, many of which have also become lucritive projects.

The current open source model doesn’t require them to give anything back. That’s not to say none do - some do (substantially even), but it’s entirely voluntary and many don’t at all. It just so happens that the flagship browsers who sponsor an engine/open source browser project have so many users that they can lose quite a lot to people only taking without upsetting this delicate balance.

You’ll note that while I said those flagship browsers not only pay for development, they generate revenue with default search - there’s still only a tiny few organizations on the planet willing to maintain an engine/browser. A big part of that is that that’s only possible if you are a flagship browser.

In my view, we need to work on that on several fronts. First, we need to acknowledge it’s a problem and begin trying to change the conversation. Thankfully, we’re seeing lots of things developing here like Open Collective, and we’re starting to see people increasingly pointing that out. A few weeks ago, Nicholas C. Zakas (@slicknet) wrote Sponsoring dependencies: The next step in open source sustainability which is worth a read. However, no matter how we make money trickle through the system, we also need to find new ways of making it appealing to put it in in the first place. Making business investment into the commons tax-deductable would probably go a long way.

Another interesting part of that is that many of those who make those lucrative products but don’t give back do then advertise them. One is a very easy pitch (ads), the other is hard to justify even at a fraction of the cost. So, advertising can tap from completely different departments in the same company, for different reasons. I think there’s something important in that distinction that we should look at too.

For example, “plentiful ads” isn’t even the only ads model! Conversely, some kinds of ad opportunies are worth a lot more because they aren’t plentiful. They are exclusive. The Super Bowl, for example, lasts only a few hours and raises more money than Firefox’s current annual budget in that time. It’s worth noting that even with its currently reduced marketshare, more people use Firefox than watch the Super Bowl! (Google search makes about 2 orders of magnitude more in a quarter than that, btw). Maybe there’s something there worth thinking about.

### But… search?

You might have noticed a huge flaw in all of this discussion: It’s been focused entirely on how the browser commons is funded by search, but it’s left off the fact that even if we funded browsers a completely different way - they still need to search, and that is where all the money currently is. Attacking only one end of this would sort of make it worse.

Yeah, this is kind of the perfect mousetrap we’ve built for ourselves.

That original search deal was so killer because they realized just how fundamental integrated search would be to just using the web and … I hate to point it out again, but that is also not free. In fact, Google’s search engine costs a lot more than their web browser engine… And someone has to pay for that too, and it is again, the central money spigot.

I’m genuinely not sure how to address that. It’s harder to call search engines we have today “the commons”, though search integration is clearly fundamental to it. That seems much more challenging to tackle. It is something of an optimization problem and involves businesses and incentives and users making choices about things I think most people are not even especially well equipped to understand.

The one thing I can say for sure though is: If we don’t talk about problems, or even begin to agree they might be problems, it’s pretty hard to improve them. So, let’s start doing that.

## July 06, 2022

### Clayton Craft

#### Deleting stuff is fun and helpful!

Debugging problems during software development can be frustrating, even if it's immensely rewarding once you eventually figure it out. This is especially true when reproducing the problem seemingly requires some complex dance through tons of other parts of the project. Frustration is further exacerbated when the language used is one you're still trying to master.

There's one technique I've learned through the School of Hard Knocks to deal with problems that are overly complex in one form or another: simplify it

What do I mean? I mean start removing / commenting out code that is seemingly unrelated. Try to simplify the input into the function or whatever into the simplest form that causes the problem to happen. Pull out portions of code that you think are related to the problem and build a new program in an attempt to reproduce the problem with some skeleton of the original code base / configuration. With source control, you're not really at risk of "losing" anything, so go wild with removing things. Your goal is to recreate the problem at hand with as simple of a program (and input!) as possible.

### Easier to spot the root cause

Oftentimes in the process of simplifying it, I notice the cause of the problem. "Oops, I'm pretty sure I didn't actually mean to do that." - me, a lot.

### Easier for comparisons

If the problem magically goes away after simplifying it, then I have a valuable starting point for diffing the broken program against. If it's still not obvious, then I start re-adding components (with attempts to reproduce in between additions) until I can cause the problem to happen again.

### Easier to ask for help

If I simplify it and the problem still happens, it's much easier to share a simplified version when asking for help. Somewhat unsurprisingly, people are generally more willing to help if they don't have to spend half a day studying the entire codebase to get a handle on how things are supposed to work. Allowing them to also reproduce the problem more easily can only help.

### Easier to turn into a test

Simplifying the code to reproduce the problem can include removing unrelated functionality and trying to mock out unrelated dependencies. This is like... 90-something percent of the way towards writing a test. So once the issue is discovered and resolved, I already have a test ready to go that can confirm the fix and help guard against it from happening again in the future! Win!

The next time you're baffled while debugging by some weird behavior, just start deleting stuff! It's fun! And it'll probably help!

# Introduction

The Steam Deck is a handheld gaming computer that runs a Linux-based operating system called SteamOS. The machine comes with SteamOS 3 (code name “holo”), which is in turn based on Arch Linux.

Although there is no SteamOS 3 installer for a generic PC (yet), it is very easy to install on a virtual machine using QEMU. This post explains how to do it.

The goal of this VM is not to play games (you can already install Steam on your computer after all) but to use SteamOS in desktop mode. The Gamescope mode (the console-like interface you normally see when you use the machine) requires additional development to make it work with QEMU and will not work with these instructions.

A SteamOS VM can be useful for debugging, development, and generally playing and tinkering with the OS without risking breaking the Steam Deck.

Running the SteamOS desktop in a virtual machine only requires QEMU and the OVMF UEFI firmware and should work in any relatively recent distribution. In this post I’m using QEMU directly, but you can also use virt-manager or some other tool if you prefer, we’re emulating a standard x86_64 machine here.

# General concepts

SteamOS is a single-user operating system and it uses an A/B partition scheme, which means that there are two sets of partitions and two copies of the operating system. The root filesystem is read-only and system updates happen on the partition set that is not active. This allows for safer updates, among other things.

There is one single /home partition, shared by both partition sets. It contains the games, user files, and anything that the user wants to install there.

Although the user can trivially become root, make the root filesystem read-write and install or change anything (the pacman package manager is available), this is not recommended because

• it increases the chances of breaking the OS, and
• any changes will disappear with the next OS update.

A simple way for the user to install additional software that survives OS updates and doesn’t touch the root filesystem is Flatpak. It comes preinstalled with the OS and is integrated with the KDE Discover app.

# Preparing all the necessary files

The first thing that we need is the installer. For that we have to download the Steam Deck recovery image from here: https://store.steampowered.com/steamos/download/?ver=steamdeck&snr=

Once the file has been downloaded, we can uncompress it and we’ll get a raw disk image called steamdeck-recovery-4.img (the number may vary).

Note that the recovery image is already SteamOS (just not the most up-to-date version). If you simply want to have a quick look you can play a bit with it and skip the installation step. In this case I recommend that you extend the image before using it, for example with ‘truncate -s 64G steamdeck-recovery-4.img‘ or, better, create a qcow2 overlay file and leave the original raw image unmodified: ‘qemu-img create -f qcow2 -F raw -b steamdeck-recovery-4.img steamdeck-recovery-extended.qcow2 64G

But here we want to perform the actual installation, so we need a destination image. Let’s create one:

$qemu-img create -f qcow2 steamos.qcow2 64G # Installing SteamOS Now that we have all files we can start the virtual machine: $ qemu-system-x86_64 -enable-kvm -smp cores=4 -m 8G \
-device usb-ehci -device usb-tablet \
-device intel-hda -device hda-duplex \
-device VGA,xres=1280,yres=800 \
-drive if=virtio,file=steamdeck-recovery-4.img,driver=raw \
-drive if=none,id=drive0,file=steamos.qcow2


Note that we’re emulating an NVMe drive for steamos.qcow2 because that’s what the installer script expects. This is not strictly necessary but it makes things a bit easier. If you don’t want to do that you’ll have to edit ~/tools/repair_device.sh and change DISK and DISK_SUFFIX.

Once the system has booted we’ll see a KDE Plasma session with a few tools on the desktop. If we select “Reimage Steam Deck” and click “Proceed” on the confirmation dialog then SteamOS will be installed on the destination drive. This process should not take a long time.

Now, once the operation finishes a new confirmation dialog will ask if we want to reboot the Steam Deck, but here we have to choose “Cancel”. We cannot use the new image yet because it would try to boot into the Gamescope session, which won’t work, so we need to change the default desktop session.

SteamOS comes with a helper script that allows us to enter a chroot after automatically mounting all SteamOS partitions, so let’s open a Konsole and make the Plasma session the default one in both partition sets:

$sudo steamos-chroot --disk /dev/nvme0n1 --partset A # steamos-readonly disable # echo '[Autologin]' > /etc/sddm.conf.d/zz-steamos-autologin.conf # echo 'Session=plasma.desktop' >> /etc/sddm.conf.d/zz-steamos-autologin.conf # steamos-readonly enable # exit$ sudo steamos-chroot --disk /dev/nvme0n1 --partset B
# exit


After this we can shut down the virtual machine. Our new SteamOS drive is ready to be used. We can discard the recovery image now if we want.

# Booting SteamOS and first steps

To boot SteamOS we can use a QEMU line similar to the one used during the installation. This time we’re not emulating an NVMe drive because it’s no longer necessary.

$cp /usr/share/OVMF/OVMF_VARS.fd .$ qemu-system-x86_64 -enable-kvm -smp cores=4 -m 8G \
-device usb-ehci -device usb-tablet \
-device intel-hda -device hda-duplex \
-device VGA,xres=1280,yres=800 \
-drive if=pflash,format=raw,file=OVMF_VARS.fd \
-drive if=virtio,file=steamos.qcow2 \
-device virtio-net-pci,netdev=net0 \
-netdev user,id=net0,hostfwd=tcp::2222-:22


(the last two lines redirect tcp port 2222 to port 22 of the guest to be able to SSH into the VM. If you don’t want to do that you can omit them)

If everything went fine, you should see KDE Plasma again, this time with a desktop icon to launch Steam and another one to “Return to Gaming Mode” (which we should not use because it won’t work). See the screenshot that opens this post.

Congratulations, you’re running SteamOS now. Here are some things that you probably want to do:

• (optional) Change the keyboard layout in the system settings (the default one is US English)
• Set the password for the deck user: run ‘passwd‘ on a terminal
• Enable / start the SSH server: ‘sudo systemctl enable sshd‘ and/or ‘sudo systemctl start sshd‘.
• SSH into the machine: ‘ssh -p 2222 deck@localhost

The Steam Deck recovery image doesn’t install the most recent version of SteamOS, so now we should probably do a software update.

• First of all ensure that you’re giving enought RAM to the VM (in my examples I run QEMU with -m 8G). The OS update might fail if you use less.
• (optional) Change the OS branch if you want to try the beta release: ‘sudo steamos-select-branch beta‘ (or main, if you want the bleeding edge)
• Check the currently installed version in /etc/os-release (see the BUILD_ID variable)
• Check the available version: ‘steamos-update check
• Download and install the software update: ‘steamos-update

Note: if the last step fails after reaching 100% with a post-install handler error then go to Connections in the system settings, rename Wired Connection 1 to something else (anything, the name doesn’t matter), click Apply and run steamos-update again. This works around a bug in the update process. Recent images fix this and this workaround is not necessary with them.

As we did with the recovery image, before rebooting we should ensure that the new update boots into the Plasma session, otherwise it won’t work:

## Personal notes

For me, I have a lot of personal connections to this that make it meaningful. As I said in Harold Crick and the Web Platform, I might be partially responsible for its delays.

Little did he know, he’d get deeply involved in helping solve that problem.

When I came to Igalia, it was one of my first projects.

I helped fix some things. The first Web Platform Test I ever contributed was for MathML. I think the first WHATWG PR I ever sent was on MathML. The first BlinkOn Lightning Talk I ever did was about - you guessed it, MathML. The first W3C Charter I ever helped write? That’s right: About MathML. The first actual Working Group I’ve ever chaired (well, co-chaired) is about Math. The first explainer I ever wrote myself was about MathML. The first podcast I ever hosted was on… Guess what? MathML. And so on.

And here’s the thing: I am perhaps the least mathematically minded person you will ever meet. But we did the right thing. A good thing. And we did it a good way, and for good reasons.

A few episodes later on our podcast we had Rick Byers from Chrome and Rossen Atanassov from Microsoft on the show, and Rick brought up MathML. Both of them were hugely impressed and supportive of it even back then - Rick said

I fully expect MathML to ship at some point and it’ll ship in Chrome… Even though from Google’s business perspective, it probably wouldn’t have been a good return on investment for us to do it… I’m thrilled that Igalia was able to do it.

By then, there was a global pandemic underway and Rossen pointed out…

I’m looking forward to it. Look, I… I’m a huge supporter of having a native math into into the engines, MathML… And having the ability to, at the end of the day, increase the edu market, which will benefit the most out of it. Especially, you know, having been through one full semester of having a middle school student at home, and having her do all of her work through online tools… Having better edu support will go a long way. So, thank you on behalf all of the students, future students that will benefit

And… yeah, that’s kind of it right? There’s this huge thing that has no special seeming obvious ROI for browsers, that doesn’t have every developer in the world screaming for it but it’s really great for society and kind of important to serve this niche because it underpins the ability of student, physicists, mathematicians etc to communicate with text.

# Mission Accomplished Progress

Anyway… Wow. This is huge, I think, in so many ways.

I’m gonna go raise a glass to everyone who helped achieve this astonishingly huge thing.

I think it says so much about our ability to do things together and promise for the ecosystem and how it could work.

I sort of wish I could just say “Mission Accomplished” but the truth is that this is a beginning, not an end. It means we have really good support of parsing and rendering (assuming some math capable fonts) tons and tons of math interoperably, and a spec for how it needs to integrate with the whole rest of the platform, but only 1 implementation of that bit. Now we have to align the others implementations the same way so that we really have just One Platform and all of it can move forward together and no part gets left behind again.

Then, beyond that is MathML-Core Level 2. Level 1 draws a box around what was practically able to be aligned in the first pass - but it leaves a few hard problems on the table which really need solving. Many of these have some partial solutions in the other 2 browsers already but are hard to specify and integrate.

I have a lot of faith that we can reach both of those goals, but to do it takes investment. I really hope that reaching this milestone helps convince organizations and vendors to contribute toward reaching them. I’d encourage everyone to give the larger ecosystem some thought and consider how we can support good efforts - even maybe directly.

If you'd like to understand the argument for this better, my friend and colleague at Igalia Eric Meyer presented an excellent talk when we announced it...

## June 21, 2022

### Ziran Sun

#### My first time attending Igalia Summit

With employee distributed in over 20 countries globally and most are working remotely, Igalia holds summits twice a year to give the employees opportunities to meet up face-to-face. The summit normally runs in a period of a week with code camps, team building and recreational activities etc.. This year’s summer summit was held between the 15th-19th of June in A Coruña, Galicia.

I joined Igalia in the beginning of 2020. Because of COVID-19 pandemic, this was my first time attending an Igalia Summit. Due to personal time schedule I only managed to stay at A Coruña for three nights. The overall experience was great and I thoroughly enjoyed it!

### Getting to know A Coruña

Igalia’s HQ is based in A Coruña in Galicia, Spain. A beautiful white-sanded beach is just a few yards away from the hotel we stayed. I did a barefoot stroll along the beach in one morning. The beach itself was reasonably quiet. There were a few people swimming in the shallow part of the sea. Weather that morning was perfect with warm sunshine and occasional cool gentle breezes.

The smell of sea in the air and people wandering around relaxingly in the evenings, somehow brought a familiar feeling to me and reminded of my home town.

On Wednesday I joined guided visit to A Coruña set in 1906. It was a very interesting walk around the historic part of the city. The tour guide Suso Martín presented an amazing one man show by walking us through the A Coruña‘s history back to the 12th century, historic buildings, religion and romances associated with the city.

On Thursday evening we went for the guided tour of MEGA, Mundo Estrella Galicia. En route we passed the office that Igalia used to locate at in early days. According to some Igalians who worked there before, it was a small office and there were about just over 10 Igalians in those days. Today Igalia has over 100 employees and last year we celebrated our 20th year anniversary in Open Source.

The highlights of the MEGA tour for me were watching the product line working and tasting the beers. We were spoiled by beers and local cheeses.

### Meeting other igalians

Since I joined Igalian, all the meetings happened online due to pandemic. I was glad that I could finally meet other Igalians “physically” . During the summit I had chances to chat other Igalians at code camp, during meals and guided tours. It was pleasant. Playing board games together after an evening meal (I’d call it “night meal”) was a great fun. Some Igalians can be very funny and witty. I had quite a few “tearful laughter” moments. It was very enjoyable.

During the team code camp, I had chances to spending more time with my teammates. The technical meetings in both days were very engaging and effective. I was very happy to see Javi Fernández in person. Last year I was involved in CSS grid compatibility work (one of the 5 key areas for the Compat2021 effort that Igalia was involved in). I personally had never touched CSS Grid layout at the beginning of the assignment. The web platform team in Igalia though has very good knowledge and working experiences in this area. My teammates Manuel Rego, Javi Fernández and Sergio Villar were among the pioneer developers in this field. To assure the success of the task, the team formed a safe environment by providing me constant guidance and supports. Specifically, I had Javi’s helps throughout the whole task. Javi has been an amazing technical mentor with great expertise and patience. We had numerous calls for technical discussions. The team also had a couple of debugging sessions with me for some very tricky bugs.

The team meal on Wednesday evening was very nice. Delicious food, great companions and nice fruity beers – what could go wrong? Well, we did walk back to the hotel in the rain…

### Thanks

The event was very well organized and productive. I really appreciate the hard work and great effort those igalians put in to make it happen. Just want to say – a big THANK YOU! I’m glad that I managed the trip. Traveling from the UK is not straightforward but I’d say it’s well worthy it.

### Andy Wingo

#### an optimistic evacuation of my wordhoard

Good morning, mallocators. Last time we talked about how to split available memory between a block-structured main space and a large object space. Given a fixed heap size, making a new large object allocation will steal available pages from the block-structured space by finding empty blocks and temporarily returning them to the operating system.

Today I'd like to talk more about nothing, or rather, why might you want nothing rather than something. Given an Immix heap, why would you want it organized in such a way that live data is packed into some blocks, leaving other blocks completely free? How bad would it be if instead the live data were spread all over the heap? When might it be a good idea to try to compact the heap? Ideally we'd like to be able to translate the answers to these questions into heuristics that can inform the GC when compaction/evacuation would be a good idea.

lospace and the void

Let's start with one of the more obvious points: large object allocation. With a fixed-size heap, you can't allocate new large objects if you don't have empty blocks in your paged space (the Immix space, for example) that you can return to the OS. To obtain these free blocks, you have four options.

1. You can continue lazy sweeping of recycled blocks, to see if you find an empty block. This is a bit time-consuming, though.

2. Otherwise, you can trigger a regular non-moving GC, which might free up blocks in the Immix space but which is also likely to free up large objects, which would result in fresh empty blocks.

3. You can trigger a compacting or evacuating collection. Immix can't actually compact the heap all in one go, so you would preferentially select evacuation-candidate blocks by choosing the blocks with the least live data (as measured at the last GC), hoping that little data will need to be evacuated.

4. Finally, for environments in which the heap is growable, you could just grow the heap instead. In this case you would configure the system to target a heap size multiplier rather than a heap size, which would scale the heap to be e.g. twice the size of the live data, as measured at the last collection.

If you have a growable heap, I think you will rarely choose to compact rather than grow the heap: you will either collect or grow. Under constant allocation rate, the rate of empty blocks being reclaimed from freed lospace objects will be equal to the rate at which they are needed, so if collection doesn't produce any, then that means your live data set is increasing and so growing is a good option. Anyway let's put growable heaps aside, as heap-growth heuristics are a separate gnarly problem.

The question becomes, when should large object allocation force a compaction? Absent growable heaps, the answer is clear: when allocating a large object fails because there are no empty pages, but the statistics show that there is actually ample free memory. Good! We have one heuristic, and one with an optimum: you could compact in other situations but from the point of view of lospace, waiting until allocation failure is the most efficient.

shrinkage

Moving on, another use of empty blocks is when shrinking the heap. The collector might decide that it's a good idea to return some memory to the operating system. For example, I enjoyed this recent paper on heuristics for optimum heap size, that advocates that you size the heap in proportion to the square root of the allocation rate, and that as a consequence, when/if the application reaches a dormant state, it should promptly return memory to the OS.

Here, we have a similar heuristic for when to evacuate: when we would like to release memory to the OS but we have no empty blocks, we should compact. We use the same evacuation candidate selection approach as before, also, aiming for maximum empty block yield.

fragmentation

What if you go to allocate a medium object, say 4kB, but there is no hole that's 4kB or larger? In that case, your heap is fragmented. The smaller your heap size, the more likely this is to happen. We should compact the heap to make the maximum hole size larger.

side note: compaction via partial evacuation

The evacuation strategy of Immix is... optimistic. A mark-compact collector will compact the whole heap, but Immix will only be able to evacuate a fraction of it.

It's worth dwelling on this a bit. As described in the paper, Immix reserves around 2-3% of overall space for evacuation overhead. Let's say you decide to evacuate: you start with 2-3% of blocks being empty (the target blocks), and choose a corresponding set of candidate blocks for evacuation (the source blocks). Since Immix is a one-pass collector, it doesn't know how much data is live when it starts collecting. It may not know that the blocks that it is evacuating will fit into the target space. As specified in the original paper, if the target space fills up, Immix will mark in place instead of evacuating; an evacuation candidate block with marked-in-place objects would then be non-empty at the end of collection.

In fact if you choose a set of evacuation candidates hoping to maximize your empty block yield, based on an estimate of live data instead of limiting to only the number of target blocks, I think it's possible to actually fill the targets before the source blocks empty, leaving you with no empty blocks at the end! (This can happen due to inaccurate live data estimations, or via internal fragmentation with the block size.) The only way to avoid this is to never select more evacuation candidate blocks than you have in target blocks. If you are lucky, you won't have to use all of the target blocks, and so at the end you will end up with more free blocks than not, so a subsequent evacuation will be more effective. The defragmentation result in that case would still be pretty good, but the yield in free blocks is not great.

In a production garbage collector I would still be tempted to be optimistic and select more evacuation candidate blocks than available empty target blocks, because it will require fewer rounds to compact the whole heap, if that's what you wanted to do. It would be a relatively rare occurrence to start an evacuation cycle. If you ran out of space while evacuating, in a production GC I would just temporarily commission some overhead blocks for evacuation and release them promptly after evacuation is complete. If you have a small heap multiplier in your Immix space, occasional partial evacuation in a long-running process would probably reach a steady state with blocks being either full or empty. Fragmented blocks would represent newer objects and evacuation would periodically sediment these into longer-lived dense blocks.

mutator throughput

Finally, the shape of the heap has its inverse in the shape of the holes into which the mutator can allocate. It's most efficient for the mutator if the heap has as few holes as possible: ideally just one large hole per block, which is the limit case of an empty block.

The opposite extreme would be having every other "line" (in Immix terms) be used, so that free space is spread across the heap in a vast spray of one-line holes. Even if fragmentation is not a problem, perhaps because the application only allocates objects that pack neatly into lines, having to stutter all the time to look for holes is overhead for the mutator. Also, the result is that contemporaneous allocations are more likely to be placed farther apart in memory, leading to more cache misses when accessing data. Together, allocator overhead and access overhead lead to lower mutator throughput.

When would this situation get so bad as to trigger compaction? Here I have no idea. There is no clear maximum. If compaction were free, we would compact all the time. But it's not; there's a tradeoff between the cost of compaction and mutator throughput.

I think here I would punt. If the heap is being actively resized based on allocation rate, we'll hit the other heuristics first, and so we won't need to trigger evacuation/compaction based on mutator overhead. You could measure this, though, in terms of average or median hole size, or average or maximum number of holes per block. Since evacuation is partial, all you need to do is to identify some "bad" blocks and then perhaps evacuation becomes attractive.

gc pause

Welp, that's some thoughts on when to trigger evacuation in Immix. Next time, we'll talk about some engineering aspects of evacuation. Until then, happy consing!

### Carlos García Campos

#### Thread safety support in libsoup3

In libsoup2 there’s some thread safety support that allows to send messages from a thread different than the one where the session is created. There are other APIs that can be used concurrently too, like accessing some of the session properties, and others that aren’t thread safe at all. It’s not clear what’s thread safe and even sending a message is not fully thread safe either, depending on the session features involved. However, several applications relay on the thread safety support and have always worked surprisingly well.

In libsoup3 we decided to remove the (broken) thread safety support and only allowed to use the API from the same thread where the session was created. This simplified the code and made easier to add the HTTP/2 implementation. Note that HTTP/2 supports multiple request over the same TCP connection, which is a lot more efficient than starting multiple requests from several threads in parallel.

When apps started to be ported to libsoup3, those that relied on the thread safety support ended up being a pain to be ported. Major refactorings where required to either stop using the sync API from secondary threads, or moving all the soup usage to the same secondary thread. We managed to make it work in several modules like gstreamer and gvfs, but others like evolution required a lot more work. The extra work was definitely worth it and resulted in much better and more efficient code. But we also understand that porting an application to a new version of a dependency is not a top priority task for maintainers.

So, in order to help with the migration to libsoup3, we decided to add thread safety support to libsoup3 again, but this time trying to cover all the APIs involved in sending a message and documenting what’s expected to be thread safe. Also, since we didn’t remove the sync APIs, it’s expected that we support sending messages synchronously from secondary threads. We still encourage to use only the async APIS from a single thread, because that’s the most efficient way, especially for HTTP/2 requests, but apps currently using threads can be easily ported first and then refactored later.

The thread safety support in libsoup3 is expected to cover only one use case: sending messages. All other APIs, including accessing session properties, are not thread safe and can only be used from the thread where the session is created.

There are a few important things to consider when using multiple threads in libsoup3:

• In the case of HTTP/2, two messages for the same host sent from different threads will not use the same connection, so the advantage of HTTP/2 multiplexing is lost.
• Only the API to send messages can be called concurrently from multiple threads. So, in case of using multiple threads, you must configure the session (setting network properties, features, etc.) from the thread it was created and before any request is made.
• All signals associated to a message (SoupSession::request-queued, SoupSession::request-unqueued, and all SoupMessage signals) are emitted from the thread that started the request, and all the IO will happen there too.
• The session can be created in any thread, but all session APIs except the methods to send messages must be called from the thread where the session was created.
• To use the async API from a thread different than the one where the session was created, the thread must have a thread default main context where the async callbacks are dispatched.
• The sync API doesn’t need any main context at all.

## June 20, 2022

### Tiago Vignatti

#### Short blog post from Madrid's bus

This post was inspired by “Short blog post from Madrid’s hotel room” from my colleague Frédéric Wang. You should really check his post instead! To Fred: thanks for the feedback and review here. I’m in for football in the next Summit, alright? :-) This week, I finally went to A Coruña for the Web Engines Hackfest and internal company meetings. These were my first on-site events since the COVID-19 pandemic. After two years of non-super-exciting virtual conferences I was so glad to finally be able to meet with colleagues and other people from the Web.

### Andy Wingo

#### blocks and pages and large objects

Good day! In a recent dispatch we talked about the fundamental garbage collection algorithms, also introducing the Immix mark-region collector. Immix mostly leaves objects in place but can move objects if it thinks it would be profitable. But when would it decide that this is a good idea? Are there cases in which it is necessary?

I promised to answer those questions in a followup article, but I didn't say which followup :) Before I get there, I want to talk about paged spaces.

enter the multispace

We mentioned that Immix divides the heap into blocks (32kB or so), and that no object can span multiple blocks. "Large" objects -- defined by Immix to be more than 8kB -- go to a separate "large object space", or "lospace" for short.

Though the implementation of a large object space is relatively simple, I found that it has some points that are quite subtle. Probably the most important of these points relates to heap size. Consider that if you just had one space, implemented using mark-compact maybe, then the procedure to allocate a 16 kB object would go:

1. Try to bump the allocation pointer by 16kB. Is it still within range? If so we are done.

2. Otherwise, collect garbage and try again. If after GC there isn't enough space, the allocation fails.

In step (2), collecting garbage could decide to grow or shrink the heap. However when evaluating collector algorithms, you generally want to avoid dynamically-sized heaps.

cheatery

Here is where I need to make an embarrassing admission. In my role as co-maintainer of the Guile programming language implementation, I have long noodled around with benchmarks, comparing Guile to Chez, Chicken, and other implementations. It's good fun. However, I only realized recently that I had a magic knob that I could turn to win more benchmarks: simply make the heap bigger. Make it start bigger, make it grow faster, whatever it takes. For a program that does its work in some fixed amount of total allocation, a bigger heap will require fewer collections, and therefore generally take less time. (Some amount of collection may be good for performance as it improves locality, but this is a marginal factor.)

Of course I didn't really go wild with this knob but it now makes me doubt all benchmarks I have ever seen: are we really using benchmarks to select for fast implementations, or are we in fact selecting for implementations with cheeky heap size heuristics? Consider even any of the common allocation-heavy JavaScript benchmarks, DeltaBlue or Earley or the like; to win these benchmarks, web browsers are incentivised to have large heaps. In the real world, though, a more parsimonious policy might be more appreciated by users.

Java people have known this for quite some time, and are therefore used to fixing the heap size while running benchmarks. For example, people will measure the minimum amount of memory that can allow a benchmark to run, and then configure the heap to be a constant multiplier of this minimum size. The MMTK garbage collector toolkit can't even grow the heap at all currently: it's an important feature for production garbage collectors, but as they are just now migrating out of the research phase, heap growth (and shrinking) hasn't yet been a priority.

lospace

So now consider a garbage collector that has two spaces: an Immix space for allocations of 8kB and below, and a large object space for, well, larger objects. How do you divide the available memory between the two spaces? Could the balance between immix and lospace change at run-time? If you never had large objects, would you be wasting space at all? Conversely is there a strategy that can also work for only large objects?

Perhaps the answer is obvious to you, but it wasn't to me. After much reading of the MMTK source code and pondering, here is what I understand the state of the art to be.

1. Arrange for your main space -- Immix, mark-sweep, whatever -- to be block-structured, and able to dynamically decomission or recommission blocks, perhaps via MADV_DONTNEED. This works if the blocks are even multiples of the underlying OS page size.

2. Keep a counter of however many bytes the lospace currently has.

3. When you go to allocate a large object, increment the lospace byte counter, and then round up to number of blocks to decommission from the main paged space. If this is more than are currently decommissioned, find some empty blocks and decommission them.

4. If no empty blocks were found, collect, and try again. If the second try doesn't work, then the allocation fails.

5. Now that the paged space has shrunk, lospace can allocate. You can use the system malloc, but probably better to use mmap, so that if these objects are collected, you can just MADV_DONTNEED them and keep them around for later re-use.

6. After GC runs, explicitly return the memory for any object in lospace that wasn't visited when the object graph was traversed. Decrement the lospace byte counter and possibly return some empty blocks to the paged space.

There are some interesting aspects about this strategy. One is, the memory that you return to the OS doesn't need to be contiguous. When allocating a 50 MB object, you don't have to find 50 MB of contiguous free space, because any set of blocks that adds up to 50 MB will do.

Another aspect is that this adaptive strategy can work for any ratio of large to non-large objects. The user doesn't have to manually set the sizes of the various spaces.

This strategy does assume that address space is larger than heap size, but only by a factor of 2 (modulo fragmentation for the large object space). Therefore our risk of running afoul of user resource limits and kernel overcommit heuristics is low.

The one underspecified part of this algorithm is... did you see it? "Find some empty blocks". If the main paged space does lazy sweeping -- only scanning a block for holes right before the block will be used for allocation -- then after a collection we don't actually know very much about the heap, and notably, we don't know what blocks are empty. (We could know it, of course, but it would take time; you could traverse the line mark arrays for all blocks while the world is stopped, but this increases pause time. The original Immix collector does this, however.) In the system I've been working on, instead I have it so that if a mutator finds an empty block, it puts it on a separate list, and then takes another block, only allocating into empty blocks once all blocks are swept. If the lospace needs blocks, it sweeps eagerly until it finds enough empty blocks, throwing away any nonempty blocks. This causes the next collection to happen sooner, but that's not a terrible thing; this only occurs when rebalancing lospace versus paged-space size, because if you have a constant allocation rate on the lospace side, you will also have a complementary rate of production of empty blocks by GC, as they are recommissioned when lospace objects are reclaimed.

What if your main paged space has ample space for allocating a large object, but there are no empty blocks, because live objects are equally peppered around all blocks? In that case, often the application would be best served by growing the heap, but maybe not. In any case in a strict-heap-size environment, we need a solution.

But for that... let's pick up another day. Until then, happy hacking!

## June 17, 2022

### Frédéric Wang

#### Short blog post from Madrid's hotel room

This week, I finally went back to A Coruña for the Web Engines Hackfest and internal company meetings. These were my first on-site events since the COVID-19 pandemic. After two years of non-super-exciting virtual conferences I was so glad to finally be able to meet with colleagues and other people from the Web.

Igalia has grown considerably and I finally get to know many new hires in person. Obviously, some people were still not able to travel despite the effort we put to settle strong sanitary measures. Nevertheless, our infrastructure has also improved a lot and we were able to provide remote communication during these events, in order to give people a chance to attend and participate !

Work on the Madrid–Galicia high-speed rail line finally completed last December, meaning one can now travel with fast trains between Paris - Barcelona - Madrid - A Coruña. This takes about one day and a half though and, because I’m voting for the Legislative elections in France, I had to shorten a bit my stay and miss nice social activities 😥… That’s a pity, but I’m looking forward to participating more next time!

Finally on the technical side, my main contribution was to present our upcoming plan to ship MathML in Chromium. The summary is that we are happy with this first implementation and will send the intent-to-ship next week. There are minor issues to address, but the consensus from the conversations we had with other attendees (including folks from Google and Mozilla) is that they should not be a blocker and can be refined depending on the feedback from API owners. So let’s do it and see what happens…

There is definitely a lot more to write and nice pictures to share, but it’s starting to be late here and I have a train back to Paris tomorrow. 😉

## June 16, 2022

### Iago Toral

#### V3DV Vulkan 1.2 status

A quick update on my latest activities around V3DV: I’ve been focusing on getting the driver ready for Vulkan 1.2 conformance, which mostly involved fixing a few CTS tests of the kind that would only fail occasionally, these are always fun :). I think we have fixed all the issues now and we are ready to submit conformance to Khronos, my colleague Alejandro Piñeiro is now working on that.

## June 15, 2022

### Andy Wingo

#### defragmentation

Good morning, hackers! Been a while. It used to be that I had long blocks of uninterrupted time to think and work on projects. Now I have two kids; the longest such time-blocks are on trains (too infrequent, but it happens) and in a less effective but more frequent fashion, after the kids are sleeping. As I start writing this, I'm in an airport waiting for a delayed flight -- my first since the pandemic -- so we can consider this to be the former case.

It is perhaps out of mechanical sympathy that I have been using my reclaimed time to noodle on a garbage collector. Managing space and managing time have similar concerns: how to do much with little, efficiently packing different-sized allocations into a finite resource.

I have been itching to write a GC for years, but the proximate event that pushed me over the edge was reading about the Immix collection algorithm a few months ago.

on fundamentals

Immix is a "mark-region" collection algorithm. I say "algorithm" rather than "collector" because it's more like a strategy or something that you have to put into practice by making a concrete collector, the other fundamental algorithms being copying/evacuation, mark-sweep, and mark-compact.

To build a collector, you might combine a number of spaces that use different strategies. A common choice would be to have a semi-space copying young generation, a mark-sweep old space, and maybe a treadmill large object space (a kind of copying collector, logically; more on that later). Then you have heuristics that determine what object goes where, when.

On the engineering side, there's quite a number of choices to make there too: probably you make some parts of your collector to be parallel, maybe the collector and the mutator (the user program) can run concurrently, and so on. Things get complicated, but the fundamental algorithms are relatively simple, and present interesting fundamental tradeoffs.

figure 1 from the immix paper

For example, mark-compact is most parsimonious regarding space usage -- for a given program, a garbage collector using a mark-compact algorithm will require less memory than one that uses mark-sweep. However, mark-compact algorithms all require at least two passes over the heap: one to identify live objects (mark), and at least one to relocate them (compact). This makes them less efficient in terms of overall program throughput and can also increase latency (GC pause times).

Copying or evacuating spaces can be more CPU-efficient than mark-compact spaces, as reclaiming memory avoids traversing the heap twice; a copying space copies objects as it traverses the live object graph instead of after the traversal (mark phase) is complete. However, a copying space's minimum heap size is quite high, and it only reaches competitive efficiencies at large heap sizes. For example, if your program needs 100 MB of space for its live data, a semi-space copying collector will need at least 200 MB of space in the heap (a 2x multiplier, we say), and will only run efficiently at something more like 4-5x. It's a reasonable tradeoff to make for small spaces such as nurseries, but as a mature space, it's so memory-hungry that users will be unhappy if you make it responsible for a large portion of your memory.

Finally, mark-sweep is quite efficient in terms of program throughput, because like copying it traverses the heap in just one pass, and because it leaves objects in place instead of moving them. But! Unlike the other two fundamental algorithms, mark-sweep leaves the heap in a fragmented state: instead of having all live objects packed into a contiguous block, memory is interspersed with live objects and free space. So the collector can run quickly but the allocator stops and stutters as it accesses disparate regions of memory.

allocators

Collectors are paired with allocators. For mark-compact and copying/evacuation, the allocator consists of a pointer to free space and a limit. Objects are allocated by bumping the allocation pointer, a fast operation that also preserves locality between contemporaneous allocations, improving overall program throughput. But for mark-sweep, we run into a problem: say you go to allocate a 1 kilobyte byte array, do you actually have space for that?

Generally speaking, mark-sweep allocators solve this problem via freelist allocation: the allocator has an array of lists of free objects, one for each "size class" (say 2 words, 3 words, and so on up to 16 words, then more sparsely up to the largest allocatable size maybe), and services allocations from their appropriate size class's freelist. This prevents the 1 kB free space that we need from being "used up" by a 16-byte allocation that could just have well gone elsewhere. However, freelists prevent objects allocated around the same time from being deterministically placed in nearby memory locations. This increases variance and decreases overall throughput for both the allocation operations but also for pointer-chasing in the course of the program's execution.

Also, in a mark-sweep collector, we can still reach a situation where there is enough space on the heap for an allocation, but that free space broken up into too many pieces: the heap is fragmented. For this reason, many systems that perform mark-sweep collection can choose to compact, if heuristics show it might be profitable. Because the usual strategy is mark-sweep, though, they still use freelist allocation.

on immix and mark-region

Mark-region collectors are like mark-sweep collectors, except that they do bump-pointer allocation into the holes between survivor objects.

Sounds simple, right? To my mind, though the fundamental challenge in implementing a mark-region collector is how to handle fragmentation. Let's take a look at how Immix solves this problem.

part of figure 2 from the immix paper

Firstly, Immix partitions the heap into blocks, which might be 32 kB in size or so. No object can span a block. Block size should be chosen to be a nice power-of-two multiple of the system page size, not so small that common object allocations wouldn't fit. Allocating "large" objects -- greater than 8 kB, for Immix -- go to a separate space that is managed in a different way.

Within a block, Immix divides space into lines -- maybe 128 bytes long. Objects can span lines. Any line that does not contain (a part of) an object that survived the previous collection is part of a hole. A hole is a contiguous span of free lines in a block.

On the allocation side, Immix does bump-pointer allocation into holes. If a mutator doesn't have a hole currently, it scans the current block (obtaining one if needed) for the next hole, via a side-table of per-line mark bits: one bit per line. Lines without the mark are in holes. Scanning for holes is fairly cheap, because the line size is not too small. Note, there are also per-object mark bits as well; just because you've marked a line doesn't mean that you've traced all objects on that line.

Allocating into a hole has good expected performance as well, as it's bump-pointer, and the minimum size isn't tiny. In the worst case of a hole consisting of a single line, you have 128 bytes to work with. This size is large enough for the majority of objects, given that most objects are small.

mitigating fragmentation

Immix still has some challenges regarding fragmentation. There is some loss in which a single (piece of an) object can keep a line marked, wasting any free space on that line. Also, when an object can't fit into a hole, any space left in that hole is lost, at least until the next collection. This loss could also occur for the next hole, and the next and the next and so on until Immix finds a hole that's big enough. In a mark-sweep collector with lazy sweeping, these free extents could instead be placed on freelists and used when needed, but in Immix there is no such facility (by design).

One mitigation for fragmentation risks is "overflow allocation": when allocating an object larger than a line (a medium object), and Immix can't find a hole before the end of the block, Immix allocates into a completely free block. So actually mutator threads allocate into two blocks at a time: one for small objects and medium objects if possible, and the other for medium objects when necessary.

Another mitigation is that large objects are allocated into their own space, so an Immix space will never be used for blocks larger than, say, 8kB.

The other mitigation is that Immix can choose to evacuate instead of mark. How does this work? Is it worth it?

stw

This question about the practical tradeoffs involving evacuation is the one I wanted to pose when I started this article; I have gotten to the point of implementing this part of Immix and I have some doubts. But, this article is long enough, and my plane is about to land, so let's revisit this on my return flight. Until then, see you later, allocators!

## June 10, 2022

### Intro

I think it’s fair to say that every application developer needs to take care of accessibility at some point. Indeed, even if you use an accessible toolkit to create an application, the app isn’t always accessible simply because the combination of accessible parts is not always accessible. In reality, many things can go wrong. Without care the app may become completely inaccessible as complexity increases

If you’re a web developer, then you’re (hopefully) already familiar with WCAG, ARIA and other cool stuff helping to address accessibility issues in web apps. If you are a platform developer such as an Android or iOS, then you probably know a bazillion of tricks to keep mobile apps accessible. You might also know how to run the app over screen readers and other assistive technology softwares to ensure everything goes smoothly and works as expected. And you might already have started thinking of how you can use automated testing to cover all the accessibility caveats to avoid regressions and to reduce overhead of the manual testing.

So let’s talk about accessibility automated testing. There’s no universal solution that would embrace each and every platform and every single case. However there are many existing approaches, some are better than others. Some of them are fairly solid. Having said that looking at the diversity of the existing solutions and realizing how much people still do the manual testing (AAM specs, such as ARIA or HTML accessibility APIs mappings, can be a great example of it), I think it’d be amazing to systemize the existing techniques and come with a better, universal solution that would cover the majority of cases. I hope this post can help to find the right way to do this.

### What to test

First things come first: what is the scope of automated accessibility testing, i.e. what exactly do we want to test?

### Web

Without a doubt, the web is vast and complex and plays a significant role in our lives. It surely has to be accessible. But what is an accessible web, exactly?

The main conductors on the web are web browsers. They make up the web platform by providing all of the tiny building blocks such as HTML elements to create web content or ARIA attributes to define semantics. Browsers deliver web content to the users by rendering it on a screen and expose its semantics to the assistive technologies such as screen readers. All these blocks must be accessible. In particular that means the browsers are responsible to expose all building blocks to the assistive technologies correctly. It’s very tempting to say if a browser is doing the job well, then a web author cannot go wrong by using these blocks (for sure, if the web author is not doing anything strange on purpose). It sounds about right and it works nicely in theory. But, in practice accessibility issues suddenly pop up as the complexity of a web app goes up.

### Browsers

The browsers are certainly the major use case in the web accessibility testing and I’d like to get them covered, simply because the web cannot be accessible without accessible browsers. Also, they already do a decent job on accessibility testing, and we can learn a lot from them. Indeed, they’re all stuffed with accessibility testing solutions and each has its own test harness for automating the process. Their system could be unified and adjusted to a broader range of uses.

### Web apps

The web applications are the second piece of the web puzzle. They also must be covered.

The webapps are made up of small and accessible building blocks (if the browser does a good job). However, as I previously noted, it is not sufficient to have individually accessible parts — it is necessary that all of the combinations of parts are accessible. It may sound quite obvious since QA and end to end testing wasn’t invented yesterday, but this is something overlooked quite often. So this is one more yet use case on the table: the overall integrated web application’s accessibility.

### Platform

Although the web is vital, there’s a sizable market of desktop and mobile applications which also need to get accessibility tested. Some desktop/mobile apps are using embedded browsers under-the-hood as a rendering engine, which brings them to the web scope. But speaking generally, desktop/mobile apps are not the web.

Having said that, it’s worth noting that browsers and desktop/mobile applications coexist in the same environment, and they use the very same platform accessibility APIs to expose the content to the assistive technologies. It makes web and desktop/mobile applications have more in common than people usually tend to think. I’m from the browser world and I keep focus on the web accessibility as many of you probably are, but let’s keep in mind desktop and mobile applications as one more use case. We will have them covered as well.

### How to test

The next question to address is how to test accessibility or how to ensure that the application is accessible. You could test your entire application with a screen reader or a magnifier on your own and that would be pretty trustworthy, but what exactly to test when it comes to automated testing?

### Unit testing

Unit testing is the first level of automated testing, and accessibility is no exception. It allows you to perform very low-level testing such as testing individual c++ classes or JS modules, or testing all internal things that are never disclosed or can never be tested via any public API as the pure under-the-hood thing. As a result, unit testing is a critical component in automated testing.

However, it has nothing to do with accessibility specifically, and there’s nothing to generalize or systemize for the benefits of accessibility. It’s simply something that all systems must have.

### Accessibility APIs mappings

When it comes to automated accessibility testing, the first and probably the best thing to start from is to test accessibility APIs. Accessibility APIs is the universal language that any accessible application can be expressed in. If accessibility properties of UI looks about right, then you have great chances the app is accessible. By the way, this is the most common strategy in accessibility testing today in web browsers.

To give an example how accessibility APIs testing can be used practically, you can think of AAM web specifications. These specs define how ARIA attributes or HTML/SVG elements should be mapped to the accessibility APIs. This is a somewhat restrictive example to demonstrate the capabilities of accessibility APIs testing, because AAM utilize only a few things peculiar to accessibility APIs (for example, it misses key things like accessible relations, hierarchies or actions) but it’s a good example to get a sense of what kind of things can be tested when it comes to accessibility APIs.

Accessibility APIs have complex nature. They differ from platform to platform, but they share many traits in common. It’s not surprising at all, because they all are designed to expose the same thing: the semantics of user interface.

Here are the main categories you can find in many accessibility APIs.

• States and properties: an example for these could be focused or selected states on a menu item or accessible name/description for properties.
• Methods or interfaces allow to retrieve complex information about accessible elements, for example, to retrieve information about list or grid controls.
• The relation concept is a key one. It defines how accessible elements relate to each other. In particular they are used to navigate content and/or to get extra information about elements such as labels for a control or headers for a table cell.
• Accessible trees are a special case of accessible relations. However they are often defined as a separate entity. The accessible tree is a hierarchical representation of the content, and the ATs frequently rely on it.
• As a rule of thumb there is also special support for text and selection.
• There are also accessible actions for example a button click or expand/collapse a dropdown
• Accessible events are used to notify the assistive technologies about app changes.

All of this is a very typical if not comprehensive list of what can be (or should be) tested when it comes to accessibility platform APIs testing.

Accessibility APIs are fairly low level testing, which is good or might be not depending on a use case. I would say it’s nearly perfect to test browsers, for example, how the browsers expose HTML elements and such. It also should be the right choice for different kinds of toolkits, which provides you with a set of building blocks, for example, extra controls on top of HTML5 elements. I’m not quite confident that this type of testing is exactly what web developers look for when it comes to webapps accessibility testing, because it’s fairly low level and requires understanding of under-the-hood things, but certainly they can benefit from checking accessibility properties on certain elements/structures, for example, for web components.

### Assistive Technologies testing

Assistive Technologies (AT) testing makes another layer of automated testing. Unlinke accessibility APIs testing, this is a great way to ensure your app is spoken exactly as you want over different screen readers or it gets zoomed properly when it comes to screen magnifiers. Any application including both browsers and web apps can benefit from integrational AT testing by running individual control elements or separate UI blocks though the ATs.

AT testing can be also used for end-to-end testing. However, as any end-to-end testing it cannot be comprehensive because of a steep rise in testing scenarios as the app complexity increases. Having said that, it certainly can be helpful to check the most crucial scenarios and can make a real great addition to the integrational accessibility APIs testing.

### Testing flow

Testing a single HTML page to check an accessible tree and/or its relevant accessibility properties makes a nice pattern of atomic testing. It is quite similar to unit testing. However, the reality is that not every use case can be reduced to a simple static page. It makes such testing quite restrictive. The real world affirms we need to test accessibility in dynamics for a full spectrum of testing scenarios, such as clicking a button and checking what happens next.

This makes another important piece of the puzzle, a testing flow, or in other words, how to control an application to test its dynamics, where you can query accessibility properties at the right time.

A typical test flow can be described in three simple steps:

• Query-n-check: test accessibility properties against expectations.
• Action: change a value or trigger an action: it represents the dynamics part and describes how a test interacts with an application.
• On-hold: wait for an event to hold the test execution until the app gets to the right state.

To summarize there are two key pieces we’d need for a testing flow. First, to trigger an action. This can be emulation of user actions, triggering accessible actions or changing accessible properties, anything that allows the user to operate and control the application. Secondly, we need the ability to hold on to a test execution until certain criterias are met. That could be an accessible event or presence of certain accessible properties on an element, or waiting until a certain text is visible, whatever, just to put the test execution on pause until it’s good to proceed.

### What we have now

Let’s take a look at what the market has to propose. I don’t know much about desktop or mobile applications, many of them are not open source, so no chance to sneak a peek. But I think it’d be a good idea to start from the web and the web browsers in particular, who have made some real good progress on accessibility testing.

### AOM

The first thing I would like to mention is AOM. It’s not the browsers, but it’s something closely related. AOM stands for Accessible Object Model. This is an attempt to expose accessibility properties in a cross-platform way in browsers. Similarly to platform accessibility APIs, AOM defines roles, properties, relations, and an accessible tree. You can think of AOM as the web accessibility API. So having AOM with all typical accessibility features can make a great platform for cross-browser accessibility testing.

Certain elements like platform dependent mappings are not very suitable for AOM for sure. For example, AAM specs, the accessibility APIs mapping, which define how to expose the web to the platform accessibility APIs, do not make a great choice for AOM testing. Indeed, despite platform APIs similarities on the conceptual level, they differ in details. However, if we care about the logical compound only, AOM will suit nicely. For example, ARIA name computation algorithms make a great match for AOM testing, or AOM can be used to test relations between a label and a control. In this case we don’t worry what is the platform name for the relation, the only thing we want to test is the right relations are exposed.

Sadly, it’s not something that was implemented. AOM is a long term vision, and the main focus till now has been on ARIA reflection, which is prototyped by a number of browsers by now. But ARIA reflection spec is just a DOM reflection of ARIA attributes and thus has somewhat lower testing capabilities. For example, HTML elements that don’t have a proper ARIA mapping cannot be tested by ARIA reflection or any of accessibility events.

So AOM is something that has great testing potential but it’s not yet implemented or even specified.

### ATTA

ATTA (Accessible Technology Test Adapter API) was the answer to manual accessibility testing for ARIA and HTML AAM specifications. Roughly speaking ATTA defines a protocol used to describe accessibility properties for a given HTML code snippet to create the test expectations. The expectations are sent to the ATTA server, which queries an ATTA driver for an accessible tree for the given HTML snippet and then checks it against the given expectations.

The ATTA is integrated into WPT test suite and it could make a great solution for the web accessibility APIs testing, if it worked. There are implementations of ATTA drivers for IAccessible2, UI Automation and ATK but apparently none of them ever reached the ready-to-use status.

So ATTA has a bunch of worthy ideas like WPT built-in integration and the modular system which allows you connect various platform APIs drivers, but sadly it is not finished yet and no active work any longer.

### Browsers

Let’s take a look at the browsers. Browsers have quite a long history of accessibility support and they’ve made a great progress on accessibility testing.

### Firefox

In Firefox all testing is done in Javascript. It’s worth noting Gecko’s accessibility core is a massive system that includes practically every feature of any desktop accessibility API. Gecko has a fairly mature set of cross-platform accessibility interfaces. Because the platform implementations are thin wrappers around the accessibility core, if you test something on a cross-platform layer in Gecko, then you can be confident that it will work well on all platforms. Gecko does, however, offer native NSAccessibility objects to JavaScript the same way that they do in cross-platform testing. It works nicely and allows one to poke all NSAccessibility attributes and parameterized attributes, as well as listening to the accessibility events. This approach is not portable as is, because it relies on a somewhat ancient Netscape-era technology. It could be adapted though to work through webidl if you want to to make it portable to other browsers, but fairly useless outside browsers world. Nevertheless it is certainly good to get inspired by.

Here’s an example of the Gecko accessibility test. The test represents a typical scenario for the accessibility testing: you get a property, call an action, wait for an event, and then make sure the property value was adjusted properly. You can imagine that cross platform tests are quite similar to this one.

Gecko implements its own test suite which provides a number of util functions such as ok() or is(), which are responsible to handle and report success or failures. This is the kind of testing system, where test expectations are listed explicitly in a test body, or in other words, the test says what to test and what are expectations. As a direct consequence if you need to change expectations, you have to adjust the test manually. It’s quite a typical testing system though.

### WebKit/Safari

I think it’s fair to say that WebKit, being the engine behind the popular Safari browser, has decent NSAccessibility protocol support, but it also supports ATK and MSAA. The WebKit testsuite is rather straightforward and implements its testing capabilities in a cross platform style. It exposes a helper object to DOM, and you can query platform dependent accessible properties/methods from JavaScript. It’s quite similar to what Firefox does.

The testsuite itself is quite different from Firefox though, which is not surprising. The WebKit test generates output which is compared to expectations stored in files. The test expectations are also listed in a test body, which makes the approach close to Gecko.

Similar to Gecko, WebKit also supports event testing, here’s an example of a typical test. It might look bulky for a simple thing it does, but all stuff can be wrapped by a nice promise-based wrapper which will make the test more readable. The most important thing here is that WebKit also supports testing for all parts of a typical accessibility API.

### Chromium

Beyond low-level c++ unit testing, Chromium relies on platform accessibility APIs testing. It is quite similar to Firefox or WebKit with one key difference.

Chromium can dump an accessibility tree with relevant accessibility properties or can record accessibility events, which it subsequently compares to the expectation files. The main gotcha here is these tests can be rebaselined easily unlike other kinds of tests. If something changes on the API level, for example, if an accessible tree is spec’d out differently, new properties are added or old ones are removed — any change, then all you need is to rerun a test and capture output, which becomes the new expectations. It’s as if you take a snapshot that becomes a new standard, and then all following runs match it.

Here’s a typical example of a tree test with mac and win expectation files.

Chromium also reveals basic scripting capabilities. Those are mainly mac targeted though, and thus scoped by NSAccessibility protocol testing. However it allows testing all bits of the API, including methods, attributes, parameterized attributes, actions and events.

### What would make a great test suite?

Let’s pull things together. Accessibility exists within the context of a platform, which glues together an application and the assistive technology by accessibility API. We want a platform-wide testing solution to test platform accessibility APIs. It should be possible to test a variety of things such as accessible trees, properties, methods and events.

The solution should not be limited to just web browsers. It should be capable of covering web applications running in web browsers. We’d also like to test any application running on the system.

Multiple platforms should be supported such as AT-SPI/ATK on Linux, MSAA/IA2 and UIA on Windows and NSAccessibility protocol on Mac. The solution has to be extensible in case we need to support other platforms in future.

Test flow control should be supported out-of-the-box, such as interacting with an app and then waiting for an accessible event. Another example of interactions will be the flow control directives allowing communication with the assistive technologies. Getting those covered will allow writing end-to-end tests.

Last but not least. Easy test rebaselining is a key feature to have. If you ever need to change a test’s expectations, then you just rerun the test and record its output. It happens more often than you probably think. This testsuite allows you to adjust expectations with about zero effort.

### Chromium accessibility tools

Chromium has decent accessibility tools to inspect a platform accessibility tree and to listen to accessibility events. They are available on all major platforms and can be easily ported to other platforms, essentially on any platform Chromium is running on. They are capable of inspecting any application on a system including web applications running in a web browser. All major desktop APIs are supported as well, namely AT-SPI on Linux, MSAA/IA2 and UIA on Windows and NSAccessibility protocol on Mac.

In Chromium these tools are integrated into a test harness to perform platform accessibility testing. The testsuite supports test flow instructions and rebaselining mechanism.

The tools can be beefed up with testing capabilities to become a new test harness, or be easily integrated into existing test suites or CDCI systems.

The tools are not perfect and are not feature-complete. They have great potential though. As long as they are capable of providing all the must-have testing features we discussed above, all they need is some love I think. The tools are open source and anyone can contribute. However, having them as an integral part of the Chromium project makes it seem like they are inherently tied to that project and doesn’t make life easy for new contributors. It’s possible, however, that eventually the tools could get a new home. If new contributors bring the fresh vision and new use cases to shape the tool’s features, then it can make a great start of a new open source project which will embrace all innovations, and hopefully can solve a long standing problem of accessibility automated testing.

Is it worth taking a shot?

Are you ready to join the efforts?

## June 06, 2022

### Alejandro Piñeiro

#### Playing with the rpi4 CPU/GPU frequencies

In recent days I have been testing how modifying the default CPU and GPU frequencies on the rpi4 increases the performance of our reference Vulkan applications. By default Raspbian uses 1500MHz and 500MHz respectively. But with a good heat dissipation (a good fan, rpi400 heat spreader, etc) you can play a little with those values.

One of the tools we usually use to check performance changes are gfxreconstruct. This tools allows you to record all the Vulkan calls during a execution of an aplication, and then you can replay the captured file. So we have traces of several applications, and we use them to test any hypothetical performance improvement, or to verify that some change doesn’t cause a performance drop.

So, let’s see what we got if we increase the CPU/GPU frequency, focused on the Unreal Engine 4 demos, that are the more shader intensive:

So as expected, with higher clock speed we see a good boost in performance of ~10FPS for several of these demos.

Some could wonder why the increase on the CPU frequency got so little impact. As I mentioned, we didn’t get those values from the real application, but from gfxreconstruct traces. Those are only capturing the Vulkan calls. So on those replays there are not tasks like collision detection, user input, etc that are usually handled on the CPU. Also as mentioned, all the Unreal Engine 4 demos uses really complex shaders, so the “bottleneck” there is the GPU.

Let’s move now from the cold numbers, and test the real applications. Let’s start with the Unreal Engine 4 SunTemple demo, using the default CPU/GPU frequencies (1500/500):

Even if it runs fairly smooth most of the time at ~24 FPS, there are some places where it dips below 18 FPS. Let’s see now increasing the CPU/GPU frequencies to 1800/750:

Now the demo runs at ~34 FPS most of the time. The worse dip is ~24 FPS. It is a lot smoother than before.

Here is another example with the Unreal Engine 4 Shooter demo, already increasing the CPU/GPU frequencies:

Here the FPS never dips below 34FPS, staying at ~40FPS most of time.

It has been around 1 year and a half since we announced a Vulkan 1.0 driver for Raspberry Pi 4, and since then we have made significant performance improvements, mostly around our compiler stack, that have notably improved some of these demos. In some cases (like the Unreal Engine 4 Shooter demo) we got a 50%-60% improvement (if you want more details about the compiler work, you can read the details here).

In this post we can see how after this and taking advantage of increasing the CPU and GPU frequencies, we can really start to get reasonable framerates in more demanding demos. Even if this is still at low resolutions (for this post all the demos were running at 640×480), it is still great to see this on a Raspberry Pi.

# Spicy Progress

For a while, we had lots of updates and excitement on our "spicy-sections" proposal - but you've probably noticed it's gotten a little quiet. In this post, I'll explain why and lay out where I think we are and how I hope we can move foward.

A month or so ago, the Tabvengers were working hard to advance some open issues, work out details (including the name 'oui-panelset') and create a stable, high-parity version of what our actual proposal would look like so that we could hand ownership of it to OpenUI officially and work with a more official proposal. Jon Neal, in particular, worked very hard to make something pretty great.

However, as is often the case, with more brains and eyeballs comes new scrutiny and feedback. In particular, during accesibility reviews to make sure we hadn't messed something up along the way (we did, because it was a total rewrite that broke all of the tests), we hit a new perspective.

Based on some questions asked during that review about examples of where it could be used, we quickly assembled a series of about 50 links to content in public home pages where we believed our "spicy-sections" (aka "oui-panelset") proposal could be used to great effect, as some supporting history in order to discuss.

In this discussion, one of the co-chairs of the ARIA Working Group suggested that perhaps most of those shouldn't even be using ARIA tabs. Worse, there was some concern that if that was true, then making it exceptionally easy for them to do just that is potentially actively harmful.

Suspend any opinion on what you might think that means, or your own thoughts for a moment... It's definitely worth slowing down and trying to understand and take some care. It's already led to a number of interesting conversations. We need to understand "why not?" and try to articulate some specifics. At the moment, we're still working to sort out a lot here, but you can expect more on that soon.

In some further discussion we began discussing that there are sort of distinct "kinds" of intefaces that many people would refer to as tabs. This is somewhat unsuprising as our research points out that many/most modern UI kits actually have more than one control for tab-like things. My own posts and notes also make sure to note that examples like browser tabs are a different control than the kind that spicy-sections/panelset are talking about - those kinds of tabs are sort of window-managers with additional features that even users understand differently. No one would ever expect Ctrl+F, for example to search all the open windows. We expect close buttons, perhaps status indicators, the ability to reorder them.

So, first thing was to introduce some separate terminolgy for clarity. The things we've been doing with "spicy sections" (and perhaps the sorts of things you'll be able to build with CSS Toggles) could be classified as "content tabs" (I believe it was Sarah Higley that offered the term). Again, this makes sense of past dicssions - Hixie said many years ago "tabs are a matter of overflow". That makes a lot of sense for "content tabs" and absolutely none for something like browser tabs.

## If not that...?

Ok, but... Assume, for the moment, that ARIA tabs are not always appropriate. Assume that we can articulate the right specifics about when they are and are not. It kind of begs the question "if not that, what?". Because, it's not like those pages will cease to have interfaces that generally look and act like that. It's very clear that "content tabs" exist in the wild and have plenty of benefits, and in order to judge whether it is true that some other, non-ARIA using solution is better, we need something concrete to compare it with, and that we can do user testing with. If it is true that something else is better, we have to be very careful in describing how to do the better thing to we prevent authors from merely falling into a different series of pitfalls, maybe even still give them that element.

But what is that other thing specifically?

### Further thought and experiments

So... I've been thinking about this a lot. I found several examples of "not ARIA tabs" which have very different characteristics, including from an accessibility standpoint. There are a lot of ways we can get these wrong. I've been looking for something without glaring issues.

To this end, and through some discussion with Adam Argyle and Jon Neal, I have created this very rough functional sketch of what I am thinking makes a good springboard for discussing specifics and trying to work through problems. Just like with the original, if your window is wide enough, this will display as 'content tabs' otherwise it will just show plain content.

See the Pen spicy-alternative-sketch by вкαя∂εℓℓ (@briankardell) on CodePen.

As of this posting, please don't use the code in this pen. It needs additional testing and discussion, but perhaps more importantly it is affected by one (or two, depending on your system) bugs in Firefox, being worked on now.

#### What it is..

Is it a somewhat reduced version of the current oui-panelset which simply lets you specify only a 'content-tabs' affordance (or not). The markup, parts, etc are identical. However, instead of realizing this as ARIA tabs, it turns these into a TOC of links and uses a scroll-port and scroll-snapping to create the tab-like presentation, based on one of Adam's experiments.

#### What I like about it

Well, it is pretty simple, and it doubles down on all of the original points and observations. It's now not just a case that it is "similar to scroll" - it is literally scroll. But with that comes some other interesting good qualities right out of the box: Find-in-page Just Works™. Headings aren't "turned into" tabs, they continue to exist - so screen reader users experience this as normal content and it is navigable by headings. We could probably apply it even to something like a carosel or paging if we're not fighting about the strict semantics of "what it is". You can leave the headings visible or easily hide them (assuming something isn't busted, see below) with a simple rule like

/* you can use a negative margin to hide */
oui-panelset h2 {
margin-top: -4rem;
}


If it turned out to be useful, then this is the proof for CSS Toggles and it makes the job of what they need to accomplish much clearer and easier.

### What I am not so sure of...

As I say often: the devil is in the details. This is a crude pen prototype, we have to do a lot of additional questioning and make sure this is resilient, and we have to test and improve it - a lot, probably, to know if it is really viable. For example, currently it 'jumps' the content when you click a link - I'm not sure how much we can accurately avoid this and remain resilient - but maybe.

It raises other new questions though too: What happens if you do 'select all' -- currently it will literally select everything. Is that right? Or should that limit to the scrollport - and so on. It doesn't 'project' the headings into slots, it 'mirrors' them - as far as I know, that isn't something anything else does. Is it a deal breaker? I don't know!

These 'tabs' are really links and as such they have all of the qualities of regular links. For example, one navigates them with the tab key and they have to manually activate - there is no roving focus or automatic activation. Some people see this as a pro, some as a con. We'll need to do A/B testing.

Finally - most importantly - do I think it is actually 'better?'.

To be honest, I'm somewhat unconvinced, that there is a "right" answer here. It would not surprise me at all if the results was that we learn that some users prefer for tabs to use automatic activation and roving tab index, and others don't. It would not suprise me at all if some users prefer that all of these present as ARIA tabs, and others are confused by it. I expect that many of these questions have something to do with classes of users and individual preferences.

That said, we've been working to find ways to answer all the problems and we think we have some ideas

### What I am hoping

What I am hoping is that having something concrete to discuss and debate, even if it is somewhat sketchy, actually lets us do that.

I would like to see us focus on ::parts and styling specification that is largely shared for both "kinds" of tabs (I think we're actually very close) such that even if they wind up being two different things, we can reuse much learning and code.

I would like to see us figure out how we sort this out via a custom element (or elements) when we don't know the final answer. Perhaps one option would be to add this affordance as an option to our <oui-panelset> proposal, perhaps even discuss whether it should be the only tabs-like affordance it supports initially. However, I would like to keep the door open to having the current use of ARIA as option too, since the truth is we don't know yet..

What I think _could_ be idea is that we could expose a user preference in browsers and let users pick what they prefer. This means that if we get it wrong by default, remediation is as simple as changing a single attribute value or CSS property.

## May 30, 2022

### Christopher Michael

#### Modesetting: A Glamor-less RPi adventure

The goal of this adventure is to have hardware acceleration for applications when we have Glamor disabled in the X server.

### What is Glamor ?

Glamor is a GL-based rendering acceleration library for the X server that can use OpenGL, EGL, or GBM. It uses GL functions & shaders to complete 2D graphics operations, and uses normal textures to represent drawable pixmaps where possible. Glamor calls GL functions to render to a texture directly and is somehow hardware independent. If the GL rendering cannot complete due to failure (or not being supported), then Glamor will fallback to software rendering (via llvmpipe) which uses framebuffer functions.

### Why disable Glamor ?

On current RPi images like bullseye, Glamor is disabled by default for RPi 1-3 devices. This means that there is no hardware acceleration out of the box. The main reason for not using Glamor on RPi 1-3 hardware is because it uses GPU memory (CMA memory) which is limited to 256Mb. If you run out of CMA memory, then the X server cannot allocate memory for pixmaps and your system will crash. RPi 1-3 devices currently use V3D as the render GPU. V3D can only sample from tiled buffers, but it can render to tiled or linear buffers. If V3D needs to sample from a linear buffer, then we allocate a shadow buffer and transform the linear image to a tiled layout in the shadow buffer and sample from the shadow buffer. Any update of the linear texture implies updating the shadow image… and that is SLOW. With Glamor enabled in this scenario, you will quickly run out of CMA memory and crash. This issue is especially apparent if you try launching Chromium in full screen with many tabs opened.

### Where has my hardware acceleration gone ?

On RPi 1-3 devices, we default to the modesetting driver from the X server. For those that are not aware, ‘modesetting’ is an Xorg driver for Kernel Modesetting (KMS) devices. The driver supports TrueColor visuals at various framebuffer depths and also supports RandR 1.2 for multi-head configurations. This driver supports all hardware where a KMS device is available and uses the Linux DRM ioctls or dumb buffer objects to create & map memory for applications to use. This driver can be used with Glamor to provide hardware acceleration, however that can lead to the X server crashing as mentioned above. Without enabling Glamor, then the modesetting driver cannot do hardware acceleration and applications will render using software (dumb buffer objects). So how can we get hardware acceleration without Glamor ? Let’s take an adventure into the land of Direct Rendering…

### What is Direct Rendering ?

Direct rendering allows for X client applications to perform 3D rendering using direct access to the graphics hardware. User-space programs can use the DRM API to command the GPU to do hardware-accelerated 3D rendering and video decoding. You may be thinking “Wow, this could solve the problem” and you would be correct. If this could be enabled in the modesetting driver without using Glamor, then we could have hardware acceleration without having to worry about the X server crashing. It cannot be that difficult, right ? Well, as it turns out, things are not so simple. The biggest problem with this approach is that the DRI2 implementation inside the modesetting driver depends on Glamor. DRI2 is a version of the Direct Rendering Infrastructure (DRI). It is a framework comprising the modern Linux graphics stack that allows unprivileged user-space programs to use graphics hardware. The main use of DRI is to provide hardware acceleration for the Mesa implementation of OpenGL. So what approach should be taken ? Do we modify the modesetting driver code to support DRI2 without Glamor ? Is there a better way to get direct rendering without DRI2 ? As it turns out, there is a better way…enter DRI3.

### DRI3 to the rescue ?

The main purpose of the DRI3 extension is to implement the mechanism to share direct rendered buffers between DRI clients and the X Server. With DRI3, clients can allocate the render buffers themselves instead of relying on the X server for doing the allocation. DRI3 clients allocate and use GEM buffers objects as rendering targets, while the X Server represents these render buffers using a pixmap. After initialization the client doesn’t make any extra calls to the X server, except perhaps in the case of window resizing. Utilizing this method, we should be able to avoid crashing the X server if we run out of memory, right ? Well once again, things are not as simple as they appear to be…

### So using DRI3 & GEM can save the day ?

With GEM, a user-space program can create, handle and destroy memory objects living in the GPU memory. When a user-space program needs video memory (for a framebuffer, texture or any other data), it requests the allocation from the DRM driver using the GEM API. The DRM driver keeps track of the used video memory and is able to comply with the request if there is free memory available. You may recall from earlier that the main reason for not using Glamor on RPi 1-3 hardware is because it uses GPU memory (CMA memory) which is limited to 256Mb, so how can using DRI3 with GEM help us ? The short answer is “it does not”…at least, not if we utilize GEM.

### Where do we go next ?

Surely there must be a way to have hardware acceleration without using all of our GPU memory ? I am glad you asked because there is a solution that we will explore in my next blog post.