# Planet Igalia

## August 10, 2022

### Javier Fernández

#### New Custom Handlers component for Chrome

The HTML Standard section on Custom Handlers describes the procedure to register custom protocol handlers for URLs with specific schemes. When a URL is followed, the protocol dictates which part of the system handles it by consulting registries. In some cases, handlers may fall to the OS level and and launch a desktop application (eg. mailto:// may launch a native application) or, the browser may redirect the request to a different HTTP(S) URL (eg. mailto:// can go to gmail).

There are multiple ways to invoke the registerProtocolHandler method from the HTML API. The most common is through Browser Extensions, or add-ons, where the user must install third-party software to add or modify a browser’s feature. However, this approach puts on the users the responsibility of ensuring the extension is secure and that it respects their privacy. Ideally, downstream browsers could ship built-in protocol handlers that take this load off of users, but this was previously difficult.

During the last years Igalia and Protocol Labs have been working on improving the support of this HTML API in several of the main web engines, with the long term goal of getting the IPFS protocol a first-class member inside the most relevant browsers.

In this post I’m going to explain the most recent improvements, especially in Chrome, and some side advantages for Chromium embedders that we got on the way.

## Componentization of the Custom Handler logic

The first challenge faced by Chromium-based browsers wishing to add support for a new protocol was the architecture. Most of the logic and relevant codebase lived in the //chrome layer. Thus, any intent to introduce a behavior change on this logic would require a direct fork and patch the //chrome layers with the new logic. The design wasn’t defined to allow Chrome embedders to extend it with new features

The Chromium project provides a Public Content API to allow embedders reusing a great part of the browser’s common logic and to implement, if needed, specific behavior though the specialization of these APIs. These would seem to be the ideal way of extending the browser capabilities to handle new protocols. However, accessing //chrome from any layer (eg. //content or //components) is forbidden and constitutes a layering violation.

After some discussion with Chrome engineers (special thanks to Kuniko Yasuda and Colin Blundell) we decided to move the Protocol Handlers logic to the //components layer and create a new Custom Handlers component.

The following diagram provides a high-level overview of the architectural change:

The new component makes the Protocol Handler logic Chrome independent, bringing us the possibility of using the Content Public API to introduce the mentioned behavior changes, as we’ll see later in this post.

Only the registry’s Delegate and the Factory classes will remain in the //chrome layer.

The Delegate class was defined to provide OS integration, like setting as default browser, which needs to access the underlying operating systems interfaces. It also deals with other browser-specific logic, like Profiles. With the new componentized design, embedders may provide their own Delegate implementations with different strategies to interact with the operating system’s desktop or even support additional flavors.

On the other hand, the Factory (in addition to the creation of the appropriated Delegate’s instance) provides a BrowserContext that allows the registry to operate under Incognito mode.

Lets now get a deeper look into the component, trying to understand the responsibilities of each class. The following class diagram illustrates the multi-process architecture of the Custom Handler logic:

The ProtocolHandlerRegistry class has been modified to allow operating without user preferences storage; this is particularly useful to move most of the browser tests to the new component and avoid the dependency with the prefs and user_prefs components (eg. testing). In this scenario the registered handlers will be stored in memory only.

It’s worth mentioning that as part of this architectural change, I’ve applied some code cleanup and refactoring, following the policies dictated in the base::Value Code Health task-force. Some examples:

The ProtocolHandlerThrottle implements URL redirection with the handler provided by the registry. The ChromeContentBrowserClient and ShellBrowserContentClient classes use it to implement the CreateURLLoaderThrottles method of the ContentBrowserClient interface.

The specialized request is the responsible for handling the permission prompt dialog response, except for the the content-shell implementations, which bypass the Permission APIs (more on this later). I’ve also been working on a new design that relies on the PermissionContextBase::CreatePermissionRequest interface, so that we could create our own custom request and avoid the direct call to the PermissionRequestManager. Unfortunately this still needs some additional support to deal with the RegisterProtocolHandlerPermissionRequest‘s specific parameters needed for creating new instances, but I won’t extend on this, as it deserves its own post.

Finally, I’d like to remark that the new component provides a simple registry factory, designed mainly for testing purposes. It’s basically the same as the one implemented in the //chrome layer except some limitations of the new component:

• No incognito mode
• Use of a mocking Delegate (no actual OS interaction)

## Single-Point security checks

I also wanted to apply some refactoring as well to improve the inter-process consistency and increase the amount of shared code between the renderer and browser process.

One of the most relevant parts of this refactoring was the one applied to the implement the Custom Handlers’ normalization process, which goes from some URL’s syntax validation to the security and privacy checks. I’ve landed some patches (CL#3507332, CL#3692750) to implement functions that are shared among the renderer and browser processes to implement the mentioned normalization procedure.

The code has been moved to blink/common so that it could be invoked also by the new Custom Handlers component described before. Both the WebContentsImpl (run by the Browser process) and the NavigatorContentUtils (by the Renderer process) rely on this common code now.

Additionally, I have moved all the security and privacy checks from the Browser class, in Chrome, to the WebContentsImpl class. This implies that any embedder that implements the WebContentsDelegate interface shouldn’t need to bother about these checks, assuming that any ProtocolHandler instance is valid.

# Conclusion

In the introduction I mentioned that the main goal of this refactoring was to decouple the Custom Handlers logic from the Chrome’s specific codebase; this would allow us to modify or even extend the implementation of the Custom Handlers APIs.

The componentization of some parts of the Chrome’s codebase has many advantages, as it’s described in the Browser Components design document. Additionally, a new Custom Handlers component would provide other interesting advantages on different fronts.

One of the most interesting use cases is the definition of Web Platform Tests for the Custom Handlers APIs. This goal has 2 main challenges to address:

• Testing Automation
• Content Shell support

Finally, we would like this new component to be useful for Chrome embedders, and even browsers built directly on top of a full featured Chrome, to implement new Custom Handlers related features. This line of work would eventually be useful to improve the support of distributed protocols like IPFS, as an intermediate step towards a native implementation of the protocols in the browser.

## August 09, 2022

### Eric Meyer

#### Recreating “The Effects of Nuclear Weapons” for the Web

In my previous post, I wrote about a way to center elements based on their content, without forcing the element to be a specific width, while preserving the interior text alignment. In this post, I’d like to talk about why I developed that technique.

Near the beginning of this year, fellow Web nerd and nuclear history buff Chris Griffith mentioned a project to put an entire book online: The Effects of Nuclear Weapons by Samuel Glasstone and Philip J. Dolan, specifically the third (1977) edition. Like Chris, I own a physical copy of this book, and in fact, the information and tools therein were critical to the creation of HYDEsim, way back in the Aughts. I acquired it in pursuit of my degree in History, for which I studied the Cold War and the policy effects of the nuclear arms race, from the first bombers to the Strategic Defense Initiative.

I was immediately intrigued by the idea and volunteered my technical services, which Chris accepted. So we started taking the OCR output of a PDF scan of the book, cleaning up the myriad errors, re-typing the bits the OCR mangled too badly to just clean up, structuring it all with HTML, converting figures to PNGs and photos to JPGs, and styling the whole thing for publication, working after hours and in odd down times to bring this historical document to the Web in a widely accessible form. The result of all that work is now online.

That linked page is the best example of the technique I wrote about in the aforementioned previous post: as a Table of Contents, none of the lines actually get long enough to wrap. Rather than figuring out the exact length of the longest line and centering based on that, I just let CSS do the work for me.

There were a number of other things I invented (probably re-invented) as we progressed. Footnotes appear at the bottom of pages when the footnote number is activated through the use of the :target pseudo-class and some fixed positioning. It’s not completely where I wanted it to be, but I think the rest will require JS to pull off, and my aim was to keep the scripting to an absolute minimum.

I couldn’t keep the scripting to zero, because we decided early on to use MathJax for the many formulas and other mathematical expressions found throughout the text. I’d never written LaTeX before, and was very quickly impressed by how compact and yet powerful the syntax is.

Over time, I do hope to replace the MathJax-parsed LaTeX with raw MathML for both accessibility and project-weight reasons, but as of this writing, Chromium lacks even halfway-decent MathML support, so we went with the more widely-supported solution. (My colleague Frédéric Wang at Igalia is pushing hard to fix this sorry state of affairs in Chromium, so I do have hopes for a migration to MathML… some day.)

The figures (as distinct from the photos) throughout the text presented an interesting challenge. To look at them, you’d think SVG would be the ideal image format. Had they come as vector images, I’d agree, but they’re raster scans. I tried recreating one or two in hand-crafted SVG and quickly determined the effort to create each was significant,  and really only worked for the figures that weren’t charts, graphs, or other presentations of data. For anything that was a chart or graph, the risk of introducing inaccuracies was too high, and again, each would have required an inordinate amount of effort to get even close to correct. That’s particularly true considering that without knowing what font face was being used for the text labels in the figures, they’d have to be recreated with paths or polygons or whatever, driving the cost-to-recreate astronomically higher.

So I made the figures PNGs that are mostly transparent, except for the places where there was ink on the paper. After any necessary straightening and some imperfection cleanup in Acorn, I then ran the PNGs through the color-index optimization process I wrote about back in 2020, which got them down to an average of 75 kilobytes each, ranging from 443KB down to 7KB.

At the 11th hour, still secretly hoping for a magic win, I ran them all through svgco.de to see if we could get automated savings. Of the 161 figures, exactly eight of them were made smaller, which is not a huge surprise, given the source material. So, I saved those eight for possible future updates and plowed ahead with the optimized PNGs. Will I return to this again in the future? Probably. It bugs me that the figures could be better, and yet aren’t.

It also bugs me that we didn’t get all of the figures and photos fully described in alt text. I did write up alternative text for the figures in Chapter I, and a few of the photos have semi-decent captions, but this was something we didn’t see all the way through, and like I say, that bugs me. If it also bugs you, please feel free to fork the repository and submit a pull request with good alt text. Or, if you prefer, you could open an issue and include your suggested alt text that way. By the image, by the section, by the chapter: whatever you can contribute would be appreciated.

Those image captions, by the way? In the printed text, they’re laid out as a label (e.g., “Figure 1.02”) and then the caption text follows. But when the text wraps, it doesn’t wrap below the label.  Instead, it wraps in its own self-contained block instead, with the text fully justified except for the last line, which is centered. Centered! So I set up the markup and CSS like this:

<figure>
<figcaption>
<span>Figure 1.02.</span> <span>Effects of a nuclear explosion.</span>
</figcaption>
</figure>
figure figcaption {
display: grid;
grid-template-columns: max-content auto;
gap: 0.75em;
justify-content: center;
text-align: justify;
text-align-last: center;
}

Oh CSS Grid, how I adore thee. And you too, CSS box alignment. You made this little bit of historical recreation so easy, it felt like cheating.

Some other things weren’t easy. The data tables, for example, have a tendency to align columns on the decimal place, even when most but not all of the numbers are integers. Long, long ago, it was proposed that text-align be allowed a string value, something like text-align: '.', which you could then apply to a table column and have everything line up on that character. For a variety of reasons, this was never implemented, a fact which frosts my windows to this day. In general, I mean, though particularly so for this project. The lack of it made keeping the presentation historically accurate a right pain, one I may get around to writing about, if I ever overcome my shame.

There are two things about the book that we deliberately chose not to faithfully recreate. The first is the font face. My best guess is that the book was typeset using something from the Century family, possibly Century Schoolbook (the New version of which was a particular favorite of mine in college). The very-widely-installed Cambria seems fairly similar, at least to my admittedly untrained eye, and furthermore was designed specifically for screen media, so I went with body text styling no more complicated than this:

body {
font: 1em/1.35 Cambria, Times, serif;
hyphens: auto;
}

I suppose I could have tracked down a free version of Century and used it as a custom font, but I couldn’t justify the performance cost in both download and rendering speed to myself and any future readers. And the result really did seem close enough to the original to accept.

The second thing we didn’t recreate is the printed-page layout, which is two-column. That sort of layout can work very well on the book page; it almost always stinks on a Web page. Thus, the content of the book is rendered online in a single column. The exceptions are the chapter-ending Bibliography sections and the book’s Index, both of which contain content compact and granular enough that we could get away with the original layout.

There’s a lot more I could say about how this style or that pattern came about, and maybe someday I will, but for now let me leave you with this: all these decisions are subject to change, and open to input. If you come up with a superior markup scheme for any of the bits of the book, we’re happy to look at pull requests or issues, and to act on them. It is, as we say in our preface to the online edition, a living project.

We also hope that, by laying bare the grim reality of these horrific weapons, we can contribute in some small way to making them a dead and buried technology.

Have something to say to all that? You can add a comment to the post, or email Eric directly.

## August 07, 2022

### Andy Wingo

#### coarse or lazy?

sweeping, coarse and lazy

One of the things that had perplexed me about the Immix collector was how to effectively defragment the heap via evacuation while keeping just 2-3% of space as free blocks for an evacuation reserve. The original Immix paper states:

To evacuate the object, the collector uses the same allocator as the mutator, continuing allocation right where the mutator left off. Once it exhausts any unused recyclable blocks, it uses any completely free blocks. By default, immix sets aside a small number of free blocks that it never returns to the global allocator and only ever uses for evacuating. This headroom eases defragmentation and is counted against immix's overall heap budget. By default immix reserves 2.5% of the heap as compaction headroom, but [...] is fairly insensitive to values ranging between 1 and 3%.

To Immix, a "recyclable" block is partially full: it contains surviving data from a previous collection, but also some holes in which to allocate. But when would you have recyclable blocks at evacuation-time? Evacuation occurs as part of collection. Collection usually occurs when there's no more memory in which to allocate. At that point any recyclable block would have been allocated into already, and won't become recyclable again until the next trace of the heap identifies the block's surviving data. Of course after the next trace they could become "empty", if no object survives, or "full", if all lines have survivor objects.

In general, after a full allocation cycle, you don't know much about the heap. If you could easily know where the live data and the holes were, a garbage collector's job would be much easier :) Any algorithm that starts from the assumption that you know where the holes are can't be used before a heap trace. So, I was not sure what the Immix paper is meaning here about allocating into recyclable blocks.

Thinking on it again, I realized that Immix might trigger collection early sometimes, before it has exhausted the previous cycle's set of blocks in which to allocate. As we discussed earlier, there is a case in which you might want to trigger an early compaction: when a large object allocator runs out of blocks to decommission from the immix space. And if one evacuating collection didn't yield enough free blocks, you might trigger the next one early, reserving some recyclable and empty blocks as evacuation targets.

when do you know what you know: lazy and eager

Consider a basic question, such as "how many bytes in the heap are used by live objects". In general you don't know! Indeed you often never know precisely. For example, concurrent collectors often have some amount of "floating garbage" which is unreachable data but which survives across a collection. And of course you don't know the difference between floating garbage and precious data: if you did, you would have collected the garbage.

Even the idea of "when" is tricky in systems that allow parallel mutator threads. Unless the program has a total ordering of mutations of the object graph, there's no one timeline with respect to which you can measure the heap. Still, Immix is a stop-the-world collector, and since such collectors synchronously trace the heap while mutators are stopped, these are times when you can exactly compute properties about the heap.

Let's retake the question of measuring live bytes. For an evacuating semi-space, knowing the number of live bytes after a collection is trivial: all survivors are packed into to-space. But for a mark-sweep space, you would have to compute this information. You could compute it at mark-time, while tracing the graph, but doing so takes time, which means delaying the time at which mutators can start again.

Alternately, for a mark-sweep collector, you can compute free bytes at sweep-time. This is the phase in which you go through the whole heap and return any space that wasn't marked in the last collection to the allocator, allowing it to be used for fresh allocations. This is the point in the garbage collection cycle in which you can answer questions such as "what is the set of recyclable blocks": you know what is garbage and you know what is not.

Though you could sweep during the stop-the-world pause, you don't have to; sweeping only touches dead objects, so it is correct to allow mutators to continue and then sweep as the mutators run. There are two general strategies: spawn a thread that sweeps as fast as it can (concurrent sweeping), or make mutators sweep as needed, just before they allocate (lazy sweeping). But this introduces a lag between when you know and what you know—your count of total live heap bytes describes a time in the past, not the present, because mutators have moved on since then.

For most collectors with a sweep phase, deciding between eager (during the stop-the-world phase) and deferred (concurrent or lazy) sweeping is very easy. You don't immediately need the information that sweeping allows you to compute; it's quite sufficient to wait until the next cycle. Moving work out of the stop-the-world phase is a win for mutator responsiveness (latency). Usually people implement lazy sweeping, as it is naturally incremental with the mutator, naturally parallel for parallel mutators, and any sweeping overhead due to cache misses can be mitigated by immediately using swept space for allocation. The case for concurrent sweeping is less clear to me, but if you have cores that would otherwise be idle, sure.

eager coarse sweeping

Immix is interesting in that it chooses to sweep eagerly, during the stop-the-world phase. Instead of sweeping irregularly-sized objects, however, it sweeps over its "line mark" array: one byte for each 128-byte "line" in the mark space. For 32 kB blocks, there will be 256 bytes per block, and line mark bytes in each 4 MB slab of the heap are packed contiguously. Therefore you get relatively good locality, but this just mitigates a cost that other collectors don't have to pay. So what does eager marking over these coarse 128-byte regions buy Immix?

Firstly, eager sweeping buys you eager identification of empty blocks. If your large object space needs to steal blocks from the mark space, but the mark space doesn't have enough empties, it can just trigger collection and then it knows if enough blocks are available. If no blocks are available, you can grow the heap or signal out-of-memory. If the lospace (large object space) runs out of blocks before the mark space has used all recyclable blocks, that's no problem: evacuation can move the survivors of fragmented blocks into these recyclable blocks, which have also already been identified by the eager coarse sweep.

Without eager empty block identification, if the lospace runs out of blocks, firstly you don't know how many empty blocks the mark space has. Sweeping is a kind of wavefront that moves through the whole heap; empty blocks behind the wavefront will be identified, but those ahead of the wavefront will not. Such a lospace allocation would then have to either wait for a concurrent sweeper to advance, or perform some lazy sweeping work. The expected latency of a lospace allocation would thus be higher, without eager identification of empty blocks.

Secondly, eager sweeping might reduce allocation overhead for mutators. If allocation just has to identify holes and not compute information or decide on what to do with a block, maybe it go brr? Not sure.

lines, lines, lines

The original Immix paper also notes a relative insensitivity of the collector to line size: 64 or 256 bytes could have worked just as well. This was a somewhat surprising result to me but I think I didn't appreciate all the roles that lines play in Immix.

Obviously line size affect the worst-case fragmentation, though this is mitigated by evacuation (which evacuates objects, not lines). This I got from the paper. In this case, smaller lines are better.

Line size affects allocation-time overhead for mutators, though which way I don't know: scanning for holes will be easier with fewer lines in a block, but smaller lines would contain more free space and thus result in fewer collections. I can only imagine though that with smaller line sizes, average hole size would decrease and thus medium-sized allocations would be harder to service. Something of a wash, perhaps.

However if we ask ourselves the thought experiment, why not just have 16-byte lines? How crazy would that be? I think the impediment to having such a precise line size would mainly be Immix's eager sweep, as a fine-grained traversal of the heap would process much more data and incur possibly-unacceptable pause time overheads. But, in such a design you would do away with some other downsides of coarse-grained lines: a side table of mark bytes would make the line mark table redundant, and you eliminate much possible "dark matter" hidden by internal fragmentation in lines. You'd need to defer sweeping. But then you lose eager identification of empty blocks, and perhaps also the ability to evacuate into recyclable blocks. What would such a system look like?

Readers that have gotten this far will be pleased to hear that I have made some investigations in this area. But, this post is already long, so let's revisit this in another dispatch. Until then, happy allocations in all regions.

## August 04, 2022

### Igalia Compilers Team

#### Igalia’s Compilers Team in 2022H1

As we enter the second half of 2022, we’d like to provide a summary (necessarily highly condensed and selective!) of what we’ve been up to recently, providing some insight into the breadth of technical challenges our team of over 20 compiler engineers has been tackling.

## Low-level JS / JSC on 32-bit systems

We have continued to maintain support for 32-bit systems (mainly ARMv7, but also MIPS) in JavaScriptCore (JSC). The work involves continuous tracking of upstream development to prevent regressions as well as the development of new features:

• A major milestone for this has been the completion of support for WebAssembly in the Low-Level Interpreter (LLInt) for ARMv7. The MIPS support is mostly complete.
• Developed an initial prototype of the concurrency compilation in the DFG tier for 32-bit systems, and the results are promising. The work continues, and we expect to upstream it in 2022H2.
• Code reduction and optimizations: we upstreamed several code reductions and optimizations for 32-bit systems (mainly ARMv7): 25% size reduction in DFGOSRExit blocks, 24% in baseline JIT on JetStream2 and 25% code size reduction from porting EXTRA_CTI_THUNKS.
• Improved our hardware testing infrastructure with more MIPS and faster ARMv7 hardware for the buildbots running in the EWS (Early Warning System), which allows for a smaller response time for regressions.
• Deployed two fuzzing bots that run test JSC 24/7. The bots already found a few issues upstream that we reported to Apple. The bugs that affect 64-bit systems were fixed by the team at Apple, while we are responsible for fixing the ones affecting 32-bit systems. We expect to work on them in 2022H2.
• Added logic to transparently re-run failing JSC tests (on 32-bit platforms) and declare them a pass if they’re simply flaky, as long as the flakiness does not rise above a threshold. This means fewer false alerts for developers submitting patches to the EWS and for the people doing QA work. Naturally, the flakiness information is stored in the WebKit resultsdb and visualized at results.webkit.org.

## JS and standards

Another aspect of our work is our contribution to the JavaScript standards effort, through involvement in the TC39 standards body, direct contribution to standards proposals, and implementation of those proposals in the major JS engines.

• Further coverage of the Temporal spec in the Test262 conformance suite, as well as various specification updates. See this blog post for insight into some of the challenges tackled by Temporal.
• Performance improvements for JS class features in V8, such as faster initialisations.
• Work towards supporting snapshots in node.js, including activities such as fixing support for V8 startup snapshots in the presence of class field initializers.
• Collaborating with others on the “types as comments” proposal for JS, successfully reaching stage 1 in the TC39 process.
• Implementing ShadowRealm support in WebKit

## WebAssembly

WebAssembly is a low-level compilation target for the web, which we have contributed to in terms of specification proposals, LLVM toolchain modifications, implementation work in the JS engines, and working with customers on use cases both on the server and in web browsers.

Some highlights from the last 6 months include:

• Creation of a proposal for reference-typed strings in WebAssembly to ensure efficient operability with languages like JavaScript. We also landed patches to implement this proposal in V8.
• Prototyping Just-In-Time (JIT) compilation within WebAssembly.
• Working to implement support for WebAssembly GC types in Clang and LLVM (with one important use case being efficient and leak-free sharing of object graphs between JS and C++ compiled to Wasm).
• Implementation of support for GC types in WebKit’s implementation of WebAssembly.

## Events

With in-person meetups becoming possible again, Igalians have been talking on a range of topics – Multi-core Javascript (BeJS), TC39 (JS Nation), RISC-V LLVM (Cambridge RISC-V meetup), and more.

We’ve also had opportunities for much-needed face to face time within the team, with many of the compilers team meeting in Brussels in May, and for the company-wide summit held in A Coruña in June. These events provided a great opportunity to discuss current technical challenges, strategy, and ideas for the future, knowledge sharing, and of course socialising.

## Team growth

Our team has grown further this year, being joined by:

• Nicolò Ribaudo – a core maintainer of BabelJS, continuing work on that project after joining Igalia in June as well as contributing to work on JS modules.
• Aditi Singh – previously worked with the team through the Coding Experience program, joining full time in March focusing on the Temporal project.
• Alex Bradbury – a long-time LLVM developer who joined in March and is focusing on WebAssembly and RISC-V work in Clang/LLVM.

We’re keen to continue to grow the team and actively hiring, so if you think you might be interested in working in any of the areas discussed above, please apply here.

If you’re keen to learn more about how we work at Igalia, a recent article at The New Stack provides a fantastic overview and includes comments from a number of customers who have supported the work described in this post.

### André Almeida

#### Keeping a project bisectable

People write code. Test coverage is never enough. Some angry contributor will disable the CI. And we all write bugs. But that’s OK, it part of the job. Programming is hard and sometimes we may miss a corner case, forget that numbers overflow and all other strange things that computers can do. One easy thing that we can do to help the poor developer that needs to find what changed in the code that stopped their printer to work properly, is to keep the project bisectable.

## July 24, 2022

### Danylo Piliaiev

#### July 2022 Turnip Status Update

There is a steady progress being made since :tada: Turnip is Vulkan 1.1 Conformant :tada:. We now support GL 4.6 via Zink, implemented a lot of extensions, and are close to Vulkan 1.3 conformance.

Support of real-world games is also looking good, here is a video of Adreno 660 rendering “The Witcher 3”, “The Talos Principle”, and “OMD2”:

All of them have reasonable frame rate. However, there was a bit of “cheating” involved. Only “The Talos Principle” game was fully running on the development board (via box64), other two games were only rendered in real time on Adreno GPU, but were ran on x86-64 laptop with their VK commands being streamed to the dev board. You could read about this method in my post “Testing Vulkan drivers with games that cannot run on the target device”.

The video was captured directly on the device via OBS with obs-vkcapture, which worked surprisingly well after fighting a bunch of issues due to the lack of binary package for it and a bit dated Ubuntu installation.

## Zink (GL over Vulkan)

A number of extensions were implemented that are required for Zink to support higher GL versions. As of now Turnip supports OpenGL 4.6 via Zink, and while yet not conformant - only a handful of GL CTS tests are failing. For the perspective, Freedreno (our GL driver for Adreno) supports only OpenGL 3.3.

For Zink adventures and profound post titles check out Mike Blumenkrantz’s awesome blog supergoodcode.com

If you are interested in Zink over Turnip bring up in particular, you should read:

## Low Resolution Z improvements

A major improvement for low resolution Z optimization (LRZ) was recently made in Turnip, read about it in the previous post of mine: LRZ on Adreno GPUs

## Extensions

Anyway, since the last update Turnip supports many more extensions (in no particular order):

For Vulkan 1.3 conformance there are only a few extension left to implement. The only major ones are VK_KHR_dynamic_rendering and VK_EXT_inline_uniform_block required for Vulkan 1.3. VK_KHR_dynamic_rendering is currently being reviewed and foundation for VK_EXT_inline_uniform_block was recently merged.

That’s all for today!

## July 20, 2022

### Andy Wingo

#### unintentional concurrency

Good evening, gentle hackfolk. Last time we talked about heuristics for when you might want to compact a heap. Compacting garbage collection is nice and tidy and appeals to our orderly instincts, and it enables heap shrinking and reallocation of pages to large object spaces and it can reduce fragmentation: all very good things. But evacuation is more expensive than just marking objects in place, and so a production garbage collector will usually just mark objects in place, and only compact or evacuate when needed.

Today's post is more details!

dedication

Just because it's been, oh, a couple decades, I would like to reintroduce a term I learned from Marnanel years ago on advogato, a nerdy group blog kind of a site. As I recall, there is a word that originates in the Oxbridge social environment, "narg", from "Not A Real Gentleman", and which therefore denotes things that not-real-gentlemen do: nerd out about anything that's not, like, fox-hunting or golf; or generally spending time on something not because it will advance you in conventional hierarchies but because you just can't help it, because you love it, because it is just your thing. Anyway, in the spirit of pursuits that are really not What One Does With One's Time, this post is dedicated to the word "nargery".

side note, bis: immix-style evacuation versus mark-compact

In my last post I described Immix-style evacuation, and noted that it might take a few cycles to fully compact the heap, and that it has a few pathologies: the heap might never reach full compaction, and that Immix might run out of free blocks in which to evacuate.

With these disadvantages, why bother? Why not just do a single mark-compact pass and be done? I implicitly asked this question last time but didn't really answer it.

For some people will be, yep, yebo, mark-compact is the right answer. And yet, there are a few reasons that one might choose to evacuate a fraction of the heap instead of compacting it all at once.

The first reason is object pinning. Mark-compact systems assume that all objects can be moved; you can't usefully relax this assumption. Most algorithms "slide" objects down to lower addresses, squeezing out the holes, and therefore every live object's address needs to be available to use when sliding down other objects with higher addresses. And yet, it would be nice sometimes to prevent an object from being moved. This is the case, for example, when you grant a foreign interface (e.g. a C function) access to a buffer: if garbage collection happens while in that foreign interface, it would be nice to be able to prevent garbage collection from moving the object out from under the C function's feet.

Another reason to want to pin an object is because of conservative root-finding. Guile currently uses the Boehm-Demers-Weiser collector, which conservatively scans the stack and data segments for anything that looks like a pointer to the heap. The garbage collector can't update such a global root in response to compaction, because you can't be sure that a given word is a pointer and not just an integer with an inconvenient value. In short, objects referenced by conservative roots need to be pinned. I would like to support precise roots at some point but part of my interest in Immix is to allow Guile to move to a better GC algorithm, without necessarily requiring precise enumeration of GC roots. Optimistic partial evacuation allows for the possibility that any given evacuation might fail, which makes it appropriate for conservative root-finding.

Finally, as moving objects has a cost, it's reasonable to want to only incur that cost for the part of the heap that needs it. In any given heap, there will likely be some data that stays live across a series of collections, and which, once compacted, can't be profitably moved for many cycles. Focussing evacuation on only the part of the heap with the lowest survival rates avoids wasting time on copies that don't result in additional compaction.

(I should admit one thing: sliding mark-compact compaction preserves allocation order, whereas evacuation does not. The memory layout of sliding compaction is more optimal than evacuation.)

multi-cycle evacuation

Say a mutator runs out of memory, and therefore invokes the collector. The collector decides for whatever reason that we should evacuate at least part of the heap instead of marking in place. How much of the heap can we evacuate? The answer depends primarily on how many free blocks you have reserved for evacuation. These are known-empty blocks that haven't been allocated into by the last cycle. If you don't have any, you can't evacuate! So probably you should keep some around, even when performing in-place collections. The Immix papers suggest 2% and that works for me too.

Then you evacuate some blocks. Hopefully the result is that after this collection cycle, you have more free blocks. But you haven't compacted the heap, at least probably not on the first try: not into 2% of total space. Therefore you tell the mutator to put any empty blocks it finds as a result of lazy sweeping during the next cycle onto the evacuation target list, and then the next cycle you have more blocks to evacuate into, and more and more and so on until after some number of cycles you fall below some overall heap fragmentation low-watermark target, at which point you can switch back to marking in place.

I don't know how this works in practice! In my test setups which triggers compaction at 10% fragmentation and continues until it drops below 5%, it's rare that it takes more than 3 cycles of evacuation until the heap drops to effectively 0% fragmentation. Of course I had to introduce fragmented allocation patterns into the microbenchmarks to even cause evacuation to happen at all. I look forward to some day soon testing with real applications.

concurrency

Just as a terminological note, in the world of garbage collectors, "parallel" refers to multiple threads being used by a garbage collector. Parallelism within a collector is essentially an implementation detail; when the world is stopped for collection, the mutator (the user program) generally doesn't care if the collector uses 1 thread or 15. On the other hand, "concurrent" means the collector and the mutator running at the same time.

Different parts of the collector can be concurrent with the mutator: for example, sweeping, marking, or evacuation. Concurrent sweeping is just a detail, because it just visits dead objects. Concurrent marking is interesting, because it can significantly reduce stop-the-world pauses by performing most of the computation while the mutator is running. It's tricky, as you might imagine; the collector traverses the object graph while the mutator is, you know, mutating it. But there are standard techniques to make this work. Concurrent evacuation is a nightmare. It's not that you can't implement it; you can. But it's very very hard to get an overall performance win from concurrent evacuation/copying.

So if you are looking for a good bargain in the marketplace of garbage collector algorithms, it would seem that you need to avoid concurrent copying/evacuation. It's an expensive product that would seem to not buy you very much.

All that is just a prelude to an observation that there is a funny source of concurrency even in some systems that don't see themselves as concurrent: mutator threads marking their own roots. To recall, when you stop the world for a garbage collection, all mutator threads have to somehow notice the request to stop, reach a safepoint, and then stop. Then the collector traces the roots from all mutators and everything they reference, transitively. Then you let the threads go again. Thing is, once you get more than a thread or four, stopping threads can take time. You'd be tempted to just have threads notice that they need to stop, then traverse their own stacks at their own safepoint to find their roots, then stop. But, this introduces concurrency between root-tracing and other mutators that might not have seen the request to stop. For marking, this concurrency can be fine: you are just setting mark bits, not mutating the roots. You might need to add an additional mark pattern that can be distinguished from marked-last-time and marked-the-time-before-but-dead-now, but that's a detail. Fine.

But if you instead start an evacuating collection, the gates of hell open wide and toothy maws and horns fill your vision. One thread could be stopping and evacuating the objects referenced by its roots, while another hasn't noticed the request to stop and is happily using the same objects: chaos! You are trying to make a minor optimization to move some work out of the stop-the-world phase but instead everything falls apart.

Anyway, this whole article was really to get here and note that you can't do ragged-stops with evacuation without supporting full concurrent evacuation. Otherwise, you need to postpone root traversal until all threads are stopped. Perhaps this is another argument that evacuation is expensive, relative to marking in place. In practice I haven't seen the ragged-stop effect making so much of a difference, but perhaps that is because evacuation is infrequent in my test cases.

Zokay? Zokay. Welp, this evening's nargery was indeed nargy. Happy hacking to all collectors out there, and until next time.

### Víctor Jáquez

This is the brief story of the Gamepad implementation in WPEWebKit.

It started with an early development done by Eugene Mutavchi (kudos!). Later, by the end of 2021, I retook those patches and dicussed them with my fellow igalian Adrián, and we decided to come with a slightly different approach.

Before going into the details, let’s quickly review the WPE architecture:

1. cog library — it’s a shell library that simplifies the task of writing a WPE browser from the scratch, by providing common functionality and helper APIs.
2. WebKit library — that’s the web engine that, given an URI and other following inputs, returns, among other ouputs, graphic buffers with the page rendered.
3. WPE library — it’s the API that bridges cog (1) (or whatever other browser application) and WebKit (2).
4. WPE backend — it’s main duty is to provide graphic buffers to WebKit, buffers supported by the hardware, the operating system, windowing system, etc.

Eugene’s implementation has code in WebKit (implementing the gamepad support for WPE port); code in WPE library with an API to communicate WebKit’s gamepad and WPE backend, which provided a custom implementation of gamepad, reading directly the event in the Linux device. Almost everything was there, but there were some issues:

• WPE backend is mainly designed as a set of protocols, similar to Wayland, to deal with graphic buffers or audio buffers, but not for input events. Cog library is the place where input events are handled and injected to WebKit, such as keyboard.
• The gamepad handling in a WPE backend was ad-hoc and low level, reading directly the events from Linux devices. This approach is problematic since there are plenty gamepads in the market and each has its own axis and buttons, so remapping them to the standard map is required. To overcome this issue and many others, there’s a GNOME library: libmanette, which is already used by WebKitGTK port.

Today’s status of the gamepad support is that it works but it’s not yet fully upstreamed.

• merged ">">libwpe pull request.
• cog pull request — there are two implementations: none and libmanette. None is just a dummy implementation which will ignore any request for a gamepad provider; it’s provided if libmanette is not available or if available libwpe hasn’t gamepad support.
• WebKit pull request.

To prove you all that it works my exhibit A is this video, where I play asteroids in a RasberryPi 4 64 bits:

The image was done with buildroot, using its master branch (from a week ago) with a bunch of modifications, such as adding libmanette, a kernel patch for my gamepad device, kernel 5.15.55 and its corresponding firmware, etc.

## July 19, 2022

### Manuel Rego

#### Some highlights of the Web Engines Hackfest 2022

Last month Igalia arranged a new edition of the Web Engines Hackfest in A Coruña (Galicia, Spain), where brought together more than 70 people working on the web platform during 2 days, with lots of discussions and conversations around different features.

This was my first onsite event since “before times”, it was amazing seeing people for real after such a long time, meeting again some old colleagues and also a bunch of people for the first time. Being an organizer of the event meant that they were very busy days for me, but it looks like people were happy with the result and enjoyed the event quite a lot.

This is a brief post about my personal highlights during the event.

## Talks

During the hackfest we had an afternoon with 5 talks, the talks were live streamed on YouTube and people could follow them remotely and also ask questions through the event matrix channel.

Leo Balter’s Talk

• Leo Balter talked about how Salesforce participates on the web platform as partner, working with browsers and web standards.
I really liked this talk, because it explains how companies that use the web platform, can collaborate and have a direct impact on the evolution of the web. And there are many ways to do that: reporting bugs that affect your company, explaining use cases that are important for you and the things you miss from the platform, providing feedback about different features, looking for partners to fix outstanding issues or add support for new stuff, etc.
Igalia has been showing during the last decade that there’s a way to have an impact on the web platform outside of the big companies and browser vendors. Thanks to our position on the different communities, we can help companies to push features they’re interested in and that would benefit the entire web platform in the future.
• Dominik Röttsches gave a talk about COLRv1 fonts giving details on Chromium implementation and the different open-source software components involved.
This new font format allows to do really amazing things and Dominik showed how to create a Galician emoji font with popular things like Tower of Hercules or Polbo á feira. With some early demos on variable COLRv1 and the beginnings of the first Galician emoji font…
• Daniel Minor explained the work done in Gecko and SpiderMonkey to refactor the internationalization system.
Very interesting talk with lots of information and details about internationalization, going deep on text segmentation and how it works on different languages, and also introducing the ICU4X project.
• Ada Rose Cannon did a great introduction to WebXR and Augmented Reality.
Despite not being onsite, this was an awesome talk and the video was actually a very immersive experience. Ada explained many concepts and features around WebXR and Augmented Reality with a bunch of cool examples and demos.
• Thomas Steiner talked about Project Fugu APIs that have been implemented in Chromium.
Using the Web Engines Hackfest logo as example, he explained different new capabilities that Project Fugu is adding to the web through a real application called SVGcode.

It was a great set of talks, and you can now watch them all on YouTube. We hope you enjoy them if you haven’t the chance to watch them yet.

## CSS & Interop 2022

On the CSS breakout session we talked about all the new big features that are arriving to browsers these days. Container Queries and :has being probably the most notable examples, features that people have been requesting since the early days and that are shipping into browsers this year.

Apart from that, we talked about the Interop 2022 effort. How the target areas to improve interoperability are defined, and how much it implies the work in some of them.

## MathML & Fonts

Frédéric Wang did a nice presentation about MathML and all the work that has been done in the recent years. The feature is close to shipping in Chromium (modulo finding some solution regarding printing or waiting for LayoutNG to be ready to print), that will be a huge step forward for MathML becoming it a feature supported in the major browser engines.

Related to the MathML work there were some discussion around fonts, particularly OpenType MATH fonts, you can read Fred’s post for more details. There are some good news regarding this topic, new macOS version includes STIX Two Math installed by default, and there are ongoing conversations to get some OpenType MATH font by default in Android too.

MathML Breakout Session

## Accessibility & AOM

Valerie Young, who has recently started acting as co-chair of the ARIA Working Group, was leading a session around accessibility where we talked about ARIA and related things like AOM.

The Accessibility Object Model (AOM) is an effort that involves a lot of different things. In this session we talked about ARIA Attribute Reflection and the issues making accessible custom elements that use Shadow DOM, and that proposals like Cross-root ARIA Delegation are trying to solve.

Accessibility Breakout Session

## Acknowledgements

To close this post I’d like to say thank you to everyone that participated in the Web Engines Hackfest, without your presence this event wouldn’t make any sense. Also I’d like to thank the speakers for the great talks and the time devoted to work on them for this event. As usual big thanks to the sponsors Arm, Google and Igalia to make this event possible once more. And thanks again to Igalia for letting me be part of the event organization.

## July 18, 2022

### Víctor Jáquez

#### GstVA H.264 encoder, compositor and JPEG decoder

There are, right now, three new GstVA elements merged in main: vah264enc, vacompositor and vajpegdec.

Just to recap, GstVA is a GStreamer plugin in gst-plugins-bad (yes, we agree it’s not a great name anymore), to differentiate it from gstreamer-vaapi. Both plugins use libva to access stateless video processing operations; the main difference is, precisely, how stream’s state is handled: while GstVA uses GStreamer libraries shared by other hardware accelerated plugins (such as d3d11 and v4l2codecs), gstreamer-vaapi uses an internal, tightly coupled and convoluted library.

Also, note that right now (release 1.20) GstVA elements are ranked NONE, while gstreamer-vaapi ones are mostly PRIMARY+1.

Back to the three new elements in GstVA, the most complex one is vah264enc wrote almost completely by He Junyan, from Intel. For it, He had to write a H.264 bitwriter which is, until certain extend, the opposite for H.264 parser: construct the bitstream buffer from H.264 structures such as PPS, SPS, slice header, etc. This API is part of libgstcodecparsers, ready to be reused by other plugins or applications. Currently vah264enc is fairly complete and functional, dealing with profiles and rate controls, among other parameters. It still have rough spots, but we’re working on them. But He Junyan is restless and he already has in the pipeline an encoder common class along with a HEVC and AV1 encoders.

The second element is vacompositor, wrote by Artie Eoff. It’s the replacement of vaapioverlay in gstreamer-vaapi. The suffix compositor is preferred to follow the name of primary video mixing (software-based) element: compositor, successor of videomixer. See this discussion for further details. The purpose of this element is to compose a single video stream from multiple video streams. It works with Intel’s media-driver supporting alpha channel, and also works with AMD Mesa Gallium, but without alpha channel (in other words, a custom degree of transparency).

The last, but not the least, element is vajpegdec, which I worked on. The main issue was not the decoder itself, but jpegparse, which didn’t signal the image caps required for the hardware accelerated decoders. For instance, VA only decodes images with SOF marker 0 (Baseline DCT). It wasn’t needed before because the main and only consumer of the parser is jpegdec which deals with any type of JPEG image. Long story short, we revamped jpegparse and now it signals sof marker, color space (YUV, RGB, etc.) and chroma subsampling (if it has YUV color space), along with comments and EXIF-like metadata as pipeline’s tags. Thus vajpegdec will expose in caps template the supported color space and chroma subsampling supported by the driver. For example, Intel supports (more or less) RGB color space, while AMD Mesa Gallium don’t.

And that’s all for now. Thanks.

## What is LRZ?

[A Low Resolution Z (LRZ)] pass is also referred to as draw order independent depth rejection. During the binning pass, a low resolution Z-buffer is constructed, and can reject LRZ-tile wide contributions to boost binning performance. This LRZ is then used during the rendering pass to reject pixels efficiently before testing against the full resolution Z-buffer.

My colleague Samuel Iglesias did the initial reverse-engineering of this feature; for its in-depth overview you could read his great blog post Low Resolution Z Buffer support on Turnip.

Here are a few excerpts from this post describing what is LRZ?

To understand better how LRZ works, we need to talk a bit about tiled-based rendering. This is a way of rendering based on subdividing the framebuffer in tiles and rendering each tile separately.

The binning pass processes the geometry of the scene and records in a table on which tiles a primitive will be rendered. By doing this, the HW only needs to render the primitives that affect a specific tile when is processed.

The rendering pass gets the rasterized primitives and executes all the fragment related processes of the pipeline. Once it finishes, the resolve pass starts.

Where is LRZ used then? Well, in both binning and rendering passes. In the binning pass, it is possible to store the depth value of each vertex of the geometries of the scene in a buffer as the HW has that data available. That is the depth buffer used internally for LRZ. It has lower resolution as too much detail is not needed, which helps to save bandwidth while transferring its contents to system memory.

Thanks to LRZ, the rendering pass is only executed on the fragments that are going to be visible at the end.

LRZ brings a couple of things on the table that makes it interesting. One is that applications don’t need to reorder their primitives before submission to be more efficient, that is done by the HW with LRZ automatically.

Now, a year later, I returned to this feature to make some important improvements, for nitty-gritty details you could dive into Mesa MR#16251 “tu: Overhaul LRZ, implement on-GPU dir tracking and LRZ fast-clear”. There I implemented on-GPU LRZ direction tracking, LRZ reuse between renderpasses, and fast-clear of LRZ.

In this post I want to give a practical advice, based on things I learnt while reverse-engineering this feature, on how to help driver to enable LRZ. Some of it could be self-evident, some is already written in the official docs, and some cannot be found there. It should be applicable for Vulkan, GLES, and likely for Direct3D.

## Do not change the direction of depth comparisons

Or rather, when writing depth - do not change the direction of depth comparisons. If depth comparison direction is changed while writing into depth buffer - LRZ would have to be disabled.

Why? Because if depth comparison direction is GREATER - LRZ stores the lowest depth value of the block of pixels, if direction is LESS - it stores the highest value of the block. So if direction is changed the LRZ value becomes wrong for the new direction.

A few examples:

• :thumbsup: Going from VK_COMPARE_OP_GREATER -> VK_COMPARE_OP_GREATER_OR_EQUAL is good;
• :x: Going from VK_COMPARE_OP_GREATER -> VK_COMPARE_OP_LESS is bad;
• :neutral_face: From VK_COMPARE_OP_GREATER with depth write -> VK_COMPARE_OP_LESS without depth write is ok;
• LRZ would be just temporally disabled for VK_COMPARE_OP_LESS draw calls.

The rules could be summarized as:

• Changing depth write direction disables LRZ;
• For calls with different direction but without depth write LRZ is temporally disabled;
• VK_COMPARE_OP_GREATER and VK_COMPARE_OP_GREATER_OR_EQUAL have same direction;
• VK_COMPARE_OP_LESS and VK_COMPARE_OP_LESS_OR_EQUAL have same direction;
• VK_COMPARE_OP_EQUAL and VK_COMPARE_OP_NEVER don’t have a direction, LRZ is temporally disabled;
• Surprise, your VK_COMPARE_OP_EQUAL compares don’t benefit from LRZ;
• VK_COMPARE_OP_ALWAYS and VK_COMPARE_OP_NOT_EQUAL either temporally or completely disable LRZ, depending on depth being written.

## Simple rules for fragment shader

### Do not write depth

This obviously makes resulting depth value unpredictable, so LRZ has to be completely disabled.

Note, the output values of manually written depth could be bound by conservative depth modifier, for GLSL this is achieved by GL_ARB_conservative_depth extension, like this:

layout (depth_greater) out float gl_FragDepth;


However, Turnip at the moment does not consider this hint, and it is unknown if Qualcomm’s proprietary driver does.

### Do not use Blending/Logic OPs/colorWriteMask

All of them make a new fragment value depend on the old fragment value. LRZ is temporary disabled in this case.

### Do not have side-effects in fragment shaders

Writing to SSBO, images, … from fragment shader forces late Z, thus it is incompatible with LRZ. At the moment Turnip completely disables LRZ when shader has such side-effects.

Discarding fragments moves the decision whether fragment contributes to the depth buffer to the time of fragment shader execution. LRZ is temporary disabled in this case.

## LRZ in secondary command buffers and dynamic rendering

TLDR: Since Snapdragon 865 (Adreno 650) LRZ supported in secondary command buffers.

TLDR: LRZ would work with VK_KHR_dynamic_rendering, but you’d like to avoid using this extension because it isn’t nice to tilers.

Official docs state that LRZ is disabled with “Use of secondary command buffers (Vulkan)”, and on another page that “Snapdragon 865 and newer will not disable LRZ based on this criteria”.

Why?

Because up to Snapdragon 865 tracking of the direction is done on the CPU, meaning that LRZ direction is kept in internal renderpass object, updated and checked without any GPU involvement.

But starting from Snapdragon 865 the direction could be tracked on GPU which allows driver not to know previous LRZ direction during a command buffer construction. Therefor secondary command buffers could now use LRZ!

Recently Vulkan 1.3 came out and mandated the support of VK_KHR_dynamic_rendering. It gets rid of complicated VkRenderpass and VkFramebuffer setup, but much more exciting is a simpler way for parallel renderpasses construction (with VK_RENDERING_SUSPENDING_BIT / VK_RENDERING_RESUMING_BIT flags).

VK_KHR_dynamic_rendering poses a similar challenge for LRZ as secondary command buffers and has the same solution.

## Reusing LRZ between renderpasses

TLDR: Since Snapdragon 865 (Adreno 650) LRZ would work if you store depth in one renderpass and load it later, giving depth image isn’t changed in-between.

Another major improvement brought by Snapdragon 865 is the possibility to reuse LRZ state between renderpasses.

The on-GPU direction tracking is part of the equation here, another part is the tracking of a depth view being used. Depth image has a single LRZ buffer which corresponds to a single array layer + single mip level of the image. So if view with different array layer or mip layer is used - LRZ state couldn’t be reused and will be invalidated.

With the above knowledge here are the conditions when LRZ state could be reused:

• Depth attachment was stored (STORE_OP_STORE) at the end of some past renderpass;
• The same depth attachment with the same depth view settings is being loaded (not cleared) in the current renderpass;
• There were no changes in the underlying depth image, meaning there was no vkCmdBlitImage*, vkCmdCopyBufferToImage*, or vkCmdCopyImage*. Otherwise LRZ state would be invalidated;

Misc notes:

• LRZ state is saved per depth image so you don’t lose the state if you you have several renderpasses with different depth attachments;
• vkCmdClearAttachments + LOAD_OP_LOAD is just equal to LOAD_OP_CLEAR.

## Conclusion

While there are many rules listed above - it all boils down to keeping things simple in the main renderpass(es) and not being too clever.

# Where Browsers Come From

In this post, I’ll talk about the evolution of the economics around browsers, and what we should think about as we move forward.

The very first web browsers were rather small, single-person affairs. The source for Tim Berners-Lee’s original WorldWideWeb (which included a WYSIWYG editor!) was rougly a third larger than the unminified build of jQuery. Tim even abstracted the HTTP bits so you could build your own browser more easily, or a server (or both/all in one). For the most part, we can say these were created in academia.

There was at least some assumption at the time that like most other available software, someone could monetize a good one as a product. In fact, before building his own, Tim famously tried to give the idea away to a company doing just that.

For a while we flirted with that idea in various forms, creating companies “around” browsers. These companies largely never “just sold the browser” but involved lots of experimenting with ways to keep browsers “free” for individuals by subsidizing them in other ways: coporate sales, licences, support, server sales, and so on. With these concerted effort, and money, it was soon impractical for a 1 person affair to keep up or compete. Instead, the competition pretty quickly became between Netscape and Microsoft.

Microsoft, on the other hand, subsidized the browser through products people were already buying. They were already spending money to add similar web features to many those products, so why not just centralize it? They eventually just shipped it with the OS.

When it became clear that Netscape’s model was losing, the people there who really cared about the web had a real problem: Now what? Unless someone could do something radical, Microsoft’s way would be the only way.

## And then, two interesting things happened…

In 1998, as a result, Mozilla was spun off from Netscape and Open Source in the modern sense was born. The browser known as Firefox wouldn’t even launch its 1.0 version until 2004.

At literally the same time, Google (the search engine) was developing and getting users. It existed for 6 years before it launched its IPO – also in 2004. The web had indexes and search engines before Google, and had some clear winners. But, it would be an understatement to say that Google was game-changing. It just absolutely nailed search with some totally new ideas. They made it simple and accurate.

Google was banking on what really good search meant to the Web: It made it practical and easy. The easier they made it, the more people turned to the web for answers, and so on. That meant people would just be searching, literally all the all the time - and each of those searches came with ad opportunities!

You can see just how important this is to getting around on the web today. We’ve replaced the URL bar with the Wonder Bar (integrated search). If it doesn’t look like a URL, it goes to a search engine - automatically… But it wasn’t always like that.

Just before Firefox 1.0, Mozilla signed a landmark “default search” deal which would pay for 85% of their total budget. It wasn’t the intent for that to be the only means forever. In fact, they tried to expand into product ideas too. However, this dependency has generally only increased, recently accounting for up to 95% of the total Mozilla budget. Over time, this model would become the dominant approach.

Also, by the release of Firefox 1.0 its codebase was up to 2.1 million lines of code.

## How we pay for the web

Step back and look at the growth in the size and complexity of the 3 remaining engines over time…

browser ~lines of code
OG WWW 13,000
Firefox 1.0 2,100,000
Chromium 2022 34,900,821

Today, the cost to develop and maintain a browser is measured in the many hundreds of millions of dollars per year – each. And here’s the thing…

Most of the actual, practical revenue that has made the whole thing possible ever since is in a large sense driven by the model of monetizing browsers through default search.

In some cases this is pretty direct and easy to point to. In other cases, it’s a little obscured. However, I ask you to consider that companies (no matter how profitable) don’t voluntarily commit to bleeding hundreds of millions of dollars per year (and rising), forever. Something has to make it more economically viable - and today that thing is lots and lots of default search revenue.

Without it, the ledgers would suddenly go red, but default search deals ensure the books aren’t being drained of massive piles of money. In fact, they’re generating revenue.

It’s tempting to think “Well, so? It seems to be working pretty well? This is the way”.

…But, is it? Is it the only way? Should it always be the only way?

I kind of think not.

## Starting a conversation about our future

We’ve never really just sold the browser, we’ve always subsidized browser development in other ways. I’m definitely not suggesting that users should have to pay to download and use a web browser, but I do think it’s useful to realize that that doesn’t exactly make browsers free either. Instead, it spreads out costs some other way that obscures those details from the conversation. From that angle, it might be a useful thought exercise to consider: What does the browser commons cost, per-user, per-year?

Based on what we know of team sizes, scales and budgets - several people seem to have arrived on a similar ballpark figure: Somewhere around 2 billion dollars per year. Those are current costs to maintain all the current engines and make/keep them competive. I expect that for many this will induce a kind of “sticker shock” and it will be tempting to think “wow, that’s inefficient, maybe we don’t need 3 engines after all.” However, let’s dig in a little further.

# Installing SteamOS

Now that we have all files we can start the virtual machine:

$qemu-system-x86_64 -enable-kvm -smp cores=4 -m 8G \ -device usb-ehci -device usb-tablet \ -device intel-hda -device hda-duplex \ -device VGA,xres=1280,yres=800 \ -drive if=pflash,format=raw,readonly=on,file=/usr/share/ovmf/OVMF.fd \ -drive if=virtio,file=steamdeck-recovery-4.img,driver=raw \ -device nvme,drive=drive0,serial=badbeef \ -drive if=none,id=drive0,file=steamos.qcow2  Note that we’re emulating an NVMe drive for steamos.qcow2 because that’s what the installer script expects. This is not strictly necessary but it makes things a bit easier. If you don’t want to do that you’ll have to edit ~/tools/repair_device.sh and change DISK and DISK_SUFFIX. Once the system has booted we’ll see a KDE Plasma session with a few tools on the desktop. If we select “Reimage Steam Deck” and click “Proceed” on the confirmation dialog then SteamOS will be installed on the destination drive. This process should not take a long time. Now, once the operation finishes a new confirmation dialog will ask if we want to reboot the Steam Deck, but here we have to choose “Cancel”. We cannot use the new image yet because it would try to boot into the Gamescope session, which won’t work, so we need to change the default desktop session. SteamOS comes with a helper script that allows us to enter a chroot after automatically mounting all SteamOS partitions, so let’s open a Konsole and make the Plasma session the default one in both partition sets: $ sudo steamos-chroot --disk /dev/nvme0n1 --partset A
# exit

$qemu-system-x86_64 -enable-kvm -smp cores=4 -m 8G \ -device usb-ehci -device usb-tablet \ -device intel-hda -device hda-duplex \ -device VGA,xres=1280,yres=800 \ -drive if=pflash,format=raw,readonly=on,file=/usr/share/ovmf/OVMF.fd \ -drive if=pflash,format=raw,file=OVMF_VARS.fd \ -drive if=virtio,file=steamos.qcow2 \ -device virtio-net-pci,netdev=net0 \ -netdev user,id=net0,hostfwd=tcp::2222-:22  (the last two lines redirect tcp port 2222 to port 22 of the guest to be able to SSH into the VM. If you don’t want to do that you can omit them) If everything went fine, you should see KDE Plasma again, this time with a desktop icon to launch Steam and another one to “Return to Gaming Mode” (which we should not use because it won’t work). See the screenshot that opens this post. Congratulations, you’re running SteamOS now. Here are some things that you probably want to do: • (optional) Change the keyboard layout in the system settings (the default one is US English) • Set the password for the deck user: run ‘passwd‘ on a terminal • Enable / start the SSH server: ‘sudo systemctl enable sshd‘ and/or ‘sudo systemctl start sshd‘. • SSH into the machine: ‘ssh -p 2222 deck@localhost # Updating the OS to the latest version The Steam Deck recovery image doesn’t install the most recent version of SteamOS, so now we should probably do a software update. • First of all ensure that you’re giving enought RAM to the VM (in my examples I run QEMU with -m 8G). The OS update might fail if you use less. • (optional) Change the OS branch if you want to try the beta release: ‘sudo steamos-select-branch beta‘ (or main, if you want the bleeding edge) • Check the currently installed version in /etc/os-release (see the BUILD_ID variable) • Check the available version: ‘steamos-update check • Download and install the software update: ‘steamos-update Note: if the last step fails after reaching 100% with a post-install handler error then go to Connections in the system settings, rename Wired Connection 1 to something else (anything, the name doesn’t matter), click Apply and run steamos-update again. This works around a bug in the update process. Recent images fix this and this workaround is not necessary with them. As we did with the recovery image, before rebooting we should ensure that the new update boots into the Plasma session, otherwise it won’t work: $ sudo steamos-chroot --partset other
# exit


After this we can restart the system.

If everything went fine we should be running the latest SteamOS release. Enjoy!

# Reporting bugs

SteamOS is under active development. If you find problems or want to request improvements please go to the SteamOS community tracker.

Edit 06 Jul 2022: Small fixes, mention how to install the OS without using NVMe.

## July 01, 2022

### Claudio Saavedra

#### Fri 2022/Jul/01

I wrote a technical overview of the WebKit WPE project for the WPE WebKit blog, for those interested in WPE as a potential solution to the problem of browsers in embedded devices.

This article begins a series of technical writeups on the architecture of WPE, and we hope to publish during the rest of the year further articles breaking down different components of WebKit, including graphics and other subsystems, that will surely be of great help for those interested in getting more familiar with WebKit and its internals.

## June 29, 2022

### Patrick Griffis

#### WebExtension Support in Epiphany

I’m excited to help bring WebExtensions to Epiphany (GNOME Web) thanks to investment from my employer Igalia. In this post, I’ll go over a summary of how extensions work and give details on what Epiphany supports.

Web browsers have supported extensions in some form for decades. They allow the creation of features that would otherwise be part of a browser but can be authored and experimented with more easily. They’ve helped develop and popularize ideas like ad blocking, password management, and reader modes. Sometimes, as in very popular cases like these, browsers themselves then begin trying to apply lessons upstream.

## Toward universal support

For most of this history, web extensions have used incompatible browser-specific APIs. This began to change in 2015 with Firefox adopting an API similar to Chrome’s. In 2020, Safari also followed suit. We now have the foundations of an ecosystem-wide solution.

“The foundations of” is an important thing to understand: There are still plenty of existing extensions built with browser-specific APIs and this doesn’t magically make them all portable. It does, however, provide a way towards making portable extensions. In some cases, existing extensions might just need some porting. In other cases, they may utilize features that aren’t entirely universal yet (or, may never be).

## Bringing Extensions to Epiphany

With version 43.alpha Epiphany users can begin to take advantage of some of the same powerful and portable extensions described above. Note that there are quite a few APIs that power this and with this release we’ve covered a meaningful segment of them but not all (details below). Over time our API coverage and interoperability will continue to grow.

## What WebExtensions can do: Technical Details

At a high level, WebExtensions allow a private privileged web page to run in the browser. This is an invisible Background Page that has access to a browser JavaScript API. This API, given permission, can interact with browser tabs, cookies, downloads, bookmarks, and more.

Along with the invisible background page, it gives a few options to show a UI to the user. One such method is a Browser Action which is shown as a button in the browser’s toolbar that can popup an HTML view for the user to interact with. Another is an Options Page dedicated to configuring the extension.

Lastly, an extension can inject JavaScript directly into any website it has permissions to via Content Scripts. These scripts are given full access to the DOM of any web page they run in. However content scripts don’t have access to the majority of the browser API but, along with the above pages, it has the ability to send and receive custom JSON messages to all pages within an extension.

### Example usage

For a real-world example, I use Bitwarden as my password manager which I’ll simplify how it roughly functions. Firstly there is a Background Page that does account management for your user. It has a Popup that the user can trigger to interface with your account, passwords, and options. Finally, it also injects Content Scripts into every website you open.

The Content Script can detect all input fields and then wait for a message to autofill information into them. The Popup can request the details of the active tab and, upon you selecting an account, send a message to the Content Script to fill this information. This flow does function in Epiphany now but there are still some issues to iron out for Bitwarden.

## Epiphany’s current support

Epiphany 43.alpha supports the basic structure described above. We are currently modeling our behavior after Firefox’s ManifestV2 API which includes compatibility with Chrome extensions where possible. Supporting ManifestV3 is planned alongside V2 in the future.

As of today, we support the majority of:

• alarms - Scheduling of events to trigger at specific dates or times.
• commands - Keyboard shortcuts.
• cookies - Management and querying of browser cookies.
• downloads - Ability to start and manage downloads.
• menus - Creation of context menu items.
• notifications - Ability to show desktop notifications.
• storage - Storage of extension private settings.
• tabs - Control and monitoring of browser tabs, including creating, closing, etc.
• windows - Control and monitoring of browser windows.

A notable missing API is webRequest which is commonly used by blocking extensions such as uBlock Origin or Privacy Badger. I would like to implement this API at some point however it requires WebKitGTK improvements.

For specific API details please see Epiphany’s documentation.

What this means today is that users of Epiphany can write powerful extensions using a well-documented and commonly used format and API. What this does not mean is that most extensions for other browsers will just work out of the box, at least not yet. Cross-browser extensions are possible but they will have to only require the subset of APIs and behaviors Epiphany currently supports.

## How to install extensions

This support is still considered experimental so do understand this may lead to crashes or other unwanted behavior. Also please report issues you find to Epiphany rather than to extensions.

You can install the development release and test it like so:

flatpak remote-add --if-not-exists gnome-nightly https://nightly.gnome.org/gnome-nightly.flatpakrepo
flatpak install gnome-nightly org.gnome.Epiphany.Devel
flatpak run --command=gsettings org.gnome.Epiphany.Devel set org.gnome.Epiphany.web:/org/gnome/epiphany/web/ enable-webextensions true


You will now see Extensions in Epiphany’s menu and if you run it from the terminal it will print out any message logged by extensions for debugging. You can download extensions most easily from Mozilla’s website.

## June 24, 2022

### Tim Chevalier

#### on impostor syndrome, or: worry dies last

According to Sedgwick, it was just this kind of interchange that fueled her emotional re-education. She came to see that the quickness of her mind was actually holding back her progress, because she expected emotional change to be as easy to master as a new theory: “It’s hard to recognize that your whole being, your soul doesn’t move at the speed of your cognition,” she told me. “That it could take you a year to really know something that you intellectually believe in a second.” She learned “how not to feel ashamed of the amount of time things take, or the recalcitrance of emotional or personal change.”

My colleague Ioanna Dimitriou told me “worry dies last”, and it made me remember this passage from an interview with Eve Kosofsky Sedgwick.

It’s especially common in fields where people’s work is constantly under review by talented peers, such as academia or Open Source Software, or taking on a new job.

At the end of 2012/beginning of 2013 I wrote a four-part blog post about my experiences with impostor syndrome. That led to me getting invited to speak on episode 113 of the “Ruby Rogues” podcast, which was dedicated to impostor syndrome. (Unfortunately, from what I can tell, their web site is gone.)

Since then, my thinking about impostor syndrome has changed.

“Impostor syndrome” is an entirely rational behavior for folks who do get called impostors (ie. many underrepresented people). It’s part coping mechanism, part just listening to the feedback you’re getting….

We call it “impostor syndrome”, but we’re not sick. The real sickness is an industry that calls itself a meritocracy but over and over and over fails to actually reward merit.

This is fixable. It will take doing the work of rooting out bias in all its forms, at all levels – and critically, in who gets chosen to level up. So let’s get to work.

I agree with everything Leigh wrote here. Impostor syndrome, like any response to past or ongoing trauma, is not a pathology. It’s a reasonable adaptation to an environment that places stresses on your mind and body that exhaust your resources for coping with those demands. I wrote a broader post about this point in 2016, called “Stop Moralizing About Personality Traits”.

Acceptance is the first step towards change. By now, I’ve spent over a decade consciously reckoning with the repercussions of growing up and into young adulthood without emotional support, both on the micro-level (family and intimate relationships) and the macro-level (being a perennial outsider with no home on either side of a variety of social borders: for example, those of gender, sexuality, disability, culture, and nationality). When i started my current job last year, I wasn’t over it. That made it unnecessarily hard to get started and put up a wall between me and any number of people who might have offered help if they’d only known what I was going through. I’m still not over it.

To recognize, and name as a problem, the extent to which my personality has been shaped by unfair social circumstances: that was one step. Contrary to my acculturation as an engineer, the next step is not “fix the problem”. In fact, there is no patch you can simply apply to your own inner operating system, because all of your conscious thoughts run in user space. Maybe you can attach a debugger to your own kernel, but some changes can’t be made to a running program without a cold reboot. I don’t recommend trying that at home.

Learning to identify impostor syndrome (or, as you might call it, “dysfunctional environment syndrome”; or, generalizing, “complex trauma” or “structural violence”) is one step, but a bug report isn’t the same thing as a passing regression test. As with free software, improvement has to come a little bit at a time, from many different contributors; there are few successful projects with a single maintainer.

I am ashamed of the amount of time things take, of looking like a senior professional on the outside as long as my peers don’t know (or aren’t thinking about) how I’ve never had a single job in tech for more than two years, about what it was like for me to move from job to job never picking enough momentum to accomplish anything that felt real to me. I wonder whether they think I’ve got it all figured out, which I don’t, but it often feels easier to just let people think that and suffer in silence. Learning to live with trauma requires trusting relationships; you can’t do it on your own. But the trauma itself makes it difficult to impossible to trust and to enter into genuine relationships.

I am not exaggerating when I say that my career has been traumatic for me; it has both echoed much older traumas and created entirely new ones. That’s a big part of why I had to share how I felt about finally meeting more of my co-workers in person. I’m 41 years old and I feel like I should be better at this by now. I’m not. But I’ll keep trying, if it takes a lifetime.

## June 23, 2022

### Tim Chevalier

#### we belong

I am about 70% robot and 30% extremely sentimental and emotional person, generally in series rather than in parallel. But last week’s Igalia summit was a tidal wave of feelings, unexpected but completely welcome. Some of those feelings are ones I’ve already shared with the exact people who need to know, but there are some that I need to share with the Internet. I am hoping I’m not the only one who feels this way, though I don’t think I am.

A lot of us are new and this was our first summit. Meeting 60 or 70 people who were previously 2D faces on a screen for half an hour a week, at best, was intense. I was told by others that reuniting with long-time friends/colleagues/comrades/whatever words you want to use (and it’s hard to find the exact right one for a workplace like this) who they hadn’t seen since pre-pandemic was intense as well.

For me, there was more to it. I doubt I’m alone in this either, but it might explain why I’m feeling so strongly.

I tried to quit tech in 2015. I couldn’t afford to in the end, and I went to Google. They fired me for (allegedly) discriminating against white men, in late 2017. I decided it was time to quit again. I became an EMT and then a patient care coordinator, and applied to nursing schools. I got rejected. I decided I didn’t want to try again because I had learned that unless I became a physician, working in health care would never give me the respect I need. Unfortunately, I have an ego. I like to think that I balance it out with empathy more than some people in tech do, but it’s still there.

I got a DM in 2018 from some guy I knew from Twitter asking if I wanted to apply to Igalia, and I waited three years to take him up on it. Now I’m here.

Getting started wasn’t easy. The two weeks working from the office before the summit wasn’t easy either. But it all fell away sometime between Wednesday and Friday of last week, and quite unexpectedly, I decided I’m moving to Europe as soon as I can, probably to A Coruña (where Igalia’s headquarters is) at first but who knows where life will take me next? Listing all the reasons would take too long. But: I found a safe space, one where I felt welcome, accepted, like I belonged. It’s not a perfect one; I stood up during one of the meetings and expressed my pain at the dissonance between the comfort I feel here and the knowledge that most of the faces in the room were white and most belonged to men. I want to say we’re working on it, but our responsibility is to finish the work, not to feel good that we’ve started it. That’s true for writing code to deliver to a customer, and it’s true for achieving fairness.

I am old enough now to accommodate multiple conflicting truths. My desire to improve the unfairness, and to get other people to open their hearts enough to risk all-consuming rage at just how unfair things can be, those things coexist with my joy about finding a group of such consistently caring, thoughtful, and justice-minded people — ones who don’t seem to mind having me around.

I’m normally severely allergic to words like “love” and “family” in a corporate context. As an early childhood trauma survivor, these words are fraught, and I would rather keep things at work a bit more chill. At the same time, when I heard Igalians use these words during the summit to talk about our collective, it didn’t feel as menacing as it usually does. Maybe the right word to use here — the thing that we really mean when we generalize the words “love” and “family” because we’ve been taught (incorrectly) that it can only come from our lovers or parents — is “safety”. Safety is one of the most underrated concepts there is. Feeling safe means believing that you can rely on the people around you, that they know where you’re coming from or else if they don’t, that they’re willing to try to find out, that they’re willing to be changed by what happens if they do find out. I came in apprehensive. But in little ways and in big ways, I found safe people, not just one or two but a lot.

I could say more, but if I did, I might never stop. To channel the teenaged energy that I’m feeling right now (partly due to reconnecting with that version of myself who loved computers and longed to find other people who did too), I’ll include some songs that convey how I feel about this week. I don’t know if this will ring true for anyone else, but I have to try.

Allette Brooks, “Silicon Valley Rebel”

We lean her bike along the office floor
They never know what to expect shaved into the back of her head when she walks in the door
And she says ‘I don’t believe in working like that for a company
It’s not like they care about you
It’s not like they care about me’

Please don’t leave us here alone in this silicon hell, oh
Life would be so unbearable without your rebel yell...


Vienna Teng, “Level Up”

Call it any name you need
So long as you can feel it all
So long as all your doors are flung wide
Call it your day number 1 in the rest of forever

If you are afraid, give more
If you are alive, give more now
Everybody here has seams and scars


Namoli Brennet “We Belong”

Here's to all the tough girls
And here's to all the sensitive boys
We Belong
Here's to all the rejects
And here's to all the misfits
We Belong

Here's to all the brains and the geeks
And here's to all the made up freaks, yeah
We Belong

And when the same old voices say
That we'd be better off running away
We belong, We belong anyway

The Nields, “Easy People”

You let me be who I want to be

Bob Franke, “Thanksgiving Eve”

What can you do with your days
But work and hope
Let your dreams bind your work to your play

And most of all, the Mountain Goats, “Color in Your Cheeks”

They came in by the dozens, walking or crawling
Some were bright-eyed, some were dead on their feet
And they came from Zimbabwe, or from Soviet Georgia
East Saint Louis, or from Paris, or they lived across the street
But they came, and when they finally made it here
It was the least that we could do to make our welcome clear

Come on in
We haven't slept for weeks
Drink some of this
This'll put color in your cheeks

This is a different kind of post from the ones I was originally planning to do on this blog. And I never thought I’d be talking about my job this way. Life comes at you fast. To return to the Allette Brooks lyric: it’s because at a co-op, there’s no “they” that’s separate from “you” and “me”. It’s just “you” and “me”, and we care about each other. It turns out that safe spaces and cooperative structure aren’t just political ideas that happen to correlate — in a company, neither can exist without the other. It’s not a safe space if you can get fired over one person’s petty grievance, like being reminded that white men don’t understand everything. Inversely, cooperative structure can’t work without deep trust. Trust is hard to scale, and as Igalia grows I worry about what will happen (doubtless, the people who were here when it was 1/10ths of the size have a different view.) There is no guarantee of success, but I want to be one of the ones to try.

And we’re hiring. I know how hard it is to try again when you’ve been humiliated, betrayed, and disappointed at work before, something that’s more common than not in tech when you don’t look like, sound like, or feel like everybody else. I’m here because somebody believed in me. I’m glad both that they did and that I was able to return that leap of faith and believe that I truly was believed in. And I would like to pass along the favor to you. Of course, to do that, I have to get to know you a little bit first. As long as I continue to have some time, I want to talk to people in groups that are systematically underrepresented in tech who may be intrigued by what I wrote here but aren’t sure if they’re good enough. My email is tjc at (obvious domain name) and the jobs page is at https://www.igalia.com/jobs. Even if you don’t see a technical role there that exactly fits you, please don’t let that stop you from reaching out; what matters most is willingness to learn and to tolerate the sometimes-painful, always-rewarding process of creating something together with mutual consent rather than coercion.

# Achievement Unlocked: Intent to Mathify

We are about to reach a very unique and honestly pretty epic moment in standards history, and most people probably won’t even notice. It’s also personally meaningful to me, so I’d like to tell you about it and why it’s worth celebrating it…

Igalia just filed an Intent to Ship support for MathML-Core in Chromium. If you’re not familliar with what an intent to ship means - the blink release process involves several stages before something gets to ship. An intent to ship marks the beginning of the final step which allows the feature to just work by default in stable releases.

I know, I know. Many website developers will think “that doesn’t help me much and I have all these other problems which I would love to solve instead”. But… that’s kind of the problem and why this is momentus in several ways.

MathML has a wildly interesting history. At some level, the need for being able to display mathematical text seem like it would have been obvious from the web’s start at CERN, right? It was! In fact, support for rendering some math existed in CERNs experimental browser in 1993. Graphics too. It’s unsurprising then that SVG and MathML were among the first active working groups at the W3C when it was established. MathML was, in fact, the first XML oriented standard that the W3C ever published with it’s first recommendation coming in April 1998. For reference that’s over a year before HTML 4.0.1 reached REC… During the “HTML5” split it, along with SVG was specially integrated into the new, very well defined parser.

So why is it suddenly “news?”.

Well… It is complicated, and I wrote one of my personally favorite pieces explaining it all at the beginning of 2019, before I came to work at Igalia. Really, I think it is enjoyable and you should read it, but the TL;DR version is that standards implementations, and their prioritization is voluntary. In the same way that many developers are focused on shopping and selling and animations and layout and better modularization and… lots of other problems - there were just more appealing things on their plates which would appease a wider audience. There are millions of equations in Wikipedia alone, but we don’t think about it so much because we’ve developed plenty of clever (and often not-so-clever) workarounds to deal with “until the last one lands”.

And so progress was slower than normal. Volunteers did an amazing amount of the actual implementation work in many cases. And then, just as it looked like we were about to cross a finish line, Chrome forked WebKit and decided that - for now - they were going to rip out the newly landed MathML which had some problems to make it easier for them to refactor major parts of the engine. And then the way we do standards changed. We got more rigorous over the years. Basically - the story just kept getting worse. It was almost like we were going backwards for math on the web.

By 2018 or so it was looking pretty unable to be righted. Igalia was provided many arguments about how hard the problem was, and the scale of actually righting the ship. It was more than just 1 more implementation, which would already have been a huge effort - it was about re-establishing a working group, specifying all of the previously unspecified things, in a way that fit the platform (coordinating with many other working groups and WHATWG on various details), going through a review with the W3C Technical Architecture Group, and so on.

But, here we are.

Setting this right isn’t just historically unique in that way either: I’m pretty sure it’s safe to say that no non-browser organization has ever landed something of this kind of scale before - and they’ve never done it with some aid from various funding sources either. While Igalia wound up footing the lion’s share of the bill ourselves, we also had financial support at stages from NISO and the Alfred P Sloan Foundation, APS Physics, and Pearson and a small collection of donors that included 75k from two people. Really, in many ways, that effort was the pre-cursor to our whole Open Prioritization idea. ## Personal notes For me, I have a lot of personal connections to this that make it meaningful. As I said in Harold Crick and the Web Platform, I might be partially responsible for its delays. Little did he know, he’d get deeply involved in helping solve that problem. When I came to Igalia, it was one of my first projects. I helped fix some things. The first Web Platform Test I ever contributed was for MathML. I think the first WHATWG PR I ever sent was on MathML. The first BlinkOn Lightning Talk I ever did was about - you guessed it, MathML. The first W3C Charter I ever helped write? That’s right: About MathML. The first actual Working Group I’ve ever chaired (well, co-chaired) is about Math. The first explainer I ever wrote myself was about MathML. The first podcast I ever hosted was on… Guess what? MathML. And so on. And here’s the thing: I am perhaps the least mathematically minded person you will ever meet. But we did the right thing. A good thing. And we did it a good way, and for good reasons. A few episodes later on our podcast we had Rick Byers from Chrome and Rossen Atanassov from Microsoft on the show, and Rick brought up MathML. Both of them were hugely impressed and supportive of it even back then - Rick said I fully expect MathML to ship at some point and it’ll ship in Chrome… Even though from Google’s business perspective, it probably wouldn’t have been a good return on investment for us to do it… I’m thrilled that Igalia was able to do it. By then, there was a global pandemic underway and Rossen pointed out… I’m looking forward to it. Look, I… I’m a huge supporter of having a native math into into the engines, MathML… And having the ability to, at the end of the day, increase the edu market, which will benefit the most out of it. Especially, you know, having been through one full semester of having a middle school student at home, and having her do all of her work through online tools… Having better edu support will go a long way. So, thank you on behalf all of the students, future students that will benefit And… yeah, that’s kind of it right? There’s this huge thing that has no special seeming obvious ROI for browsers, that doesn’t have every developer in the world screaming for it but it’s really great for society and kind of important to serve this niche because it underpins the ability of student, physicists, mathematicians etc to communicate with text. # Mission Accomplished Progress Anyway… Wow. This is huge, I think, in so many ways. I’m gonna go raise a glass to everyone who helped achieve this astonishingly huge thing. I think it says so much about our ability to do things together and promise for the ecosystem and how it could work. I sort of wish I could just say “Mission Accomplished” but the truth is that this is a beginning, not an end. It means we have really good support of parsing and rendering (assuming some math capable fonts) tons and tons of math interoperably, and a spec for how it needs to integrate with the whole rest of the platform, but only 1 implementation of that bit. Now we have to align the others implementations the same way so that we really have just One Platform and all of it can move forward together and no part gets left behind again. Then, beyond that is MathML-Core Level 2. Level 1 draws a box around what was practically able to be aligned in the first pass - but it leaves a few hard problems on the table which really need solving. Many of these have some partial solutions in the other 2 browsers already but are hard to specify and integrate. I have a lot of faith that we can reach both of those goals, but to do it takes investment. I really hope that reaching this milestone helps convince organizations and vendors to contribute toward reaching them. I’d encourage everyone to give the larger ecosystem some thought and consider how we can support good efforts - even maybe directly. If you'd like to understand the argument for this better, my friend and colleague at Igalia Eric Meyer presented an excellent talk when we announced it... ## June 21, 2022 ### Ziran Sun #### My first time attending Igalia Summit With employee distributed in over 20 countries globally and most are working remotely, Igalia holds summits twice a year to give the employees opportunities to meet up face-to-face. The summit normally runs in a period of a week with code camps, team building and recreational activities etc.. This year’s summer summit was held between the 15th-19th of June in A Coruña, Galicia. I joined Igalia in the beginning of 2020. Because of COVID-19 pandemic, this was my first time attending an Igalia Summit. Due to personal time schedule I only managed to stay at A Coruña for three nights. The overall experience was great and I thoroughly enjoyed it! ### Getting to know A Coruña Igalia’s HQ is based in A Coruña in Galicia, Spain. A beautiful white-sanded beach is just a few yards away from the hotel we stayed. I did a barefoot stroll along the beach in one morning. The beach itself was reasonably quiet. There were a few people swimming in the shallow part of the sea. Weather that morning was perfect with warm sunshine and occasional cool gentle breezes. The smell of sea in the air and people wandering around relaxingly in the evenings, somehow brought a familiar feeling to me and reminded of my home town. On Wednesday I joined guided visit to A Coruña set in 1906. It was a very interesting walk around the historic part of the city. The tour guide Suso Martín presented an amazing one man show by walking us through the A Coruña‘s history back to the 12th century, historic buildings, religion and romances associated with the city. On Thursday evening we went for the guided tour of MEGA, Mundo Estrella Galicia. En route we passed the office that Igalia used to locate at in early days. According to some Igalians who worked there before, it was a small office and there were about just over 10 Igalians in those days. Today Igalia has over 100 employees and last year we celebrated our 20th year anniversary in Open Source. The highlights of the MEGA tour for me were watching the product line working and tasting the beers. We were spoiled by beers and local cheeses. ### Meeting other igalians Since I joined Igalian, all the meetings happened online due to pandemic. I was glad that I could finally meet other Igalians “physically” . During the summit I had chances to chat other Igalians at code camp, during meals and guided tours. It was pleasant. Playing board games together after an evening meal (I’d call it “night meal”) was a great fun. Some Igalians can be very funny and witty. I had quite a few “tearful laughter” moments. It was very enjoyable. During the team code camp, I had chances to spending more time with my teammates. The technical meetings in both days were very engaging and effective. I was very happy to see Javi Fernández in person. Last year I was involved in CSS grid compatibility work (one of the 5 key areas for the Compat2021 effort that Igalia was involved in). I personally had never touched CSS Grid layout at the beginning of the assignment. The web platform team in Igalia though has very good knowledge and working experiences in this area. My teammates Manuel Rego, Javi Fernández and Sergio Villar were among the pioneer developers in this field. To assure the success of the task, the team formed a safe environment by providing me constant guidance and supports. Specifically, I had Javi’s helps throughout the whole task. Javi has been an amazing technical mentor with great expertise and patience. We had numerous calls for technical discussions. The team also had a couple of debugging sessions with me for some very tricky bugs. The team meal on Wednesday evening was very nice. Delicious food, great companions and nice fruity beers – what could go wrong? Well, we did walk back to the hotel in the rain… ### Thanks The event was very well organized and productive. I really appreciate the hard work and great effort those igalians put in to make it happen. Just want to say – a big THANK YOU! I’m glad that I managed the trip. Traveling from the UK is not straightforward but I’d say it’s well worthy it. ### Andy Wingo #### an optimistic evacuation of my wordhoard Good morning, mallocators. Last time we talked about how to split available memory between a block-structured main space and a large object space. Given a fixed heap size, making a new large object allocation will steal available pages from the block-structured space by finding empty blocks and temporarily returning them to the operating system. Today I'd like to talk more about nothing, or rather, why might you want nothing rather than something. Given an Immix heap, why would you want it organized in such a way that live data is packed into some blocks, leaving other blocks completely free? How bad would it be if instead the live data were spread all over the heap? When might it be a good idea to try to compact the heap? Ideally we'd like to be able to translate the answers to these questions into heuristics that can inform the GC when compaction/evacuation would be a good idea. lospace and the void Let's start with one of the more obvious points: large object allocation. With a fixed-size heap, you can't allocate new large objects if you don't have empty blocks in your paged space (the Immix space, for example) that you can return to the OS. To obtain these free blocks, you have four options. 1. You can continue lazy sweeping of recycled blocks, to see if you find an empty block. This is a bit time-consuming, though. 2. Otherwise, you can trigger a regular non-moving GC, which might free up blocks in the Immix space but which is also likely to free up large objects, which would result in fresh empty blocks. 3. You can trigger a compacting or evacuating collection. Immix can't actually compact the heap all in one go, so you would preferentially select evacuation-candidate blocks by choosing the blocks with the least live data (as measured at the last GC), hoping that little data will need to be evacuated. 4. Finally, for environments in which the heap is growable, you could just grow the heap instead. In this case you would configure the system to target a heap size multiplier rather than a heap size, which would scale the heap to be e.g. twice the size of the live data, as measured at the last collection. If you have a growable heap, I think you will rarely choose to compact rather than grow the heap: you will either collect or grow. Under constant allocation rate, the rate of empty blocks being reclaimed from freed lospace objects will be equal to the rate at which they are needed, so if collection doesn't produce any, then that means your live data set is increasing and so growing is a good option. Anyway let's put growable heaps aside, as heap-growth heuristics are a separate gnarly problem. The question becomes, when should large object allocation force a compaction? Absent growable heaps, the answer is clear: when allocating a large object fails because there are no empty pages, but the statistics show that there is actually ample free memory. Good! We have one heuristic, and one with an optimum: you could compact in other situations but from the point of view of lospace, waiting until allocation failure is the most efficient. shrinkage Moving on, another use of empty blocks is when shrinking the heap. The collector might decide that it's a good idea to return some memory to the operating system. For example, I enjoyed this recent paper on heuristics for optimum heap size, that advocates that you size the heap in proportion to the square root of the allocation rate, and that as a consequence, when/if the application reaches a dormant state, it should promptly return memory to the OS. Here, we have a similar heuristic for when to evacuate: when we would like to release memory to the OS but we have no empty blocks, we should compact. We use the same evacuation candidate selection approach as before, also, aiming for maximum empty block yield. fragmentation What if you go to allocate a medium object, say 4kB, but there is no hole that's 4kB or larger? In that case, your heap is fragmented. The smaller your heap size, the more likely this is to happen. We should compact the heap to make the maximum hole size larger. side note: compaction via partial evacuation The evacuation strategy of Immix is... optimistic. A mark-compact collector will compact the whole heap, but Immix will only be able to evacuate a fraction of it. It's worth dwelling on this a bit. As described in the paper, Immix reserves around 2-3% of overall space for evacuation overhead. Let's say you decide to evacuate: you start with 2-3% of blocks being empty (the target blocks), and choose a corresponding set of candidate blocks for evacuation (the source blocks). Since Immix is a one-pass collector, it doesn't know how much data is live when it starts collecting. It may not know that the blocks that it is evacuating will fit into the target space. As specified in the original paper, if the target space fills up, Immix will mark in place instead of evacuating; an evacuation candidate block with marked-in-place objects would then be non-empty at the end of collection. In fact if you choose a set of evacuation candidates hoping to maximize your empty block yield, based on an estimate of live data instead of limiting to only the number of target blocks, I think it's possible to actually fill the targets before the source blocks empty, leaving you with no empty blocks at the end! (This can happen due to inaccurate live data estimations, or via internal fragmentation with the block size.) The only way to avoid this is to never select more evacuation candidate blocks than you have in target blocks. If you are lucky, you won't have to use all of the target blocks, and so at the end you will end up with more free blocks than not, so a subsequent evacuation will be more effective. The defragmentation result in that case would still be pretty good, but the yield in free blocks is not great. In a production garbage collector I would still be tempted to be optimistic and select more evacuation candidate blocks than available empty target blocks, because it will require fewer rounds to compact the whole heap, if that's what you wanted to do. It would be a relatively rare occurrence to start an evacuation cycle. If you ran out of space while evacuating, in a production GC I would just temporarily commission some overhead blocks for evacuation and release them promptly after evacuation is complete. If you have a small heap multiplier in your Immix space, occasional partial evacuation in a long-running process would probably reach a steady state with blocks being either full or empty. Fragmented blocks would represent newer objects and evacuation would periodically sediment these into longer-lived dense blocks. mutator throughput Finally, the shape of the heap has its inverse in the shape of the holes into which the mutator can allocate. It's most efficient for the mutator if the heap has as few holes as possible: ideally just one large hole per block, which is the limit case of an empty block. The opposite extreme would be having every other "line" (in Immix terms) be used, so that free space is spread across the heap in a vast spray of one-line holes. Even if fragmentation is not a problem, perhaps because the application only allocates objects that pack neatly into lines, having to stutter all the time to look for holes is overhead for the mutator. Also, the result is that contemporaneous allocations are more likely to be placed farther apart in memory, leading to more cache misses when accessing data. Together, allocator overhead and access overhead lead to lower mutator throughput. When would this situation get so bad as to trigger compaction? Here I have no idea. There is no clear maximum. If compaction were free, we would compact all the time. But it's not; there's a tradeoff between the cost of compaction and mutator throughput. I think here I would punt. If the heap is being actively resized based on allocation rate, we'll hit the other heuristics first, and so we won't need to trigger evacuation/compaction based on mutator overhead. You could measure this, though, in terms of average or median hole size, or average or maximum number of holes per block. Since evacuation is partial, all you need to do is to identify some "bad" blocks and then perhaps evacuation becomes attractive. gc pause Welp, that's some thoughts on when to trigger evacuation in Immix. Next time, we'll talk about some engineering aspects of evacuation. Until then, happy consing! ### Carlos García Campos #### Thread safety support in libsoup3 In libsoup2 there’s some thread safety support that allows to send messages from a thread different than the one where the session is created. There are other APIs that can be used concurrently too, like accessing some of the session properties, and others that aren’t thread safe at all. It’s not clear what’s thread safe and even sending a message is not fully thread safe either, depending on the session features involved. However, several applications relay on the thread safety support and have always worked surprisingly well. In libsoup3 we decided to remove the (broken) thread safety support and only allowed to use the API from the same thread where the session was created. This simplified the code and made easier to add the HTTP/2 implementation. Note that HTTP/2 supports multiple request over the same TCP connection, which is a lot more efficient than starting multiple requests from several threads in parallel. When apps started to be ported to libsoup3, those that relied on the thread safety support ended up being a pain to be ported. Major refactorings where required to either stop using the sync API from secondary threads, or moving all the soup usage to the same secondary thread. We managed to make it work in several modules like gstreamer and gvfs, but others like evolution required a lot more work. The extra work was definitely worth it and resulted in much better and more efficient code. But we also understand that porting an application to a new version of a dependency is not a top priority task for maintainers. So, in order to help with the migration to libsoup3, we decided to add thread safety support to libsoup3 again, but this time trying to cover all the APIs involved in sending a message and documenting what’s expected to be thread safe. Also, since we didn’t remove the sync APIs, it’s expected that we support sending messages synchronously from secondary threads. We still encourage to use only the async APIS from a single thread, because that’s the most efficient way, especially for HTTP/2 requests, but apps currently using threads can be easily ported first and then refactored later. The thread safety support in libsoup3 is expected to cover only one use case: sending messages. All other APIs, including accessing session properties, are not thread safe and can only be used from the thread where the session is created. There are a few important things to consider when using multiple threads in libsoup3: • In the case of HTTP/2, two messages for the same host sent from different threads will not use the same connection, so the advantage of HTTP/2 multiplexing is lost. • Only the API to send messages can be called concurrently from multiple threads. So, in case of using multiple threads, you must configure the session (setting network properties, features, etc.) from the thread it was created and before any request is made. • All signals associated to a message (SoupSession::request-queued, SoupSession::request-unqueued, and all SoupMessage signals) are emitted from the thread that started the request, and all the IO will happen there too. • The session can be created in any thread, but all session APIs except the methods to send messages must be called from the thread where the session was created. • To use the async API from a thread different than the one where the session was created, the thread must have a thread default main context where the async callbacks are dispatched. • The sync API doesn’t need any main context at all. ## June 20, 2022 ### Tiago Vignatti #### Short blog post from Madrid's bus This post was inspired by “Short blog post from Madrid’s hotel room” from my colleague Frédéric Wang. You should really check his post instead! To Fred: thanks for the feedback and review here. I’m in for football in the next Summit, alright? :-) This week, I finally went to A Coruña for the Web Engines Hackfest and internal company meetings. These were my first on-site events since the COVID-19 pandemic. After two years of non-super-exciting virtual conferences I was so glad to finally be able to meet with colleagues and other people from the Web. ### Andy Wingo #### blocks and pages and large objects Good day! In a recent dispatch we talked about the fundamental garbage collection algorithms, also introducing the Immix mark-region collector. Immix mostly leaves objects in place but can move objects if it thinks it would be profitable. But when would it decide that this is a good idea? Are there cases in which it is necessary? I promised to answer those questions in a followup article, but I didn't say which followup :) Before I get there, I want to talk about paged spaces. enter the multispace We mentioned that Immix divides the heap into blocks (32kB or so), and that no object can span multiple blocks. "Large" objects -- defined by Immix to be more than 8kB -- go to a separate "large object space", or "lospace" for short. Though the implementation of a large object space is relatively simple, I found that it has some points that are quite subtle. Probably the most important of these points relates to heap size. Consider that if you just had one space, implemented using mark-compact maybe, then the procedure to allocate a 16 kB object would go: 1. Try to bump the allocation pointer by 16kB. Is it still within range? If so we are done. 2. Otherwise, collect garbage and try again. If after GC there isn't enough space, the allocation fails. In step (2), collecting garbage could decide to grow or shrink the heap. However when evaluating collector algorithms, you generally want to avoid dynamically-sized heaps. cheatery Here is where I need to make an embarrassing admission. In my role as co-maintainer of the Guile programming language implementation, I have long noodled around with benchmarks, comparing Guile to Chez, Chicken, and other implementations. It's good fun. However, I only realized recently that I had a magic knob that I could turn to win more benchmarks: simply make the heap bigger. Make it start bigger, make it grow faster, whatever it takes. For a program that does its work in some fixed amount of total allocation, a bigger heap will require fewer collections, and therefore generally take less time. (Some amount of collection may be good for performance as it improves locality, but this is a marginal factor.) Of course I didn't really go wild with this knob but it now makes me doubt all benchmarks I have ever seen: are we really using benchmarks to select for fast implementations, or are we in fact selecting for implementations with cheeky heap size heuristics? Consider even any of the common allocation-heavy JavaScript benchmarks, DeltaBlue or Earley or the like; to win these benchmarks, web browsers are incentivised to have large heaps. In the real world, though, a more parsimonious policy might be more appreciated by users. Java people have known this for quite some time, and are therefore used to fixing the heap size while running benchmarks. For example, people will measure the minimum amount of memory that can allow a benchmark to run, and then configure the heap to be a constant multiplier of this minimum size. The MMTK garbage collector toolkit can't even grow the heap at all currently: it's an important feature for production garbage collectors, but as they are just now migrating out of the research phase, heap growth (and shrinking) hasn't yet been a priority. lospace So now consider a garbage collector that has two spaces: an Immix space for allocations of 8kB and below, and a large object space for, well, larger objects. How do you divide the available memory between the two spaces? Could the balance between immix and lospace change at run-time? If you never had large objects, would you be wasting space at all? Conversely is there a strategy that can also work for only large objects? Perhaps the answer is obvious to you, but it wasn't to me. After much reading of the MMTK source code and pondering, here is what I understand the state of the art to be. 1. Arrange for your main space -- Immix, mark-sweep, whatever -- to be block-structured, and able to dynamically decomission or recommission blocks, perhaps via MADV_DONTNEED. This works if the blocks are even multiples of the underlying OS page size. 2. Keep a counter of however many bytes the lospace currently has. 3. When you go to allocate a large object, increment the lospace byte counter, and then round up to number of blocks to decommission from the main paged space. If this is more than are currently decommissioned, find some empty blocks and decommission them. 4. If no empty blocks were found, collect, and try again. If the second try doesn't work, then the allocation fails. 5. Now that the paged space has shrunk, lospace can allocate. You can use the system malloc, but probably better to use mmap, so that if these objects are collected, you can just MADV_DONTNEED them and keep them around for later re-use. 6. After GC runs, explicitly return the memory for any object in lospace that wasn't visited when the object graph was traversed. Decrement the lospace byte counter and possibly return some empty blocks to the paged space. There are some interesting aspects about this strategy. One is, the memory that you return to the OS doesn't need to be contiguous. When allocating a 50 MB object, you don't have to find 50 MB of contiguous free space, because any set of blocks that adds up to 50 MB will do. Another aspect is that this adaptive strategy can work for any ratio of large to non-large objects. The user doesn't have to manually set the sizes of the various spaces. This strategy does assume that address space is larger than heap size, but only by a factor of 2 (modulo fragmentation for the large object space). Therefore our risk of running afoul of user resource limits and kernel overcommit heuristics is low. The one underspecified part of this algorithm is... did you see it? "Find some empty blocks". If the main paged space does lazy sweeping -- only scanning a block for holes right before the block will be used for allocation -- then after a collection we don't actually know very much about the heap, and notably, we don't know what blocks are empty. (We could know it, of course, but it would take time; you could traverse the line mark arrays for all blocks while the world is stopped, but this increases pause time. The original Immix collector does this, however.) In the system I've been working on, instead I have it so that if a mutator finds an empty block, it puts it on a separate list, and then takes another block, only allocating into empty blocks once all blocks are swept. If the lospace needs blocks, it sweeps eagerly until it finds enough empty blocks, throwing away any nonempty blocks. This causes the next collection to happen sooner, but that's not a terrible thing; this only occurs when rebalancing lospace versus paged-space size, because if you have a constant allocation rate on the lospace side, you will also have a complementary rate of production of empty blocks by GC, as they are recommissioned when lospace objects are reclaimed. What if your main paged space has ample space for allocating a large object, but there are no empty blocks, because live objects are equally peppered around all blocks? In that case, often the application would be best served by growing the heap, but maybe not. In any case in a strict-heap-size environment, we need a solution. But for that... let's pick up another day. Until then, happy hacking! ## June 17, 2022 ### Frédéric Wang #### Short blog post from Madrid's hotel room This week, I finally went back to A Coruña for the Web Engines Hackfest and internal company meetings. These were my first on-site events since the COVID-19 pandemic. After two years of non-super-exciting virtual conferences I was so glad to finally be able to meet with colleagues and other people from the Web. Igalia has grown considerably and I finally get to know many new hires in person. Obviously, some people were still not able to travel despite the effort we put to settle strong sanitary measures. Nevertheless, our infrastructure has also improved a lot and we were able to provide remote communication during these events, in order to give people a chance to attend and participate ! Work on the Madrid–Galicia high-speed rail line finally completed last December, meaning one can now travel with fast trains between Paris - Barcelona - Madrid - A Coruña. This takes about one day and a half though and, because I’m voting for the Legislative elections in France, I had to shorten a bit my stay and miss nice social activities 😥… That’s a pity, but I’m looking forward to participating more next time! Finally on the technical side, my main contribution was to present our upcoming plan to ship MathML in Chromium. The summary is that we are happy with this first implementation and will send the intent-to-ship next week. There are minor issues to address, but the consensus from the conversations we had with other attendees (including folks from Google and Mozilla) is that they should not be a blocker and can be refined depending on the feedback from API owners. So let’s do it and see what happens… There is definitely a lot more to write and nice pictures to share, but it’s starting to be late here and I have a train back to Paris tomorrow. 😉 ## June 16, 2022 ### Iago Toral #### V3DV Vulkan 1.2 status A quick update on my latest activities around V3DV: I’ve been focusing on getting the driver ready for Vulkan 1.2 conformance, which mostly involved fixing a few CTS tests of the kind that would only fail occasionally, these are always fun :). I think we have fixed all the issues now and we are ready to submit conformance to Khronos, my colleague Alejandro Piñeiro is now working on that. ## June 15, 2022 ### Andy Wingo #### defragmentation Good morning, hackers! Been a while. It used to be that I had long blocks of uninterrupted time to think and work on projects. Now I have two kids; the longest such time-blocks are on trains (too infrequent, but it happens) and in a less effective but more frequent fashion, after the kids are sleeping. As I start writing this, I'm in an airport waiting for a delayed flight -- my first since the pandemic -- so we can consider this to be the former case. It is perhaps out of mechanical sympathy that I have been using my reclaimed time to noodle on a garbage collector. Managing space and managing time have similar concerns: how to do much with little, efficiently packing different-sized allocations into a finite resource. I have been itching to write a GC for years, but the proximate event that pushed me over the edge was reading about the Immix collection algorithm a few months ago. on fundamentals Immix is a "mark-region" collection algorithm. I say "algorithm" rather than "collector" because it's more like a strategy or something that you have to put into practice by making a concrete collector, the other fundamental algorithms being copying/evacuation, mark-sweep, and mark-compact. To build a collector, you might combine a number of spaces that use different strategies. A common choice would be to have a semi-space copying young generation, a mark-sweep old space, and maybe a treadmill large object space (a kind of copying collector, logically; more on that later). Then you have heuristics that determine what object goes where, when. On the engineering side, there's quite a number of choices to make there too: probably you make some parts of your collector to be parallel, maybe the collector and the mutator (the user program) can run concurrently, and so on. Things get complicated, but the fundamental algorithms are relatively simple, and present interesting fundamental tradeoffs. figure 1 from the immix paper For example, mark-compact is most parsimonious regarding space usage -- for a given program, a garbage collector using a mark-compact algorithm will require less memory than one that uses mark-sweep. However, mark-compact algorithms all require at least two passes over the heap: one to identify live objects (mark), and at least one to relocate them (compact). This makes them less efficient in terms of overall program throughput and can also increase latency (GC pause times). Copying or evacuating spaces can be more CPU-efficient than mark-compact spaces, as reclaiming memory avoids traversing the heap twice; a copying space copies objects as it traverses the live object graph instead of after the traversal (mark phase) is complete. However, a copying space's minimum heap size is quite high, and it only reaches competitive efficiencies at large heap sizes. For example, if your program needs 100 MB of space for its live data, a semi-space copying collector will need at least 200 MB of space in the heap (a 2x multiplier, we say), and will only run efficiently at something more like 4-5x. It's a reasonable tradeoff to make for small spaces such as nurseries, but as a mature space, it's so memory-hungry that users will be unhappy if you make it responsible for a large portion of your memory. Finally, mark-sweep is quite efficient in terms of program throughput, because like copying it traverses the heap in just one pass, and because it leaves objects in place instead of moving them. But! Unlike the other two fundamental algorithms, mark-sweep leaves the heap in a fragmented state: instead of having all live objects packed into a contiguous block, memory is interspersed with live objects and free space. So the collector can run quickly but the allocator stops and stutters as it accesses disparate regions of memory. allocators Collectors are paired with allocators. For mark-compact and copying/evacuation, the allocator consists of a pointer to free space and a limit. Objects are allocated by bumping the allocation pointer, a fast operation that also preserves locality between contemporaneous allocations, improving overall program throughput. But for mark-sweep, we run into a problem: say you go to allocate a 1 kilobyte byte array, do you actually have space for that? Generally speaking, mark-sweep allocators solve this problem via freelist allocation: the allocator has an array of lists of free objects, one for each "size class" (say 2 words, 3 words, and so on up to 16 words, then more sparsely up to the largest allocatable size maybe), and services allocations from their appropriate size class's freelist. This prevents the 1 kB free space that we need from being "used up" by a 16-byte allocation that could just have well gone elsewhere. However, freelists prevent objects allocated around the same time from being deterministically placed in nearby memory locations. This increases variance and decreases overall throughput for both the allocation operations but also for pointer-chasing in the course of the program's execution. Also, in a mark-sweep collector, we can still reach a situation where there is enough space on the heap for an allocation, but that free space broken up into too many pieces: the heap is fragmented. For this reason, many systems that perform mark-sweep collection can choose to compact, if heuristics show it might be profitable. Because the usual strategy is mark-sweep, though, they still use freelist allocation. on immix and mark-region Mark-region collectors are like mark-sweep collectors, except that they do bump-pointer allocation into the holes between survivor objects. Sounds simple, right? To my mind, though the fundamental challenge in implementing a mark-region collector is how to handle fragmentation. Let's take a look at how Immix solves this problem. part of figure 2 from the immix paper Firstly, Immix partitions the heap into blocks, which might be 32 kB in size or so. No object can span a block. Block size should be chosen to be a nice power-of-two multiple of the system page size, not so small that common object allocations wouldn't fit. Allocating "large" objects -- greater than 8 kB, for Immix -- go to a separate space that is managed in a different way. Within a block, Immix divides space into lines -- maybe 128 bytes long. Objects can span lines. Any line that does not contain (a part of) an object that survived the previous collection is part of a hole. A hole is a contiguous span of free lines in a block. On the allocation side, Immix does bump-pointer allocation into holes. If a mutator doesn't have a hole currently, it scans the current block (obtaining one if needed) for the next hole, via a side-table of per-line mark bits: one bit per line. Lines without the mark are in holes. Scanning for holes is fairly cheap, because the line size is not too small. Note, there are also per-object mark bits as well; just because you've marked a line doesn't mean that you've traced all objects on that line. Allocating into a hole has good expected performance as well, as it's bump-pointer, and the minimum size isn't tiny. In the worst case of a hole consisting of a single line, you have 128 bytes to work with. This size is large enough for the majority of objects, given that most objects are small. mitigating fragmentation Immix still has some challenges regarding fragmentation. There is some loss in which a single (piece of an) object can keep a line marked, wasting any free space on that line. Also, when an object can't fit into a hole, any space left in that hole is lost, at least until the next collection. This loss could also occur for the next hole, and the next and the next and so on until Immix finds a hole that's big enough. In a mark-sweep collector with lazy sweeping, these free extents could instead be placed on freelists and used when needed, but in Immix there is no such facility (by design). One mitigation for fragmentation risks is "overflow allocation": when allocating an object larger than a line (a medium object), and Immix can't find a hole before the end of the block, Immix allocates into a completely free block. So actually mutator threads allocate into two blocks at a time: one for small objects and medium objects if possible, and the other for medium objects when necessary. Another mitigation is that large objects are allocated into their own space, so an Immix space will never be used for blocks larger than, say, 8kB. The other mitigation is that Immix can choose to evacuate instead of mark. How does this work? Is it worth it? stw This question about the practical tradeoffs involving evacuation is the one I wanted to pose when I started this article; I have gotten to the point of implementing this part of Immix and I have some doubts. But, this article is long enough, and my plane is about to land, so let's revisit this on my return flight. Until then, see you later, allocators! ## June 10, 2022 ### Alex Surkov #### Automated accessibility testing ### Intro I think it’s fair to say that every application developer needs to take care of accessibility at some point. Indeed, even if you use an accessible toolkit to create an application, the app isn’t always accessible simply because the combination of accessible parts is not always accessible. In reality, many things can go wrong. Without care the app may become completely inaccessible as complexity increases If you’re a web developer, then you’re (hopefully) already familiar with WCAG, ARIA and other cool stuff helping to address accessibility issues in web apps. If you are a platform developer such as an Android or iOS, then you probably know a bazillion of tricks to keep mobile apps accessible. You might also know how to run the app over screen readers and other assistive technology softwares to ensure everything goes smoothly and works as expected. And you might already have started thinking of how you can use automated testing to cover all the accessibility caveats to avoid regressions and to reduce overhead of the manual testing. So let’s talk about accessibility automated testing. There’s no universal solution that would embrace each and every platform and every single case. However there are many existing approaches, some are better than others. Some of them are fairly solid. Having said that looking at the diversity of the existing solutions and realizing how much people still do the manual testing (AAM specs, such as ARIA or HTML accessibility APIs mappings, can be a great example of it), I think it’d be amazing to systemize the existing techniques and come with a better, universal solution that would cover the majority of cases. I hope this post can help to find the right way to do this. ### What to test First things come first: what is the scope of automated accessibility testing, i.e. what exactly do we want to test? ### Web Without a doubt, the web is vast and complex and plays a significant role in our lives. It surely has to be accessible. But what is an accessible web, exactly? The main conductors on the web are web browsers. They make up the web platform by providing all of the tiny building blocks such as HTML elements to create web content or ARIA attributes to define semantics. Browsers deliver web content to the users by rendering it on a screen and expose its semantics to the assistive technologies such as screen readers. All these blocks must be accessible. In particular that means the browsers are responsible to expose all building blocks to the assistive technologies correctly. It’s very tempting to say if a browser is doing the job well, then a web author cannot go wrong by using these blocks (for sure, if the web author is not doing anything strange on purpose). It sounds about right and it works nicely in theory. But, in practice accessibility issues suddenly pop up as the complexity of a web app goes up. ### Browsers The browsers are certainly the major use case in the web accessibility testing and I’d like to get them covered, simply because the web cannot be accessible without accessible browsers. Also, they already do a decent job on accessibility testing, and we can learn a lot from them. Indeed, they’re all stuffed with accessibility testing solutions and each has its own test harness for automating the process. Their system could be unified and adjusted to a broader range of uses. ### Web apps The web applications are the second piece of the web puzzle. They also must be covered. The webapps are made up of small and accessible building blocks (if the browser does a good job). However, as I previously noted, it is not sufficient to have individually accessible parts — it is necessary that all of the combinations of parts are accessible. It may sound quite obvious since QA and end to end testing wasn’t invented yesterday, but this is something overlooked quite often. So this is one more yet use case on the table: the overall integrated web application’s accessibility. ### Platform Although the web is vital, there’s a sizable market of desktop and mobile applications which also need to get accessibility tested. Some desktop/mobile apps are using embedded browsers under-the-hood as a rendering engine, which brings them to the web scope. But speaking generally, desktop/mobile apps are not the web. Having said that, it’s worth noting that browsers and desktop/mobile applications coexist in the same environment, and they use the very same platform accessibility APIs to expose the content to the assistive technologies. It makes web and desktop/mobile applications have more in common than people usually tend to think. I’m from the browser world and I keep focus on the web accessibility as many of you probably are, but let’s keep in mind desktop and mobile applications as one more use case. We will have them covered as well. ### How to test The next question to address is how to test accessibility or how to ensure that the application is accessible. You could test your entire application with a screen reader or a magnifier on your own and that would be pretty trustworthy, but what exactly to test when it comes to automated testing? ### Unit testing Unit testing is the first level of automated testing, and accessibility is no exception. It allows you to perform very low-level testing such as testing individual c++ classes or JS modules, or testing all internal things that are never disclosed or can never be tested via any public API as the pure under-the-hood thing. As a result, unit testing is a critical component in automated testing. However, it has nothing to do with accessibility specifically, and there’s nothing to generalize or systemize for the benefits of accessibility. It’s simply something that all systems must have. ### Accessibility APIs mappings When it comes to automated accessibility testing, the first and probably the best thing to start from is to test accessibility APIs. Accessibility APIs is the universal language that any accessible application can be expressed in. If accessibility properties of UI looks about right, then you have great chances the app is accessible. By the way, this is the most common strategy in accessibility testing today in web browsers. To give an example how accessibility APIs testing can be used practically, you can think of AAM web specifications. These specs define how ARIA attributes or HTML/SVG elements should be mapped to the accessibility APIs. This is a somewhat restrictive example to demonstrate the capabilities of accessibility APIs testing, because AAM utilize only a few things peculiar to accessibility APIs (for example, it misses key things like accessible relations, hierarchies or actions) but it’s a good example to get a sense of what kind of things can be tested when it comes to accessibility APIs. Accessibility APIs have complex nature. They differ from platform to platform, but they share many traits in common. It’s not surprising at all, because they all are designed to expose the same thing: the semantics of user interface. Here are the main categories you can find in many accessibility APIs. • States and properties: an example for these could be focused or selected states on a menu item or accessible name/description for properties. • Methods or interfaces allow to retrieve complex information about accessible elements, for example, to retrieve information about list or grid controls. • The relation concept is a key one. It defines how accessible elements relate to each other. In particular they are used to navigate content and/or to get extra information about elements such as labels for a control or headers for a table cell. • Accessible trees are a special case of accessible relations. However they are often defined as a separate entity. The accessible tree is a hierarchical representation of the content, and the ATs frequently rely on it. • As a rule of thumb there is also special support for text and selection. • There are also accessible actions for example a button click or expand/collapse a dropdown • Accessible events are used to notify the assistive technologies about app changes. All of this is a very typical if not comprehensive list of what can be (or should be) tested when it comes to accessibility platform APIs testing. Accessibility APIs are fairly low level testing, which is good or might be not depending on a use case. I would say it’s nearly perfect to test browsers, for example, how the browsers expose HTML elements and such. It also should be the right choice for different kinds of toolkits, which provides you with a set of building blocks, for example, extra controls on top of HTML5 elements. I’m not quite confident that this type of testing is exactly what web developers look for when it comes to webapps accessibility testing, because it’s fairly low level and requires understanding of under-the-hood things, but certainly they can benefit from checking accessibility properties on certain elements/structures, for example, for web components. ### Assistive Technologies testing Assistive Technologies (AT) testing makes another layer of automated testing. Unlinke accessibility APIs testing, this is a great way to ensure your app is spoken exactly as you want over different screen readers or it gets zoomed properly when it comes to screen magnifiers. Any application including both browsers and web apps can benefit from integrational AT testing by running individual control elements or separate UI blocks though the ATs. AT testing can be also used for end-to-end testing. However, as any end-to-end testing it cannot be comprehensive because of a steep rise in testing scenarios as the app complexity increases. Having said that, it certainly can be helpful to check the most crucial scenarios and can make a real great addition to the integrational accessibility APIs testing. ### Testing flow Testing a single HTML page to check an accessible tree and/or its relevant accessibility properties makes a nice pattern of atomic testing. It is quite similar to unit testing. However, the reality is that not every use case can be reduced to a simple static page. It makes such testing quite restrictive. The real world affirms we need to test accessibility in dynamics for a full spectrum of testing scenarios, such as clicking a button and checking what happens next. This makes another important piece of the puzzle, a testing flow, or in other words, how to control an application to test its dynamics, where you can query accessibility properties at the right time. A typical test flow can be described in three simple steps: • Query-n-check: test accessibility properties against expectations. • Action: change a value or trigger an action: it represents the dynamics part and describes how a test interacts with an application. • On-hold: wait for an event to hold the test execution until the app gets to the right state. To summarize there are two key pieces we’d need for a testing flow. First, to trigger an action. This can be emulation of user actions, triggering accessible actions or changing accessible properties, anything that allows the user to operate and control the application. Secondly, we need the ability to hold on to a test execution until certain criterias are met. That could be an accessible event or presence of certain accessible properties on an element, or waiting until a certain text is visible, whatever, just to put the test execution on pause until it’s good to proceed. ### What we have now Let’s take a look at what the market has to propose. I don’t know much about desktop or mobile applications, many of them are not open source, so no chance to sneak a peek. But I think it’d be a good idea to start from the web and the web browsers in particular, who have made some real good progress on accessibility testing. ### AOM The first thing I would like to mention is AOM. It’s not the browsers, but it’s something closely related. AOM stands for Accessible Object Model. This is an attempt to expose accessibility properties in a cross-platform way in browsers. Similarly to platform accessibility APIs, AOM defines roles, properties, relations, and an accessible tree. You can think of AOM as the web accessibility API. So having AOM with all typical accessibility features can make a great platform for cross-browser accessibility testing. Certain elements like platform dependent mappings are not very suitable for AOM for sure. For example, AAM specs, the accessibility APIs mapping, which define how to expose the web to the platform accessibility APIs, do not make a great choice for AOM testing. Indeed, despite platform APIs similarities on the conceptual level, they differ in details. However, if we care about the logical compound only, AOM will suit nicely. For example, ARIA name computation algorithms make a great match for AOM testing, or AOM can be used to test relations between a label and a control. In this case we don’t worry what is the platform name for the relation, the only thing we want to test is the right relations are exposed. Sadly, it’s not something that was implemented. AOM is a long term vision, and the main focus till now has been on ARIA reflection, which is prototyped by a number of browsers by now. But ARIA reflection spec is just a DOM reflection of ARIA attributes and thus has somewhat lower testing capabilities. For example, HTML elements that don’t have a proper ARIA mapping cannot be tested by ARIA reflection or any of accessibility events. So AOM is something that has great testing potential but it’s not yet implemented or even specified. ### ATTA ATTA (Accessible Technology Test Adapter API) was the answer to manual accessibility testing for ARIA and HTML AAM specifications. Roughly speaking ATTA defines a protocol used to describe accessibility properties for a given HTML code snippet to create the test expectations. The expectations are sent to the ATTA server, which queries an ATTA driver for an accessible tree for the given HTML snippet and then checks it against the given expectations. The ATTA is integrated into WPT test suite and it could make a great solution for the web accessibility APIs testing, if it worked. There are implementations of ATTA drivers for IAccessible2, UI Automation and ATK but apparently none of them ever reached the ready-to-use status. So ATTA has a bunch of worthy ideas like WPT built-in integration and the modular system which allows you connect various platform APIs drivers, but sadly it is not finished yet and no active work any longer. ### Browsers Let’s take a look at the browsers. Browsers have quite a long history of accessibility support and they’ve made a great progress on accessibility testing. ### Firefox In Firefox all testing is done in Javascript. It’s worth noting Gecko’s accessibility core is a massive system that includes practically every feature of any desktop accessibility API. Gecko has a fairly mature set of cross-platform accessibility interfaces. Because the platform implementations are thin wrappers around the accessibility core, if you test something on a cross-platform layer in Gecko, then you can be confident that it will work well on all platforms. Gecko does, however, offer native NSAccessibility objects to JavaScript the same way that they do in cross-platform testing. It works nicely and allows one to poke all NSAccessibility attributes and parameterized attributes, as well as listening to the accessibility events. This approach is not portable as is, because it relies on a somewhat ancient Netscape-era technology. It could be adapted though to work through webidl if you want to to make it portable to other browsers, but fairly useless outside browsers world. Nevertheless it is certainly good to get inspired by. Here’s an example of the Gecko accessibility test. The test represents a typical scenario for the accessibility testing: you get a property, call an action, wait for an event, and then make sure the property value was adjusted properly. You can imagine that cross platform tests are quite similar to this one. Gecko implements its own test suite which provides a number of util functions such as ok() or is(), which are responsible to handle and report success or failures. This is the kind of testing system, where test expectations are listed explicitly in a test body, or in other words, the test says what to test and what are expectations. As a direct consequence if you need to change expectations, you have to adjust the test manually. It’s quite a typical testing system though. ### WebKit/Safari I think it’s fair to say that WebKit, being the engine behind the popular Safari browser, has decent NSAccessibility protocol support, but it also supports ATK and MSAA. The WebKit testsuite is rather straightforward and implements its testing capabilities in a cross platform style. It exposes a helper object to DOM, and you can query platform dependent accessible properties/methods from JavaScript. It’s quite similar to what Firefox does. The testsuite itself is quite different from Firefox though, which is not surprising. The WebKit test generates output which is compared to expectations stored in files. The test expectations are also listed in a test body, which makes the approach close to Gecko. Similar to Gecko, WebKit also supports event testing, here’s an example of a typical test. It might look bulky for a simple thing it does, but all stuff can be wrapped by a nice promise-based wrapper which will make the test more readable. The most important thing here is that WebKit also supports testing for all parts of a typical accessibility API. ### Chromium Beyond low-level c++ unit testing, Chromium relies on platform accessibility APIs testing. It is quite similar to Firefox or WebKit with one key difference. Chromium can dump an accessibility tree with relevant accessibility properties or can record accessibility events, which it subsequently compares to the expectation files. The main gotcha here is these tests can be rebaselined easily unlike other kinds of tests. If something changes on the API level, for example, if an accessible tree is spec’d out differently, new properties are added or old ones are removed — any change, then all you need is to rerun a test and capture output, which becomes the new expectations. It’s as if you take a snapshot that becomes a new standard, and then all following runs match it. Here’s a typical example of a tree test with mac and win expectation files. Chromium also reveals basic scripting capabilities. Those are mainly mac targeted though, and thus scoped by NSAccessibility protocol testing. However it allows testing all bits of the API, including methods, attributes, parameterized attributes, actions and events. ### What would make a great test suite? Let’s pull things together. Accessibility exists within the context of a platform, which glues together an application and the assistive technology by accessibility API. We want a platform-wide testing solution to test platform accessibility APIs. It should be possible to test a variety of things such as accessible trees, properties, methods and events. The solution should not be limited to just web browsers. It should be capable of covering web applications running in web browsers. We’d also like to test any application running on the system. Multiple platforms should be supported such as AT-SPI/ATK on Linux, MSAA/IA2 and UIA on Windows and NSAccessibility protocol on Mac. The solution has to be extensible in case we need to support other platforms in future. Test flow control should be supported out-of-the-box, such as interacting with an app and then waiting for an accessible event. Another example of interactions will be the flow control directives allowing communication with the assistive technologies. Getting those covered will allow writing end-to-end tests. Last but not least. Easy test rebaselining is a key feature to have. If you ever need to change a test’s expectations, then you just rerun the test and record its output. It happens more often than you probably think. This testsuite allows you to adjust expectations with about zero effort. ### Chromium accessibility tools Chromium has decent accessibility tools to inspect a platform accessibility tree and to listen to accessibility events. They are available on all major platforms and can be easily ported to other platforms, essentially on any platform Chromium is running on. They are capable of inspecting any application on a system including web applications running in a web browser. All major desktop APIs are supported as well, namely AT-SPI on Linux, MSAA/IA2 and UIA on Windows and NSAccessibility protocol on Mac. In Chromium these tools are integrated into a test harness to perform platform accessibility testing. The testsuite supports test flow instructions and rebaselining mechanism. The tools can be beefed up with testing capabilities to become a new test harness, or be easily integrated into existing test suites or CDCI systems. The tools are not perfect and are not feature-complete. They have great potential though. As long as they are capable of providing all the must-have testing features we discussed above, all they need is some love I think. The tools are open source and anyone can contribute. However, having them as an integral part of the Chromium project makes it seem like they are inherently tied to that project and doesn’t make life easy for new contributors. It’s possible, however, that eventually the tools could get a new home. If new contributors bring the fresh vision and new use cases to shape the tool’s features, then it can make a great start of a new open source project which will embrace all innovations, and hopefully can solve a long standing problem of accessibility automated testing. Is it worth taking a shot? Are you ready to join the efforts? ## June 06, 2022 ### Alejandro Piñeiro #### Playing with the rpi4 CPU/GPU frequencies In recent days I have been testing how modifying the default CPU and GPU frequencies on the rpi4 increases the performance of our reference Vulkan applications. By default Raspbian uses 1500MHz and 500MHz respectively. But with a good heat dissipation (a good fan, rpi400 heat spreader, etc) you can play a little with those values. One of the tools we usually use to check performance changes are gfxreconstruct. This tools allows you to record all the Vulkan calls during a execution of an aplication, and then you can replay the captured file. So we have traces of several applications, and we use them to test any hypothetical performance improvement, or to verify that some change doesn’t cause a performance drop. So, let’s see what we got if we increase the CPU/GPU frequency, focused on the Unreal Engine 4 demos, that are the more shader intensive: So as expected, with higher clock speed we see a good boost in performance of ~10FPS for several of these demos. Some could wonder why the increase on the CPU frequency got so little impact. As I mentioned, we didn’t get those values from the real application, but from gfxreconstruct traces. Those are only capturing the Vulkan calls. So on those replays there are not tasks like collision detection, user input, etc that are usually handled on the CPU. Also as mentioned, all the Unreal Engine 4 demos uses really complex shaders, so the “bottleneck” there is the GPU. Let’s move now from the cold numbers, and test the real applications. Let’s start with the Unreal Engine 4 SunTemple demo, using the default CPU/GPU frequencies (1500/500): Even if it runs fairly smooth most of the time at ~24 FPS, there are some places where it dips below 18 FPS. Let’s see now increasing the CPU/GPU frequencies to 1800/750: Now the demo runs at ~34 FPS most of the time. The worse dip is ~24 FPS. It is a lot smoother than before. Here is another example with the Unreal Engine 4 Shooter demo, already increasing the CPU/GPU frequencies: Here the FPS never dips below 34FPS, staying at ~40FPS most of time. It has been around 1 year and a half since we announced a Vulkan 1.0 driver for Raspberry Pi 4, and since then we have made significant performance improvements, mostly around our compiler stack, that have notably improved some of these demos. In some cases (like the Unreal Engine 4 Shooter demo) we got a 50%-60% improvement (if you want more details about the compiler work, you can read the details here). In this post we can see how after this and taking advantage of increasing the CPU and GPU frequencies, we can really start to get reasonable framerates in more demanding demos. Even if this is still at low resolutions (for this post all the demos were running at 640×480), it is still great to see this on a Raspberry Pi. ## June 02, 2022 ### Brian Kardell #### Spicy Progress # Spicy Progress For a while, we had lots of updates and excitement on our "spicy-sections" proposal - but you've probably noticed it's gotten a little quiet. In this post, I'll explain why and lay out where I think we are and how I hope we can move foward. A month or so ago, the Tabvengers were working hard to advance some open issues, work out details (including the name 'oui-panelset') and create a stable, high-parity version of what our actual proposal would look like so that we could hand ownership of it to OpenUI officially and work with a more official proposal. Jon Neal, in particular, worked very hard to make something pretty great. However, as is often the case, with more brains and eyeballs comes new scrutiny and feedback. In particular, during accesibility reviews to make sure we hadn't messed something up along the way (we did, because it was a total rewrite that broke all of the tests), we hit a new perspective. Based on some questions asked during that review about examples of where it could be used, we quickly assembled a series of about 50 links to content in public home pages where we believed our "spicy-sections" (aka "oui-panelset") proposal could be used to great effect, as some supporting history in order to discuss. In this discussion, one of the co-chairs of the ARIA Working Group suggested that perhaps most of those shouldn't even be using ARIA tabs. Worse, there was some concern that if that was true, then making it exceptionally easy for them to do just that is potentially actively harmful. Suspend any opinion on what you might think that means, or your own thoughts for a moment... It's definitely worth slowing down and trying to understand and take some care. It's already led to a number of interesting conversations. We need to understand "why not?" and try to articulate some specifics. At the moment, we're still working to sort out a lot here, but you can expect more on that soon. In some further discussion we began discussing that there are sort of distinct "kinds" of intefaces that many people would refer to as tabs. This is somewhat unsuprising as our research points out that many/most modern UI kits actually have more than one control for tab-like things. My own posts and notes also make sure to note that examples like browser tabs are a different control than the kind that spicy-sections/panelset are talking about - those kinds of tabs are sort of window-managers with additional features that even users understand differently. No one would ever expect Ctrl+F, for example to search all the open windows. We expect close buttons, perhaps status indicators, the ability to reorder them. So, first thing was to introduce some separate terminolgy for clarity. The things we've been doing with "spicy sections" (and perhaps the sorts of things you'll be able to build with CSS Toggles) could be classified as "content tabs" (I believe it was Sarah Higley that offered the term). Again, this makes sense of past dicssions - Hixie said many years ago "tabs are a matter of overflow". That makes a lot of sense for "content tabs" and absolutely none for something like browser tabs. ## If not that...? Ok, but... Assume, for the moment, that ARIA tabs are not always appropriate. Assume that we can articulate the right specifics about when they are and are not. It kind of begs the question "if not that, what?". Because, it's not like those pages will cease to have interfaces that generally look and act like that. It's very clear that "content tabs" exist in the wild and have plenty of benefits, and in order to judge whether it is true that some other, non-ARIA using solution is better, we need something concrete to compare it with, and that we can do user testing with. If it is true that something else is better, we have to be very careful in describing how to do the better thing to we prevent authors from merely falling into a different series of pitfalls, maybe even still give them that element. But what is that other thing specifically? ### Further thought and experiments So... I've been thinking about this a lot. I found several examples of "not ARIA tabs" which have very different characteristics, including from an accessibility standpoint. There are a lot of ways we can get these wrong. I've been looking for something without glaring issues. To this end, and through some discussion with Adam Argyle and Jon Neal, I have created this very rough functional sketch of what I am thinking makes a good springboard for discussing specifics and trying to work through problems. Just like with the original, if your window is wide enough, this will display as 'content tabs' otherwise it will just show plain content. See the Pen spicy-alternative-sketch by вкαя∂εℓℓ (@briankardell) on CodePen. As of this posting, please don't use the code in this pen. It needs additional testing and discussion, but perhaps more importantly it is affected by one (or two, depending on your system) bugs in Firefox, being worked on now. #### What it is.. Is it a somewhat reduced version of the current oui-panelset which simply lets you specify only a 'content-tabs' affordance (or not). The markup, parts, etc are identical. However, instead of realizing this as ARIA tabs, it turns these into a TOC of links and uses a scroll-port and scroll-snapping to create the tab-like presentation, based on one of Adam's experiments. #### What I like about it Well, it is pretty simple, and it doubles down on all of the original points and observations. It's now not just a case that it is "similar to scroll" - it is literally scroll. But with that comes some other interesting good qualities right out of the box: Find-in-page Just Works™. Headings aren't "turned into" tabs, they continue to exist - so screen reader users experience this as normal content and it is navigable by headings. We could probably apply it even to something like a carosel or paging if we're not fighting about the strict semantics of "what it is". You can leave the headings visible or easily hide them (assuming something isn't busted, see below) with a simple rule like /* you can use a negative margin to hide */ oui-panelset h2 { margin-top: -4rem; }  If it turned out to be useful, then this is the proof for CSS Toggles and it makes the job of what they need to accomplish much clearer and easier. ### What I am not so sure of... As I say often: the devil is in the details. This is a crude pen prototype, we have to do a lot of additional questioning and make sure this is resilient, and we have to test and improve it - a lot, probably, to know if it is really viable. For example, currently it 'jumps' the content when you click a link - I'm not sure how much we can accurately avoid this and remain resilient - but maybe. It raises other new questions though too: What happens if you do 'select all' -- currently it will literally select everything. Is that right? Or should that limit to the scrollport - and so on. It doesn't 'project' the headings into slots, it 'mirrors' them - as far as I know, that isn't something anything else does. Is it a deal breaker? I don't know! These 'tabs' are really links and as such they have all of the qualities of regular links. For example, one navigates them with the tab key and they have to manually activate - there is no roving focus or automatic activation. Some people see this as a pro, some as a con. We'll need to do A/B testing. Finally - most importantly - do I think it is actually 'better?'. To be honest, I'm somewhat unconvinced, that there is a "right" answer here. It would not surprise me at all if the results was that we learn that some users prefer for tabs to use automatic activation and roving tab index, and others don't. It would not suprise me at all if some users prefer that all of these present as ARIA tabs, and others are confused by it. I expect that many of these questions have something to do with classes of users and individual preferences. That said, we've been working to find ways to answer all the problems and we think we have some ideas ### What I am hoping What I am hoping is that having something concrete to discuss and debate, even if it is somewhat sketchy, actually lets us do that. I would like to see us focus on ::parts and styling specification that is largely shared for both "kinds" of tabs (I think we're actually very close) such that even if they wind up being two different things, we can reuse much learning and code. I would like to see us figure out how we sort this out via a custom element (or elements) when we don't know the final answer. Perhaps one option would be to add this affordance as an option to our <oui-panelset> proposal, perhaps even discuss whether it should be the only tabs-like affordance it supports initially. However, I would like to keep the door open to having the current use of ARIA as option too, since the truth is we don't know yet.. What I think _could_ be idea is that we could expose a user preference in browsers and let users pick what they prefer. This means that if we get it wrong by default, remediation is as simple as changing a single attribute value or CSS property. ## May 30, 2022 ### Christopher Michael #### Modesetting: A Glamor-less RPi adventure The goal of this adventure is to have hardware acceleration for applications when we have Glamor disabled in the X server. ### What is Glamor ? Glamor is a GL-based rendering acceleration library for the X server that can use OpenGL, EGL, or GBM. It uses GL functions & shaders to complete 2D graphics operations, and uses normal textures to represent drawable pixmaps where possible. Glamor calls GL functions to render to a texture directly and is somehow hardware independent. If the GL rendering cannot complete due to failure (or not being supported), then Glamor will fallback to software rendering (via llvmpipe) which uses framebuffer functions. ### Why disable Glamor ? On current RPi images like bullseye, Glamor is disabled by default for RPi 1-3 devices. This means that there is no hardware acceleration out of the box. The main reason for not using Glamor on RPi 1-3 hardware is because it uses GPU memory (CMA memory) which is limited to 256Mb. If you run out of CMA memory, then the X server cannot allocate memory for pixmaps and your system will crash. RPi 1-3 devices currently use V3D as the render GPU. V3D can only sample from tiled buffers, but it can render to tiled or linear buffers. If V3D needs to sample from a linear buffer, then we allocate a shadow buffer and transform the linear image to a tiled layout in the shadow buffer and sample from the shadow buffer. Any update of the linear texture implies updating the shadow image… and that is SLOW. With Glamor enabled in this scenario, you will quickly run out of CMA memory and crash. This issue is especially apparent if you try launching Chromium in full screen with many tabs opened. ### Where has my hardware acceleration gone ? On RPi 1-3 devices, we default to the modesetting driver from the X server. For those that are not aware, ‘modesetting’ is an Xorg driver for Kernel Modesetting (KMS) devices. The driver supports TrueColor visuals at various framebuffer depths and also supports RandR 1.2 for multi-head configurations. This driver supports all hardware where a KMS device is available and uses the Linux DRM ioctls or dumb buffer objects to create & map memory for applications to use. This driver can be used with Glamor to provide hardware acceleration, however that can lead to the X server crashing as mentioned above. Without enabling Glamor, then the modesetting driver cannot do hardware acceleration and applications will render using software (dumb buffer objects). So how can we get hardware acceleration without Glamor ? Let’s take an adventure into the land of Direct Rendering… ### What is Direct Rendering ? Direct rendering allows for X client applications to perform 3D rendering using direct access to the graphics hardware. User-space programs can use the DRM API to command the GPU to do hardware-accelerated 3D rendering and video decoding. You may be thinking “Wow, this could solve the problem” and you would be correct. If this could be enabled in the modesetting driver without using Glamor, then we could have hardware acceleration without having to worry about the X server crashing. It cannot be that difficult, right ? Well, as it turns out, things are not so simple. The biggest problem with this approach is that the DRI2 implementation inside the modesetting driver depends on Glamor. DRI2 is a version of the Direct Rendering Infrastructure (DRI). It is a framework comprising the modern Linux graphics stack that allows unprivileged user-space programs to use graphics hardware. The main use of DRI is to provide hardware acceleration for the Mesa implementation of OpenGL. So what approach should be taken ? Do we modify the modesetting driver code to support DRI2 without Glamor ? Is there a better way to get direct rendering without DRI2 ? As it turns out, there is a better way…enter DRI3. ### DRI3 to the rescue ? The main purpose of the DRI3 extension is to implement the mechanism to share direct rendered buffers between DRI clients and the X Server. With DRI3, clients can allocate the render buffers themselves instead of relying on the X server for doing the allocation. DRI3 clients allocate and use GEM buffers objects as rendering targets, while the X Server represents these render buffers using a pixmap. After initialization the client doesn’t make any extra calls to the X server, except perhaps in the case of window resizing. Utilizing this method, we should be able to avoid crashing the X server if we run out of memory, right ? Well once again, things are not as simple as they appear to be… ### So using DRI3 & GEM can save the day ? With GEM, a user-space program can create, handle and destroy memory objects living in the GPU memory. When a user-space program needs video memory (for a framebuffer, texture or any other data), it requests the allocation from the DRM driver using the GEM API. The DRM driver keeps track of the used video memory and is able to comply with the request if there is free memory available. You may recall from earlier that the main reason for not using Glamor on RPi 1-3 hardware is because it uses GPU memory (CMA memory) which is limited to 256Mb, so how can using DRI3 with GEM help us ? The short answer is “it does not”…at least, not if we utilize GEM. ### Where do we go next ? Surely there must be a way to have hardware acceleration without using all of our GPU memory ? I am glad you asked because there is a solution that we will explore in my next blog post. ## May 23, 2022 ### Iago Toral #### Vulkan 1.2 getting closer Lately I have been exposing a bit more functionality in V3DV and was wondering how far we are from Vulkan 1.2. Turns out that a lot of the new Vulkan 1.2 features are actually optional and what we have right now (missing a few trivial patches to expose a few things) seems to be sufficient for a minimal implementation. We actually did a test run with CTS enabling Vulkan 1.2 to verify this and it went surprisingly well, with just a few test failures that I am currently looking into, so I think we should be able to submit conformance soon. For those who may be interested, here is a list of what we are not supporting (all of these are optional features in Vulkan 1.2): VK_EXT_descriptor_indexing I think we should be able to support this in the future. VK_KHR_shader_float16_int8 This we can support in theory, since the hardware has support for half-float, however, the way this is designed in hardware comes with significant caveats that I think would make it really difficult to take advantage of it in practice. It would also require significant work, so it is not something we are planning at present. VK_KHR_buffer_device_address We can’t implement this without hacks because the Vulkan spec explicitly defined these addresses to be 64-bit values and the V3D GPU only deals with 32-bit addresses and is not capable of doing any kind of native 64-bit operation. At first I thought we could just lower these to 32-bit (since we know they will be 32-bit), but because the spec makes these explicit 64-bit values, it allows shaders to cast a device address from/to uvec2, which generates 64-bit bitcast instructions and those require both the destination and source to be 64-bit values. VK_EXT_sampler_filter_minmax VK_KHR_draw_indirect_count VK_EXT_scalar_block_layout VK_EXT_shader_viewport_index_layer VK_KHR_shader_atomic_int64 These lack required hardware support, so we don’t expect to implement them. ## May 16, 2022 ### Alejandro Piñeiro #### v3dv status update 2022-05-16 We haven’t posted updates to the work done on the V3DV driver since we announced the driver becoming Vulkan 1.1 Conformant. But after reaching that milestone, we’ve been very busy working on more improvements, so let’s summarize the work done since then. # Multisync support As mentioned on past posts, for the Vulkan driver we tried to focus as much as possible on the userspace part. So we tried to re-use the already existing kernel interface that we had for V3D, used by the OpenGL driver, without modifying/extending it. This worked fine in general, except for synchronization. The V3D kernel interface only supported one synchronization object per submission. This didn’t properly map with Vulkan synchronization, which is more detailed and complex, and allowed defining several semaphores/fences. We initially handled the situation with workarounds, and left some optional features as unsupported. After our 1.1 conformance work, our colleage Melissa Wen started to work on adding support for multiple semaphores on the V3D kernel side. Then she also implemented the changes on V3DV to use this new feature. If you want more technical info, she wrote a very detailed explanation on her blog (part1 and part2). For now the driver has two codepaths that are used depending on if the kernel supports this new feature or not. That also means that, depending on the kernel, the V3DV driver could expose a slightly different set of supported features. # More common code – Migration to the common synchronization framework For a while, Mesa developers have been doing a great effort to refactor and move common functionality to a single place, so it can be used by all drivers, reducing the amount of code each driver needs to maintain. During these months we have been porting V3DV to some of that infrastructure, from small bits (common VkShaderModule to NIR code), to a really big one: common synchronization framework. As mentioned, the Vulkan synchronization model is really detailed and powerful. But that also means it is complex. V3DV support for Vulkan synchronization included heavy use of threads. For example, V3DV needed to rely on a CPU wait (polling with threads) to implement vkCmdWaitEvents, as the GPU lacked a mechanism for this. This was common to several drivers. So at some point there were multiple versions of complex synchronization code, one per driver. But, some months ago, Jason Ekstrand refactored Anvil support and collaborated with other driver developers to create a common framework. Obviously each driver would have their own needs, but the framework provides enough hooks for that. After some gitlab and IRC chats, Jason provided a Merge Request with the port of V3DV to this new common framework, that we iterated and tested through the review process. Also, with this port we got timelime semaphore support for free. Thanks to this change, we got ~1.2k less total lines of code (and have more features!). Again, we want to thank Jason Ekstrand for all his help. # Support for more extensions: Since 1.1 got announced the following extension got implemented and exposed: • VK_EXT_debug_utils • VK_KHR_timeline_semaphore • VK_KHR_create_renderpass2 • VK_EXT_4444_formats • VK_KHR_driver_properties • VK_KHR_16_bit_storage and VK_KHR_8bit_storage • VK_KHR_imageless_framebuffer • VK_KHR_depth_stencil_resolve • VK_EXT_image_drm_format_modifier • VK_EXT_line_rasterization • VK_EXT_inline_uniform_block • VK_EXT_separate_stencil_usage • VK_KHR_separate_depth_stencil_layouts • VK_KHR_pipeline_executable_properties • VK_KHR_shader_float_controls • VK_KHR_spirv_1_4 If you want more details about VK_KHR_pipeline_executable_properties, Iago wrote recently a blog post about it (here) # Android support Android support for V3DV was added thanks to the work of Roman Stratiienko, who implemented this and submitted Mesa patches. We also want to thank the Android RPi team, and the Lineage RPi maintainer (Konsta) who also created and tested an initial version of that support, which was used as the baseline for the code that Roman submitted. I didn’t test it myself (it’s in my personal TO-DO list), but LineageOS images for the RPi4 are already available. # Performance In addition to new functionality, we also have been working on improving performance. Most of the focus was done on the V3D shader compiler, as improvements to it would be shared among the OpenGL and Vulkan drivers. But one of the features specific to the Vulkan driver (pending to be ported to OpenGL), is that we have implemented double buffer mode, only available if MSAA is not enabled. This mode would split the tile buffer size in half, so the driver could start processing the next tile while the current one is being stored in memory. In theory this could improve performance by reducing tile store overhead, so it would be more benefitial when vertex/geometry shaders aren’t too expensive. However, it comes at the cost of reducing tile size, which also causes some overhead on its own. Testing shows that this helps in some cases (i.e the Vulkan Quake ports) but hurts in others (i.e. Unreal Engine 4), so for the time being we don’t enable this by default. It can be enabled selectively by adding V3D_DEBUG=db to the environment variables. The idea for the future would be to implement a heuristic that would decide when to activate this mode. # FOSDEM 2022 If you are interested in watching an overview of the improvements and changes to the driver during the last year, we made a presention in FOSDEM 2022: “v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi 4” ## May 10, 2022 ### Melissa Wen #### Multiple syncobjs support for V3D(V) (Part 2) In the previous post, I described how we enable multiple syncobjs capabilities in the V3D kernel driver. Now I will tell you what was changed on the userspace side, where we reworked the V3DV sync mechanisms to use Vulkan multiple wait and signal semaphores directly. This change represents greater adherence to the Vulkan submission framework. I was not used to Vulkan concepts and the V3DV driver. Fortunately, I counted on the guidance of the Igalia’s Graphics team, mainly Iago Toral (thanks!), to understand the Vulkan Graphics Pipeline, sync scopes, and submission order. Therefore, we changed the original V3DV implementation for vkQueueSubmit and all related functions to allow direct mapping of multiple semaphores from V3DV to the V3D-kernel interface. Disclaimer: Here’s a brief and probably inaccurate background, which we’ll go into more detail later on. In Vulkan, GPU work submissions are described as command buffers. These command buffers, with GPU jobs, are grouped in a command buffer submission batch, specified by vkSubmitInfo, and submitted to a queue for execution. vkQueueSubmit is the command called to submit command buffers to a queue. Besides command buffers, vkSubmitInfo also specifies semaphores to wait before starting the batch execution and semaphores to signal when all command buffers in the batch are complete. Moreover, a fence in vkQueueSubmit can be signaled when all command buffer batches have completed execution. From this sequence, we can see some implicit ordering guarantees. Submission order defines the start order of execution between command buffers, in other words, it is determined by the order in which pSubmits appear in VkQueueSubmit and pCommandBuffers appear in VkSubmitInfo. However, we don’t have any completion guarantees for jobs submitted to different GPU queue, which means they may overlap and complete out of order. Of course, jobs submitted to the same GPU engine follow start and finish order. A fence is ordered after all semaphores signal operations for signal operation order. In addition to implicit sync, we also have some explicit sync resources, such as semaphores, fences, and events. Considering these implicit and explicit sync mechanisms, we rework the V3DV implementation of queue submissions to better use multiple syncobjs capabilities from the kernel. In this merge request, you can find this work: v3dv: add support to multiple wait and signal semaphores. In this blog post, we run through each scope of change of this merge request for a V3D driver-guided description of the multisync support implementation. #### Groundwork and basic code clean-up: As the original V3D-kernel interface allowed only one semaphore, V3DV resorted to booleans to “translate” multiple semaphores into one. Consequently, if a command buffer batch had at least one semaphore, it needed to wait on all jobs submitted complete before starting its execution. So, instead of just boolean, we created and changed structs that store semaphores information to accept the actual list of wait semaphores. #### Expose multisync kernel interface to the driver: In the two commits below, we basically updated the DRM V3D interface from that one defined in the kernel and verified if the multisync capability is available for use. #### Handle multiple semaphores for all GPU job types: At this point, we were only changing the submission design to consider multiple wait semaphores. Before supporting multisync, V3DV was waiting for the last job submitted to be signaled when at least one wait semaphore was defined, even when serialization wasn’t required. V3DV handle GPU jobs according to the GPU queue in which they are submitted: • Control List (CL) for binning and rendering • Texture Formatting Unit (TFU) • Compute Shader Dispatch (CSD) Therefore, we changed their submission setup to do jobs submitted to any GPU queues able to handle more than one wait semaphores. These commits created all mechanisms to set arrays of wait and signal semaphores for GPU job submissions: • Checking the conditions to define the wait_stage. • Wrapping them in a multisync extension. • According to the kernel interface (described in the previous blog post), configure the generic extension as a multisync extension. Finally, we extended the ability of GPU jobs to handle multiple signal semaphores, but at this point, no GPU job is actually in charge of signaling them. With this in place, we could rework part of the code that tracks CPU and GPU job completions by verifying the GPU status and threads spawned by Event jobs. #### Rework the QueueWaitIdle mechanism to track the syncobj of the last job submitted in each queue: As we had only single in/out syncobj interfaces for semaphores, we used a single last_job_sync to synchronize job dependencies of the previous submission. Although the DRM scheduler guarantees the order of starting to execute a job in the same queue in the kernel space, the order of completion isn’t predictable. On the other hand, we still needed to use syncobjs to follow job completion since we have event threads on the CPU side. Therefore, a more accurate implementation requires last_job syncobjs to track when each engine (CL, TFU, and CSD) is idle. We also needed to keep the driver working on previous versions of v3d kernel-driver with single semaphores, then we kept tracking ANY last_job_sync to preserve the previous implementation. #### Rework synchronization and submission design to let the jobs handle wait and signal semaphores: With multiple semaphores support, the conditions for waiting and signaling semaphores changed accordingly to the particularities of each GPU job (CL, CSD, TFU) and CPU job restrictions (Events, CSD indirect, etc.). In this sense, we redesigned V3DV semaphores handling and job submissions for command buffer batches in vkQueueSubmit. We scrutinized possible scenarios for submitting command buffer batches to change the original implementation carefully. It resulted in three commits more: We keep track of whether we have submitted a job to each GPU queue (CSD, TFU, CL) and a CPU job for each command buffer. We use syncobjs to track the last job submitted to each GPU queue and a flag that indicates if this represents the beginning of a command buffer. The first GPU job submitted to a GPU queue in a command buffer should wait on wait semaphores. The first CPU job submitted in a command buffer should call v3dv_QueueWaitIdle() to do the waiting and ignore semaphores (because it is waiting for everything). If the job is not the first but has the serialize flag set, it should wait on the completion of all last job submitted to any GPU queue before running. In practice, it means using syncobjs to track the last job submitted by queue and add these syncobjs as job dependencies of this serialized job. If this job is the last job of a command buffer batch, it may be used to signal semaphores if this command buffer batch has only one type of GPU job (because we have guarantees of execution ordering). Otherwise, we emit a no-op job just to signal semaphores. It waits on the completion of all last jobs submitted to any GPU queue and then signal semaphores. Note: We changed this approach to correctly deal with ordering changes caused by event threads at some point. Whenever we have an event job in the command buffer, we cannot use the last job in the last command buffer assumption. We have to wait all event threads complete to signal After submitting all command buffers, we emit a no-op job to wait on all last jobs by queue completion and signal fence. Note: at some point, we changed this approach to correct deal with ordering changes caused by event threads, as mentioned before. ### Final considerations With many changes and many rounds of reviews, the patchset was merged. After more validations and code review, we polished and fixed the implementation together with external contributions: Also, multisync capabilities enabled us to add new features to V3DV and switch the driver to the common synchronization and submission framework: • v3dv: expose support for semaphore imports This was waiting for multisync support in the v3d kernel, which is already available. Exposing this feature however enabled a few more CTS tests that exposed pre-existing bugs in the user-space driver so we fix those here before exposing the feature. • v3dv: Switch to the common submit framework This should give you emulated timeline semaphores for free and kernel-assisted sharable timeline semaphores for cheap once you have the kernel interface wired in. We used a set of games to ensure no performance regression in the new implementation. For this, we used GFXReconstruct to capture Vulkan API calls when playing those games. Then, we compared results with and without multisync caps in the kernelspace and also enabling multisync on v3dv. We didn’t observe any compromise in performance, but improvements when replaying scenes of vkQuake game. #### Multiple syncobjs support for V3D(V) (Part 1) As you may already know, we at Igalia have been working on several improvements to the 3D rendering drivers of Broadcom Videocore GPU, found in Raspberry Pi 4 devices. One of our recent works focused on improving V3D(V) drivers adherence to Vulkan submission and synchronization framework. We had to cross various layers from the Linux Graphics stack to add support for multiple syncobjs to V3D(V), from the Linux/DRM kernel to the Vulkan driver. We have delivered bug fixes, a generic gate to extend job submission interfaces, and a more direct sync mapping of the Vulkan framework. These changes did not impact the performance of the tested games and brought greater precision to the synchronization mechanisms. Ultimately, support for multiple syncobjs opened the door to new features and other improvements to the V3DV submission framework. ## DRM Syncobjs But, first, what are DRM sync objs? * DRM synchronization objects (syncobj, see struct &drm_syncobj) provide a * container for a synchronization primitive which can be used by userspace * to explicitly synchronize GPU commands, can be shared between userspace * processes, and can be shared between different DRM drivers. * Their primary use-case is to implement Vulkan fences and semaphores. [...] * At it's core, a syncobj is simply a wrapper around a pointer to a struct * &dma_fence which may be NULL.  And Jason Ekstrand well-summarized dma_fence features in a talk at the Linux Plumbers Conference 2021: A struct that represents a (potentially future) event: • Has a boolean “signaled” state • Has a bunch of useful utility helpers/concepts, such as refcount, callback wait mechanisms, etc. Provides two guarantees: • One-shot: once signaled, it will be signaled forever • Finite-time: once exposed, is guaranteed signal in a reasonable amount of time ## What does multiple semaphores support mean for Raspberry Pi 4 GPU drivers? For our main purpose, the multiple syncobjs support means that V3DV can submit jobs with more than one wait and signal semaphore. In the kernel space, wait semaphores become explicit job dependencies to wait on before executing the job. Signal semaphores (or post dependencies), in turn, work as fences to be signaled when the job completes its execution, unlocking following jobs that depend on its completion. The multisync support development comprised of many decision-making points and steps summarized as follow: • added to the v3d kernel-driver capabilities to handle multiple syncobj; • exposed multisync capabilities to the userspace through a generic extension; and • reworked synchronization mechanisms of the V3DV driver to benefit from this feature • enabled simulator to work with multiple semaphores • tested on Vulkan games to verify the correctness and possible performance enhancements. We decided to refactor parts of the V3D(V) submission design in kernel-space and userspace during this development. We improved job scheduling on V3D-kernel and the V3DV job submission design. We also delivered more accurate synchronizing mechanisms and further updates in the Broadcom Vulkan driver running on Raspberry Pi 4. Therefore, we summarize here changes in the kernel space, describing the previous state of the driver, taking decisions, side improvements, and fixes. ### From single to multiple binary in/out syncobjs: Initially, V3D was very limited in the numbers of syncobjs per job submission. V3D job interfaces (CL, CSD, and TFU) only supported one syncobj (in_sync) to be added as an execution dependency and one syncobj (out_sync) to be signaled when a submission completes. Except for CL submission, which accepts two in_syncs: one for binner and another for render job, it didn’t change the limited options. Meanwhile in the userspace, the V3DV driver followed alternative paths to meet Vulkan’s synchronization and submission framework. It needed to handle multiple wait and signal semaphores, but the V3D kernel-driver interface only accepts one in_sync and one out_sync. In short, V3DV had to fit multiple semaphores into one when submitting every GPU job. ### Generic ioctl extension The first decision was how to extend the V3D interface to accept multiple in and out syncobjs. We could extend each ioctl with two entries of syncobj arrays and two entries for their counters. We could create new ioctls with multiple in/out syncobj. But after examining other drivers solutions to extend their submission’s interface, we decided to extend V3D ioctls (v3d_cl_submit_ioctl, v3d_csd_submit_ioctl, v3d_tfu_submit_ioctl) by a generic ioctl extension. I found a curious commit message when I was examining how other developers handled the issue in the past: Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Mar 22 09:23:22 2019 +0000 drm/i915: Introduce the i915_user_extension_method An idea for extending uABI inspired by Vulkan's extension chains. Instead of expanding the data struct for each ioctl every time we need to add a new feature, define an extension chain instead. As we add optional interfaces to control the ioctl, we define a new extension struct that can be linked into the ioctl data only when required by the user. The key advantage being able to ignore large control structs for optional interfaces/extensions, while being able to process them in a consistent manner. In comparison to other extensible ioctls, the key difference is the use of a linked chain of extension structs vs an array of tagged pointers. For example, struct drm_amdgpu_cs_chunk { __u32 chunk_id; __u32 length_dw; __u64 chunk_data; }; [...]  So, inspired by amdgpu_cs_chunk and i915_user_extension, we opted to extend the V3D interface through a generic interface. After applying some suggestions from Iago Toral (Igalia) and Daniel Vetter, we reached the following struct: struct drm_v3d_extension { __u64 next; __u32 id; #define DRM_V3D_EXT_ID_MULTI_SYNC 0x01 __u32 flags; /* mbz */ };  This generic extension has an id to identify the feature/extension we are adding to an ioctl (that maps the related struct type), a pointer to the next extension, and flags (if needed). Whenever we need to extend the V3D interface again for another specific feature, we subclass this generic extension into the specific one instead of extending ioctls indefinitely. ### Multisync extension For the multiple syncobjs extension, we define a multi_sync extension struct that subclasses the generic extension struct. It has arrays of in and out syncobjs, the respective number of elements in each of them, and a wait_stage value used in CL submissions to determine which job needs to wait for syncobjs before running. struct drm_v3d_multi_sync { struct drm_v3d_extension base; /* Array of wait and signal semaphores */ __u64 in_syncs; __u64 out_syncs; /* Number of entries */ __u32 in_sync_count; __u32 out_sync_count; /* set the stage (v3d_queue) to sync */ __u32 wait_stage; __u32 pad; /* mbz */ };  And if a multisync extension is defined, the V3D driver ignores the previous interface of single in/out syncobjs. Once we had the interface to support multiple in/out syncobjs, v3d kernel-driver needed to handle it. As V3D uses the DRM scheduler for job executions, changing from single syncobj to multiples is quite straightforward. V3D copies from userspace the in syncobjs and uses drm_syncobj_find_fence()+ drm_sched_job_add_dependency() to add all in_syncs (wait semaphores) as job dependencies, i.e. syncobjs to be checked by the scheduler before running the job. On CL submissions, we have the bin and render jobs, so V3D follows the value of wait_stage to determine which job depends on those in_syncs to start its execution. When V3D defines the last job in a submission, it replaces dma_fence of out_syncs with the done_fence from this last job. It uses drm_syncobj_find() + drm_syncobj_replace_fence() to do that. Therefore, when a job completes its execution and signals done_fence, all out_syncs are signaled too. ### Other improvements to v3d kernel driver This work also made possible some improvements in the original implementation. Following Iago’s suggestions, we refactored the job’s initialization code to allocate memory and initialize a job in one go. With this, we started to clean up resources more cohesively, clearly distinguishing cleanups in case of failure from job completion. We also fixed the resource cleanup when a job is aborted before the DRM scheduler arms it - at that point, drm_sched_job_arm() had recently been introduced to job initialization. Finally, we prepared the semaphore interface to implement timeline syncobjs in the future. ### Going Up The patchset that adds multiple syncobjs support and improvements to V3D is available here and comprises four patches: • drm/v3d: decouple adding job dependencies steps from job init • drm/v3d: alloc and init job in one shot • drm/v3d: add generic ioctl extension • drm/v3d: add multiple syncobjs support After extending the V3D kernel interface to accept multiple syncobjs, we worked on V3DV to benefit from V3D multisync capabilities. In the next post, I will describe a little of this work. ## May 09, 2022 ### Iago Toral #### VK_KHR_pipeline_executable_properties Sometimes you want to go and inspect details of the shaders that are used with specific draw calls in a frame. With RenderDoc this is really easy if the driver implements VK_KHR_pipeline_executable_properties. This extension allows applications to query the driver about various aspects of the executable code generated for a Vulkan pipeline. I implemented this extension for V3DV, the Vulkan driver for Raspberry Pi 4, last week (it is currently in review process) because I was tired of jumping through loops to get the info I needed when looking at traces. For V3DV we expose the NIR and QPU assembly code as well as various others stats, some of which are quite relevant to performance, such as spill or thread counts. ## May 02, 2022 ### Víctor Jáquez #### From gst-build to local-projects Two years ago I wrote a blog post about using gst-build inside of WebKit SDK flatpak. Well, all that has changed. That’s the true upstream spirit. There were two main reason for the change: 1. Since the switch to GStreamer mono repository, gst-build has been deprecated. The mechanism in WebKit were added, basically, to allow GStreamer upstream, so keeping gst-build directory just polluted the conceptual framework. 2. By using gst-build one could override almost any other package in WebKit SDK. For example, for developing gamepad handling in WPE I added libmanette as a GStreamer subproject, to link a modified version of the library rather than the one in flatpak. But that approach added an unneeded conceptual depth in tree. In order to simplify these operations, by taking advantage of Meson’s subproject support directly, gst-build handling were removed and new mechanism was set in place: Local Dependencies. With local dependencies, you can add or override almost any dependency, while flatting the tree layout, by placing at the same level GStreamer and any other library. Of course, in order add dependencies, they must be built with meson. For example, to override libsoup and GStreamer, just clone both repositories below of Tools/flatpak/local-projects/subprojects, and declare them in WEBKIT_LOCAL_DEPS environment variable:  export WEBKIT_SDK_LOCAL_DEPS=libsoup,gstreamer-full
$export WEBKIT_SDK_LOCAL_DEPS_OPTIONS="-Dgstreamer-full:introspection=disabled -Dgst-plugins-good:soup=disabled"$ build-webkit --wpe


# Slightly Random Interface Thoughts

Not the normal fare for my blog, but my mind makes all kinds of weird connections after work, I suppose attempting to synthesize things. Occasionally it leads me down a road I think is interesting. Today it had me thinking about interfaces.

When I was very young, everyththing (TVs, radios, stereos, etc) had analog controls - mainly 2 kinds, switches and knobs. Almost everything was controlled and tuned by big physical knobs. Some of my grandparents things were probably 25 years old by then, and that's how they worked. While "new stuff" at the time certainly had differences, there was a definite "sameness" to it. They were mainly switches and dial knobs. We were tweaking, more or less, the same old things, for a pretty long time.

That makes sense on a whole lot of levels - it has benefits. You know it works. Your users understand how to use it. It's proven.

Of course, there were improvements or experiements at the edges: Minor evolutions of varying degrees. Maybe they moved the size or shape of knobs, or gave you additional knobs for different kinds of fine-tuning.

But then, sometimes something was interestingly new. The first takes are almost always not great, but they inspire other ideas. Ideas from all over start to mix and smash together and, ultimately, we get a kind of whole new "species".

My 4k smart TV is about as different from my grandmother's TV as you can imagine. It's a few speciations removed.

## Human Machine Interfaces

Really, that's the case with pretty much everything we use to interface with machines, whether it is a physical interface, or a digital one. Nothing stays entirely the same. Comparing Windows 3.1 programs to their counterparts today, for example, would show you lots of variation and evolution in controls. Last year we wrote up some documentation surrounding research while working on "tabs" and if you breeze through it a bit, you can see a bit of discussion showing a fair bit of evolution and cross influencing over the years for just that one control.

But, what connected in my mind was something else entirely: "Game controllers". I say "game controllers" because historically, again, there is a lot of cross pollenation of ideas - these things often wind up being used for far more than just games. Indeed, in many of these devices they are your primary interface to a whole immersive operating system. That's basically your entire means of interacting with Igalia's new Wolvic Browser, for example, which is geared toward XR environments like this. It's interesting how we've smashed ideas from lots of different things together here and it's got me to thinking about the changes I've seen along the way.

## Brief recollections

Way back, before my time, people started making games on computers (PDPs) with SpaceWar! That's just a neat thing you can learn a bunch about here if you want and see in action..

You can see though that our ideas were very rough - maybe you could repurpose keys on a keyboard or, as they did in the video - wire on/off buttons for everything: a button to turn left, a button to turn right, a button to fire, a button to thrust, etc.

By the time I was a kid we'd popularly moved on to Pong and pong-like games which had a physical knob, initially on the device itself. Later we'd separate those into physical corded paddles and add a button. Ohh, neat.

Very quickly came many takes on a joystick with a button (or two). Again, some tried more radical takes, like CalecoVision, which had a short stick with a fat head, two side triggers at the top of a whole number pad!

But, really, "a joystick that you 'grab' with one hand and a button or two" became the dominant paradigm. They had a definite "sameness" to the Atari 2600 model.

And for a the next several years most things just tweaked this a little bit. It was applied in arcades and on home computers and "game systems", but those ideas also wound up being applied to things like controlling heavy machinery.

Then, suddenly in the mid 1980s the original Nintendo was introduced here in the US. It said "Joystick? Hell no. They don't bring us joy." Instead, it introduced the d-pad, and two primary buttons, and two 'special' non-gameplay buttons arranged in the center. This was kind of the first big change in a while.

And then came some tweaking on the edges... A little bit later and we got the super Nintendo which gave us 4 gameplay buttons instead of two.

Then came something more radical. The 64 which came with a Frankenstein dpad + tiny thumb operated joystick and nine game play buttons, as well as radically changing the physical shape into some kind of monstrosity introducing "handles". Holy crap, what a monstrosity.

Then in 1997 we got the Playstation 1 controller: A d-pad, two thumbsticks with better shape and placement and basically 8 buttons - and... better handles.

## What struck me was...

Wow - 1997 was a quarter of a century ago! A bit like my grandmother's radio, while there are certainly differences, I feel like there is a definite 'sameness' to controls for game systems today. They have many other kinds of advances - haptics, pitch and roll stuff, a touch pad inspired by other innovations, and now in the ps5 some resistance stuff... But really, they are very similar at their core in terms of how you interact with them, and these have all been bolted on at the edges. Anyone who played PS1 games could pretty easily jump into a PS5 controller. And there's a lot of good to that. It clearly works well.

But then, suddenly, all of these XR devices did something new and different... Just like all of the other examples, they're clearly very inspired by the controllers that came before them but because they really focus on the sensors as a primary mechanism rather secondary, they've basically split the controller in half.

A few of my favorite games blend them rather nicely, allowing you to use the left thumbstick to walk around as you would in any game, but control your immediate body with movement. I have to say, this feels kind of more natural to me in a lot of ways - a nicer blend.

I have to wonder what kind of other uses we'll put these kinds of advancements to - how the ideas will mix together. Very possibly XR will continue trying to free you up even more, maybe these controls won't even last long there. But, they're clearly on to something, I think - and I can easily imagine these being great ways to do all sorts of things that we're still using the current "normal" gamepads for - and a lot more.

I'm excited to see what happens next.

## April 26, 2022

### Eric Meyer

#### Flexibly Centering an Element with Side-Aligned Content

In a recent side project that I hope will become public fairly soon, I needed to center a left-aligned list of links inside the sides of the viewport, but also line-wrap in cases where the lines got too long (as in mobile). There are a few ways to do this, but I came up with one that was new to me. Here’s how it works.

First, let’s have a list.  Pretend each list item contains a link so that I don’t have to add in all the extra markup.

<ol>
<li>Foreword</li>
<li>Chapter 1: The Day I Was Born</li>
<li>Chapter 2: Childhood</li>
<li>Chapter 4: Teenage Dreaming</li>
<li>Chapter 5: Look Out World</li>
<li>Chapter 6: The World Strikes Back</li>
<li>Chapter 7: Righting My Ship</li>
<li>Chapter 8: In Hindsight</li>
<li>Afterword</li>
</ol>

Great. Now I want it to be centered in the viewport, without centering the text. In other words, the text should all be left-aligned, but the element containing them should be as centered as possible.

One way to do this is to wrap the <ol> element in another element like a <div> and then use flexbox:

div.toc {
display: flex;
justify-content: center;
}

That makes sense if you want to also vertically center the list (with align-items: center) and if you’re already going to be wrapping the list with something that should be flexed, but neither really applied in this case, and I didn’t want to add a wrapper element that had no other purpose except centering. It’s 2022, there ought to be another way, right? Right. And this is it:

ol {
max-inline-size: max-content;
margin-inline: auto;
}

I also could have used width there in place of max-inline-size since this is in English, so the inline axis is horizontal, but as Jeremy pointed out, it’s a weird clash to have a physical property (width) and a logical property (margin-inline) working together. So here, I’m going all-logical, which is probably better for the ongoing work of retraining myself to instinctively think in logical directions anyway.

Thanks to max-inline-size: max-content, the list can’t get any wider (more correctly: any longer along the inline axis) than the longest list item. If the container is wider than that, then margin-inline: auto means the ol element’s box will be centered in the container, as happens with any block box where the width is set to a specific amount, there’s leftover space in the container, and the side margins of the box are set to auto. This is as if I’d pre-calculated the maximum content size to be (say) 434 pixels wide and then declared max-inline-size: 434px.

The great thing here is that I don’t have to do that pre-calculation, which would be very fragile in any case. I can just use max-content instead. And then, if the container ever gets too small to fit the longest bit of content, because the ol was set to max-inline-size instead of just straight inline-size, it can fill out the container as block boxes usually do, and the content inside it can wrap to multiple lines.

Perhaps it’s not the most common of layout needs, but if you find yourself wanting a lightweight way to center the box of an element with side-aligned content, maybe this will work for you.

What’s nice about this is that it’s one of those simple things that was difficult-to-impossible for so long, with hacks and workarounds needed to make it work at all, and now it… just works.  No extra markup, not even any calc()-ing, just a couple of lines that say exactly what they do, and are what you want them to do.  It’s a nice little example of the quiet revolution that’s been happening in CSS of late.  Hard things are becoming easy, and more than easy, simple.  Simple in the sense of “direct and not complex”, not in the sense of “obvious and basic”.  There’s a sense of growing maturity in the language, and I’m really happy to see it.

Have something to say to all that? You can add a comment to the post, or email Eric directly.

## April 21, 2022

### Qiuyi Zhang (Joyee)

#### Fixing snapshot support of class fields in V8

Up until V8 10.0, the class field initializers had been

## April 19, 2022

### Manuel Rego

#### Web Engines Hackfest 2022

Once again Igalia is organizing the Web Engines Hackfest. This year the event is going to be hybrid. Though most things will happen on-site, online participation in some part of the event is going to be possible too.

Regarding dates, the hackfest will take place on June 13 & 14 in A Coruña. If you’re interested in participating, you can find more the information and the registration form at the event website: https://webengineshackfest.org/2022/.

## What’s the Web Engines Hackfest?

This event started a long way back. The first edition happened in 2009 when 12 folks visited the Igalia offices in A Coruña and spent there a whole week working on WebKitGTK port. At that time, it was kind of early stages on the project and lots of work was needed, so those joint weeks were very productive to move things forward, discuss plans and implement features.

As the event grew and more people got interested, in 2014 it was renamed to Web Engines Hackfest and started to welcome people working on different web engines. This brought the opportunity for engineers of the different browsers to come together for a few days and discuss different features.

The hackfest has continued to grow and these days we welcome anyone that is somehow involved on the web platform. In this year’s event there will be people from different parts of the web platform community, from implementors and spec editors, to people interested in some particular feature.

This event has an unconference format. People attending are the ones defining the topics, and work together in breakout sessions to discuss them. They could be issues on a particular browser, generic purpose features, new ideas, even sometimes tooling demos. In addition, we always arrange a few talks as part of the hackfest. But the most important part of the event is being together with very different folks and having the chance to discuss a variety of topics with them. There are not lots of places where people from different companies and browsers join together to discuss topics. The idea of the hackfest is to provide a venue for that to happen.

## 2022 edition

This year we’re hosting the event in a new place, as Igalia’s office is no longer big enough to host all the people that will be attending the event. The venue is called Palexco and it’s close to the city center and just by the seaside (with views of the port). It’s a great place with lots of spaces and big rooms, so we’ll be very comfortable there. Note that we’ll have childcare service for the ones that might need it.

New venue: Palexco (picture by Jose Luis Cernadas Iglesias)

The event is going to be 2 days this time, 13th and 14 June. Hopefully the weather will be great at that time of the year, and the folks visiting A Coruña should be able to really enjoy the trip. There are going to be lots of light hours too, sunrise is going to be around 7am and sunset past 10pm.

The registration form is still open. So far we’ve got a good amount of people registered from different companies like: Arm, Deno Land, Fission, Google, Igalia, KaiOS, Mozilla, Protocol Labs, Red Hat and Salesforce.