# Planet Igalia

## November 28, 2022

### Andy Wingo

#### are ephemerons primitive?

Good evening :) A quick note, tonight: I've long thought that ephemerons are primitive and can't be implemented with mark functions and/or finalizers, but today I think I have a counterexample.

For context, one of the goals of the GC implementation I have been working on on is to replace Guile's current use of the Boehm-Demers-Weiser (BDW) conservative collector. Of course, changing a garbage collector for a production language runtime is risky, and for Guile one of the mitigation strategies for this work is that the new collector is behind an abstract API whose implementation can be chosen at compile-time, without requiring changes to user code. That way we can first switch to BDW-implementing-the-new-GC-API, then switch the implementation behind that API to something else.

Abstracting GC is a tricky problem to get right, and I thank the MMTk project for showing that this is possible -- you have user-facing APIs that need to be implemented by concrete collectors, but also extension points so that the user can provide some compile-time configuration too, for example to provide field-tracing visitors that take into account how a user wants to lay out objects.

Anyway. As we discussed last time, ephemerons are usually have explicit support from the GC, so we need an ephemeron abstraction as part of the abstract GC API. The question is, can BDW-GC provide an implementation of this API?

I think the answer is "yes, but it's very gnarly and will kill performance so bad that you won't want to do it."

the contenders

Consider that the primitives that you get with BDW-GC are custom mark functions, run on objects when they are found to be live by the mark workers; disappearing links, a kind of weak reference; and finalizers, which receive the object being finalized, can allocate, and indeed can resurrect the object.

BDW-GC's finalizers are a powerful primitive, but not one that is useful for implementing the "conjunction" aspect of ephemerons, as they cannot constrain the marker's idea of graph connectivity: a finalizer can only prolong the life of an object subgraph, not cut it short. So let's put finalizers aside.

Weak references have a tantalizingly close kind of conjunction property: if the weak reference itself is alive, and the referent is also otherwise reachable, then the weak reference can be dereferenced. However this primitive only involves the two objects E and K; there's no way to then condition traceability of a third object V to E and K.

We are left with mark functions. These are an extraordinarily powerful interface in BDW-GC, but somewhat expensive also: not inlined, and going against the grain of what BDW-GC is really about (heaps in which the majority of all references are conservative). But, OK. They way they work is, your program allocates a number of GC "kinds", and associates mark functions with those kinds. Then when you allocate objects, you use those kinds. BDW-GC will call your mark functions when tracing an object of those kinds.

Let's assume firstly that you have a kind for ephemerons; then when you go to mark an ephemeron E, you mark the value V only if the key K has been marked. Problem solved, right? Only halfway: you also have to handle the case in which E is marked first, then K. So you publish E to a global hash table, and... well. You would mark V when you mark a K for which there is a published E. But, for that you need a hook into marking V, and V can be any object...

So now we assume additionally that all objects are allocated with user-provided custom mark functions, and that all mark functions check if the marked object is in the published table of pending ephemerons, and if so marks values. This is essentially what a proper ephemeron implementation would do, though there are some optimizations one can do to avoid checking the table for each object before the mark stack runs empty for the first time. In this case, yes you can do it! Additionally if you register disappearing links for the K field in each E, you can know if an ephemeron E was marked dead in a previous collection. Add a pre-mark hook (something BDW-GC provides) to clear the pending ephemeron table, and you are in business.

yes, but no

So, it is possible to implement ephemerons with just custom mark functions. I wouldn't want to do it, though: missing the mostly-avoid-pending-ephemeron-check optimization would be devastating, and really what you want is support in the GC implementation. I think that for the BDW-GC implementation in whippet I'll just implement weak-key associations, in which the value is always marked strongly unless the key was dead on a previous collection, using disappearing links on the key field. That way a (possibly indirect) reference from a value V to a key K can indeed keep K alive, but oh well: it's a conservative approximation of what should happen, and not worse than what Guile has currently.

Good night and happy hacking!

# Alice 2023

There's an upcomming election in the W3C...

If you're not familliar with W3C elections already, I wrote a brief (~400 word) "Primer for Busy People" on the general "parts" of the W3C and introducing the two main elected bodies, including the Technical Architecture Group (TAG), a kind of steering committee for the W3C/Web at large.

The W3C TAG is charged with overseeing and helping direct and hold together the broad view and direction of the web over time. While that sounds perhaps a little abstract and esoteric, it is mostly grounded in very concrete things. Proposals go through TAG review and the TAG discusses and asks questions at many levels to enure that things are safe, ethical, well designed, forward looking, and consistently reasoned across APIs and working groups. As part of this they also try to gather and publish articulations of principles and advice which can be reapplied and referred back to. These principles help shape the web that we'll all be developing on in significant ways, for the rest of our lives. So, I think it matters that we have a good TAG.

The TAG has changed a lot since 2012 when I first started advocating that it needed reform and that regular developers should publicly support candidates and lobby member orgs to vote on our behalf. I think that a lot of that change is because it worked. A concerted effort was made to ensure that we had as good a slate of candidates as we could muster. Candidates threw their support to other candidates they believed were great choices too, etc. For the next few years, those candidates were elected something like 25:2 in terms of seats. I'm happy to say that I think that most of those changes have been really positive. The TAG is relevant again, it is productive and useful. They do outreach in the community. It is much more diverse on many levels. The work is better connected to practice. However, another outcome of this is that the W3C also changed how elections work. I'm not going to get into why I think this is complicated and a net loss (I do that in several other posts like Optimizing The W3C Sandwiches if you're really curious), but it means that each member has only one vote counted and candidates are left to convince you that they are the best candidate. The net result is that I've pretty much stepped away from advocating for candidates like I used to and focused instead on just helping to make sure there are good nominees, at least.

A notable exception to this was in 2019 when Alice Boxhall (@sundress on Twitter) ran. I wrote about why I was excited and supported her. She is truly a unique candidate in my mind. After her stint on the TAG, she stepped away from Google and took a sabbatical only to return this year to work with me at Igalia! In this election, she's running again and I couldn't support it more. All of the rationale I had in 2019 holds true today, and additionally now you can say she'll have experience and bring perspective from a very unique organization that represents a lot that is good and positive change in the web ecosystem.

I really wish I could say more about this election in a productive way, but all I can offer is these two points:

First... I hope you're similarly excited for Alice to run and join me in supporting her.

Second... Encourage people to actually vote, and vote together

We have a surplus of good candidates. There are many permutations of choices that would be arguably good. However, the truth is that with this voting system, if turnout is low (it often is, amazingly few orgs vote), it is very likely that someone at the very bottom of the preferences of the vast majority gets a one of those seats. Instead of optimizing one of the many good permutations, it can ensure that we lose one of those seats. Just having people turn up at all makes that kind of failure harder, so please - send a DM, tweet something, remind people - make sure they vote.

To be a little stronger, while you're at it - discuss with some other members in a community that you belong to. Get their thoughts and insights, and see if you can build some concensus on an order of preference - in doing so you'll greatly help the odds of being happy with the results.

But, also, again...

Polls close at 04:59 UTC on 2022-12-14 (23:59, Boston time on 2022-12-13).

# The Cool Thing You Didn’t Know Existed

Recently, in discussions with my colleague Eric Meyer, we struck on an idea that we thought would be really helpful - even built a thing for our own use to try it out. Turned out it kind of already exists and I’d like to tell you about it, and how we hope to make it better.

Information about web related things are just one among many, many kinds of information flying at us through various channels a mile a minute (or ~ 1.61km per minute, if you prefer). Some of those things are about “standards”. I say “standards” in quotes because much of what we hear about are things we hope will someday be universally implemented.

Many ulimately fail, take wrong turns and ultimately are re-imagined. Those that ultimately do get universally implmented and reach Definitely A Standard™ status often take years to do so. This presents numerous challenges, so it’s really helpful to be able to really quickly understand what it is you’re investing your time in, and make sure you don’t miss the important stuff.

Luckily there are some kind of more newsworthy ‘events’ along the way - mainly, implementations landing. The more implementations there are, the more confidence you can have that this valuable - whether that is just taking a few minutes to find out about it, or actually taking some time to try to learn it and put it to use. But… the last one is special. It is the finish line. There aren’t more questions or doubts. Ironically though, the timeframes and noise before this point make it sometimes easy to miss.

That is, in fact, a pretty rare occurence. Wouldn’t it be nice if there was something that would only tell you that? That would be a pretty quiet channel, and if you saw something there you could say “oh, that’s important”. Interestingly, it kind of already exists

The MDN browser-comapt-data repo is a community maintained source that all of the browser implmenters and standards folks keep up. I think most people know this already, but what you might not know is that they create a release update to it regularly (currently twice a week) with, actually very good release notes.

GitHub, it seems, also has a built in pattern for providing an RSS Feed for project releases. The practical upshot of this is that you can Subscribe to the RSS Feed of the browser compat data releases.

Wow, cool.

## We can do better...

This feed is great, but it still contains some noise you probably don’t care about, and because of that it can be a little confusing…

All of these updates are additions to BCD data features and renamings and removals along the way, for example. Those are just metadata that ultimately implementations might or might not be connected to. There is no new implementation news in this release, but it's not especially obvious. There are folks (like me) who like to follow that, but really it’s the implementations that are the high order bit here (even for me).

My colleague Eric Meyer and I were talking about all this, and ways to make the information coming out of BCD versions more digestable for ourselves and the public at large… and decided what the heck, let’s build something rough and see where it lands. So, in the thing we built, it's more focused.

In this view, additions/removals sort of data are summarized as a count on a single line. Further details collapsed by default. You can open them if you want, of course, but they don’t get in the way when you are trying to quickly visually scan it.

The main content that is expanded by default is about the addition (or removals) of implementations - and those are organized, linked to relevant MDN reference, and you’re told who added it and visually, and easily able to see if it's reached fully complete implementation status.

Our experiments led to us reaching out to our peers at MDN and we’re currently working with them to see if we can make improvements some improvements like these. In the meantime - subscribe to that RSS feed and check it every week, I think you’ll be happy you did.

## November 16, 2022

### José Dapena

#### Native call stack profiling (1/3): introduction

This week I presented a lightning talk in BlinkOn 17. There I talked about the work for improving native stack profiling support
in Windows.

This post starts a series where I will provide more context and details
to the presentation.

## Why callstack profiling

First, a definition:

Callstack profiling: a performance analysis tool, that samples periodically the call stacks of all threads, for a specific workload.

Why is it useful? It provides a better undestanding of performance problems, specially if they are caused by CPU-bound bottle necks.

As we sample the full stack for each thread, we are capturing a handful of information:
– Which functions are using more CPU directly.
– As we capture the full stacktrace, we know also which functions involve more CPU usage, even if it is indirectly through the calls they do.

But it is not only useful for CPU waits. It will also capture when a method is waiting for something (i.e. because of networking, or a semaphore).

The provided information is useful for initial analysis of the problem, as it will give a high level view of where time could be spent by the application. But it will also be useful in further stages of the analysis, and even for comparing different implementations and consider possible changes.

## How does it work?

For call stack sampling, we need some infrastructure to be able to capture and traverse properly the callstack for each thread.

In compilation stage, information is added for function names and the frame pointers. This allows, for a specific stack, to resolve later the actual names, and even lines of code that are captured.

In runtime stage, function information will be required for generated code. I.e. in a web browser, the Javascript code that is compiled in runtime.

Then, every sample will extract the callstack of all the threads of all the analysed processes. This will happen periodically, at the rate established by the profiling tool.

## System wide native callstack profiling

When possible, sampling the call stacks of the full system can be benefitial for the analysis.

First, we may want to include system libraries and other dependencies of our component in the analysis. But also, system analyzers can provide other metrics that can give a better context to the analysed workload (network or CPU load, memory usage, swappiness, …).

In the end, many problems are not bound to a single component, so capturing the interaction with other components can be useful.

## Next

In next blog posts in this series, I will present native stack profiling for Windows, and how it is integrated with Chromium.

### Jacobo Aragunde

#### Accessible name computation in Chromium Chapter II: The traversal problem

Back in March last year, as part of my maintenance work on Chrome/Chromium accessibility at Igalia, I inadvertently started a long, deep dive into the accessible name calculation code, that had me intermittently busy since then. Part of this work was already presented in a lightning talk at BlinkOn 15, in November 2021:

This is the second post in a series bringing a conclusion to the work presented in that talk. You may read the previous one, with an introduction to accessible name computation and a description of the problems we wanted to address, here.

### Why traverse the DOM?

In our previous post, we found out we used a DOM tree traversal in AXNodeObject::TextFromDescendants to calculate the name for a node based on its descendants. There are some reasons this is not a good idea:

• There was duplicate code related to selection of relevant nodes (cause of the whitespace bug). For example, the accessibility tree creation already selects relevant nodes, and simplifies whitespace nodes to one single unit whenever relevant.
• The DOM tree does not contain certain relevant information we need for accessible names (cause of the ::before/after bug). Relevant pseudo-elements are already included in the accessibility tree.
• In general, it would be more efficient to reuse the cached accessibility tree.

There must have been a good reason to traverse the DOM again instead of the cached accessibility tree, and it was easy enough to switch our code to do the latter, and find out what was breaking. Our tests were kind enough to point out multiple failures related with accessible names from hidden subtrees: it is possible, under certain conditions, to name elements after other hidden elements. Because hidden elements are generally excluded from the accessibility tree, to be able to compute a name after them we had resorted to the DOM tree.

### Hidden content and aria-labelledby

Authors may know they can use aria-labelledby to name a node after another, hidden node. This is a simple example, where the produced name for the input is “foo”:

<input aria-labelledby="label">
<div id="label" class="hidden">foo</div>


Unfortunately, the spec did not provide clear answers to some corner cases: what happens if there are elements that are explicitly hidden inside the first hidden node? What with nodes that aren’t directly referenced? How deep are we expected to go inside a hidden tree to calculate its name? And what about aria-hidden nodes?

The discussion and changes to the spec that resulted from it are worth their own blog post! For this one, let me jump directly to the conclusion: we agreed to keep browsers current and past behavior for the most part, and change the spec to reflect it.

### Addressing the traversal problem in Chromium

The goal now is to change name computation code to work exclusively with the accessibility tree. It would address the root cause of the problems we had identified, and simplify our code, potentially fixing other issues. To do so, we need to include in the accessibility tree all the information required to get a name from a hidden subtree like explained previously.

We have the concept of “ignored” nodes in the Blink accessibility tree: these nodes exist in the tree, are kept up-to-date, etc. But they are not exposed to the platform accessibility. We will use ignored nodes to keep name-related items in the tree.

At first, I attempted to include in the accessibility tree those nodes whose parent was a target of an aria-labelledby or aria-describedby relation, even if they were hidden. For that purpose, I had to track the DOM for changes in these relations, and maintain the tree up-to-date. This proved to be complicated, as we had to add more code, more conditions… Exactly the opposite to our goal of simplifying things. We ended up abandoning this line of work.

We considered the best solution would be the simplest one: keep all hidden nodes in the tree, just in case we need them. The patch had an impact in memory consumption that we decided to assume, taking into account that the advantages of the change (less and cleaner code, bug fixes) were more than the disadvantages, and its neutral-to-slightly-positive impact in other performance metrics.

There were some non-hidden nodes that were also excluded from the accessibility tree: that was the case of <label> elements, to prevent ATs from double-speaking the text label and the labeled element next to it. We also had to include these labels in the Blink accessibility tree, although marked “ignored”.

Finally, with all the necessary nodes in the tree, we could land a patch to simplify traversal for accessible name and description calculation.

Along the way, we made a number of small fixes and improvements in surrounding and related code, which hopefully made things more stable and performant.

### Next in line: spec work

In this post, we have explained the big rework in accessible naming code to be able to use the accessibility tree as the only source of names. It was a long road, with a lot unexpected problems and dead-ends, but worth for the better code, improved test coverage and bugs fixed.

In the next and final post, I’ll explain the work we did in the spec, and the adjustments required in Chromium, to address imprecision when naming from hidden subtrees.

Thanks for reading, and happy hacking!

# Quick­JS al­ready sup­ports ar­bi­trary-pre­ci­sion dec­i­mals

Quick­JS is a neat JavaScript en­gine by Fab­rice Bel­lard. It’s fast and small (ver­sion 2021-03-27 clocks in at 759 Kb). It in­cludes sup­port for ar­bi­trary-pre­ci­sion dec­i­mals, even though the TC39 dec­i­mal pro­pos­al is (at the time of writ­ing) still at Stage 2. You can in­stall it us­ing the tool of your choice; I was able to in­stall it us­ing Home­brew for ma­cOS (the for­mu­la is called quickjs) and FreeB­SD and OpenB­SD. It can also be in­stalled us­ing esvu. (It doesn’t seem to be avail­able as a pack­age on Ubun­tu.)

To get start­ed with ar­bi­trary pre­ci­sion dec­i­mals, you need to fire up Quick­JS with the bignum flag:

$qjs --bignumQuick­JS - Type "\h" for helpqjs > 0.1 + 0.2 === 0.3falseqjs > 0.1m + 0.2m === 0.3mtrueqjs > 0.12345678910111213141516171819m * 0.19181716151413121110987654321m0.0236811308550240594910066199325923006903586430678413779899m (The m is the pro­posed suf­fix for lit­er­al high-pre­ci­sion dec­i­mal num­bers that pro­pos­al-dec­i­mal is sup­posed to give us.) No­tice how we nice­ly un­break JavaScript dec­i­mal arith­metic, with­out hav­ing to load a li­brary. The fi­nal API in the of­fi­cial TC39 Dec­i­mal pro­pos­al still has not been worked out. In­deed, a core ques­tion there re­mains out­stand­ing at the time of writ­ing: what kind of nu­mer­ic pre­ci­sion should be sup­port­ed? (The two main con­tenders are ar­bi­trary pre­ci­sion and the oth­er be­ing 128-bit IEEE 754 (high, but not *ar­bi­trary*, pre­ci­sion). Quick­JS does ar­bi­trary pre­ci­sion.) Nonethe­less, Quick­JS pro­vides a BigDecimal func­tion: qjs > BigDec­i­mal("123456789.0123456789")123456789.0123456789m More­over, you can do ba­sic arith­metic with dec­i­mals: ad­di­tion, sub­trac­tion, mul­ti­pli­ca­tion, di­vi­sion, mod­u­lo, square roots, and round­ing. Every­thing is done with in­fi­nite pre­ci­sion (no loss of in­for­ma­tion). If you know, in ad­vance, what pre­ci­sion is need­ed, you can tweak the arith­meti­cal op­er­a­tions by pass­ing in an op­tions ar­gu­ment. Here’s an ex­am­ple of adding two big dec­i­mal num­bers: qjs > var a = 0.12345678910111213141516171819m;un­de­finedqjs > var b = 0.19181716151413121110987654321m;un­de­finedqjs > BigDec­i­mal.add(a,b)0.3152739506152433425250382614mqjs > BigDec­i­mal.add(a,b, { "round­ing­Mode": "up", "max­i­mum­Frac­tion­Dig­its": 10 })0.3152739507m ## November 14, 2022 ### Jesse Alama #### Here’s how to unbreak floating-point math in JavaScript # Un­break­ing float­ing-point math in JavaScript Be­cause com­put­ers are lim­it­ed, they work in a fi­nite range of num­bers, name­ly, those that can be rep­re­sent­ed straight­for­ward­ly as fixed-length (usu­al­ly 32 or 64) se­quences of bits. If you’ve only got 32 or 64 bits, it’s clear that there are only so many num­bers you can rep­re­sent, whether we’re talk­ing about in­te­gers or dec­i­mals. For in­te­gers, it’s clear that there’s a way to ex­act­ly rep­re­sent math­e­mat­i­cal in­te­gers (with­in the fi­nite do­main per­mit­ted by 32 or 64 bits). For dec­i­mals, we have to deal with the lim­its im­posed by hav­ing only a fixed num­ber of bits: most dec­i­mal num­bers can­not be ex­act­ly rep­re­sent­ed. This leads to headaches in all sorts of con­texts where dec­i­mals arise, such as fi­nance, sci­ence, en­gi­neer­ing, and ma­chine learn­ing. It has to do with our use of base 10 and the com­put­er’s use of base 2. Math strikes again! Ex­act­ness of dec­i­mal num­bers isn’t an ab­struse, edge case-y prob­lem that some math­e­mati­cians thought up to poke fun at pro­gram­mers en­gi­neers who aren’t blessed to work in an in­fi­nite do­main. Con­sid­er a sim­ple ex­am­ple. Fire up your fa­vorite JavaScript en­gine and eval­u­ate this: 1 + 2 === 3 You should get true. Duh. But take that ex­am­ple and work it with dec­i­mals: 0.1 + 0.2 === 0.3 You’ll get false. How can that be? Is float­ing-point math bro­ken in JavaScript? Short an­swer: yes, it is. But if it’s any con­so­la­tion, it’s not just JavaScript that’s bro­ken in this re­gard. You’ll get the same re­sult in all sorts of oth­er lan­guages. This isn’t wat. This is the un­avoid­able bur­den we pro­gram­mers bear when deal­ing with dec­i­mal num­bers on ma­chines with lim­it­ed pre­ci­sion. Maybe you’re think­ing OK, but if that’s right, how in the world do dec­i­mal num­bers get han­dled at all? Think of all the fi­nan­cial ap­pli­ca­tions out there that must be do­ing the wrong thing count­less times a day. You’re quite right! One way of get­ting around odd­i­ties like the one above is by al­ways round­ing. So in­stead of work­ing with, say, this is by han­dling dec­i­mal num­bers as strings (se­quences of dig­its). You would then de­fine op­er­a­tions such as ad­di­tion, mul­ti­pli­ca­tion, and equal­i­ty by do­ing el­e­men­tary school math, dig­it by dig­it (or, rather, char­ac­ter by char­ac­ter). ## So what to do? Num­bers in JavaScript are sup­posed to be IEEE 754 float­ing-point num­bers. A con­se­quence of this is, ef­fec­tive­ly, that 0.1 + 0.2 will nev­er be 0.3 (in the sense of the === op­er­a­tor in JavaScript). So what can be done? There’s an npm li­brary out there, dec­i­mal.js, that pro­vides sup­port for ar­bi­trary pre­ci­sion dec­i­mals. There are prob­a­bly oth­er li­braries out there that have sim­i­lar or equiv­a­lent func­tion­al­i­ty. As you might imag­ine, the is­sue un­der dis­cus­sion is old. There are workarounds us­ing a li­brary. But what about ex­tend­ing the lan­guage of JavaScript so that the equa­tion does get val­i­dat­ed? Can we make JavaScript work with dec­i­mals cor­rect­ly, with­out us­ing a li­brary? Yes, we can. ## Aside: Huge in­te­gers It’s worth think­ing about a sim­i­lar is­sue that also aris­es from the finite­ness of our ma­chines: ar­bi­trar­i­ly large in­te­gers in JavaScript. Out of the box, JavaScript didn’t sup­port ex­treme­ly large in­te­gers. You’ve got 32-bit or (more like­ly) 64-bit signed in­te­gers. But even though that’s a big range, it’s still, of course, lim­it­ed. Big­Int, a pro­pos­al to ex­tend JS with pre­cise­ly this kind of thing, reached Stage 4 in 2019, so it should be avail­able in pret­ty much every JavaScript en­gine you can find. Go ahead and fire up Node or open your brows­er’s in­spec­tor and plug in the num­ber of nanosec­onds since the Big Bang: 13_787_000_000_000n // years * 365n // days * 24n // hours * 60n // minutes * 60n // seconds * 1000n // milliseconds * 1000n // microseconds * 1000n // nanoseconds (Not a sci­en­ti­cian. May not be true. Not in­tend­ed to be a fac­tu­al claim.) ## Adding big dec­i­mals to the lan­guage OK, enough about big in­te­gers. What about adding sup­port for ar­bi­trary pre­ci­sion dec­i­mals in JavaScript? Or, at least, high-pre­ci­sion dec­i­mals? As we see above, we don’t even need to wrack our brains try­ing to think of com­pli­cat­ed sce­nar­ios where a ton of dig­its af­ter the dec­i­mal point are need­ed. Just look at 0.1 + 0.2 = 0.3. That’s pret­ty low-pre­ci­sion, and it still doesn’t work. Is there any­thing anal­o­gous to Big­Int for non-in­te­ger dec­i­mal num­bers? No, not as a li­brary; we al­ready dis­cussed that. Can we add it to the lan­guage, so that, out of the box—with no third-par­ty li­brary—we can work with dec­i­mals? The an­swer is yes. Work is pro­ceed­ing on this mat­ter, but things re­main to un­set­tled. The rel­e­vant pro­pos­al is BigDec­i­mal. I’ll be work­ing on this for a while. I want to get big dec­i­mals into JavaScript. There are all sorts of is­sues to re­solve, but they’re def­i­nite­ly re­solv­able. We have ex­pe­ri­ence with ar­bi­trary pre­ci­sion arith­metic in oth­er lan­guages. It can be done. So yes, float­ing-point math is bro­ken in JavaScript, but help is on the way. You’ll see more from me here as I tack­le this in­ter­est­ing prob­lem; stay tuned! ### Javier Fernández #### Discovering Chrome’s pre-defined Custom Handlers In a previous post I described some architectural changes I’ve applied in Chrome to move the Custom Handlers logic into a new component. One of the advantages of this new architecture is an easier extensibility for embedders and less friction with the //chrome layer. Igalia and Protocol Labs has been collaborating for some years already to improve the multi-protocol capabilities in the main browsers; most of our work has been done in Chrome, but we have interesting contributions in Firefox and of course we have plans to increase or activity in Safari. As I said in the mentioned post, one of the long-term goals that Protocol Labs has with this work is to improve the integration of the IPFS protocol in the main browsers. On this regard, its undeniable that the Brave browser is the one leading this effort, providing a native implementation of the IPFS protocol that can work with both, a local IPFS node and the HTTPS gateway. For the rest of the browsers, Protocol Labs would like to at least implement the HTTPS gateway support, as an intermediate step and to increase the adoption and use of the protocol. This is where the Chrome’s pre-defined custom handlers feature could provide us a possible approach. This feature allows Chrome to register handlers for specific schemes at compile-time. The ChromeOS embedders use this mechanism to register handlers for the mailto and webcal schemes, to redirect the HTTP requests to mail.google.com and google.com/calendar sites respectively. The idea would be that Chrome could use the same approach to register pre-defined handlers for the IPFS scheme, redirecting the IPFS request to the HTTPS gateways. In order to understand better how this feature would work, I\’ll explain first how Chrome deals with the needs of the different embedders regarding the schemes. ## The SchemesRegistry The SchemeRegistry data structure is used to declare the schemes registered in the browser and assign some specific properties to different subsets of such group of schemes. The data structure defines different lists to provide these specific features to the schemes; the main and more basic list is perhaps the standard schemes, which is defined as follows: A standard-format scheme adheres to what RFC 3986 calls “generic URI syntax” (https://tools.ietf.org/html/rfc3986#section-3). There are other lists to lend specific properties to a well-defined set of schemes: // Schemes that are allowed for referrers. std::vector referrer_schemes = { {kHttpsScheme, SCHEME_WITH_HOST_PORT_AND_USER_INFORMATION}, {kHttpScheme, SCHEME_WITH_HOST_PORT_AND_USER_INFORMATION}, }; // Schemes that do not trigger mixed content warning. std::vector secure_schemes = { kHttpsScheme, kAboutScheme, kDataScheme, kQuicTransportScheme, kWssScheme, };  The following class diagram shows how the SchemeRegistry data structure is defined, with some duplication to avoid layering violations, and its relationship with the Content Public API, used by the Chrome embedders to define its own behavior for some schemes It’s worth mentioning that this relationship between the //url, //content and //blink layers shows some margin to be improved, as it was noted down by Dimitry Gozman (one of the Content Public API owners) in one of the reviews of the patches I’ve been working on. I hope we could propose some improvements on this code, as part of the goals that Igalia and Protocol Labs will define for 2023. ## Additional schemes for Embedders As it was commented on the preliminary discussions I had as part of the analysis of the initial refactoring work described in the first section, Chrome already provides a way for embedders to define new schemes and even modifying the behavior of the ones already registered. This is done via the Content Public API, as usual. The Chrome’s Content Layer offers a Public API that embedders can implement to define its own behavior for certain features. One of these abstract features is the addition of new schemes to be handled internally. The ContentClient interface has a method called AddAdditionalSchemes precisely for this purpose. On the other hand, the ContentBrowserClient interface provides the HasCustomSchemeHandler and IsHandledURL methods to allow embedders implement their specific behavior to handler such schemes. The following class diagram shows how embedders implements the Content Client interfaces to provide specific behavior to some schemes: I’ve applied a refactoring to allow defining pre-defined handlers via the SchemeRegistry. This is some excerpt of such refactoring: // Schemes with a predefined default custom handler. std::map predefined_handler_schemes; void AddPredefinedHandlerScheme(const char* new_scheme, const char* handler) { DoAddSchemeWithHandler( new_scheme, handler, GetSchemeRegistryWithoutLocking().predefined_handler_schemes); } const std::map GetPredefinedHandlerSchemes() { return GetSchemeRegistry().predefined_handler_schemes; }  Finally, embedders would just need to implement the ChromeContentClient::AddAdditionalSchemes interface to register a scheme and its associated handler. For instance, in order to install in Android a predefined handler for the IPFS protocol, we would just need to add in the AwContentClient::AddAdditionalSchemes function the following logic:  schemes.predefined_handler_schemes.emplace_back( "ipfs", "https://dweb.link/ipfs/?uri=%s"); schemes.predefined_handler_schemes.emplace_back( "ipns", "https://dweb.link/ipns/?uri=%s");  ## Conclusion The pre-defined handlers allow us to introduce in the browser a custom behavior for specific schemes. Unlike the registerProtocolHandler method, we could bypass some if the restrictions imposed by the HTML API, given that in the context of an embedder browser its possible to have more control on the navigation use cases. The custom handlers, either using the pre-defined handlers approach or the registerProtocolHandler method, is not the ideal way of introducing multi-protocol support in the browser. I personally consider it a promising intermediate step, but a more solid solution must be designed for these use cases of the browser. ## November 10, 2022 ### Danylo Piliaiev #### Debugging Unrecoverable GPU Hangs I already talked about debugging hangs in “Graphics Flight Recorder - unknown but handy tool to debug GPU hangs”, now I want to talk about the most nasty kind of GPU hangs - the ones which cannot be recovered from, where your computer becomes completely unresponsive and you cannot even ssh into it. How would one debug this? There is no data to get after the hang and it’s incredibly frustrating to even try different debug options and hypothesis, if you are wrong - you get to reboot the machine! If you are a hardware manufacturer creating a driver for your own GPU, you could just run the workload in your fancy hardware simulator, wait for a few hours for the result and call it a day. But what if you don’t have access to a simulator, or to some debug side channel? There are a few things one could try: • Try all debug options you have, try to disable compute dispatches, some draw calls, blits, and so on. The downside is that you have to reboot every time you hit the issue. • Eyeball the GPU packets until they start eyeballing you. • Breadcrumbs! Today we will talk about the breadcrumbs. Unfortunately, Graphics Flight Recorder, which I already wrote about, isn’t of much help here. The hang is unrecoverable so GFR isn’t able to gather the results and write them to the disk. But the idea of breadcrumbs is still useful! What if instead of gathering result post factum, we stream the results to some other machine. This would allow us to get the results even if our target becomes unresponsive. Though, the requirement to get the results ASAP considerably changes the workflow. What if we write breadcrumbs on GPU after each command and spin a thread on CPU reading it in a busy loop? In practice the the amount of breadcrumbs between the one sent over network and the one currently executed is just too big to be practical. So we have to make GPU and CPU running in a lockstep. • GPU writes a breadcrumb and immediately waits for this value to be acknowledged; • CPU in a busy loop checks the last written breadcrumb value, sends it over socket, and writes it back to the fixed address; • GPU sees a new value and continues execution. This way the most recent breadcrumb gets immediately sent over the network. In practice, some breadcrumbs are still lost between the last sent over the network and the one where GPU hangs. But the difference is only a few of them. With the lockstep execution we could narrow the hanging command even further. For this we have to wait for a certain time after each breadcrumb before proceeding to the next one. I chose to just promt user for explicit keyboard input for each breadcrumb. I ended up with the following workflow: • Run with breadcrumbs enabled but without requiring explicit user input; • On another machine receive the stream of breadcrumbs via network; • Note the last received breadcrumb, the hanging command would be nearby; • Reboot the target; • Enable breadcrumbs starting from a few breadcrumbs before the last one received, require explicit ack from the user on each breadcrumb; • In a few steps GPU should hang; • Now that we know the closest breadcrumb to the real hang location - we can get the command stream and see what happens right after the breadcrumb; • Knowing which command(s) is causing our hang - it’s now possible to test various changes in the driver. Could Graphics Flight Recorder do this? In theory yes, but it would require additional VK extension to be able to wait for value in memory. However, it still would have a crucial limitation. Even with Vulkan being really close to the hardware there are still many cases where one Vulkan command is translated into many GPU commands under the hood. Things like image copies, blits, renderpass boundaries. For unrecoverable hangs we want to narrow down the hanging GPU command as much as possible, so it makes more sense to implement such functionality in a driver. ## Turnip’s take on the breadcrumbs I recently implemented it in Turnip (open-source Vulkan driver for Qualcomm’s GPUs) and used a few times with good results. Current implementation in Turnip is rather spartan and gives the minimal amount of instruments to achieve the workflow described above. While looks like this: 1) Launch workload with TU_BREADCRUMBS envvar (TU_BREADCRUMBS=$IP:$PORT,break=$BREAKPOINT:$BREAKPOINT_HITS): TU_BREADCRUMBS=$IP:$PORT,break=-1:0 ./some_workload  2) Receive breadcrumbs on another machine via this bash spaghetti: > nc -lvup$PORT | stdbuf -o0 xxd -pc -c 4 | awk -Wposix '{printf("%u:%u\n", "0x" $0, a[$0]++)}'

Received packet from 10.42.0.19:49116 -> 10.42.0.1:11113 (local)
1:0
7:0
8:0
9:0
10:0
11:0
12:0
13:0
14:0
[...]
10:3
11:3
12:3
13:3
14:3
15:3
16:3
17:3
18:3


Each line is a breadcrumb № and how many times it was repeated (either if command buffer is reusable or if breadcrumb is in a command stream which repeats for each tile).

3) Increase hang timeout:

echo -n 120000 > /sys/kernel/debug/dri/0/hangcheck_period_ms


TU_BREADCRUMBS=$IP:$PORT,break=15:3 ./some_workload

GPU is on breadcrumb 18, continue?y
GPU is on breadcrumb 19, continue?y
GPU is on breadcrumb 20, continue?y
GPU is on breadcrumb 21, continue?y
...


5) Continue until GPU hangs.

:tada: Now you know two breadcrumbs between which there is a command which causes the hang.

## Limitations

• If the hang was caused by a lack of synchronization, or a lack of cache flushing - it may not be reproducible with breadcrumbs enabled.
• While breadcrumbs could help to narrow down the hang to a single command - a single command may access many different states and have many steps under the hood (e.g. draw or blit commands).
• The method for breadcrumbs described here do affect performance, especially if tiling rendering is used, since breadcrumbs are repeated for each tile.

### Melissa Wen

#### V3D enablement in mainline kernel

Hey,

If you enjoy using upstream Linux kernel in your Raspberry Pi system or just want to give a try in the freshest kernel graphics drivers there, the good news is that now you can compile and boot the V3D driver from the mainline in your Raspberry Pi 4. Thanks to the work of Stefan, Peter and Nicolas [1] [2], the V3D enablement reached the Linux kernel mainline. That means hacking and using new features available in the upstream V3D driver directly from the source.

However, even for those used to compiling and installing a custom kernel in the Raspberry Pi, there are some quirks to getting the mainline v3d module available in 32-bit and 64-bit systems. I’ve quickly summarized how to compile and install upstream kernel versions (>=6.0) in this short blog post.

Note: V3D driver is not present in Raspberry Pi models 0-3.

First, I’m taking into account that you already know how to cross-compile a custom kernel to your system. If it is not your case, a good tutorial is already available in the Raspberry Pi documentation, but it targets the kernel in the rpi-linux repository (downstream kernel).

From this documentation, the main differences in the upstream kernel are presented below:

## Raspberry Pi 4 64-bit (arm64)

### Diff short summary:

1. instead of getting the .config file from bcm2711_defconfig, get it by running make ARCH=arm64 defconfig
2. compile and install the kernel image and modules as usual, but just copy the dtb file arch/arm64/boot/dts/broadcom/bcm2711-rpi-4-b.dtb to the /boot of your target machine (no /overlays directory to copy)
3. change /boot/config.txt:
• comment/remove the dt_overlay=vc4-kms-v3d entry
• add a device_tree=bcm2711-rpi-4-b.dtb entry

## Raspberry Pi 4 32-bits (arm)

### Diff short summary:

1. get the .config file by running make ARCH=arm multi_v7_defconfig
2. using make ARCH=arm menuconfig or a similar tool, enable CONFIG_ARM_LPAE=y
3. compile and install the kernel image and modules as usual, but just copy the dtb file arch/arm/boot/dts/bcm2711-rpi-4-b.dtb to the /boot of your target machine (no /overlays directory to copy)
4. change /boot/config.txt:
• comment/remove the dt_overlay=vc4-kms-v3d entry
• add a device_tree=bcm2711-rpi-4-b.dtb entry

## Step-by-step for remote deployment:

### Set variables

Raspberry Pi 4 64-bit (arm64)

cd <path-to-upstream-linux-directory>
KERNEL=make kernelrelease
ARCH="arm64"
CROSS_COMPILE="aarch64-linux-gnu-"
DEFCONFIG=defconfig
IMAGE=Image
RPI4=<ip>
TMP=mktemp -d


Raspberry Pi 4 32-bits (arm)

cd <path-to-upstream-linux-directory>
KERNEL=make kernelrelease
ARCH="arm"
CROSS_COMPILE="arm-linux-gnueabihf-"
DEFCONFIG=multi_v7_defconfig
IMAGE=zImage
DTB_PATH="bcm2711-rpi-4-b.dtb"
RPI4=<ip>
TMP=mktemp -d


make ARCH=$ARCH CROSS_COMPILE=$CROSS_COMPILE $DEFCONFIG  Raspberry Pi 4 32-bit (arm) Additional step for 32-bit system. Enable CONFIG_ARM_LPAE=y using make ARCH=arm menuconfig ### Cross-compile the mainline kernel make ARCH=$ARCH CROSS_COMPILE=$CROSS_COMPILE$IMAGE modules dtbs


make ARCH=$ARCH CROSS_COMPILE=$CROSS_COMPILE INSTALL_MOD_PATH=$TMP modules_install  ### Copy kernel image, modules and the dtb to your remote system ssh$RPI4 mkdir -p /tmp/new_modules /tmp/new_kernel
rsync -av $TMP/$RPI4:/tmp/new_modules/
scp arch/$ARCH/boot/$IMAGE $RPI4:/tmp/new_kernel/$KERNEL
scp arch/$ARCH/boot/dts/$DTB_PATH $RPI4:/tmp/new_kernel ssh$RPI4 sudo rsync -av /tmp/new_modules/lib/modules/ /lib/modules/
ssh $RPI4 sudo rsync -av /tmp/new_kernel/ /boot/ rm -rf$TMP


### Set config.txt of you RPi 4

In your Raspberry Pi 4, open the config file /boot/config.txt

• comment/remove the dt_overlay=vc4-kms-v3d entry
• add a device_tree=bcm2711-rpi-4-b.dtb entry
• add a kernel=<image-name> entry

## Why not Kworkflow?

You can safely use the steps above, but if you are hacking the kernel and need to repeat this compiling and installing steps repeatedly, why don’t try the Kworkflow?

Kworkflow is a set of scripts to synthesize all steps to have a custom kernel compiled and installed in local and remote machines and it supports kernel building and deployment to Raspberry Pi machines for Raspbian 32 bit and 64 bit.

After learning the kernel compilation and installation step by step, you can simply use kw bd command and have a custom kernel installed in your Raspberry Pi 4.

Note: the userspace 3D acceleration (glx/mesa) is working as expected on arm64, but the driver is not loaded yet for arm. Besides that, a bunch of pte invalid errors may appear when using 3D acceleration, it’s a known issue that are still under investigation.

## November 01, 2022

### Abhijeet Kandalkar

When I started working on analyzing the startup time of a browser, I didn’t have an idea where to start. For me, Performance was a number game and trial-n-error approach. The challenge I was looking into is described in bug 1270977, in which the chrome binary hosted on two different partitions behaved differently.

BROWSER LaCrOS Mount point Partition fs type comment
B2 deployedLacros /usr/local/lacros-chrome stateful ext4 pushed by developer

You can learn more about LaCrOS in my previous blog. From the above table, the difference in filesystem was the first suspect. Squashfs is a compressed read-only file system for Linux.

To emulate the first launch, all caches are cleared using drop_caches as below. echo 3 > /proc/sys/vm/drop_caches

#### How to measure a file system read time?

For investigation and testing/benchmarking, sequential read requests are emulated using a dd command and random read requests are emulated using fio(Flexible IO Tester)

For sequential read, B1 took around 92MB/s while B2 took 242MB/s.(Higher read speed means faster). Its a huge difference. ext4 is faster than squashfs.

localhost ~ # echo 3 > /proc/sys/vm/drop_caches
localhost ~ # time dd
43910+1 records in
43910+1 records out
179859368 bytes (180 MB, 172 MiB) copied, 1.93517 s, 92.9 MB/s

real    0m1.951s
user    0m0.020s
sys 0m0.637s

localhost ~ # echo 3 > /proc/sys/vm/drop_caches
localhost ~ # time dd if=/usr/local/lacros-chrome/chrome of=/dev/null bs=4096
43910+1 records in
43910+1 records out
179859368 bytes (180 MB, 172 MiB) copied, 0.744337 s, 242 MB/s

real    0m0.767s
user    0m0.056s
sys 0m0.285s



As Squashfs is a compressed file system, it first uncompresses the data and then performs a read. So definitely it’s going to be slower than ext4 but the difference(242 -92) was huge and unacceptable.

#### Can we do better?

The next step was to find some tools that could provide more information.

• strace - Linux utility to find out system calls that take more time or have blocking IO. But couldn’t get much information out of this analysis.
• iostat - This provides more insight information for input/output

$echo 3 > /proc/sys/vm/drop_caches && iostat -d -tc -xm 1 20 After running this command for both B1 and B2, it appears that B1 is not merging read requests. In other words, rrqm/s and %rrqm values are ZERO. Refer man-iostat %rrqm The percentage of read requests merged together before being sent to the device. rrqm/s The number of read requests merged per second that were queued to the device. The data to be read is scattered in different sectors of a different block. As a result of scattering, the read requests were not merged, so there could be a lot of read requests down to the block device. Next step was to find out who is responsible for merging read requests. #### IO scheduler IO scheduler optimizes disk access requests by merging I/O requests to similar locations on disk. To get the current IO scheduler for a disk-device $/sys/block/<disk device>/queue/scheduler


for B1,

localhost ~ # cat /sys/block/dm-1/queue/scheduler
none


For B2, it uses the bfq scheduler

localhost ~ # cat /sys/block/mmcblk0/queue/scheduler


As you can notice, B1 does not have scheduler. For B2, scheduler is bfq(Budget Fair Queueing).

SUSE and IBM has good documentation about IO scheduling.

It has been found so far that there are differences between file systems, read request merge patterns, and IO schedulers. But there is one more difference which we havent noticed yet.

If you look at how B1 is mounted, you will see that we are using a virtual device.

localhost ~ # df -TH
Filesystem  Type      Size  Used  Avail  Use%  Mounted on
/dev/dm-1   squashfs  143M  143M      0  100%  /run/imageloader/lacros-dogfood-dev/99.0.4824.0


#### A twist in my story

For B1, we are not directly communicating with the block device instead the requests are going through a virtual device using a device mapper. As you might have noticed above, dm-1 device mapper virtual device is used by B1.

We are using a dm-verity as a virtual device.

Device-Mapper’s “verity” target provides transparent integrity checking of block devices using a cryptographic digest provided by the kernel crypto API. This target is read-only.

From a security point of view, it makes sense to host your executable binary to a read-only files system that provides integrity checking.

There is a twist in that dm-verity does not have a scheduler, and scheduling is delegated to the scheduler of the underlying block device. The I/O will not be merged at the device-mapper layer, but at an underlying layer.

Several layers are different between B1 and B2, including partition encryption(dm-verity) and compression(squashfs).

In addition, we found that B1 defaults to using --direct-io

### What is DIRECT IO ?

Direct I/O is a feature of the file system whereby file reads and writes go directly from the applications to the storage device, bypassing the operating system read and write caches.

Although direct I/O can reduce CPU usage, using it typically results in the process taking longer to complete, especially for relatively small requests. This penalty is caused by the fundamental differences between normal cached I/O and direct I/O.

Every direct I/O read causes a synchronous read from the disk; unlike the normal cached I/O policy where the read may be satisfied from the cache. This can result in very poor performance if the data was likely to be in memory under the normal caching policy.

### Is it possible to control how many read requests are made?

We decided to go deeper and analyze the block layer. In the block layer the request queue is handled, and I/O scheduling is done. To see how the flow of IO goes through the block layer, there is a suit of tools consisting of

• blktrace (tracing the IO requests through the block layer),
• blkparse (parsing the output of blktrace to make it human readable),
• btrace (script to combine blktrace and blkparse, and
• btt (a blktrace output post-processing tool)), among others.

Below is the result of block-level parsing.

For B1,

ALL MIN AVG MAX N
Q2Q 0.000000001 0.000069861 0.004097547 94883
Q2G 0.000000242 0.000000543 0.000132620 94884
G2I 0.000000613 0.000001455 0.003585980 94884
I2D 0.000001399 0.000002456 0.003777911 94884
D2C 0.000133042 0.000176883 0.001771526 23721
Q2C 0.000136308 0.000181337 0.003968613 23721

For B2,

ALL MIN AVG MAX N
Q2Q 0.000000001 0.000016582 0.003283451 76171
Q2G 0.000000178 0.000000228 0.000016893 76172
G2I 0.000000626 0.000005904 0.000587304 76172
I2D 0.000002011 0.000007582 0.000861783 76172
D2C 0.000004959 0.000067955 0.001275667 19043
Q2C 0.000009132 0.000081670 0.001315828 19043
• Number of read requests(N) are more for B1 than B2.
• For B1, N = 94883 (partition with squashfs + dm-verity)
• For B2, N = 76171 (ext4 partition)
• Average D2C time for B1 is more than twice that of B2 (0.000176883 > 2*0.000067955).

D2C time: the average time from when the actual IO was issued to the driver until is completed (completion trace) back to the block IO layer.

--d2c-latencies option: This option instructs btt to generate the D2C latency ﬁle. This file has the ﬁrst column (X values) representing runtime (seconds), while the second column (Y values) shows the actual latency for command at that time (either Q2D, D2C or Q2C).

### FIO(Flexible IO Tester)

Up until now, all analysis has been performed using a dd which emulate sequential read requests.

Read access pattern also affects a IO performace so to emulated a random IO workloads tools like FIO is useful.

When you specify a --size, FIO creates a file with filename and when you don’ specify it FIO reads form a filename.

We are testing an existing chrome binary without providing its size.

localhost~# fio --name=4k_read --ioengine=libaio --iodepth=1 --rw=randread --bs=4k --norandommap --filename=<dir>/chrome

### Hotspots

A “hotspot” is a place where the app is spending a lot of time. After performing above analysis, find those areas and speed them up. This is easily done using a profiling tool like It has been found so far that there are differences between file systems, read request merge patterns, and IO schedulers. But there one more difference which we havent noticed yet.

For optimal performance,

• reduce the number of requests (N)
• by tuning the blocksize of the filesystem(squashfs) of the partition
• reduce the time taken to process each request(D2C)
• by tuning a compression algorithm used by squashfs filesystem
• by tuning a hashing function used by dm-verity

#### Tuning a blocksize

Block is a continuous location on the hard drive where data is stored. In general, FileSystem stores data as a collection of blocks. When the block size is small, the data gets divided into blocks and distributed in more blocks. As more blocks are created, there will be more number of read requests.

Squashfs compresses files, inodes, and directories, and supports block sizes from 4 KiB up to 1 MiB for greater compression.

We experimented with different block sizes of 4K, 64K, 128K, 256K. As the block size is increasing, the read speed for DD is increasing. We have chosen 128K.

mksquashfs /tmp/squashfs-root /home/chronos/cros-components/lacros-dogfood-dev/100.0.4881.0/image.squash.test.<block-size> -b <block-size>K

#### Tuning a compression algorithm

Squashfs supports different compression algorithms. mksquashfs command has the option -comp <comp> to configure the compression algorithm. There are different compressors available:

• gzip (default)
• lzma
• lzo
• lz4
• xz
• zstd

mksquashfs /tmp/squashfs-root /home/chronos/cros-components/lacros-dogfood-dev/100.0.4881.0/image.squash.test.<block-size> -b <block-size>K -comp gzip

#### Tuning a hashing algorithm

This is related to dm-verity. dm-verity decrypt the read request. It uses different hashing algorithms.

• sha256 (default)
• sha224
• sha1
• md5 We can use hardware implementation of the above algorithms to speed up things.

### Conclusion

The purpose of this blog post is to demonstrate how to identify the hotspot and how to explore different parameters affecting the system’s performance. It was an interesting experience and I learned a lot. By sharing all the details about how the problem was broken down and improved, I hope others can learn something as well. As a result of playing around with the various combinations of parameters such as blocksize, compression, and direct-io, we have decided to go with the following configuration.

1. For Filesystem,
• Blocksize : Changed from 4K to 128K
• Compression algorithm : Changed from gzip to xz
• direct-io : Changed from on to off
2. dm-verity :
• Hashing algorithm : Currently it used sha256. Either use hardware implementation of sha256 or switch to other algorithm.

Your first step should be to identify your current stack and then analyze each layer of your file system stack using the available tools and optimize them. The information provided in this blog may be helpful to you in deciding what to do about your own performance issues. You can get it touch with me if you need any help.

## October 31, 2022

### Andy Wingo

#### ephemerons and finalizers

Good day, hackfolk. Today we continue the series on garbage collection with some notes on ephemerons and finalizers.

conjunctions and disjunctions

First described in a 1997 paper by Barry Hayes, which attributes the invention to George Bosworth, ephemerons are a kind of weak key-value association.

Thinking about the problem abstractly, consider that the garbage collector's job is to keep live objects and recycle memory for dead objects, making that memory available for future allocations. Formally speaking, we can say:

• An object is live if it is in the root set

• An object is live it is referenced by any live object.

This circular definition uses the word any, indicating a disjunction: a single incoming reference from a live object is sufficient to mark a referent object as live.

Ephemerons augment this definition with a conjunction:

• An object V is live if, for an ephemeron E containing an association betweeen objects K and V, both E and K are live.

This is a more annoying property for a garbage collector to track. If you happen to mark K as live and then you mark E as live, then you can just continue to trace V. But if you see E first and then you mark K, you don't really have a direct edge to V. (Indeed this is one of the main purposes for ephemerons: associating data with an object, here K, without actually modifying that object.)

During a trace of the object graph, you can know if an object is definitely alive by checking if it was visited already, but if it wasn't visited yet that doesn't mean it's not live: we might just have not gotten to it yet. Therefore one common implementation strategy is to wait until tracing the object graph is done before tracing ephemerons. But then we have another annoying problem, which is that tracing ephemerons can result in finding more live ephemerons, requiring another tracing cycle, and so on. Mozilla's Steve Fink wrote a nice article on this issue earlier this year, with some mitigations.

finalizers aren't quite ephemerons

All that is by way of introduction. If you just have an object graph with strong references and ephemerons, our definitions are clear and consistent. However, if we add some more features, we muddy the waters.

Consider finalizers. The basic idea is that you can attach one or a number of finalizers to an object, and that when the object becomes unreachable (not live), the system will invoke a function. One way to imagine this is a global association from finalizable object O to finalizer F.

As it is, this definition is underspecified in a few ways. One, what happens if F references O? It could be a GC-managed closure, after all. Would that prevent O from being collected?

Ephemerons solve this problem, in a way; we could trace the table of finalizers like a table of ephemerons. In that way F would only be traced if O is live already, so that by itself it wouldn't keep O alive. But then if O becomes dead, you'd want to invoke F, so you'd need it to be live, so reachability of finalizers is not quite the same as ephemeron-reachability: indeed logically all F values in the finalizer table are live, because they all will be invoked at some point.

In the end, if F references O, then F actually keeps O alive. Whether this prevents O from being finalized depends on our definition for finalizability. We could say that an object is finalizable if it is found to be unreachable after a full trace, and the finalizers F are in the root set. Or we could say that an object is finalizable if it is unreachable after a partial trace, in which finalizers are not themselves in the initial root set, and instead we trace them after determining the finalizable set.

Having finalizers in the initial root set is unfortunate: there's no quick check you can make when adding a finalizer to signal this problem to the user, and it's very hard to convey to a user exactly how it is that an object is referenced. You'd have to add lots of gnarly documentation on top of the already unavoidable gnarliness that you already had to write. But, perhaps it is a local maximum.

Incidentally, you might think that you can get around these issues by saying "don't reference objects from their finalizers", and that's true in a way. However it's not uncommon for finalizers to receive the object being finalized as an argument; after all, it's that object which probably encapsulates the information necessary for its finalization. Of course this can lead to the finalizer prolonging the longevity of an object, perhaps by storing it to a shared data structure. This is a risk for correct program construction (the finalized object might reference live-but-already-finalized objects), but not really a burden for the garbage collector, except in that it's a serialization point in the collection algorithm: you trace, you compute the finalizable set, then you have to trace the finalizables again.

ephemerons vs finalizers

The gnarliness continues! Imagine that O is associated with a finalizer F, and also, via ephemeron E, some auxiliary data V. Imagine that at the end of the trace, O is unreachable and so will be dead. Imagine that F receives O as an argument, and that F looks up the association for O in E. Is the association to V still there?

Guile's documentation on guardians, a finalization-like facility, specifies that weak associations (i.e. ephemerons) remain in place when an object becomes collectable, though I think in practice this has been broken since Guile switched to the BDW-GC collector some 20 years ago or so and I would like to fix it.

One nice solution falls out if you prohibit resuscitation by not including finalizer closures in the root set and not passing the finalizable object to the finalizer function. In that way you will never be able to look up E×OV, because you don't have O. This is the path that JavaScript has taken, for example, with WeakMap and FinalizationRegistry.

However if you allow for resuscitation, for example by passing finalizable objects as an argument to finalizers, I am not sure that there is an optimal answer. Recall that with resuscitation, the trace proceeds in three phases: first trace the graph, then compute and enqueue the finalizables, then trace the finalizables. When do you perform the conjunction for the ephemeron trace? You could do so after the initial trace, which might augment the live set, protecting some objects from finalization, but possibly missing ephemeron associations added in the later trace of finalizable objects. Or you could trace ephemerons at the very end, preserving all associations for finalizable objects (and their referents), which would allow more objects to be finalized at the same time.

Probably if you trace ephemerons early you will also want to trace them later, as you would do so because you think ephemeron associations are important, as you want them to prevent objects from being finalized, and it would be weird if they were not present for finalizable objects. This adds more serialization to the trace algorithm, though:

1. (Add finalizers to the root set?)

2. Trace from the roots

3. Trace ephemerons?

4. Compute finalizables

5. Trace finalizables (and finalizer closures if not done in 1)

6. Trace ephemerons again?

These last few paragraphs are the reason for today's post. It's not clear to me that there is an optimal way to compose ephemerons and finalizers in the presence of resuscitation. If you add finalizers to the root set, you might prevent objects from being collected. If you defer them until later, you lose the optimization that you can skip steps 5 and 6 if there are no finalizables. If you trace (not-yet-visited) ephemerons twice, that's overhead; if you trace them only once, the user could get what they perceive as premature finalization of otherwise reachable objects.

In Guile I think I am going to try to add finalizers to the root set, pass the finalizable to the finalizer as an argument, and trace ephemerons twice if there are finalizable objects. I think this wil minimize incoming bug reports. I am bummed though that I can't eliminate them by construction.

Until next time, happy hacking!

## October 28, 2022

### Jacobo Aragunde

#### Accessible name computation in Chromium Chapter I: Introduction

Back in March last year, as part of my maintenance work on Chrome/Chromium accessibility at Igalia, I inadvertently started a long, deep dive into the accessible name calculation code, that had me intermittently busy since then.

As a matter of fact, I started drafting this post over a year ago, and I already presented part of my work in a lightning talk at BlinkOn 15, in November 2021:

This series of posts is the promised closure from that talk, that extends and updates what was presented there.

### Introduction

Accessible name and description computation is an important matter in web accessibility. It conveys how user agents (web browsers) are expected to expose the information provided by web contents in text from, in a way it’s meaningful for users of accessibility technologies (ATs). This text is meant to be spoken, or routed to a braille printer.

We know a lot of the Web is text, but it’s not always easy. Style rules can affect what is visible and hence, what is part of a name or description and what is not. There are specific properties to make content hidden for accessibility purposes (aria-hidden), and also to establish relations between elements for naming and description purposes (aria-labelledby, aria-describedby) beyond the markup hierarchy, potentially introducing loops. A variety of properties are used to provide naming, specially for non-text content, but authors may apply them anywhere (aria-label, alt, title) and we need to decide their priorities. There are a lot of factors in play.

There is a document maintained by the ARIA Working Group at W3C, specifying how to do this: https://w3c.github.io/accname/. The spec text is complex, as it has to lay out the different ways to name content, settle their priorities when they interact and define how other properties affect them.

### Accessible name/description in Chromium

You may have heard the browser keeps multiple tree structures in memory to be able to display your web sites. There is the DOM tree, which corresponds with the nodes as specified in the HTML document after it’s been parsed, any wrong markup corrected, etc. It’s a live tree, meaning that it’s not static; JavaScript code and user interaction can alter it.

There is a layout tree, which I won’t dare explaining because it’s out of my experience field, but we can say it’s related with how the DOM is displayed: what is hidden and what is visible, and where it’s placed.

And then we have the accessibility tree: it’s a structure that keeps the hierarchy and information needed for ATs. It’s based on the DOM tree but they don’t have the same information: nodes in those trees have different attributes but, more importantly, there is not a 1:1 relation between DOM and accessibility nodes. Not every DOM node needs to be exposed in the accessibility tree and, because of aria-owns relations, tree hierarchies can be different too. It’s also a live tree, updated by JS code and user interaction.

In Chromium implementation, we must also talk about several accessibility trees. There is one tree per site, managed by Blink code in the renderer process, and represented in an internal and platform-agnostic way for the most part. But the browser process needs to compose the tree for the active tab contents with the browser UI and expose it through the means provided by the operating system. More importantly, these accessibility objects are different from the ones used in Blink, because they are defined by platform-specific abstractions: NSAccessibility, ATK/AT-SPI, MSAA/UIA/IAccessible2…

When working on accessible tree computation, most of the code I deal with is related with the Blink accessibility tree, save for bugs in the translation to platform abstractions. It is, still, a complex piece of code, more than I initially imagined. The logic is spread across twelve functions belonging to different classes, and they call each other creating cycles. It traversed the accessibility tree, but also the DOM tree in certain occasions.

### The problems

Last year, my colleague Joanmarie Diggs landed a set of tests in Chromium, adapted from the test cases used by the Working Group as exit criteria for Candidate Recommendations. There were almost 300 tests, most of them passing just fine, but a few exposed bugs that I thought could be a good introduction to this part of the accessibility code.

The first issue I tried to fix was related to whitespace in names, as it looked simple enough to be a good first contact with the code. The issue consisted on missing whitespace in the accessible names: while empty or whitespace nodes translated to spaces in rendered nodes, the same didn’t happen in the accessible names and descriptions. Consider this example:

<label for="test">
<span>1</span><span>2</span><span>3</span>
<span>4</span><span>5</span>
</label>
<input id="test"/>


The label is rendered as “123 45” but Chromium produced the accessible name “12345”. It must be noticed that the spec is somewhat indefinite with regard of the whitespace, which is already reported, but we considered Chromium’s behavior was wrong for two reasons: it didn’t match the rendered text, and it was different from the other implementations. I reported the issue at the Chromium bug tracker.

Another, bigger problem I set to fix was that the CSS pseudo-elements ::before and ::after weren’t added to accessible names. Consider the markup:

<style>
label:after { content:" fruit"; }
</style>
<label for="test">fancy</label>
<input id="test"/>


The rendered text is “fancy fruit”, but Chromium was producing the accessible name “fancy”. In addition to not matching the rendered text, the spec explicitly states those pseudo-elements must contribute to the accessible name. I reported this issue, too.

I located the source of both problems at AXNodeObject::TextFromDescendants: this function triggers the calculation of the name for a node based on its descendants, and it was traversing the DOM tree to obtain them. The DOM tree does not contain the aforementioned pseudo-elements and, for that reason, they were not being added to the name. Additionally, we were explicitly excluding whitespace nodes in this traversal, with the intention of preventing big chunks of whitespace being added to names, like new lines and tabs to indent the markup.

A question hit me at this point: why were we using the DOM for this in the first place? That meant doing an extra traversal for every name calculated, and duplicating code that selected which nodes are important. It seems the reasons for this traversal were lost in time, according to code comments:

// TODO: Why isn't this just using cached children?


While I could land (and I did) temporary fixes for these individual bugs, it looked like we could fix “the traversal problem” instead, addressing the root cause of the problems while streamlining our code and, potentially, fixing other issues. I will address this problem, the reasons to traverse the DOM in first place and the interesting detours this work lead me, in a future post. Stay tuned, and happy hacking!

EDIT: next chapter available here!

## October 27, 2022

### Eric Meyer

I talked in my last post about how I used linear gradients to recreate dashed lines for the navlinks and navbar of wpewebkit.org, but that wasn’t the last instance of dashing gradients in the design.  I had occasion to use that technique again, except this time as a way to create a gradient dash.  I mean, a dashed line that has a visible gradient from one color to another, as well as a dashed line that’s constructed using a linear gradient.  Only this time, the dashed gradient is used to create the gaps, not the dashes.

To set the stage, here’s the bit of the design I had to create:

The vertical dashed line down the left side of the design is a dashed linear gradient, but that’s not actually relevant.  It could have been a dashed border style, or an SVG, or really anything.  And that image on the right is a PNG, but that’s also not really relevant.  What’s relevant is I had to make sure the image was centered in the content column, and yet its dashed line connected to the left-side dashed line, regardless of where the image landed in the actual content.

Furthermore, the colors of the two dashed lines were different at the point I was doing the implementation: the left-side line was flat black, and the line in the image was more of a dark gray.  I could have just connected a dark gray line to the black, but it didn’t look right.  A color transition was needed, while still being a dashed line.

Ideally, I was looking for a solution that would allow a smooth color fade over the connecting line’s length while also allowing the page background color to show through.  Because the page background might sometimes be white and sometimes be a light blue and might in the future be lime green wavy pattern or something, who knows.

So I used a dashed linear gradient as a mask.  The CSS goes like this:

.banner::before {
content: '';
position: absolute;
top: 50%;
left: -5rem;
width: 5rem;
height: 1px;
270deg, transparent, #999 1px 3px, transparent 4px 7px);
}

Please allow me to break it down a step at a time.

First, there’s the positioning of the pseudo-element, reaching leftward 5rem from the left edge of the content column, which here I’ve annotated with a red outline. (I prefer outlines to borders because outlines don’t participate in layout, so they can’t shift things around.)

.banner::before {
content: '';
position: absolute;
top: 50%;
left: -5rem;
width: 5rem;
height: 1px;
}

To that pseudo-element, I added a 90-degree-pointing linear gradient from black to gray to its background.

	…
}

The pseudo-element does happen to touch the end of one of the vertical dashes, but that’s purely by coincidence.  It could have landed anywhere, including between two dashes.

So now it was time for a mask.  CSS Masks are images used to hide parts of an element based on the contents of the masking image, usually its alpha channel. (Using luminosity to define transparency levels via the mask-mode property is also an option, but Chrome doesn’t support that as yet.)  For example, you can use small images to clip off the corners of an element.

In this case, I defined a repeating linear gradient because I knew what size the dashes should be, and I didn’t want to mess with mask-size and mask-repeat (ironically enough, as you’ll see in a bit).  This way, the mask is 100% the size of the element’s box, and I just need to repeat the gradient pattern however many times are necessary to cross the entire width of the element’s background.

Given the design constraints, I wanted the dash pattern to start from the right side, the one next to the image, and repeat leftward, to ensure that the dash pattern would look unbroken.  Thus I set its direction to be 270 degrees.  Here I’ll have it alternate between red and transparent, because the actual color used for the opaque parts of the mask doesn’t matter, only its opacity:

	…
270deg, transparent, red 1px 3px, transparent 4px 7px);
}

The way the mask works, the transparent parts of the masking image cause the corresponding parts of the element to be transparent  —  to be clipped away, in effect.  The opaque parts, whatever color they actually have, cause the corresponding parts of the element to be drawn.  Thus, the parts of the background’s black-to-gray gradient that line up with the opaque parts of the mask are rendered, and the rest of the gradient is not.  Thus, a color-fading dashed line.

This is actually why I separated the ends of the color stops by a pixel: by defining a one-pixel distance for the transition from transparent to opaque (and vice versa), the browser fills those pixels in with something partway between those two states.  It’s probably 50% opaque, but that’s really up to the browser.

The result is a softening effect, which matches well with the dashed line in the image itself and doesn’t look out of place when it meets up with the left-hand vertical dashed line.

At least, everything I just showed you is what happens in Firefox, which proudly supports all the masking properties.  In Chromium and WebKit browsers, you need vendor prefixes for masking to work.  So here’s how the CSS turned out:

.banner::before {
content: '';
position: absolute;
top: 50%;
left: -5rem;
width: 5rem;
height: 1px;
270deg, transparent, red 1px 3px, transparent 4px 7px);
270deg, transparent, red 1px 3px, transparent 4px 7px);
}

And that’s where we were at launch.

It still bugged me a little, though, because the dashes I created were one pixel tall, but the dashes in the image weren’t.  They were more like a pixel and a half tall once you take the aliasing into account, so they probably started out 2 (or more)  pixels tall and then got scaled down.  The transition from those visually-slightly-larger dashes to the crisply-one-pixel-tall pseudo-element didn’t look quite right.

I’d hoped that just increasing the height of the pseudo-element to 2px would work, but that made the line look too thick.  That meant I’d have to basically recreate the aliasing myself.

At first I considered leaving the mask as it was and setting up two linear gradients, the existing one on the bottom and a lighter one on top.  Instead, I chose to keep the single background gradient and set up two masks, the original on the bottom and a more transparent one on top.  Which meant I’d have to size and place them, and keep them from tiling.  So, despite my earlier avoidance, I had to mess with mask sizing and repeating anyway.

mask-image:
270deg, transparent, #89A4 1px 3px, transparent 4px 7px),
270deg, transparent, #89A 1px 3px, transparent 4px 7px);
mask-position: 100% 0%, 100% 100%;

And that’s how it stands as of today.

In the end, this general technique is pretty much infinitely adaptable, in that you could define any dash pattern you like and have it repeat over a gradient, background image, or even just a plain background color.  It could be used to break up an element containing text, so it looks like the text has been projected onto a thick dashed line, or put some diagonal slashes through text, or an image, or a combination.

Or use a tiled radial gradient to make your own dotted line, one that’s fully responsive without ever clipping a dot partway through.  Wacky line patterns with tiled, repeated conic gradients?  Sure, why not?

The point being, masks let you break up the rectangularity of elements, which can go a long way toward making designs feel more alive.  Give ’em a try and see what you come up with!

Have something to say to all that? You can add a comment to the post, or email Eric directly.

## October 22, 2022

### Ricardo García

#### My mesh shaders talk at XDC 2022

* { padding: 0; margin: 0; overflow: hidden; } html, body { height: 100%; } img, span { /* All elements take the whole iframe width and are vertically centered. */ position: absolute; width: 100%; top: 0; bottom: 0; margin: auto; } span { /* This mostly applies to the play button. */ height: 1.5em; text-align: center; font-family: sans-serif; font-size: 500%; color: white; } "> * { padding: 0; margin: 0; overflow: hidden; } html, body { height: 100%; } img, span { /* All elements take the whole iframe width and are vertically centered. */ position: absolute; width: 100%; top: 0; bottom: 0; margin: auto; } span { /* This mostly applies to the play button. */ height: 1.5em; text-align: center; font-family: sans-serif; font-size: 500%; color: white; } " >

Just after me, Timur Kristóf also presented an excellent talk with details about the Mesa mesh shader implementation for RADV, available as part of the same playlist.

As an additional resource, I’m going to participate together with Timur, Steven Winston and Christoph Kubisch in a Khronos Vulkanised Webinar to talk a bit more about mesh shaders on October 27. You must register to attend, but attendance is free.

Back to XDC, crossing the Atlantic Ocean to participate in the event was definitely tiring, but I had a lot of fun at the conference. It was my first in-person XDC and a special one too, this year hosted together with WineConf and FOSS XR. Seeing everyone there and shaking some hands, even with our masks on most of the time, made me realize how much I missed traveling to events. Special thanks to Codeweavers for organizing the conference, and in particular to Jeremy White and specially to Arek Hiler for taking care of most technical details and acting as a host and manager in the XDC room.

Apart from my mesh shaders talk, do not miss other talks by Igalians at the conference:

And, of course, take a look at the whole event playlist for more super-interesting content, like the one by Alyssa Rosenzweig and Lina Asahi about reverse-engineered GPU drivers, the one about Zink by Mike Blumenkrantz (thanks for the many shout-outs!) or the guide to write Vulkan drivers by Jason Ekstrand which includes some info about the new open source Vulkan driver for NVIDIA cards.

That said, if you’re really interested in my talk but don’t want to watch a video (or the closed captions are giving you trouble), you can find the slides and the script to my talk below. Each thumbnail can be clicked to view a full-HD image.

#### Talk slides and script

Hi everyone, I’m Ricardo Garcia from Igalia. Most of my work revolves around CTS tests and the Vulkan specification, and today I’ll be talking about the new mesh shader extension for Vulkan that was published a month ago. I participated in the release process of this extension by writing thousands of tests and reviewing and discussing the specification text for Vulkan, GLSL and SPIR-V.

Mesh shaders are a new way of processing geometry in the graphics pipeline. Essentially, they introduce an alternative way of creating graphics pipelines in Vulkan, but they don’t introduce a completely new type of pipeline.

The new extension is multi-vendor and heavily based on the NVIDIA-only extension that existed before, but some details have been fine-tuned to make it closer to the DX12 version of mesh shaders and to make it easier to implement for other vendors.

I want to cover what mesh shaders are, how they compare to the classic pipelines and how they solve some problems.

Then we’ll take a look at what a mesh shader looks like and how it works, and we’ll also talk about drawbacks mesh shaders have.

Mesh shading introduces a new type of graphics pipeline with a much smaller number of stages compared to the classic one. One of the new stages is called the mesh shading stage.

These new pipelines try to address some issues and shortcomings with classic pipelines on modern GPUs. The new pipeline stages have many things in common with compute shaders, as we’ll see.

This is a simplified version of the classic pipeline.

Basically the pipeline can be divided in two parts. The first stages are in charge of generating primitives for the rasterizer.

Then the rasterizer does a lot of magic including primitive clipping, barycentric interpolation and preparing fragment data for fragment shader invocations.

It’s technically possible to replace the whole pipeline with a compute shader (there’s a talk on Thursday about this), but mesh shaders do not touch the rasterizer and everything that comes after it.

Mesh shaders try to apply a compute model to replace some of this with a shader that’s similar to compute, but the changes are restricted to the first part of the pipeline.

If I have to cut the presentation short, this is perhaps one of the slides you should focus on.

Mesh shading employs a shader that’s similar to compute to generate geometry for the rasterizer.

There’s no input assembly, no vertex shader, no tessellation, etc. Everything that you did with those stages is now done in the new mesh shading stage, which is a bit more flexible and powerful.

In reality, the mesh shading extension actually introduces two new stages. There’s an optional task shader that runs before the mesh shader, but we’ll forget about it for now.

These are the problems mesh shading tries to solve.

Vertex inputs are a bit annoying to implement in drivers and in some hardware they use specific fixed function units that may be a bottleneck in some cases, at least in theory.

The main pain point is that vertex shaders work at the per-vertex level, so you don’t generally have control of how geometry is arranged in primitives. You may run several vertex shader invocations that end up forming a primitive that faces back and is not visible and there’s no easy way to filter those out, so you waste computing power and memory bandwidth reading data for those vertices. Some implementations do some clever stuff here trying to avoid these issues.

Finally, tessellation and geometry shaders should perhaps be simpler and more powerful, and should work like compute shaders so we process vertices in parallel more efficiently.

So far we know mesh shaders look a bit like compute shaders and they need to generate geometry somehow because their output goes to the rasterizer, so lets take a look.

The example will be in GLSL to make it easier to read. As you can see, it needs a new extension which, when translated to SPIR-V will be converted into a SPIR-V extension that gives access to some new opcodes and functionality.

The first similarity to compute is that mesh shaders are dispatched in 3d work groups like compute shaders, and each of them has a number of invocations in 3d controlled by the shader itself. Same deal. There’s a limit to the size of each work group, but the minimum mandatory limit by the spec is 128 invocations. If the hardware does not support work groups of that size, they will be emulated. We also have a properties structure in Vulkan where you can check the recommended maximum size for work groups according to the driver.

Inside the body of your shader you get access to the typical built-ins for compute shaders, like the number of work groups, work group id, local invocation indices, etc.

If subgroups are supported, you can also use subgroup operations in them.

But mesh shaders also have to generate geometry. The type can not be chosen at runtime. When writing the shader you have to decide if you shader will output triangles, lines or points.

You must also indicate an upper limit in the number of vertices and primitives that each work group will generate.

Generally speaking, this will be a small-ish number. Several implementations will limit you to 256 vertices and primitives at most, which is the minimum required limit.

To handle big meshes with this, you’ll need several work groups and each work group will handle a piece of the whole mesh.

In each work group, the local invocations will cooperate to generate arrays of vertex and primitive data.

And here you can see how. After, perhaps, some initial processing not seen here, you have to indicate how many actual vertices and primitives the work group will emit, using the SetMeshOutputsEXT call.

That call goes first before filling any output array and you can reason about it as letting the implementation allocate the appropriate amount of memory for those output arrays.

Mesh shaders output indexed geometry, like when you use vertex and index buffers together.

You need to write data for each vertex to an output array, and primitive indices to another output array. Typically, each local invocation handles one position or a chunk of those arrays so they cooperate together to fill the whole thing. In the slide here you see a couple of those arrays, the most typical ones.

The built-in mesh vertices ext array contains per-vertex built-ins, like the vertex position. Indices used with this array go from 0 to ACTUAL_V-1

Then, the primitive triangle indices ext array contains, for each triangle, 3 uint indices into the previous vertices array. The primitive indices array itself is accessed using indices from 0 to ACTUAL_P-1. If there’s a second slide that I want you to remember, it’s this one. What we have here is an initial template to start writing any mesh shader.

There are a few more details we can add. For example, mesh shaders can also generate custom output attributes that will be interpolated and used as inputs to the fragment shader, just like vertex shaders can.

The difference is that in mesh shaders they form arrays. If we say nothing, like in the first output here, they’re considered per-vertex and have the same index range as the mesh vertices array.

A nice addition for mesh shaders is that you can use the perprimitiveEXT keyword to indicate output attributes are per-primitive and do not need to be interpolated, like the second output here. If you use these, you need to declare them with the same keyword in the fragment shader so the interfaces match. Indices to these arrays have the same range as the built-in primitive indices array.

And, of course, if there’s no input assembly we need to read data from somewhere. Typically from descriptors like storage buffers containing vertex and maybe index information, but we could also generate geometry procedurally.

Just to show you a few more details, these are the built-in arrays used for geometry. There are arrays of indices for triangles, lines or points depending on what the shader is supposed to generate.

The mesh vertices ext array that we saw before can contain a bit more data apart from the position (point size, clip and cull distances).

The third array was not used before. It’s the first time I mention it and as you can see it’s per-primitive instead of per-vertex. You can indicate a few things like the primitive id, the layer or viewport index, etc.

As I mentioned before, each work group can only emit a relatively small number of primitives and vertices, so for big models, several work groups are dispatched and each of them is in charge of generating and processing a meshlet, which are the colored patches you see here on the bunny.

It’s worth mentioning the subdivision of big meshes into meshlets is typically done when preparing assets for the application, meaning there shouldn’t be any runtime delay.

Mesh shading work groups are dispatched with specific commands inside a render pass, and they look similar to compute dispatches as you can see here, with a 3d size.

Let’s talk a bit about task shaders, which are optional. If present, they go before mesh shaders, and the dispatch commands do not control the number of mesh shader work groups, but the number of task shader work groups that are dispatched and each task shader work group will dispatch a number of mesh shader work groups.

Each task shader work group also follows the compute model, with a number of local invocations that cooperate together.

Each work group typically pre-processes geometry in some way and amplifies or reduces the amount of work that needs to be done. That’s why it’s called the amplification shader in DX12.

Once that pre-processing is done, each task work group decides, at runtime, how many mesh work groups to launch as children, forming a tree with two levels.

But we probably want them all to process different things, so the way to tell them apart from inside the mesh shader code is to use a payload: a piece of data that is generated in each task work group and passed down to its children as read-only data.

Combining the payload with existing built-ins allows you to process different things in each mesh shader work group.

This is done like this. On the left you have a task shader.

You can see it also works like a compute shader and invocations cooperate to pre-process stuff and generate the payload. The payload is a variable declared with the task payload shared ext qualifier. These payloads work like shared memory. That’s why they have “shared” in the qualifier.

Avoiding input assembly bottlenecks if they exist.

Pre-compute data and discard geometry in advance, saving processing power and memory bandwidth.

Geometry and tessellation can be applied freely, in more flexible ways.

The use of a model similar to compute shaders allows us to take advantage of the GPU processing power more effectively.

Many games use a compute pre-pass to process some data and calculate things that will be needed at draw time. With mesh shaders it may be possible to streamline this and integrate this processing into the mesh or task shaders.

You can also abuse mesh shading pipelines as compute pipelines with two levels if needed.

Mesh shading is problematic for tiling GPUs as you can imagine, for the same reasons tessellation and geometry shaders suck on those platforms.

Giving users freedom in this part of the pipeline may allow them to shoot themselves in the foot and end-up with sub-optimal performance. If not used properly, mesh shaders may be slower than classic pipelines.

The structure of vertex and index buffers needs to be declared explicitly in shaders, increasing coupling between CPU and GPU code, which is not nice.

Most importantly, right now it’s hard or even impossible to write a single mesh shader that performs great on all implementations.

Some vendor preferences are exposed as properties by the extension.

NVIDIA loves smaller work groups and using loops in code to generate geometry with each invocation (several vertices and triangles per invocation).

Threads on AMD can only generate at most one vertex and one primitive, so they’d love you to use bigger work groups and use the local invocation index to access per-vertex and per-primitive arrays.

As you can imagine, this probably results in different mesh shaders for each vendor.

### Andy Wingo

#### the sticky mark-bit algorithm

Good day, hackfolk!

# The Sticky Mark-Bit Algorithm

Also an intro to mark-sweep GC

7 Oct 2022 – Igalia

Andy Wingo

A funny post today; I gave an internal presentation at work recently describing the so-called "sticky mark bit" algorithm. I figured I might as well post it here, as a gift to you from your local garbage human.

## Automatic Memory Management

“Don’t free, the system will do it for you”

Eliminate a class of bugs: use-after-free

Relative to bare malloc/free, qualitative performance improvements

• cheap bump-pointer allocation
• cheap reclamation/recycling
• better locality

Continuum: bmalloc / tcmalloc grow towards GC

Before diving in though, we start with some broad context about automatic memory management. The term mostly means "garbage collection" these days, but really it describes a component of a system that provides fresh memory for new objects and automatically reclaims memory for objects that won't be needed in the program's future. This stands in contrast to manual memory management, which relies on the programmer to free their objects.

Of course, automatic memory management ensures some valuable system-wide properties, like lack of use-after-free vulnerabilities. But also by enlarging the scope of the memory management system to include full object lifetimes, we gain some potential speed benefits, for example eliminating any cost for free, in the case of e.g. a semi-space collector.

## Automatic Memory Management

Two strategies to determine live object graph

• Reference counting
• Tracing

What to do if you trace

• Mark, and then sweep or compact
• Evacuate

Tracing O(n) in live object count

I should mention that reference counting is a form of automatic memory management. It's not enough on its own; unreachable cycles in the object reference graph have to be detected either by a heap tracer or broken by weak references.

It used to be that we GC nerds made fun of reference counting as being an expensive, half-assed solution that didn't work very well, but there have been some fundamental advances in the state of the art in the last 10 years or so.

But this talk is more about the other kind of memory management, which involves periodically tracing the graph of objects in the heap. Generally speaking, as you trace you can do one of two things: mark the object, simply setting a bit indicating that an object is live, or evacuate the object to some other location. If you mark, you may choose to then compact by sliding all objects down to lower addresses, squeezing out any holes, or you might sweep all holes into a free list for use by further allocations.

## Mark-sweep GC (1/3)

freelist := []

allocate():
if freelist is empty: collect()
return freelist.pop()

collect():
mark()
sweep()
if freelist is empty: abort

Concretely, let's look closer at mark-sweep. Let's assume for the moment that all objects are the same size. Allocation pops fresh objects off a freelist, and collects if there is none. Collection does a mark and then a sweep, aborting if sweeping yielded no free objects.

## Mark-sweep GC (2/3)

mark():
worklist := []
for ref in get_roots():
if mark_one(ref):
while worklist is not empty:
for ref in trace(worklist.pop()):
if mark_one(ref):

sweep():
for ref in heap:
if marked(ref):
unmark_one(ref)
else
freelist.add(ref)

Going a bit deeper, here we have some basic implementations of mark and sweep. Marking starts with the roots: edges from outside the automatically-managed heap indicating a set of initial live objects. You might get these by maintaining a stack of objects that are currently in use. Then it traces references from these roots to other objects, until there are no more references to trace. It will visit each live object exactly once, and so is O(n) in the number of live objects.

Sweeping requires the ability to iterate the heap. With the precondition here that collect is only ever called with an empty freelist, it will clear the mark bit from each live object it sees, and otherwise add newly-freed objects to the global freelist. Sweep is O(n) in total heap size, but some optimizations can amortize this cost.

## Mark-sweep GC (3/3)

marked := 1

get_tag(ref):
return *(uintptr_t*)ref
set_tag(ref, tag):
*(uintptr_t*)ref = tag

marked(ref):
return (get_tag(ref) & 1) == marked
mark_one(ref):
if marked(ref): return false;
set_tag(ref, (get_tag(ref) & ~1) | marked)
return true
unmark_one(ref):
set_tag(ref, (get_tag(ref) ^ 1))

Finally, some details on how you might represent a mark bit. If a ref is a pointer, we could store the mark bit in the first word of the objects, as we do here. You can choose instead to store them in a side table, but it doesn't matter for today's example.

## Observations

Freelist implementation crucial to allocation speed

Non-contiguous allocation suboptimal for locality

World is stopped during collect(): “GC pause”

mark O(n) in live data, sweep O(n) in total heap size

Touches a lot of memory

The salient point is that these O(n) operations happen when the world is stopped. This can be noticeable, even taking seconds for the largest heap sizes. It sure would be nice to have the benefits of GC, but with lower pause times.

## Optimization: rotate mark bit

flip():
marked ^= 1

collect():
flip()
mark()
sweep()
if freelist is empty: abort

unmark_one(ref):
pass

Avoid touching mark bits for live data

Incidentally, before moving on, I should mention an optimization to mark bit representation: instead of clearing the mark bit for live objects during the sweep phase, we could just choose to flip our interpretation of what the mark bit means. This allows unmark_one to become a no-op.

## Reducing pause time

Parallel tracing: parallelize mark. Clear improvement, but speedup depends on object graph shape (e.g. linked lists).

Concurrent tracing: mark while your program is running. Tricky, and not always a win (“Retrofitting Parallelism onto OCaml”, ICFP 2020).

Partial tracing: mark only a subgraph. Divide space into regions, record inter-region links, collect one region only. Overhead to keep track of inter-region edges.

Now, let's revisit the pause time question. What can we do about it? In general there are three strategies.

## Generational GC

Partial tracing

Two spaces: nursery and oldgen

Allocations in nursery (usually)

Objects can be promoted/tenured from nursery to oldgen

Minor GC: just trace the nursery

Major GC: trace nursery and oldgen

“Objects tend to die young”

Overhead of old-to-new edges offset by less amortized time spent tracing

Today's talk is about partial tracing. The basic idea is that instead of tracing the whole graph, just trace a part of it, ideally a small part.

A simple and effective strategy for partitioning a heap into subgraphs is generational garbage collection. The idea is that objects tend to die young, and that therefore it can be profitable to focus attention on collecting objects that were allocated more recently. You therefore partition the heap graph into two parts, young and old, and you generally try to trace just the young generation.

The difficulty with partitioning the heap graph is that you need to maintain a set of inter-partition edges, and you do so by imposing overhead on the user program. But a generational partition minimizes this cost because you never do an only-old-generation collection, so you don't need to remember new-to-old edges, and mutations of old objects are less common than new.

## Generational GC

Usual implementation: semispace nursery and mark-compact oldgen

Tenuring via evacuation from nursery to oldgen

Excellent locality in nursery

Very cheap allocation (bump-pointer)

But... evacuation requires all incoming edges to an object to be updated to new location

Requires precise enumeration of all edges

Usually the generational partition is reflected in the address space: there is a nursery and it is in these pages and an oldgen in these other pages, and never the twain shall meet. To tenure an object is to actually move it from the nursery to the old generation. But moving objects requires that the collector be able to enumerate all incoming edges to that object, and then to have the collector update them, which can be a bit of a hassle.

## JavaScriptCore

No precise stack roots, neither in generated nor C++ code

Compare to V8’s Handle<> in C++, stack maps in generated code

Stack roots conservative: integers that happen to hold addresses of objects treated as object graph edges

(Cheaper implementation strategy, can eliminate some bugs)

Specifically in JavaScriptCore, the JavaScript engine of WebKit and the Safari browser, we have a problem. JavaScriptCore uses a technique known as "conservative root-finding": it just iterates over the words in a thread's stack to see if any of those words might reference an object on the heap. If they do, JSC conservatively assumes that it is indeed a reference, and keeps that object live.

Of course a given word on the stack could just be an integer which happens to be an object's address. In that case we would hold on to too much data, but that's not so terrible.

Conservative root-finding is again one of those things that GC nerds like to make fun of, but the pendulum seems to be swinging back its way; perhaps another article on that some other day.

## JavaScriptCore

Automatic memory management eliminates use-after-free...

...except when combined with manual memory management

Prevent type confusion due to reuse of memory for object of different shape

Type-segregated heaps

No evacuation: no generational GC?

The other thing about JSC is that it is constantly under attack by malicious web sites, and that any bug in it is a step towards hackers taking over your phone. Besides bugs inside JSC, there are bugs also in the objects exposed to JavaScript from the web UI. Although use-after-free bugs are impossible with a fully traceable object graph, references to and from DOM objects might not be traceable by the collector, instead referencing GC-managed objects by reference counting or weak references or even manual memory management. Bugs in these interfaces are a source of exploitable vulnerabilities.

In brief, there seems to be a decent case for trying to mitigate use-after-free bugs. Beyond the nuclear option of not freeing, one step we could take would be to avoid re-using memory between objects of different shapes. So you have a heap for objects with 3 fields, another objects with 4 fields, and so on.

But it would seem that this mitigation is at least somewhat incompatible with the usual strategy of generational collection, where we use a semi-space nursery. The nursery memory gets re-used all the time for all kinds of objects. So does that rule out generational collection?

## Sticky mark bit algorithm

collect(is_major=false):
if is_major: flip()
mark(is_major)
sweep()
if freelist is empty:
if is_major: abort
collect(true)

mark(is_major):
worklist := []
if not is_major:
worklist += remembered_set
remembered_set := []
...

Turns out, you can generationally partition a mark-sweep heap.

Recall that to visit each live object, you trace the heap, setting mark bits. To visit them all again, you have to clear the mark bit between traces. Our first collect implementation did so in sweep, via unmark_one; then with the optimization we switched to clear them all before the next trace in flip().

Here, then, the trick is that you just don't clear the mark bit between traces for a minor collection (tracing just the nursery). In that way all objects that were live at the previous collection are considered the old generation. Marking an object is tenuring, in-place.

There are just two tiny modifications to mark-sweep to implement sticky mark bit collection: one, flip the mark bit only on major collections; and two, include a remembered set in the roots for minor collections.

## Sticky mark bit algorithm

Mark bit from previous trace “sticky”: avoid flip for minor collections

Consequence: old objects not traced, as they are already marked

Old-to-young edges: the “remembered set”

Write barrier

write_field(object, offset, value):
remember(object)
object[offset] = value

The remembered set is maintained by instrumenting each write that the program makes with a little call out to code from the garbage collector. This code is the write barrier, and here we use it to add to the set of objects that might reference new objects. There are many ways to implement this write barrier but that's a topic for another day.

## JavaScriptCore

Concurrent GC: mark runs while JS program running; “riptide”; interaction with write barriers

Generational GC: in-place, non-moving GC generational via sticky mark bit algorithm

Alan Demers, “Combining generational and conservative garbage collection: framework and implementations”, POPL ’90

So returning to JavaScriptCore and the general techniques for reducing pause times, I can summarize to note that it does them all. It traces both in parallel and concurrently, and it tries to trace just newly-allocated objects using the sticky mark bit algorithm.

## Conclusions

A little-used algorithm

Motivation for JSC: conservative roots

Original motivation: conservative roots; write barrier enforced by OS-level page protections

Revived in “Sticky Immix”

Better than nothing, not quite as good as semi-space nursery

I find that people that are interested in generational GC go straight for the semispace nursery. There are some advantages to that approach: allocation is generally cheaper in a semispace than in a mark space, locality among new objects is better, locality after tenuring is better, and you have better access locality during a nursery collection.

But if for some reason you find yourself unable to enumerate all roots, you can still take advantage of generational collection via the sticky mark-bit algorithm. It's a simple change that improves performance, as long as you are able to insert write barriers on all heap object mutations.

The challenge with a sticky-mark-bit approach to generations is avoiding the O(n) sweep phase. There are a few strategies, but more on that another day perhaps.

And with that, presentation done. Until next time, happy hacking!

## October 19, 2022

### Eric Meyer

#### A Dashing Navbar Solution

#thoughts figure.standalone img { box-shadow: 0.25em 0.25em 0.67em #0008; margin-block-end: 0.5em; }

One of the many things Igalia does is maintain an official port of WebKit for embedded devices called WPE WebKit, and as you might expect, it has a web site.  The design had gotten a little stale since its launch a few years ago, so we asked Denis Radenković at 38one to come up with a new design, which we launched yesterday.  And I got to turn it into HTML and CSS!  Which was mostly normal stuff, margins and font sizing and all that, but also had some bits that called for some creativity.

There was one aspect of the design that I honestly thought wasn’t going to be possible, which was the way the “current page” link in the site navbar connected up to the rest of the page’s design via a dashed line. You can see it on the site, or in this animation about how each navlink was designed to appear.

I thought about using bog-standard dashed element borders: one on a filler element or pseudo-element that spanned the mostly-empty left side of the navbar, and one on each navlink before the current one, and then… I don’t know. I didn’t get that far, because I realized the dashed borders would almost certainly stutter, visually. What I mean is, the dashes wouldn’t follow a regular on-off pattern, which would look fairly broken.

My next thought was to figure out how to size a filler element or pseudo-element so that it was the exact width needed to reach the middle of the active navlink. Maybe there’s a clever way to do that with just HTML and CSS, but I couldn’t think of a way to do it that wasn’t either structurally nauseating or dependent on me writing clever JavaScript. So I tossed that idea, though I’ll return to it at the end of the post.

In the end, it was the interim development styles that eventually got me there. See, while building out other parts of the design, I just threw a dashed border on the bottom of the navbar, which (as is the way of borders) spanned its entire bottom edge. At some point I glanced up at it and thought, If only I could figure out a mask that would clip off the part of the border I don’t need. And slowly, I realized that with a little coding sleight-of-hand, I could almost exactly do that.

First, I removed the bottom border from the navbar and replaced it with a dashed linear gradient background, like this:

nav.global {
90deg,
currentColor 25%,
transparent 25% 75%,
currentColor 75%
);
background-position: 0 100%;
background-size: 8px 1px;
background-repeat: repeat-x;
}

(Okay, I actually used the background shorthand form of the above — linear-gradient(…) 0 100% / 8px 1px repeat-x — but the result is the same.)

I thought about using repeating-linear-gradient, which would have let me skip having to declare a background-repeat, but would have required me to size the gradient’s color stops using length units instead of percentages. I liked how, with the above, I could set the color stops with percentages, and then experiment with the size of the image using background-size. (7px 1px? 9px 1px? 8px 1.33px? Tried ’em all, and then some.) But that’s mostly a personal preference, not based in any obvious performance or clarity win, so if you want to try this with a repeating gradient, go for it.

But wait. What was the point of recreating the border’s built-in dash effect with a gradient background? Well, as I said, it made it easier to experiment with different sizes. It also allowed me to create a dash pattern that would be very consistent across browsers, which border-style dashes very much are not.

But primarily, I wanted the dashes to be in the background of the navbar because I could then add small solid-color gradients to the backgrounds of the navlinks that come after the active one.

nav.global li.currentPage ~ li {
background: linear-gradient(0deg, #FFF 2px, transparent 2px);
}

That selects all the following-sibling list items of the currentPage-classed list item, which here is the “Learn & Discover” link.

<ul class="about off">
<li class="currentPage"><a class="nav-link" href="…">Learn &amp; Discover</a></li>
<li><a class="btn cta" href="…">Get Started</a></li>
</ul>

Here’s the result, with the gradient set to be visible with a pinkish fill instead of #FFF so we can see it:

You can see how the pink hides the dashes. In the actual styles, because the #FFF is the same as the design’s page background, those list items’ background gradients are placed over top of (and thus hide) the navbar’s background dash.

I should point out that the color stop on the white solid gradient could have been at 1px rather than 2px and still worked, but I decided to give myself a little bit of extra coverage, just for peace of mind. I should also point out that I didn’t just fill the backgrounds of the list items with background-color: #FFF because the navbar has a semitransparent white background fill and a blurring background-filter, so the page content can be hazily visible through the navbar, as shown here.

The white line is slightly suboptimal in this situation, but it doesn’t really stand out and does add a tiny bit of visual accent to the navbar’s edge, so I was willing to go with it.

The next step was a little trickier: there needs to be a vertical dashed line “connecting” to the navbar’s dashed line at the horizontal center of the link, and also the navbar’s dashed line needs to be hidden, but only to the right of the vertical dashed line. I thought about doing two background gradients on the list item, one for the vertical dashed line and one for the white “mask”, but I realized that constraining the vertical dashed line to be half the height of the list item while also getting the repeating pattern correct was too much for my brain to figure out.

Instead, I positioned and sized a generated pseudo-element like this:

nav.global ul li.currentPage {
position: relative;
}
nav.global ul li.currentPage::before {
content: '';
position: absolute;
z-index: 1;
top: 50%;
bottom: 0;
left: 50%;
right: 0;
background:
currentColor 25%,
transparent 25% 75%,
currentColor 75%) 0 0 / 1px 8px repeat-y,
background-size: 1px 0.5em, auto;
}

That has the generated pseudo-element fill the bottom right quadrant of the active link’s list item, with a vertical dashed linear gradient running along its left edge and a solid white two-pixel gradient along its bottom edge, with the solid white below the vertical dash. Here it is, with the background pinkishly filled in to be visible behind the two gradients.

With that handled, the last step was to add the dash across the bottom of the current-page link and then mask the vertical line with a background color, like so:

nav.global ul li.currentPage a {
position: relative;
z-index: 2;
background: var(--dashH); /* converted the dashes to variables */
background-size: 0.5em 1px;
background-position: 50% 100%;
background-color: #FFF;
}

And that was it. Now, any of the navbar links can be flagged as the current page (with a class of currentPage on the enclosing list item) and will automatically link up with the dashes across the bottom of the navbar, with the remainder of that navbar dash hidden by the various solid white gradient background images.

So, it’s kind of a hack, and I wish there were a cleaner way to do this. And maybe there is! I pondered setting up a fine-grained grid and adding an SVG with a dashed path, or maybe a filler <span>, to join the current link to the line. That also feels like a hack, but maybe less of one. Or maybe not!

What I believe I want are the capabilities promised by the Anchored Positioning proposal. I think I could have done something like:

nav.global {position: relative;}
nav.global::before {
position: absolute;
left: 0;
bottom: 0;
right: var(--center);
}

…and then used that to run the dashed line over from the left side of the page to underneath the midpoint of the current link, and then up to the bottom edge of that link. Which would have done away with the need for the li ~ li overlaid-background hack, and nearly all the other hackery. I mean, I enjoy hackery as much as the next codemonaut, but I’m happier when the hacks are more elegant and minimal.

Have something to say to all that? You can add a comment to the post, or email Eric directly.

## October 18, 2022

### Julie Kim

#### A <webview> tag in Chromium

I often talk about a webview as a common term. For instance, when I worked on AGL project, we integrated Web Runtime in order to support a webview to native applications from AGL like Android WebView. However, I wasn’t aware of <webview> tag implementation for a web page in Chromium.

## <webview>

<webview> is a tag that Chromium Apps and chrome://… pages could use.

Chromium has some samples using <webview> for testing. I launched one of these examples with some modification to include <div> at the same level with a <webview>.

The green box is a <webview> and the blue box is <div>. A web page includes <webview> that has another web page inside it. Note that a <webview> tag is not a standard and it’s used for Chrome Apps and chrome:// pages.

## WebContents is

A reusable component that is the main class of the Content module. It’s easily embeddable to allow multiprocess rendering of HTML into a view.
This image shows the conceptual application layers in Chromium. When we open a web page in Chromium, you can imagine that Chromium creates a WebContents and it shows us the rendered web page.

## Current implementation

When <webview> exists in a web page, how does it work and where is it located on these layers from the image above?

Based on the current implementation, WebContents can have inner WebContents. Chromium implements a <webview> tag with a GuestVIew which is the templated base class for out-of-process frames in the chrome layer. A GuestView maintains an association between a guest WebContents and an owner WebContents.

For a <webview> interface at Javascript, Chromium has custom bindings. When a <webview> tag is attached, the API connected by the custom bindings is called from the renderer. The request for attaching it is sent to the browser through Mojo interface.

interface GuestViewHost {
// We have a RenderFrame with routing id of |embedder_local_frame_routing_id|.
// We want this local frame to be replaced with a remote frame that points
// to a GuestView. This message will attach the local frame to the guest.
// The GuestView is identified by its ID: |guest_instance_id|.
AttachToEmbedderFrame(
int32 embedder_local_frame_routing_id,
int32 element_instance_id,
int32 guest_instance_id,
mojo_base.mojom.DictionaryValue params) => ();
It triggers WebContentsImpl::AttachInnerWebContents() to attach the inner WebContents to the outer WebContents.

Chromium has also another type of GuestView internally. That is MimeHandlerViewGuest to handle mime types from tags such as <object> or <embed>. For instance, if <embed> tag is linked to pdf file, it’s also managed by a GuestView.

## MPArch

These days I’m working on MPArch, that
• enables one WebContents to host multiple (active / inactive, visible / invisible) pages. from MPArch document
For the detail for MPArch, you could refer to this blog post, ‘MPArch(Multiple Page Architecture) project in Chromium‘.

As a GuestView is implemented with a Multiple-WebContents model, it’s also considered for MPArch.
After working on Prerendering and fenced frames intensively, our team started to migrate GuestView tests in order to avoid direct access to WebContents and manage it based on RenderFrameHost.
Once we have some progress regarding GuestView, I hope we could share the status, as well.

We’re hiring. https://www.igalia.com/jobs/

## Motivation

According to the Google Chromium team, some years ago it turned out that the BFCache(Back/Forward cache), Portals, Guest Views, and Prerender faced the same fundamental problem of having to support multiple pages in the same tab, but were implemented with different architecture models. BFCache and pending deletion documents were implemented with a Single-WebContents model. On the other hand, Portals and Guest Views were implemented with a Multiple-WebContents model. After extensive discussions between the Chrome Security Architecture team, the BFCache team, the Blink Architecture team, the Portals team, and others, they concluded that BFCache and Portals should converge to a single shared architecture. This is crucial to minimize the complexity of writing browser features on top of and within and minimize the maintenance cost across the codebase. In addition to BFCache and Portals, Prerendering, FencedFrames, GuestViews, and Pending deletion frames were facing the same problem: how to support multiple pages in a given tab. The below table shows each feature’s requirements. If the Chrome team didn’t invest in the architecture, as you can see in the table, the six use cases need to solve the same problems individually, which will lead to browser features having to deal with multiple incoherent cases, then they were supposed that it will greatly increase the complexity across the codebase.

## Goal

The goal of multiple-page architecture is to enable projects including BFCache, Portals, Prerendering, and FencedFrames to minimize the complexity exposed to the browser features. Specifically, Multiple Page Architecture is to
• Enable one WebContents to host multiple(active/inactive, visible/invisible) pages
• Allow navigating WebContents to an existing page
For instance, when a page moves to BFCache, the ownership of the page is transferred from the main frame tree to BFCache. When a portal is activated, the ownership of the page is transferred to the main frame tree. This requires clearly splitting the tab and page / main-document concepts. This is closely related to and greatly benefits from Site Isolation’s effort to split page and document concepts.

## Design Overview

Let’s take a look at a high-level design briefly. To apply the MPArch to Chromium, we have to change the core navigation and existing features. First, we need to make the core navigation provide the ability to host multiple pages in the same WebContents, allow one page to embed another, and allow WebContents to navigate to existing pages.

In particular, this includes,
• Allow multiple FrameTrees to be created in the given WebContents in addition to the existing main one
• Add a new navigation type, which navigates a given FrameTree to an existing page. (For example, BFCache will use it for history navigation restoring the page from the cache. Prerendering will use it to instantly navigate to a prerendered page.)
• Modify RenderFrameHost to avoid knowing its current FrameTreeNode/FrameTree to allow the main RenderFrameHost to be transferred between different FrameTreeNodes.
And also, we need to fix existing features relying on the now-invalid assumptions like one tab only having one page at any given moment. The MPArch team evolves //content/public APIs to simplify these fixes and provide guidance and documentation to browser feature developers.

## Igalia Participation

As we’ve seen briefly so far, the MPArch project is a significant Chromium architecture change that requires a lot of modification of Chromium overall. The Igalia team began to participate in the prerendering task first in the MPArch project in April 2021. We had mainly worked on the prerendering contents restriction as well as the long tail of fixes for MPArch-related features. After finishing the prerendering and long tail fixes of MPArch-related features, we started to work on the FencedFrame which is a target of the MPArch projects since September 2021. The Igalia team has merged over 550 patches for MPArch since April 2021 so far.

## References

[1] Multiple Page Architecture
[2] Multi Page Architecture (BlinkOn 13)
[3] Prerendering
[4] Prerendering contents restriction
[5] Fixing long tail of features for MPArch

## October 03, 2022

### Claudio Saavedra

#### Mon 2022/Oct/03

The series on the WPE port by the WebKit team at Igalia grows, with several new articles that go deep into different areas of the engine:

These articles are an interesting read not only if you're working on WebKit, but also if you are curious on how a modern browser engine works and some of the moving parts beneath the surface. So go check them out!

On a related note, the WebKit team is always on the lookout for talent to join us. Experience with WebKit or browsers is not necessarily a must, as we know from experience that anyone with a strong C/C++ background and enough curiosity will be able to ramp up and start contributing soon enough. If these articles spark your curiosity, feel free to reach out to me to find out more or to apply directly!

### Andy Wingo

#### on "correct and efficient work-stealing for weak memory models"

Hello all, a quick post today. Inspired by Rust as a Language for High Performance GC Implementation by Yi Lin et al, a few months ago I had a look to see how the basic Rust concurrency facilities that they used were implemented.

One of the key components that Lin et al used was a Chase-Lev work-stealing double-ended queue (deque). The 2005 article Dynamic Circular Work-Stealing Deque by David Chase and Yossi Lev is a nice read defining this data structure. It's used when you have a single producer of values, but multiple threads competing to claim those values. This is useful when implementing per-CPU schedulers or work queues; each CPU pushes on any items that it has to its own deque, and pops them also, but when it runs out of work, it goes to see if it can steal work from other CPUs.

The 2013 paper Correct and Efficient Work-Stealing for Weak Memory Models by Nhat Min Lê et al updates the Chase-Lev paper by relaxing the concurrency primitives from the original big-hammer sequential-consistency operations used in the Chase-Lev paper to an appropriate mix of C11 relaxed, acquire/release, and sequentially-consistent operations. The paper therefore has a C11 translation of the original algorithm, and a proof of correctness. It's quite pleasant. Here's the a version in Rust's crossbeam crate, and here's the same thing in C.

I had been using this updated C11 Chase-Lev deque implementation for a while with no complaints in a parallel garbage collector. Each worker thread would keep a local unsynchronized work queue, which when it grew too large would donate half of its work to a per-worker Chase-Lev deque. Then if it ran out of work, it would go through all the workers, seeing if it could steal some work.

My use of the deque was thus limited to only the push and steal primitives, but not take (using the language of the Lê et al paper). take is like steal, except that it takes values from the producer end of the deque, and it can't run concurrently with push. In practice take only used by the the thread that also calls push. Cool.

Well I thought, you know, before a worker thread goes to steal from some other thread, it might as well see if it can do a cheap take on its own deque to see if it could take back some work that it had previously offloaded there. But here I ran into a bug. A brief internet search didn't turn up anything, so here we are to mention it.

Specifically, there is a bug in the Lê et al paper that is not in the Chase-Lev paper. The original paper is in Java, and the C11 version is in, well, C11. The issue is.... integer overflow! In brief, push will increment bottom, and steal increments top. take, on the other hand, can decrement bottom. It uses size_t to represent bottom. I think you see where this is going; if you take on an empty deque in the initial state, you create a situation that looks just like a deque with (size_t)-1 elements, causing garbage reads and all kinds of delightful behavior.

The funny thing is that I looked at the proof and I looked at the industrial applications of the deque and I thought well, I just have to transcribe the algorithm exactly and I'll be golden. But it just goes to show that proving one property of an algorithm doesn't necessarily imply that the algorithm is correct.

## October 02, 2022

### Manuel Rego

#### Igalia Web Platform team is hiring

Igalia’s Web Platform team is looking for new members. In this blog post, I’ll explore the kinds of work our team does, plus a little about what working at Igalia is like.

If you’re interested in the position, please apply using the following form: https://www.igalia.com/jobs/web_platform_engineer

## Working at Igalia

Igalia has more than two decades of experience working on free software, actively participating in open source communities, and contributing code to many upstream projects. We are now in a position that allows us to commit and own new features in many of those communities, while becoming a trusted partner that people know they can rely on.

That’s cool and all, but the best part is how we organize ourselves in a cooperative with a flat structure, where we all take part in the decisions that affect us, and (eventually) we all become co-owners of the company. Our model is based on strong values, and focused on building a new way to understand the IT industry and run a company. At Igalia, everyone gets paid the same, we all share responsibility for the company’s success and the rewards for our work.

At Igalia, we provide a remote-friendly, collaborative, and supportive environment, driven by principles of diversity and inclusion. There are currently over 120 Igalians from all over the world, and we have been a remote-friendly company for more than a decade.

Picture of a group of Igalians attending Igalia Summit in June 2022

## About the Web Platform team

If I had to describe the work we do in the Web Platform team in one sentence, I’d say something like: implementing web platform features in the different open source browser engines. That’s quite a high level definition, so let’s go into some more specific examples of what we’ve done this year (though this is not an exhaustive list):

• Shipped some features: :has() and inert in Chromium, CSS Containment and :focus-visible in WebKit.
• Got MathML ready to ship in Chromium (hopefully shipping by the end of the year), and improved its interoperability in other rendering engines.
• Worked on Interop 2022 in WebKit and Gecko: Cascade Layers, CSS Containment, Forms, etc.
• Landed a variety of accessibility features and improvements: accessible name calculation, AOM (accessible actions and attribute reflection), Orca screen reader maintenance, etc.
• Helped ship the Custom Highlight API in Chromium and implement the ::spelling-error and ::grammar-error pseudos.
• Added support for Custom Protocol Handlers in Chromium.
• Took over Firefox Reality development and released Wolvic, the open source WebXR browser.

“Implementing” the features is not the whole story though, as they usually involve a bunch of other work like proposals, specs, and tests. It’s really important to be able to navigate those situations, and seek consensus and agreement from the parties involved, so our team has a great deal of experience working on specs and participating in the relevant standards bodies.

Picture of some members of the Web Platform team during Igalia Summit in June 2022. From left to right: Brenna Brown, Rob Buis, Cathie Chen, Frédéric Wang, Valerie Young, Manuel Rego, Javier Fernández, Andreu Botella, Sergio Villar, Ziran Sun

We meet twice a year at Igalia Summit. Not everyone can always make it to A Coruña (Galicia, Spain) in person, but we look forward to everyone visiting HQ eventually.

We have around 20 team members with a variety of backgrounds and locations, from Australia to Brazil to San Francisco and beyond. We collaborate with many others in the company, including some well known folks like Brian Kardell and Eric Meyer, who are always spreading the word about the great work we do.

Just in case you’re wondering about the interview process, you won’t be asked to do a live coding exercise or anything like that. The interviews are intended to allow both you and Igalia to get to know each other better.

If the Web Platform team doesn’t sound right for you, it might be worth checking out some of our other open positions.

Feel free to contact me if you have any questions about the position or Igalia as a whole.

## September 29, 2022

### Clayton Craft

#### 8 cubic feet

This is the average amount of natural gas burned by my water heater to heat about 1 hot shower's worth of water. Yes, I am now able to monitor resource consumption at home. Watch out!

Sorry to the rest of the world who use the superior metric system, but units of gas are sold here in "therms", which is much easier to convert to feet³ than to meters³...

Anyways, I recently added a few improvements to my home to help monitor energy and water consumption. Thank you to the companies / municipalities that service my water and natural gas service for 1) installing wireless meters and 2) leaving them unencrypted (more on that in a bit!) For measuring electricity, I sprung for an IoTaWatt, with split-core current transformers installed in/under the breaker panel. The whole setup is "managed" by Home Assistant.

Monitoring the utility wireless meters for water and natural gas consumption was surprisingly easy with an SDR (I used one from rtl-sdr). I've heard that in other countries, the messages these wireless meters broadcast in every direction is encrypted. At my location, for whatever reasons (or lack thereof), they are not. So not only can I easily receive usage info about the meters measuring my service, but I'm able to receive messages from quite a few other meters located around my home... at neighbors' houses, across the street, whatever. This particular water meter includes some bits to indicate whether it detects a "leak condition", where water is being drained away without interruption for some large number of hours. While trying to identify my meter in the swarm of meters all shouting around me, I noticed that some neighbor has a water leak. I had no way to identify who they were, since (luckily) the meters aren't broadcasting any obvious personally identifiable info. The "leak" went away after another day, before I was able to call the water provider. If that was a legitimate water leak that lasted many days/weeks, it would be really useful info to know if you were the one with the leak.

I haven't had this level of sensing set up for very long, however I can't help but notice the real cost of doing certain things we do regularly. Boiling eggs on the stove takes about 0.5kWh of electricity, for example. 0.5kWh is about the amount of energy that 5 adult humans produce over 1 hour, assuming the Wikipedia estimate of 100 watts per adult. That may or may not sound like much, but it adds up. What I can't see (yet?) is what it took my electricity provider to generate 0.5kWh of electricity.

It's a little nuts to think about, and I'm not sure if this is going to heavily influence my/my family's behavior moving forward. I think it may... Because even after a small number of days it's really hard to ignore some usage patterns that are emerging. I also can't help but think if others would act differently if they could see a more accurate real cost of doing things.

## September 27, 2022

### Manuel Rego

#### TPAC 2022

A few weeks ago the W3C arranged a new edition of TPAC, this time it was an hybrid event. I had the chance to attend onsite in Vancouver (Canada).

This is the second conference for me this year after the Web Engines Hackfest, and the first one travelling away. It was awesome to meet so many people in real life again, some of them for the first time face to face. Also it was great spending some days with the other igalians attending (Andreu Botella, Brian Kardell and Valerie Young).

This is my third TPAC and I’ve experienced an evolution since my first one back in 2016 in Lisbon. At that time Igalia started to be known in some specific groups, mostly ARIA and CSS Working Groups, but many people didn’t know us at all. However nowadays lots of people know us, Igalia is mentioned every now and then in conversations, in presentations, etc. When someone asks where you work and you reply Igalia, they only have good words about the work we do. “You should hire Igalia to implement that” is a sentence that we heard several times during the whole week. It’s clear how we’ve grown into the ecosystem and how we can have an important impact on the web platform evolution. These days many companies understand the benefits of contributing to the web platform bringing new priorities to the table. And Igalia is ready to help pushing some features forward on the different standard bodies and browser engines.

Igalia sponsored TPAC in bronze level and also the inclusion fund.

The rest of this blog post is about some brief highlights related to my participation on the event.

## Web Developers Meetup

On Tuesday evening there was the developer meetup with 4 talks and some demos. The talks were very nice, recordings will be published soon but meanwhile you can check slides at the meetup website. It’s worth to highlight that Igalia is somehow involved in topics related to the four presentations, some of them are probably better known than others. But some examples so you can connect the dots:

Wolvic logo

On top of the talks, Igalia was showing a demo of Wolvic, the open source browser for WebXR devices. A bunch of people asked questions and got interested on it, and everyone liked the stickers with the Wolvic logo.

## CSS

Lots of amazing things have been happening to CSS recently like :has() and Container Queries. But these folks are always thinking in the next steps and during the CSSWG meeting there were discussions about some new proposals:

• CSS Anchoring: This is about positioning things like pop-ups relative to other elements. The goal is to simplify current situation so it can be done with some CSS lines instead of having to deal manually with some JavaScript code.
• CSS Toggles: This is a proposal to properly add the toggle concept in CSS, similar to what people have been doing with the “Checkbox Hack”.

Picture of the CSSWG meeting by Rossen Atanassov

On the breaks there were lots of informal discussions about the line-clamp property and how to unprefix it. There different proposals about how to address this issue, let’s hope there’s some sort of agreement and this can be implemented soon.

Shadow DOM is cool but also quite complex, and right now the accessibility story with Shadow DOM is mostly broken. The Accessibility Object Model (AOM) proposals aim to solve some of these issues.

At TPAC a bunch of folks interested on AOM arranged an informal meeting followed by a number of conversations around some of these topics. On one side, there’s the ARIA reflection proposal that is being worked on in different browsers, which will allow setting ARIA attributes via JavaScript. In addition, there were lots of discussions around Cross-root ARIA Delegation proposal, and its counterpart Cross-root ARIA Reflection, thinking about the best way to solve these kind of issues. We now have some kind of agreement on an initial proposal design that would need to be discussed with the different groups involved. Let’s hope we’re on the right path to find a proper solution to help making Shadow DOM more accessible.

## Maps on the Web

This is one of those problems that don’t have a proper solution on the web platform. Igalia has been in conversations around this for a while, for example you can check this talk by Brian Kardell at a W3C workshop. This year Bocoup together with the Natural Resources Canada did a research on the topic, defining a long term roadmap to add support for maps on the Web.

There were a few sessions at TPAC about maps where Igalia participated. Things are still in a kind of early stage, but some of the features described on the roadmap are very interesting and have a broader scope than only maps (for example pan and zoom would be very useful for other more generic use cases too). Let’s see how all this evolves in the future.

## Wrap-up

And that’s all from my side after a really great week in the nice city of Vancouver. As usual in this kind of event, the most valuable part was meeting people and having lots of informal conversations all over the place. The hybrid setup was really nice, but still those face-to-face conversations are something different to what you can do attending remotely.

See you all in the next editions!

## September 23, 2022

### André Almeida

#### futex2 at Linux Plumbers Conference 2022

In-person conferences are finally back! After two years of remote conferences, the kernel development community got together in Dublin, Ireland, to discuss current problems that need collaboration to be solved. As in the past editions, I took the opportunity to discuss about futex2, a project I’m deeply involved in. futex2 is a project to solve issues found in the current futex interface. This years’ session was about NUMA awaress of futex.

## September 21, 2022

#### What's new for RISC-V in LLVM 15

LLVM 15.0.0 was released around about two weeks ago now, and I wanted to highlight some of RISC-V specific changes or improvements that were introduced while going into a little more detail than I was able to in the release notes.

In case you're not familiar with LLVM's release schedule, it's worth noting that there are two major LLVM releases a year (i.e. one roughly every 6 months) and these are timed releases as opposed to being cut when a pre-agreed set of feature targets have been met. We're very fortunate to benefit from an active and growing set of contributors working on RISC-V support in LLVM projects, who are responsible for the work I describe below - thank you! I coordinate biweekly sync-up calls for RISC-V LLVM contributors, so if you're working in this area please consider dropping in.

Linker relaxation is a mechanism for allowing the linker to optimise code sequences at link time. A code sequence to jump to a symbol might conservatively take two instructions, but once the target address is known at link-time it might be small enough to fit in the immediate of a single instruction, meaning the other can be deleted. Because a linker performing relaxation may delete bytes (rather than just patching them), offsets including those for jumps within a function may be changed. To allow this to happen without breaking program semantics, even local branches that might typically be resolved by the assembler must be emitted as a relocation when linker relaxation is enabled. See the description in the RISC-V psABI or Palmer Dabbelt's blog post on linker relaxation for more background.

Although LLVM has supported codegen for linker relaxation for a long time, LLD (the LLVM linker) has until now lacked support for processing these relaxations. Relaxation is primarily an optimisation, but processing of R_RISCV_ALIGN (the alignment relocation) is necessary for correctness when linker relaxation is enabled, meaning it's not possible to link such object files correctly without at least some minimal support. Fangrui Song implemented support for R_RISCV_ALIGN/R_RISCV_CALL/R_RISCV_CALL_PLT/R_RISCV_TPREL_* relocations in LLVM 15 and wrote up a blog post with more implementation details, which is a major step in bringing us to parity with the GCC/binutils toolchain.

## Optimisations

As with any release, there's been a large number of codegen improvements, both target-independent and target-dependent. One addition to highlight in the RISC-V backend is the new RISCVCodeGenPrepare pass. This is the latest piece of a long-running campaign (largely led by Craig Topper) to improve code generation related to sign/zero extensions on RV64. CodeGenPrepare is a target-independent pass that performs some late-stage transformations to the input ahead of lowering to SelectionDAG. The RISC-V specific version looks for opportunities to convert zero-extension to i64 with a sign-extension (which is cheaper).

Another new pass that may be of interest is RISCVMakeCompressible (contributed by Lewis Revill and Craig Blackmore). Rather than trying to improve generated code performance, this is solely focused on reducing code size, and may increase the static instruction count in order to do so (which is why it's currently only enabled at the -Oz optimisation level). It looks for cases where an instruction has been selected which can't be represented by one of the compressed (16-bit as opposed to 32-bit wide) instruction forms. For instance due to the register not being one of the registers addressable from the compressed instruction, or the offset being out of range). It will then look for opportunities to transform the input to make the instructions compressible. Grabbing two examples from the header comment of the pass:

; 'zero' register not addressable in compressed store.
=>   li a1, 0
sw zero, 0(a0)   =>   c.sw a1, 0(a0)
sw zero, 8(a0)   =>   c.sw a1, 8(a0)
sw zero, 4(a0)   =>   c.sw a1, 4(a0)
sw zero, 24(a0)  =>   c.sw a1, 24(a0)


and

; compressed stores support limited offsets
lui a2, 983065     =>   lui a2, 983065
sw  a1, -236(a2)   =>   c.sw  a1, 20(a3)
sw  a1, -240(a2)   =>   c.sw  a1, 16(a3)
sw  a1, -244(a2)   =>   c.sw  a1, 12(a3)
sw  a1, -248(a2)   =>   c.sw  a1, 8(a3)
sw  a1, -252(a2)   =>   c.sw  a1, 4(a3)
sw  a0, -256(a2)   =>   c.sw  a0, 0(a3)


There's a whole range of other backend codegen improvements, including additions to existing RISC-V specific passes but unfortunately it's not feasible to enumerate them all.

One improvement to note from the Clang frontend is that the C intrinsics for the RISC-V Vector extension are now lazily generated, avoiding the need to parse a huge pre-generated header file and improving compile times.

## Support for new instruction set extensions

A batch of new instruction set extensions were ratified at the end of last year (see also the recently ratified extension list. LLVM 14 already featured a number of these (with the vector and ratified bit manipulation extensions no longer being marked as experimental). In LLVM 15 we were able to fill in some of the gaps, adding support for additional ratified extensions as well as some new experimental extensions.

In particular:

• Assembler and disassembler support for the Zdinx, Zfinx, Zhinx, and Zhinxmin extensions. Cores that implement these extensions store double/single/half precision floating point values in the integer register file (GPRs) as opposed to having a separate floating-point register file (FPRs).
• The instructions defined in the conventional floating point extensions are defined to instead operate on the general purpose registers, and instructions that become redundant (namely those that involve moving values from FPRs to GPRs) are removed.
• Cores might implement these extensions rather than the conventional floating-point in order to reduce the amount of architectural state that is needed, reducing area and context-switch cost. The downside is of course that register pressure for the GPRs will be increased.
• Codegen for these extensions is not yet supported (i.e. the extensions are only supported for assembly input or inline assembly). A patch to provide this support is under review though.
• Assembler and disassembler support for the Zicbom, Zicbop, and Zicboz extensions. These cache management operation (CMO) extensions add new instructions for invalidating, cleaning, and flushing cache blocks (Zicbom), zeroing cache blocks (Zicboz), and prefetching cache blocks (Zicbop).
• These operations aren't currently exposed via C intrinsics, but these will be added once the appropriate naming has been agreed.
• One of the questions raised during implementation was about the preferred textual format for the operands. Specifically, whether it should be e.g. cbo.clean (a0)/cbo.clean 0(a0) to match the format used for other memory operations, or cbo.clean a0 as was used in an early binutils patch. We were able to agree between the CMO working group, LLVM, and GCC developers on the former approach.
• Assembler, disassembler, and codegen support for the Zmmul extension. This extension is just a subset of the 'M' extension providing just the multiplication instructions without the division instructions.
• Assembler and disassembler support for the additional CSRs (control and status registers) and instructions introduced by the hypervisor and Svinval additions to the privileged architecture specification. Svinval provides fine-grained address-translation cache invalidation and fencing, while the hypervisor extension provides support for efficiently virtualising the supervisor-level architecture (used to implement KVM for RISC-V).
• Assembler and disassembler support for the Zihintpause extension. This adds the pause instruction intended for use as a hint within spin-wait loops.
• Zihintpause was actually the first extension to go through RISC-V International's fast-track architecture extension process back in early 2021. We were clearly slow to add it to LLVM, but are trying to keep a closer eye on ratified extensions going forwards.
• Support was added for the not yet ratified Zvfh extension, providing support for half precision floating point values in RISC-V vectors.
• Unlike the extensions listed above, support for Zvfh is experimental. This is a status we use within the RISC-V backend for extensions that are not yet ratified and may change from release to release with no guarantees on backwards compatibility. Enabling support for such extensions requires passing -menable-experimental-extensions to Clang and specifying the extension's version when listing it in the -march string.

It's not present in LLVM 15, but LLVM 16 onwards will feature a user guide for the RISC-V target summarising the level of support for each extension (huge thanks to Philip Reames for kicking off this effort).

## Other changes

In case I haven't said it enough times, there's far more interesting changes than I could reasonably cover. Apologies if I've missed your favourite new feature or improvement. In particular, I've said relatively little about RISC-V Vector support. There's been a long series of improvements and correctness fixes in the LLVM 15 development window, after RVV was made non-experimental in LLVM 14 and there's much more to come in LLVM 16 (e.g. scalable vectorisation becoming enabled by default).

## September 20, 2022

### Danylo Piliaiev

It is a major milestone for a driver that is created without any hardware documentation.

The last major roadblocks were VK_KHR_dynamic_rendering and, to a much lesser extent, VK_EXT_inline_uniform_block. Huge props to Connor Abbott for implementing them both!

VK_KHR_dynamic_rendering was an especially nasty extension to implement on tiling GPUs because dynamic rendering allows splitting a render pass between several command buffers.

For desktop GPUs there are no issues with this. They could just record and execute commands in the same order they are submitted without any additional post-processing. Desktop GPUs don’t have render passes internally, they are just a sequence of commands for them.

On the other hand, tiling GPUs have the internal concept of a render pass: they do binning of the whole render pass geometry first, load part of the framebuffer into the tile memory, execute all render pass commands, store framebuffer contents into the main memory, then repeat load_framebufer -> execute_renderpass -> store_framebuffer for all tiles. In Turnip the required glue code is created at the end of a render pass, while the whole render pass contents (when the render pass is split across several command buffers) are known only at the submit time. Therefore we have to stitch the final render pass right there.

## What’s next?

Implementing Vulkan 1.3 was necessary to support the latest DXVK (Direct3D 9-11 translation layer). VK_KHR_dynamic_rendering itself was also necessary for the latest VKD3D (Direct3D 12 translation layer).

For now my plan is:

• Continue implementing new extensions for DXVK, VKD3D, and Zink as they come out.
• Focus more on performance.
• Improvements to driver debug tooling so it works better with internal and external debugging utilities.

## Introduction: an immutable OS

The Steam Deck runs SteamOS, a single-user operating system based on Arch Linux. Although derived from a standard package-based distro, the OS in the Steam Deck is immutable and system updates replace the contents of the root filesystem atomically instead of using the package manager.

An immutable OS makes the system more stable and its updates less error-prone, but users cannot install additional packages to add more software. This is not a problem for most users since they are only going to run Steam and its games (which are stored in the home partition). Nevertheless, the OS also has a desktop mode which provides a standard Linux desktop experience, and here it makes sense to be able to install more software.

How to do that though? It is possible for the user to become root, make the root filesytem read-write and install additional software there, but any changes will be gone after the next OS update. Modifying the rootfs can also be dangerous if the user is not careful.

The simplest and safest way to install additional software is with Flatpak, and that’s the method recommended in the Steam Deck Desktop FAQ. Flatpak is already installed and integrated in the system via the Discover app so I won’t go into more details here.

However, while Flatpak works great for desktop applications not every piece of software is currently available, and Flatpak is also not designed for other types of programs like system services or command-line tools.

Fortunately there are several ways to add software to the Steam Deck without touching the root filesystem, each one with different pros and cons. I will probably talk about some of them in the future, but in this post I’m going to focus on one that is already available in the system: systemd-sysext.

This is a tool included in recent versions of systemd and it is designed to add additional files (in the form of system extensions) to an otherwise immutable root filesystem. Each one of these extensions contains a set of files. When extensions are enabled (aka “merged”) those files will appear on the root filesystem using overlayfs. From then on the user can open and run them normally as if they had been installed with a package manager. Merged extensions are seamlessly integrated with the rest of the OS.

Since extensions are just collections of files they can be used to add new applications but also other things like system services, development tools, language packs, etc.

## Creating an extension: yakuake

I’m using yakuake as an example for this tutorial since the extension is very easy to create, it is an application that some users are demanding and is not easy to distribute with Flatpak.

So let’s create a yakuake extension. Here are the steps:

1) Create a directory and unpack the files there:

$mkdir yakuake$ wget https://steamdeck-packages.steamos.cloud/archlinux-mirror/extra/os/x86_64/yakuake-21.12.1-1-x86_64.pkg.tar.zst
$tar -C yakuake -xaf yakuake-*.tar.zst usr  2) Create a file called extension-release.NAME under usr/lib/extension-release.d with the fields ID and VERSION_ID taken from the Steam Deck’s /etc/os-release file. $ mkdir -p yakuake/usr/lib/extension-release.d/
$echo ID=steamos > yakuake/usr/lib/extension-release.d/extension-release.yakuake$ echo VERSION_ID=3.3.1 >> yakuake/usr/lib/extension-release.d/extension-release.yakuake


3) Create an image file with the contents of the extension:

$mksquashfs yakuake yakuake.raw  That’s it! The extension is ready. A couple of important things: image files must have the .raw suffix and, despite the name, they can contain any filesystem that the OS can mount. In this example I used SquashFS but other alternatives like EroFS or ext4 are equally valid. NOTE: systemd-sysext can also use extensions from plain directories (i.e skipping the mksquashfs part). Unfortunately we cannot use them in our case because overlayfs does not work with the casefold feature that is enabled on the Steam Deck. ## Using the extension Once the extension is created you simply need to copy it to a place where systemd-systext can find it. There are several places where they can be installed (see the manual for a list) but due to the Deck’s partition layout and the potentially large size of some extensions it probably makes more sense to store them in the home partition and create a link from one of the supported locations (/var/lib/extensions in this example): (deck@steamdeck ~)$ mkdir extensions
(deck@steamdeck ~)$scp user@host:/path/to/yakuake.raw extensions/ (deck@steamdeck ~)$ sudo ln -s $PWD/extensions /var/lib/extensions  Once the extension is installed in that directory you only need to enable and start systemd-sysext: (deck@steamdeck ~)$ sudo systemctl enable systemd-sysext
(deck@steamdeck ~)$sudo systemctl start systemd-sysext  After this, if everything went fine you should be able to see (and run) /usr/bin/yakuake. The files should remain there from now on, also if you reboot the device. You can see what extensions are enabled with this command: $ systemd-sysext status
HIERARCHY EXTENSIONS SINCE
/opt      none       -
/usr      yakuake    Tue 2022-09-13 18:21:53 CEST


If you add or remove extensions from the directory then a simple “systemd-sysext refresh” is enough to apply the changes.

Unfortunately, and unlike distro packages, extensions don’t have any kind of post-installation hooks or triggers, so in the case of Yakuake you probably won’t see an entry in the KDE application menu immediately after enabling the extension. You can solve that by running kbuildsycoca5 once from the command line.

## Limitations and caveats

Using systemd extensions is generally very easy but there are some things that you need to take into account:

1. Using extensions is easy (you put them in the directory and voilà!). However, creating extensions is not necessarily always easy. To begin with, any libraries, files, etc., that your extensions may need should be either present in the root filesystem or provided by the extension itself. You may need to combine files from different sources or packages into a single extension, or compile them yourself.
2. In particular, if the extension contains binaries they should probably come from the Steam Deck repository or they should be built to work with those packages. If you need to build your own binaries then having a SteamOS virtual machine can be handy. There you can install all development files and also test that everything works as expected. One could also create a Steam Deck SDK extension with all the necessary files to develop directly on the Deck
3. Extensions are not distribution packages, they don’t have dependency information and therefore they should be self-contained. They also lack triggers and other features available in packages. For desktop applications I still recommend using a system like Flatpak when possible.
4. Extensions are tied to a particular version of the OS and, as explained above, the ID and VERSION_ID of each extension must match the values from /etc/os-release. If the fields don’t match then the extension will be ignored. This is to be expected because there’s no guarantee that a particular extension is going to work with a different version of the OS. This can happen after a system update. In the best case one simply needs to update the extension’s VERSION_ID, but in some cases it might be necessary to create the extension again with different/updated files.
5. Extensions only install files in /usr and /opt. Any other file in the image will be ignored. This can be a problem if a particular piece of software needs files in other directories.
6. When extensions are enabled the /usr and /opt directories become read-only because they are now part of an overlayfs. They will remain read-only even if you run steamos-readonly disable !!. If you really want to make the rootfs read-write you need to disable the extensions (systemd-sysext unmerge) first.
7. Unlike Flatpak or Podman (including toolbox / distrobox), this is (by design) not meant to isolate the contents of the extension from the rest of the system, so you should be careful with what you’re installing. On the other hand, this lack of isolation makes systemd-sysext better suited to some use cases than those container-based systems.

## Conclusion

systemd extensions are an easy way to add software (or data files) to the immutable OS of the Steam Deck in a way that is seamlessly integrated with the rest of the system. Creating them can be more or less easy depending on the case, but using them is extremely simple. Extensions are not packages, and systemd-sysext is not a package manager or a general-purpose tool to solve all problems, but if you are aware of its limitations it can be a practical tool. It is also possible to share extensions with other users, but here the usual warning against installing binaries from untrusted sources applies. Use with caution, and enjoy!

## September 12, 2022

### Eric Meyer

#### Nuclear Targeted Footnotes

One of the more interesting design challenges of The Effects of Nuclear Weapons was the fact that, like many technical texts, it has footnotes.  Not a huge number, and in fact one chapter has none at all, but they couldn’t be ignored.  And I didn’t want them to be inline between paragraphs or stuck into the middle of the text.

This was actually a case where Chris and I decided to depart a bit from the print layout, because in print a chapter has many pages, but online it has a single page.  So we turned the footnotes into endnotes, and collected them all near the end of each chapter.

Originally I had thought about putting footnotes off to one side in desktop views, such as in the right-hand grid gutter.  After playing with some rough prototypes, I realized this wasn’t going to go the way I wanted it to, and would likely make life difficult in a variety of display sizes between the “big desktop monitor” and “mobile device” realms.  I don’t know, maybe I gave up too easily, but Chris and I had already decided that endnotes were an acceptable adaptation and I decided to roll with that.

So here’s how the footnotes work.  First off, in the main-body text, a footnote marker is wrapped in a <sup> element and is a link that points at a named anchor in the endnotes. (I may go back and replace all the superscript elements with styled <mark> elements, but for now, they’re superscript elements.)  Here’s an example from the beginning of Chapter I, which also has a cross-reference link in it, classed as such even though we don’t actually style them any differently than other links.

This is true for a conventional “high explosive,” such as TNT, as well as for a nuclear (or atomic) explosion,<sup><a href="#fnote01">1</a></sup> although the energy is produced in quite different ways (<a href="#§1.11" class="xref">§ 1.11</a>).

Then, down near the end of the document, there’s a section that contains an ordered list.  Inside that list are the endnotes, which are in part marked up like this:

<li id="fnote01"><sup>1</sup> The terms “nuclear” and atomic” may be used interchangeably so far as weapons, explosions, and energy are concerned, but “nuclear” is preferred for the reason given in <a href="#§1.11" class="xref">§ 1.11</a>.

The list item markers are switched off with CSS, and superscripted numbers stand in their place.  I do it that way because the footnote numbers are important to the content, but also have specific presentation demands that are difficult  —  nay, impossible — to pull off with normal markers, like raising them superscript-style. (List markers are only affected by a very limited set of properties.)

In order to get the footnote text to align along the start (left) edge of their content and have the numbers hang off the side, I elected to use the old negative-text-indent-positive-padding trick:

.endnotes li {
text-indent: -0.75em;
}

That works great as long as there are never any double-digit footnote numbers, which was indeed the case… until Chapter VIII.  Dang it.

So, for any footnote number above 9, I needed a different set of values for the indent-padding trick, and I didn’t feel like adding in a bunch of greater-than-nine classes. Following-sibling combinator to the rescue!

.endnotes li:nth-of-type(9) ~ li {
margin-inline-start: -0.33em;
text-indent: -1.1em;
}

The extra negative start margin is necessary solely to get the text in the list items to align horizontally, though unnecessary if you don’t care about that sort of thing.

Okay, so the endnotes looked right when seen in their list, but I needed a way to get back to the referring paragraph after reading a footnote.  Thus, some “backjump” links got added to each footnote, pointing back to the paragraph that referred to them.

<span class="backjump">[ref. <a href="#§1.01">§ 1.01</a>]</span>

With that, a reader can click/tap a footnote number to jump to the corresponding footnote, then click/tap the reference link to get back to where they started.  Which is fine, as far as it goes, but that idea of having footnotes appear in context hadn’t left me.  I decided I’d make them happen, one way or another.

(Throughout all this, I wished more than once the HTML 3.0 proposal for <fn> had gone somewhere other than the dustbin of history and the industry’s collective memory hole.  Ah, well.)

I was thinking I’d need some kind of JavaScript thing to swap element nodes around when it occurred to me that clicking a footnote number would make the corresponding footnote list item a target, and if an element is a target, it can be styled using the :target pseudo-class.  Making it appear in context could be a simple matter of positioning it in the viewport, rather than with relation to the document.  And so:

.endnotes li:target {
position: fixed;
bottom: 0;
margin-inline: -2em 0;
border-top: 1px solid;
background: #FFF;
box-shadow: 0 0 3em 3em #FFF;
max-width: 45em;
}

That is to say, when an endnote list item is targeted, it’s fixedly positioned against the bottom of the viewport and given some padding and background and a top border and a box shadow, so it has a bit of a halo above it that sets it apart from the content it’s overlaying.  It actually looks pretty sweet, if I do say so myself, and allows the reader to see footnotes without having to jump back and forth on the page.  Now all I needed was a way to make the footnote go away.

Again I thought about going the JavaScript route, but I’m trying to keep to the Web’s slower pace layers as much as possible in this project for maximum compatibility over time and technology.  Thus, every footnote gets a “close this” link right after the backjump link, marked up like this:

<a href="#fnclosed" class="close">X</a></li>

(I realize that probably looks a little weird, but hang in there and hopefully I can clear it up in the next few paragraphs.)

So every footnote ends with two links, one to jump to the paragraph (or heading) that referred to it, which is unnecessary when the footnote has popped up due to user interaction; and then, one to make the footnote go away, which is unnecessary when looking at the list of footnotes at the end of the chapter.  It was time to juggle display and visibility values to make each appear only when necessary.

.endnotes li .close {
display: none;
visibility: hidden;
}
.endnotes li:target .close {
display: block;
visibility: visible;
}
.endnotes li:target .backjump {
display: none;
visibility: hidden;
}

Thus, the “close this” links are hidden by default, and revealed when the list item is targeted and thus pops up.  By contrast, the backjump links are shown by default, and hidden when the list item is targeted.

As it now stands, this approach has some upsides and some downsides.  One upside is that, since a URL with an identifier fragment is distinct from the URL of the page itself, you can dismiss a popped-up footnote with the browser’s Back button.  On kind of the same hand, though, one downside is that since a URL with an identifier fragment is distinct from the URL of the page itself, if you consistently use the “close this” link to dismiss a popped-up footnote, the browser history gets cluttered with the opened and closed states of various footnotes.

This is bad because you can get partway through a chapter, look at a few footnotes, and then decide you want to go back one page by hitting the Back button, at which point you discover have to go back through all those footnote states in the history before you actually go back one page.

I feel like this is a thing I can (probably should) address by layering progressively-enhancing JavaScript over top of all this, but I’m still not quite sure how best to go about it.  Should I add event handlers and such so the fragment-identifier stuff is suppressed and the URL never actually changes?  Should I add listeners that will silently rewrite the browser history as needed to avoid this?  Ya got me.  Suggestions or pointers to live examples of solutions to similar problems are welcomed in the comments below.

Less crucially, the way the footnote just appears and disappears bugs me a little, because it’s easy to miss if you aren’t looking in the right place.  My first thought was that it would be nice to have the footnote unfurl from the bottom of the page, but it’s basically impossible (so far as I can tell) to animate the height of an element from 0 to auto.  You also can’t animate something like bottom: calc(-1 * calculated-height) to 0 because there is no CSS keyword (so far as I know) that returns the calculated height of an element.  And you can’t really animate from top: 100vh to bottom: 0 because animations are of a property’s values, not across properties.

I’m currently considering a quick animation from something like bottom: -50em to 0, going on the assumption that no footnote will ever be more than 50 em tall, regardless of the display environment.  But that means short footnotes will slide in later than tall footnotes, and probably appear to move faster.  Maybe that’s okay?  Maybe I should do more of a fade-and-scale-in thing instead, which will be visually consistent regardless of footnote size.  Or I could have them 3D-pivot up from the bottom edge of the viewport!  Or maybe this is another place to layer a little JS on top.

Or maybe I’ve overlooked something that will let me unfurl the way I first envisioned with just HTML and CSS, a clever new technique I’ve missed or an old solution I’ve forgotten.  As before, comments with suggestions are welcome.

Have something to say to all that? You can add a comment to the post, or email Eric directly.

## Summary

simple-reload provides a straight-forward (~30 lines of JS and zero server-side requirements) way of reloading a web page as it is iteratively developed or modified. Once activated, a page will be reloaded whenever it regains focus.

If you encounter any problems, please file an issue on the simple-reload GitHub repository.

Cons:

• Reload won't take place until the page is focused by the user, which requires manual interaction.
• This is significantly less burdensome if using focus follows pointer in your window manager.
• Reloads will occur even if there were no changes.
• With e.g. LiveReload, a reload only happens when the server indicates there has been a change. This may be a big advantage for stateful pages or pages with lots of forms.

Pros:

• Tiny, easy to modify implementation.
• This can be helpful to keep a fixed revision of a page in one tab to compare against.
• No server-side requirements.
• So works even from file://.
• Makes it easier if using with a remote server (e.g. no need to worry about exposing a port as for LiveReload).

## Code

<script type="module">
const enableByDefault = false;
// Firefox triggers blur/focus events when resizing, so we ignore a focus
// following a blur within 200ms (assumed to be generated by resizing rather
// than human interaction).
let blurTimeStamp = null;
function focusListener(ev) {
if (ev.timeStamp - blurTimeStamp >= 200) {
}
}
function blurListener(ev) {
if (blurTimeStamp === null) {
}
blurTimeStamp = ev.timeStamp;
}
function deactivate() {
window.removeEventListener("focus", focusListener);
window.removeEventListener("blur", blurListener);
document.title = document.title.replace(/^\u27F3 /, "");
window.addEventListener("dblclick", activate, { once: true });
}
function activate() {
}
if (enableByDefault || sessionStorage.getItem("simple-reload") == "activated") {
document.title = "\u27F3 " + document.title;
window.addEventListener("dblclick", deactivate, { once: true });
} else {
}
</script>


## Usage

Paste the above code into the <head> or <body> of a HTML file. You can then enable the reload behaviour by double-clicking on a page (double-click again to disable again). The title is prefixed with ⟳ while reload-on-focus is enabled. If you'd like reload-on-focus enabled by default, just flip the enableByDefault variable to true. You could either modify whatever you're using to emit HTML to include this code when in development mode, or configure your web server of choice to inject it for you.

The enabled/disabled state of the reload logic is scoped to the current tab, so will be maintained if navigating to different pages within the same domain in that tab.

In terms of licensing, the implementation is so straight-forward it hardly feels copyrightable. Please consider it public domain, or MIT if you're more comfortable with an explicit license.

## Implementation notes

• In case it's not clear, "blur" events mentioned above are events fired when an element is no longer in focus.
• As noted in the code comment, Firefox (under dwm in Linux at least) seems to trigger blur+focus events when resizing the window using a keybinding while Chrome doesn't. Being able to remove the logic to deal with this issue would be a good additional simplification.

Article changelog
• 2022-09-11: Removed incorrect statement about sessionStorage usage being redundant if enableByDefault = true (it does mean disabling reloading will persist).
• 2022-09-10: Initial publication date.

#### Muxup implementation notes

This article contains a few notes on various implementation decisions made when creating the Muxup website. They're intended primarily as a reference for myself, but some parts may be of wider interest. See about for more information about things like site structure.

## Site generation

I ended up writing my own script for generating the site pages from a tree of Markdown files. Zola seems like an excellent option, but as I had a very specific idea on how I wanted pages to be represented in source form and the format and structure of the output, writing my own was the easier option (see build.py). Plus, yak-shaving is fun.

I opted to use mistletoe for Markdown parsing. I found a few bugs in the traverse helper function when implementing some transformations on the generated AST, but upstream was very responsive about reviewing and merging my PRs. The main wart I've found is that parsing and rendering aren't fully separated, although this doesn't pose any real practical concern for my use case and will hopefully be fixed in the future. mistune also seemed promising, but has some conformance issues.

One goal was to keep everything possible in standard Markdown format. This means, for instance, avoiding custom frontmatter entries or link formats if the same information could be extracted from the file). Therefore:

• There is no title frontmatter entry - the title is extracted by grabbing the first level 1 heading from the Markdown AST (and erroring if it isn't present).
• All internal links are written as [foo](/relative/to/root/foo.md). The generator will error if the file can't be found, and will otherwise translate the target to refer to the appropriate permalink.
• This has the additional advantage that links are still usable if viewing the Markdown on GitHub, which can be handy if reviewing a previous revision of an article.
• Article changelogs are just a standard Markdown list under the "Article changelog" heading, which is checked and transformed at the Markdown AST level in order to produce the desired output (emitting the list using a <details> and <summary>).

All CSS was written through the usual mix of experimentation (see simple-reload for the page reloading solution I used to aid iterative development) and learning from the CSS used by other sites.

## Randomly generated title highlights

The main visual element throughout the site is the randomised, roughly drawn highlight used for the site name and article headings. This takes some inspiration from the RoughNotation library (see also the author's description of the algorithms used), but uses my own implementation that's much more tightly linked to my use case.

The core logic for drawing these highlights is based around drawing the SVG path:

• Determine the position and size of the text element to be highlight (keeping in mind it might be described by multiple rectangles if split across several lines).
• For each rectangle, draw a bezier curve starting from the middle of the left hand side through to the right hand side, with its two control points at between 20-40% and 40-80% of the width.
• Apply randomised offsets in the x and y directions for every point.
• The range of the randomised offsets should depend on length of the text. Generally speaking, a smaller offsets should be used for a shorter piece of text.

This logic is implemented in preparePath and reproduced below (with the knowledge the offset(delta) is a helper to return a random number between delta and -delta, hopefully it's clear how this relates to the logic described above):

function preparePath(hlInfo) {
const parentRect = hlInfo.svg.getBoundingClientRect();
const rects = hlInfo.hlEl.getClientRects();
let pathD = "";
for (const rect of rects) {
const x = rect.x - parentRect.x, y = rect.y - parentRect.y,
w = rect.width, h = rect.height;
const mid_y = y + h / 2;
let maxOff = w < 75 ? 3 : w < 300 ? 6 : 8;
const divergePoint = .2 + .2 * Math.random();
pathD = ${pathD} M${x+offset(maxOff)} ${mid_y+offset(maxOff)} C${x+w*divergePoint+offset(maxOff)} ${mid_y+offset(maxOff)},${x+2*w*divergePoint+offset(maxOff)} ${mid_y+offset(maxOff)}${x+w+offset(maxOff)} ${mid_y+offset(maxOff)}; } hlInfo.nextPathD = pathD; hlInfo.strokeWidth = 0.85*rects[0].height; }  I took some care to avoid forced layout/reflow by batching together reads and writes of the DOM into separate phases when drawing the initial set of highlights for the page, which is why this function is generates the path but doesn't modify the SVG directly. Separate logic adds handlers to links that are highlighted continuously redraw the highlight (I liked the visual effect). ## Minification and optimisation I primarily targeted the low-hanging fruit here, and avoided adding in too many dependencies (e.g. separate HTML and CSS minifiers) during the build. muxup.com is a very lightweight site - the main extravagance I've allowed myself is the use of a webfont (Nunito) but that only weighs in at ~36KiB (the variable width version converted to woff2 and subsetted using pyftsubset from fontTools). As the CSS and JS payload is so small, it's inlined into each page. The required JS for the article pages and the home page is assembled and then minified using terser. This reduces the size of the JS required for the page you're reading from 5077 bytes to 2620 bytes uncompressed (1450 bytes to 991 bytes if compressing the result with Brotli, though in practice the impact will be a bit different when compressing the JS together with the rest of the page + CSS). When first published, the page you are reading (including inlined CSS and JS) was ~27.7KiB uncompressed (7.7KiB Brotli compressed), which compares to 14.4KiB for its source Markdown (4.9KiB Brotli compressed). Each page contains an embedded stylesheet with a conservative approximation of the minimal CSS needed. The total amount of CSS is small enough that it's easy to manually split between the CSS that is common across the site, the CSS only needed for the home page, the CSS common to all articles, and then other CSS rules that may or may not be needed depending on the page content. For the latter case, CSS snippets to include are gated on matching a given string (e.g. <kbd for use of the <kbd> tag). For my particular use case, this is more straight forward and faster than e.g. relying on PurgeCSS as part of the build process. The final trick on the frontend is prefetching. Upon hovering on an internal link, it will be fetched (see the logic at the end of common.js for the implementation approach), meaning that in the common case any link you click will already have been loaded and cached. More complex approaches could be used to e.g. hook the mouse click event and directly update the DOM using the retrieved data. But this would require work to provide UI feedback during the load and the incremental benefit over prefetching to prime the cache seems small. ## Serving using Caddy I had a few goals in setting up Caddy to serve this site: • Enable new and shiny things like HTTP3 and serving Brotli compressed content. • Both are very widely supported on the browser side (and Brotli really isn't "new" any more), but require non-standard modules or custom builds on Nginx. • Redirect wwww.muxup.com/* URLs to muxup.com/*. • HTTP request methods other than GET or HEAD should return error 405. • Set appropriate Cache-Control headers in order to avoid unnecessary re-fetching content. Set shorter lifetimes for served .html and 308 redirects vs other assets. Leave 404 responses with no Cache-Control header. • Avoid serving the same content at multiple URLs (unless explicitly asked for) and don't expose the internal filenames of content served via a different canonical URL. Also, prefer URLs without a trailing slash, but ensure not to issue a redirect if the target file doesn't exist. This means (for example): • muxup.com/about/ should redirect to muxup.com/about • muxup.com/2022q3////muxup-implementation-notes should 404 or redirect. • muxup.com/about/./././ should 404 or redirect • muxup.com/index.html should 404. • muxup.com/index.html.br (referring to the precompressed brotli file) should 404. • muxup.com/non-existing-path/ should 404. • If there is a directory foo and a foo.html at the same level, serve foo.html for GET /foo (and redirect to it for GET /foo/). • Never try to serve */index.html or similar (except in the special case of GET /). Perhaps because my requirements were so specific, this turned out to be a little more involved than I expected. If seeking to understand the Caddyfile format and Caddy configuration in general, I'd strongly recommend getting a good understanding of the key ideas by reading Caddyfile concepts, understanding the order in which directives are handled by default and how you might control the execution order of directives using the route directive or use the handle directive to specify groups of directives in a mutually exclusive fashion based on different matchers. The Composing in the Caddyfile article provides a good discussion of these options). Ultimately, I wrote a quick test script for the desired properties and came up with the following Caddyfile that meets almost all goals: { servers { protocol { experimental_http3 } } email asb@asbradbury.org } (muxup_file_server) { file_server { index "" precompressed br disable_canonical_uris } } www.muxup.com { redir https://muxup.com{uri} 308 header Cache-Control "max-age=2592000, stale-while-revalidate=2592000" } muxup.com { root * /var/www/muxup.com/htdocs encode gzip log { output file /var/log/caddy/muxup.com.access.log } header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" vars short_cache_control "max-age=3600" vars long_cache_control "max-age=2592000, stale-while-revalidate=2592000" @method_isnt_GET_or_HEAD not method GET HEAD @path_is_suffixed_with_html_or_br path *.html *.html/ *.br *.br/ @path_or_html_suffixed_path_exists file {path}.html {path} @html_suffixed_path_exists file {path}.html @path_or_html_suffixed_path_doesnt_exist not file {path}.html {path} @path_is_root path / @path_has_trailing_slash path_regexp ^/(.*)/$

error 405
}
handle @path_is_suffixed_with_html_or_br {
error 404
}
handle @path_has_trailing_slash {
route {
uri strip_suffix /
redir @path_or_html_suffixed_path_exists {path} 308
error @path_or_html_suffixed_path_doesnt_exist 404
}
}
handle @path_is_root {
rewrite index.html
import muxup_file_server
}
handle @html_suffixed_path_exists {
rewrite {path}.html
import muxup_file_server
}
handle * {
import muxup_file_server
}
handle_errors {
respond "{err.status_code} {err.status_text}"
}
}


A few notes on the above:

• It isn't currently possible to match URLs with // due to the canonicalisation Caddy performs, but 2.6.0 including PR #4948 hopefully provides a solution. Hopefully to allow matching /./ too.
• HTTP3 should enabled by default in Caddy 2.6.0.
• Surprisingly, you need to explicitly opt in to enabling gzip compression (see this discussion with the author of Caddy about that choice).
• The combination of try_files and file_server provides a large chunk of the basics, but something like the above handlers is needed to get the precise desired behaviour for things like redirects, *.html and *.br etc.
• route is needed within handle @path_has_trailing_slash because the default execution order of directives has uri occurring some time after redir and error.
• Caddy doesn't support dynamically brotli compressing responses, so the precompressed br option of file_server is used to serve pre-compressed files (as prepared by the deploy script).

## Analytics

The simplest possible solution - don't have any.

Last but by no means least, is the randomly selected doodle at the bottom of each page. I select images from the Quick, Draw! dataset and export them to SVG to be randomly selected on each page load. A rather scrappy script contains logic to generate SVGs from the dataset's NDJSON format and contains a simple a Flask application that allows selecting desired images from randomly displayed batches from each dataset.

With examples such as O'Reilly's beautiful engravings of animals on their book covers there's a well established tradition of animal illustrations on technical content - and what better way to honour that tradition than with a hastily drawn doodle by a random person on the internet that spins when your mouse hovers over it?

Article changelog
• 2022-09-10: Initial publication date.

## September 02, 2022

### Ricardo García

Vulkan 1.3.226 was released yesterday and it finally includes the cross-vendor VK_EXT_mesh_shader extension. This has definitely been an important moment for me. As part of my job at Igalia and our collaboration with Valve, I had the chance to work reviewing this extension in depth and writing thousands of CTS tests for it. You’ll notice I’m listed as one of the extension contributors. Hopefully, the new tests will be released to the public soon as part of the open source VK-GL-CTS Khronos project.

During this multi-month journey I had the chance to work closely with several vendors working on adding support for this extension in multiple drivers, including NVIDIA (special shout-out to Christoph Kubisch, Patrick Mours, Pankaj Mistry and Piers Daniell among others), Intel (thanks Marcin Ślusarz for finding and reporting many test bugs) and, of course, Valve. Working for the latter, Timur Kristóf provided an implementation for RADV and reported to me dozens of bugs, test ideas, suggestions and improvements. Do not miss his blog post series about mesh shaders and how they’re implemented on RDNA2 hardware. Timur’s implementation will be used in your Linux system if you have a capable AMD GPU and, of course, the Steam Deck.

The extension has been developed with DX12 compatibility in mind. It’s possible to use mesh shading from Vulkan natively and it also allows future titles using DX12 mesh shading to be properly run on top of VKD3D-Proton and enjoyed on Linux, if possible, from day one. It’s hard to provide a summary of the added functionality and what mesh shaders are about in a short blog post like this one, so I’ll refer you to external documentation sources, starting by the Vulkan mesh shading post on the Khronos Blog. Both Timur and myself have submitted a couple of talks to XDC 2022 which have been accepted and will give you a primer on mesh shading as well as some more information on the RADV implementation. Do not miss the event at Minneapolis or enjoy it remotely while it’s being livestreamed in October.

## September 01, 2022

### Delan Azabani

#### Meet the CSS highlight pseudos

A year and a half ago, I was asked to help upstream a Chromium patch allowing authors to recolor spelling and grammar errors in CSS. At the time, I didn’t realise that this was part of a far more ambitious effort to reimagine spelling errors, grammar errors, text selections, and more as a coherent system that didn’t yet exist as such in any browser. That system is known as the highlight pseudos, and this post will focus on the design of said system and its consequences for authors.

This is the third part of a series (part one, part two) about Igalia’s work towards making the CSS highlight pseudos a reality.

article { --cr-highlight: #3584E4; --cr-highlight-aC0h: #3584E4C0; } article figure > img { max-width: 100%; } article figure > figcaption { max-width: 30rem; margin-left: auto; margin-right: auto; } article pre, article code { font-family: Inconsolata, monospace, monospace; } article aside, article blockquote { font-size: 0.75em; max-width: 30rem; } article aside { margin-left: 0; padding-left: 1rem; border-left: 3px double rebeccapurple; } article blockquote { margin-left: 3rem; } article blockquote:before { margin-left: -2rem; } ._spelling, ._grammar { text-decoration-thickness: /* iOS takes 0 literally */ 1px; text-decoration-skip-ink: none; } ._spelling { text-decoration: /* not a shorthand on iOS */ underline; text-decoration-style: wavy; text-decoration-color: red; } ._grammar { text-decoration: /* not a shorthand on iOS */ underline; text-decoration-style: wavy; text-decoration-color: green; } ._example { border: 2px dotted rebeccapurple; } ._example * + *, ._hpdemo * + * { margin-top: 0; } ._checker { position: relative; margin-left: auto; margin-right: auto; } ._checker:focus { outline: none; } ._checker::before { display: flex; align-items: center; justify-content: center; position: absolute; top: 0; bottom: 0; left: 0; right: 0; width: 100%; font-size: 7em; color: transparent; background: transparent; content: "▶"; } ._checker:not(:focus)::before { color: rebeccapurple; background: #66339940; } ._checker tbody th { text-align: left; } ._checker ._live::selection, ._checker ._live *::selection { color: currentColor; background: transparent; } ._checker:not(:focus) ._live > div { visibility: hidden; } ._checker:not([data-phase=done]):not(#specificity) ._live > div, ._checker:not([data-phase=done]):not(#specificity) ._live > div * { color: transparent; } ._checker:not([data-phase=done]):not(#specificity) ._live > div::selection, ._checker:not([data-phase=done]):not(#specificity) ._live > div *::selection { color: transparent; } ._checker:not([data-phase=done]):not(#specificity) ._live > div::highlight(checker), ._checker:not([data-phase=done]):not(#specificity) ._live > div *::highlight(checker), ._checker:not([data-phase=done]):not(#specificity) ._live > div::highlight(lower), ._checker:not([data-phase=done]):not(#specificity) ._live > div *::highlight(lower) { color: transparent; } ._checker ._live > div { width: 5em; } ._checker ._live > div { position: relative; line-height: 1; } ._checker ._live > div > span { position: absolute; margin: 0; padding-top: calc((1.5em - 1em) / 2); width: 5em; } /* ::highlight() [end-to-end test] = no, if the pseudo selector is broken and/or no active highlight = yes, if the pseudo selector works and highlight is active */ ._checker ._custom :nth-child(2) { color: transparent; background: transparent; } ._checker ._custom :nth-child(1)::highlight(checker) { color: transparent; } ._checker ._custom :nth-child(2)::highlight(checker) { color: CanvasText; background: Canvas; } /* ::highlight() [selector] = no, if the pseudo selector is unsupported = yes, if the pseudo selector is supported • highlight not active, only for selector list validity */ ._checker ._chps :nth-child(2) { color: transparent; } ._checker ._chps :nth-child(1), :not(*)::highlight(checker) { color: transparent; } ._checker ._chps :nth-child(2), :not(*)::highlight(checker) { color: CanvasText; } /* ::highlight() [API] = no, if the API is missing or broken = yes, if the API is present and working */ ._checker ._cha {} /* ::spelling-error [end-to-end test] = no, if the pseudo selector is broken and/or no active highlight = yes, if the pseudo selector works and highlight is active */ ._checker [spellcheck] :nth-child(2) { color: transparent; background: transparent; } ._checker [spellcheck] :nth-child(1)::spelling-error { color: transparent; } ._checker [spellcheck] :nth-child(2)::spelling-error { color: CanvasText; background: Canvas; } ._checker [spellcheck] *::spelling-error { text-decoration: none; } /* Highlight inheritance (::highlight) = no, if var() inherits from originating element = yes, if var() ignores originating element and uses fallback */ ._checker ._hih { color: transparent; background: transparent; } ._checker ._hih::highlight(checker) { --t: transparent; --x: CanvasText; --y: Canvas; } ._checker ._hih :nth-child(1)::highlight(checker) { color: var(--t, CanvasText); background: var(--t, Canvas); } ._checker ._hih :nth-child(2)::highlight(checker) { color: var(--x, transparent); background: var(--y, transparent); } /* Highlight inheritance (::selection) = no, if var() inherits from originating element = yes, if var() ignores originating element and uses fallback */ ._checker ._his { color: transparent; background: transparent; } ._checker ._his::selection { --t: transparent; --x: CanvasText; --y: Canvas; } ._checker ._his :nth-child(1)::selection { color: var(--t, CanvasText); background: var(--t, Canvas); } ._checker ._his :nth-child(2)::selection { color: var(--x, transparent); background: var(--y, transparent); } /* Highlight overlay painting = no, if currentColor takes color from originating element only = yes, if currentColor takes color from next active highlight • lower highlight “yes” is hidden by ‘-webkit-text-fill-color’ */ ._checker ._hop { color: transparent; background: transparent; } ._checker ._hop :nth-child(1) { color: CanvasText; } ._checker ._hop :nth-child(1)::highlight(lower) { color: transparent; } ._checker ._hop :nth-child(1)::highlight(checker) { color: currentColor; } ._checker ._hop :nth-child(2) { color: transparent; } ._checker ._hop :nth-child(2)::highlight(lower) { color: CanvasText; -webkit-text-fill-color: transparent; } ._checker ._hop :nth-child(2)::highlight(checker) { color: currentColor; -webkit-text-fill-color: currentColor; } ._table { font-size: 0.75em; } ._table td, ._table th { vertical-align: top; border: 1px solid black; } ._table td:not(._tight), ._table th:not(._tight) { padding: 0.5em; } ._tight picture, ._tight img { vertical-align: top; } ._compare * + *, ._tight * + *, ._gifs * + * { margin-top: 0; } ._compare { max-width: 100%; border: 1px solid rebeccapurple; } ._compare > div { max-width: 100%; position: relative; touch-action: pinch-zoom; --cut: 50%; } ._compare > div > * { vertical-align: top; max-width: 100%; } ._compare > div > :nth-child(1) { position: absolute; clip: rect(auto, auto, auto, var(--cut)); } ._compare > div > :nth-child(2) { position: absolute; width: var(--cut); height: 100%; border-right: 1px solid rebeccapurple; } ._compare > div > :nth-child(2):before { content: var(--left-label); color: rebeccapurple; font-size: 0.75em; position: absolute; right: 0.5em; } ._compare > div > :nth-child(2):after { content: var(--right-label); color: rebeccapurple; font-size: 0.75em; position: absolute; left: calc(100% + 0.5em); } ._sum td:first-of-type { padding-right: 1em; } ._gifs { position: relative; display: flex; flex-flow: column nowrap; } ._gifs > video { transition: opacity 0.125s linear; } ._gifs > button { transition: 0.125s linear; transition-property: color, background-color; } ._gifs._paused > video { opacity: 0.5; } ._gifs._paused > button { color: rebeccapurple; background: #66339940; } ._gifs > button { position: absolute; top: 0; bottom: 0; left: 0; right: 0; width: 100%; font-size: 7em; color: transparent; background: transparent; content: "▶"; } ._gifs > button:focus-visible { outline: 0.25rem solid #663399C0; outline-offset: -0.25rem; } ._commits { position: relative; } ._commits > :first-child { position: absolute; right: -0.1em; height: 100%; border-right: 0.2em solid rgba(102,51,153,0.5); } ._commits > :last-child { position: relative; padding-right: 0.5em; } * + ._commit, ._commit * + * { margin-top: 0; } ._commit { line-height: 2; margin-right: -1.5em; text-align: right; } ._commit > img { width: 2em; vertical-align: middle; } ._commit > a { padding-right: 0.5em; text-decoration: none; color: rebeccapurple; } ._commit > a > code { font-size: 1em; } ._commit-none > a { color: rgba(102,51,153,0.5); }

## What are they?

CSS has four highlight pseudos and an open set of author-defined custom highlight pseudos. They have their roots in ::selection, which was a rudimentary and non-standard, but widely supported, way of styling text and images selected by the user.

The built-in highlights are ::selection for user-selected content, ::target-text for linking to text fragments, ::spelling-error for misspelled words, and ::grammar-error for text with grammar errors, while the custom highlights are known as ::highlight(x) where x is the author-defined highlight name.

## Can I use them?

::selection has long been supported by all of the major browsers, and ::target-text shipped in Chromium 89. But for most of that time, no browser had yet implemented the more robust highlight pseudo system in the CSS pseudo spec.

::highlight() and the custom highlight API shipped in Chromium 105, thanks to the work by members1 of the Microsoft Edge team. They are also available in Safari 14.1 (including iOS 14.5) as an experimental feature (Highlight API). You can enable that feature in the Develop menu, or for iOS, under Settings > Safari > Advanced.

Safari’s support currently has a couple of quirks, as of TP 152. Range is not supported for custom highlights yet, only StaticRange, and the Highlight constructor has a bug where it requires passing exactly one range, ignoring any additional arguments. To create a Highlight with no ranges, first create one with a dummy range, then call the clear or delete methods.

Chromium 105 also implements the vast majority of the new highlight pseudo system. This includes highlight overlay painting, which was enabled for all highlight pseudos, and highlight inheritance, which was enabled for ::highlight() only.

Chromium 108 includes ::spelling-error and ::grammar-error as an experimental feature, together with the new ‘text-decoration-line’ values ‘spelling-error’ and ‘grammar-error’. You can enable these features at

chrome://flags/#enable-experimental-web-platform-features

Chromium’s support also currently has some bugs, as of r1041796. Notably, highlights don’t yet work under ::first-line and ::first-letter2, ‘text-shadow’ is not yet enabled for ::highlight(), computedStyleMap results are wrong for ‘currentColor’, and highlights that split ligatures (e.g. for complex scripts) only render accurately in ::selection2.

Click the table below to see if your browser supports these features.

act:
sel:
cha:


## How do I use them?

While you can write rules for highlight pseudos that target all elements, as was commonly done for pre-standard ::selection, selecting specific elements can be more powerful, allowing descendants to cleanly override highlight styles.

Previously the same code would yield…

…unless you also selected the descendants of :root and aside:

Note that a bare ::selection rule still means *::selection, and like any universal rule, it can interfere with inheritance when mixed with non-universal highlight rules.

::selection is primarily controlled by user input, though pages can both read and write the active ranges via the Selection API with getSelection().

::target-text is activated by navigating to a URL ending in a fragment directive, which has its own syntax embedded in the #fragment. For example:

• #foo:~:text=bar targets #foo and highlights the first occurrence of “bar”
• #:~:text=the,dog highlights the first range of text from “the” to “dog”

::spelling-error and ::grammar-error are controlled by the user’s spell checker, which is only used where the user can input text, such as with textarea or contenteditable, subject to the spellcheck attribute (which also affects grammar checking). For privacy reasons, pages can’t read the active ranges of these highlights, despite being visible to the user.

::highlight() is controlled via the Highlight API with CSS.highlights. CSS.highlights is a maplike object, which means the interface is the same as a Map of strings (highlight names) to Highlight objects. Highlight objects, in turn, are setlike objects, which you can use like a Set of Range or StaticRange objects.

You can use getComputedStyle() to query resolved highlight styles under a particular element. Regardless of which parts (if any) are highlighted, the styles returned are as if the given highlight is active and all other highlights are inactive.

## How do they work?

Highlight pseudos are defined as pseudo-elements, but they actually have very little in common with other pseudo-elements like ::before and ::first-line.

Unlike other pseudos, they generate highlight overlays, not boxes, and these overlays are like layers over the original content. Where text is highlighted, a highlight overlay can add backgrounds and text shadows, while the text proper and any other decorations are “lifted” to the very top.

@import url(/images/hpdemo.css);

You can think of highlight pseudos as innermost pseudo-elements that always exist at the bottom of any tree of elements and other pseudos, but unlike other pseudos, they don’t inherit their styles from that element tree.

Instead each highlight pseudo forms its own inheritance tree, parallel to the element tree. This means body::selection inherits from html::selection, not from ‘body’ itself.

At this point, you can probably see that the highlight pseudos are quite different from the rest of CSS, but there are also several special cases and rules needed to make them a coherent system.

For the typical appearance of spelling and grammar errors, highlight pseudos need to be able to add their own decorations, and they need to be able to leave the underlying foreground color unchanged. Highlight inheritance happens separately from the element tree, so we need some way to refer to the underlying foreground color.

That escape hatch is to set ‘color’ itself to ‘currentColor’, which is the default if nothing in the highlight tree sets ‘color’.

This is a bit of a special case within a special case.

You see, ‘currentColor’ is usually defined as “the computed value of ‘color’”, but the way I like to think of it is “don’t change the foreground color”, and most color-valued properties like ‘text-decoration-color’ default to this value.

For ‘color’ itself that wouldn’t make sense, so we instead define ‘color:currentColor’ as equivalent to ‘color:inherit’, which still fits that mental model. But for highlights, that definition would no longer fit, so we redefine it as being the ‘color’ of the next active highlight below.

To make highlight inheritance actually useful for ‘text-decoration’ and ‘background-color’, all properties are inherited in highlight styles, even those that are not usually inherited.

This would conflict with the usual rules3 for decorating boxes, because descendants would get two decorations, one propagated and one inherited. We resolved this by making decorations added by highlights not propagate to any descendants.

Unstyled highlight pseudos generally don’t change the appearance of the original content, so the default ‘color’ and ‘background-color’ in highlights are ‘currentColor’ and ‘transparent’ respectively, the latter being the property’s initial value. But two highlight pseudos, ::selection and ::target-text, have UA default foreground and background colors.

For compatibility with ::selection in older browsers, the UA default ‘color’ and ‘background-color’ (e.g. white on blue) is only used if neither were set by the author. This rule is known as paired cascade, and for consistency it also applies to ::target-text.

It’s common for selected text to almost invert the original text colors, turning black on white into white on blue, for example. To guarantee that the original decorations remain as legible as the text when highlighted, which is especially important for decorations with semantic meaning (e.g. line-through), originating decorations are recolored to the highlight ‘color’. This doesn’t apply to decorations added by highlights though, because that would break the typical appearance of spelling and grammar errors.

The default style rules for highlight pseudos might look something like this. Notice the new ‘spelling-error’ and ‘grammar-error’ decorations, which authors can use to imitate native spelling and grammar errors.

The way the highlight pseudos have been designed naturally leads to some limitations.

## Gotchas

Older browsers with ::selection tend to treat it purely as a way to change the original content’s styles, including text shadows and other decorations. Some tutorial content has even been written to that effect:

One of the most helpful uses for ::selection is turning off a text-shadow during selection. A text-shadow can clash with the selection’s background color and make the text difficult to read. Set text-shadow: none; to make text clear and easy to read during selection.

Under the spec, highlight pseudos can no longer remove or really change the original content’s decorations and shadows. Setting these properties in highlight pseudos to values other than ‘none’ adds decorations and shadows to the overlays when they are active.

While the new :has() selector might appear to offer a solution to this problem, pseudo-element selectors are not allowed in :has(), at least not yet.

Removing shadows that might clash with highlight backgrounds (as suggested in the tutorial above) will no longer be as necessary anyway, since highlight backgrounds now paint on top of the original text shadows.

If you still want to ensure those shadows don’t clash with highlights in older browsers, you can set ‘text-shadow’ to ‘none’, which is harmless in newer browsers.

As for line decorations, if you’re really determined, you can work around this limitation by using ‘-webkit-text-fill-color’, a standard property (believe it or not) that controls the foreground fill color of text4.

Fun fact: because of ‘-webkit-text-fill-color’ and its stroke-related siblings, it isn’t always possible for highlight pseudos to avoid changing the foreground colors of text, at least not without out-of-band knowledge of what those colors are.

### Accessing global constants

Highlight pseudos also don’t automatically have access to custom properties set in the element tree, which can make things tricky if you have a design system that exposes a color palette via custom properties on :root.

You can work around this by adding selectors for the necessary highlight pseudos to the rule defining the constants, or if the necessary highlight pseudos are unknown, by rewriting each constant as a custom @property rule.

### Spec issues

While the design of the highlight pseudos has mostly settled, there are still some unresolved issues to watch out for.

• how to use spelling and grammar decorations with the UA default colors (#7522)
• values of non-applicable properties, e.g. ‘text-shadow’ with em units (#7591)
• the meaning of underline- and emphasis-related properties in highlights (#7101)
• whether ‘-webkit-text-fill-color’ and friends are allowed in highlights (#7580)
• some browsers “tweak” the colors or alphas set in highlight styles (#6853)
• how the highlight pseudos are supposed to interact with SVG (svgwg#894)

## What now?

The highlight pseudos are a radical departure from older browsers with ::selection, and have some significant differences with CSS as we know it. Now that we have some experimental support, we want your help to play around with these features and help us make them as useful and ergonomic as possible before they’re set in stone.

Special thanks to Rego, Brian, Eric (Igalia), Florian, fantasai (CSSWG), Emilio (Mozilla), and Dan for their work in shaping the highlight pseudos (and this post). We would also like to thank Bloomberg for sponsoring this work.

1. Dan, Fernando, Sanket, Luis, Bo, and anyone else I missed.

2. See this demo for more details.  2

3. CSSWG discussion also found that decorating box semantics are undesirable for decorations added by highlights anyway.

4. This is actually the case everywhere the WHATWG compat spec applies, at all times. If you think about it, the only reason why setting ‘color’ to ‘red’ makes your text red is because ‘-webkit-text-fill-color’ defaults to ‘currentColor’.

### Andy Wingo

#### new month, new brainworm

Today, a brainworm! I had a thought a few days ago and can't get it out of my head, so I need to pass it on to another host.

So, imagine a world in which there is a a drive to build a kind of Kubernetes on top of WebAssembly. Kubernetes nodes are generally containers, associated with additional metadata indicating their place in overall system topology (network connections and so on). (I am not a Kubernetes specialist, as you can see; corrections welcome.) Now in a WebAssembly cloud, the nodes would be components, probably also with additional topological metadata. VC-backed companies will duke it out for dominance of the WebAssembly cloud space, and in a couple years we will probably emerge with an open source project that has become a de-facto standard (though it might be dominated by one or two players).

In this world, Kubernetes and Spiffy-Wasm-Cloud will coexist. One of the success factors for Kubernetes was that you can just put your old database binary inside a container: it's the same ABI as when you run your database in a virtual machine, or on (so-called!) bare metal. The means of composition are TCP and UDP network connections between containers, possibly facilitated by some kind of network fabric. In contrast, in Spiffy-Wasm-Cloud we aren't starting from the kernel ABI, with processes and such: instead there's WASI, which is more of a kind of specialized and limited libc. You can't just drop in your database binary, you have to write code to get it to conform to the new interfaces.

One consequence of this situation is that I expect WASI and the component model to develop a rich network API, to allow WebAssembly components to interoperate not just with end-users but also other (micro-)services running in the same cloud. Likewise there is room here for a company to develop some complicated network fabrics for linking these things together.

However, WebAssembly-to-WebAssembly links are better expressed via typed functional interfaces; it's more expressive and can be faster. Not only can you end up having fine-grained composition that looks more like lightweight Erlang processes, you can also string together components in a pipeline with communications overhead approaching that of a simple function call. Relative to Kubernetes, there are potential 10x-100x improvements to be had, in throughput and in memory footprint, at least in some cases. It's the promise of this kind of improvement that can drive investment in this area, and eventually adoption.

But, you still have some legacy things running in containers. What to do? Well... Maybe recompile them to WebAssembly? That's my brain-worm.

A container is a file system image containing executable files and data. Starting with the executable files, they are in machine code, generally x64, and interoperate with system libraries and the run-time via an ABI. You could compile them to WebAssembly instead. You could interpret them as data, or JIT-compile them as webvm does, or directly compile them to WebAssembly. This is the sort of thing you hire Fabrice Bellard to do ;) Then you have the filesystem. Let's assume it is stateless: any change to the filesystem at runtime doesn't need to be preserved. (I understand this is a goal, though I could be wrong.) So you could put the filesystem in memory, as some kind of addressable data structure, and you make the libc interface access that data structure. It's something like the microkernel approach. And then you translate whatever topological connectivity metadata you had for Kubernetes to your Spiffy-Wasm-Cloud's format.

Anyway in the end you have a WebAssembly module and some metadata, and you can run it in your WebAssembly cloud. Or on the more basic level, you have a container and you can now run it on any machine with a WebAssembly implementation, even on other architectures (coucou RISC-V!).

Anyway, that's the tweet. Have fun, whoever gets to work on this :)

## August 31, 2022

### Qiuyi Zhang (Joyee)

#### Building V8 on an M1 MacBook

I’ve recently got an M1 MacBook and played around with it a bit. It seems many open source projects still haven’t added MacOS with ARM64

## August 29, 2022

### Lauro Moura

#### Using Breakpad to generate crash dumps with WPE WebKit

Breakpad is a tool from Google that helps generate crash reports. From its description:

Breakpad is a library and tool suite that allows you to distribute an application to users with compiler-provided debugging information removed, record crashes in compact "minidump" files, send them back to your server, and produce C and C++ stack traces from these minidumps. Breakpad can also write minidumps on request for programs that have not crashed.

It works by stripping the debug information from the executable and saving it into "symbol files." When a crash occurs or upon request, the Breakpad client library generates the crash information in these "minidumps." The Breakpad minidump processor combines these files with the symbol files and generates a human-readable stack trace. The following picture, also from Breakpad's documentation, describes this process:

In WPE, Breakpad support was added initially for the downstream 2.28 branch by Vivek Arumugam and backported upstream. It'll be available in the soon-to-be-released 2.38 version, and the WebKit Flatpak SDK bundles the Breakpad client library since late May 2022.

As a developer feature, Breakpad support is disabled by default but can be enabled by passing -DENABLE_BREAKPAD=1 to cmake when building WebKit. Optionally, you can also set -DBREAKPAD_MINIDUMP_DIR=<some path> to hardcode the path used by Breakpad to save the minidumps. If not set during build time, BREAKPAD_MINIDUMP_DIR must be set as an environment variable pointing to a valid directory when running the application. If defined, this variable also overrides the path defined during the build.

## Generating the symbols

To generate the symbol files, Breakpad provides the dump_syms tool. It takes a path to the executable/library and dumps to stdout the symbol information.

Once generated, the symbol files must be laid out in a specific tree structure so minidump_stackwalk can find them when merging with the crash information. The folder containing the symbol files must match a hash code generated for that specific binary. For example, in the case of the libWPEWebKit:

$dump_syms WebKitBuild/WPE/Release/lib/libWPEWebKit-1.1.so > libWPEWebKit-1.1.so.0.sym$ head -n 1 libWPEWebKit-1.1.so.0.sym
MODULE Linux x86_64 A2DA230C159B97DC00000000000000000 libWPEWebKit-1.1.so.0
$mkdir -p ./symbols/libWPEWebKit-1.1.so.0/A2DA230C159B97DC00000000000000000$ cp libWPEWebKit-1.1.so.0.sym ./symbols/libWPEWebKit-1.1.so.0/A2DA230C159B97DC00000000000000000/


## Generating the crash log

Besides the symbol files, we need a minidump file with the stack information, which can be generated in two ways. First, by asking Breakpad to create it. The other way is, well, when the application crashes :)

To generate a minidump manually, you can either call google_breakpad::WriteMiniDump(path, callback, context) or send one of the crashing signals Breakpad recognizes. The former is helpful to generate the dumps programmatically at specific points, while the signal approach might be helpful to inspect hanging processes. These are the signals Breakpad handles as crashing ones:

• SIGSEGV
• SIGABRT
• SIGFPE
• SIGILL (Note: this is for illegal instruction, not the ordinary SIGKILL)
• SIGBUS
• SIGTRAP

Now, first we must run Cog:

$BREAKPAD_MINIDUMP_DIR=/home/lauro/minidumps ./Tools/Scripts/run-minibrowser --wpe --release https://www.wpewebkit.org  Crashing the WebProcess using SIGTRAP: $ ps aux | grep WebProcess
<SOME-PID> ... /app/webkit/.../WebProcess
$kill -TRAP <SOME-PID>$ ls /home/lauro/minidumps
5c2d93f2-6e9f-48cf-6f3972ac-b3619fa9.dmp
$file ~/minidumps/5c2d93f2-6e9f-48cf-6f3972ac-b3619fa9.dmp /home/lauro/minidumps/5c2d93f2-6e9f-48cf-6f3972ac-b3619fa9.dmp: Mini DuMP crash report, 13 streams, Thu Aug 25 20:29:11 2022, 0 type  Note: In the current form, WebKit supports Breakpad dumps in the WebProcess and NetworkProcess, which WebKit spawns itself. The developer must manually add support for it in the UIProcess (the browser/application using WebKit). The exception handler should be installed as early as possible, and many programs might do some initialization before initializing WebKit itself. ## Translating the crash log Once we have a .dmp crash log and the symbol files, we can use minidump_stackwalk to show the human-readable crash log: $ minidump_stackwalk ~/minidumps/5c2d93f2-6e9f-48cf-6f3972ac-b3619fa9.dmp ./symbols/
Operating system: Linux
0.0.0 Linux 5.15.0-46-generic #49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 x86_64
CPU: amd64
family 23 model 113 stepping 0
1 CPU

GPU: UNKNOWN

Crash reason:  SIGTRAP
Process uptime: not available

0  libc.so.6 + 0xf71fd
rax = 0xfffffffffffffffc   rdx = 0x0000000000000090
rcx = 0x00007f6ae47d61fd   rbx = 0x00007f6ae4f0c2e0
rsi = 0x0000000000000001   rdi = 0x000056187adbf6d0
rbp = 0x00007fffa9268f10   rsp = 0x00007fffa9268ef0
r8 = 0x0000000000000000    r9 = 0x00007f6ae4fdc2c0
r10 = 0x00007fffa9333080   r11 = 0x0000000000000293
r12 = 0x0000000000000001   r13 = 0x00007fffa9268f34
r14 = 0x0000000000000090   r15 = 0x000056187ade7aa0
rip = 0x00007f6ae47d61fd
Found by: given as instruction pointer in context
1  libglib-2.0.so.0 + 0x585ce
rsp = 0x00007fffa9268f20   rip = 0x00007f6ae4efc5ce
Found by: stack scanning
2  libglib-2.0.so.0 + 0x58943
rsp = 0x00007fffa9268f80   rip = 0x00007f6ae4efc943
Found by: stack scanning
3  libWPEWebKit-1.1.so.0!WTF::RunLoop::run() + 0x120
rsp = 0x00007fffa9268fa0   rip = 0x00007f6ae8923180
Found by: stack scanning
4  libWPEWebKit-1.1.so.0!WebKit::WebProcessMain(int, char**) + 0x11e
rbx = 0x000056187a5428d0   rbp = 0x0000000000000003
rsp = 0x00007fffa9268fd0   r12 = 0x00007fffa9269148
rip = 0x00007f6ae74719fe
Found by: call frame info
<snip remaining trace>


## Final words

This article briefly overviews enabling and using Breakpad to generate crash dumps. In a future article, we'll cover using Breakpad to get crashdumps while running WPE on embedded boards like RaspberryPis.

o/

## August 23, 2022

### Andy Wingo

#### accessing webassembly reference-typed arrays from c++

The WebAssembly garbage collection proposal is coming soonish (really!) and will extend WebAssembly with the the capability to create and access arrays whose memory is automatically managed by the host. As long as some system component has a reference to an array, it will be kept alive, and as soon as nobody references it any more, it becomes "garbage" and is thus eligible for collection.

(In a way it's funny to define the proposal this way, in terms of what happens to garbage objects that by definition aren't part of the program's future any more; really the interesting thing is the new things you can do with live data, defining new data types and representing them outside of linear memory and passing them between components without copying. But "extensible-arrays-structs-and-other-data-types" just isn't as catchy as "GC". Anyway, I digress!)

One potential use case for garbage-collected arrays is for passing large buffers between parts of a WebAssembly system. For example, a webcam driver could produce a stream of frames as reference-typed arrays of bytes, and then pass them by reference to a sandboxed WebAssembly instance to, I don't know, identify cats in the images or something. You get the idea. Reference-typed arrays let you avoid copying large video frames.

A lot of image-processing code is written in C++ or Rust. With WebAssembly 1.0, you just have linear memory and no reference-typed values, which works well for these languages that like to think of memory as having a single address space. But once you get reference-typed arrays in the mix, you effectively have multiple address spaces: you can't address the contents of the array using a normal pointer, as you might be able to do if you mmap'd the buffer into a program's address space. So what do you do?

reference-typed values are special

The broader question of C++ and GC-managed arrays is, well, too broad for today. The set of array types is infinite, because it's not just arrays of i32, it's also arrays of arrays of i32, and arrays of those, and arrays of records, and so on.

So let's limit the question to just arrays of i8, to see if we can make some progress. So imagine a C function that takes an array of i8:

void process(array_of_i8 array) {
// ?
}


If you know WebAssembly, there's a clear translation of the sort of code that we want:

(func (param $array (ref (array i8))) ; operate on local 0 )  The WebAssembly function will have an array as a parameter. But, here we start to run into more problems with the LLVM toolchain that we use to compile C and other languages to WebAssembly. When the C front-end of LLVM (clang) compiles a function to the LLVM middle-end's intermediate representation (IR), it models all local variables (including function parameters) as mutable memory locations created with alloca. Later optimizations might turn these memory locations back to SSA variables and thence to registers or stack slots. But, a reference-typed value has no bit representation, and it can't be stored to linear memory: there is no alloca that can hold it. Incidentally this problem is not isolated to future extensions to WebAssembly; the externref and funcref data types that landed in WebAssembly 2.0 and in all browsers are also reference types that can't be written to main memory. Similarly, the table data type which is also part of shipping WebAssembly is not dissimilar to GC-managed arrays, except that they are statically allocated at compile-time. At Igalia, my colleagues Paulo Matos and Alex Bradbury have been hard at work to solve this gnarly problem and finally expose reference-typed values to C. The full details and final vision are probably a bit too much for this article, but some bits on the mechanism will help. Firstly, note that LLVM has a fairly traditional breakdown between front-end (clang), middle-end ("the IR layer"), and back-end ("the MC layer"). The back-end can be quite target-specific, and though it can be annoying, we've managed to get fairly good support for reference types there. In the IR layer, we are currently representing GC-managed values as opaque pointers into non-default, non-integral address spaces. LLVM attaches an address space (an integer less than 224 or so) to each pointer, mostly for OpenCL and GPU sorts of use-cases, and we abuse this to prevent LLVM from doing much reasoning about these values. This is a bit of a theme, incidentally: get the IR layer to avoid assuming anything about reference-typed values. We're fighting the system, in a way. As another example, because LLVM is really oriented towards lowering high-level constructs to low-level machine operations, it doesn't necessarily preserve types attached to pointers on the IR layer. Whereas for WebAssembly, we need exactly that: we reify types when we write out WebAssembly object files, and we need LLVM to pass some types through from front-end to back-end unmolested. We've had to change tack a number of times to get a good way to preserve data from front-end to back-end, and probably will have to do so again before we end up with a final design. Finally on the front-end we need to generate an alloca in different address spaces depending on the type being allocated. And because reference-typed arrays can't be stored to main memory, there are semantic restrictions as to how they can be used, which need to be enforced by clang. Fortunately, this set of restrictions is similar enough to what is imposed by the ARM C Language Extensions (ACLE) for scalable vector (SVE) values, which also don't have a known bit representation at compile-time, so we can piggy-back on those. This patch hasn't landed yet, but who knows, it might land soon; in the mean-time we are going to run ahead of upstream a bit to show how you might define and use an array type definition. Further tacks here are also expected, as we try to thread the needle between exposing these features to users and not imposing too much of a burden on clang maintenance. accessing array contents All this is a bit basic, though; it just gives you enough to have a local variable or a function parameter of a reference-valued type. Let's continue our example: void process(array_of_i8 array) { uint32_t sum; for (size_t idx = 0; i < __builtin_wasm_array_length(array); i++) sum += (uint8_t)__builtin_wasm_array_ref_i8(array, idx); // ... }  The most basic way to extend C to access these otherwise opaque values is to expose some builtins, say __builtin_wasm_array_length and so on. Probably you need different intrinsics for each scalar array element type (i8, i16, and so on), and one for arrays which return reference-typed values. We'll talk about arrays of references another day, but focusing on the i8 case, the C builtin then lowers to a dedicated LLVM intrinsic, which passes through the middle layer unscathed. In C++ I think we can provide some nicer syntax which preserves the syntactic illusion of array access. I think this is going to be sufficient as an MVP, but there's one caveat: SIMD. You can indeed have an array of i128 values, but you can only access that array's elements as i128; worse, you can't load multiple data from an i8 array as i128 or even i32. Compare this to to the memory control proposal, which instead proposes to map buffers to non-default memories. In WebAssembly, you can in theory (and perhaps soon in practice) have multiple memories. The easiest way I can see on the toolchain side is to use the address space feature in clang: void process(uint8_t *array __attribute__((address_space(42))), size_t len) { uint32_t sum; for (size_t idx = 0; i < len; i++) sum += array[idx]; // ... }  How exactly to plumb the mapping between address spaces which can only be specified by number from the front-end to the back-end is a little gnarly; really you'd like to declare the set of address spaces that a compilation unit uses symbolically, and then have the linker produce a final allocation of memory indices. But I digress, it's clear that with this solution we can use SIMD instructions to load multiple bytes from memory at a time, so it's a winner with respect to accessing GC arrays. Or is it? Perhaps there could be SIMD extensions for packed GC arrays. I think it makes sense, but it's a fair amount of (admittedly somewhat mechanical) specification and implementation work. & future In some future bloggies we'll talk about how we will declare new reference types: first some basics, then some more integrated visions for reference types and C++. Lots going on, and this is just a brain-dump of the current state of things; thoughts are very much welcome. ## August 18, 2022 ### Andy Wingo #### just-in-time code generation within webassembly Just-in-time (JIT) code generation is an important tactic when implementing a programming language. Generating code at run-time allows a program to specialize itself against the specific data it is run against. For a program that implements a programming language, that specialization is with respect to the program being run, and possibly with respect to the data that program uses. The way this typically works is that the program generates bytes for the instruction set of the machine it's running on, and then transfers control to those instructions. Usually the program has to put its generated code in memory that is specially marked as executable. However, this capability is missing in WebAssembly. How, then, to do just-in-time compilation in WebAssembly? webassembly as a harvard architecture In a von Neumman machine, like the ones that you are probably reading this on, code and data share an address space. There's only one kind of pointer, and it can point to anything: the bytes that implement the sin function, the number 42, the characters in "biscuits", or anything at all. WebAssembly is different in that its code is not addressable at run-time. Functions in a WebAssembly module are numbered sequentially from 0, and the WebAssembly call instruction takes the callee as an immediate parameter. So, to add code to a WebAssembly program, somehow you'd have to augment the program with more functions. Let's assume we will make that possible somehow -- that your WebAssembly module that had N functions will now have N+1 functions, and with function N being the new one your program generated. How would we call it? Given that the call instructions hard-code the callee, the existing functions 0 to N-1 won't call it. Here the answer is call_indirect. A bit of a reminder, this instruction take the callee as an operand, not an immediate parameter, allowing it to choose the callee function at run-time. The callee operand is an index into a table of functions. Conventionally, table 0 is called the indirect function table as it contains an entry for each function which might ever be the target of an indirect call. With this in mind, our problem has two parts, then: (1) how to augment a WebAssembly module with a new function, and (2) how to get the original module to call the new code. late linking of auxiliary webassembly modules The key idea here is that to add code, the main program should generate a new WebAssembly module containing that code. Then we run a linking phase to actually bring that new code to life and make it available. System linkers like ld typically require a complete set of symbols and relocations to resolve inter-archive references. However when performing a late link of JIT-generated code, we can take a short-cut: the main program can embed memory addresses directly into the code it generates. Therefore the generated module would import memory from the main module. All references from the generated code to the main module can be directly embedded in this way. The generated module would also import the indirect function table from the main module. (We would ensure that the main module exports its memory and indirect function table via the toolchain.) When the main module makes the generated module, it also embeds a special patch function in the generated module. This function would add the new functions to the main module's indirect function table, and perform any relocations onto the main module's memory. All references from the main module to generated functions are installed via the patch function. We plan on two implementations of late linking, but both share the fundamental mechanism of a generated WebAssembly module with a patch function. dynamic linking via the run-time One implementation of a linker is for the main module to cause the run-time to dynamically instantiate a new WebAssembly module. The run-time would provide the memory and indirect function table from the main module as imports when instantiating the generated module. The advantage of dynamic linking is that it can update a live WebAssembly module without any need for re-instantiation or special run-time checkpointing support. In the context of the web, JIT compilation can be triggered by the WebAssembly module in question, by calling out to functionality from JavaScript, or we can use a "pull-based" model to allow the JavaScript host to poll the WebAssembly instance for any pending JIT code. For WASI deployments, you need a capability from the host. Either you import a module that provides run-time JIT capability, or you rely on the host to poll you for data. static linking via wizer Another idea is to build on Wizer's ability to take a snapshot of a WebAssembly module. You could extend Wizer to also be able to augment a module with new code. In this role, Wizer is effectively a late linker, linking in a new archive to an existing object. Wizer already needs the ability to instantiate a WebAssembly module and to run its code. Causing Wizer to ask the module if it has any generated auxiliary module that should be instantiated, patched, and incorporated into the main module should not be a huge deal. Wizer can already run the patch function, to perform relocations to patch in access to the new functions. After having done that, Wizer (or some other tool) would need to snapshot the module, as usual, but also adding in the extra code. As a technical detail, in the simplest case in which code is generated in units of functions which don't directly call each other, this is as simple as just appending the functions to the code section and then and appending the generated element segments to the main module's element segment, updating the appended function references to their new values by adding the total number of functions in the module before the new module was concatenated to each function reference. late linking appears to be async codegen From the perspective of a main program, WebAssembly JIT code generation via late linking appears the same as aynchronous code generation. For example, take the C program: struct Value; struct Func { struct Expr *body; void *jitCode; }; void recordJitCandidate(struct Func *func); uint8_t* flushJitCode(); // Call to actually generate JIT code. struct Value* interpretCall(struct Expr *body, struct Value *arg); struct Value* call(struct Func *func, struct Value* val) { if (func->jitCode) { struct Value* (*f)(struct Value*) = jitCode; return f(val); } else { recordJitCandidate(func); return interpretCall(func->body, val); } }  Here the C program allows for the possibility of JIT code generation: there is a slot in a Func instance to fill in with a code pointer. If this program generates code for a given Func, it won't be able to fill in the pointer -- it can't add new code to the image. But, it could tell Wizer to do so, and Wizer could snapshot the program, link in the new function, and patch &func->jitCode. From the program's perspective, it's as if the code becomes available asynchronously. demo! So many words, right? Let's see some code! As a sketch for other JIT compiler work, I implemented a little Scheme interpreter and JIT compiler, targetting WebAssembly. See interp.cc for the source. You compile it like this: $ /opt/wasi-sdk/bin/clang++ -O2 -Wall \
-mexec-model=reactor \
-Wl,--growable-table \
-Wl,--export-table \
-DLIBRARY=1 \
-fno-exceptions \
interp.cc -o interplib.wasm


Here we are compiling with WASI SDK. I have version 14.

The -mexec-model=reactor argument means that this WASI module isn't just a run-once thing, after which its state is torn down; rather it's a multiple-entry component.

The two -Wl, options tell the linker to export the indirect function table, and to allow the indirect function table to be augmented by the JIT module.

The -DLIBRARY=1 is used by interp.cc; you can actually run and debug it natively but that's just for development. We're instead compiling to wasm and running with a WASI environment, giving us fprintf and other debugging niceties.

The -fno-exceptions is because WASI doesn't support exceptions currently. Also we don't need them.

WASI is mainly for non-browser use cases, but this module does so little that it doesn't need much from WASI and I can just polyfill it in browser JavaScript. So that's what we have here:

JavaScript disabled, no wasm-jit demo. See the wasm-jit web page for more information. &&<<<<><>>&&<&>>>><><><><><><>>><>>

Each time you enter a Scheme expression, it will be parsed to an internal tree-like intermediate language. You can then run a recursive interpreter over that tree by pressing the "Evaluate" button. Press it a number of times, you should get the same result.

As the interpreter runs, it records any closures that it created. The Func instances attached to the closures have a slot for a C++ function pointer, which is initially NULL. Function pointers in WebAssembly are indexes into the indirect function table; the first slot is kept empty so that calling a NULL pointer (a pointer with value 0) causes an error. If the interpreter gets to a closure call and the closure's function's JIT code pointer is NULL, it will interpret the closure's body. Otherwise it will call the function pointer.

If you then press the "JIT" button above, the module will assemble a fresh WebAssembly module containing JIT code for the closures that it saw at run-time. Obviously that's just one heuristic: you could be more eager or more lazy; this is just a detail.

Although the particular JIT compiler isn't much of interest---the point being to see JIT code generation at all---it's nice to see that the fibonacci example sees a good speedup; try it yourself, and try it on different browsers if you can. Neat stuff!

not just the web

I was wondering how to get something like this working in a non-webby environment and it turns out that the Python interface to wasmtime is just the thing. I wrote a little interp.py harness that can do the same thing that we can do on the web; just run as python3 interp.py, after having pip3 install wasmtime:

\$ python3 interp.py
...
Calling eval(0x11eb0) 5 times took 1.716s.
Calling jitModule()
jitModule result: <wasmtime._module.Module object at 0x7f2bef0821c0>
Instantiating and patching in JIT module
...
Calling eval(0x11eb0) 5 times took 1.161s.


Interestingly it would appear that the performance of wasmtime's code (0.232s/invocation) is somewhat better than both SpiderMonkey (0.392s) and V8 (0.729s).

reflections

This work is just a proof of concept, but it's a step in a particular direction. As part of previous work with Fastly, we enabled the SpiderMonkey JavaScript engine to run on top of WebAssembly. When combined with pre-initialization via Wizer, you end up with a system that can start in microseconds: fast enough to instantiate a fresh, shared-nothing module on every HTTP request, for example.

The SpiderMonkey-on-WASI work left out JIT compilation, though, because, you know, WebAssembly doesn't support JIT compilation. JavaScript code actually ran via the C++ bytecode interpreter. But as we just found out, actually you can compile the bytecode: just-in-time, but at a different time-scale. What if you took a SpiderMonkey interpreter, pre-generated WebAssembly code for a user's JavaScript file, and then combined them into a single freeze-dried WebAssembly module via Wizer? You get the benefits of fast startup while also getting decent baseline performance. There are many engineering considerations here, but as part of work sponsored by Shopify, we have made good progress in this regard; details in another missive.

I think a kind of "offline JIT" has a lot of value for deployment environments like Shopify's and Fastly's, and you don't have to limit yourself to "total" optimizations: you can still collect and incorporate type feedback, and you get the benefit of taking advantage of adaptive optimization without having to actually run the JIT compiler at run-time.

But if we think of more traditional "online JIT" use cases, it's clear that relying on host JIT capabilities, while a good MVP, is not optimal. For one, you would like to be able to freely emit direct calls from generated code to existing code, instead of having to call indirectly or via imports. I think it still might make sense to have a language run-time express its generated code in the form of a WebAssembly module, though really you might want native support for compiling that code (asynchronously) from within WebAssembly itself, without calling out to a run-time. Most people I have talked to that work on WebAssembly implementations in JS engines believe that a JIT proposal will come some day, but it's good to know that we don't have to wait for it to start generating code and taking advantage of it.

& out

If you want to play around with the demo, do take a look at the wasm-jit Github project; it's fun stuff. Happy hacking, and until next time!

## August 15, 2022

### Eric Meyer

#### Table Column Alignment with Variable Transforms

One of the bigger challenges of recreating The Effects of Nuclear Weapons for the Web was its tables.  It was easy enough to turn tab-separated text and numbers into table markup, but the column alignment almost broke me.

To illustrate what I mean, here are just a few examples of columns that had to be aligned.

At first I naïvely thought, “No worries, I can right- or left-align most of these columns and figure out the rest later.”  But then I looked at the centered column headings, and how the column contents were essentially centered on the headings while having their own internal horizontal alignment logic, and realized all my dreams of simple fixes were naught but ashes.

My next thought was to put blank spacer columns between the columns of visible content, since table layout doesn’t honor the gap property, and then set a fixed width for various columns.  I really didn’t like all the empty-cell spam that would require, even with liberal application of the rowspan attribute, and it felt overly fragile  —  any shifts in font face (say, on an older or niche system) might cause layout upset within the visible columns, such as wrapping content that shouldn’t be wrapped or content overlapping other content.  I felt like there was a better answer.

I also thought about segregating every number and symbol (including decimal separators) into separate columns, like this:

<tr>
<th>Neutrinos from fission products</th>
<td>10</td>
<td></td>
<td></td>
</tr>
<tr class="total">
<th>Total energy per fission</th>
<td>200</td>
<td>±</td>
<td>6</td>
</tr>

Then I contemplated what that would do to screen readers and the document structure in general, and after the nausea subsided, I decided to look elsewhere.

It was at that point I thought about using spacer <span>s.  Like, anywhere I needed some space next to text in order to move it to one side or the other, I’d throw in something like one of these:

<span class="spacer"></span>
<span style="display: inline; width: 2ch;"></span>

Again, the markup spam repulsed me, but there was the kernel of an idea in there… and when I combined it with the truism “CSS doesn’t care what you expect elements to look or act like”, I’d hit upon my solution.

Let’s return to Table 1.43, which I used as an illustration in the announcement post.  It’s shown here in its not-aligned and aligned states, with borders added to the table-cell elements.

This is exactly the same table, only with cells shifted to one side or another in the second case.  To make this happen, I first set up a series of CSS rules:

figure.table .lp1 {transform: translateX(0.5ch);}
figure.table .lp2 {transform: translateX(1ch);}
figure.table .lp3 {transform: translateX(1.5ch);}
figure.table .lp4 {transform: translateX(2ch);}
figure.table .lp5 {transform: translateX(2.5ch);}

figure.table .rp1 {transform: translateX(-0.5ch);}
figure.table .rp2 {transform: translateX(-1ch);}

For a given class, the table cell is translated along the X axis by the declared number of ch units.  Yes, that means the table cells sharing a column no longer actually sit in the column.  No, I don’t care — and neither, as I said, does CSS.

I chose the labels lp and rp for “left pad” and “right pad”, in part as a callback to the left-pad debacle of yore even though it has basically nothing to do with what I’m doing here.  (Many of my class names are private jokes to myself.  We take our pleasures where we can.)  The number in each class name represents the number of “characters” to pad, which here increment by half-ch measures.  Since I was trying to move things by characters, using the unit that looks like it’s a character measure (even though it really isn’t) made sense to me.

With those rules set up, I could add simple classes to table cells that needed to be shifted, like so:

<td class="lp3">5 ± 0.5</td>

<td class="rp2">10</td>

That was most of the solution, but it turned out to not be quite enough.  See, things like decimal places and commas aren’t as wide as the numbers surrounding them, and sometimes that was enough to prevent a specific cell from being able to line up with the rest of its column.  There were also situations where the data cells could all be aligned with each other, but were unacceptably offset from the column header, which was nearly always centered.

So I decided to calc() the crap out of this to add the flexibility a custom property can provide.  First, I set a sitewide variable:

body {
--offset: 0ch;
}

I then added that variable to the various transforms:

figure.table .lp1 {transform: translateX(calc(0.5ch + var(--offset)));}
figure.table .lp2 {transform: translateX(calc(1ch   + var(--offset)));}
figure.table .lp3 {transform: translateX(calc(1.5ch + var(--offset)));}
figure.table .lp4 {transform: translateX(calc(2ch   + var(--offset)));}
figure.table .lp5 {transform: translateX(calc(2.5ch + var(--offset)));}

figure.table .rp1 {transform: translateX(calc(-0.5ch + var(--offset)));}
figure.table .rp2 {transform: translateX(calc(-1ch   + var(--offset)));}

Why use a variable at all?  Because it allows me to define offsets specific to a given table, or even specific to certain table cells within a table.  Consider the styles embedded along with Table 3.66:

#tbl3-66 tbody tr:first-child td:nth-child(1),
#tbl3-66 tbody td:nth-child(7) {
--offset: 0.25ch;
}
#tbl3-66 tbody td:nth-child(4) {
--offset: 0.1ch;
}

Yeah. The first cell of the first row and the seventh cell of every row in the table body needed to be shoved over an extra quarter-ch, and the fourth cell in every table-body row (under the heading “Sp”) got a tenth-ch nudge.  You can judge the results for yourself.

So, in the end, I needed only sprinkle class names around table markup where needed, and add a little extra offset via a custom property that I could scope to exactly where needed.  Sure, the whole setup is hackier than a panel of professional political pundits, but it works, and to my mind, it beats the alternatives.

I’d have been a lot happier if I could have aligned some of the columns on a specific character.  I think I still would have needed the left- and right-pad approach, but there were a lot of columns where I could have reduced or eliminated all the classes.  A quarter-century ago, HTML 4 had this capability, in that you could write:

<COLGROUP>
<COL>
<COL>
<COL align="±">
</COLGROUP>

CSS2 was also given this power via text-align, where you could give it a string value in order to specify horizontal alignment.

But browsers never really supported these features, even if some of them do still have bugs open on the issue.  (I chuckle aridly every time I go there and see “Opened 24 years ago” a few lines above “Status: NEW”.)  I know it’s not top of anybody’s wish list, but I wouldn’t mind seeing that capability return, somehow. Maybe as something that could be used in Grid column tracks as well as table columns.

I also found myself really pining for the ability to use attr() here, which would have allowed me to drop the classes and use data-* attributes on the table cells to say how far to shift them.  I could even have dropped the offset variable.  Instead, it could have looked something like this:

<td data-pad="3.25">5 ± 0.5</td>

<td data-pad="-1.9">10</td>



Alas, attr() is confined to the content property, and the idea of letting it be used more widely remains unrealized.

Anyway, that was my journey into recreating mid-20th-Century table column alignment on the Web.  It’s true that sufficiently old browsers won’t get the fancy alignment due to not supporting custom properties or calc(), but the data will all still be there.  It just won’t have the very specific column alignment, that’s all.  Hooray for progressive enhancement!

Have something to say to all that? You can add a comment to the post, or email Eric directly.

## August 10, 2022

### Javier Fernández

#### New Custom Handlers component for Chrome

The HTML Standard section on Custom Handlers describes the procedure to register custom protocol handlers for URLs with specific schemes. When a URL is followed, the protocol dictates which part of the system handles it by consulting registries. In some cases, handlers may fall to the OS level and and launch a desktop application (eg. mailto:// may launch a native application) or, the browser may redirect the request to a different HTTP(S) URL (eg. mailto:// can go to gmail).

There are multiple ways to invoke the registerProtocolHandler method from the HTML API. The most common is through Browser Extensions, or add-ons, where the user must install third-party software to add or modify a browser’s feature. However, this approach puts on the users the responsibility of ensuring the extension is secure and that it respects their privacy. Ideally, downstream browsers could ship built-in protocol handlers that take this load off of users, but this was previously difficult.

During the last years Igalia and Protocol Labs have been working on improving the support of this HTML API in several of the main web engines, with the long term goal of getting the IPFS protocol a first-class member inside the most relevant browsers.

In this post I’m going to explain the most recent improvements, especially in Chrome, and some side advantages for Chromium embedders that we got on the way.

## Componentization of the Custom Handler logic

The first challenge faced by Chromium-based browsers wishing to add support for a new protocol was the architecture. Most of the logic and relevant codebase lived in the //chrome layer. Thus, any intent to introduce a behavior change on this logic would require a direct fork and patch the //chrome layers with the new logic. The design wasn’t defined to allow Chrome embedders to extend it with new features

The Chromium project provides a Public Content API to allow embedders reusing a great part of the browser’s common logic and to implement, if needed, specific behavior though the specialization of these APIs. These would seem to be the ideal way of extending the browser capabilities to handle new protocols. However, accessing //chrome from any layer (eg. //content or //components) is forbidden and constitutes a layering violation.

After some discussion with Chrome engineers (special thanks to Kuniko Yasuda and Colin Blundell) we decided to move the Protocol Handlers logic to the //components layer and create a new Custom Handlers component.

The following diagram provides a high-level overview of the architectural change:

The new component makes the Protocol Handler logic Chrome independent, bringing us the possibility of using the Content Public API to introduce the mentioned behavior changes, as we’ll see later in this post.

Only the registry’s Delegate and the Factory classes will remain in the //chrome layer.

The Delegate class was defined to provide OS integration, like setting as default browser, which needs to access the underlying operating systems interfaces. It also deals with other browser-specific logic, like Profiles. With the new componentized design, embedders may provide their own Delegate implementations with different strategies to interact with the operating system’s desktop or even support additional flavors.

On the other hand, the Factory (in addition to the creation of the appropriated Delegate’s instance) provides a BrowserContext that allows the registry to operate under Incognito mode.

Lets now get a deeper look into the component, trying to understand the responsibilities of each class. The following class diagram illustrates the multi-process architecture of the Custom Handler logic:

The ProtocolHandlerRegistry class has been modified to allow operating without user preferences storage; this is particularly useful to move most of the browser tests to the new component and avoid the dependency with the prefs and user_prefs components (eg. testing). In this scenario the registered handlers will be stored in memory only.

It’s worth mentioning that as part of this architectural change, I’ve applied some code cleanup and refactoring, following the policies dictated in the base::Value Code Health task-force. Some examples:

The ProtocolHandlerThrottle implements URL redirection with the handler provided by the registry. The ChromeContentBrowserClient and ShellBrowserContentClient classes use it to implement the CreateURLLoaderThrottles method of the ContentBrowserClient interface.

The specialized request is the responsible for handling the permission prompt dialog response, except for the the content-shell implementations, which bypass the Permission APIs (more on this later). I’ve also been working on a new design that relies on the PermissionContextBase::CreatePermissionRequest interface, so that we could create our own custom request and avoid the direct call to the PermissionRequestManager. Unfortunately this still needs some additional support to deal with the RegisterProtocolHandlerPermissionRequest‘s specific parameters needed for creating new instances, but I won’t extend on this, as it deserves its own post.

Finally, I’d like to remark that the new component provides a simple registry factory, designed mainly for testing purposes. It’s basically the same as the one implemented in the //chrome layer except some limitations of the new component:

• No incognito mode
• Use of a mocking Delegate (no actual OS interaction)

## Single-Point security checks

I also wanted to apply some refactoring as well to improve the inter-process consistency and increase the amount of shared code between the renderer and browser process.

One of the most relevant parts of this refactoring was the one applied to the implement the Custom Handlers’ normalization process, which goes from some URL’s syntax validation to the security and privacy checks. I’ve landed some patches (CL#3507332, CL#3692750) to implement functions that are shared among the renderer and browser processes to implement the mentioned normalization procedure.

The code has been moved to blink/common so that it could be invoked also by the new Custom Handlers component described before. Both the WebContentsImpl (run by the Browser process) and the NavigatorContentUtils (by the Renderer process) rely on this common code now.

Additionally, I have moved all the security and privacy checks from the Browser class, in Chrome, to the WebContentsImpl class. This implies that any embedder that implements the WebContentsDelegate interface shouldn’t need to bother about these checks, assuming that any ProtocolHandler instance is valid.

# Conclusion

In the introduction I mentioned that the main goal of this refactoring was to decouple the Custom Handlers logic from the Chrome’s specific codebase; this would allow us to modify or even extend the implementation of the Custom Handlers APIs.

The componentization of some parts of the Chrome’s codebase has many advantages, as it’s described in the Browser Components design document. Additionally, a new Custom Handlers component would provide other interesting advantages on different fronts.

One of the most interesting use cases is the definition of Web Platform Tests for the Custom Handlers APIs. This goal has 2 main challenges to address:

• Testing Automation
• Content Shell support

Finally, we would like this new component to be useful for Chrome embedders, and even browsers built directly on top of a full featured Chrome, to implement new Custom Handlers related features. This line of work would eventually be useful to improve the support of distributed protocols like IPFS, as an intermediate step towards a native implementation of the protocols in the browser.

## August 09, 2022

### Eric Meyer

#### Recreating “The Effects of Nuclear Weapons” for the Web

In my previous post, I wrote about a way to center elements based on their content, without forcing the element to be a specific width, while preserving the interior text alignment.  In this post, I’d like to talk about why I developed that technique.

Near the beginning of this year, fellow Web nerd and nuclear history buff Chris Griffith mentioned a project to put an entire book online: The Effects of Nuclear Weapons by Samuel Glasstone and Philip J. Dolan, specifically the third (1977) edition.  Like Chris, I own a physical copy of this book, and in fact, the information and tools therein were critical to the creation of HYDEsim, way back in the Aughts.  I acquired it while in pursuit of my degree in History, for which I studied the Cold War and the policy effects of the nuclear arms race, from the first bombers to the Strategic Defense Initiative.

I was immediately intrigued by the idea and volunteered my technical services, which Chris accepted.  So we started taking the OCR output of a PDF scan of the book, cleaning up the myriad errors, re-typing the bits the OCR mangled too badly to just clean up, structuring it all with HTML, converting figures to PNGs and photos to JPGs, and styling the whole thing for publication, working after hours and in odd down times to bring this historical document to the Web in a widely accessible form.  The result of all that work is now online.

That linked page is the best example of the technique I wrote about in the aforementioned previous post: as a Table of Contents, none of the lines actually get long enough to wrap.  Rather than figuring out the exact length of the longest line and centering based on that, I just let CSS do the work for me.

There were a number of other things I invented (probably re-invented) as we progressed.  Footnotes appear at the bottom of pages when the footnote number is activated through the use of the :target pseudo-class and some fixed positioning.  It’s not completely where I wanted it to be, but I think the rest will require JS to pull off, and my aim was to keep the scripting to an absolute minimum.

I couldn’t keep the scripting to zero, because we decided early on to use MathJax for the many formulas and other mathematical expressions found throughout the text.  I’d never written LaTeX before, and was very quickly impressed by how compact and yet powerful the syntax is.

Over time, I do hope to replace the MathJax-parsed LaTeX with raw MathML for both accessibility and project-weight reasons, but as of this writing, Chromium lacks even halfway-decent MathML support, so we went with the more widely-supported solution.  (My colleague Frédéric Wang at Igalia is pushing hard to fix this sorry state of affairs in Chromium, so I do have hopes for a migration to MathML… some day.)

The figures (as distinct from the photos) throughout the text presented an interesting challenge.  To look at them, you’d think SVG would be the ideal image format. Had they come as vector images, I’d agree, but they’re raster scans.  I tried recreating one or two in hand-crafted SVG and quickly determined the effort to create each was significant, and really only worked for the figures that weren’t charts, graphs, or other presentations of data.  For anything that was a chart or graph, the risk of introducing inaccuracies was too high, and again, each would have required an inordinate amount of effort to get even close to correct.  That’s particularly true considering that without knowing what font face was being used for the text labels in the figures, they’d have to be recreated with paths or polygons or whatever, driving the cost-to-recreate astronomically higher.

So I made the figures PNGs that are mostly transparent, except for the places where there was ink on the paper.  After any necessary straightening and some imperfection cleanup in Acorn, I then ran the PNGs through the color-index optimization process I wrote about back in 2020, which got them down to an average of 75 kilobytes each, ranging from 443KB down to 7KB.

At the 11th hour, still secretly hoping for a magic win, I ran them all through svgco.de to see if we could get automated savings.  Of the 161 figures, exactly eight of them were made smaller, which is not a huge surprise, given the source material.  So, I saved those eight for possible future updates and plowed ahead with the optimized PNGs.  Will I return to this again in the future?  Probably.  It bugs me that the figures could be better, and yet aren’t.

It also bugs me that we didn’t get all of the figures and photos fully described in alt text.  I did write up alternative text for the figures in Chapter I, and a few of the photos have semi-decent captions, but this was something we didn’t see all the way through, and like I say, that bugs me.  If it also bugs you, please feel free to fork the repository and submit a pull request with good alt text.  Or, if you prefer, you could open an issue and include your suggested alt text that way.  By the image, by the section, by the chapter: whatever you can contribute would be appreciated.

Those image captions, by the way?  In the printed text, they’re laid out as a label (e.g., “Figure 1.02”) and then the caption text follows.  But when the text wraps, it doesn’t wrap below the label.  Instead, it wraps in its own self-contained block instead, with the text fully justified except for the last line, which is centered.  Centered!  So I set up the markup and CSS like this:

<figure>
<figcaption>
<span>Figure 1.02.</span> <span>Effects of a nuclear explosion.</span>
</figcaption>
</figure>
figure figcaption {
display: grid;
grid-template-columns: max-content auto;
gap: 0.75em;
justify-content: center;
text-align: justify;
text-align-last: center;
}

Oh CSS Grid, how I adore thee.  And you too, CSS box alignment.  You made this little bit of historical recreation so easy, it felt like cheating.

Some other things weren’t easy.  The data tables, for example, have a tendency to align columns on the decimal place, even when most but not all of the numbers are integers.  Long, long ago, it was proposed that text-align be allowed a string value, something like text-align: '.', which you could then apply to a table column and have everything line up on that character.  For a variety of reasons, this was never implemented, a fact which frosts my windows to this day.  In general, I mean, though particularly so for this project.  The lack of it made keeping the presentation historically accurate a right pain, one I may get around to writing about, if I ever overcome my shame.  [Editor’s note: he overcame that shame.]

There are two things about the book that we deliberately chose not to faithfully recreate.  The first is the font face.  My best guess is that the book was typeset using something from the Century family, possibly Century Schoolbook (the New version of which was a particular favorite of mine in college).  The very-widely-installed Cambria seems fairly similar, at least to my admittedly untrained eye, and furthermore was designed specifically for screen media, so I went with body text styling no more complicated than this:

body {
font: 1em/1.35 Cambria, Times, serif;
hyphens: auto;
}

I suppose I could have tracked down a free version of Century and used it as a custom font, but I couldn’t justify the performance cost in both download and rendering speed to myself and any future readers.  And the result really did seem close enough to the original to accept.

The second thing we didn’t recreate is the printed-page layout, which is two-column.  That sort of layout can work very well on the book page; it almost always stinks on a Web page.  Thus, the content of the book is rendered online in a single column.  The exceptions are the chapter-ending Bibliography sections and the book’s Index, both of which contain content compact and granular enough that we could get away with the original layout.

There’s a lot more I could say about how this style or that pattern came about, and maybe someday I will, but for now let me leave you with this: all these decisions are subject to change, and open to input.  If you come up with a superior markup scheme for any of the bits of the book, we’re happy to look at pull requests or issues, and to act on them.  It is, as we say in our preface to the online edition, a living project.

We also hope that, by laying bare the grim reality of these horrific weapons, we can contribute in some small way to making them a dead and buried technology.

Have something to say to all that? You can add a comment to the post, or email Eric directly.

## August 07, 2022

### Andy Wingo

#### coarse or lazy?

sweeping, coarse and lazy

One of the things that had perplexed me about the Immix collector was how to effectively defragment the heap via evacuation while keeping just 2-3% of space as free blocks for an evacuation reserve. The original Immix paper states:

To evacuate the object, the collector uses the same allocator as the mutator, continuing allocation right where the mutator left off. Once it exhausts any unused recyclable blocks, it uses any completely free blocks. By default, immix sets aside a small number of free blocks that it never returns to the global allocator and only ever uses for evacuating. This headroom eases defragmentation and is counted against immix's overall heap budget. By default immix reserves 2.5% of the heap as compaction headroom, but [...] is fairly insensitive to values ranging between 1 and 3%.

To Immix, a "recyclable" block is partially full: it contains surviving data from a previous collection, but also some holes in which to allocate. But when would you have recyclable blocks at evacuation-time? Evacuation occurs as part of collection. Collection usually occurs when there's no more memory in which to allocate. At that point any recyclable block would have been allocated into already, and won't become recyclable again until the next trace of the heap identifies the block's surviving data. Of course after the next trace they could become "empty", if no object survives, or "full", if all lines have survivor objects.

In general, after a full allocation cycle, you don't know much about the heap. If you could easily know where the live data and the holes were, a garbage collector's job would be much easier :) Any algorithm that starts from the assumption that you know where the holes are can't be used before a heap trace. So, I was not sure what the Immix paper is meaning here about allocating into recyclable blocks.

Thinking on it again, I realized that Immix might trigger collection early sometimes, before it has exhausted the previous cycle's set of blocks in which to allocate. As we discussed earlier, there is a case in which you might want to trigger an early compaction: when a large object allocator runs out of blocks to decommission from the immix space. And if one evacuating collection didn't yield enough free blocks, you might trigger the next one early, reserving some recyclable and empty blocks as evacuation targets.

when do you know what you know: lazy and eager

Consider a basic question, such as "how many bytes in the heap are used by live objects". In general you don't know! Indeed you often never know precisely. For example, concurrent collectors often have some amount of "floating garbage" which is unreachable data but which survives across a collection. And of course you don't know the difference between floating garbage and precious data: if you did, you would have collected the garbage.

Even the idea of "when" is tricky in systems that allow parallel mutator threads. Unless the program has a total ordering of mutations of the object graph, there's no one timeline with respect to which you can measure the heap. Still, Immix is a stop-the-world collector, and since such collectors synchronously trace the heap while mutators are stopped, these are times when you can exactly compute properties about the heap.

Let's retake the question of measuring live bytes. For an evacuating semi-space, knowing the number of live bytes after a collection is trivial: all survivors are packed into to-space. But for a mark-sweep space, you would have to compute this information. You could compute it at mark-time, while tracing the graph, but doing so takes time, which means delaying the time at which mutators can start again.

Alternately, for a mark-sweep collector, you can compute free bytes at sweep-time. This is the phase in which you go through the whole heap and return any space that wasn't marked in the last collection to the allocator, allowing it to be used for fresh allocations. This is the point in the garbage collection cycle in which you can answer questions such as "what is the set of recyclable blocks": you know what is garbage and you know what is not.

Though you could sweep during the stop-the-world pause, you don't have to; sweeping only touches dead objects, so it is correct to allow mutators to continue and then sweep as the mutators run. There are two general strategies: spawn a thread that sweeps as fast as it can (concurrent sweeping), or make mutators sweep as needed, just before they allocate (lazy sweeping). But this introduces a lag between when you know and what you know—your count of total live heap bytes describes a time in the past, not the present, because mutators have moved on since then.

For most collectors with a sweep phase, deciding between eager (during the stop-the-world phase) and deferred (concurrent or lazy) sweeping is very easy. You don't immediately need the information that sweeping allows you to compute; it's quite sufficient to wait until the next cycle. Moving work out of the stop-the-world phase is a win for mutator responsiveness (latency). Usually people implement lazy sweeping, as it is naturally incremental with the mutator, naturally parallel for parallel mutators, and any sweeping overhead due to cache misses can be mitigated by immediately using swept space for allocation. The case for concurrent sweeping is less clear to me, but if you have cores that would otherwise be idle, sure.

eager coarse sweeping

Immix is interesting in that it chooses to sweep eagerly, during the stop-the-world phase. Instead of sweeping irregularly-sized objects, however, it sweeps over its "line mark" array: one byte for each 128-byte "line" in the mark space. For 32 kB blocks, there will be 256 bytes per block, and line mark bytes in each 4 MB slab of the heap are packed contiguously. Therefore you get relatively good locality, but this just mitigates a cost that other collectors don't have to pay. So what does eager marking over these coarse 128-byte regions buy Immix?

Firstly, eager sweeping buys you eager identification of empty blocks. If your large object space needs to steal blocks from the mark space, but the mark space doesn't have enough empties, it can just trigger collection and then it knows if enough blocks are available. If no blocks are available, you can grow the heap or signal out-of-memory. If the lospace (large object space) runs out of blocks before the mark space has used all recyclable blocks, that's no problem: evacuation can move the survivors of fragmented blocks into these recyclable blocks, which have also already been identified by the eager coarse sweep.

Without eager empty block identification, if the lospace runs out of blocks, firstly you don't know how many empty blocks the mark space has. Sweeping is a kind of wavefront that moves through the whole heap; empty blocks behind the wavefront will be identified, but those ahead of the wavefront will not. Such a lospace allocation would then have to either wait for a concurrent sweeper to advance, or perform some lazy sweeping work. The expected latency of a lospace allocation would thus be higher, without eager identification of empty blocks.

Secondly, eager sweeping might reduce allocation overhead for mutators. If allocation just has to identify holes and not compute information or decide on what to do with a block, maybe it go brr? Not sure.

lines, lines, lines

The original Immix paper also notes a relative insensitivity of the collector to line size: 64 or 256 bytes could have worked just as well. This was a somewhat surprising result to me but I think I didn't appreciate all the roles that lines play in Immix.

Obviously line size affect the worst-case fragmentation, though this is mitigated by evacuation (which evacuates objects, not lines). This I got from the paper. In this case, smaller lines are better.

Line size affects allocation-time overhead for mutators, though which way I don't know: scanning for holes will be easier with fewer lines in a block, but smaller lines would contain more free space and thus result in fewer collections. I can only imagine though that with smaller line sizes, average hole size would decrease and thus medium-sized allocations would be harder to service. Something of a wash, perhaps.

However if we ask ourselves the thought experiment, why not just have 16-byte lines? How crazy would that be? I think the impediment to having such a precise line size would mainly be Immix's eager sweep, as a fine-grained traversal of the heap would process much more data and incur possibly-unacceptable pause time overheads. But, in such a design you would do away with some other downsides of coarse-grained lines: a side table of mark bytes would make the line mark table redundant, and you eliminate much possible "dark matter" hidden by internal fragmentation in lines. You'd need to defer sweeping. But then you lose eager identification of empty blocks, and perhaps also the ability to evacuate into recyclable blocks. What would such a system look like?

Readers that have gotten this far will be pleased to hear that I have made some investigations in this area. But, this post is already long, so let's revisit this in another dispatch. Until then, happy allocations in all regions.

## August 04, 2022

### Igalia Compilers Team

#### Igalia’s Compilers Team in 2022H1

As we enter the second half of 2022, we’d like to provide a summary (necessarily highly condensed and selective!) of what we’ve been up to recently, providing some insight into the breadth of technical challenges our team of over 20 compiler engineers has been tackling.

## Low-level JS / JSC on 32-bit systems

We have continued to maintain support for 32-bit systems (mainly ARMv7, but also MIPS) in JavaScriptCore (JSC). The work involves continuous tracking of upstream development to prevent regressions as well as the development of new features:

• A major milestone for this has been the completion of support for WebAssembly in the Low-Level Interpreter (LLInt) for ARMv7. The MIPS support is mostly complete.
• Developed an initial prototype of the concurrency compilation in the DFG tier for 32-bit systems, and the results are promising. The work continues, and we expect to upstream it in 2022H2.
• Code reduction and optimizations: we upstreamed several code reductions and optimizations for 32-bit systems (mainly ARMv7): 25% size reduction in DFGOSRExit blocks, 24% in baseline JIT on JetStream2 and 25% code size reduction from porting EXTRA_CTI_THUNKS.
• Improved our hardware testing infrastructure with more MIPS and faster ARMv7 hardware for the buildbots running in the EWS (Early Warning System), which allows for a smaller response time for regressions.
• Deployed two fuzzing bots that run test JSC 24/7. The bots already found a few issues upstream that we reported to Apple. The bugs that affect 64-bit systems were fixed by the team at Apple, while we are responsible for fixing the ones affecting 32-bit systems. We expect to work on them in 2022H2.
• Added logic to transparently re-run failing JSC tests (on 32-bit platforms) and declare them a pass if they’re simply flaky, as long as the flakiness does not rise above a threshold. This means fewer false alerts for developers submitting patches to the EWS and for the people doing QA work. Naturally, the flakiness information is stored in the WebKit resultsdb and visualized at results.webkit.org.

## JS and standards

Another aspect of our work is our contribution to the JavaScript standards effort, through involvement in the TC39 standards body, direct contribution to standards proposals, and implementation of those proposals in the major JS engines.

• Further coverage of the Temporal spec in the Test262 conformance suite, as well as various specification updates. See this blog post for insight into some of the challenges tackled by Temporal.
• Performance improvements for JS class features in V8, such as faster initialisations.
• Work towards supporting snapshots in node.js, including activities such as fixing support for V8 startup snapshots in the presence of class field initializers.
• Collaborating with others on the “types as comments” proposal for JS, successfully reaching stage 1 in the TC39 process.
• Implementing ShadowRealm support in WebKit

## WebAssembly

WebAssembly is a low-level compilation target for the web, which we have contributed to in terms of specification proposals, LLVM toolchain modifications, implementation work in the JS engines, and working with customers on use cases both on the server and in web browsers.

Some highlights from the last 6 months include:

• Creation of a proposal for reference-typed strings in WebAssembly to ensure efficient operability with languages like JavaScript. We also landed patches to implement this proposal in V8.
• Prototyping Just-In-Time (JIT) compilation within WebAssembly.
• Working to implement support for WebAssembly GC types in Clang and LLVM (with one important use case being efficient and leak-free sharing of object graphs between JS and C++ compiled to Wasm).
• Implementation of support for GC types in WebKit’s implementation of WebAssembly.

## Events

With in-person meetups becoming possible again, Igalians have been talking on a range of topics – Multi-core Javascript (BeJS), TC39 (JS Nation), RISC-V LLVM (Cambridge RISC-V meetup), and more.

We’ve also had opportunities for much-needed face to face time within the team, with many of the compilers team meeting in Brussels in May, and for the company-wide summit held in A Coruña in June. These events provided a great opportunity to discuss current technical challenges, strategy, and ideas for the future, knowledge sharing, and of course socialising.

## Team growth

Our team has grown further this year, being joined by:

• Nicolò Ribaudo – a core maintainer of BabelJS, continuing work on that project after joining Igalia in June as well as contributing to work on JS modules.
• Aditi Singh – previously worked with the team through the Coding Experience program, joining full time in March focusing on the Temporal project.
• Alex Bradbury – a long-time LLVM developer who joined in March and is focusing on WebAssembly and RISC-V work in Clang/LLVM.

We’re keen to continue to grow the team and actively hiring, so if you think you might be interested in working in any of the areas discussed above, please apply here.