Planet Igalia

December 01, 2023

Brian Kardell

The Effects of Nuclear Maths

The Effects of Nuclear Maths

Carrying on Eric's work on “The Effects of Nuclear Weapons” - with native MathML.

As I have explained a few times since first writing about it in Harold Crick and the Web Platform in Jan 2019, I'm interested in MathML (and SVG). I'm interested in them because they are historically special: integrated into the HTML specification/parser even. Additionally, their lineage makes them unnecessarily weird and "other". Worse, they're dramatically under-invested in/de-prioritized. I believe all that really needs to change. Note that none of that says that I need MathML, personally, for the stuff that I write - but rather that I understand why it is that way and the importance for the web at large (and society). This is also the first time I've helped advocate for a thing I didn't have a bunch of first-hand experience with.

Since then, I've worked with the Math Community Group to create MathML-Core to resolve those historical issues, get a W3C TAG review, an implementation in Chromium (also done by Igalians) and recharter the MathML Working Group (which I now co-chair).

For authors, the status of things is way better. Practically speaking, you can use MathML and render some pretty darn nice looking maths for a whole lot of stuff. Some of the integrations with CSS specifically aren't aligned, but they didn't exist at all before, so maybe that’s ok.

Despite all of this, it seems lots of people and systems still use MathJax to render as SVG, sometimes also rendering MathML for accessibility reasons. But, why? It's not great for performance if you don't have to.

I really wanted to better understand. So I decided to tackle some test projects myself and to take you all along with me by documenting what I do, as I do it.

Nuking all the MathJax (references)

Last year, my colleague Eric Meyer wrote a piece about some work he'd been doing on a passion project Recreating “The Effects of Nuclear Weapons” for the Web, in which they bring a famous book to the web as faithfully as they can.

This piece includes a lot of math - hundreds of equations.

You can see the result of their work on atomicarchive.com. The first two chapters aren't so math heavy, but Chapter 3 alone contains 179 equations!

It looks good!

Luckily, it's on github.

Oh hey! These are just static HTML files!

Inside, I see that each file contains a block like this:

<script id="MathJax-script" async src="mathjax/tex-chtml.js"></script>
<script>
MathJax = {
  tex: {
    inlineMath: [['$', '$'], ['\\(', '\\)']]
  },
  svg: {
    fontCache: 'global'
  }
};
MathJax.Hub.Config({
  CommonHTML: {
    linebreaks: {automatic: true}
  }
});
</script>

These lines configure how MathJax works, including how it is identified in the page. If you scan through the source of, say, chapter 3, you will see lots of things like this:

<p>where $c_0$ is the ambient speed of sound (ahead of the shock front), $p$ is the peak overpressure (behind the shock front), $P_0$ is the ambient pressure (ahead of the shock), and $\gamma$ is the ratio of the specific heats of the medium, i.e., air. If $\gamma$ is taken as 1.4, which is the value at moderate temperatures, the equation for the shock velocity becomes</p>
    
\[
U = c_0 \left( 1 + \frac{6p}{7P_0} \right)^{1/2},
\]
Code sample from chapter 3

What you can see here is that the text is peppered with those delimiters from the configuration, and inside is TeX.

People don't like writing *ML

I just want to take a minute and address the elephant in the room: Most people don't want to write MathML, and that's not as damning as it sounds. Case in point, as I am writing this post, I am not writing <p> and <h1> and <section>s and so on… I'm writing markdown, as the gods intended.

TimBL’s original idea for HTML wasn't that people would write it - it was that machines would write it. His first browser was an editor too - but a relatively basic rich text editor, a lot like the ones you use every day on github issues or a million other places. The very first project Tim did with HTML was the CERN phone book which did all kinds of complicated stuff to dynamically write HTML - the same way we do today. In fact, it feels like 90% of engineering seems to be transforming strings into other strings: And it's nothing new. TeX has been around almost as long as I've been alive and it's a fine thing to process.

This format, of text tokens being anywhere in the document is way more ideally suited to processing on a build or server than on the client. On the client we'd more ideally work with the tree and manage re-flow and so on. Here the tree is irrelevant - in the way even - because there are some conflicting escapes or encodings… But, it's just a string... And there are lots of things that can process that string and turn it into MathML.

My colleague Fred Wang has an open source project called TeXZilla, so I'm going to use that.

Nuking the scripts

First thing first, let's get rid of those script tags. Since they contain the tokens they'd be searching the body for, but well be processing the whole document, they'll just cause problems anyways.

Ok. Done.

Next, I checkout the TeXZilla project parallel to my book project and try to use the ‘streamfilter’ which lets me just cat and | (pipe) a file to process it…

 >cat chapter1.html | node "../../texzilla/node_modules/texzilla/TeXZilla.js" streamfilter

Hmm… It fails with an error that looks something like this (not exactly if you're following along, this is from a different file):

>cat chapter1.html | node "../../texzilla/node_modules/texzilla/TeXZilla.js" streamfilter

Error: Parse error on line 145:
...}\text{Neutron + }&\!\left\{\enspace
---------------------^
Expecting 'ENDMATH1', got 'COLSEP'
.....

I don’t know TeX, so this is kind of a mystery to me. I search for the quoted line and have a look. I find some code surrounded by \begin{align} and \end{align}. I turn to the interwebs and find some examples where it says \begin{aligned} and \end{aligned}, so I try just grep/changing it across the files and, sure enough, it processes further. Which is right? Who knows - let's move on...

I find a similar error like...

Error: Parse error on line 759:
...xt{ rads. } \; \textit{Answer}\]</asi
-----------------------^
Expecting '{', got 'TEXTARG'

Once again on the interwebs I find some code like Textit{...} (uppercase T) - so, I try that. Sure enough, it works fine. Or wait... Does it? No, it just leaves that code untransformed now.

But… Now we have MathML! At least in some chapters!

With a little more looking, I find that this is a primary sort of TeX thing, it's just a quirk of TeXZilla, and I turn those those textit{...} into mathit{...} and now we're good. Finally. Burned some time on that, but not too bad.

I encounter two other problems along the way: First, escaping. The source contains ampersand escapes for a few Unicode characters which aren't a problem when you're accessing it from body.innerText or something maybe, but are a problem here. Still, it takes only a couple of minutes more to replace them with their actual Unicode counterparts: &phi;φ, &lambda; ⇒ Λ, &&times; ⇒ ×).

When I’m done I can send that to a file, something like this…

>cat chapter3.html | node "../../texzilla/node_modules/texzilla/TeXZilla.js" streamfilter > output.html

And then just open it in a browser.

There are some differences. MathJax adds a margin to math that's on its own line (block) that isn't there natively. The math formulas that are ‘tabular’ aren't aligning their text (the =), and the font is maybe a little small. After a little fooling around, I add this snip to the stylesheet:

math[display="block" i] {
  margin: 1rem 0;
}

math {
    font-size: 1rem;
}

mfrac {
    padding-inline-start: 0.275rem;
    padding-inline-end: 0.275rem;
}

mtd {
  text-align: left;
}

And the last problem I hit? An error in the actual inline TeX in the source, I think, where it was missing a $. MathJax simply left that un-transformed, just like my previous example because it lacked the closing $, but TeXZilla was all errory about it. I sent a pull request, and I guess we'll see.

And like… that's it.

All in all, the processing time here including learning the tools and figuring out the quirks in a language I'm unfamiliar with is maybe half a day - but that's scattered across short bits of free time here and there.

Is it done? Is it good enough? It definitely needs some proofing to see what I've missed (I'm sure there's some!), and good scrutiny generally, but... It looks pretty good at first glance, right (below)?

Native MathML on the left, MathJax on the right

You can browse it at:

https://bkardell.com/effects-of-nuclear-math/html/.

And, send me an issue if you find anything that should be improved before I eventually send a pull request, or you see something that looks bad enough that you wouldn't switch to native MathML for this.

But, all in all: I feel like you can make pretty good use of native MathML at this point.

December 01, 2023 05:00 AM

November 26, 2023

Alex Bradbury

Storing data in pointers

Introduction

On mainstream 64-bit systems, the maximum bit-width of a virtual address is somewhat lower than 64 bits (commonly 48 bits). This gives an opportunity to repurpose those unused bits for data storage, if you're willing to mask them out before using your pointer (or have a hardware feature that does that for you - more on this later). I wondered what happens to userspace programs relying on such tricks as processors gain support for wider virtual addresses, hence this little blog post. TL;DR is that there's no real change unless certain hint values to enable use of wider addresses are passed to mmap, but read on for more details as well as other notes about the general topic of storing data in pointers.

Storage in upper bits assuming 48-bit virtual addresses

Assuming your platform has 48-bit wide virtual addresses, this is pretty straightforward. You can stash whatever you want in those 16 bits, but you'll need to ensure you masking them out for every load and store (which is cheap, but has at least some cost) and would want to be confident that there's no other attempted users for these bits. The masking would be slightly different in kernel space due to rules on how the upper bits are set:

What if virtual addresses are wider than 48 bits?

So we've covered the easy case, where you can freely (ab)use the upper 16 bits for your own purposes. But what if you're running on a system that has wider than 48 bit virtual addresses? How do you know that's the case? And is it possible to limit virtual addresses to the 48-bit range if you're sure you don't need the extra bits?

You can query the virtual address width from the command-line by catting /proc/cpuinfo, which might include a line like address sizes : 39 bits physical, 48 bits virtual. I'd hope there's a way to get the same information without parsing /proc/cpuinfo, but I haven't been able to find it.

As for how to keep using those upper bits on a system with wider virtual addresses, helpfully the behaviour of mmap is defined with this compatibility in mind. It's explicitly documented for x86-64, for AArch64 and for RISC-V that addresses beyond 48-bits won't be returned unless a hint parameter beyond a certain width is used (the details are slightly different for each target). This means if you're confident that nothing within your process is going to be passing such hints to mmap (including e.g. your malloc implementation), or at least that you'll never need to try to reuse upper bits of addresses produced in this way, then you're free to presume the system uses no more than 48 bits of virtual address.

Top byte ignore and similar features

Up to this point I've completely skipped over the various architectural features that allow some of the upper bits to be ignored upon dereference, essentially providing hardware support for this type of storage of additional data within pointeres by making additional masking unnecessary.

  • x86-64 keeps things interesting by having slightly different variants of this for Intel and AMD.
    • Intel introduced Linear Address Masking (LAM), documented in chapter 6 of their document on instruction set extensions and future features. If enabled this modifies the canonicality check so that, for instance, on a system with 48-bit virtual addresses bit 47 must be equal to bit 63. This would allow bits 62:48 (15 bits) can be freely used with no masking needed. "LAM57" allows 62:57 to be used (6 bits). It seems as if Linux is currently opting to only support LAM57 and not LAM48. Support for LAM can be configured separately for user and supervisor mode, but I'll refer you to the Intel docs for details.
    • AMD instead describes Upper Address Ignore (see section 5.10) which allows bits 63:57 (7 bits) to be used, and unlike LAM doesn't require bit 63 to match the upper bit of the virtual address. As documented in LWN, this caused some concern from the Linux kernel community. Unless I'm missing it, there doesn't seem to be any level of support merged in the Linux kernel at the time of writing.
  • RISC-V has the proposed pointer masking extension which defines new supervisor-level extensions Ssnpm, Smnpm, and Smmpm to control it. These allow PMLEN to potentially be set to 7 (masking the upper 7 bits) or 16 (masking the upper 16 bits). In usual RISC-V style, it's not mandated which of these are supported, but the draft RVA23 profile mandates that PMLEN=7 must be supported at a minimum. Eagle-eyed readers will note that the proposed approach has the same issue that caused concern with AMD's Upper Address Ignore, namely that the most significant bit is no longer required to be the same as the top bit of the virtual address. This is noted in the spec, with the suggestion that this is solvable at the ABI level and some operating systems may choose to mandate that the MSB not be used for tagging.
  • AArch64 has the Top Byte Ignore (TBI) feature, which as the name suggests just means that the top 8 bits of a virtual address are ignored when used for memory accesses and can be used to store data. Any other bits between the virtual address width and top byte must be set to all 0s or all 1s, as before. TBI is also used by Arm's Memory Tagging Extension (MTE), which uses 4 of those bits as the "key" to be compared against the "lock" tag bits associated with a memory location being accessed. Armv8.3 defines another potential consumer of otherwise unused address bits, pointer authentication which uses 11 to 31 bits depending on the virtual address width if TBI isn't being used, or 3 to 23 bits if it is.

A relevant historical note that multiple people pointed out: the original Motorola 68000 had a 24-bit address bus and so the top byte was simply ignored which caused well documented porting issues when trying to expand the address space.

Storing data in least significant bits

Another commonly used trick I'd be remiss not to mention is repurposing a small number of the least significant bits in a pointer. If you know a certain set of pointers will only ever be used to point to memory with a given minimal alignment, you can exploit the fact that the lower bits corresponding to that alignment will always be zero and store your own data there. As before, you'll need to account for the bits you repurpose when accessing the pointer - in this case either by masking, or by adjusting the offset used to access the address (if those least significant bits are known).

As suggested by Per Vognsen, after this article was first published, you can exploit x86's scaled index addressing mode to use up to 3 bits that are unused due to alignment, but storing your data in the upper bits. The scaled index addressing mode meaning there's no need for separate pointer manipulation upon access. e.g. for an 8-byte aligned address, store it right-shifted by 3 and use the top 3 bits for metadata, then scaling by 8 using SIB when accessing (which effectively ignores the top 3 bits). This has some trade-offs, but is such a neat trick I felt I have to include it!

Some real-world examples

To state what I hope is obvious, this is far from an exhaustive list. The point of this quick blog post was really to discuss cases where additional data is stored alongside a pointer, but of course unused bits can also be exploited to allow a more efficient tagged union representation (and this is arguably more common), so I've included some examples of that below:

  • The fixie trie, is a variant of the trie that uses 16 bits in each pointer to store a bitmap used as part of the lookup logic. It also exploits the minimum alignment of pointers to repurpose the least significant bit to indicate if a value is a branch or a leaf.
  • On the topic of storing data in the least significant bits, we have a handy PointerIntPair class in LLVM to allow the easy implementation of this optimisation. There's also an 'alignment niches' proposal for Rust which would allow this kind of optimisation to be done automatically for enums (tagged unions). Another example of repurposing the LSB found in the wild would be the Linux kernel using it for a spin lock (thanks Vegard Nossum for the tip, who notes this is used in the kernel's directory entry cache hashtable). There are surely many many more examples.
  • Go repurposes both upper and lower bits in its taggedPointer, used internally in its runtime implementation.
  • If you have complete control over your heap then there's more you can do to make use of embedded metadata, including using additional bits by avoiding allocation outside of a certain range and using redundant mappings to avoid or reduce the need for masking. OpenJDK's ZGC is a good example of this, utilising a 42-bit address space for objects and upon allocation mapping pages to different aliases to allow pointers using their metadata bits to be dereferenced without masking.
  • A fairly common trick in language runtimes is to exploit the fact that values can be stored inside the payload of double floating point NaN (not a number) values and overlap it with pointers (knowing that the full 64 bits aren't needed) and even small integers. There's a nice description of this in JavaScriptCore, but it was famously used in LuaJIT. Andy Wingo also has a helpful write-up. Along similar lines, OCaml steals just the least significant bit in order to efficiently support unboxed integers (meaning integers are 63-bit on 64-bit platforms and 31-bit on 32-bit platforms).
  • Apple's Objective-C implementation makes heavy use of unused pointer bits, with some examples documented in detail on Mike Ash's excellent blog. Inlining the reference count (falling back to a hash lookup upon overflow) is a fun one. Another example of using the LSB to store small strings in-line is squoze.
  • V8 opts to limit the heap used for V8 objects to 4GiB using pointer compression, where an offset is used alongside the 32-bit value (which itself might be a pointer or a 31-bit integer, depending on the least significant bit) to refer to the memory location.
  • As this list is becoming more of a collection of things slightly outside the scope of this article I might as well round it off with the XOR linked list, which reduces the storage requirements for doubly linked lists by exploiting the reversibility of the XOR operation.
  • I've focused on storing data in conventional pointers on current commodity architectures but there is of course a huge wealth of work involving tagged memory (also an area where I've dabbled - something for a future blog post perhaps) and/or alternative pointer representations. I've touched on this with MTE (mentioned due to its interaction with TBI), but another prominent example is of course CHERI which moves to using 128-bit capabilities in order to fit in additional inline metadata. David Chisnall provided some observations based on porting code to CHERI that relies on the kind of tricks described in this post.

Fin

What did I miss? What did I get wrong? Let me know on Mastodon or email (asb@asbradbury.org).

You might be interested in the discussion of this article on lobste.rs, on HN, on various subreddits, or on Mastodon.


Article changelog
  • 2023-11-27:
    • Mention Squoze (thanks to Job van der Zwan).
    • Reworded the intro so as not to claim "it's quite well known" that the maximum virtual address width is typically less than 64 bits. This might be interpreted as shaming readers for not being aware of that, which wasn't my intent. Thanks to HN reader jonasmerlin for pointing this out.
    • Mention CHERI is the list of "real world examples" which is becoming dominated by instances of things somewhat different to what I was describing! Thanks to Paul Butcher for the suggestion.
    • Link to relevant posts on Mike Ash's blog (suggested by Jens Alfke).
    • Link to the various places this article is being discussed.
    • Add link to M68k article suggested by Christer Ericson (with multiple others suggesting something similar - thanks!).
  • 2023-11-26:
    • Minor typo fixes and rewordings.
    • Note the Linux kernel repurposing the LSB as a spin lock (thanks to Vegard Nossum for the suggestion).
    • Add SIB addressing idea shared by Per Vognsen.
    • Integrate note suggested by Andy Wingo that explicit masking often isn't needed when the least significant pointer bits are repurposed.
    • Add a reference to Arm Pointer Authentication (thanks to the suggestion from Tzvetan Mikov).
  • 2023-11-26: Initial publication date.

November 26, 2023 12:00 PM

November 24, 2023

Andy Wingo

tree-shaking, the horticulturally misguided algorithm

Let's talk about tree-shaking!

But first, I need to talk about WebAssembly's dirty secret: despite the hype, WebAssembly has had limited success on the web.

There is Photoshop, which does appear to be a real success. 5 years ago there was Figma, though they don't talk much about Wasm these days. There are quite a number of little NPM libraries that use Wasm under the hood, usually compiled from C++ or Rust. I think Blazor probably gets used for a few in-house corporate apps, though I could be fooled by their marketing.

But then you might recall the hyped demos of 3D first-person-shooter games with Unreal engine, but that was 5 years and one major release of Unreal ago: the current Unreal 5 does not support targetting WebAssembly.

Don't get me wrong, I think WebAssembly is great. It is having fine success in off-the-web environments, and I think it is going to be a key and growing part of the Web platform. I suspect, though, that we are only just now getting past the trough of dissolutionment.

It's worth reflecting a bit on the nature of web Wasm's successes and failures. Taking Photoshop as an example, I think we can say that Wasm does very well at bringing large C++ programs to the web. I know that it took quite some work, but I understand the end result to be essentially the same source code, just compiled for a different target.

Similarly for the JavaScript module case, Wasm finds success in getting legacy C++ code to the web, and as a way to write new web-targetting Rust code. These are often tasks that JavaScript doesn't do very well at, or which need a shared implementation between client and server deployments.

On the other hand, WebAssembly has not been a Web success for DOM-heavy apps. Nobody is talking about rewriting wordpress.com in Wasm, for example. Why is that? It may sound like a silly question to you: Wasm just isn't good at that stuff. But why? If you dig down a bit, I think it's that the programming models are just too different: the Web's primary programming model is JavaScript, a language with dynamic typing and managed memory, whereas WebAssembly 1.0 was about static typing and linear memory. Getting to the DOM from Wasm was a hassle that was overcome only by the most ardent of the true Wasm faithful.

Relatedly, Wasm has also not really been a success for languages that aren't, like, C or Rust. I am guessing that wordpress.com isn't written mostly in C++. One of the sticking points for this class of language. is that C#, for example, will want to ship with a garbage collector, and that it is annoying to have to do this. Check my article from March this year for more details.

Happily, this restriction is going away, as all browsers are going to ship support for reference types and garbage collection within the next months; Chrome and Firefox already ship Wasm GC, and Safari shouldn't be far behind thanks to the efforts from my colleague Asumu Takikawa. This is an extraordinarily exciting development that I think will kick off a whole 'nother Gartner hype cycle, as more languages start to update their toolchains to support WebAssembly.

if you don't like my peaches

Which brings us to the meat of today's note: web Wasm will win where compilers create compact code. If your language's compiler toolchain can manage to produce useful Wasm in a file that is less than a handful of over-the-wire kilobytes, you can win. If your compiler can't do that yet, you will have to instead rely on hype and captured audiences for adoption, which at best results in an unstable equilibrium why you figure out what's next.

In the JavaScript world, managing bloat and deliverable size is a huge industry. Bundlers like esbuild are a ubiquitous part of the toolchain, compiling down a set of JS modules to a single file that should include only those functions and data types that are used in a program, and additionally applying domain-specific size-squishing strategies such as minification (making monikers more minuscule).

Let's focus on tree-shaking. The visual metaphor is that you write a bunch of code, and you only need some of it for any given page. So you imagine a tree whose, um, branches are the modules that you use, and whose leaves are the individual definitions in the modules, and you then violently shake the tree, probably killing it and also annoying any nesting birds. The only thing that's left still attached is what is actually needed.

This isn't how trees work: holding the trunk doesn't give you information as to which branches are somehow necessary for the tree's mission. It also primes your mind to look for the wrong fixed point, removing unneeded code instead of keeping only the necessary code.

But, tree-shaking is an evocative name, and so despite its horticultural and algorithmic inaccuracies, we will stick to it.

The thing is that maximal tree-shaking for languages with a thicker run-time has not been a huge priority. Consider Go: according to the golang wiki, the most trivial program compiled to WebAssembly from Go is 2 megabytes, and adding imports can make this go to 10 megabytes or more. Or look at Pyodide, the Python WebAssembly port: the REPL example downloads about 20 megabytes of data. These are fine sizes for technology demos or, in the limit, very rich applications, but they aren't winners for web development.

shake a different tree

To be fair, both the built-in Wasm support for Go and the Pyodide port of Python both derive from the upstream toolchains, where producing small binaries is nice but not necessary: on a server, who cares how big the app is? And indeed when targetting smaller devices, we tend to see alternate implementations of the toolchain, for example MicroPython or TinyGo. TinyGo has a Wasm back-end that can apparently go down to less than a kilobyte, even!

These alternate toolchains often come with some restrictions or peculiarities, and although we can consider this to be an evil of sorts, it is to be expected that the target platform exhibits some co-design feedback on the language. In particular, running in the sea of the DOM is sufficiently weird that a Wasm-targetting Python program will necessarily be different than a "native" Python program. Still, I think as toolchain authors we aim to provide the same language, albeit possibly with a different implementation of the standard library. I am sure that the ClojureScript developers would prefer to remove their page documenting the differences with Clojure if they could, and perhaps if Wasm becomes a viable target for Clojurescript, they will.

on the algorithm

To recap: now that it supports GC, Wasm could be a winner for web development in Python and other languages. You would need a different toolchain and an effective tree-shaking algorithm, so that user experience does not degrade. So let's talk about tree shaking!

I work on the Hoot Scheme compiler, which targets Wasm with GC. We manage to get down to 70 kB or so right now, in the minimal "main" compilation unit, and are aiming for lower; auxiliary compilation units that import run-time facilities (the current exception handler and so on) from the main module can be sub-kilobyte. Getting here has been tricky though, and I think it would be even trickier for Python.

Some background: like Whiffle, the Hoot compiler prepends a prelude onto user code. Tree-shakind happens in a number of places:

Generally speaking, procedure definitions (functions / closures) are the easy part: you just include only those functions that are referenced by the code. In a language like Scheme, this gets you a long way.

However there are three immediate challenges. One is that the evaluation model for the definitions in the prelude is letrec*: the scope is recursive but ordered. Binding values can call or refer to previously defined values, or capture values defined later. If evaluating the value of a binding requires referring to a value only defined later, then that's an error. Again, for procedures this is trivially OK, but as soon as you have non-procedure definitions, sometimes the compiler won't be able to prove this nice "only refers to earlier bindings" property. In that case the fixing letrec (reloaded) algorithm will end up residualizing bindings that are set!, which of all the tree-shaking passes above require the delicate DCE pass to remove them.

Worse, some of those non-procedure definitions are record types, which have vtables that define how to print a record, how to check if a value is an instance of this record, and so on. These vtable callbacks can end up keeping a lot more code alive even if they are never used. We'll get back to this later.

Similarly, say you print a string via display. Well now not only are you bringing in the whole buffered I/O facility, but you are also calling a highly polymorphic function: display can print anything. There's a case for bitvectors, so you pull in code for bitvectors. There's a case for pairs, so you pull in that code too. And so on.

One solution is to instead call write-string, which only writes strings and not general data. You'll still get the generic buffered I/O facility (ports), though, even if your program only uses one kind of port.

This brings me to my next point, which is that optimal tree-shaking is a flow analysis problem. Consider display: if we know that a program will never have bitvectors, then any code in display that works on bitvectors is dead and we can fold the branches that guard it. But to know this, we have to know what kind of arguments display is called with, and for that we need higher-level flow analysis.

The problem is exacerbated for Python in a few ways. One, because object-oriented dispatch is higher-order programming. How do you know what foo.bar actually means? Depends on foo, which means you have to thread around representations of what foo might be everywhere and to everywhere's caller and everywhere's caller's caller and so on.

Secondly, lookup in Python is generally more dynamic than in Scheme: you have __getattr__ methods (is that it?; been a while since I've done Python) everywhere and users might indeed use them. Maybe this is not so bad in practice and flow analysis can exclude this kind of dynamic lookup.

Finally, and perhaps relatedly, the object of tree-shaking in Python is a mess of modules, rather than a big term with lexical bindings. This is like JavaScript, but without the established ecosystem of tree-shaking bundlers; Python has its work cut out for some years to go.

in short

With GC, Wasm makes it thinkable to do DOM programming in languages other than JavaScript. It will only be feasible for mass use, though, if the resulting Wasm modules are small, and that means significant investment on each language's toolchain. Often this will take the form of alternate toolchains that incorporate experimental tree-shaking algorithms, and whose alternate standard libraries facilitate the tree-shaker.

Welp, I'm off to lunch. Happy wassembling, comrades!

by Andy Wingo at November 24, 2023 11:41 AM

Adrián Pérez

Standalone Binaries With Zig CC and Meson

Have you ever wanted to run a program in some device without needing to follow a complicated, possibly slow build process? In many cases it can be simpler to use zig cc to compile a static binary and call it a day.

Zig What?

The Zig programming language is a new(ish) programming language that may not get as much rep as others like Rust, but the authors have had one of the greatest ideas ever: reuse the chops that Clang has as a cross-compiler, throw in a copy of the sources for the Musl C library, and provide an extremely convenient compiler driver that builds the parts of the C library needed by your program on-demand.

Try the following, it just works, it’s magic:

cat > hi.c <<EOF
#include <stdio.h>

int main() {
    puts("That's all, folks!");
    return 0;
}
EOF
zig cc --target=aarch64-linux-musl -o hi-arm hi.c

Go on, try it. I’ll wait.


All good? Here is what it looks like for me:

% uname -m
x64_64
% file hi-arm
hi-arm: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), statically linked, with debug_info, not stripped
% ls -lh hi-arm
-rwxrwxr-x+ 1 aperez aperez 82K nov 24 16:08 hi-arm*
% llvm-strip -x --strip-unneeded hi-arm
% ls -lh hi-arm
-rwxrwxr-x+ 1 aperez aperez 8,2K nov 24 16:11 hi-arm*
% 

A compiler running on a x86_64 machine produced a 64-bit ARM program out of thin air (no cross-sysroot, no additional tooling), which does not have any runtime dependency (it is linked statically), weighting 82 KiB, and can be stripped further down to only 8,2 KiB.

While I cannot say much about Zig the programming language itself because I have not had the chance to spend much time writing code in it, I have the compiler installed just because of zig cc.

A Better Example

One of the tools I often want to run in a development board is drm_info, so let’s cross-compile that. This little program prints all the information it can gather from a DRM/KMS device, optionally formatted as JSON (with the -j command line flag) and it can be tremendously useful to understand the capabilities of GPUs and their drivers. My typical usage of this tool is knowing whether WPE WebKit would run in a given embedded device, and as guidance for working on Cog’s DRM platform plug-in.

It is also a great example because it uses Meson as its build system and has a few dependencies—so it is not a trivial example like the one above—but not so many that one would run into much trouble. Crucially, both its main dependencies, json-c and libdrm, are available as WrapDB packages that Meson itself can fetch and configure automatically. Combined with zig cc, this means the only other thing we need is installing Meson.

Well, that is not completely true. We also need to tell Meson how to cross-compile using our fancy toolchain. For this, we can write the following cross file, which I have named zig-aarch64.conf:

[binaries]
c = ['/opt/zigcc-wrapper/cc', 'aarch64-linux-musl']
objcopy = 'llvm-objcopy'
ranlib = 'llvm-ranlib'
strip = 'llvm-strip'
ar = 'llvm-ar'

[target_machine]
system = 'linux'
cpu_family = 'aarch64'
cpu = 'cortex-a53'
endian = 'little'

While ideally it should be possible to set the value for c in the cross file to zig with the needed command line flags, in practice there is a bug that prevents the linker detection in Meson from working correctly. At least until the patch which fixes the issue is not yet part of a release, the wrapper script at /opt/zigcc-wrapper/cc will defer to using lld when it notices that the linker version is requested, or otherwise call zig cc:

#! /bin/sh
set -e
target=$1
shift

if [ "$1" = '-Wl,--version' ] ; then
    exec clang -fuse-ld=lld "$@"
fi
exec zig cc --target="$target" "$@"

Now, this is how we would go about building drm_info, note the use of --default-library=static to ensure that the subprojects for the dependencies are built accordingly and that they can be linked into the resulting executable:

git clone https://gitlab.freedesktop.org/emersion/drm_info.git
meson setup drm_info.aarch64 drm_info \
  --cross-file=zig-aarch64.conf --default-library=static
meson compile -Cdrm_info.aarch64

A short while later we will be kicked back into our shell prompt, where we can check the resulting binary:

% file drm_info.aarch64/drm_info
drm_info.aarch64/drm_info: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), statically linked, with debug_info, not stripped
% ls -lh drm_info.aarch64/drm_info
-rwxr-xr-x 1 aperez aperez 1,8M nov 24 23:07 drm_info.aarch64/drm_info*
% llvm-strip -x --strip-unneeded drm_info.aarch64/drm_info
% ls -lh drm_info.aarch64/drm_info
-rwxr-xr-x 1 aperez aperez 199K nov 24 23:09 drm_info.aarch64/drm_info*
%

That’s a ~200 KiB drm_info binary that can be copied over to any 64-bit ARM-based device running Linux (the kernel). Magic!

The General Magic logo, but its text reads

by aperez (adrian@perezdecastro.org) at November 24, 2023 11:32 AM

November 16, 2023

Andy Wingo

a whiff of whiffle

A couple nights ago I wrote about a superfluous Scheme implementation and promised to move on from sheepishly justifying my egregious behavior in my next note, and finally mention some results from this experiment. Well, no: I am back on my bullshit. Tonight I write about a couple of implementation details that discerning readers may find of interest: value representation, the tail call issue, and the standard library.

what is a value?

As a Lisp, Scheme is one of the early "dynamically typed" languages. These days when you say "type", people immediately think propositions as types, mechanized proof of program properties, and so on. But "type" has another denotation which is all about values and almost not at all about terms: one might say that vector-ref has a type, but it's not part of a proof; it's just that if you try to vector-ref a pair instead of a vector, you get a run-time error. You can imagine values as being associated with type tags: annotations that can be inspected at run-time for, for example, the sort of error that vector-ref will throw if you call it on a pair.

Scheme systems usually have a finite set of type tags: there are fixnums, booleans, strings, pairs, symbols, and such, and they all have their own tag. Even a Scheme system that provides facilities for defining new disjoint types (define-record-type et al) will implement these via a secondary type tag layer: for example that all record instances are have the same primary tag, and that you have to retrieve their record type descriptor to discriminate instances of different record types.

Anyway. In Whiffle there are immediate types and heap types. All values have a low-bit tag which is zero for heap objects and nonzero for immediates. For heap objects, the first word of the heap object has tagging in the low byte as well. The 3-bit heap tag for pairs is chosen so that pairs can just be two words, with no header word. There is another 3-bit heap tag for forwarded objects, which is used but the GC when evacuating a value. Other objects put their heap tags in the low 8 bits of the first word. Additionally there is a "busy" tag word value, used to prevent races when evacuating from multiple threads.

Finally, for generational collection of objects that can be "large" -- the definition of large depends on the collector implementation, and is not nicely documented, but is more than, like, 256 bytes -- anyway these objects might need to have space for a "remembered" bit in the object themselves. This is not the case for pairs but is the case for, say, vectors: even though they are prolly smol, they might not be, and they need space for a remembered bit in the header.

tail calls

When I started Whiffle, I thought, let's just compile each Scheme function to a C function. Since all functions have the same type, clang and gcc will have no problem turning any tail call into a proper tail call.

This intuition was right and wrong: at optimization level -O2, this works great. We don't even do any kind of loop recognition / contification: loop iterations are tail calls and all is fine. (Not the most optimal implementation technique, but the assumption is that for our test cases, GC costs will dominate.)

However, when something goes wrong, I will need to debug the program to see what's up, and so you might think to compile at -O0 or -Og. In that case, somehow gcc does not compile to tail calls. One time while debugging a program I was flummoxed at a segfault during the call instruction; turns out it was just stack overflow, and the call was trying to write the return address into an unmapped page. For clang, I could use the musttail attribute; perhaps I should, to allow myself to debug properly.

Not being able to debug at -O0 with gcc is annoying. I feel like if GNU were an actual thing, we would have had the equivalent of a musttail attribute 20 years ago already. But it's not, and we still don't.

stdlib

So Whiffle makes C, and that C uses some primitives defined as inline functions. Whiffle actually lexically embeds user Scheme code with a prelude, having exposed a set of primitives to that prelude and to user code. The assumption is that the compiler will open-code all primitives, so that the conceit of providing a primitive from the Guile compilation host to the Whiffle guest magically works out, and that any reference to a free variable is an error. This works well enough, and it's similar to what we currently do in Hoot as well.

This is a quick and dirty strategy but it does let us grow the language to something worth using. I think I'll come back to this local maximum later if I manage to write about what Hoot does with modules.

coda

So, that's Whiffle: the Guile compiler front-end for Scheme, applied to an expression that prepends a user's program with a prelude, in a lexical context of a limited set of primitives, compiling to very simple C, in which tail calls are just return f(...), relying on the C compiler to inline and optimize and all that.

Perhaps next up: some results on using Whiffle to test Whippet. Until then, good night!

by Andy Wingo at November 16, 2023 09:11 PM

José Dapena

V8 profiling instrumentation overhead

In the last 2 years, I have been contributing to improve the performance of V8 profiling instrumentation in different scenarios, both for Linux (using Perf) and for Windows (using Windows Performance Toolkit).

My work has been mostly improving the stack walk instrumentation that allows Javascript generated code to show in the stack traces, referring to the original code, in the same way native compiled code in C or C++ can provide that information.

In this blog post I run different benchmarks to determine how much overhead this instrumentation introduces.

How can you enable profiling instrumentation?

V8 implements system native profile instrumentation for the Javascript code, Both for Linux Perf and for Windows Performance Toolkit. Though, those are disabled by default.

For Windows, the command line switch --enable-etw-stack-walking instruments the generated Javascript code, to emit the source code information for the compiled functions using ETW (Event Tracing for Windows). This gives a better insight of time spent in different functions, specially when stack profiling is enabled in Windows Performance Toolkit.

For Linux Perf, there are several command line switches:
--perf-basic-prof: this emits, for any Javascript generated code, its memory location and a descriptive name (that includes the source location of the function).
--perf-basic-prof-only-functions: same as the previous one, but only emitting functions, excluding other stubs as regular expressions.
--perf-prof: this is a different way to provide the information for Linux Perf. It generates a more complex format specified here. On top of the addresses of the functions, it also includes source code information, even with details about the exact line of code inside the function that is executed in a sample.
--perf-prof-annotate-wasm: this one extends --perf-prof, to add debugging information to WASM code.
--perf-prof-unwinding-info: This last one also extends --perf-prof, but providing experimental support for unwinding info.

And then, we have interpreted code support. V8 does not generate JIT code for everything, and in many cases it runs interpreted code that calls common builtin code. So, in a stack trace, instead of seeing the Javascript method that eventually runs those methods through the interpreter, we see those common methods. The solution? Using --interpreted-frames-native-stack, that adds additional information that identifies which Javascript method in the stack is actually interpreted. This is basic to understand the attribution of running code, especially if it is not considered hot enough to be compiled.

They are disabled by default, is it a problem?

We would ideally want to have all of this profiling support always enabled when we are profiling code. For that, we would want to avoid having to pass any command line switch, so we can profile Javascript workloads without any additional configuration.

But all these command line switches are disabled by default. Why? There are several reasons.

What happens on Linux?

In Linux, Perf profiling instrumentation will unconditionally generate additional files, as it cannot know if Perf is running or not. --perf-basic-prof backend writes the generated information to .map files, and --perf-prof backend writes additional information to jit-*.dump files that will be used later with perf inject -j.

Additionally, the instrumentation requires code compaction to be disabled. This is because code compaction will change the memory location of the compiled code. V8 generates CODE_MOVE internal events for profiler instrumentation. But ETW and Linux Perf backends do not handle those events for a variety of reasons. It has an impact in memory as there is no way to compact the space allocated for code without code move.

So, when we enable any of the Linux Perf command line switches we both generate extra files and take more memory.

And what about Windows?

In Windows we have the same problem with code compaction. We do not support generating CODE_MOVE events so, while profiling, we cannot enable code compaction.

And the emitted ETW events with JIT code position information do still take memory space and make the profile recordings bigger.

But there is an important difference: Windows ETW API allows applications to know when they are being profiled, and even filter that. V8 only emits the code location information for ETW if the application is being profiled, and code compaction is only disabled when profiling is ongoing.

That means the overhead for enabling --enable-etw-stack-walking is expected to be minimal.

What about --interpreted-frames-native-stack? It has also been optimized to generate the extra stack information only when a profile is being recorded.

Can we enable them by default?

For Linux, the answer is a clear no. Any V8 workload would generate profiling information in files, and disabling code compaction makes memory usage higher. These problems happen no matter if you are recording a profile or not.

But, Windows ETW support adds overhead only when profiling! It looks like it could make sense to enable both --interpreted-frames-native-stack and --enable-etw-stack-walking for Windows.

Are we there yet? Overhead analysis

So first, we need to verify the actual overhead because we do not want to introduce a regression by enabling any of these options. I also want to confirm the actual overhead in Linux matches the expectation. To do that I am showing the result of running CSuite benchmarks, included as part of V8. I am capturing both the CPU and memory usage.

Linux benchmarks

The results obtained in Linux are…

Legend:
– REC: recording a profile (Yes, No).
– No flag: running the test without any extra command line flag.
– BP: passing --perf-basic-prof.
– IFNS: passing --interpreted-frames-native-stack.
– BP+IFNS: passing both --perf-basic-prof and --interpreted-frames-native-stack.

Linux results – score:

Test REC Better No flag BP IFNS BP+IFNS
Octane No Higher 9101.2 8953.3 9077.1 9097.3
Yes 9112.8 9041.7 9004.8 9093.9
Kraken No Lower 4060.3 4108.9 4076.6 4119.9
Yes 4078.1 4141.7 4083.2 4131.3
Sunspider No Lower 595.7 622.8 595.4 627.8
Yes 598.5 626.6 599.3 633.3

Linux results – memory (in Kilobytes)

Test REC No flag BP IFNS BP+IFNS
Octane No 244040.0 249152.0 243442.7 253533.8
Yes 242234.7 245010.7 245632.9 252108.0
Kraken No 46039.1 46169.5 46009.1 46497.1
Yes 46002.1 46187.0 46024.4 46520.5
Sunspider No 90267.0 90857.2 90214.6 91100.4
Yes 90210.3 90948.4 90195.9 91110.8

What is seen in the results?
– There is apparently no CPU or memory overhead of --interpreted-frames-native-stack alone. Though there is an outlier while recording Octane memory usage, that could be an error in the sampling process. This is expected as the additional information is only generated while any profiling instrumentation is enabled.
– There is no clear extra cost in memory for recording vs not recording. This is expected as the extra information only depends on having the switches enabled. It does not detect when recording is ongoing.
– There is, as expected, a cost in CPU when Perf is ongoing, in most of the cases.
--perf-basic-prof has CPU and memory impact. And combined with --interpreted-frames-native-stack, the impact is even higher as expected.

This is on top of the fact that files are generated. The benchmarks back not enabling Perf --perf-basic-prof by default. But as --interpreted-frames-native-stack has only overhead in combination with profiling switches, it could make sense to consider enabling it on Linux.

Windows benchmarks

What about Windows results?

Legend:
– No flag: running the test without any extra command line flag.
– ETW: passing --enable-etw-stack-walking.
– IFNS: passing --interpreted-frames-native-stack.
– ETW+IFNS: passing both --enable-etw-stack-walking and --interpreted-frames-native-stack.

Windows results – score:

Test REC Better No flag ETW IFNS ETW+IFNS
Octane No Higher 8323.8 8336.2 8308.7 8310.4
Yes 8050.9 7991.3 8068.8 7900.6
Kraken No Lower 4273.0 4294.4 4303.7 4279.7
Yes 4380.2 4416.4 4433.0 4413.9
Sunspider No Lower 671.1 670.8 672.0 671.5
Yes 693.2 703.9 690.9 716.2

Windows results – memory:

Test REC No flag ETW IFNS ETW+IFNS
Octane Not recording 202535.1 204944.9 200700.4 203557.8
Recording 205125.8 204801.3 206856.9 209260.4
Kraken Not recording 76188.2 76095.2 76102.1 76188.5
Recording 76031.6 76265.0 76215.6 76662.0
Sunspider Not recording 31784.9 31888.4 31806.9 31882.7
Recording 31848.9 31882.1 31802.7 32339.0

What is observed?
– No memory or CPU overhead is observed when recording is not ongoing, with --enable-etw-stack-walking and/or --interpreted-frames-native-stack.
– Even recording, there is no clearly visible overhead of --interpreted-frames-native-stack.
– When recording, --enable-etw-stack-walking has an impact on CPU. But not on memory.
– When --enable-etw-stack-walking and --interpreted-frames-native-stack are combined, while recording, both memory and CPU overheads are observed.

So, again, it should be possible to enable --interpreted-frames-native-stack on Windows by default as it only has impact while recording a profile with --enable-etw-stack-walking. And it should be possible to enable --enable-etw-stack-walking by default too as, again, the impact happens only when a profile is recorded.

Windows: there are still problems

Is it that simple? Is it just OK only adding overhead when recording a profile?

One problem with this is that there is an impact in system wide recordings, even if the focus is not on Javascript execution.

Also, the CPU impact is not linear. Most of it happens when a profile recording starts, V8 emits the code position of all the methods in all the already existing Javascript functions and builtins. Now, imagine a regular desktop with several V8 based browsers as Chrome and Edge, with all their tabs. Or a server with many NodeJS instances. All that generates the information of all their methods at the same time.

So the impact of the overhead is significant, even if it only happens while recording.

Next steps

For --interpreted-frames-native-stack it looks like it would be better to propose enabling it by default in all studied platforms (Windows and Linux). It significantly improves the profiling instrumentation, and it has impact only when actual instrumentation is used. And then it still allows disabling it for the specific cases where the original builtins recording is preferred.

For --enable-etw-stack-walking, it could make sense to also propose enabling it by default, even with the known overhead while recording profiles. And just make sure there are no surprise regressions.

But, likely, the best option is trying to reduce that overhead. First, by allowing to filter better which processes generate the additional information from Windows Performance Recorder. And then, also, by reducing the initial overhead as much as possible. I.e. a big part of it is emitting the builtins and V8 snapshot compiled functions for each of the contexts again and again. Ideally we want to avoid that duplication if possible.

Wrapping up

While on Linux, enabling profile instrumentation with command line switches is not an option, Windows dynamically enabling instrumentation allows to consider enabling ETW support by default. But there are still optimization opportunities that could be addressed before.

Thanks!

This report has been done as part of the collaboration between Bloomberg and Igalia. Thanks!


by José Dapena Paz at November 16, 2023 09:40 AM

Brian Kardell

Let's Play a Game

Let's Play a Game

It's Interop 24 planning time! Let's play a game: Tell me what you prioritize, or don't and why?

I've written before about how prioritization is hard. Just in the most tangible of terms, there are differing finite resources, distributed differently across organizations. That is, not all teams have the same size or composition, or current availability. They're all also generally managed and planned independently with several other factors in mind, and in totally different architectures. That's a big part of what makes standards often take so long to get sorted out to where everyone can just use them without significant concerns.

Conversely, occasionally we see things ship at what seems by comparison amazing speed. Sometimes that's a bit of an illusion: They've been brewing and aligning hard parts for years and we're only counting the end bits. But other times it's just because we've aligned. A nice way to think about this, if you're a developer, is to compare it to web page performance. If all of the resources just load as discovered and there's no thought put into optimization, things can get really out of control. When you do some optimization and coordination, you can get things many times faster pretty quickly. The big difference here is mainly that instead of milliseconds, with standards substitute weeks or something.

I think that's one of the best things about Interop. While it isn't a promise by anyone or anything, it does seem to serve as a function to get us to focus together on some things and align - doing that optimization work.

It's a damned tricky exercise though, I'm not going to lie. There are so many things to consider and align. I kind of like sharing insight into this because I spent a really long time as a developer myself with a significantly different perspective. Seeing all of this from the inside the last few years has been really eye-opening. So, I thought, let's play a game... It might be useful to both of us.

Below is a list of links to about 90 2024 proposals. It's not carefully complete and I have excluded a few that I personally feel perhaps don't fit the necessary criteria for inclusion. It would, of course, be a huge time suck for you to spend your unpaid time carefully researching all of them - so that's not the game. Note that this represents a tiny fraction of the open issues in the platform - so you can start to see how this is challenging.

So, the game is: Bubble sort as many of these as you can (at least 5-10 would be great) and, optionally, share something back about why you prioritized or de-prioritized something. That can be in a reply blog post, a few messages on social media of your choice, share a gist... Whatever you like.

Here's the list (note there is a bit more prose afterward)

  1. MediaCapture device enumeration
  2. Standardize SVG properties in CSS to support unitless geometry
  3. Fetch Web API
  4. Web Share API
  5. Full support of background properties and remove of prefixes
  6. Canvas text rendering and metrics (2024 edition)
  7. Scroll-driven Animations
  8. Scrollend Events
  9. <search>
  10. text-box-trim
  11. Web Audio API
  12. CSS :dir() selectors and dir=auto interoperability
  13. CSS Nesting
  14. WebRTC peer connections and codecs
  15. scrollbar-width CSS property
  16. JavaScript Promise Integration
  17. CSS style container queries (custom properties)
  18. Allowing <hr> inside of <select>
  19. CSS background-clip
  20. text-wrap: pretty
  21. font-size-adjust
  22. requestIdleCallback
  23. <details> and <summary> elements
  24. CSS text-indent
  25. Relative Color Syntax
  26. scrollbar-color CSS property
  27. input[type="range"] styling
  28. Top Layer Animations
  29. Gamut mapping
  30. Canvas2D filter and reset
  31. WebXR
  32. User activation (2024 edition)
  33. text-transform: full-size-kana & text-transform: full-width
  34. document.caretPostitionFromPoint
  35. Unit division and multiplication for mixed units of the same type within calc()
  36. EXPAND :has() to include support for more pseudo-classes
  37. WebM AV1 video codec
  38. CSS image() function
  39. WebDriver BiDi
  40. CSS box-decoration-break
  41. text-wrap: balance
  42. inverted-colors Media Query
  43. Accessibility (computed role + accname)
  44. display: contents
  45. Accessibility issues with display properties (not including display: contents)
  46. HTMLVideoElement.requestVideoFrameCallback()
  47. Text Fragments
  48. blocking="render" attribute on scripts and style sheets
  49. Indexed DB v3
  50. Declarative Shadow DOM
  51. attr() support extended capabilities
  52. Notifications API
  53. HTTP(S) URLs for WebSocket
  54. WasmGC
  55. size-adjust
  56. video-dynamic-range Media Query
  57. CSS scrollbar-gutter
  58. CSS Typed OM Level 1 (houdini)
  59. import attributes / JSON modules / CSS modules
  60. CSS element() function
  61. View Transitions Level 1
  62. Local Network Access and Mixed Content specification
  63. WebM Opus audio codec
  64. Custom Media Queries
  65. Trusted Types
  66. WebTransport API
  67. CSS object-view-box
  68. Intersection Observer v2
  69. Custom Highlight API
  70. backdrop-filter
  71. WebRTC “end-to-end-encryption”
  72. Streams
  73. Media pseudo classes: :paused/:playing/:seeking/:buffering/:stalled/:muted/:volume-locked
  74. CSS box sizing properties with MathML Core
  75. Detect UA Transitions on same-document Navigations
  76. css fill/stroke
  77. JPEG XL image format
  78. font-family keywords
  79. CSS Multi-Column Layout block element breaking
  80. Navigation API
  81. Ready-made counter styles
  82. hidden=until-found and auto-expanding details
  83. Popover
  84. P3 All The Things
  85. URLPattern
  86. margin-trim
  87. overscroll-behavior on the root scroller
  88. CSS Painting API Level 1 (houdini)
  89. CSS Logical Properties and Values
  90. CSS caret, caret-color and caret-shape properties

The tricky thing is that there is no 'right' answer here, and one can easily make several cases. For example, perhaps we should prioritize things that cannot be polyfilled or done with pre-processors over things that can. That sounds reasonable enough, but given that the queue never empties: Do you just do that forever? Does it eventually lead to larger areas that are ignored? Similarly, do you focus on the things that are sure to be used by everyone, or do you assume those already get the most priority and focus on the things that don't. Do you pick certain compat pain points/bugs, or focus on launching new features quickly and very interoperably?

It's a hard game! Looking forward to your answers.

November 16, 2023 05:00 AM

November 14, 2023

Andy Wingo

whiffle, a purpose-built scheme

Yesterday I promised an apology but didn't actually get past the admission of guilt. Today the defendant takes the stand, in the hope that an awkward cross-examination will persuade the jury to take pity on a poor misguided soul.

Which is to say, let's talk about Whiffle: what it actually is, what it is doing for me, and why on earth it is that [I tell myself that] writing a new programming language implementation is somehow preferable than re-using an existing one.

graphic designgarbage collection is my passion

Whiffle is purpose-built to test the Whippet garbage collection library.

Whiffle lets me create Whippet test cases in C, without actually writing C. C is fine and all, but the problem with it and garbage collection is that you have to track all stack roots manually, and this is an error-prone process. Generating C means that I can more easily ensure that each stack root is visitable by the GC, which lets me make test cases with more confidence; if there is a bug, it is probably not because of an untraced root.

Also, Whippet is mostly meant for programming language runtimes, not for direct use by application authors. In this use-case, probably you can use less "active" mechanisms for ensuring root traceability: instead of eagerly recording live values in some kind of handlescope, you can keep a side table that is only consulted as needed during garbage collection pauses. In particular since Scheme uses the stack as a data structure, I was worried that using handle scopes would somehow distort the performance characteristics of the benchmarks.

Whiffle is not, however, a high-performance Scheme compiler. It is not for number-crunching, for example: garbage collectors don't care about that, so let's not. Also, Whiffle should not go to any effort to remove allocations (sroa / gvn / cse); creating nodes in the heap is the purpose of the test case, and eliding them via compiler heroics doesn't help us test the GC.

I settled on a baseline-style compiler, in which I re-use the Scheme front-end from Guile to expand macros and create an abstract syntax tree. I do run some optimizations on that AST; in the spirit of the macro writer's bill of rights, it does make sense to provide some basic reductions. (These reductions can be surprising, but I am used to the Guile's flavor of cp0 (peval), and this project is mostly for me, so I thought it was risk-free; I was almost right!).

Anyway the result is that Whiffle uses an explicit stack. A safepoint for a thread simply records its stack pointer: everything between the stack base and the stack pointer is live. I do have a lingering doubt about the representativity of this compilation strategy; would a conclusion drawn from Whippet apply to Guile, which uses a different stack allocation strategy? I think probably so but it's an unknown.

what's not to love

Whiffle also has a number of design goals that are better formulated in the negative. I mentioned compiler heroics as one thing to avoid, and in general the desire for a well-understood correspondence between source code and run-time behavior has a number of other corrolaries: Whiffle is a pure ahead-of-time (AOT) compiler, as just-in-time (JIT) compilation adds noise. Worse, speculative JIT would add unpredictability, which while good on the whole would be anathema to understanding an isolated piece of a system like the GC.

Whiffle also should produce stand-alone C files, without a thick run-time. I need to be able to understand and reason about the residual C programs, and depending on third-party libraries would hinder this goal.

Oddly enough, users are also an anti-goal: as a compiler that only exists to test a specific GC library, there is no sense in spending too much time making Whiffle nicer for other humans, humans whose goal is surely not just to test Whippet. Whiffle is an interesting object, but is not meant for actual use or users.

corners: cut

Another anti-goal is completeness with regards to any specific language standard: the point is to test a GC, not to make a useful Scheme. Therefore Whippet gets by just fine without flonums, fractions, continuations (delimited or otherwise), multiple return values, ports, or indeed any library support at all. All of that just doesn't matter for testing a GC.

That said, it has been useful to be able to import standard Scheme garbage collection benchmarks, such as earley or nboyer. These have required very few modifications to run in Whippet, mostly related to Whippet's test harness that wants to spawn multiple threads.

and so?

I think this evening we have elaborated a bit more about the "how", complementing yesterday's note about the "what". Tomorrow (?) I'll see if I can dig in more to the "why": what questions does Whiffle let me ask of Whippet, and how good of a job does it do at getting those answers? Until then, may all your roots be traced, and happy hacking.

by Andy Wingo at November 14, 2023 10:10 PM

November 13, 2023

Andy Wingo

i accidentally a scheme

Good evening, dear hackfriends. Tonight's missive is an apology: not quite in the sense of expiation, though not quite not that, either; rather, apology in the sense of explanation, of exegesis: apologia. See, I accidentally made a Scheme. I know I have enough Scheme implementations already, but I went and made another one. It's for a maybe good reason, though!

one does not simply a scheme

I feel like we should make this the decade of leaning into your problems, and I have a Scheme problem, so here we are. See, I co-maintain Guile, and have been noodling on a new garbage collector (GC) for Guile, Whippet. Whippet is designed to be embedded in the project that uses it, so one day I hope it will be just copied into Guile's source tree, replacing the venerable BDW-GC that we currently use.

The thing is, though, that GC implementations are complicated. A bug in a GC usually manifests itself far away in time and space from the code that caused the bug. Language implementations are also complicated, for similar reasons. Swapping one GC for another is something to be done very carefully. This is even more the case when the switching cost is high, which is the case with BDW-GC: as a collector written as a library to link into "uncooperative" programs, there is more cost to moving to a conventional collector than in the case where the embedding program is already aware that (for example) garbage collection may relocate objects.

So, you need to start small. First, we need to prove that the new GC implementation is promising in some way, that it might improve on BDW-GC. Then... embed it directly into Guile? That sounds like a bug farm. Is there not any intermediate step that one might take?

But also, how do you actually test that a GC algorithm or implementation is interesting? You need a workload, and you need the ability to compare the new collector to the old, for that workload. In Whippet I had been writing some benchmarks in C (example), but this approach wasn't scaling: besides not sparking joy, I was starting to wonder if what I was testing would actually reflect usage in Guile.

I had an early approach to rewrite a simple language implementation like the other Scheme implementation I made to demonstrate JIT code generation in WebAssembly, but that soon foundered against what seemed to me an unlikely rock: the compiler itself. In my wasm-jit work, the "compiler" itself was in C++, using the C++ allocator for compile-time allocations, and the result was a tree of AST nodes that were interpreted at run-time. But to embed the benchmarks in Whippet itself I needed something C, which is less amenable to abstraction of any kind... Here I think I could have made a different choice: to somehow allow C++ or something as a dependency to write tests, or to do more mallocation in the "compiler"...

But that wasn't fun. A lesson I learned long ago is that if something isn't fun, I need to turn it into a compiler. So I started writing a compiler to a little bytecode VM, initially in C, then in Scheme because C is a drag and why not? Why not just generate the bytecode C file from Scheme? Same dependency set, once the C file is generated. And then, as long as you're generating C, why go through bytecode at all? Why not just, you know, generate C?

after all, why not? why shouldn't i keep it?

And that's how I accidentally made a Scheme, Whiffle. Tomorrow I'll write a little more on what Whiffle is and isn't, and what it's doing for me. Until then, happy hacking!

by Andy Wingo at November 13, 2023 09:36 PM

November 08, 2023

Brian Kardell

Wolvic Store Experiment

Wolvic Store Experiment

A post about cool stuff for a cause.

As of today, you can purchase some cool Wolvic swag, and proceeds will go to supporting development of the browser:

There’s mugs, and water bottles and shirts and sweatshirts and … You know, stuff.

Why?

For the last few years I’ve been writing about the health of the whole web ecosystem, and talking about it on our podcast. We just assume that the “search revenue pays for browsers” model that got us here is fine and will last forever. I’m pretty sure it won’t.

When you look at how engines and browsers are funded, it’s way too centralized on search funding. Of course, this is not a way to fund a new browser - rather it rewards those that are already making really significant entry. The whole thing is actually kind of fragile when you step back and look at it.

It would be nice to change that, and we only achieve that by trying.

So, in general at Igalia we’re trying several ways to diversify funding. Specifically for Wolvic, we’re looking into how we might explore other kinds of advertising, but we’ve got this far with investment from Igalia, a partnership model and an Open Collective.

It’s kind of easy to get applause when I talk about this, but the truth is very few people put money into our collectives. It reminded me of… Well… Every charity, ever.

Lots of charities (or even NPR) give you some swag with a donation. For some reason, that’s helpful in expanding the base of people making any kind of donation at all. For some reason, offering swag makes people consider putting some donation toward the cause.

Of course, this isn’t especially efficient: If you buy a product from our store, maybe only a few dollars of that becomes an the actual donation. It would be far more efficient to just donate a couple of dollars to the collective, but very few do. So we’re trying something new.

Will offering swag help here too? Is this a model that could help us pay for things? I don’t know, but I’m always looking for new ways to talk about this problem and new experiments we could try to keep the discussion going. So, what do you think?

November 08, 2023 05:00 AM

November 07, 2023

Melissa Wen

AMD Driver-specific Properties for Color Management on Linux (Part 2)

TL;DR:

This blog post explores the color capabilities of AMD hardware and how they are exposed to userspace through driver-specific properties. It discusses the different color blocks in the AMD Display Core Next (DCN) pipeline and their capabilities, such as predefined transfer functions, 1D and 3D lookup tables (LUTs), and color transformation matrices (CTMs). It also highlights the differences in AMD HW blocks for pre and post-blending adjustments, and how these differences are reflected in the available driver-specific properties.

Overall, this blog post provides a comprehensive overview of the color capabilities of AMD hardware and how they can be controlled by userspace applications through driver-specific properties. This information is valuable for anyone who wants to develop applications that can take advantage of the AMD color management pipeline.

Get a closer look at each hardware block’s capabilities, unlock a wealth of knowledge about AMD display hardware, and enhance your understanding of graphics and visual computing. Stay tuned for future developments as we embark on a quest for GPU color capabilities in the ever-evolving realm of rainbow treasures.


Operating Systems can use the power of GPUs to ensure consistent color reproduction across graphics devices. We can use GPU-accelerated color management to manage the diversity of color profiles, do color transformations to convert between High-Dynamic-Range (HDR) and Standard-Dynamic-Range (SDR) content and color enhacements for wide color gamut (WCG). However, to make use of GPU display capabilities, we need an interface between userspace and the kernel display drivers that is currently absent in the Linux/DRM KMS API.

In the previous blog post I presented how we are expanding the Linux/DRM color management API to expose specific properties of AMD hardware. Now, I’ll guide you to the color features for the Linux/AMD display driver. We embark on a journey through DRM/KMS, AMD Display Manager, and AMD Display Core and delve into the color blocks to uncover the secrets of color manipulation within AMD hardware. Here we’ll talk less about the color tools and more about where to find them in the hardware.

We resort to driver-specific properties to reach AMD hardware blocks with color capabilities. These blocks display features like predefined transfer functions, color transformation matrices, and 1-dimensional (1D LUT) and 3-dimensional lookup tables (3D LUT). Here, we will understand how these color features are strategically placed into color blocks both before and after blending in Display Pipe and Plane (DPP) and Multiple Pipe/Plane Combined (MPC) blocks.

That said, welcome back to the second part of our thrilling journey through AMD’s color management realm!

AMD Display Driver in the Linux/DRM Subsystem: The Journey

In my 2022 XDC talk “I’m not an AMD expert, but…”, I briefly explained the organizational structure of the Linux/AMD display driver where the driver code is bifurcated into a Linux-specific section and a shared-code portion. To reveal AMD’s color secrets through the Linux kernel DRM API, our journey led us through these layers of the Linux/AMD display driver’s software stack. It includes traversing the DRM/KMS framework, the AMD Display Manager (DM), and the AMD Display Core (DC) [1].

The DRM/KMS framework provides the atomic API for color management through KMS properties represented by struct drm_property. We extended the color management interface exposed to userspace by leveraging existing resources and connecting them with driver-specific functions for managing modeset properties.

On the AMD DC layer, the interface with hardware color blocks is established. The AMD DC layer contains OS-agnostic components that are shared across different platforms, making it an invaluable resource. This layer already implements hardware programming and resource management, simplifying the external developer’s task. While examining the DC code, we gain insights into the color pipeline and capabilities, even without direct access to specifications. Additionally, AMD developers provide essential support by answering queries and reviewing our work upstream.

The primary challenge involved identifying and understanding relevant AMD DC code to configure each color block in the color pipeline. However, the ultimate goal was to bridge the DC color capabilities with the DRM API. For this, we changed the AMD DM, the OS-dependent layer connecting the DC interface to the DRM/KMS framework. We defined and managed driver-specific color properties, facilitated the transport of user space data to the DC, and translated DRM features and settings to the DC interface. Considerations were also made for differences in the color pipeline based on hardware capabilities.

Exploring Color Capabilities of the AMD display hardware

Now, let’s dive into the exciting realm of AMD color capabilities, where a abundance of techniques and tools await to make your colors look extraordinary across diverse devices.

First, we need to know a little about the color transformation and calibration tools and techniques that you can find in different blocks of the AMD hardware. I borrowed some images from [2] [3] [4] to help you understand the information.

Predefined Transfer Functions (Named Fixed Curves):

Transfer functions serve as the bridge between the digital and visual worlds, defining the mathematical relationship between digital color values and linear scene/display values and ensuring consistent color reproduction across different devices and media. You can learn more about curves in the chapter GPU Gems 3 - The Importance of Being Linear by Larry Gritz and Eugene d’Eon.

ITU-R 2100 introduces three main types of transfer functions:

  • OETF: the opto-electronic transfer function, which converts linear scene light into the video signal, typically within a camera.
  • EOTF: electro-optical transfer function, which converts the video signal into the linear light output of the display.
  • OOTF: opto-optical transfer function, which has the role of applying the “rendering intent”.

AMD’s display driver supports the following pre-defined transfer functions (aka named fixed curves):

  • Linear/Unity: linear/identity relationship between pixel value and luminance value;
  • Gamma 2.2, Gamma 2.4, Gamma 2.6: pure power functions;
  • sRGB: 2.4: The piece-wise transfer function from IEC 61966-2-1:1999;
  • BT.709: has a linear segment in the bottom part and then a power function with a 0.45 (~1/2.22) gamma for the rest of the range; standardized by ITU-R BT.709-6;
  • PQ (Perceptual Quantizer): used for HDR display, allows luminance range capability of 0 to 10,000 nits; standardized by SMPTE ST 2084.

These capabilities vary depending on the hardware block, with some utilizing hardcoded curves and others relying on AMD’s color module to construct curves from standardized coefficients. It also supports user/custom curves built from a lookup table.

1D LUTs (1-dimensional Lookup Table):

A 1D LUT is a versatile tool, defining a one-dimensional color transformation based on a single parameter. It’s very well explained by Jeremy Selan at GPU Gems 2 - Chapter 24 Using Lookup Tables to Accelerate Color Transformations

It enables adjustments to color, brightness, and contrast, making it ideal for fine-tuning. In the Linux AMD display driver, the atomic API offers a 1D LUT with 4096 entries and 8-bit depth, while legacy gamma uses a size of 256.

3D LUTs (3-dimensional Lookup Table):

These tables work in three dimensions – red, green, and blue. They’re perfect for complex color transformations and adjustments between color channels. It’s also more complex to manage and require more computational resources. Jeremy also explains 3D LUT at GPU Gems 2 - Chapter 24 Using Lookup Tables to Accelerate Color Transformations

CTM (Color Transformation Matrices):

Color transformation matrices facilitate the transition between different color spaces, playing a crucial role in color space conversion.

HDR Multiplier:

HDR multiplier is a factor applied to the color values of an image to increase their overall brightness.

AMD Color Capabilities in the Hardware Pipeline

First, let’s take a closer look at the AMD Display Core Next hardware pipeline in the Linux kernel documentation for AMDGPU driver - Display Core Next

AMDGPU driver > Display Core Next (DCN)"> AMDGPU driver > Display Core Next (DCN)">

In the AMD Display Core Next hardware pipeline, we encounter two hardware blocks with color capabilities: the Display Pipe and Plane (DPP) and the Multiple Pipe/Plane Combined (MPC). The DPP handles color adjustments per plane before blending, while the MPC engages in post-blending color adjustments. In short, we expect DPP color capabilities to match up with DRM plane properties, and MPC color capabilities to play nice with DRM CRTC properties.

Note: here’s the catch – there are some DRM CRTC color transformations that don’t have a corresponding AMD MPC color block, and vice versa. It’s like a puzzle, and we’re here to solve it!

AMD Color Blocks and Capabilities

We can finally talk about the color capabilities of each AMD color block. As it varies based on the generation of hardware, let’s take the DCN3+ family as reference. What’s possible to do before and after blending depends on hardware capabilities describe in the kernel driver by struct dpp_color_caps and struct mpc_color_caps.

The AMD Steam Deck hardware provides a tangible example of these capabilities. Therefore, we take SteamDeck/DCN301 driver as an example and look at the “Color pipeline capabilities” described in the file: driver/gpu/drm/amd/display/dcn301/dcn301_resources.c

/* Color pipeline capabilities */

dc->caps.color.dpp.dcn_arch = 1; // If it is a Display Core Next (DCN): yes. Zero means DCE.
dc->caps.color.dpp.input_lut_shared = 0;
dc->caps.color.dpp.icsc = 1; // Intput Color Space Conversion  (CSC) matrix.
dc->caps.color.dpp.dgam_ram = 0; // The old degamma block for degamma curve (hardcoded and LUT). `Gamma correction` is the new one.
dc->caps.color.dpp.dgam_rom_caps.srgb = 1; // sRGB hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.bt2020 = 1; // BT2020 hardcoded curve support (seems not actually in use)
dc->caps.color.dpp.dgam_rom_caps.gamma2_2 = 1; // Gamma 2.2 hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.pq = 1; // PQ hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.hlg = 1; // HLG hardcoded curve support
dc->caps.color.dpp.post_csc = 1; // CSC matrix
dc->caps.color.dpp.gamma_corr = 1; // New `Gamma Correction` block for degamma user LUT;
dc->caps.color.dpp.dgam_rom_for_yuv = 0;

dc->caps.color.dpp.hw_3d_lut = 1; // 3D LUT support. If so, it's always preceded by a shaper curve. 
dc->caps.color.dpp.ogam_ram = 1; // `Blend Gamma` block for custom curve just after blending
// no OGAM ROM on DCN301
dc->caps.color.dpp.ogam_rom_caps.srgb = 0;
dc->caps.color.dpp.ogam_rom_caps.bt2020 = 0;
dc->caps.color.dpp.ogam_rom_caps.gamma2_2 = 0;
dc->caps.color.dpp.ogam_rom_caps.pq = 0;
dc->caps.color.dpp.ogam_rom_caps.hlg = 0;
dc->caps.color.dpp.ocsc = 0;

dc->caps.color.mpc.gamut_remap = 1; // Post-blending CTM (pre-blending CTM is always supported)
dc->caps.color.mpc.num_3dluts = pool->base.res_cap->num_mpc_3dlut; // Post-blending 3D LUT (preceded by shaper curve)
dc->caps.color.mpc.ogam_ram = 1; // Post-blending regamma.
// No pre-defined TF supported for regamma.
dc->caps.color.mpc.ogam_rom_caps.srgb = 0;
dc->caps.color.mpc.ogam_rom_caps.bt2020 = 0;
dc->caps.color.mpc.ogam_rom_caps.gamma2_2 = 0;
dc->caps.color.mpc.ogam_rom_caps.pq = 0;
dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
dc->caps.color.mpc.ocsc = 1; // Output CSC matrix.

I included some inline comments in each element of the color caps to quickly describe them, but you can find the same information in the Linux kernel documentation. See more in struct dpp_color_caps, struct mpc_color_caps and struct rom_curve_caps.

Now, using this guideline, we go through color capabilities of DPP and MPC blocks and talk more about mapping driver-specific properties to corresponding color blocks.

DPP Color Pipeline: Before Blending (Per Plane)

Let’s explore the capabilities of DPP blocks and what you can achieve with a color block. The very first thing to pay attention is the display architecture of the display hardware: previously AMD uses a display architecture called DCE

  • Display and Compositing Engine, but newer hardware follows DCN - Display Core Next.

The architectute is described by: dc->caps.color.dpp.dcn_arch

AMD Plane Degamma: TF and 1D LUT

Described by: dc->caps.color.dpp.dgam_ram, dc->caps.color.dpp.dgam_rom_caps,dc->caps.color.dpp.gamma_corr

AMD Plane Degamma data is mapped to the initial stage of the DPP pipeline. It is utilized to transition from scanout/encoded values to linear values for arithmetic operations. Plane Degamma supports both pre-defined transfer functions and 1D LUTs, depending on the hardware generation. DCN2 and older families handle both types of curve in the Degamma RAM block (dc->caps.color.dpp.dgam_ram); DCN3+ separate hardcoded curves and 1D LUT into two block: Degamma ROM (dc->caps.color.dpp.dgam_rom_caps) and Gamma correction block (dc->caps.color.dpp.gamma_corr), respectively.

Pre-defined transfer functions:

  • they are hardcoded curves (read-only memory - ROM);
  • supported curves: sRGB EOTF, BT.709 inverse OETF, PQ EOTF and HLG OETF, Gamma 2.2, Gamma 2.4 and Gamma 2.6 EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. Setting TF = Identity/Default and LUT as NULL means bypass.

References:

AMD Plane 3x4 CTM (Color Transformation Matrix)

AMD Plane CTM data goes to the DPP Gamut Remap block, supporting a 3x4 fixed point (s31.32) matrix for color space conversions. The data is interpreted as a struct drm_color_ctm_3x4. Setting NULL means bypass.

References:

AMD Plane Shaper: TF + 1D LUT

Described by: dc->caps.color.dpp.hw_3d_lut

The Shaper block fine-tunes color adjustments before applying the 3D LUT, optimizing the use of the limited entries in each dimension of the 3D LUT. On AMD hardware, a 3D LUT always means a preceding shaper 1D LUT used for delinearizing and/or normalizing the color space before applying a 3D LUT, so this entry on DPP color caps dc->caps.color.dpp.hw_3d_lut means support for both shaper 1D LUT and 3D LUT.

Pre-defined transfer function enables delinearizing content with or without shaper LUT, where AMD color module calculates the resulted shaper curve. Shaper curves go from linear values to encoded values. If we are already in a non-linear space and/or don’t need to normalize values, we can set a Identity TF for shaper that works similar to bypass and is also the default TF value.

Pre-defined transfer functions:

  • there is no DPP Shaper ROM. Curves are calculated by AMD color modules. Check calculate_curve() function in the file amd/display/modules/color/color_gamma.c.
  • supported curves: Identity, sRGB inverse EOTF, BT.709 OETF, PQ inverse EOTF, HLG OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 inverse EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. When setting Plane Shaper TF (!= Identity) and LUT at the same time, the color module will combine the pre-defined TF and the custom LUT values into the LUT that’s actually programmed. Setting TF = Identity/Default and LUT as NULL works as bypass.

References:

AMD Plane 3D LUT

Described by: dc->caps.color.dpp.hw_3d_lut

The 3D LUT in the DPP block facilitates complex color transformations and adjustments. 3D LUT is a three-dimensional array where each element is an RGB triplet. As mentioned before, the dc->caps.color.dpp.hw_3d_lut describe if DPP 3D LUT is supported.

The AMD driver-specific property advertise the size of a single dimension via LUT3D_SIZE property. Plane 3D LUT is a blog property where the data is interpreted as an array of struct drm_color_lut elements and the number of entries is LUT3D_SIZE cubic. The array contains samples from the approximated function. Values between samples are estimated by tetrahedral interpolation The array is accessed with three indices, one for each input dimension (color channel), blue being the outermost dimension, red the innermost. This distribution is better visualized when examining the code in [RFC PATCH 5/5] drm/amd/display: Fill 3D LUT from userspace by Alex Hung:

+	for (nib = 0; nib < 17; nib++) {
+		for (nig = 0; nig < 17; nig++) {
+			for (nir = 0; nir < 17; nir++) {
+				ind_lut = 3 * (nib + 17*nig + 289*nir);
+
+				rgb_area[ind].red = rgb_lib[ind_lut + 0];
+				rgb_area[ind].green = rgb_lib[ind_lut + 1];
+				rgb_area[ind].blue = rgb_lib[ind_lut + 2];
+				ind++;
+			}
+		}
+	}

In our driver-specific approach we opted to advertise it’s behavior to the userspace instead of implicitly dealing with it in the kernel driver. AMD’s hardware supports 3D LUTs with 17-size or 9-size (4913 and 729 entries respectively), and you can choose between 10-bit or 12-bit. In the current driver-specific work we focus on enabling only 17-size 12-bit 3D LUT, as in [PATCH v3 25/32] drm/amd/display: add plane 3D LUT support:

+		/* Stride and bit depth are not programmable by API yet.
+		 * Therefore, only supports 17x17x17 3D LUT (12-bit).
+		 */
+		lut->lut_3d.use_tetrahedral_9 = false;
+		lut->lut_3d.use_12bits = true;
+		lut->state.bits.initialized = 1;
+		__drm_3dlut_to_dc_3dlut(drm_lut, drm_lut3d_size, &lut->lut_3d,
+					lut->lut_3d.use_tetrahedral_9,
+					MAX_COLOR_3DLUT_BITDEPTH);

A refined control of 3D LUT parameters should go through a follow-up version or generic API.

Setting 3D LUT to NULL means bypass.

References:

AMD Plane Blend/Out Gamma: TF + 1D LUT

Described by: dc->caps.color.dpp.ogam_ram

The Blend/Out Gamma block applies the final touch-up before blending, allowing users to linearize content after 3D LUT and just before the blending. It supports both 1D LUT and pre-defined TF. We can see Shaper and Blend LUTs as 1D LUTs that are sandwich the 3D LUT. So, if we don’t need 3D LUT transformations, we may want to only use Degamma block to linearize and skip Shaper, 3D LUT and Blend.

Pre-defined transfer function:

  • there is no DPP Blend ROM. Curves are calculated by AMD color modules;
  • supported curves: Identity, sRGB EOTF, BT.709 inverse OETF, PQ EOTF, HLG inverse OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. If plane_blend_tf_property != Identity TF, AMD color module will combine the user LUT values with pre-defined TF into the LUT parameters to be programmed. Setting TF = Identity/Default and LUT to NULL means bypass.

References:

MPC Color Pipeline: After Blending (Per CRTC)

DRM CRTC Degamma 1D LUT

The degamma lookup table (LUT) for converting framebuffer pixel data before apply the color conversion matrix. The data is interpreted as an array of struct drm_color_lut elements. Setting NULL means bypass.

Not really supported. The driver is currently reusing the DPP degamma LUT block (dc->caps.color.dpp.dgam_ram and dc->caps.color.dpp.gamma_corr) for supporting DRM CRTC Degamma LUT, as explaning by [PATCH v3 20/32] drm/amd/display: reject atomic commit if setting both plane and CRTC degamma.

DRM CRTC 3x3 CTM

Described by: dc->caps.color.mpc.gamut_remap

It sets the current transformation matrix (CTM) apply to pixel data after the lookup through the degamma LUT and before the lookup through the gamma LUT. The data is interpreted as a struct drm_color_ctm. Setting NULL means bypass.

DRM CRTC Gamma 1D LUT + AMD CRTC Gamma TF

Described by: dc->caps.color.mpc.ogam_ram

After all that, you might still want to convert the content to wire encoding. No worries, in addition to DRM CRTC 1D LUT, we’ve got a AMD CRTC gamma transfer function (TF) to make it happen. Possible TF values are defined by enum amdgpu_transfer_function.

Pre-defined transfer functions:

  • there is no MPC Gamma ROM. Curves are calculated by AMD color modules.
  • supported curves: Identity, sRGB inverse EOTF, BT.709 OETF, PQ inverse EOTF, HLG OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 inverse EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. When setting CRTC Gamma TF (!= Identity) and LUT at the same time, the color module will combine the pre-defined TF and the custom LUT values into the LUT that’s actually programmed. Setting TF = Identity/Default and LUT to NULL means bypass.

References:

Others

AMD CRTC Shaper and 3D LUT

We have previously worked on exposing CRTC shaper and CRTC 3D LUT, but they were removed from the AMD driver-specific color series because they lack userspace case. CRTC shaper and 3D LUT works similar to plane shaper and 3D LUT but after blending (MPC block). The difference here is that setting (not bypass) Shaper and Gamma blocks together are not expected, since both blocks are used to delinearize the input space. In summary, we either set Shaper + 3D LUT or Gamma.

Input and Output Color Space Conversion

There are two other color capabilities of AMD display hardware that were integrated to DRM by previous works and worth a brief explanation here. The DC Input CSC sets pre-defined coefficients from the values of DRM plane color_range and color_encoding properties. It is used for color space conversion of the input content. On the other hand, we have de DC Output CSC (OCSC) sets pre-defined coefficients from DRM connector colorspace properties. It is uses for color space conversion of the composed image to the one supported by the sink.

References:

The search for rainbow treasures is not over yet

If you want to understand a little more about this work, be sure to watch Joshua and I presented two talks at XDC 2023 about AMD/Steam Deck colors on Gamescope:

In the time between the first and second part of this blog post, Uma Shashank and Chaitanya Kumar Borah published the plane color pipeline for Intel and Harry Wentland implemented a generic API for DRM based on VKMS support. We discussed these two proposals and the next steps for Color on Linux during the Color Management workshop at XDC 2023 and I briefly shared workshop results in the 2023 XDC lightning talk session.

The search for rainbow treasures is not over yet! We plan to meet again next year in the 2024 Display Hackfest in Coruña-Spain (Igalia’s HQ) to keep up the pace and continue advancing today’s display needs on Linux.

Finally, a HUGE thank you to everyone who worked with me on exploring AMD’s color capabilities and making them available in userspace.

November 07, 2023 08:12 AM

Loïc Le Page

Use EGLStreams in a WPE WebKit backend

Build a WPE WebKit backend based on the EGLStream extension.

In the previous post we saw how to build a basic WPE Backend from scratch. Now we are going to transfer the frames produced by the WPEWebProcess to the application process without copying the graphical data out of the GPU memory.

There are different mechanisms to transfer hardware buffers from one process to another. We are going to use here the EGLStream extension that is available on compatible hardware from the EGL API.

The reference project for this WPE Backend is WPEBackend-offscreen-nvidia. It is using the EGLStream extension to produce the hardware buffers on the WPEWebProcess side, and the EGL_NV_stream_consumer_eglimage extension (that is specific to NVidia hardware) to transform those buffers into EGLImages on the application process side.

N.B. the WPEBackend-offscreen-nvidia project and the content of this post are compatible with the WPE WebKit 2.38.x or 2.40.x versions using the 1.14.x version of libwpe.

With the growing adoption of DMA Buffers on all modern Linux platforms, the WPE WebKit architecture is evolving and, in the future, the need for a WPE Backend should disappear.

Future designs of the libwpe API should allow to directly receive the video buffers on the application process side without needing to implement a different WPE Backend for each hardware platform. From the application developer point of view, it will simplify the usage of WPE WebKit by hiding all multiprocesses considerations.

The EGLStream extension #

EGLStream is an extension from the EGL API. It defines some mechanisms to transfer hardware video buffers from one process to another without getting out of the GPU memory. In this aspect, it fulfils the same purpose as DMA Buffers but without the complexity of negociating the different parameters and handling tranparently multi-planes images. On the other end, the EGLStream extension is less flexible than DMA Buffers and is unlikely to be available on non-NVidia hardware.

In other words: the EGLStream extension is quite easy to use but you will need an NVidia GPU.

The EGLStream goes in one exclusive direction from a producer in charge of creating the video buffer, to a consumer that will receive and present the video buffer. So you need two endpoints, one in the consumer process and one in the producer process.

Indeed, apart from the EGLStream extension itself, you will need other extensions to create the content of the video buffer on the producer side, and to consume this content on the consumer side.

The producer extension used in the WPEBackend-offscreen-nvidia project is EGL_KHR_stream_producer_eglsurface. It is straightforward to use because you create a specific surface the same way you would be creating a PBuffer surface and then when calling eglSwapBuffers the EGL implementation will take care of finishing the drawing and sending the content to the consumer.

On the consumer side, you can use different extensions. The most interesting in our case are:

In WPEBackend-offscreen-nvidia we are using the latter approach, so we can directly transfer the obtained EGLImage to the user application for presentation.

How to initialize the EGLStream between the producer and the consumer #

The order of creation of the EGLStream endpoints, the steps for their connection and configuration of the producer and consumer extensions are quite strict:

1. You start with creating the consumer endpoint of the ELGStream: #

EGLStream createConsumerStream(EGLDisplay eglDisplay) noexcept
{
    static constexpr const EGLint s_streamAttribs[] = {EGL_STREAM_FIFO_LENGTH_KHR, 1,
                                                       EGL_CONSUMER_ACQUIRE_TIMEOUT_USEC_KHR, 1000 * 1000,
                                                       EGL_NONE};
    return eglCreateStreamKHR(eglDisplay, s_streamAttribs);
}

The EGL_STREAM_FIFO_LENGTH_KHR parameter defines the length of the EGLStream queue. If set to 0, the EGLStream will work in mailbox mode which means that each time the producer has a new frame it will empty the stream content and replace the actual frame by the new one. If greater than 0, the EGLStream will work in fifo mode which means that the stream queue can contain up to EGL_STREAM_FIFO_LENGTH_KHR frames.

Here we are configuring a queue of 1 frame because the specification of EGL_KHR_stream_producer_eglsurface guaranties that, in this case, when calling eglSwapBuffers(...) on the producer side the call will block until the consumer reads the previous frame in the EGLStream queue. It will ensure proper rendering synchronization between the application process side and the WPEWebProcess side without needing to rely on IPC which would add an extra delay between frames.

The EGL_CONSUMER_ACQUIRE_TIMEOUT_USEC_KHR parameter defines the maximum timeout in microseconds to wait on the consumer side to acquire a frame when calling eglStreamConsumerAcquireKHR(...). It is only used with the EGL_KHR_stream_consumer_gltexture extension as EGL_NV_stream_consumer_eglimage offers to set a different timeout for each call of its eglQueryStreamConsumerEventNV(...) acquire function.

2. Still on the consumer side, you get the EGLStream file descriptor by calling eglGetStreamFileDescriptorKHR(...). #

Then you need to initialize the consumer mechanism.

If you are using the EGL_KHR_stream_consumer_gltexture extension #

You initialize your OpenGL or GLES context and create the external texture associated with the EGLStream:

glGenTextures(1, &texture);
glBindTexture(GL_TEXTURE_EXTERNAL_OES, texture);
glTexParameteri(GL_TEXTURE_EXTERNAL_OES, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_EXTERNAL_OES, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_EXTERNAL_OES, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
glTexParameteri(GL_TEXTURE_EXTERNAL_OES, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);

eglStreamConsumerGLTextureExternalKHR(eglDisplay, eglStream);

By calling eglStreamConsumerGLTextureExternalKHR(...) the currently bound GL_TEXTURE_EXTERNAL_OES texture is associated with the provided EGLStream.

If you are using the EGL_NV_stream_consumer_eglimage extension #

In this case the initial configuration is easier as you only need to call the eglStreamImageConsumerConnectNV(...) function to initialize the consumer.

Sending the consumer EGLStream file descriptor to the producer #

Once the consumer initialized, you need to send the EGLStream file descriptor to the producer process through IPC. If the producer process is the parent process of the consumer, you can just send the integer value of the file descriptor and then, on the producer process, get an open handle to the same resource by calling:

long procFD = syscall(SYS_pidfd_open, cpid, 0);
streamFD = syscall(SYS_pidfd_getfd, procFD, streamFD, 0);
close(procFD);

with cpid the PID of the consumer child process and streamFD the integer value of the EGLStream file descriptor on the consumer process side.

If producer and consumer processes are not bound, you cannot use this technic because of processes access rights. In this case, the only way of transfering the file descriptor resource from the consumer to the producer process is to use a Unix socket using the sendmsg(...) and recvmsg(...) functions with a SCM_RIGHTS message type (see the IPC communication integration).

3. During all this time, on the producer side, you are waiting for the EGLStream file descriptor. #

When you finally receive the file descriptor, you can create the producer endpoint of the EGLStream and initialize the producer mechanism with the EGL_KHR_stream_producer_eglsurface extension.

EGLStream eglStream = eglCreateStreamFromFileDescriptorKHR(eglDisplay, consumerFD);

const EGLint surfaceAttribs[] = {EGL_WIDTH, width, EGL_HEIGHT, height, EGL_NONE};
EGLSurface eglSurface = eglCreateStreamProducerSurfaceKHR(eglDisplay, config, eglStream, surfaceAttribs);

Just like with a classical PBuffer surface, you specify the dimensions of the EGLStream producer surface in the surface attributes. The provided EGL config is the same configuration as the one used to create the EGLContext. When you choose the EGLConfig with eglChooseConfig(...), you must specify the value EGL_STREAM_BIT_KHR for the EGL_SURFACE_TYPE parameter.

At the moment of selecting the EGLContext before drawing, you will bind this newly created eglSurface as target surface. Then eglSwapBuffers(...) will automatically takes care of finishing the rendering and sending the video buffer content to the consumer endpoint of the EGLStream.

sequenceDiagram
    participant A as Producer
    participant B as Consumer
    B ->> B: Create the EGLStream
    B ->> B: Get the EGLStream file descriptor
    B ->> B: Initialize the consumer
    B ->> A: Send the file descriptor
    A ->> A: Create the EGLStream
    A ->> A: Initialize the producer
    A --> B: Connected
    activate A
    activate B
    loop Rendering
        A ->> A: Draw frame
        A ->> B: Send frame
        B ->> B: Consume frame
        B -->> A: EGLStream ready for next frame
    end
    deactivate A
    deactivate B

How to consume the EGLStream frames #

As explained above, on the producer side there is nothing special to do. You can do the rendering just like with a conventional surface as the whole EGLStream mechanism is transparently handled during the eglSwapBuffers(...) call.

On the consumer side, you need to manually acquire and release the frames. The principle is the same with any of the consumer extensions:

  • you acquire the next frame from the EGLStream queue
  • you do the presentation
  • you release the frame when not needed anymore

but the actual functions calls are slightly different.

With the EGL_KHR_stream_consumer_gltexture extension #

In this case, the rendering loop will look like this:

while (drawing)
{
    EGLint status = 0;
    if (!eglQueryStreamKHR(eglDisplay, eglStream, EGL_STREAM_STATE_KHR, &status))
        break;

    switch (status)
    {
    case EGL_STREAM_STATE_CREATED_KHR:
    case EGL_STREAM_STATE_CONNECTING_KHR:
        continue;

    case EGL_STREAM_STATE_DISCONNECTED_KHR:
        drawing = false;
        break;

    case EGL_STREAM_STATE_EMPTY_KHR:
    case EGL_STREAM_STATE_OLD_FRAME_AVAILABLE_KHR:
    case EGL_STREAM_STATE_NEW_FRAME_AVAILABLE_KHR:
        break;
    }
    if (!drawing)
        break;

    if (!eglStreamConsumerAcquireKHR(eglDisplay, eglStream))
        continue;

    // Drawing code...
    // The currently bound GL_TEXTURE_EXTERNAL_OES holds the content
    // of the new EGLStream frame

    eglSwapBuffers(eglDisplay, eglSurface);
    eglStreamConsumerReleaseKHR(eglDisplay, eglStream);
}

The eglStreamConsumerAcquireKHR(...) call will block during maximum EGL_CONSUMER_ACQUIRE_TIMEOUT_USEC_KHR microseconds or until receiving a frame (or until the EGLStream is disconnected). The value of EGL_CONSUMER_ACQUIRE_TIMEOUT_USEC_KHR is set once for all when creating the consumer EGLStream endpoint.

With the EGL_NV_stream_consumer_eglimage extension #

In this case, the rendering loop will look like this:

static constexpr EGLTime ACQUIRE_MAX_TIMEOUT_USEC = 1000 * 1000;
EGLImage eglImage = EGL_NO_IMAGE;

while (drawing)
{
    EGLenum event = 0;
    EGLAttrib data = 0;
    // WARNING: specifications state that the timeout is in nanoseconds
    // (see: https://registry.khronos.org/EGL/extensions/NV/EGL_NV_stream_consumer_eglimage.txt)
    // but in reality it is in microseconds (at least with the version 535.113.01 of the NVidia drivers)
    if (!eglQueryStreamConsumerEventNV(eglDisplay, eglStream, ACQUIRE_MAX_TIMEOUT_USEC, &event, &data))
    {
        drawing = false;
        continue;
    }

    switch (event)
    {
    case EGL_STREAM_IMAGE_ADD_NV:
        if (eglImage)
            eglDestroyImage(eglDisplay, eglImage);

        eglImage = eglCreateImage(eglDisplay, EGL_NO_CONTEXT, EGL_STREAM_CONSUMER_IMAGE_NV,
                                  static_cast<EGLClientBuffer>(eglStream), nullptr);
        continue;

    case EGL_STREAM_IMAGE_REMOVE_NV:
        if (data)
        {
            EGLImage image = reinterpret_cast<EGLImage>(data);
            eglDestroyImage(eglDisplay, image);
            if (image == eglImage)
                eglImage = EGL_NO_IMAGE;
        }
        continue;

    case EGL_STREAM_IMAGE_AVAILABLE_NV:
        if (eglStreamAcquireImageNV(eglDisplay, eglStream, &eglImage, EGL_NO_SYNC))
            break;
        else
            continue;

    default:
        continue;
    }

    // Present the EGLImage...

    eglStreamReleaseImageNV(eglDisplay, eglStream, eglImage, EGL_NO_SYNC);
}

if (eglImage)
    eglDestroyImage(eglDisplay, eglImage);

Integration of EGLStreams in WPEBackend-offscreen-nvidia #

The WPEBackend-offscreen-nvidia project is based on the WPEBackend-direct architecture presented in the previous post, with the following differences:

On the application process side #

The ViewBackend initializes an EGL surfaceless display using the EGL_MESA_platform_surfaceless extension in order to be able to create a consumer EGLStream.

Then the WPE Backend consumes the EGLStream frames from a separated thread while the end-user application callback is called from the application main thread. This separation is needed to avoid border effects between the application rendering EGL display and the consumer surfaceless display.

After acquiring a new frame, the consumer thread waits until the end-user application main thread calls the wpe_offscreen_nvidia_view_backend_dispatch_frame_complete(...) function. This advertises the WPE Backend that the frame can be released. Then the consumer thread releases the frame and tries to acquire the next one. Unlike in WPEBackend-direct, there is no need for further IPC synchronization as the EGLStream producer (on the WPEWebProcess side) is blocked until the frame is released on the consumer side. The rendering synchronization is implicit through the EGLStream mechanisms.

sequenceDiagram
    participant A as WPE Backend - Consumer thread
    participant B as WPE Backend - Application main thread
    participant C as End-user application
    A ->> A: Acquire frame from producer
    activate A
    A ->> A: wait
    B ->> C: Call presentation callback with actual EGLImage
    C ->> C: Use the EGLImage
    C ->> B: wpe_offscreen_nvidia_view_backend_dispatch_frame_complete(...)
    B ->> A: Unlock consumer thread
    deactivate A
    A ->> A: Release frame

The IPC communication #

As far as IPC communication is concerned, WPEBackend-offscreen-nvidia adds the possibility to transfer file descriptors between processes using a Unix socket channel and the SCM_RIGHTS message type.

bool Channel::writeFileDescriptor(int fd) noexcept
{
    assert(m_localFd != -1);
    if (fd == -1)
        return false;

    union {
        cmsghdr header;
        char buffer[CMSG_SPACE(sizeof(int))];
    } control = {};

    msghdr msg = {};
    msg.msg_control = control.buffer;
    msg.msg_controllen = sizeof(control.buffer);

    cmsghdr* header = CMSG_FIRSTHDR(&msg);
    header->cmsg_len = CMSG_LEN(sizeof(int));
    header->cmsg_level = SOL_SOCKET;
    header->cmsg_type = SCM_RIGHTS;
    *reinterpret_cast<int*>(CMSG_DATA(header)) = fd;

    if (sendmsg(m_localFd, &msg, MSG_EOR | MSG_NOSIGNAL) == -1)
    {
        closeChannel();
        m_handler.handleError(*this, errno);
        return false;
    }

    return true;
}

int Channel::readFileDescriptor() noexcept
{
    assert(m_localFd != -1);

    union {
        cmsghdr header;
        char buffer[CMSG_SPACE(sizeof(int))];
    } control = {};

    msghdr msg = {};
    msg.msg_control = control.buffer;
    msg.msg_controllen = sizeof(control.buffer);

    if (recvmsg(m_localFd, &msg, MSG_WAITALL) == -1)
    {
        closeChannel();
        m_handler.handleError(*this, errno);
        return -1;
    }

    cmsghdr* header = CMSG_FIRSTHDR(&msg);
    if (header && (header->cmsg_len == CMSG_LEN(sizeof(int))) && (header->cmsg_level == SOL_SOCKET) &&
        (header->cmsg_type == SCM_RIGHTS))
    {
        int fd = *reinterpret_cast<int*>(CMSG_DATA(header));
        return fd;
    }

    return -1;
}

On the WPEWebProcess side #

The RendererBackendEGL provides a surfaceless platform with the EGL_PLATFORM_SURFACELESS_MESA value. In order to work correctly with WPE WebKit it requires a small patch in the WebKit source code (see here).

By default, the WPE WebKit code sets the EGL_WINDOW_BIT value for the EGL_SURFACE_TYPE parameter when using surfaceless EGL contexts in the GLContextEGL.cpp file. We need to set the EGL_STREAM_BIT_KHR value to select EGLConfigs compatible with an EGLStream producer surface.

Then in RendererBackendEGLTarget, the only differences are:

  • in the RendererBackendEGLTarget::frameWillRender(...) method it lazily initializes the producer EGLStream if needed (as soon as it has received the EGLStream file descriptor from the consumer side). And then it attaches the EGLStream producer surface to the current ThreadedCompositor EGLContext. This way all further drawing will be re-targeted to the producer surface.
  • in the RendererBackendEGLTarget::frameRendered(...) method it swaps the producer surface buffers (which will block until the consumer acquires the current new frame) and then calls directly the wpe_renderer_backend_egl_target_dispatch_frame_complete(...) function as the rendering synchronization has already implicitly occurred during the buffers swapping phase.

The rest of the code is the same as in WPEBackend-direct.

Using WPEBackend-offscreen-nvidia in a docker container without windowing system #

As the producer (WPEWebProcess side) and the consumer (application process side) are both using EGL surfaceless contexts, WPE WebKit running with the WPEBackend-offscreen-nvidia backend can work on a Linux system with no windowing system at all.

This is very useful when doing offscreen rendering on server-side using docker containers for example. As soon as you have a NVidia GPU correctly configured on the host, you can create lightweight containers with a minimal set of packages and do web rendering into EGLImages with full hardware acceleration and zero-copy of the video buffers from the GPU to the CPU (you still may have copies - or not - inside the GPU memory depending on the drivers implementation and the usage of your final EGLImages).

To use the host NVidia GPU from within a docker container, you first need to install the NVIDIA Container Toolkit.

Then the WPEBackend-offscreen-nvidia project provides a Dockerfile example to build the whole backend with the WPE WebKit patched library, and to run it with the provided webview-sample. It also provides a docker-compose.yml example to run the previous webview-sample with the NVidia GPU.

These examples still install some X11 dependencies as the webview-sample creates an X11 window to present the content of the produced EGLImages. If you are using WPE WebKit and the WPEBackend-offscreen-nvidia backend in an environment with no windowing system, you will only need to install the following dependencies:

ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=graphics

RUN apt-get update -qq && \
    apt-get upgrade -qq && \
    apt-get install -qq --no-install-recommends \
        gstreamer1.0-plugins-good gstreamer1.0-plugins-bad \
        libgstreamer-gl1.0-0 libepoxy0 libharfbuzz-icu0 libjpeg8 \
        libxslt1.1 liblcms2-2 libopenjp2-7 libwebpdemux2 libwoff1 libcairo2 \
        libglib2.0-0 libsoup2.4-1 libegl1-mesa

WORKDIR /root
RUN mkdir -p /usr/share/glvnd/egl_vendor.d && \
    cat <<EOF > /usr/share/glvnd/egl_vendor.d/10_nvidia.json
{
    "file_format_version" : "1.0.0",
    "ICD" : {
        "library_path" : "libEGL_nvidia.so.0"
    }
}
EOF

The gstreamer plugins are only needed if you want to use multimedia features, else you can also remove them.

November 07, 2023 12:00 AM

November 06, 2023

José Dapena

Keep GCC running (2023 update)

Last week I attended BlinkOn 18 in Sunnyvale Google offices. For the 5th time I presented the status of Chromium build using the GNU toolchain components: GCC compiler and libstdc++ standard library implementation.

This blog post recaps the current status in 2023, as it was presented in the conference.

Current status

First things first: GCC is still maintained, and working, on the official development releases! Though, as official build bots will not check that, fixes usually take a few extra days to land.

This is the result of the work from contributors of several companies (Igalia, LGE, Vewd and others). But most important, from individual contributors (last 2 years the main contributor, with more than 50% of the commit has been Stephan Hartmann from Gentoo).

GCC support

The work to support GCC is coordinated from the GCC Chromium meta bug.

Since I started tracking the GCC support we have been getting a quite stable number of contributions, 70-100 per year. Though in 2023 (even if we are counting only 8 months on this chart), we are way below 40.

What happened? I am not really sure. Though, I have some ideas. First, simply as we move to newer versions of GCC, the implementation of recent C++ standards have improved. But then, also, this is the result of the great work that has been done recently in both GCC and LLVM, and also in standarization process, to get more interoperable implementations.

Main GCC problems

Now, I will focus on the most frequent causes for build breakage affecting GCC since January 2022.

Variable clashes

The rules for visibility in C++ are slightly different in GCC. An example: if a class declares a getter with the same name of a declared type in current namespace, Clang will be mostly Ok with that. But GCC will fail with an error.

Bad:

using Foo = int;

class A {
    ...
    Foo Foo();
    ...
}

A possible fix is renaming the accessor method name:

Foo GetFoo();

Or using a explicit namespace for the type:

::Foo GetFoo()

Ambiguous constructors

GCC may sometimes fail to resolve which constructor to use when there is an implicit type conversion. To avoid that, we can make that conversion explicit or use all braces initializers.

constexpr is more strict in GCC

In GCC a constexpr method declared as default demands all its parts to be also declared constexpr:

Bad

int MethodA() { ... };

constexpr MethodB() { ... MethodA() ... };

Two possible fixes: or dropping constexpr from the using method, or adding constexpr to all used methods.

noexcept strictness

noexcept strictness in GCC and Clang is the same when exceptions are enabled, requiring that all invoked functions used from a noexcept are also noexcept. Though, when exceptions are disabled using -fno-exception, Clang will ignore the errors. But GCC will still check the rules.

This works in Clang and not in GCC if -fno-exception is set:

class A {
    ...
    virtual int Method();
};

class B {
    ...
    int Method() noexcept override;
};

CPU intrinsics casts

Implicit casts of CPU intrinsic types will fail in GCC, requiring explicit conversion or cast.

As example:

int8_t input __attribute__((vector_size(16)));
uint16_t output = _mm_movemask_epi8(input);

If we see the declaration of the intrinsic call:

int _mm_movemask_epi8 (__m128i a);

GCC will not allow implicit casting from an int8_t __attribute__((vector_size(16))), though the storage is the same. In GCC we require a reinterpret_cast:

int8_t input __attribute__((vector_size(16)));
uint16_t output = _mm_movemask_epi8(reinterpret_cast<__m128i>(input));

Template specializations need to be in a namespace scope

Template specializations are not allowed in GCC if they are not in a namespace scope. Usually this error materializes with developers adding the template specialization in the templated class scope.

A failing example:

namespace a {
    class A() {
        template<typename T>
        T Foo(const T&t ) {
            ...
        }

        template<>
        size_ Foo(const size_t& t)
    };
}

The fix is moving the template specialization to the namespace scope:

namespace a {
    class A() {
        template<typename T>
        T Foo(const T&t ) {
            ...
        }
    };

    template<>
    size_ A::Foo(const size_t& t) {
        ...
    }
}

libstdc++ support

Regarding the C++ standard library implementation from GNU project, the work is coordinated in the libstdc++ Chromium meta bug.

We have definitely observed a big increase of required fixes in 2022, and projection of 2023 is going to be similar.

In this case there are two possible reasons. One is that we are more exhaustively tracking the work using the meta bug.

Main libstdc++ problems

In the case of libstdc++, these are the most frequent causes for build breakage since January 2022.

Missing includes

The libc++ implementation is different from libstdc++. And that implies some library headers that could be indirectly included in libc++ are not in libstdc++.

STL containers not allowing const members

Let’s see this code:

std::vector<const int> v;

In Clang, this is allowed. Though, in GCC there is an explicit assert forbidding it: std::vector must have a non-const, non-volatile value_type.

This is not only specific to std::vector, but also to std::unordered:*, std::list, std::set, std::deque, std::forward_list or std::multiset.

The solution? Just do not use const as members:

std::vector<int> v;

Destructor of unique_ptr requires declaration of contained type

Assigning from nullptr to std::unique_ptr requires destructor declaration

class A;

std::unique_ptr<A> GetValue() {
    return nullptr;
}

Usually it is just needed to include the full declaration of the class, as it needs the size of the type for the default destructor.

Yocto meta-chromium layer

In my case, to verify GCC and libstdc++ support, on top of building Chromium in Ubuntu using both, I am also maintaining a Yocto layer that builds Chromium development releases. I usually try to verify the build in less than a week after each release.

The layer is available at github.com/Igalia/meta-chromium

I am regularly testing the build using core-image-weston in both Raspberry PI 4 (64 bits) and Intel x86-64 (using Qemu). There I try both X11 and Wayland Ozone backends.

Wrapping up

I would still recommend people to move to use the officially supported Chromium toolchains: libc++ and Clang. With them, downstreams get a far more tested implementation, more security features, or better integration for sanitizers. But, meanwhile, I expect things will still be kept working as several downstreams and distributions will still ship Chromium on top of the GNU toolchains.

Even with the GNU toolchain not being officially supported in Chromium, the community has been successful providing support for both GCC and libstdc++. Thanks to all the contributors!

by José Dapena Paz at November 06, 2023 06:53 AM

November 01, 2023

Eric Meyer

Blinded By the Light DOM

For a while now, Web Components (which I’m not going to capitalize again, you’re welcome) have been one of those things that pop up in the general web conversation, seem intriguing, and then fade into the background again.

I freely admit a lot of this experience is due to me, who is not all that thrilled with the Shadow DOM in general and all the shenanigans required to cross from the Light Side to the Dark Side in particular.  I like the Light DOM.  It’s designed to work together pretty well.  This whole high-fantasy-flavored Shadowlands of the DOM thing just doesn’t sit right with me.

If they do for you, that’s great!  Rock on with your bad self.  I say all this mostly to set the stage for why I only recently had a breakthrough using web components, and now I quite like them.  But not the shadow kind.  I’m talking about Fully Light-DOM Components here.

It started with a one-two punch: first, I read Jim Nielsen’s “Using Web Components on My Icon Galleries Websites”, which I didn’t really get the first few times I read it, but I could tell there was something new (to me) there.  Very shortly thereafter, I saw Dave Rupert’s <fit-vids> CodePen, and that’s when the Light DOM Bulb went off in my head.  You just take some normal HTML markup, wrap it with a custom element, and then write some JS to add capabilities which you can then style with regular CSS!  Everything’s of the Light Side of the Web.  No need to pierce the Vale of Shadows or whatever.

Kindly permit me to illustrate at great length and in some depth, using a thing I created while developing a tool for internal use at Igalia as the basis.  Suppose you have some range inputs, just some happy little slider controls on your page, ready to change some values, like this:

<label for="title-size">Title font size</label>
<input id="title-size" type="range" min="0.5" max="4" step="0.1" value="2" />

The idea here is that you use the slider to change the font size of an element of some kind.  Using HTML’s built-in attributes for range inputs, I set a minimum, maximum, and initial value, the step size permitted for value changes, and an ID so a <label> can be associated with it.  Dirt-standard HTML stuff, in other words.  Given that this markup exists in the page, then, it needs to be hooked up to the thing it’s supposed to change.

In Ye Olden Days, you’d need to write a function to go through the entire DOM looking for these controls (maybe you’d add a specific class to the ones you need to find), figure out how to associate them with the element they’re supposed to affect (a title, in this case), add listeners, and so on.  It might go something like:

let sliders = document.querySelectorAll('input[id]');
for (i = 0; i < sliders.length; i++) {
	let slider = sliders[i];
	// …add event listeners
	// …target element to control
	// …set behaviors, maybe call external functions
	// …etc., etc., etc.
}

Then you’d have to stuff all that into a window.onload observer or otherwise defer the script until the document is finished loading.

To be clear, you can absolutely still do it that way.  Sometimes, it’s even the most sensible choice!  But fully-light-DOM components can make a lot of this easier, more reusable, and robust.  We can add some custom elements to the page and use those as a foundation for scripting advanced behavior.

Now, if you’re like me (and I know I am), you might think of converting everything into a completely bespoke element and then forcing all the things you want to do with it into its attributes, like this:

<super-slider type="range" min="0.5" max="4" step="0.1" value="2"
	          unit="em" target=".preview h1">
Title font size
</super-slider>

Don’t do this.  If you do, then you end up having to reconstruct the HTML you want to exist out of the data you stuck on the custom element.  As in, you have to read off the type, min, max, step, and value attributes of the <super-slider> element, then create an <input> element and add the attributes and their values you just read off <super-slider>, create a <label> and insert the <super-slider>’s text content into the label’s text content, and why?  Why did I do this to myse —  uh, I mean, why do this to yourself?

Do this instead:

<super-slider unit="em" target=".preview h1">
	<label for="title-size">Title font size</label>
	<input id="title-size" type="range" min="0.5" max="4" step="0.1" value="2" />
</super-slider>

This is the pattern I got from <fit-vids>, and the moment that really broke down the barrier I’d had to understanding what makes web components so valuable.  By taking this approach, you get everything HTML gives you with the <label> and <input> elements for free, and you can add things on top of it.  It’s pure progressive enhancement.

Side note: keep in mind that the name <super-slider> was chosen, just as with <fit-vids>, because custom elements must (in the RFC 2119 sense of that word) have a hyphen within their name.  You can’t call your element <betterH1> or <fun_textarea> and have it work; you’d have to call them <better-H1> and <fun-textarea> instead.

Also, don’t @ me about the H1 being uppercase.  Element names can be capitalized however you want; we lowercase them out of historical inertia and unexamined habit.

To figure out how all this goes together, I found MDN’s page “Using custom elements” really quite valuable.  That’s where I internalized the reality that instead of having to scrape the DOM for custom elements and then run through a loop, I could extend HTML itself:

class superSlider extends HTMLElement {
	connectedCallback() {
		//
		// the magic happens here!
		//
	}
}

customElements.define("super-slider",superSlider);

What that last line does is tell the browser, “any <super-slider> element is of the superSlider JavaScript class”.  Which means, any time the browser sees <super-slider>, it does the stuff that’s defined by class superSlider in the script.  Which is the thing in the previous code block!  So let’s talk about how it works, with concrete examples.

It’s the class structure that holds the real power.  Inside there, connectedCallback() is invoked whenever a <super-slider> is connected; that is, whenever one is encountered in the page by the browser as it parses the markup, or when one is added to the page later on.  It’s an auto-startup callback.  (What’s a callback? I’ve never truly understood that, but it turns out I don’t have to!)  So in there, I write something like:

connectedCallback() {
	let targetEl = document.querySelector(this.getAttribute('target'));
	let unit = this.getAttribute('unit');
	let slider = this.querySelector('input[type="range"]');
}

So far, all I’ve done here is:

  • Used the value of the target attribute on <super-slider> to find the element that the range slider should affect using a CSS-esque query.
  • The unit attribute’s value to know what CSS unit I’ll be using later in the code.
  • Grabbed the range input itself by running a querySelector() within the <super-slider> element.

With all those things defined, I can add an event listener to the range input:

slider.addEventListener("input",(e) => {
	let value = slider.value + unit;
	targetEl.style.setProperty('font-size',value);
});

…and really, that’s it.  Put all together:

class superSlider extends HTMLElement {
	connectedCallback() {
		let targetEl = document.querySelector(this.getAttribute('target'));
		let unit = this.getAttribute('unit');
		let slider = this.querySelector('input[type="range"]');
		slider.addEventListener("input",(e) => {
			targetEl.style.setProperty('font-size',slider.value + unit);
		});
	}
}

customElements.define("super-slider",superSlider);

You can see it in action with this CodePen.

 <span>See the Pen <a href="https://codepen.io/meyerweb/pen/oNmXJRX">
 WebCOLD 01</a> by Eric A.  Meyer (<a href="https://codepen.io/meyerweb">@meyerweb</a>)
 on <a href="https://codepen.io">CodePen</a>.</span>

As I said earlier, you can get to essentially the same result by running document.querySelectorAll('super-slider') and then looping through the collection to find all the bits and bobs and add the event listeners and so on.  In a sense, that’s what I’ve done above, except I didn’t have to do the scraping and looping and waiting until the document has loaded  —  using web components abstracts all of that away.  I’m also registering all the components with the browser via customElements.define(), so there’s that too.  Overall, somehow, it just feels cleaner.

One thing that sets customElements.define() apart from the collect-and-loop-after-page-load approach is that custom elements fire all that connection callback code on themselves whenever they’re added to the document, all nice and encapsulated.  Imagine for a moment an application where custom elements are added well after page load, perhaps as the result of user input.  No problem!  There isn’t the need to repeat the collect-and-loop code, which would likely have to have special handling to figure out which are the new elements and which already existed.  It’s incredibly handy and much easier to work with.

But that’s not all!  Suppose we want to add a “reset” button  —  a control that lets you set the slider back to its starting value.  Adding some code to the connectedCallback() can make that happen.  There’s probably a bunch of different ways to do this, so what follows likely isn’t the most clever or re-usable way.  It is, instead, the way that made sense to me at the time.

let reset = slider.getAttribute('value');
let resetter = document.createElement('button');
resetter.textContent = '↺';
resetter.setAttribute('title', reset + unit);
resetter.addEventListener("click",(e) => {
	slider.value = reset;
	slider.dispatchEvent(
	    new MouseEvent('input', {view: window, bubbles: false})
	);
});
slider.after(resetter);

With that code added into the connection callback, a button gets added right after the slider, and it shows a little circle-arrow to convey the concept of resetting.  You could just as easily make its text “Reset”.  When said button is clicked or keyboard-activated ("click" handles both, it seems), the slider is reset to the stored initial value, and then an input event is fired at the slider so the target element’s style will also be updated.  This is probably an ugly, ugly way to do this!  I did it anyway.

 <span>See the Pen <a href="https://codepen.io/meyerweb/pen/jOdPdyQ">
 WebCOLD 02</a> by Eric A.  Meyer (<a href="https://codepen.io/meyerweb">@meyerweb</a>)
 on <a href="https://codepen.io">CodePen</a>.</span>

Okay, so now that I can reset the value, maybe I’d also like to see what the value is, at any given moment in time?  Say, by inserting a classed <span> right after the label and making its text content show the current combination of value and unit?

let label = this.querySelector('label');
let readout = document.createElement('span');
readout.classList.add('readout');
readout.textContent = slider.value + unit;
label.after(readout);

Plus, I’ll need to add the same text content update thing to the slider’s handling of input events:

slider.addEventListener("input", (e) => {
	targetEl.style.setProperty("font-size", slider.value + unit);
	readout.textContent = slider.value + unit;
});

I imagine I could have made this readout-updating thing a little more generic (less DRY, if you like) by creating some kind of getter/setter things on the JS class, which is totally possible to do, but that felt like a little much for this particular situation.  Or I could have broken the readout update into its own function, either within the class or external to it, and passed in the readout and slider and reset value and unit to cause the update.  That seems awfully clumsy, though.  Maybe figuring out how to make the span a thing that observes slider changes and updates automatically?  I dunno, just writing the same thing in two places seemed a lot easier, so that’s how I did it.

So, at this point, here’s the entirety of the script, with a CodePen example of the same thing immediately after.

class superSlider extends HTMLElement {
	connectedCallback() {
		let targetEl = document.querySelector(this.getAttribute("target"));
		let unit = this.getAttribute("unit");

		let slider = this.querySelector('input[type="range"]');
		slider.addEventListener("input", (e) => {
			targetEl.style.setProperty("font-size", slider.value + unit);
			readout.textContent = slider.value + unit;
		});

		let reset = slider.getAttribute("value");
		let resetter = document.createElement("button");
		resetter.textContent = "↺";
		resetter.setAttribute("title", reset + unit);
		resetter.addEventListener("click", (e) => {
			slider.value = reset;
			slider.dispatchEvent(
				new MouseEvent("input", { view: window, bubbles: false })
			);
		});
		slider.after(resetter);

		let label = this.querySelector("label");
		let readout = document.createElement("span");
		readout.classList.add("readout");
		readout.textContent = slider.value + unit;
		label.after(readout);
	}
}

customElements.define("super-slider", superSlider);
 <span>See the Pen <a href="https://codepen.io/meyerweb/pen/NWoGbWX">
 WebCOLD 03</a> by Eric A.  Meyer (<a href="https://codepen.io/meyerweb">@meyerweb</a>)
 on <a href="https://codepen.io">CodePen</a>.</span>

Anything you can imagine JS would let you do to the HTML and CSS, you can do in here.  Add a class to the slider when it has a value other than its default value so you can style the reset button to fade in or be given a red outline, for example.

Or maybe do what I did, and add some structural-fix-up code.  For example, suppose I were to write:

<super-slider unit="em" target=".preview h2">
	<label>Subtitle font size</label>
	<input type="range" min="0.5" max="2.5" step="0.1" value="1.5" />
</super-slider>

In that bit of markup, I left off the id on the <input> and the for on the <label>, which means they have no structural association with each other.  (You should never do this, but sometimes it happens.)  To handle this sort of failing, I threw some code into the connection callback to detect and fix those kinds of authoring errors, because why not?  It goes a little something like this:

if (!label.getAttribute('for') && slider.getAttribute('id')) {
	label.setAttribute('for',slider.getAttribute('id'));
}
if (label.getAttribute('for') && !slider.getAttribute('id')) {
	slider.setAttribute('id',label.getAttribute('for'));
}
if (!label.getAttribute('for') && !slider.getAttribute('id')) {
	let connector = label.textContent.replace(' ','_');
	label.setAttribute('for',connector);
	slider.setAttribute('id',connector);
}

Once more, this is probably the ugliest way to do this in JS, but also again, it works.  Now I’m making sure labels and inputs have association even when the author forgot to explicitly define it, which I count as a win.  If I were feeling particularly spicy, I’d have the code pop an alert chastising me for screwing up, so that I’d fix it instead of being a lazy author.

It also occurs to me, as I review this for publication, that I didn’t try to do anything in situations where both the for and id attributes are present, but their values don’t match.  That feels like something I should auto-fix, since I can’t imagine a scenario where they would need to intentionally be different.  It’s possible my imagination is lacking, of course.

So now, here’s all just-over-40 lines of the script that makes all this work, followed by a CodePen demonstrating it.

class superSlider extends HTMLElement {
	connectedCallback() {
		let targetEl = document.querySelector(this.getAttribute("target"));
		let unit = this.getAttribute("unit");

		let slider = this.querySelector('input[type="range"]');
		slider.addEventListener("input", (e) => {
			targetEl.style.setProperty("font-size", slider.value + unit);
			readout.textContent = slider.value + unit;
		});

		let reset = slider.getAttribute("value");
		let resetter = document.createElement("button");
		resetter.textContent = "↺";
		resetter.setAttribute("title", reset + unit);
		resetter.addEventListener("click", (e) => {
			slider.value = reset;
			slider.dispatchEvent(
				new MouseEvent("input", { view: window, bubbles: false })
			);
		});
		slider.after(resetter);

		let label = this.querySelector("label");
		let readout = document.createElement("span");
		readout.classList.add("readout");
		readout.textContent = slider.value + unit;
		label.after(readout);

		if (!label.getAttribute("for") && slider.getAttribute("id")) {
			label.setAttribute("for", slider.getAttribute("id"));
		}
		if (label.getAttribute("for") && !slider.getAttribute("id")) {
			slider.setAttribute("id", label.getAttribute("for"));
		}
		if (!label.getAttribute("for") && !slider.getAttribute("id")) {
			let connector = label.textContent.replace(" ", "_");
			label.setAttribute("for", connector);
			slider.setAttribute("id", connector);
		}
	}
}

customElements.define("super-slider", superSlider);
 <span>See the Pen <a href="https://codepen.io/meyerweb/pen/PoVPbzK">
 WebCOLD 04</a> by Eric A.  Meyer (<a href="https://codepen.io/meyerweb">@meyerweb</a>)
 on <a href="https://codepen.io">CodePen</a>.</span>

There are doubtless cleaner/more elegant/more clever ways to do pretty much everything I did above, considering I’m not much better than an experienced amateur when it comes to JavaScript.  Don’t focus so much on the specifics of what I wrote, and more on the overall concepts at play.

I will say that I ended up using this custom element to affect more than just font sizes.  In some places I wanted to alter margins; in others, the hue angle of colors.  There are a couple of ways to do this.  The first is what I did, which is to use a bunch of CSS variables and change their values.  So the markup and relevant bits of the JS looked more like this:

<super-slider unit="em" variable="titleSize">
	<label for="title-size">Title font size</label>
	<input id="title-size" type="range" min="0.5" max="4" step="0.1" value="2" />
</super-slider>
let cssvar = this.getAttribute("variable");
let section = this.closest('section');

slider.addEventListener("input", (e) => {
	section.style.setProperty(`--${cssvar}`, slider.value + unit);
	readout.textContent = slider.value + unit;
});

The other way (that I can think of) would be to declare the target element’s selector and the property you want to alter, like this:

<super-slider unit="em" target=".preview h1" property="font-size">
	<label for="title-size">Title font size</label>
	<input id="title-size" type="range" min="0.5" max="4" step="0.1" value="2" />
</super-slider>

I’ll leave the associated JS as an exercise for the reader.  I can think of reasons to do either of those approaches.

But wait!  There’s more! Not more in-depth JS coding (even though we could absolutely keep going, and in the tool I built, I absolutely did), but there are some things to talk about before wrapping up.

First, if you need to invoke the class’s constructor for whatever reason — I’m sure there are reasons, whatever they may be — you have to do it with a super() up top.  Why?  I don’t know.  Why would you need to?  I don’t know.  If I read the intro to the super page correctly, I think it has something to do with class prototypes, but the rest went so far over my head the FAA issued a NOTAM.  Apparently I didn’t do anything that depends on the constructor in this article, so I didn’t bother including it.

Second, basically all the JS I wrote in this article went into the connectedCallback() structure.  This is only one of four built-in callbacks!  The others are:

  • disconnectedCallback(), which is fired whenever a custom element of this type is removed from the page.  This seems useful if you have things that can be added or subtracted dynamically, and you want to update other parts of the DOM when they’re subtracted.
  • adoptedCallback(), which is (to quote MDN) “called each time the element is moved to a new document.” I have no idea what that means.  I understand all the words; it’s just that particular combination of them that confuses me.
  • attributeChangedCallback(), which is fired when attributes of the custom element change.  I thought about trying to use this for my super-sliders, but in the end, nothing I was doing made sense (to me) to bubble up to the custom element just to monitor and act upon.  A use case that does suggest itself: if I allowed users to change the sizing unit, say from em to vh, I’d want to change other things, like the min, max, step, and default value attributes of the sliders.  So, since I’d have to change the value of the unit attribute anyway, it might make sense to use attributeChangedCallback() to watch for that sort of thing and then take action.  Maybe!

Third, I didn’t really talk about styling any of this.  Well, because all of this stuff is in the Light DOM, I don’t have to worry about Shadow Walls or whatever, I can style everything the normal way.  Here’s a part of the CSS I use in the CodePens, just to make things look a little nicer:

super-slider {
	display: flex;
	align-items: center;
	margin-block: 1em;
}
super-slider input[type="range"] {
	margin-inline: 0.25em 1px;
}
super-slider .readout {
	width: 3em;
	margin-inline: 0.25em;
	padding-inline: 0.5em;
	border: 1px solid #0003;
	background: #EEE;
	font: 1em monospace;
	text-align: center;
}

Hopefully that all makes sense, but if not, let me know in the comments and I’ll clarify.

A thing I didn’t do was use the :defined pseudo-class to style custom elements that are defined, or rather, to style those that are not defined.  Remember the last line of the script, where customElements.define() is called to define the custom elements?  Because they are defined that way, I could add some CSS like this:

super-slider:not(:defined) {
	display: none;
}

In other words, if a <super-slider> for some reason isn’t defined, make it and everything inside it just… go away.  Once it becomes defined, the selector will no longer match, and the display: none will be peeled away.  You could use visibility or opacity instead of display; really, it’s up to you.  Heck, you could tile red warning icons in the whole background of the custom element if it hasn’t been defined yet, just to drive the point home.

The beauty of all this is, you don’t have to mess with Shadow DOM selectors like ::part() or ::slotted().  You can just style elements the way you always style them, whether they’re built into HTML or special hyphenated elements you made up for your situation and then, like the Boiling Isles’ most powerful witch, called into being.

That said, there’s a “fourth” here, which is that Shadow DOM does offer one very powerful capability that fully Light DOM custom elements lack: the ability to create a structural template with <slot> elements, and then drop your Light-DOM elements into those slots.  This slotting ability does make Shadowy web components a lot more robust and easier to share around, because as long as the slot names stay the same, the template can be changed without breaking anything.  This is a level of robustness that the approach I explored above lacks, and it’s built in.  It’s the one thing I actually do like about Shadow DOM.

It’s true that in a case like I’ve written about here, that’s not a huge issue: I was quickly building a web component for a single tool that I could re-use within the context of that tool.  It works fine in that context.  It isn’t portable, in the sense of being a thing I could turn into an npm package for others to use, or probably even share around my organization for other teams to use.  But then, I only put 40-50 lines worth of coding into it, and was able to rapidly iterate to create something that met my needs perfectly.  I’m a lot more inclined to take this approach in the future, when the need arises, which will be a very powerful addition to my web development toolbox.

I’d love to see the templating/slotting capabilities of Shadow DOM brought into the fully Light-DOM component world.  Maybe that’s what Declarative Shadow DOM is?  Or maybe not!  My eyes still go cross-glazed whenever I try to read articles about Shadow DOM, almost like a trickster demon lurking in the shadows casts a Spell of Confusion at me.

So there you have it: a few thousand words on my journey through coming to understand and work with these fully-Light-DOM web components, otherwise known as custom elements.  Now all they need is a catchy name, so we can draw more people to the Light Side of the Web.  If you have any ideas, please drop ’em in the comments!


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at November 01, 2023 06:09 PM

October 31, 2023

Víctor Jáquez

GStreamer Conference 2023

This year the GStreamer Conference happened in A Coruña, basically at home, along with the hackfest.

The conference was the first after a long hiatus of four years of pandemics. The community craved it and long expected it. Some igalians helped to the GStreamer Foundation and our warm community with the organization and logistics. I’m very thankful with my peers and the sponsors of the event. Personally, I’m happy with the outcome. Though, I ought to say, organizing a conference like this is quite a challenge and very demanding.

Palexco Marina

The conference were recorded and streamed by Ubicast. And you can watch any presentation of the conference in their GStreamer Archive.

Tim sharing the State of the Union

Lunch time in Palexco

This is the list of talks where fellow Igalians participated:

Vulkan Video in GStreamer

Video Editing with GStreamer: an update

MSE and EME on GStreamer based WebKit ports

High-level WebRTC APIs in GStreamer

GstPipelineStudio version 0.3.0 is out !

WebCodecs in WebKit, with GStreamer!

Updates in GStreamer VA-API (and DMABuf negotiation)

GstWebRTC in WebKit, current status and challenges

There were two days of conference. The following two were for the hackfest, at Igalia’s Head Quarters.

Day one of the Hackfest

by vjaquez at October 31, 2023 01:10 PM

Jesse Alama

The decimals around us: Cataloging support for decimal numbers

A cat­a­log of sup­port for dec­i­mal num­bers in var­i­ous pro­gram­ming lan­guages

Dec­i­mals num­bers are a data type that aims to ex­act­ly rep­re­sent dec­i­mal num­bers. Some pro­gram­mers may not know, or ful­ly re­al­ize, that, in most pro­gram­ming lan­guages, the num­bers that you en­ter look like dec­i­mal num­bers but in­ter­nal­ly are rep­re­sent­ed as bi­na­ry—that is, base-2—float­ing-point num­bers. Things that are to­tal­ly sim­ple for us, such as 0.1, sim­ply can­not be rep­re­sent­ed ex­act­ly in bi­na­ry. The dec­i­mal data type—what­ev­er its stripe or fla­vor—aims to rem­e­dy this by giv­ing us a way of rep­re­sent­ing and work­ing with dec­i­mal num­bers, not bi­na­ry ap­prox­i­ma­tions there­of. (Wikipedia has more.)

To help with my work on adding dec­i­mals to JavaScript, I've gone through a list of pop­u­lar pro­gram­ming lan­guages, tak­en from the 2022 Stack­Over­flow de­vel­op­er sur­vey. What fol­lows is a brief sum­ma­ry of where these lan­guages stand re­gard­ing dec­i­mals. The in­ten­tion is to keep things sim­ple. The pur­pose is:

  1. If a lan­guage does have dec­i­mals, say so;
  2. If a lan­guage does not have dec­i­mals, but at least one third-par­ty li­brary ex­ists, men­tion it and link to it. If a dis­cus­sion is un­der­way to add dec­i­mals to the lan­guage, link to that dis­cus­sion.

There is no in­ten­tion to fil­ter out an lan­guage in par­tic­u­lar; I'm just work­ing with a slice of lan­guages found in in the Stack­Over­flow list linked to ear­li­er. If a lan­guage does not have dec­i­mals, there may well be mul­ti­ple third-part dec­i­mal li­braries. I'm not aware of all li­braries, so if I have linked to a mi­nor li­brary and ne­glect to link to a more high-pro­file one, please let me know. More im­por­tant­ly, if I have mis­rep­re­sent­ed the ba­sic fact of whether dec­i­mals ex­ists at all in a lan­guage, send mail.

C

C does not have dec­i­mals. But they're work­ing on it! The C23 stan­dard (as in, 2023) stan­dard pro­pos­es to add new fixed bit-width data types (32, 64, and 128) for these num­bers.

C#

C# has dec­i­mals in its un­der­ly­ing .NET sub­sys­tem. (For the same rea­son, dec­i­mals also ex­ist in Vi­su­al Ba­sic.)

C++

C++ does not have dec­i­mals. But—like C—they're work­ing on it!

Dart

Dart does not have dec­i­mals. But a third-par­ty li­brary ex­ists.

Go

Go does not have dec­i­mals, but a third-par­ty li­brary ex­ists.

Java

Java has dec­i­mals.

JavaScript

JavaScript does not have dec­i­mals. We're work­ing on it!

Kotlin

Kotlin does not have dec­i­mals. But, in a way, it does: since Kotlin is run­ning on the JVM, one can get dec­i­mal func­tion­al­i­ty by us­ing Java's built-in sup­port.

PHP

PHP does not have dec­i­mals. An ex­ten­sion ex­ists and at least one third-par­ty li­brary ex­ists.

Python

Python has dec­i­mals.

Ruby

Ruby has dec­i­mals. De­spite that, there is some third-par­ty work to im­prove the built-in sup­port.

Rust

Rust does not have dec­i­mals, but a crate ex­ists.

SQL

SQL has dec­i­mals (it is the DECIMAL data type). (Here is the doc­u­men­ta­tion for, e.g., Post­greSQL, and here is the doc­u­men­ta­tion for MySQL.)

Swift

Swift has dec­i­mals

Type­Script

Type­Script does not have dec­i­mals. How­ev­er, if dec­i­mals get added to JavaScript (see above), Type­Script will prob­a­bly in­her­it dec­i­mals, even­tu­al­ly.

October 31, 2023 09:09 AM

October 30, 2023

Iago Toral

XDC 2023

I was at XDC 2023 in A Coruña a few days ago where I had the opportunity to talk about some of the work we have been doing on the Raspberry Pi driver stack together with my colleagues Juan Suárez and Maíra Canal. We talked about Raspberry Pi 5, CPU job handling in the Vulkan driver, OpenGL 3.1 support and how we are exposing GPU stats to user space. If you missed it here is the link to Youtube.

Big thanks to Igalia for organizing it and to all the sponsors and specially to Samuel and Chema for all the work they put into making this happen.

by Iago Toral at October 30, 2023 11:15 AM

Jesse Alama

Native support for decimal numbers in the Python programming language

Na­tive sup­port for dec­i­mal num­bers in the Python pro­gram­ming lan­guage

As part of the project of ex­plor­ing how dec­i­mal num­bers could be added to JavaScript, I'd like to take a step back and look at how oth­er lan­guages sup­port dec­i­mals (or not). Many lan­guages do sup­port dec­i­mal num­bers. It may be use­ful to un­der­stand the range of op­tions out there for sup­port­ing them. For in­stance, what kind of data mod­el do they use? What are the lim­its (if there are any)? Does the lan­gauge in­clude any spe­cial syn­tax for dec­i­mal?

Here, I'd like to briefly sum­ma­rize what Python has done.

Does Python sup­port dec­i­mals?

Python sup­ports dec­i­mal arith­metic. The func­tion­al­i­ty is part of the stan­dard li­brary. Dec­i­mals aren't avail­able out-of-the-box, in the sense that all Python pro­grams, re­gard­less of what they im­port, can start work­ing with dec­i­mals. There is no dec­i­mal lit­er­al syn­tax in the lan­guage. That said, all one needs to do is import * from decimal and you're ready to rock.

Dec­i­mals have been part of the Python stan­dard li­brary for a long time: they were added in ver­sion 2.4, in No­vem­ber 2001. Python does have a process for propos­ing ex­ten­sions to the lan­guage, called PEP (Python Ex­ten­sion Pro­pos­al). Ex­ten­sive dis­cus­sions on the of­fi­cial mail­ing lists took place. Python dec­i­mals were for­mal­ized in PEP 327.

The dec­i­mal li­brary pro­vides ac­cess to some of the in­ter­nals of dec­i­mal arith­metic, called the con­text. In the con­text, one can spec­i­fy, for in­stance, the num­ber of dec­i­mal dig­its that should be avail­able when op­er­a­tions are car­ried out. One can also for­bid mix­ing of dec­i­mal val­ues with prim­i­tive built-in types, such as in­te­gers and (bi­na­ry) float­ing-point num­bers.

In gen­er­al, the Python im­ple­men­ta­tion aims to be an im­ple­men­ta­tion of the Gen­er­al Dec­i­mal Arith­metic Spec­i­fi­ca­tion. In par­tic­u­lar, us­ing this data mod­el, it is pos­si­ble to dis­tin­guish the dig­it strings 1.2 and 1.20, con­sid­ered as dec­i­mal val­ues, as math­e­mat­i­cal­ly equal but nonethe­less dis­tinct val­ues.

Aside: How does this com­pare with Dec­i­mal128, one of the con­tender data mod­els for dec­i­mals in JavaScript? Since Python's dec­i­mal fea­ture is an im­ple­men­ta­tion of the Gen­er­al Dec­i­mal Arith­metic Spec­i­fi­ca­tion, it works with a sort of gen­er­al­ized IEEE 754 Dec­i­mal. No bit width is spec­i­fied, so Python dec­i­mals are not lit­er­al­ly the same as Dec­i­mal128. How­ev­er, one can suit­ably pa­ra­me­ter­ize Python's dec­i­mal to get some­thing es­sen­tial­ly equiv­a­lent to Dec­i­mal128:

  1. spec­i­fy the min­i­mum and max­i­mum ex­po­nent to -6144 and 6143, re­spec­tive­ly (the de­faults are -999999 and 999999, re­spec­tive­ly)
  2. spec­i­fy the pre­ci­sion to 34 (de­fault is 28)

API for Python dec­i­mals

Here are the sup­port­ed math­e­mat­i­cal func­tions:

  • ba­sic arith­metic: ad­di­tion, sub­trac­tion, sub­trac­tion, di­vi­sion
  • nat­ur­al ex­po­nen­ti­a­tion and log (e^x, ln(x))
  • log base 10
  • a^b (two-ar­gu­ment ex­po­nen­ti­a­tion, though the ex­po­nent needs to be an in­te­ger)
  • step up/down (1.451 → 1.450, 1.452)
  • square root
  • fused mul­ti­ply-and-add (a*b + c)

As men­tioned above, the data mod­el for Python dec­i­mals al­lows for sub­nor­mal dec­i­mals, but one can al­ways nor­mal­ize a val­ue (re­move the trail­ing ze­ros). (This isn't ex­act­ly a math­e­mat­i­cal func­tion, since dis­tinct mem­bers of a co­hort are math­e­mat­i­cal­ly equal.)

In Python, when im­port­ing dec­i­mals, some of the ba­sic arith­metic op­er­a­tors get over­loaded. Thus, +, *, and **, etc., pro­duce cor­rect dec­i­mal re­sults when giv­en dec­i­mal ar­gu­ments. (There is some pos­si­bil­i­ty for some­thing rough­ly sim­i­lar in JavaScript, but that dis­cus­sion has been paused.)

Trigono­met­ric func­tions are not pro­vid­ed. (These func­tions be­long to the op­tion­al part of the IEEE 754 spec­i­fi­ca­tion.)

October 30, 2023 03:14 AM

October 27, 2023

Loïc Le Page

The process of creating a new WPE backend

The basics to understand and build a WPE WebKit backend from scratch.

What is a WPE Backend? #

WPE is the official port of WebKit for embedded platforms. Depending on the platform hardware it may need to use different technics and technologies to ensure correct graphical rendering.

In order to be independent from any user-interface toolkit and/or windowing system, WPE WebKit delegates the rendering to a third-party API defined in libwpe. The implementation of this API is called a WPE Backend. You can find more explainations on the WPE Architecture page.

WPE WebKit is a multiprocesses application, the end-user starts and controls the web widgets in the application process while the web engine itself is running in different subprocesses like the WPENetworkProcess in charge of managing the underlying network connections and the WPEWebProcess in charge of the HTML and Javascript parsing, execution and rendering.

The WPEWebProcess is, in particular, in charge of drawing the final composition of the web page through the ThreadedCompositor component.

The WPE Backend is at a crossroads between the WPEWebProcess and the user application process.

graph LR;
    subgraph WPEWebProcess
        A(WPE ThreadedCompositor)
    end
    subgraph Application Process
        C(User Application)
    end
    A -->|draw| B([WPE Backend]) --> C

The WPE Backend is a shared library that is loaded at runtime by the WPEWebProcess and by the user application process. It is used to render the visual aspect of a web page and transfer the resulting video buffer from the WPE WebKit engine process to the application process.

N.B. with the growing integration of DMA Buffers on all modern Linux platforms, the WPE WebKit architecture is evolving and, in the future, the need for a WPE Backend should disappear.

Future designs of the libwpe API should allow to directly receive the video buffers on the application process side without needing to implement a different WPE Backend for each hardware platform. From the application developer point of view, it will simplify the usage of WPE WebKit by hiding all multiprocesses considerations.

The WPE Backend interfaces #

The WPE Backend shared library must export at least one symbol called _wpe_loader_interface of type wpe_loader_interface as defined here in the libwpe API.

This instance holds a simpler loader function that must return concrete implementations for the following libwpe interfaces:

When the WPE Backend is loaded by the WPEWebProcess (or the application process), the process will look for the _wpe_loader_interface symbol and then call _wpe_loader_interface.load_object("...") with predefined names whenever it needs access to a specific interface:

  • “_wpe_renderer_host_interface” for the wpe_renderer_host_interface
  • “_wpe_renderer_backend_egl_interface” for the wpe_renderer_backend_egl_interface
  • “_wpe_renderer_backend_egl_target_interface” for the wpe_renderer_backend_egl_target_interface
  • “_wpe_renderer_backend_egl_offscreen_target_interface” for the wpe_renderer_backend_egl_offscreen_target_interface

The WPE Backend also needs to implement a fifth interface of type wpe_view_backend_interface that is used on the application process side when creating the specific backend instance.

All those interfaces follow the same structure:

  • a create(...) function that acts like an object constructor
  • a destroy(...) function that acts like an object destructor
  • some functions that act like the methods of an object (each function receives the previously created object instance as first parameter)

Taking the names used in the WPEBackend-direct project, on the application process side WPE will create:

  • a rendererHost instance using the wpe_renderer_host_interface::create(...) function
  • multiple rendererHostClient instances using the wpe_renderer_host_interface::create_client(...) function (those instances are mainly used for IPC communication, one instance is created for each WPEWebProcess launched by WPE WebKit)
  • multiple viewBackend instances created by the wpe_view_backend_interface::create(...) function (one instance is created for each rendering target on the WPEWebProcess side)

On the WPEWebProcess side (there can be more than one WPEWebProcess depending of the URIs loaded by the WebKitWebView instances on the application process side), the web process will create:

  • a rendererBackendEGL instance (one per WPEWebProcess) using the wpe_renderer_backend_egl_interface::create(...) function
  • multiple rendererBackendEGLTarget instances using the wpe_renderer_backend_egl_target_interface::create(...) function (one instance is created for each new rendering target requested by the application)

N:B: the rendererBackendEGLTarget instances may be created by the wpe_renderer_backend_egl_target_interface or the wpe_renderer_backend_egl_offscreen_target_interface depending on the interfaces implemented in the WPE Backend.

Here we are only focussing on the wpe_renderer_backend_egl_target_interface that is relying on a classical EGL display (defined in the rendererBackendEGL instance). The wpe_renderer_backend_egl_offscreen_target_interface may be used in very specific use-cases that are out of the scope of this post. You can check its usage in the WPE WebKit source code for more information.

Former instances are communicating between each others using classical IPC technics (Unix sockets). The IPC layer must be implemented in the WPE Backend, the libwpe interfaces only share the endpoints file descriptors between the different processes when creating the different instances.

From a topological point of view, all those instances are organized as follows:

graph LR;
    subgraph S1 [WPEWebProcess]
        A1(rendererBackendEGL)
        subgraph S1TARGETS [ ]
            B1(rendererBackendEGLTarget)
            C1(rendererBackendEGLTarget)
            D1(rendererBackendEGLTarget)
        end
    end
    style S1 fill:#fb3,stroke:#333
    style S1TARGETS fill:#fb3,stroke:#333
    style A1 fill:#ff5
    style B1 fill:#9f5
    style C1 fill:#9f5
    style D1 fill:#9f5

    subgraph S2 [WPEWebProcess]
        A2(rendererBackendEGL)
        subgraph S2TARGETS [ ]
            B2(rendererBackendEGLTarget)
            C2(rendererBackendEGLTarget)
        end
    end
    style S2 fill:#fb3,stroke:#333
    style S2TARGETS fill:#fb3,stroke:#333
    style A2 fill:#ff5
    style B2 fill:#9f5
    style C2 fill:#9f5

    subgraph S3 [Application Process]
        E(rendererHost)
        subgraph S4 [Clients]
            F1(rendererHostClient)
            F2(rendererHostClient)
        end
        F1 ~~~ F2
        E -.- S4

        subgraph S5 [ ]
            G1(viewBackend)
            G2(viewBackend)
            G3(viewBackend)
            G4(viewBackend)
            G5(viewBackend)
        end
        S4 ~~~ S5
        G1 ~~~ G2
        G2 ~~~ G3
        G3 ~~~ G4
        G4 ~~~ G5
    end
    style S3 fill:#3cf,stroke:#333
    style S4 fill:#9ef,stroke:#333
    style S5 fill:#3cf,stroke:#333
    style E fill:#ff5
    style F1 fill:#ff5
    style F2 fill:#ff5
    style G1 fill:#9f5
    style G2 fill:#9f5
    style G3 fill:#9f5
    style G4 fill:#9f5
    style G5 fill:#9f5

    A1 <--IPC--> F1
    A2 <--IPC--> F2

    B1 <--IPC--> G1
    C1 <--IPC--> G2
    D1 <--IPC--> G3
    B2 <--IPC--> G4
    C2 <--IPC--> G5

From a usage point of view:

  • the rendererHost and rendererHostClient instances are only used to manage IPC endpoints on the application process side that are connected to each running WPEWebProcess. They are not used by the graphical rendering system.
  • the rendererBackendEGL instance (one per WPEWebProcess) is only used to connect to the native display for a specific platform. For example, on a desktop Linux, the platform may be X11 and the native display can be obtained by calling XOpenDisplay(...); or the platform may be Wayland and the native display can be obtained by calling wl_display_connect(...); etc…
  • the rendererBackendEGLTarget (on the WPEWebProcess side) and viewBackend (on the application process side) instances are the ones truly managing the web page graphical rendering.

The graphical rendering mechanism #

As seen above, the interfaces in charge of the rendering are: wpe_renderer_backend_egl_target_interface and wpe_view_backend_interface. During the instances creation, WPE WebKit exchanges the file descriptors used to establish a direct IPC connection between a rendererBackendEGL on the WPEWebProcess side and a viewBackend on the application process side.

During the EGL initialization phase, run when a new WPEWebProcess is launched, WPE WebKit will use the native display and platform provided by wpe_renderer_backend_egl_interface::get_native_display(...) and get_platform(...) functions to create a suitable GLES context.

When the WPE WebKit ThreadedCompositor is ready to render a new frame (on the WPEWebProcess side), it will call the wpe_renderer_backend_egl_target_interface::frame_will_render(...) function to advertise the WPE Backend that rendering is going to start. At this moment, the previously created GLES context is current to the calling thread and then all GLES drawing commands will be issued right after calling the former function.

Once the ThreadedCompositor has finished drawing, it will swap the EGL buffers and call the wpe_renderer_backend_egl_target_interface::frame_rendered(...) function to advertise the WPE Backend that the frame is ready. Then the ThreadedCompositor will wait until the wpe_renderer_backend_egl_target_dispatch_frame_complete(...) function is called by the WPE Backend.

What happens during frame_will_render(...) and frame_rendered(...) calls is up to the WPE Backend. It can, for example, select a Framebuffer Object to draw in a texture and then pass the texture content to the application process, or use EGL extensions like EGLStream or DMA Buffers to transfer the frame to the application process without doing a copy through the computer main RAM.

In all cases, in frame_rendered(...) function the WPE Backend generally sends the new frame to the corresponding viewBackend instance on the application side. Then the application process uses or presents this frame and then sends back an IPC message to the rendererBackendEGLTarget instance to advertise the WPEWebProcess side that the frame is not used anymore and can be recycled. When the rendererBackendEGLTarget instance receives this IPC message, this is generally the moment it calls the wpe_renderer_backend_egl_target_dispatch_frame_complete(...) function to trigger a new frame production.

With this mechanism, the application has full control over the synchronization of the rendering on the WPEWebProcess side.

sequenceDiagram
    participant A as ThreadedCompositor
    participant B as WPE Backend
    participant C as Application
    loop Rendering
      A->>B: frame_will_render(...)
      activate A
      A->>A: GLES rendering
      A->>B: frame_rendered(...)
      deactivate A
      B->>C: frame transfer
      activate C
      C->>C: frame presentation
      C->>A: dispatch_frame_complete(...)
      deactivate C
    end

A simple example: WPEBackend-direct #

The WPEBackend-direct project aims to implement a very simple WPE Backend that is not transfering any frame to the application process.

The frame presentation is directly done by the WPEWebProcess using a native X11 or Wayland window. Objective is to provide an easy means of showing and debugging the output of the ThreadedCompositor without the inherent complexity of transfering frames between different processes.

The WPEBackend-direct implements the minimum set of libwpe interfaces needed by a functional backend.

The interfaces loader is implemented in wpebackend-direct.cpp. The wpe_renderer_backend_egl_offscreen_target_interface is disabled as the backend is only providing the default wpe_renderer_backend_egl_target_interface for the target rendererBackendEGLTarget instances.

extern "C"
{
    __attribute__((visibility("default"))) wpe_loader_interface _wpe_loader_interface = {
        +[](const char* name) -> void* {
            if (std::strcmp(name, "_wpe_renderer_host_interface") == 0)
                return RendererHost::getWPEInterface();

            if (std::strcmp(name, "_wpe_renderer_backend_egl_interface") == 0)
                return RendererBackendEGL::getWPEInterface();

            if (std::strcmp(name, "_wpe_renderer_backend_egl_target_interface") == 0)
                return RendererBackendEGLTarget::getWPEInterface();

            if (std::strcmp(name, "_wpe_renderer_backend_egl_offscreen_target_interface") == 0)
            {
                static wpe_renderer_backend_egl_offscreen_target_interface s_interface = {
                    +[]() -> void* { return nullptr; },
                    +[](void*) {},
                    +[](void*, void*) {},
                    +[](void*) -> EGLNativeWindowType { return nullptr; },
                    nullptr,
                    nullptr,
                    nullptr,
                    nullptr};
                return &s_interface;
            }

            return nullptr;
        },
        nullptr, nullptr, nullptr, nullptr};
}

On the application process side, the RendererHost and RendererHostClient classes do nothing particular, they are just straightforward implementations of the libwpe interfaces used to maintain the IPC channel connection. Because rendering and frame presentation are both handled on the WPEWebProcess side, the ViewBackend class is only managing the render loop synchronization. It receives a message from the IPC channel each time a frame has been rendered and then it calls a callback from the sample application.

When the sample application is ready, it calls a specific function from the WPEBackend-direct API that will trigger a call to ViewBackend::frameComplete() that, in turn, will send an IPC message to the RendererBackendEGLTarget instance on the WPEWebProcess side to call the wpe_renderer_backend_egl_target_dispatch_frame_complete(...) function.

WebKitWebViewBackend* createWebViewBackend()
{
    auto* directBackend = wpe_direct_view_backend_create(
        +[](wpe_direct_view_backend* backend, void* /*userData*/) {

            // This callback function is called by the WPE Backend each time a
            // frame has been rendered and presented on the WPEWebProcess side.

            // Calling the next function will trigger an IPC message sent to the
            // corresponding RendererBackendEGLTarget instance on the WPEWebProcess
            // side that will trigger the rendering of the next frame.
            wpe_direct_view_backend_dispatch_frame_complete(backend);
        },
        nullptr, 800, 600);

    return webkit_web_view_backend_new(wpe_direct_view_backend_get_wpe_backend(directBackend), nullptr, nullptr);
}

On the WPEWebProcess side, the RendererBackendEGL class is a wrapper around an X11 or Wayland native display, it also maintains an IPC connection with a corresponding RendererHostClient instance on the application process side.

The RendererBackendEGLTarget class is the one in charge of the presentation. It maintains an IPC connection with a corresponding ViewBackend instance on the application process side. During its initialization the RendererBackendEGLTarget instance uses the native display held by the RendererBackendEGL instance to create a native X11 or Wayland window. This native window is communicated to the WPE WebKit engine through the wpe_renderer_backend_egl_target_interface implementation.

The WPE WebKit engine will use the provided native display and window to create a compatible EGL display and surface (see the source code here).

When the ThreadedCompositor calls frame_will_render(...), the compatible GLES context is already selected with the EGL surface associated with the native window previously created by the RendererBackendEGLTarget instance, so there is nothing more to do. When the ThreadedCompositor calls m_context->swapBuffers(); the rendered frame is automatically presented in the native window.

Then, when the ThreadedCompositor calls frame_rendered(...), the only thing left to do by the RendererBackendEGLTarget instance is to send an IPC message to the corresponding ViewBackend instance, on the application process side, to advertise that a frame has been rendered and that the ThreadedCompositor is waiting before rendering the next frame.

When dealing with WPE Backends the WPEBackend-direct example is an interesting starting point as it mainly focusses on the backend implementation. The extra code, that is not directly needed by the libwpe interfaces, is minimal and clearly separated from the backend implementation code.

Next step would be to transfer the frame content from the WPEWebProcess to the application process and do the presentation on the application process side. One obvious way of doing it would be to render to a texture on the WPEWebProcess side, transfer the texture content through a shared memory map and then create a new texture on the application process side with the shared graphical data. That would work but that would not be really efficient as, for each frame, data would need to be copied from the GPU to the CPU in the WPEWebProcess and then back from the CPU to the GPU in the application process.

What is really interesting would be to keep the frame data on the GPU side and just transfer an handle to this memory between the WPEWebProcess and the application process. This will be the topic of a future post, so stay tuned!

October 27, 2023 12:00 AM

October 25, 2023

Emmanuele Bassi

Introspection’s edge

tl;dr The introspection data for GLib (and its sub-libraries) is now generated by GLib itself instead of gobject-introspection; and in the near future, libgirepository is going to be part of GLib as well. Yes, there’s a circular dependency, but there are plans for dealing with that. Developers of language bindings should reach out to the GLib maintainers for future planning.


GLib-the-project is made of the following components:

  • glib, the base data types and C portability library
  • gmodule, a portable loadable module API
  • gobject, the object oriented run time type system
  • gio, a grab bag of I/O, IPC, file, network, settings, etc

There are, of course, additional bits inside gobject-introspection, like the regression and marshalling test suites, as well as the documentation generators that very few people use.

Just like GLib, the gobject-introspection project is actually three components in a trench coat:

  • g-ir-scanner, the Python and flex based parser that generates the XML description of a GObject-based ABI
  • g-ir-compiler, the “compiler” that turns XML into an mmap()-able binary format
  • libgirepository, the C API for loading the binary format, resolving types and symbols, and calling into the underlying C library

Since gobject-introspection depends on GLib for its implementation, the introspection data for GLib has been, until now, generated during the gobject-introspection build. This has proven to be extremely messy:

  • as the introspection data is generated from the source code, gobject-introspection needs to have access to the GLib headers and source files
  • this means building against an installed copy of GLib for the declarations, and the GLib sources for the docblocks
  • to avoid having access to the sources at all times, there is a small script that extracts all the docblocks from a GLib checkout; this script has to be run manually, and its results committed to the gobject-introspection repository

This means that the gobject-introspection maintainers have to manually keep the GLib data in sync; that any change done to GLib’s annotations and documentation are updated typically at the end of the development cycle, which also means that any issue with the annotations is discovered almost always too late; and that any and all updates to the GLib introspection data require a gobject-introspection release to end up inside downstream distributions.

The gobject-introspection build can use GLib as a subproject, at the cost of some awful, awful build system messiness, but that does not help anybody working on GLib; and, of course, since gobject-introspection depends on GLib, we cannot build the former as a subproject of the latter.

As part of the push towards moving GLib’s API references towards the use of gi-docgen, GLib now checks for the availability of g-ir-scanner and g-ir-compiler on the installed system, and if they are present, they are used to generate the GLib, GObject, GModule, and GIO introspection data.

A good side benefit of having the actual project providing the ABI be in charge of generating its machine-readable description is that we can now decouple the release cycles of GLib and gobject-introspection; updates to the introspection data of GLib will be released alongside GLib, while the version of gobject-introspection can be updated whenever there’s new API inside libgirepository, instead of requiring a bump at every cycle to match GLib.

On the other hand, though, this change introduces a circular dependency between GLib and gobject-introspection; this is less than ideal, and though circular dependencies are a fact of life for whoever builds and packages complex software, it’d be better to avoid them; this means two options:

  • merging gobject-introspection into GLib
  • making gobject-introspection not depend on GLib

This was the original plan for gobject-introspection, back in 2007-2009; after iterating over a few cycles on the features required by language bindings, the introspection format and API would be part of GLib, alongside the rest of the type system. Best laid plans, and all that…

Sadly, the first option isn’t really workable, as it would mean making GLib depend on flex, bison, and CPython in order to build g-ir-scanner, or (worse) rewriting g-ir-scanner in C. There’s also the option of rewriting the C parser in pure Python, but that, too, isn’t exactly strife-free.

The second option is more convoluted, but it has the advantage of being realistically implemented:

  1. move libgirepository into GLib as a new sub-library
  2. move g-ir-compiler into GLib’s tools
  3. copy-paste the basic GLib data types used by the introspection scanner into gobject-introspection

Aside from breaking the circular dependency, this allows language bindings that depend on libgirepository’s API (Python, JavaScript, Perl, etc) to limit their dependency to GLib, since they don’t need to deal with the introspection parser bits; it also affords us the ability to clean up the libgirepository API, provide better documentation, and formalise the file formats.

Speaking of file formats, and to manage expectations: I don’t plan to change the binary data layout; yes, we’re close to saturation when it comes to padding bits and fields, but there hasn’t been a real case in which this has been problematic. The API, on the other hand, really needs to be cleaned up, because it got badly namespaced when we originally thought it could be moved into GLib and then the effort was abandoned. This means that libgirepository will be bumped to 2.0 (matching the rest of GLib’s sub-libraries), and will require some work in the language bindings that consume it. For that, bindings developers should get in touch with GLib maintainers, so we can plan the migration appropriately.

by ebassi at October 25, 2023 08:43 AM

October 24, 2023

Eric Meyer

Mistakes Were Made

Late last week, I posted a tiny hack related to :has() and Firefox.  This was, in some ways, a mistake.  Let me explain how.

Primarily, I should have filed a bug about it.  Someone else did so, and it’s already been fixed.  This is all great in the wider view, but I shouldn’t be offloading the work of reporting browser bugs when I know perfectly well how to do that.  I got too caught up in the fun of documenting a tiny hack (my favorite kind!) to remember that, which is no excuse.

Not far behind that, I should have remembered that Firefox only supports :has() at the moment if you’ve enabled the layout.css.has-selector.enabled flag in about:config.  Although this may be the default now in Nightly builds, given that my copy of Firefox Nightly (121.0a1) shows the flag as true without the Boldfacing of Change.  At any rate, I should have been clear about the support status.

Thus, I offer my apologies to the person who did the reporting work I should have done, who also has my gratitude, and to anyone who I misled about the state of support in Firefox by not being clear about it.  Neither was my intent, but impact outweighs intent.  I’ll add a note to the top of the previous article that points here, and resolve to do better.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at October 24, 2023 08:50 PM

October 20, 2023

Stéphane Cerveau

Introducing GstPipelineStudio 0.3.4

GstPipelineStudio #

As it's not always convenient to use the powerful command line based, gst-launch tool and also manage all the debug possibilities on all the platforms supported by GStreamer, I started this personal project in 2021 to facilitate the adoption to the GStreamer framework and help newbies as confirmed engineers enjoy the power of it.

Indeed a few other projects, such as Pipeviz (greatly inspired from...) or gst-debugger, already tried to offer this GUI capability, my idea with GPS was to provide a cross-platform tool written in Rust with the powerful framework, gtk-rs.

The aim of this project is to provide the GUI for GStreamer but also being able to remote debug existing pipeline while offering a very simple and accessible interface as back in the days I discovered DirectShow with the help of graphedit or GraphStudioNext

Project details #

GstPipelineStudio interface

The interface includes 5 important zones:

  • (1) The registry area gives you an access to the GStreamer registry including all the plugins/elements available on your system. It provides you details on each elements. You can also access to a favorite list.
  • (2) the main drawing area is where you can add elements from the registry area and connect them together. This area allows you to have multiple independent pipelines with its own player for each drawing area.
  • (3) The control playback area, each pipeline can be controlled from this area including basic play/stop/pause but also a seekbar.
  • (4) the debug zone where you'll receive the messages from the application.
  • (5) The render zone where you can have a video preview if a video sink has been added to the pipeline. Future work includes to have tracers or audio analysis in this zone.

The project has been written in Rust to offer more stability and thanks to the wonderful work to use the GTK framework, it was perfectly fitting to this project as it gives an easy way to use it over the 3 platforms targeted such as GNU/Linux, MacOS and Windows. On this last platform which is quite well "implanted" in the desktop eco-system, the use of GStreamer can lead to difficulties, that's why GstPipelineStudio Windows MSI will be a perfect match to test the power of the GStreamer framework.

This project has been written under the GPL v3 License.

How it works under the hood #

The trick is quite simple as it uses the power of gst-parse-launch API to build a pipeline as a transformation of the visual pipeline to a command line.

So its a clearly a sibling of gst-launch.

Right now its directly linked to the GStreamer installed on your system but future work could be to connect it over daemons such as GstPrinceOfParser or gstd

What's new in 0.3.4 #

The main feature of this release is the cross platform ready state. These are beta versions but the CI is now ready to build and deploy Flathub (Linux), Mac OS and Windows version of GstPipelineStudio.

You can download the installers from the project page or with:

  • Linux: flatpak install org.freedesktop.dabrain34.GstPipelineStudio
  • MacOS: DMG file
  • Windows: MSI file

Here is a list of main features added to the app:

  • Open a pipeline from the command line and it will be drawn automatically on the screen. This feature allows you to take any command line pipeline and draw it on the screen to allow any new play tricks with the pipeline, such as change of elements, properties etc.

  • Multiple graphview allows you to draw multiple independent pipeline in the same instance of GstPipelineStudio. The playback state is totally independent for each of the views.

  • Capsfilter has been added to the links allowing to add this crucial feature of GStreamer pipelines.

  • gstreamer-1.0 wrap support to the build system. So you can build your own version of GPS using a dedicated GStreamer version.

What's next in the pipeline #

Among multiple use case, key and debug features, the most upcoming features are:

  • Support the zoom on the graphview. As a pipeline can be quite big, the zoom is a key/must features for GPS. See MR
  • Debug sections such as receiving events/tags/messages or tracers and terminal support, see MR
  • Elements compatibility to check if an element can connect to the previous/next one.
  • Remote debugging: A tracer wsserver is currently under development allowing to send over websocket the pipeline events such as connections, properties or element addition. A MR is under development to connect this tracer and render the corresponding pipeline.
  • Auto plugging according to the rank of each compatible elements for a given pad caps.
  • Display the audio signal in a dedicated render tab.
  • Translations
  • Documentation
  • Unit tests

Here is a lighning talk, I gave about this release (0.3.3), during the 2023 GStreamer conference.

Hope you'll enjoy this tool and please feel free to provide new features with an RFC here or merge requests here.

As usual, if you would like to learn more about GstPipelineStudio, GStreamer or any other open multimedia framework, please contact us!

October 20, 2023 12:00 AM

October 19, 2023

Eric Meyer

Prodding Firefox to Update :has() Selection

I’ve posted a followup to this post which you should read before you read this post, because you might decide there’s no need to read this one.  If not, please note that what’s documented below was a hack to overcome a bug that was quickly fixed, in a part of CSS that wasn’t enabled in stable Firefox at the time I wrote the post.  Thus, what follows isn’t really useful, and leaves more than one wrong impression.  I apologize for this.  For a more detailed breakdown of my errors, please see the followup post.


I’ve been doing some development recently on a tool that lets me quickly produce social-media banners for my work at Igalia.  It started out using a vanilla JS script to snarfle up collections of HTML elements like all the range inputs, stick listeners and stuff on them, and then alter CSS variables when the inputs change.  Then I had a conceptual breakthrough and refactored the entire thing to use fully light-DOM web components (FLDWCs), which let me rapidly and radically increase the tool’s capabilities, and I kind of love the FLDWCs even as I struggle to figure out the best practices.

With luck, I’ll write about all that soon, but for today, I wanted to share a little hack I developed to make Firefox a tiny bit more capable.

One of the things I do in the tool’s CSS is check to see if an element (represented here by a <div> for simplicity’s sake) has an image whose src attribute is a base64 string instead of a URI, and when it is, add some generated content. (It makes sense in context.  Or at least it makes sense to me.) The CSS rule looks very much like this:

div:has(img[src*=";data64,"])::before {
	[…generated content styles go here…]
}

This works fine in WebKit and Chromium.  Firefox, at least as of the day I’m writing this, often fails to notice the change, which means the selector doesn’t match, even in the Nightly builds, and so the generated content isn’t generated.  It has problems correlating DOM updates and :has(), is what it comes down to.

There is a way to prod it into awareness, though!  What I found during my development was that if I clicked or tabbed into a contenteditable element, the :has() would suddenly match and the generated content would appear.  The editable element didn’t even have to be a child of the div bearing the :has(), which seemed weird to me for no distinct reason, but it made me think that maybe any content editing would work.

I tried adding contenteditable to a nearby element and then immediately removing it via JS, and that didn’t work.  But then I added a tiny delay to removing the contenteditable, and that worked!  I feel like I might have seen a similar tactic proposed by someone on social media or a blog or something, but if so, I can’t find it now, so my apologies if I ganked your idea without attribution.

My one concern was that if I wasn’t careful, I might accidentally pick an element that was supposed to be editable, and then remove the editing state it’s supposed to have.  Instead of doing detection of the attribute during selection, I asked myself, “Self, what’s an element that is assured to be present but almost certainly not ever set to be editable?”

Well, there will always be a root element.  Usually that will be <html> but you never know, maybe it will be something else, what with web components and all that.  Or you could be styling your RSS feed, which is in fact a thing one can do.  At any rate, where I landed was to add the following right after the part of my script where I set an image’s src to use a base64 URI:

let ffHack = document.querySelector(':root');
ffHack.setAttribute('contenteditable','true');
setTimeout(function(){
	ffHack.removeAttribute('contenteditable');
},7);

Literally all this does is grab the page’s root element, set it to be contenteditable, and then seven milliseconds later, remove the contenteditable.  That’s about a millisecond less than the lifetime of a rendering frame at 120fps, so ideally, the browser won’t draw a frame where the root element is actually editable… or, if there is such a frame, it will be replaced by the next frame so quickly that the odds of accidentally editing the root are very, very, very small.

At the moment, I’m not doing any browser sniffing to figure out if the hack needs to be applied, so every browser gets to do this shuffle on Firefox’s behalf.  Lazy, I suppose, but I’m going to wave my hands and intone “browsers are very fast now” while studiously ignoring all the inner voices complaining about inefficiency and inelegance.  I feel like using this hack means it’s too late for all those concerns anyway.

I don’t know how many people out there will need to prod Firefox like this, but for however many there are, I hope this helps.  And if you have an even better approach, please let us know in the comments!


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at October 19, 2023 04:29 PM

Andy Wingo

requiem for a stringref

Good day, comrades. Today's missive is about strings!

a problem for java

Imagine you want to compile a program to WebAssembly, with the new GC support for WebAssembly. Your WebAssembly program will run on web browsers and render its contents using the DOM API: Document.createElement, Document.createTextNode, and so on. It will also use DOM interfaces to read parts of the page and read input from the user.

How do you go about representing your program in WebAssembly? The GC support gives you the ability to define a number of different kinds of aggregate data types: structs (records), arrays, and functions-as-values. Earlier versions of WebAssembly gave you 32- and 64-bit integers, floating-point numbers, and opaque references to host values (externref). This is what you have in your toolbox. But what about strings?

WebAssembly's historical answer has been to throw its hands in the air and punt the problem to its user. This isn't so bad: the direct user of WebAssembly is a compiler developer and can fend for themself. Using the primitives above, it's clear we should represent strings as some kind of array.

The source language may impose specific requirements regarding string representations: for example, in Java, you will want to use an (array i16), because Java's strings are specified as sequences of UTF-16¹ code units, and Java programs are written assuming that random access to a code unit is constant-time.

Let's roll with the Java example for a while. It so happens that JavaScript, the main language of the web, also specifies strings in terms of 16-bit code units. The DOM interfaces are optimized for JavaScript strings, so at some point, our WebAssembly program is going to need to convert its (array i16) buffer to a JavaScript string. You can imagine that a high-throughput interface between WebAssembly and the DOM is going to involve a significant amount of copying; could there be a way to avoid this?

Similarly, Java is going to need to perform a number of gnarly operations on its strings, for example, locale-specific collation. This is a hard problem whose solution basically amounts to shipping a copy of libICU in their WebAssembly module; that's a lot of binary size, and it's not even clear how to compile libICU in such a way that works on GC-managed arrays rather than linear memory.

Thinking about it more, there's also the problem of regular expressions. A high-performance regular expression engine is a lot of investment, and not really portable from the native world to WebAssembly, as the main techniques require just-in-time code generation, which is unavailable on Wasm.

This is starting to sound like a terrible system: big binaries, lots of copying, suboptimal algorithms, and a likely ongoing functionality gap. What to do?

a solution for java

One observation is that in the specific case of Java, we could just use JavaScript strings in a web browser, instead of implementing our own string library. We may need to make some shims here and there, but the basic functionality from JavaScript gets us what we need: constant-time UTF-16¹ code unit access from within WebAssembly, and efficient access to browser regular expression, internationalization, and DOM capabilities that doesn't require copying.

A sort of minimum viable product for improving the performance of Java compiled to Wasm/GC would be to represent strings as externref, which is WebAssembly's way of making an opaque reference to a host value. You would operate on those values by importing the equivalent of String.prototype.charCodeAt and friends; to get the receivers right you'd need to run them through Function.call.bind. It's a somewhat convoluted system, but a WebAssembly engine could be taught to recognize such a function and compile it specially, using the same code that JavaScript compiles to.

(Does this sound too complicated or too distasteful to implement? Disabuse yourself of the notion: it's happening already. V8 does this and other JS/Wasm engines will be forced to follow, as users file bug reports that such-and-such an app is slow on e.g. Firefox but fast on Chrome, and so on and so on. It's the same dynamic that led asm.js adoption.)

Getting properly good performance will require a bit more, though. String literals, for example, would have to be loaded from e.g. UTF-8 in a WebAssembly data section, then transcoded to a JavaScript string. You need a function that can convert UTF-8 to JS string in the first place; let's call it fromUtf8Array. An engine can now optimize the array.new_data + fromUtf8Array sequence to avoid the intermediate array creation. It would also be nice to tighten up the typing on the WebAssembly side: having everything be externref imposes a dynamic type-check on each operation, which is something that can't always be elided.

beyond the web?

"JavaScript strings for Java" has two main limitations: JavaScript and Java. On the first side, this MVP doesn't give you anything if your WebAssembly host doesn't do JavaScript. Although it's a bit of a failure for a universal virtual machine, to an extent, the WebAssembly ecosystem is OK with this distinction: there are different compiler and toolchain options when targetting the web versus, say, Fastly's edge compute platform.

But does that mean you can't run Java on Fastly's cloud? Does the Java compiler have to actually implement all of those things that we were trying to avoid? Will Java actually implement those things? I think the answers to all of those questions is "no", but also that I expect a pretty crappy outcome.

First of all, it's not technically required that Java implement its own strings in terms of (array i16). A Java-to-Wasm/GC compiler can keep the strings-as-opaque-host-values paradigm, and instead have these string routines provided by an auxiliary WebAssembly module that itself probably uses (array i16), effectively polyfilling what the browser would give you. The effort of creating this module can be shared between e.g. Java and C#, and the run-time costs for instantiating the module can be amortized over a number of Java users within a process.

However, I don't expect such a module to be of good quality. It doesn't seem possible to implement a good regular expression engine that way, for example. And, absent a very good run-time system with an adaptive compiler, I don't expect the low-level per-codepoint operations to be as efficient with a polyfill as they are on the browser.

Instead, I could see non-web WebAssembly hosts being pressured into implementing their own built-in UTF-16¹ module which has accelerated compilation, a native regular expression engine, and so on. It's nice to have a portable fallback but in the long run, first-class UTF-16¹ will be everywhere.

beyond java?

The other drawback is Java, by which I mean, Java (and JavaScript) is outdated: if you were designing them today, their strings would not be UTF-16¹.

I keep this little "¹" sigil when I mention UTF-16 because Java (and JavaScript) don't actually use UTF-16 to represent their strings. UTF-16 is standard Unicode encoding form. A Unicode encoding form encodes a sequence of Unicode scalar values (USVs), using one or two 16-bit code units to encode each USV. A USV is a codepoint: an integer in the range [0,0x10FFFF], but excluding surrogate codepoints: codepoints in the range [0xD800,0xDFFF].

Surrogate codepoints are an accident of history, and occur either when accidentally slicing a two-code-unit UTF-16-encoded-USV in the middle, or when treating an arbitrary i16 array as if it were valid UTF-16. They are annoying to detect, but in practice are here to stay: no amount of wishing will make them go away from Java, JavaScript, C#, or other similar languages from those heady days of the mid-90s. Believe me, I have engaged in some serious wishing, but if you, the virtual machine implementor, want to support Java as a source language, your strings have to be accessible as 16-bit code units, which opens the door (eventually) to surrogate codepoints.

So when I say UTF-16¹, I really mean WTF-16: sequences of any 16-bit code units, without the UTF-16 requirement that surrogate code units be properly paired. In this way, WTF-16 encodes a larger language than UTF-16: not just USV codepoints, but also surrogate codepoints.

The existence of WTF-16 is a consequence of a kind of original sin, originating in the choice to expose 16-bit code unit access to the Java programmer, and which everyone agrees should be somehow firewalled off from the rest of the world. The usual way to do this is to prohibit WTF-16 from being transferred over the network or stored to disk: a message sent via an HTTP POST, for example, will never include a surrogate codepoint, and will either replace it with the U+FFFD replacement codepoint or throw an error.

But within a Java program, and indeed within a JavaScript program, there is no attempt to maintain the UTF-16 requirements regarding surrogates, because any change from the current behavior would break programs. (How many? Probably very, very few. But productively deprecating web behavior is hard to do.)

If it were just Java and JavaScript, that would be one thing, but WTF-16 poses challenges for using JS strings from non-Java languages. Consider that any JavaScript string can be invalid UTF-16: if your language defines strings as sequences of USVs, which excludes surrogates, what do you do when you get a fresh string from JS? Passing your string to JS is fine, because WTF-16 encodes a superset of USVs, but when you receive a string, you need to have a plan.

You only have a few options. You can eagerly check that a string is valid UTF-16; this might be a potentially expensive O(n) check, but perhaps this is acceptable. (This check may be faster in the future.) Or, you can replace surrogate codepoints with U+FFFD, when accessing string contents; lossy, but preserves your language's semantic domain. Or, you can extend your language's semantics to somehow deal with surrogate codepoints.

My point is that if you want to use JS strings in a non-Java-like language, your language will need to define what to do with invalid UTF-16. Ideally the browser will give you a way to put your policy into practice: replace with U+FFFD, error, or pass through.

beyond java? (reprise) (feat: snakes)

With that detail out of the way, say you are compiling Python to Wasm/GC. Python's language reference says: "A string is a sequence of values that represent Unicode code points. All the code points in the range U+0000 - U+10FFFF can be represented in a string." This corresponds to the domain of JavaScript's strings; great!

On second thought, how do you actually access the contents of the string? Surely not via the equivalent of JavaScript's String.prototype.charCodeAt; Python strings are sequences of codepoints, not 16-bit code units.

Here we arrive to the second, thornier problem, which is less about domain and more about idiom: in Python, we expect to be able to access strings by codepoint index. This is the case not only to access string contents, but also to refer to positions in strings, for example when extracting a substring. These operations need to be fast (or fast enough anyway; CPython doesn't have a very high performance baseline to meet).

However, the web platform doesn't give us O(1) access to string codepoints. Usually a codepoint just takes up one 16-bit code unit, so the (zero-indexed) 5th codepoint of JS string s may indeed be at s.codePointAt(5), but it may also be at offset 6, 7, 8, 9, or 10. You get the point: finding the nth codepoint in a JS string requires a linear scan from the beginning.

More generally, all languages will want to expose O(1) access to some primitive subdivision of strings. For Rust, this is bytes; 8-bit bytes are the code units of UTF-8. For others like Java or C#, it's 16-bit code units. For Python, it's codepoints. When targetting JavaScript strings, there may be a performance impedance mismatch between what the platform offers and what the language requires.

Languages also generally offer some kind of string iteration facility, which doesn't need to correspond to how a JavaScript host sees strings. In the case of Python, one can implement for char in s: print(char) just fine on top of JavaScript strings, by decoding WTF-16 on the fly. Iterators can also map between, say, UTF-8 offsets and WTF-16 offsets, allowing e.g. Rust to preserve its preferred "strings are composed of bytes that are UTF-8 code units" abstraction.

Our O(1) random access problem remains, though. Are we stuck?

what does the good world look like

How should a language represent its strings, anyway? Here we depart from a precise gathering of requirements for WebAssembly strings, but in a useful way, I think: we should build abstractions not only for what is, but also for what should be. We should favor a better future; imagining the ideal helps us design the real.

I keep returning to Henri Sivonen's authoritative article, It’s Not Wrong that "🤦🏼‍♂️".length == 7, But It’s Better that "🤦🏼‍♂️".len() == 17 and Rather Useless that len("🤦🏼‍♂️") == 5. It is so good and if you have reached this point, pop it open in a tab and go through it when you can. In it, Sivonen argues (among other things) that random access to codepoints in a string is not actually important; he thinks that if you were designing Python today, you wouldn't include this interface in its standard library. Users would prefer extended grapheme clusters, which is variable-length anyway and a bit gnarly to compute; storage wants bytes; array-of-codepoints is just a bad place in the middle. Given that UTF-8 is more space-efficient than either UTF-16 or array-of-codepoints, and that it embraces the variable-length nature of encoding, programming languages should just use that.

As a model for how strings are represented, array-of-codepoints is outdated, as indeed is UTF-16. Outdated doesn't mean irrelevant, of course; there is lots of Python code out there and we have to support it somehow. But, if we are designing for the future, we should nudge our users towards other interfaces.

There is even a case that a JavaScript engine should represent its strings as UTF-8 internally, despite the fact that JS exposes a UTF-16 view on strings in its API. The pitch is that UTF-8 takes less memory, is probably what we get over the network anyway, and is probably what many of the low-level APIs that a browser uses will want; it would be faster and lighter-weight to pass UTF-8 to text shaping libraries, for example, compared to passing UTF-16 or having to copy when going to JS and when going back. JavaScript engines already have a dozen internal string representations or so (narrow or wide, cons or slice or flat, inline or external, interned or not, and the product of many of those); adding another is just a Small Matter Of Programming that could show benefits, even if some strings have to be later transcoded to UTF-16 because JS accesses them in that way. I have talked with JS engine people in all the browsers and everyone thinks that UTF-8 has a chance at being a win; the drawback is that actually implementing it would take a lot of effort for uncertain payoff.

I have two final data-points to indicate that UTF-8 is the way. One is that Swift used to use UTF-16 to represent its strings, but was able to switch to UTF-8. To adapt to the newer performance model of UTF-8, Swift maintainers designed new APIs to allow users to request a view on a string: treat this string as UTF-8, or UTF-16, or a sequence of codepoints, or even a sequence of extended grapheme clusters. Their users appear to be happy, and I expect that many languages will follow Swift's lead.

Secondly, as a maintainer of the Guile Scheme implementation, I also want to switch to UTF-8. Guile has long used Python's representation strategy: array of codepoints, with an optimization if all codepoints are "narrow" (less than 256). The Scheme language exposes codepoint-at-offset (string-ref) as one of its fundamental string access primitives, and array-of-codepoints maps well to this idiom. However, we do plan to move to UTF-8, with a Swift-like breadcrumbs strategy for accelerating per-codepoint access. We hope to lower memory consumption, simplify the implementation, and have general (but not uniform) speedups; some things will be slower but most should be faster. Over time, users will learn the performance model and adapt to prefer string builders / iterators ("string ports") instead of string-ref.

a solution for webassembly in the browser?

Let's try to summarize: it definitely makes sense for Java to use JavaScript strings when compiled to WebAssembly/GC, when running on the browser. There is an OK-ish compilation strategy for this use case involving externref, String.prototype.charCodeAt imports, and so on, along with some engine heroics to specially recognize these operations. There is an early proposal to sand off some of the rough edges, to make this use-case a bit more predictable. However, there are two limitations:

  1. Focussing on providing JS strings to Wasm/GC is only really good for Java and friends; the cost of mapping charCodeAt semantics to, say, Python's strings is likely too high.

  2. JS strings are only present on browsers (and Node and such).

I see the outcome being that Java will have to keep its implementation that uses (array i16) when targetting the edge, and use JS strings on the browser. I think that polyfills will not have acceptable performance. On the edge there will be a binary size penalty and a performance and functionality gap, relative to the browser. Some edge Wasm implementations will be pushed to implement fast JS strings by their users, even though they don't have JS on the host.

If the JS string builtins proposal were a local maximum, I could see putting some energy into it; it does make the Java case a bit better. However I think it's likely to be an unstable saddle point; if you are going to infect the edge with WTF-16 anyway, you might as well step back and try to solve a problem that is a bit more general than Java on JS.

stringref: a solution for webassembly?

I think WebAssembly should just bite the bullet and try to define a string data type, for languages that use GC. It should support UTF-8 and UTF-16 views, like Swift's strings, and support some kind of iterator API that decodes codepoints.

It should be abstract as regards the concrete representation of strings, to allow JavaScript strings to stand in for WebAssembly strings, in the context of the browser. JS hosts will use UTF-16 as their internal representation. Non-JS hosts will likely prefer UTF-8, and indeed an abstract API favors migration of JS engines away from UTF-16 over the longer term. And, such an abstraction should give the user control over what to do for surrogates: allow them, throw an error, or replace with U+FFFD.

What I describe is what the stringref proposal gives you. We don't yet have consensus on this proposal in the Wasm standardization group, and we may never reach there, although I think it's still possible. As I understand them, the objections are two-fold:

  1. WebAssembly is an instruction set, like AArch64 or x86. Strings are too high-level, and should be built on top, for example with (array i8).

  2. The requirement to support fast WTF-16 code unit access will mean that we are effectively standardizing JavaScript strings.

I think the first objection is a bit easier to overcome. Firstly, WebAssembly now defines quite a number of components that don't map to machine ISAs: typed and extensible locals, memory.copy, and so on. You could have defined memory.copy in terms of primitive operations, or required that all local variables be represented on an explicit stack or in a fixed set of registers, but WebAssembly defines higher-level interfaces that instead allow for more efficient lowering to machine primitives, in this case SIMD-accelerated copies or machine-specific sets of registers.

Similarly with garbage collection, there was a very interesting "continuation marks" proposal by Ross Tate that would give a low-level primitive on top of which users could implement root-finding of stack values. However when choosing what to include in the standard, the group preferred a more high-level facility in which a Wasm module declares managed data types and allows the WebAssembly implementation to do as it sees fit. This will likely result in more efficient systems, as a Wasm implementation can more easily use concurrency and parallelism in the GC implementation than a guest WebAssembly module could do.

So, the criteria of what to include in the Wasm standard is not "what is the most minimal primitive that can express this abstraction", or even "what looks like an ARMv8 instruction", but rather "what makes Wasm a good compilation target". Wasm is designed for its compiler-users, not for the machines that it runs on, and if we manage to find an abstract definition of strings that works for Wasm-targetting toolchains, we should think about adding it.

The second objection is trickier. When you compile to Wasm, you need a good model of what the performance of the Wasm code that you emit will be. Different Wasm implementations may use different stringref representations; requesting a UTF-16 view on a string that is already UTF-16 will be cheaper than doing so on a string that is UTF-8. In the worst case, requesting a UTF-16 view on a UTF-8 string is a linear operation on one system but constant-time on another, which in a loop over string contents makes the former system quadratic: a real performance failure that we need to design around.

The stringref proposal tries to reify as much of the cost model as possible with its "view" abstraction; the compiler can reason that any cost may incur then rather than when accessing a view. But, this abstraction can leak, from a performance perspective. What to do?

I think that if we look back on what the expected outcome of the JS-strings-for-Java proposal is, I believe that if Wasm succeeds as a target for Java, we will probably already end up with WTF-16 everywhere. We might as well admit this, I think, and if we do, then this objection goes away. Likewise on the Web I see UTF-8 as being potentially advantageous in the medium-long term for JavaScript, and certainly better for other languages, and so I expect JS implementations to also grow support for fast UTF-8.

i'm on a horse

I may be off in some of my predictions about where things will go, so who knows. In the meantime, in the time that it takes other people to reach the same conclusions, stringref is in a kind of hiatus.

The Scheme-to-Wasm compiler that I work on does still emit stringref, but it is purely a toolchain concept now: we have a post-pass that lowers stringref to WTF-8 via (array i8), and which emits calls to host-supplied conversion routines when passing these strings to and from the host. When compiling to Hoot's built-in Wasm virtual machine, we can leave stringref in, instead of lowering it down, resulting in more efficient interoperation with the host Guile than if we had to bounce through byte arrays.

So, we wait for now. Not such a bad situation, at least we have GC coming soon to all the browsers. appy hacking to all my stringfolk, and until next time!

by Andy Wingo at October 19, 2023 10:33 AM

October 16, 2023

Andy Wingo

on safepoints

Hello all, a brief note today. For context, my big project over the last year and a half or so is Whippet, a new garbage collector implementation. If everything goes right, Whippet will finding a better point on the space/time tradeoff curve than Guile's current garbage collector, BDW-GC, and will make it into Guile, well, sometime.

But, testing a garbage collector is... hard. Methodologically it's very tricky, though there has been some recent progress in assessing collectors in a more systematic way. Ideally of course you test against a real application, and barring that, against a macrobenchmark extracted from a real application. But garbage collectors are deeply intertwined with language run-times; to maximize the insight into GC performance, you need to minimize everything else, and to minimize non-collector cost, that means relying on a high-performance run-time (e.g. the JVM), which... is hard! It's hard enough to get toy programs to work, but testing against a beast like the JVM is not easy.

In the case of Guile, it's even more complicated, because the BDW-GC is a conservative collector. BDW-GC doesn't require precise annotation of stack roots, and scans intraheap edges conservatively (unless you go out of your way to do otherwise). How can you test this against a semi-space collector if Guile can't precisely enumerate all object references? The Immix-derived collector in Whippet can be configured to run with conservative stack roots (and even heap links), but how then to test the performance tradeoff of conservative vs precise tracing?

In my first iterations on Whippet, I hand-coded some benchmarks in C, starting with the classic gcbench. I used stack-allocated linked lists for precise roots, when run in precise-rooting mode. But, this is excruciating and error-prone; trying to write some more tests, I ran up against a wall of gnarly nasty C code. Not fun, and I wasn't sure that it was representative in terms of workload; did the handle discipline have some kind of particular overhead? The usual way to do things in a high-performance system is to instead have stack maps, where the collector is able to precisely find roots on the stack and registers using a side table.

Of course the usual solution if something is not fun is to make it into a tools problem, so I wrote a little Scheme-to-C compiler, Whiffle, purpose-built for testing Whippet. It's a baseline-style compiler that uses the C stack for control (function call and return), and a packed side stack of temporary values. The goal was to be able to generate some C that I could include in the collector's benchmarks; I've gotten close but haven't done that final step yet.

Anyway, because its temporaries are all in a packed stack, Whippet can always traverse the roots precisely. This means I can write bigger benchmark programs without worry, which will allow me to finish testing the collector. As a nominally separate project, Whiffle also tests that Whippet's API is good for embedding. (Yes, the names are similar; if I thought that in the future I would need to say much more about Whiffle, I would rename it!)

I was able to translate over the sparse benchmarks that Whippet already had into Scheme, and set about checking that they worked as expected. Which they did... mostly.

the bug

The Whippet interface abstracts over different garbage collector implementations; the choice of which collector to use is made at compile-time. There's the basic semi-space copying collector, with a large object space and ephemerons; there's the Immix derived "whippet" collector (yes, it shares the same name); and there's a shim that provides the Whippet API via the BDW-GC.

The benchmarks are multithreaded, except when using the semi-space collector which only supports one mutator thread. (I should fix that at some point.) I tested the benchmarks with one mutator thread, and they worked with all collectors. Yay!

Then I tested with multiple mutator threads. The Immix-derived collectors worked fine, in both precise and conservative modes. The BDW-GC collector... did not. With just one mutator thread it was fine, but with multiple threads I would get segfaults deep inside libgc.

the side quest

Over on the fediverse, Daphe Preston-Kendal asks, how does Whiffle deal with tail calls? The answer is, "sloppily". All of the generated function code has the same prototype, so return-calls from one function to another should just optimize to jumps, and indeed they do: at -O2. For -O0, this doesn't work, and sometimes you do want to compile in that mode (no optimizations) when investigating. I wish that C had a musttail attribute, and I use GCC too much to rely on the one that only Clang supports.

I found the -foptimize-sibling-calls flag in GCC and somehow convinced myself that it was a sufficient replacement, even when otherwise at -O0. However most of my calls are indirect, and I think this must have fooled GCC. Really not sure. But I was sure, at one point, until a week later I realized that the reason that the call instruction was segfaulting was because it couldn't store the return address in *$rsp, because I had blown the stack. That was embarrassing, but realizing that and trying again at -O2 didn't fix my bug.

the second side quest

Finally I recompiled bdw-gc, which I should have done from the beginning. (I use Guix on my test machine, and I wish I could have entered into an environment that had a specific set of dependencies, but with debugging symbols and source code; there are many things about Guix that I would think should help me develop software but which don't. Probably there is a way to do this that I am unaware of.)

After recompiling, it became clear that BDW-GC was trying to scan a really weird range of addresses for roots, and accessing unmapped memory. You would think that this would be a common failure mode for BDW-GC, but really, this never happens: it's an astonishingly reliable piece of software for what it does. I have used it for almost 20 years and its problems are elsewhere. Anyway, I determined that the thread that it was scanning was... well it was stuck somewhere in some rr support library, which... why was that anyway?

You see, the problem was that since my strange spooky segfaults on stack overflow, I had been living in rr for a week or so, because I needed to do hardware watchpoints and reverse-continue. But, somehow, strangely, oddly, rr was corrupting my program: it worked fine when not run in rr. Somehow rr caused GCC to grab the wrong $rsp on a remote thread.

the actual bug

I had found the actual bug in the shower, some days before, and fixed it later that evening. Consider that both precise Immix and conservative Immix are working fine. Consider that conservative Immix is, well, stack-conservative just like BDW-GC. Clearly Immix finds the roots correctly, why didn't BDW-GC? What is the real difference?

Well, friend, you have read the article title, so perhaps you won't be surprised when I say "safepoints". A safepoint is a potential stopping place in a program. At a safepoint, the overall state of the program can be inspected by some independent part of the program. In the case of garbage collection, safepoints are places where heap object references in a thread's stack can be enumerated.

For my Immix-derived collectors, safepoints are cooperative: the only place a thread will stop for collection is in an allocation call, or if the thread has explicitly signalled to the collector that is doing a blocking operation such as a thread join. Whiffle usually keeps the stack pointer for the side value stack in a register, but for allocating operations that need to go through the slow path, it makes sure to write that stack pointer into a shared data structure, so that other threads can read it if they need to walk its stack.

The BDW collector doesn't work this way. It can't rely on cooperation from the host. Instead, it installs a signal handler for the process that can suspend and resume any thread at any time. When BDW-GC needs to trace the heap, it stops all threads, finds the roots for those threads, and then traces the graph.

I had installed a custom BDW-GC mark function for Whiffle stacks so that even though they were in anonymous mmap'd memory that BDW-GC doesn't usually trace for roots, Whiffle would be sure to let BDW-GC know about them. That mark function used the safepoints that I was already tracking to determine which part of the side stack to mark.

But here's the thing: cooperative safepoints can be lazy, but preemptive safepoints must be eager. For BDW-GC, it's not enough to ensure that the thread's stack is traversable at allocations: it must be traversable at all times, because a signal can stop a thread at any time.

Safepoints also differ on the axis of precision: precise safepoints must include no garbage roots, whereas conservative safepoints can be sloppy. For precise safepoints, if you bump the stack pointer to allocate a new frame, you can't include the slots in that frame in the root set until they have values. Precision has a cost to the compiler and to the run-time in side table size or shared-data-structure update overhead. As far as I am aware, no production collector uses fully preemptive, precise roots. (Happy to be corrected, if people have counterexamples; I know that this used to be the case, but my understanding is that since multi-threaded mutators have been common, people have mostly backed away from this design.)

Concretely, as I was using a stack discipline to allocate temporaries, I had to shrink the precise safepoint at allocation sites, but I was neglecting to expand the conservative safepoint whenever the stack pointer would be restored.

coda

When I was a kid, baseball was not my thing. I failed out at the earlier phase of tee-ball, where the ball isn't thrown to you by the pitcher but is instead on a kind of rubber stand. As often as not I would fail to hit the ball and instead buckle the tee under it, making an unsatisfactory "flubt" sound, causing the ball to just fall to the ground. I probably would have persevered if I hadn't also caught a ground ball to the face one day while playing right field, knocking me out and giving me a nosebleed so bad that apparently they called off the game because so many other players were vomiting. I woke up in the entryway of the YMCA and was given a lukewarm melted push-pop.

Anyway, "flubt" is the sound I heard in my mind when I realized that rr had been perturbing my use of BDW-GC, and instead of joy and relief when I finally realized that it worked already, it tasted like melted push-pop. It be that way sometimes. Better luck next try!

by Andy Wingo at October 16, 2023 01:46 PM

October 11, 2023

Manuel Rego

Servo at Linux Foundation events in Bilbao

A few weeks ago I was at Bilbao for a couple of Linux Foundation events. My presence there was related to a new role I’ve been playing this year, as I’m currently the chair of Servo Technical Steering Committee (TSC).

About Servo #

So first things first, Servo is a web rendering engine written in Rust. Mozilla started the project more than 10 years ago, but in 2020 they stopped working on the project and Servo joined Linux Foundation. After a couple of quiet years, in January 2023 Igalia took over the project maintenance and we have been working on it since then. As part of Igalia’s team working on Servo, I’ve been helping here and there trying to coordinate efforts and make things happen; that’s why I was elected chair of the TSC. You can find more information about the Servo project at https://servo.org or in the talks linked in this post.

Linux Foundation Events #

On Monday 18th September, Igalia participated in the Linux Foundation Europe Member Summit with my colleague Samuel Iglesias, as part of the Advisory Board, and myself, as project representative for Servo. We had there a Servo booth where people could see Servo running on the Raspberry Pi 400 and browse the examples at https://demo.servo.org. As part of this event I gave a brief update on the status of the project, that you can watch live in YouTube.

Picture of the Servo booth at Linux Foundation Europe 2023. There's a banner with Servo logo and some brief explanation, then two screens and two Raspberry Pi 400 running some Servo videos. In the booth there are also stickers and brochures about Servo. Manuel Rego is behind the booth.
Servo booth at Linux Foundation Europe 2023

Next three days (from Tuesday 19th to Thursday 21st September) we participated on the Open Source Summit Europe. Where Igalia had three talks:

This time my talk was longer and included a detailed update on the Servo project, what has happened during this year, and our plans for the future.

As Servo is a Linux Foundation Europe project, we had also the chance showcase Servo on their booth during the three days of the event. There we let people know that the project is still alive, our plans for the future and we allow them to test Servo live.

It was a nice experience overall, and visiting Bilbao is always a pleasure. If you want to follow updates about Servo don’t miss the project blog or the Mastodon and Twitter accounts.

October 11, 2023 12:00 AM

October 10, 2023

Alex Bradbury

What's new for RISC-V in LLVM 17

LLVM 17 was released in the past few weeks, and I'm continuing the tradition of writing up some selective highlights of what's new as far as RISC-V is concerned in this release. If you want more general, regular updates on what's going on in LLVM you should of course subscribe to my newsletter.

In case you're not familiar with LLVM's release schedule, it's worth noting that there are two major LLVM releases a year (i.e. one roughly every 6 months) and these are timed releases as opposed to being cut when a pre-agreed set of feature targets have been met. We're very fortunate to benefit from an active and growing set of contributors working on RISC-V support in LLVM projects, who are responsible for the work I describe below - thank you! I coordinate biweekly sync-up calls for RISC-V LLVM contributors, so if you're working in this area please consider dropping in.

Code size reduction extensions

A family of extensions referred to as the RISC-V code size reduction extensions was ratified earlier this year. One aspect of this is providing ways of referring to subsets of the standard compressed 'C' (16-bit instructions) extension that don't include floating point loads/stores, as well as other variants. But the more meaningful additions are the Zcmp and Zcmt extensions, in both cases targeted at embedded rather than application cores, reusing encodings for double-precision FP store.

Zcmp provides instructions that implement common stack frame manipulation operations that would typically require a sequence of instructions, as well as instructions for moving pairs of registers. The RISCVMoveMerger pass performs the necessary peephole optimisation to produce cm.mva01s or cm.mvsa01 instructions for moving to/from registers a0-a1 and s0-s7 when possible. It iterates over generated machine instructions, looking for pairs of c.mv instructions that can be replaced. cm.push and cm.pop instructions are generated by appropriate modifications to the RISC-V function frame lowering code, while the RISCVPushPopOptimizer pass looks for opportunities to convert a cm.pop into a cm.popretz (pop registers, deallocate stack frame, and return zero) or cm.popret (pop registers, deallocate stack frame, and return).

Zcmt provides the cm.jt and cm.jalt instructions to reduce code size needed for implemented a jump table. Although support is present in the assembler, the patch to modify the linker to select these instructions is still under review so we can hope to see full support in LLVM 18.

The RISC-V code size reduction working group have estimates of the code size impact of these extensions produced using this analysis script. I'm not aware of whether a comparison has been made to the real-world results of implementing support for the extensions in LLVM, but that would certainly be interesting.

Vectorization

LLVM has two forms of auto-vectorization, the loop vectorizer and the SLP (superword-level parallelism) vectorizer. The loop vectorizer was enabled during the LLVM 16 development cycle, while the SLP vectorizer was enabled for this release. Beyond that, there's been a huge number of incremental improvements for vector codegen such that isn't always easy to pick out particular highlights. But to pick a small set of changes:

  • Version 0.12 of the RISC-V vector C intrinsics specification is now supported by Clang. As noted in the release notes, the hope is there will not be new incompatibilities introduced prior to v1.0.
  • There were lots of minor codegen improvements, one example would be improvements to the RISCVInsertVSETVLI pass to avoid additional unnecessary insertions vsetivli instruction that is used to modify the vtype control register.
  • It's not particularly user visible, but there was a lot of refactoring of vector pseudoinstructions used internally during instruction selection (following this thread. The added documentation will likely be helpful if you're hoping to better understand this.
  • You might be aware that LMUL in the RISC-V vector extension controls grouping of vector registers, for instance rather than 32 vector registers, you might want to set LMUL=4 to treat them as 8 registers that are 4 times as large. The "best" LMUL is going to vary depending on both the target microarchitecture and factors such as register pressure, but a change was made so LMUL=2 is the new default.
  • llvm-mca (the LLVM Machine Code Analyzer) is a performance analysis tool that uses information such as LLVM scheduling models to statically estimate the performance of machine code on a specific CPU. There were at least two changes relevant to llvm-mca and RISC-V vector support: scheduling information for RVV on SiFive7 cores (which of course is used outside of llvm-mca as well), and support for vsetivli/vsetvli 'instruments'. llvm-mca has the concept of an 'instrument region', a section of assembly with an LLVM-MCA comment that can (for instance) indicate the value of a control register that would affect scheduling. This can be used to set LMUL (register grouping) for RISC-V, however in the case of the immediate forms of vsetvl occuring in the input, LMUL can be statically determined.

If you want to find out more about RISC-V vector support in LLVM, be sure to check out my Igalia colleague Luke Lau's talk at the LLVM Dev Meeting this week (I'll update this article when slides+recording are available).

Other ISA extensions

It wouldn't be a RISC-V article without a list of hard to interpret strings that claim to be ISA extension names (Zvfbfwma is a real extension, I promise!). In addition to the code size reduction extension listed above there's been lots of newly added or updated extensions in this release cycle. Do refer to the RISCVUsage documentation for something that aims to be a complete list of what is supported (occasionally there are omissions) as well as clarity on what we mean by an extension being marked as "experimental".

Here's a partial list:

  • Code generation support for the Zfinx, Zdinx, Zhinx, and Zhinxmin extensions. These extensions provide support for single, double, and half precision floating point instructions respectively, but define them to operate on the general purpose register file rather than requiring an additional floating point register file. This reduces implementation cost on simple core designs.
  • Support for a whole range of vendor-defined extensions. e.g. XTHeadBa (address gneeration), XTheadBb (basic bit manipulation), Xsfvcp (SiFive VCIX), XCVbitmanip (CORE-V bit manipulation custom instructions) and many more (see the release notes.
  • Experimental vector crypto extension support was updated to version 0.5.1 of the specification.
  • Experimental support was added for version 0.2 of the Zfa extension (providing additional floating-point instructions).
  • Assembler/disassembler support for an experimental family of extensions to support operations on the bfloat16 floating-point format. Zfbfmin, Zvfbfmin, and Zvfbfwma.
  • Assembler/disassembler support for the experimental Zacas extension (atomic compare-and-swap).

It landed after the 17.x branch so isn't in this release, but in the future you'll be able to use --print-supported-extensions with Clang to have it print a table of supported ISA extensions (the same flag has now been implemented for Arm and AArch64 too).

Other additions and improvements

As always, it's not possible to go into detail on every change. A selection of other changes that I'm not able to delve into more detail on:

  • Initial RISC-V support was added to LLVM's BOLT post-link optimizer and various fixes / feature additions made to JITLink, thanks to the work of my Igalia colleague Job Noorman. There's actually a lot to say about this work, but I don't need to because Job has written up and excellent blog post on it that I highly encourage you go and read.
  • LLD gained support for some of the relaxations involving the global pointer.
  • I expect there'll be more to say about this in future releases, but there's been incremental progress on RISC-V GlobalISel in the LLVM 17 development cycle (which has continued after). You might be interested in the slides from my GlobalISel by example talk at EuroLLVM this year. Ivan Baev at SiFive is also set to speak about some of this work at the RISC-V Summit in November.
  • Clang supports a form of control-flow integrity called KCFI. This is used by low-level software like the Linux kernel (see CONFIG_CFI_CLANG in the Linux tree) but the target-specific parts were previously unimplemented for RISC-V. This gap was filled for the LLVM 17 release.
  • LLVM has its own work-in-progress libc implementation, and the RISC-V implementations of memcmp, bcmp, memset, and memcpy all gained optimised RISC-V specific versions. There will of course be further updates for LLVM 18, including the work from my colleague Mikhail R Gadelha on 32-bit RISC-V support.

Apologies if I've missed your favourite new feature or improvement - the LLVM release notes will include some things I haven't had space for here. Thanks again for everyone who has been contributing to make the RISC-V in LLVM even better.

If you have a RISC-V project you think me and my colleagues and at Igalia may be able to help with, then do get in touch regarding our services.

October 10, 2023 12:00 PM

October 05, 2023

Eric Meyer

An Anchored Navbar Solution

Not quite a year ago, I published an exploration of how I used layered backgrounds to create the appearance of a single bent line that connected one edge of the design to whichever navbar link corresponded to the current page.  It was fairly creative, if I do say so myself, but even then I knew  —  and said explicitly!  —  that it was a hack, and that I really wanted to use anchor positioning to do it cleanly.

Now that anchor positioning is supported behind a developer flag in Chrome, we can experiment with it, as I did in the recent post “Nuclear Anchored Sidenotes”.  Well, today, I’m back on my anchor BS with a return to that dashed navbar connector as seen on wpewebkit.org, and how it can be done more cleanly and simply, just as I’d hoped last year.

First, let’s look at the thing we’re trying to recreate.

The connecting line, as done with a bunch of forcibly-sized and creatively overlapped background gradient images.

To understand the ground on which we stand, let’s make a quick perusal of the simple HTML structure at play here.  At least, the relevant parts of it, with some bits elided by ellipses for clarity.

<nav class="global">
	<div>
		<a href="…"><img src="…" alt="WPE"></a>
		<ul>…</ul>
	</div>
</nav>

Inside that (unclassed! on purpose!) <ul>, there are a number of list items, each of which holds a hyperlink.  Whichever list item contains the hyperlink that corresponds to the current page gets a class of currentPage, because class naming is a deep and mysterious art.

To that HTML structure, the following bits of CSS trickery were applied in the work I did last year, brought together in this code block for the sake of brevity (note this is the old thing, not the new anchoring hotness):

nav.global div {
	display: flex;
	justify-content: space-between;
	gap: 1em;
	max-width: var(--mainColMax);
	margin: 0 auto;
	height: 100%;
	background: var(--dashH);
}

@media (min-width: 720px) {
	nav.global ul li.currentPage::before {
		content: '';
		position: absolute;
		z-index: 1;
		top: 50%;
		bottom: 0;
		left: 50%;
		right: 0;
		background:
			var(--dashV),
			linear-gradient(0deg, #FFFF 2px, transparent 2px)
			;
		background-size: 1px 0.5em, auto;
	}
	nav.global ul li.currentPage {
		position: relative;
	}
	nav.global ul li.currentPage a {
		position: relative;
		z-index: 2;
		padding: 0;
		padding-block: 0.25em;
		margin: 1em;
		background: var(--dashH);
		background-size: 0.5em 1px;
		background-position: 50% 100%;
		background-color: #FFF;
		color: inherit;
	}
}

If you’re wondering what the heck is going on there, please feel free to read the post from last year.  You can even go read it now, if you want, even though I’m about to flip most of that apple cart and stomp on the apples to make ground cider.  Your life is your own; steer it as best suits you.

Anyway, here are the bits I’m tearing out to make way for an anchor-positioning solution.  The positioning-edge properties (top, etc.) removed from the second rule will return shortly in a more logical form.

nav.global div {
	display: flex;
	justify-content: space-between;
	gap: 1em;
	max-width: var(--mainColMax);
	margin: 0 auto;
	height: 100%;
   background: var(--dashH);
}
@media (min-width: 720px) {
	nav.global ul li.currentPage::before {
		content: '';
		position: absolute;
	   z-index: 1;
		top: 50%;
		bottom: 0;
		left: 50%;
		right: 0;
		background:
			var(--dashV),
			linear-gradient(0deg, #FFFF 2px, transparent 2px)
			;
		background-size: 1px 0.5em, auto;
	}
   nav.global ul li.currentPage {
		position: relative;
	}
	nav.global ul li.currentPage a {
	   position: relative;
		z-index: 2;
		padding: 0;
		padding-block: 0.25em;
		margin: 1em;
	   background: var(--dashH);
		background-size: 0.5em 1px;
		background-position: 50% 100%;
		background-color: #FFF;
		color: inherit;
	}
}

That pulls out not only the positioning edge properties, but also the background dash variables and related properties.  And a whole rule to relatively position the currentPage list item, gone.  The resulting lack of any connecting line being drawn is perhaps predictable, but here it is anyway.

The connecting line disappears as all its support structures and party tricks are swept away.

With the field cleared of last year’s detritus, let’s get ready to anchor!

Step one is to add in positioning edges, for which I’ll use logical positioning properties instead of the old physical properties.  Along with those, a negative Z index to drop the generated decorator (that is, a decorative component based on generated content, which is what this ::before rule is creating) behind the entire set of links, dashed borders along the block and inline ends of the generated decorator, and a light-red background color so we can see the decorator’s placement more clearly.

 nav.global ul li.currentPage::before {
		content: '';
		position: absolute;
	   inset-block-start: 0;
		inset-block-end: 0;
		inset-inline-start: 0;
		inset-block-end: 0;
		z-index: -1;
		border: 1px dashed;
		border-block-width: 0 1px;
		border-inline-width: 0 1px;
		background-color: #FCC;
	}

I’ll also give the <a> element inside the currentPage list item a dashed border along its block-end edge, since the design calls for one.

 nav.global ul li.currentPage a {
		padding: 0;
		padding-block: 0.25em;
		margin: 1em;
		color: inherit;
	   border-block-end: 1px dashed;
	}

And those changes give us the result shown here.

The generated decorator, decorating the entirety of its containing block.

Well, I did set all the positioning edge values to be 0, so it makes sense that the generated decorator fills out the relatively-positioned <div> acting as its containing block.  Time to fix that.

What we need to do give the top and right  —  excuse me, the block-start and inline-end  —  edges of the decorator a positioning anchor.  Since the thing we want to connect the decorator’s visible edges to is the <a> inside the currentPage list item, I’ll make it the positioning anchor:

 nav.global ul li.currentPage a {
		padding: 0;
		padding-block: 0.25em;
		margin: 1em;
		color: inherit;
		border-block-end: 1px dashed;
	   anchor-name: --currentPageLink;
	}

Yes, you’re reading that correctly: I made an anchor be an anchor.

(That’s an HTML anchor element being designated as a CSS positioning anchor, to be clear.  Sorry to pedantically explain the joke and thus ruin it, but I fear confusion more than banality.)

Now that we have a positioning anchor, the first thing to do, because it’s more clear to do it in this order, is to pin the inline-end edge of the generated decorator to its anchor.  Specifically, to pin it to the center of the anchor, since that’s what the design calls for.

 nav.global ul li.currentPage::before {
		content: '';
		position: absolute;
		inset-block-start: 0;
		inset-block-end: 0;
		inset-inline-start: 0;
		inset-inline-end: anchor(--currentPageLink center);
		z-index: -1;
		border: 1px dashed;
		border-block-width: 0 1px;
		border-inline-width: 0 1px;
		background-color: #FCC;
	}

Because this anchor() function is being used with an inline inset property, the center here refers to the inline center of the referenced anchor (in both the HTML and CSS senses of that word) --currentPageLink, which in this particular case is its horizontal center.  That gives us the following.

The generated decorator with its inline-end edge aligned with the inline center of the anchoring anchor.

The next step is to pin the top block edge of the generated decorator with respect to its positioning anchor.  Since we want the line to come up and touch the block-end edge of the anchor, the end keyword is used to pin to the block end of the anchor (in this situation, its bottom edge).

 nav.global ul li.currentPage::before {
		content: '';
		position: absolute;
		inset-block-start: anchor(--currentPageLink end);
		inset-block-end: 0;
		inset-inline-start: 0;
		inset-inline-end: anchor(--currentPageLink center);
		z-index: -1;
		border: 1px dashed;
		border-block-width: 0 1px;
		border-inline-width: 0 1px;
		background-color: #FCC;
	}

Since the inset property in this case is block-related, the end keyword here means the block end of the anchor (again, in both senses).  And thus, the job is done, except for removing the light-red diagnostic background.

The generated decorator with its block-start edge aligned with the block-end edge of the anchoring anchor.

Once that red background is taken out, we end up with the following rules inside the media query:

 nav.global ul li.currentPage::before {
		content: '';
		position: absolute;
		inset-block-start: anchor(--currentPageLink bottom);
		inset-block-end: 0;
		inset-inline-start: 0;
		inset-inline-end: anchor(--currentPageLink center);
		z-index: -1;
		border: 1px dashed;
		border-block-width: 0 1px;
		border-inline-width: 0 1px;
	}
	nav.global ul li.currentPage a {
		padding: 0;
		padding-block: 0.25em;
		margin: 1em;
		color: inherit;
		border-block-end: 1px dashed;
		anchor-name: --currentPageLink;
	}

The inline-start and block-end edges of the generated decorator still have position values of 0, so they stick to the edges of the containing block (the <div>).  The block-start and inline-end edges have values that are set with respect to their anchor.  That’s it, done and dusted.

The connecting line is restored, but is now a lot easier to manage from the CSS side.

…okay, okay, there are a couple more things to talk about before we go.

First, the dashed borders I used here don’t look fully consistent with the other dashed “borders” in the design.  I used actual borders for the CSS in this article because they’re fairly simple, as CSS goes, allowing me to focus on the topic at hand.  To make these borders fully consistent with the rest of the design, I have two choices:

  1. Remove the borders from the generated decorator and put the background-trick “borders” back into it.  This would be relatively straightforward to do, at the cost of inflating the rules a little bit with background sizing and positioning and all that.
  2. Convert all the other background-trick “borders” to be actual dashed borders.  This would also be pretty straightforward, and would reduce the overall complexity of the CSS.

On balance, I’d probably go with the first option, because dashed borders still aren’t fully visually consistent from browser to browser, and people get cranky about those kinds of inconsistencies.  Background gradient tricks give you more control in exchange for you writing more declarations.  Still, either choice is completely defensible.

Second, you might be wondering if that <div> was even necessary.  Not technically, no.  At first, I kept using it because it was already there, and removing it seemed like it would require refactoring a bunch of other code not directly related to this post.  So I didn’t.

But it tasked me.  It tasked me.  So I decided to take it out after all, and see what I’d have to do to make it work.  Once I realized doing this illuminated an important restriction on what you can do with anchor positioning, I decided to explore it here.

As a reminder, here’s the HTML as it stood before I started removing bits:

<nav class="global">
	<div>
		<a href="…"><img src="…" alt="WPE"></a>
		<ul>…</ul>
	</div>
</nav>

Originally, the <div> was put there to provide a layout container for the logo and navbar links, so they’d be laid out to line up with the right and left sides of the page content.  The <nav> was allowed to span the entire page, and the <div> was set to the same width as the content, with auto side margins to center it.

So, after pulling out the <div>, I needed an anchor for the navbar to size itself against.  I couldn’t use the <main> element that follows the <nav> and contains the page content, because it’s a page-spanning Grid container.  Just inside it, though, are <section> elements, and some (not all!) of them are the requisite width.  So I added:

main > section:not(.full-width) {
	anchor-name: --mainCol;
}

The full-width class makes some sections page-spanning, so I needed to avoid those; thus the negative selection there.  Now I could reference the <nav>’s edges against the named anchor I just defined.  (Which is probably actually multiple anchors, but they all have the same width, so it comes to the same thing.)  So I dropped those anchor references into the CSS:

nav.global {
	display: flex;
	justify-content: space-between;
	height: 5rem;
	gap: 1em;
	position: fixed;
	top: 0;
	inset-inline-start: anchor(--mainCol left);
	inset-inline-end: anchor(--mainCol right);
	z-index: 12345;
	backdrop-filter: blur(10px);
	background: hsla(0deg,0%,100%,0.9);
}

And that worked!  The inline start and end edges, which in this case are the left and right edges, lined up with the edges of the content column.

Positioning the <nav> with respect to the anchoring section(s).

…except it didn’t work on any page that had any content that overflowed the main column, which is most of them.

See, this is why I embedded a <div> inside the <nav> in the first place.

But wait.  Why couldn’t I just position the logo and list of navigation links against the --mainCol anchor?  Because in anchored positioning, just like nearly every other form of positioning, containing blocks are barriers.  Recall that the <nav> is a fixed-position box, so it can stick to the top of the viewport.  That means any elements inside it can only be positioned with respect to anchors that also have the <nav> as their containing block.

That’s fine for the generated decorator, since it and the currentPageLink anchor both have the <nav> as their containing block.  To try to align the logo and navlinks, though, I can’t look outside the <nav> at anything else, and that includes the sections inside the <main> element, because the <nav> is not their containing block.  The <nav> element itself, on the other hand, shares a containing block with those sections: the initial containing block.  So I can anchor the <nav> itself to --mainCol.

I fiddled with various hacks to extend the background of the <nav> without shifting its content edges, padding and negative margins and stuff like that, but in end, I fell back on a border-image hack, which required I remove the background.

nav.global {
	display: flex;
	justify-content: space-between;
	height: 5rem;
	gap: 1em;
	position: fixed;
	top: 0;
	inset-inline-start: anchor(--mainCol left);
	inset-inline-end: anchor(--mainCol right)
	z-index: 12345;
	backdrop-filter: blur(10px);
	background: hsla(0deg,0%,100%,0.9);
	border-image-outset: 0 100vw;
	border-image-slice: 0 fill;
	border-image-width: 0;
	border-image-repeat: stretch;
	border-image-source: linear-gradient(0deg,hsla(0deg,0%,100%,0.9),hsla(0deg,0%,100%,0.9));
}

And that solved the visual problem.

The appearance of a full-width navbar, although it’s mostly border image fakery.

Was it worth it?  I have mixed feelings about that.  On the one hand, putting all of the layout hackery into the CSS and removing it all from the HTML feels like the proper approach.  On the other hand, it’s one measly <div>, and taking that approach means better support for older browsers.  On the gripping hand, if I’m going to use anchor positioning, older browsers are already being left out of the fun.  So I probably wouldn’t have even gone down this road, except it was a useful example of how anchor positioning can be stifled.

At any rate, there you have it, another way to use anchor positioning to create previously difficult design effects with relative ease.  Just remember that all this is still in the realm of experiments, and production use will be limited to progressive enhancements until this comes out from behind the developer flags and more browsers add support.  That makes now a good time to play around, get familiar with the technology, that sort of thing.  Have fun with it!


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at October 05, 2023 01:23 PM

Brian Kardell

tel:me.about.it.

tel:me.about.it.

Let’s talk about protocols. I’ve been repeating most of what is in this post in conversations for a while now, so I thought I should write it down. They are thoughts around what it means to add new protocols to the platform, and how we go about it in practice. The main extensibility mechanism that we have comes from the HTML5 era: registerProtocolHandler(). In this post I’ll explain where I think that’s kind of OK, where it falls (sometimes very) short, and some things I think we learned along the way. The purpose of this post isn’t to make suggestions, it’s simply to have a look at the space and the things I think we need to keep in mind.

Most of the URLs we type these days begin with https: - or before that, http:. We changed the main protocol of the web to be more secure. It’s not inconceivable we could want to have that level of change again, but there are plenty of things short of that where protocols develop and somehow find their way into the platform.

A few popular examples from the HTTPArchive data: Over 30% of websites include at least one mailto: link. Almost as many sites include a tel: link. There’s plenty of webcal: and fax:. geo: is used on over 20,300 sites. sms: is used on 42,600+ websites. These are just a few examples of protocols that are trying to do something useful.

Things That Go “Click”

When you click a link that has one of those protocol identifiers, the browser says “Do I know how to handle that protocol?” If the answer is no, it passes responsibility along to the OS. So, if you have Skype installed, and click on a skype: link, the OS might launch Skype for you.

Protocol handlers really are just built on the idea that over time, we build web-based services and it should be possible to say that the way to handle them can be by rerouting handling to a domain (say, gmail.com for mailto: links).

So, for those, things it works… Kind of.

The mistakes we made along the way…

On the surface, all of those protocols that turn up in the HTTP Archive data seem like success stories. However, I’d like to tell you why they haven’t been so great, and why we already have better solutions.

Consider the second most widely used of those, the tel: protocol. Not because this one is especially bad, but because it is very easy to illustrate many problems in a concise example.

Before tel:, authors would put some text in their web page, like:

Phone: (555)123-4567

With tel:, authors could now put something like this into their HTML:

<a href="tel:123.456.7890">Call now</a>
	

Yay! Except, no… wait… That is kind of terrible.

Sure, for some users, where all of the stars align, that will be a nice convenience. But there are just way too many cases where this is more of a barrier than a convenience. Let’s look at some of the failure modes of that line of code, starting with the simplest:

Hidden information

Because we’ve made it a link, it’s far too common that authors don’t actually display the phone number. They assume that the device I’m looking at is the device I want to dial from. There are many cases where that is not true!

Perhaps I’m on my office desktop and I want to dial with the handy office phone that is next to me, or the conference system, or even just my cell phone. Maybe I want to read the number out to someone else, or write it down for later. Just show me the !@$!@ number!

This isn’t a problem inherent to the protocol, but it’s still a problem.

Theoretically capable, practically incapable

Sometimes a device can be physically capable of dialing out, but currently not have that service available. This happened to me while traveling abroad in 2018, where I was able to use my cell with WiFi but couldn’t dial out (I didn’t have WiFi calling at the time).

This problem still comes up I use that same 2018 cell phone to control my media in my house, and occasionally to look something up, but since it isn’t on my cell plan, I can’t use any tel: links.

No registered handler at all

On some machines, there is no registered tel: protocol handler. What happens in this case? Unspecified. Each OS will do something a little different. For years tel: would often do nothing at all which is very confusing to users. Some operating systems will, to various degrees, offer to help you find something capable of handling the protocol, but even that can be confusing (see next point).

Registered a Complex Program

Some devices will launch a program that’s registered to handle the tel: protocol, even if it can’t currently dial. There are many examples of this. Apps like Google Hangouts or Skype were chat programs with some VoIP support. They could dial out if users paid a special recurring fee – but most users didn’t pay that fee. When those users clicked a link and it launched their chat program, or told them they have to pay to dial, that’s confusing and frustrating.

In fact, lots of users of those apps just use them for text chat. Not so long ago, many users on office desktops didn’t have cameras, mics or speakers. If users clicked a tel: link and it launched their chat program and implied that they could pay money to dial out, but actually couldn’t do that: This is very confusing to users.

The Un-Protocol…

Contrast that with this: Today, if you use your smartphone and you select some text - regular text - there are some built in ways to assist in order to do smart selection. That is, work has gone into making it easier for it to recognize that you’re trying to select a phone number or an email, as two examples. And if you select it, the context menu will offer “Call” or “Email”, respectively.

You know what’s wild? No author had to do anything special at all, and it works even on content written in the 90s. That’s… Pretty great. That’s not to say markup couldn’t help here, but it gets at whether it needs to fundamentally be a link.

The Web Share API splits things along a similar joint. It just shares an intent to share some content. It doesn’t need to know anything about which ways you have to available to share, or reason about things in the current time. Splitting things this way lets all of the ways to share develop entirely independently. If new social media comes along, or goes away, or we develop new physical ways of sharing involving USB or Bluetooth or HandWaveyMagicWand - it will all JustWork™.

Both of these seem to me generally better than many of the protocol based solutions. I’m not suggesting protocol based solutions are useless, I’m just suggesting that it seems like there is benefit in continuing to question why and how we’re approaching things.

October 05, 2023 04:00 AM

September 30, 2023

Brian Kardell

Florence Discovery

Florence Discovery

Last week, on my way back from TPAC I took a holiday in Florence, Italy. It was an awesome experience all around, but one thing was really personal and I wanted to write it down..

So... I am Italian.

I know, "Kardell" doesn't sound Italian - and that's because it isn't. Kardell is my biological father's name and it has, as best I know, a mix of German and Irish roots. But my from even my very youngest memories, I was raised with/by my maternal side. My parents divorced when I was all of 5.

My mother's maiden name is Frediani, and both of her parents were first born generations in America, shortly after their parents arrived. My maternal grandfather's family (Muzio Frediani) was from Tuscany. His father was a printer (and all of his kids were later). Their print shop was a historic landmark in Pittsburgh's strip district. He also founded the Italian Sons and Daughters of America. Among a lot of other things, they did a lot with Italian Americans, unions and political campaigns. My maternal grandmother's family name was Casale and they were from Silcily. Her father (Giuseppe Casale) was a contractor/mason and he built many of the houses in the neighborhood my grandparents lived in, and my mom now lives in.

This side of the family had a very strong culture. There was a lot of community. Some people in the family even still spoke Italian. They were constantly instilling this in me: You are Italian. And so, these are the roots that have always resonated in me.

When I learn about Michelangelo, Galileo, Dante, Leonardo, Raphael or Donatello (and so many more) - I felt... Actually connected to it somehow. I don't know, maybe that's silly. But the result was that Italy was the one place I've always wanted to go since I was a child.

I never really managed to get very far in making a family tree (lots of those services to help you feel outrageously expensive to me). Florence, where I went, is part of Tuscany. I went to Florence because my sister had been there kind of actively looking for connections to our family. Frediani isn't a super common name - I've never met another myself.

Anyway... I went to Florence.

A series of fortunate events.

I wasted no time, I spent pretty much all of the waking hours touring something: some church, garden, museum, street, bridge, etc. In a lucky coincidence, some friend from TPAC was also in Florence and we met up for dinner. They asked me if I had been to the church where Michelangelo was buried. No! They said "I think maybe Dante is also buried there?".

Wow I wanted to go so much - I'm so fortunate that I wound up in a situation where they were there and we had dinner and they could mention it! They had me at Michelangelo, I was very keen to go.

I went the next afternoon, but I was in shorts and that's not allowed and I'd have to shift my schedule and come back the following day.

So, on my very last day I went to this church at Sante Croce. Unlike every other thing I'd been to, there was no line. Because it was the last day, I had nowhere else I really wanted to go afterward. This was my last "big" stop. So, I decided I would really take my time, soak it all in, and pay my respects. Given this, I also decided to pay extra and do the audio guided tour.

Looking down the nave of the main church.

Inside I was surprised to see there were tombs in the floor everywhere - people were walking on them. A few were roped off - I think they were damaged, but mainly there were just occasionally these tombs in the floor. Around the outer wall of the church were bigger memorials to a number of important Italians (mainly Florentine, but not exclusively). Yeah, wow. Michelangelo, Fermi, Galileo, Dante, Machiavelli - so many.

I really took my time. I learned a ton of interesting stuff and took a lot of pictures.

Before leaving the main part of the church I sat down and listened to more about how all of the people buried here were somehow "important".

I had this fleeting thought that maybe I should go through carefully and see if I can find my family name, but I dismissed this very quickly. Florence is a huge place inside the huge place that is Tuscany - there can't be more than 70 people buried here. The odds would be astronautical. I decided to plod on.

And just then...

As the tour turns from the main church, it points you to a marble corridor leading to the Medici chapel (and more). On the floor down the hall are numerous engraved marble slabs, many are worn to unreadable condition. As I looked down the hallway my audio suddenly said

The corridor is paved with white marble tomb slabs with gray frames, which bear names and coats-of-arms erased by time. One inscription is more clearly legible than the others, almost as if time itself had obeyed the wish it expresses; see if you can find it, it's the tomb of Cosimo Frediani, and on it is written: "Don't tread on me".

Wait what? Surely that is just my mind playing tricks on me.

I rewound it and listened again. It sure sounded like my family name.

I found how to switch and look at the transcript.. It was!

It kind of gave me a chill.

I searched the stones, even enlisting the help of the museum personel. There it was: It was also the only one with a relief carved on it.

The marble tomb of Cosimo Frediani.

Wow.

Very cool.

But... Who was he? What did he do? How far are we related? I spent some time searching and I'm not totally sure yet! We've found a bit, but it will surely be a new family quest to find out.

But, I mean... How amazing is it that maybe someone in my family he is buried perhaps 50-100 meters from all of those famous Italians that I feel so connected with 🖤.

What a great experience :)

September 30, 2023 04:00 AM

September 25, 2023

Lucas Fryzek

Converting from 3D to 2D

Recently I’ve been working on a project where I needed to convert an application written in OpenGL to a software renderer. The matrix transformation code in OpenGL made use of the GLM library for matrix math, and I needed to convert the 4x4 matrices to be 3x3 matrices to work with the software renderer. There was some existing code to do this that was broken, and looked something like this:

glm::mat3 mat3x3 = glm::mat3(mat4x4);

Don’t worry if you don’t see the problem already, I’m going to illustrate in more detail with the example of a translation matrix. In 3D a standard translation matrix to translate by a vector (x, y, z) looks something like this:

[1 0 0 x]
[0 1 0 y]
[0 0 1 z]
[0 0 0 1]

Then when we multiply this matrix by a vector like (a, b, c, 1) the result is (a + x, b + y, c + z, 1). If you don’t understand why the matrix is 4x4 or why we have that extra 1 at the end don’t worry, I’ll explain that in more detail later.

Now using the existing conversion code to get a 3x3 matrix will simply take the first 3 columns and first 3 rows of the matrix and produce a 3x3 matrix from those. Converting the translation matrix above using this code produces the following matrix:

[1 0 0]
[0 1 0]
[0 0 1]

See the problem now? The (x, y, z) values disappeared! In the conversion process we lost these critical values from the translation matrix, and now if we multiply by this matrix nothing will happen since we are just left with the identity matrix. So if we can’t use this simple “cast” function in GLM, what can we use?

Well one thing we can do is preserve the last column and last row of the matrix. So assume we have a 4x4 matrix like this:

[a b c d]
[e f g h]
[i j k l]
[m n o p]

Then preserving the last row and column we should get a matrix like this:

[a b d]
[e f h]
[m n p]

And if we use this conversion process for the same translation matrix we will get:

[1 0 x]
[0 1 y]
[0 0 1]

Now we see that the (x, y) part of the translation is preserved, and if we try to multiply this matrix by the vector (a, b, 1) the result will be (a + x, b + y, 1). The translation is preserved in the conversion!

Why do we have to use this conversion?

The reason the conversion is more complicated is hidden in how we defined the translation matrix and vector we wanted to translate. The vector was actually a 4D vector with the final component set to 1. The reason we do this is that we actually want to represent an affine space instead of just a vector space. An affine space being a type of space where you can have both points and vectors. A point is exactly what you would expect it to be just a point in space from some origin, and vector is a direction with magnitude but no origin. This is important because strictly speaking translation isn’t actually defined for vectors in a normal vector space. Additionally if you try to construct a matrix to represent translation for a vector space you’ll find that its impossible to derive a matrix to do this and that operation is not a linear function. On the other hand operations like translation are well defined in an affine space and do what you would expect.

To get around the problem of vector spaces, mathematicians more clever than I figured out you can implement an affine space in a normal vector space by increasing the dimension of the vector space by one, and by adding an extra row and column to the transformation matrices used. They called this a homogeneous coordinate system. This lets you say that a vector is actually just a point if the 4th component is 1, but if its 0 its just a vector. Using this abstraction one can implement all the well defined operations for an affine space (like translation!).

So using the “homogeneous coordinate system” abstraction, translation is an operation that defined by taking a point and moving it by a vector. Lets look at how that works with the translation matrix I used as an example above. If you multiply that matrix by a 4D vector where the 4th component is 0, it will just return the same vector. Now if we multiply by a 4D vector where the 4th component is 1, it will return the point translated by the vector we used to construct that translation matrix. This implements the translation operation as its defined in an affine space!

If you’re interested in understanding more about homogeneous coordinate spaces, (like how the translation matrix is derived in the first place) I would encourage you to look at resources like “Mathematics for Computer Graphics Applications”. They provide a much more detailed explanation than I am providing here. (The homogeneous coordinate system also has some benefits for representing projections which I won’t get into here, but are explained in that text book.)

Now to finally answer the question about why we needed to preserve those final columns and vectors. Based on what we now know, we weren’t actually just converting from a “3D space” to a “2D space” we were converting from a “3D homogeneous space” to a “2D homogeneous space”. The process of converting from a higher dimension matrix to a lower dimensional matrix is lossy and some transformation details are going to be lost in process (like for example the translation along the z-axis). There is no way to tell what kind of space a given matrix is supposed to transform just by looking at the matrix itself. The matrix does not carry any information about about what space its operating in and any conversion function would need to know that information to properly convert that matrix. Therefore we need develop our own conversion function that preserves the transformations that are important to our application when moving from a “3D homogeneous space” to a “2D homogeneous space”.

Hopefully this explanation helps if you are every working on converting 3D transformation code to 2D.

September 25, 2023 04:00 AM

September 19, 2023

Jesse Alama

Binary floats can let us down! When close enough isn't enough

How bi­na­ry float­ing points can let us down

If you've played Mo­nop­oly, you'll know abuot the Bank Er­ror in Your Fa­vor card in the Com­mu­ni­ty Chest. Re­mem­ber this?

A bank er­ror in your fa­vor? Sweet! But what if the bank makes an er­ror in its fa­vor? Sure­ly that's just as pos­si­ble, right?

I'm here to tell you that if you're do­ing every­day fi­nan­cial cal­cu­la­tions—noth­ing fan­cy, but in­volv­ing mon­ey that you care about—then you might need to know that us­ing bi­na­ry float­ing point num­bers, then some­thing might be go­ing wrong. Let's see how bi­na­ry float­ing-point num­bers might yield bank er­rors in your fa­vor—or the bank's.

In a won­der­ful pa­per on dec­i­mal float­ing-point num­bers, Mike Col­ishaw gives an ex­am­ple.

Here's how you can re­pro­duce that in JavaScript:

(1.05 * 0.7).toPrecision(2); # 0.73

Some pro­gram­mers might not be aware of this, but many are. By point­ing this out I'm not try­ing to be a smar­ty­pants who knows some­thing you don't. For me, this ex­am­ple il­lus­trates just how com­mon this sort of er­ror might be.

For pro­gram­mers who are aware of the is­sue, one typ­i­cal ap­proache to deal­ing with it is this: Nev­er work with sub-units of a cur­ren­cy. (Some cur­ren­cies don't have this is­sue. If that's you and your prob­lem do­main, you can kick back and be glad that you don't need to en­gage in the fol­low­ing sorts of headaches.) For in­stance, when work­ing with US dol­lars of eu­ros, this ap­proach man­dates that one nev­er works with eu­ros and cents, but only with cents. In this set­ting, dol­lars ex­ist only as an ab­strac­tion on top of cents. As far as pos­si­ble, cal­cu­la­tions nev­er use floats. But if a float­ing-point num­ber threat­ens to come up, some form of round­ing is used.

An­oth­er aproach for a pro­gram­mer is to del­e­gate fi­nan­cial cal­cu­la­tions to an ex­ter­nal sys­tem, such as a re­la­tion­al data­base, that na­tive­ly sup­ports prop­er dec­i­mal cal­cu­la­tions. One dif­fi­cul­ty is that even if one del­e­gates these cal­cu­la­tions to an ex­ter­nal sys­tem, if one lets a float­ing-point val­ue flow int your pro­gram, even a val­ue that can be trust­ed, it may be­come taint­ed just by be­ing im­port­ed into a lan­guage that doesn't prop­er­ly sup­port dec­i­mals. If, for in­stance, the re­sult of a cal­cu­la­tion done in, say, Post­gres, is ex­act­ly 0.1, and that flows into your JavaScript pro­gram as a num­ber, it's pos­si­ble that you'll be deal­ing with a con­t­a­m­i­nat­ed val­ue. For in­stance:

(0.1).toPrecision(25) # 0.1000000000000000055511151

This ex­am­ple, ad­mit­ted­ly, re­quires quite a lot of dec­i­mals (19!) be­fore the ugly re­al­i­ty of the sit­u­a­tion rears its head. The re­al­i­ty is that 0.1 does not, and can­not, have an ex­act rep­re­sen­ta­tion in bi­na­ry. The ear­li­er ex­am­ple with the cost of a phone call is there to raise your aware­ness of the pos­si­bil­i­ty that one doesn't need to go 19 dec­i­mal places be­fore one starts to see some weird­ness show­ing up.

There are all sorts of ex­am­ples of this. It's ex­ceed­ing­ly rare for a dec­i­mal num­ber to have an ex­act rep­re­sen­ta­tion in bi­na­ry. Of the num­bers 0.1, 0.2, …, 0.9, only 0.5 can be ex­act­ly rep­re­sent­ed in bi­na­ry.

Next time you look at a bank state­ment, or a bill where some tax is cal­cu­lat­ed, I in­vite you to ask how that was cal­cu­lat­ed. Are they us­ing dec­i­mals, or floats? Is it cor­rect?

I'm work­ing on the dec­i­mal pro­pos­al for TC39 to try to work what it might be like to add prop­er dec­i­mal num­bers to JavaScript. There are a few very in­ter­est­ing de­grees of free­dom in the de­sign space (such as the pre­cise datatype to be used to rep­re­sent these kinds of num­ber), but I'm op­ti­mistic that a rea­son­able path for­ward ex­ists, that con­sen­sus be­tween JS pro­gram­mers and JS en­gine im­ple­men­tors can be found, and even­tu­al­ly im­ple­ment­ed. If you're in­ter­est­ed in these is­sues, check out the README in the pro­pos­al and get in touch!

September 19, 2023 08:37 AM

September 12, 2023

Eric Meyer

Nuclear Anchored Sidenotes

Exactly one year ago today, which I swear is a coincidence I only noticed as I prepared to publish this, I posted an article on how I coded the footnotes for The Effects of Nuclear Weapons.  In that piece, I mentioned that the footnotes I ended up using weren’t what I had hoped to create when the project first started.  As I said in the original post:

Originally I had thought about putting footnotes off to one side in desktop views, such as in the right-hand grid gutter.  After playing with some rough prototypes, I realized this wasn’t going to go the way I wanted it to…

I came back to this in my post “CSS Wish List 2023”, when I talked about anchor(ed) positioning.  The ideal, which wasn’t really possible a year ago without a bunch of scripting, was to have the footnotes arranged structurally as endnotes, which we did, but in a way that I could place the notes as sidenotes, next to the footnote reference, when there was enough space to show them.

As it happens, that’s still not really possible without a lot of scripting today, unless you have:

  1. A recent (as of late 2023) version of Chrome
  2. With the “Experimental web features” flag enabled

With those things in place, you get experimental support for CSS anchor positioning, which lets you absolutely position an element in relation to any other element, anywhere in the DOM, essentially regardless of their markup relationship to each other, as long as they conform to a short set of constraints related to their containing blocks.  You could reveal an embedded stylesheet and then position it next to the bit of markup it styles!

Anchoring Sidenotes

More relevantly to The Effects of Nuclear Weapons, I can enhance the desktop browsing experience by turning the popup footnotes into Tufte-style static sidenotes.  So, for example, I can style the list items that contain the footnotes like this:

.endnotes li {
	position: absolute;
	top: anchor(top);
	bottom: auto;
	left: calc(anchor(--main right) + 0.5em);
	max-width: 23em;
}
A sidenote next to the main text column, with its number aligned with the referencing number found in the main text column.

Let me break that down.  The position is absolute, and bottom is set to auto to override a previous bit of styling that’s needed in cases where a footnote isn’t being anchored.  I also decided to restrain the maximum width of a sidenote to 23em, for no other reason than it looked right to me.

(A brief side note, pun absolutely intended: I’m using the physical-direction property top because the logical-direction equivalent in this context, inset-block-start, only gained full desktop cross-browser support a couple of years ago, and that’s only true if you ignore IE11’s existence, plus it arrived in several mobile browsers only this year, and I still fret about those kinds of things.  Since this is desktop-centric styling, I should probably set a calendar reminder to fix these at some point in the future.  Anyway, see MDN’s entry for more.)

Now for the new and unfamiliar parts.

 top: anchor(top);

This sets the position of the top edge of the list item to be aligned with the top edge of its anchor’s box.  What is a footnote’s anchor?  It’s the corresponding superscripted footnote mark embedded in the text.  How does the CSS know that?  Well, the way I set things up  —  and this is not the only option for defining an anchor, but it’s the option that worked in this use case  —  the anchor is defined in the markup itself.  Here’s what a footnote mark and its associated footnote look like, markup-wise.

explosion,<sup><a href="#fnote01" id="fn01">1</a></sup> although
<li id="fnote01" anchor="fn01"><sup>1</sup> … </li>

The important bits for anchor positioning are the id="fn01" on the superscripted link, and the anchor="fn01" on the list item: the latter establishes the element with an id of fn01 as the anchor for the list item.  Any element can have an anchor attribute, thus creating what the CSS Anchor Positioning specification calls an implicit anchor.  It’s explicit in the HTML, yes, but that makes it implicit to CSS, I guess.  There’s even an implicit keyword, so I could have written this in my CSS instead:

 top: anchor(implicit top);

(There are ways to mark an element as an anchor and associate other elements with that anchor, without the need for any HTML.  You don’t even need to have IDs in the HTML.  I’ll get to that in a bit.)

Note that the superscripted link and the list item are just barely related, structurally speaking.  Their closest ancestor element is the page’s single <main> element, which is the link’s fourth-great-grandparent, and the list item’s third-great-grandparent.  That’s okay!  Much as a <label> can be associated with an input element across DOM structures via its for attribute, any element can be associated with an anchoring element via its anchor attribute.  In both cases, the value is an ID.

So anyway, that means the top edge of the endnote will be absolutely positioned to line up with the top edge of its anchor.  Had I wanted the top of the endnote to line up with the bottom edge of the anchor, I would have said:

 top: anchor(bottom);

But I didn’t.  With the top edges aligned, I now needed to drop the endnote into the space outside the main content column, off to its right.  At first, I did it like this:

 left: anchor(--main right);

Wait.  Before you think you can just automatically use HTML element names as anchor references, well, you can’t.  That --main is what CSS calls a dashed-ident, as in a dashed identifier, and I declared it elsewhere in my CSS.  To wit:

main {
	anchor-name: --main;
}

That assigns the anchor name --main to the <main> element in the CSS, no HTML attributes required.  Using the name --main to identify the <main> element was me following the common practice of naming things for what they are.  I could have called it --mainElement or --elMain or --main-column or --content or --josephine or --📕😉 or whatever I wanted.  It made the most sense to me to call it --main, so that’s what I picked.

Having done that, I can use the edges of the <main> element as positioning referents for any absolutely (or fixed) positioned element.  Since I wanted the left side of sidenotes to be placed with respect to the right edge of the <main>, I set their left to be anchor(--main right).

Thus, taking these two declarations together, the top edge of a sidenote is positioned with respect to the top edge of its implicit anchor, and its left edge is positioned with respect to the right edge of the anchor named --main.

	top: anchor(top);
	left: anchor(--main right);

Yes, I’m anchoring the sidenotes with respect to two completely different anchors, one of which is a descendant of the other.  That’s okay!  You can do that!  Literally, you could position each edge of an anchored element to a separate anchor, regardless of how they relate to each other structurally.

Once I previewed the result of those declarations, I saw I the sidenotes were too close to the main content, which makes sense: I had made the edges adjacent to each other.

Red borders showing the edges of the sidenote and the main column touching.

I thought about using a left margin on the sidenotes to push them over, and that would work fine, but I figured what the heck, CSS has calculation functions and anchor functions can go inside them, and any engine supporting anchor positioning will also support calc(), so why not?  Thus:

 left: calc(anchor(--main right) + 0.5em);

I wrapped those in a media query that only turned the footnotes into sidenotes at or above a certain viewport width, and wrapped that in a feature query so as to keep the styles away from non-anchor-position-understanding browsers, and I had the solution I’d envisioned at the beginning of the project!

Except I didn’t.

Fixing Proximate Overlap

What I’d done was fine as long as the footnotes were well separated.  Remember, these are absolutely positioned elements, so they’re out of the document flow.  Since we still don’t have CSS Exclusions, there needs to be a way to deal with situations where there are two footnotes close to each other.  Without it, you get this sort of thing.

Two sidenotes completely overlapping with each other.  This will not do.

I couldn’t figure out how to fix this problem, so I did what you do these days, which is I posted my problem to social media.  Pretty quickly, I got a reply from the brilliant Roman Komarov, pointing me at a Codepen that showed how to do what I needed, plus some very cool highlighting techniques.  I forked it so I could strip it down to the essentials, which is all I really needed for my use case, and also have some hope of understanding it.

Once I’d worked through it all and applied the results to TEoNW, I got exactly what I was after.

The same two sidenotes, except now there is no overlap.

But how?  It goes like this:

.endnotes li {
	position: absolute;
	anchor-name: --sidenote;
	top: max(anchor(top) , calc(anchor(--sidenote bottom) + 0.67em));
	bottom: auto;
	left: calc(anchor(--main right) + 0.5em);
	max-width: 23em;
}

Whoa.  That’s a lot of functions working together there in the top value.  (CSS is becoming more and more functional, which I feel some kind of way about.)  It can all be verbalized as, “the position of the top edge of the list item is either the same as the top edge of its anchor, or two-thirds of an em below the bottom edge of the previous sidenote, whichever is further down”.

The browser knows how to do this because the list items have all been given an anchor-name of --sidenote (again, that could be anything, I just picked what made sense to me).  That means every one of the endnote list items will have that anchor name, and other things can be positioned against them.

Those styles mean that I have multiple elements bearing the same anchor name, though.  When any sidenote is positioned with respect to that anchor name, it has to pick just one of the anchors.  The specification says the named anchor that occurs most recently before the thing you’re positioning is what wins.  Given my setup, this means an anchored sidenote will use the previous sidenote as the anchor for its top edge.

At least, it will use the previous sidenote as its anchor if the bottom of the previous sidenote (plus two-thirds of an em) is lower than the top edge of its implicit anchor.  In a sense, every sidenote’s top edge has two anchors, and the max() function picks which one is actually used in every case.

CSS, man.

Remember that all this is experimental, and the specification (and thus how anchor positioning works) could change.  The best practices for accessibility are also not clear yet, from what I’ve been able to find.  As such, this may not be something you want to deploy in production, even as a progressive enhancement.  I’m holding off myself for the time being, which means none of the above is currently used in the published version of The Effects of Nuclear Weapons.  If people are interested, I can create a Codepen to illustrate.

I do know this is something the CSS Working Group is working on pretty hard right now, so I have hopes that things will finalize soon and support will spread.

My thanks to Roman Komarov for his review of and feedback on this article.  For more use cases of anchor positioning, see his lengthy (and quite lovely) article “Future CSS: Anchor Positioning”.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at September 12, 2023 03:16 PM

September 06, 2023

Eric Meyer

Memories of Molly

The Web is a little bit darker today, a fair bit poorer: Molly Holzschlag is dead.  She lived hard, but I hope she died easy.  I am more sparing than most with my use of the word “friend”, and she was absolutely one.  To everyone.

If you don’t know her name, I’m sorry.  Too many didn’t.  She was one of the first web gurus, a title she adamantly rejected  —  “We’re all just people, people!”  —  but it fit nevertheless.  She was a groundbreaker, expanding and explaining the Web at its infancy.  So many people, on hearing the mournful news, have described her as a force of nature, and that’s a title she would have accepted with pride.  She was raucous, rambunctious, open-hearted, never ever close-mouthed, blazing with fire, and laughed (as she did everything) with her entire chest, constantly.  She was giving and took and she hurt and she wanted to heal everyone, all the time.  She was messily imperfect, would tell you so loudly and repeatedly, and gonzo in all the senses of that word.  Hunter S. Thompson should have written her obituary.

I could tell so many stories.  The time we were waiting to check into a hotel, talking about who knows what, and realized Little Richard was a few spots ahead of us in line.  Once he’d finished checking in, Molly walked right over to introduce herself and spend a few minutes talking with him.  An evening a group of us had dinner one the top floor of a building in Chiba City and I got the unexpectedly fresh shrimp hibachi.  The time she and I were chatting online about a talk or training gig, somehow got onto the subject of Nick Drake, and coordinated a playing of “ Three Hours” just to savor it together.  A night in San Francisco where the two of us went out for dinner before some conference or other, stopped at a bar just off Union Square so she could have a couple of drinks, and she got propositioned by the impressively drunk couple seated next to her after they’d failed to talk the two of us into hooking up.  The bartender couldn’t stop laughing.

Or the time a bunch of us were gathered in New Orleans (again, some conference or other) and went to dinner at a jazz club, where we ended up seated next to the live jazz trio and she sang along with some of the songs.  She had a voice like a blues singer in a cabaret, brassy and smoky and full of hard-won joys, and she used it to great effect standing in front of Bill Gates to harangue him about Internet Explorer.  She raised it to fight like hell for the Web and its users, for the foundational principles of universal access and accessible development.  She put her voice on paper in some three dozen books, and was working on yet another when she died.  In one book, she managed to sneak past the editors an example that used a stick-figure Kama Sutra custom font face.  She could never resist a prank, particularly a bawdy one, as long as it didn’t hurt anyone.

She made the trek to Cleveland at least once to attend and be part of the crew for one of our Bread and Soup parties.  We put her to work rolling tiny matzoh balls and she immediately made ribald jokes about it, laughing harder at our one-up jokes than she had at her own.  She stopped by the house a couple of other times over the years, when she was in town for consulting work, “Auntie Molly” to our eldest and one of my few colleagues to have spent any time with Rebecca.  Those pictures were lost, and I still keenly regret that.

There were so many things about what the Web became that she hated, that she’d spent so much time and energy fighting to avert, but she still loved it for what it could be and what it had been originally designed to be.  She took more than one fledgling web designer under her wing, boosted their skills and careers, and beamed with pride at their accomplishments.  She told a great story about one, I think it was Dunstan Orchard but I could be wrong, and his afternoon walk through a dry Arizona arroyo.

I could go on for pages, but I won’t; if this were a toast and she were here, she would have long ago heckled me (affectionately) into shutting up.  But if you have treasured memories of Molly, I’d love to hear them in the comments below, or on your own blog or social media or podcasts or anywhere.  She loved stories.  Tell hers.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at September 06, 2023 03:44 PM

September 04, 2023

Clayton Craft

Upgrading Steam Deck storage the lazy way

When I bought my Steam Deck over a year ago, I purchased the basic model with a very unassuming 64GB eMMC module for primary storage. I knew this wouldn't be enough on its own for holding the games I like to play, so I've gotten by with having my Steam library installed on an SD card. This has worked surprisingly well, despite the lower performance of the SD card when compared to something like an NVMe. Load times, etc actually weren't all that bad! Alas, all good things must come to an end... I started having issues with the eMMC drive filling up and I was constantly interrupted with "low storage space" notifications in Steam. I guess Steam stores saved games, Mesa shader cache, and other things like that on the primary drive despite having gobs of space on SD storage. Oops!

Since I can't be bothered to enter my Steam login, WiFi credentials, etc a second time on my Deck, I really wanted to preserve/transfer as much data from the eMMC to a new NVMe drive as possible. That's what this post is about. Spoiler alert: I was successful, and it was relatively easy/painless.

by Unknown at September 04, 2023 12:00 AM

August 27, 2023

Brian Kardell

Completely Random Car and Design Stuff

Completely Random Car and Design Stuff

This is a kind of "off brand" post for me. It's very, very random and just for fun. It's just a whole bunch of mental wandering literally all over place about cars, design, AI and only a very little bit about the web. Many of these observations are probably extermely specific to America, but perhaps still interesting. There is no overarching "point" here, it's just thoughts.

There are certain trends that I notice in the design of everyday things. Sometimes I think I notice them developing early and have lots of conversations pointing them out to other people (especially my son) and wondering aloud about how all of this will look in retrospect: Is this one of those things that will come to represent this "era"?

If it's not clear what I mean, basically every single thing in the photo below screams 1970s:

A Frigidare ad from the 1970s. A woman standanding in a kitchen with avacado Frigidare appliances and countertop, a green and yellow flowered wallpaper, she's wearing 70's attire, the cabinet design is somehow also unique to the era, the quality of the film is also reminscant of the 1970s.

It seems like there are some things that come to become associated with an era... Stuff that got really popular for a period of time and then left behind.

Several years ago I pointed out two changes I thought I saw happening in new car colors, and I started to point them out to my son. If you haven't noticed it, it's real. For a lot of years the mainstream of automobiles might have varied in real ways (you'd realize how much if you ever tried to match paint on your car), it's still been a fairly common range and palette of colors that I think mainly would be described with words like "shiny" or "glossy". I think this has been generally true since the 70's or 80's where it seems I remember some more different reds and oranges and earthy colors (my dad's old truck maybe was interesting in both ways). In any case, now we have all of these new "ceramic colors" which look more "baked in" to the materials and matte. More often than not they're grays but there are some beautiful turquoises and reds. Also a new common thing is "blacked out" emblems. I think both of these things began with trendsetting aftermarket people.

Another thing that this made me think about and try to explain to my son was that that's kind of true of automobile body styles and some other characteristics too. I found it surprisingly hard to describe because it's not as if there is a single kind of car in any era - it's just that there are a smaller array of characteristics that you can recognize as belonging to that era.

Don't you think a lot of cars today have similar characteristics? I do. But... They also feel like they are increasingly resembling a car developed at the end of the 1970's, by a car company that doesn't even exist anymore. The car company is AMC (American Motors Corporation) and the car was the Eagle SX/4. Below is an advertisement

Advertisement for the Eagle SX/4 around 1979

Basically, it was sort of the first crossover vehicle that combined a car and 4 wheel (later all wheel) drive. But also, just visually: Do you see a resemblance to a lot of popular cars today? I do. I think this car looks much more similar to many cars I see on the road today than most of what I saw for the 40 years or so between them. Also interesting, it was available in one of those interesting earthy yellow/tan (and kind of matte!) colors with dark gray/black accents that kind of mostly disappeared, but I could imagine being reborn today.

Photo of the Eagle SX/4 in an earthy, somewhat matte tan/yellow with interesting dark trims.

So, I was thinking about this and the fact it was kind of ahead of its time. In fact, I searched for exactly that and found this article with a title that is pretty much that: The AMC Eagle SX/4 – An American 4×4 Sports Car That Was Ahead Of Its Time".

That's cool, but I was also wondering how much of that came from the fact that it was not (yet) part of "The Big 3". There used to be a lot more American car companies, but we just kept consolidating into those 3. Lots of those became "brands" of the Big 3 for a while, but eventually homogenized. A huge number of them no longer exist. My first car was a Pontiac. The last Pontiac produced rolled off the line in January 2010. Saturn, same. My mom drove a Plymouth Voyager. Plymouth was discontinued in 2001. In fact, the car I drive now sometimes (not even American: an Isuzo Amigo) hasn't been sold in the US since 2000. And so on.

I was thinking that most of the diversity today that drives anything about automobiles, pretty much doesn't really come from those Big 3. And wondering if maybe that meant something for web engines too. Does it? I don't know. For sure we can see outside (but still standards compliant) innovation in browsers like Arc or Vivaldi or Polypane. I suppose the open source nature makes that more easily possible, but it's worth noting too that this is kept alive mainly through ~80% of that bill being footed by the Web Engines Big 3.

Anyway... This is already all over the place (sorry, that's how my mind works!), but in sitting down this morning to write this I had an idea that perhaps AI could help me "illustrate" if not explain the "look" of body styles associated with different decades. Here's what I found out: Midjourney is pretty bad at it!

The photo below was generated with the prompt "photo of a typical 2010's automobile in America with the most common body style of cars built between 2000 and 2009".

A picture that is, for all intents and purposes, almost literally an early 1950's Buick but with a different emblem.

I tried several variations and wound up with something like this as one of the 4 options no matter how explicit it was. I wonder why? I'd kind of think that identifing and sort of recyling the "trends" like that would be very much what AI models like this would be really good at - but apparently not as good as I'd imagine!

I decided to check the other way round, just for giggles. The inverse worked better. Giving it the first image in this article mentioned the 1970's in its explanation! Interesting!

Anyway, like I said at the get-go: There's no overall point here :)

August 27, 2023 04:00 AM

August 23, 2023

Emmanuele Bassi

The Mirror

The GObject type system has been serving the GNOME community for more than 20 years. We have based an entire application development platform on top of the features it provides, and the rules that it enforces; we have integrated multiple programming languages on top of that, and in doing so, we expanded the scope of the GNOME platform in a myriad of directions. Unlike GTK, the GObject API hasn’t seen any major change since its introduction: aside from deprecations and little new functionality, the API is exactly the same today as it was when GLib 2.0 was released in March 2002. If you transported a GNOME developer from 2003 to 2023, they would have no problem understanding a newly written GObject class; though, they would likely appreciate the levels of boilerplate reduction, and the performance improvements that have been introduced over the years.

While having a stable API last this long is definitely a positive, it also imposes a burden on maintainers and users, because any change has to be weighted against the possibility of introducing unintended regressions in code that uses undefined, or undocumented, behaviour. There’s a lot of leeway when it comes to playing games with C, and GObject has dark corners everywhere.

The other burden is that any major change to a foundational library like GObject cascades across the entire platform. Releasing GLib 3.0 today would necessitate breaking API in the entirety of the GNOME stack and further beyond; it would require either a hard to execute “flag day”, or an impossibly long transition, reverberating across downstreams for years to come. Both solutions imply amounts of work that are simply not compatible with a volunteer-based project and ecosystem, especially the current one where volunteers of core components are now stretched thin across too many projects.

And yet, we are now at a cross-roads: our foundational code base has reached the point where recruiting new resources capable of affecting change on the project has become increasingly difficult; where any attempt at performance improvement is heavily counterbalanced by the high possibility of introducing world-breaking regressions; and where fixing the safety and ergonomics of idiomatic code requires unspooling twenty years of limitations inherent to the current design.

Something must be done if we want to improve the coding practices, the performance, and the safety of the platform without a complete rewrite.

The Mirror

‘Many things I can command the Mirror to reveal,’ she answered, ‘and to some I can show what they desire to see. But the Mirror also show things unbidden, and those are often stranger and more profitable than things we wish to behold. What you will see, if you leave the Mirror free to work, I cannot tell. For it shows things that were, and things that are, and things that yet may be. But which it is that he sees, even the wisest cannot always tell. Do you wish to look?’ — Lady Galadriel, “The Lords of the Rings”, Volume 1: The Fellowship of the Ring, Book 2: The Ring Goes South

In order to properly understand what we want to achieve, we need to understand the problem space that the type system is meant to solve, and the constraints upon which the type system was implemented. We do that by holding GObject up to Galadriel’s Mirror, and gazing into its surface.

Things that were

History became legend. Legend became myth. — Lady Galadriel, “The Lord of the Rings: The Fellowship of the Ring”

Before GObject there was GtkObject. It was a simpler time, it was a simpler stack. You added types only for the widgets and objects that related to the UI toolkit, and everything else was C89, with a touch of undefined behaviour, like calling function pointers with any number of arguments. Properties were “arguments”, likes were florps, and the timeline went sideways.

We had a class initialisation and an instance initialisation functions; properties were stored in a global hash table, but the property multiplexer pair of functions was stored on the type data instead of using the class structure. Types did not have private data: you only had keyed fields. No interfaces, only single inheritance. GtkObject was reference counted, and had an initially “floating” reference, to allow transparent ownership transfer from child to parent container when writing C code, and make the life of every language binding maintainer miserable in the process. There were weak references attached to an instance that worked by invoking a callback when the instance’s reference count reached zero. Signals operated exactly as they do today: large hash table of signal information, indexed by an integer.

None of this was thread safe. After all, GTK was not thread safe either, because X11 was not thread safe; and we’re talking about 1997: who even had hardware capable of multi-threading at the time? NPTL wasn’t even a thing, yet.

The introduction of GObject in 2001 changed some of the rules—mainly, around the idea of having dynamic types that could be loaded and unloaded in order to implement plugins. The basic design of the type system, after all, came from Beast, a plugin-heavy audio application, and it was extended to subsume the (mostly) static use cases of GTK. In order to support unloading, the class aspect of the type system was allowed to be cleaned up, but the type data had to be registered and never unloaded; in other words, once a type was registered, it was there forever.

Arguments” were renamed to properties, and were extended to include more than basic types, provide validations, and notify of changes; the overall design was still using a global hash table to store all the properties across all types. Properties were tied to the GObject type, but the property definition existed as a separate type hierarchy that was designed to validate values, but not manage fields inside a class. Signals were ported wholesale, with minimal changes mainly around the marshalling of values and abstracting closures.

The entire plan was to have GObject as one of the base classes at the root of a specific hierarchy, with all the required functionality for GTK to inherit from for its own GtkObject, while leaving open the possibility of creating other hierarchies, or even other roots with different functionality, for more lightweight objects.

These constraints were entirely intentional; the idea was to be able to port GTK to the new type system, and to an out of tree GLib, during the 1.3 development phase, and minimise the amount of changes necessary to make the transition work not just inside GTK, but inside of GNOME too.

Little by little, the entire GObject layer was ported towards thread safety in the only way that worked without breaking the type system: add global locks around everything; use read-write locks for the type data; lock the access and traversal of the property hash table and of the signals table. The only real world code bases that actively exercised multi-threading support were GStreamer and the GNOME VFS API that was mainly used by Nautilus.

With the 3.0 API, GTK dropped the GtkObject base type: the whole floating reference mechanism was moved to GObject, and a new type was introduced to provide the “initial” floating reference to derived types. Around the same time, a thread-safe version of weak references for GObject appeared as a separate API, which confused the matter even more.

Things that are

Darkness crept back into the forests of the world. Rumour grew of a Shadow in the East, whispers of a Nameless Fear. — Lady Galadriel, “The Lord of the Rings: The Fellowship of the Ring”

Let’s address the elephant in the room: it’s completely immaterial how many lines of code you have to deal with when creating a new type. It’s a one-off cost, and for most cases, it’s a matter of using the existing macros. The declaration and definition macros have the advantages of enforcing a series of best practices, and keep the code consistent across multiple projects. If you don’t want to deal with boilerplate when using C, you chose the wrong language to begin with. The existence of excessive API is mainly a requirement to allow other languages to integrate their type system with GObject’s own.

The dynamic part of the type system has gotten progressively less relevant. Yes: you can still create plugins, and those can register types; but classes are never unloaded, just like their type data. There is some attempt at enforcing an order of operations: you cannot just add an interface after a type has been instantiated any more; and while you can add properties and signals after class initialisation, it’s mainly a functionality reserved for specific language bindings to maintain backward compatibility.

Yes, defining properties is boring, and could probably be simplified, but the real cost is not in defining and installing a GParamSpec: it’s in writing the set and get property multiplexers, validating the values, boxing and unboxing the data, and dealing with the different property flags; none of those things can be wrapped in some fancy C preprocessor macros—unless you go into the weeds with X macros. The other, big cost of properties is their storage inside a separate, global, lock-happy hash table. The main use case of this functionality—adding entirely separate classes of properties with the same semantics as GObject properties, like style properties and child properties in GTK—has completely fallen out of favour, and for good reasons: it cannot be managed by generic code; it cannot be handled by documentation generators without prior knowledge; and, for the same reason, it cannot be introspected. Even calling these “properties” is kind of a misnomer: they are value validation objects that operate only when using the generic (and less performant) GObject accessor API, something that is constrained to things like UI definition files in GTK, or language bindings. If you use the C accessors for your own GObject type, you’ll have to implement validation yourself; and since idiomatic code will have the generic GObject accessors call the public C API of your type, you get twice the amount of validation for no reason whatsoever.

Signals have mostly been left alone, outside of performance improvements that were hard to achieve within the constraints of the existing implementation; the generic FFI-based closure turned out to be a net performance loss, and we’re trying to walk it back even for D-Bus, which was the main driver for it to land in the first place. Marshallers are now generated with a variadic arguments variant, to reduce the amount of boxing and unboxing of GValue containers. Still, there’s not much left to squeeze out of the old GSignal API.

The atomic nature of the reference counting can be a costly feature, especially for code bases that are by necessity single-threaded; the fact that the reference count field is part of the (somewhat) public API prevents fundamental refactorings, like switching to biased reference counting for faster operations on the same thread that created an instance. The lack of room on GObject also prevents storing the thread ID that owns the instance, which in turn prevents calling the GObjectClass.dispose() and GObjectClass.finalize() virtual functions on the right thread, and requires scheduling the destruction of an object on a separate main context, or locking the contents of an object at a further cost.

Things that yet may be

The quest stands upon the edge of a knife: stray but a little, and it will fail to the ruin of all. Yet hope remains, while the company is true. — Lady Galadriel, “The Lord of the Rings: The Fellowship of the Ring”

Over the years, we have been strictly focusing on GObject: speeding up its internals, figuring out ways to improve property registration and performance, adding new API and features to ensure it behaved more reliably. The type system has also been improved, mainly to streamline its use in idiomatic GObject code bases. Not everything worked: properties are still a problem; weak references and pointers are a mess, with two different API that interact badly with GObject; signals still exists on a completely separate plane; GObject is still wildly inefficient when it comes to locking.

The thesis of this strawman is that we reached the limits of backwards compatibility of GObject, and any attempt at improving it will inevitably lead to a more brittle code, rife with potential regressions. The typical answer, in this case, would be to bump the API/ABI of GObject, remove the mistakes of the past, and provide a new idiomatic approach. Sadly, doing so not only would require a level of resources we, as the GLib project stewards, cannot provide, but it would also completely break the entire ecosystem in a way that is not recoverable. Either nobody would port to the new GObject-3.0 API; or the various projects that depend on GObject would inevitably fracture, following whichever API version they can commit to; in the meantime, downstream distributors would suffer the worst effects of the shared platform we call “Linux”.

Between inaction and slow death, and action with catastrophic consequences, there’s the possibility of a third option: what if we stopped trying to emulate Java, and have a single “god” type?

Our type system is flexible enough to support partitioning various responsibilities, and we can defer complexity where it belongs: into faster moving dependencies, that have the benefit of being able to iterate and change at a much higher rate than the foundational library of the platform. What’s the point of shoving every possible feature into the base class, in order to cover ever increasingly complex use cases across multiple languages, when we can let consumers decide to opt into their own well-defined behaviours? What GObject ought to provide is a set of reliable types that can be combined in expressive ways, and that can be inspected by generic API.

A new, old base type

We already have a derivable type, called GTypeInstance. Typed instances don’t have any memory management: once instantiated, they can only be moved, or freed. All our objects already are typed instances, since GObject inherits from it. Contrary to the current common practices we should move towards using GTypeInstance for our types.

There’s a distinct lack of convenience API for defining typed instances, mostly derived from the fact that GTypeInstance is seen as a form of “escape hatch” for projects to use in order to avoid GObject. In practice, there’s nothing that prevents us from improving the convenience of creating new instantiatable/derivable types, especially if we start using them more often. The verbose API must still exist, to allow language bindings and introspection to handle this kind of types, but just like we made convenience macros for declaring and defining GObject types, we can provide macros for new typed instances, and for setting up a GValue table.

Optional functionality

Typed instances require a wrapper API to free their contents before calling g_type_free_instance(). Nothing prevents us from adding a GFinalizable interface that can be implemented by a GTypeInstance, though: interfaces exist at the type system level, and do not require GObject to work.

typedef struct {
  void (* finalize) (GFinalizable *self);
} GFinalizableInterface;

If a typed instance provides an implementation of GFinalizable, then g_type_free_instance() can free the contents of the instance by calling g_finalizable_finalize().

This interface is optional, in case your typed instance just contains simple values, like:

typedef struct {
  GTypeInstance parent;

  bool is_valid;
  double x1, y1;
  double x2, y2;
} Box;

and does not require deallocations outside of the instance block.

A similar interface can be introduced for cloning instances, allowing a copy operation alongside a move:

typedef struct {
  GClonable * (* clone) (GClonable *self);
} GClonable;

We could then introduce g_type_instance_clone() as a generic entry point that either used GClonable, or simply allocated a new instance and called memcpy() on it, using the size of the instance (and eventual private data) known to the type system.

The prior art for this kind of functionality exists in GIO, in the form of the GInitable and GAsyncInitable interfaces; unfortunately, those interfaces require GObject, and they depend on GCancellable and GAsyncResult objects, which prevent us from moving them into the lower level API.

Typed containers and life time management

The main functionality provided by GObject is garbage collection through reference counting: you acquire a (strong) reference when you need to access an instance, and release it when you don’t need the instance any more. If the reference you released was the last one, the instance gets finalized.

Of course, once you introduce strong references you open the door to a veritable bestiary of other type of references:

  • weak references, used to keep a “pointer” to the instance, and get a notification when the last reference drops
  • floating references, used as a C convenience to allow ownership transfer of newly constructed “child” objects to their “parent”
  • toggle references, used by language bindings that acquire a strong reference on an instance they wrap with a native object; when the toggle reference gets triggered it means that the last reference being held is the one on the native wrapper, and the wrapper can be dropped causing the instance to be finalized

All of these types of reference exist inside GObject, but since they were introduced over the years, they are bolted on top of the base class using the keyed data storage, which comes with its own costly locking and ordering; they are also managed through the finalisation code, which means there are re-entrancy issues or undefined ordering behaviours that routinely crop up over the years, especially when trying to optimise construction and destruction phases.

None of this complexity is, strictly speaking, necessary; we don’t care about an instance being reference counted: a “parent” object can move the memory of a “child” typed instance directly into its own code. What we care about is that, whenever other code interacts with ours, we can hand out a reference to that memory, so that ownership is maintained.

Other languages and standard libraries have the same concept:

These constructs are not part of a base class: they are wrappers around instances. This means you’re not handing out a reference to an instance: you are handing out a reference to a container, which holds the instance for you. The behaviour of the value is made explicit by the type system, not implicit to the type.

A simple implementation of a typed “reference counted” container would provide us with both strong and weak references:

typedef struct _GRc GRc;
typedef struct _GWeak GWeak;

GRc *g_rc_new (GType data_type, gpointer data);

GRc *g_rc_acquire (GRc *rc);
void g_rc_release (GRc *rc);

gpointer g_rc_get_data (GRc *rc);

GWeak *g_rc_downgrade (GRc *rc);
GRc *g_weak_upgrade (GWeak *weak);

bool g_weak_is_empty (GWeak *weak);
gpointer g_weak_get_data (GWeak *weak);

Alongside this type of containers, we could also have a specialisation for atomic reference counted containers; or pinned containers, which guarantee that an object is kept in the same memory location; or re-implement referenced containers inside each language binding, to ensure that the behaviour is tailored to the memory management of those languages.

Specialised types

Container types introduce the requirement of having the type system understand that an object can be the product of two types: the type of the container, and the type of the data. In order to allow properties, signals, and values to effectively provide introspection of this kind of container types we are going to need to introduce “specialised” types:

  • GRc exists as a “generic”, abstract type in the type system
  • any instance of GRc that contains a instance of type A gets a new type in the type system

A basic implementation would look like:

GRc *
g_rc_new (GType data_type, gpointer data)
{
  // Returns an existing GType if something else already
  // has registered the same GRc<T>
  GType rc_type =
    g_generic_type_register_static (G_TYPE_RC, data_type);

  // Instantiates GRc, but gives it the type of
  // GRc<T>; there is only the base GRc class
  // and instance initialization functions, as
  // GRc<T> is not a pure derived type
  GRc *res = (GRc *) g_type_create_instance (rc_type);
  res->data = data;

  return res;
}

Any instance of type GRc<A> satisfies the “is-a” relationship with GRc, but it is not a purely derived type:

GType rc_type =
  ((GTypeInstance *) rc)->g_class.g_type;
g_assert_true (g_type_is_a (rc_type, G_TYPE_RC));

The GRc<A> type does not have a different instance or class size, or its own class and instance initialisation functions; it’s still an instance of the GRc type, with a different GType. The GRc<A> type only exists at run time, as it is the result of the type instantiation; you cannot instantiate a plain GRc, or derive your type from GRc in order to create your own reference counted type, either:

// WRONG
GRc *rc = g_type_create_instance (G_TYPE_RC);

// WRONG
typedef GRc GtkWidget;

You can only use a GRc inside your own instance:

typedef struct {
  // GRc<GtkWidget>
  GRc *parent;
  // GRc<GtkWidget>
  GRc *first_child;
  // GRc<GtkWidget>
  GRc *next_sibling;

  // ...
} GtkWidgetPrivate;

Tuple types

Tuples are generic containers of N values, but right now we don’t have any way of formally declaring them into the type system. A hack is to use arrays of similarly typed values, but with the deprecation of GValueArray—which is a bad type that does not allow reference counting, and does not give you guarantees anyway—we only have C arrays and pointer types.

Registering a new tuple type would work like a generic type: a base GTuple abstract type as the “parent”, and a number of types:

typedef struct _GTuple GTuple;

GTuple *
g_tuple_new_int (size_t n_elements,
                 int elements[])
{
  GType tuple_type =
    g_tuple_type_register_static (G_TYPE_TUPLE, n_elements, G_TYPE_INT);

  GTuple *res = g_type_create_instance (tuple_type);
  for (size_t i = 0; i < n_elements; i++)
    g_tuple_add (res, elements[i]);

  return res;
}

We can also create specialised tuple types, like pairs:

typedef struct _GPair GPair;

GPair *
g_pair_new (GType this_type,
            GType that_type,
            ...);

This would give use the ability to standardise our API around fundamental types, and reduce the amount of ad hoc container types that libraries have to define and bindings have to wrap with native constructs.

Sum types

Of course, once we start with specialised types, we end up with sum types:

typedef enum {
  SQUARE,
  RECT,
  CIRCLE,
} ShapeKind;

typedef struct {
  GTypeInstance parent;

  ShapeKind kind;

  union {
    struct { Point origin; float side; };
    struct { Point origin; Size size; };
    struct { Point center; float radius; };
  } shape;
} Shape;

As of right now, discriminated unions don’t have any special handling in the type system: they are generally boxed types, or typed instances, but they require type-specific API to deal with the discriminator field and type. Since we have types for enumerations and instances, we can register them at the same time, and provide offsets for direct access:

GType
g_sum_type_register_static (const char *name,
                            size_t class_size,
                            size_t instance_size,
                            GType tag_enum_type,
                            offset_t tag_field);

This way it’s possible to ask the type system for:

  • the offset of the tag in an instance, for direct access
  • all the possible values of the tag, by inspecting its GEnum type

From then on, we can easily build types like Option and Result:

typedef enum {
  G_RESULT_OK,
  G_RESULT_ERR
} GResultKind;

typedef struct {
  GTypeInstance parent;

  GResultKind type;
  union {
    GValue value;
    GError *error;
  } result;
} GResult;

// ...
g_sum_type_register_static ("GResult",
                            sizeof (GResultClass),
                            sizeof (GResult),
                            G_TYPE_RESULT_KIND,
                            offsetof (GResult, type));

// ...
GResult *
g_result_new_boolean (gboolean value)
{
  GType res_type =
    g_generic_type_register_static (G_TYPE_RESULT,
                                    G_TYPE_BOOLEAN)
  GResult *res =
    g_type_create_instance (res_type);
  g_value_set_boolean (&res->result.value, value);

  return res;
}

// ...
g_autoptr (GResult) result = obj_finish (task);
switch (g_result_get_kind (result)) {
  case G_RESULT_OK:
    g_print ("Result: %s\n",
      g_result_get_boolean (result)
        ? "true"
        : "false");
    break;

  case G_RESULT_ERROR:
    g_printerr ("Error: %s\n",
      g_result_get_error_message (result));
    break;
}

// ...
g_autoptr (GResult) result =
  g_input_stream_read_bytes (stream);
if (g_result_is_error (result)) {
  // ...
} else {
  g_autoptr (GBytes) data = g_result_get_boxed (result);
  // ...
}

Consolidating GLib and GType

Having the type system in a separate shared library did make sense back when GLib was spun off from GTK; after all, GLib was mainly a set of convenient data types for a language that lacked a decent standard library. Additionally, not many C projects were interested in the type system, as it was perceived as a big chunk of functionality in an era where space was at a premium. These days, the smallest environment capable of running GLib code is plenty capable of running the GObject type system as well. The separation between GLib data types and the GObject type system has created data types that are not type safe, and work by copying data, by having run time defined destructor functions, or by storing pointers and assuming everything will be fine. This leads to code duplication between shared libraries, and prevents the use of GLib data types in the public API, lest the introspection information gets lost.

Moving the type system inside GLib would allow us to have properly typed generic container types, like a GVector replacing GArray, GPtrArray, GByteArray, as well as the deprecated GValueArray; or a GMap and a GSet, replacing GHashTable, GSequence, and GtkRBTree. Even the various list models could be assembled on top of these new types, and moved out of GTK.

Current consumers of GLib-only API would still have their basic C types, but if they don’t want to link against a slightly bigger shared library that includes GTypeInstance, GTypeInterface, and the newly added generic, tuple, and sum types, then they would probably be better served by projects like c-util instead.

Properties

Instead of bolting properties on top of GParamSpec, we can move their definition into the type system; after all, properties are a fundamental part of a type, so it does not make sense to bind them to the class instantiation. This would also remove the long-standing issue of properties being available for registration long after a class has been initialised; it would give us the chance to ship a utility for inspecting the type system to get all the meta-information on the hierarchy and generating introspection XML without having to compile a small binary.

If we move property registration to the type registration we can also finally move away from multiplexed accessors, and use direct instance field access where applicable:

GPropertyBuilder builder;

g_property_builder_init (&builder,
  G_TYPE_STRING, "name");
// Stop using flags, and use proper setters; since
// there's no use case for unsetting the readability
// flag, we don't even need a boolean argument
g_property_builder_set_readwrite (&builder);
// The offset is used for read and write access...
g_property_builder_set_private_offset (&builder,
  offsetof (GtkWidgetPrivate, name));
// ... unless an accessor function is provided; in
// this case we want setting a property to go through
// a function
g_property_builder_set_setter_func (&builder,
  gtk_widget_set_name);

// Register the property into the type; we return the
// offset of the property into the type node, so we can
// access the property definition with a fast look up
properties[NAME] =
  g_type_add_instance_property (type,
    g_property_builder_end (&builder));

Accessing the property information would then be a case of looking into the type system under a single reader lock, instead of traversing all properties in a glorified globally locked hash table.

Once we have a property registered in the type system, accessing it is a matter of calling API on the GProperty object:

void
gtk_widget_set_name (GtkWidget *widget,
                     const char *name)
{
  GProperty *prop =
    g_type_get_instance_property (GTK_TYPE_WIDGET,
                                  properties[NAME]);

  g_property_set (prop, name);
}

Signals

Moving signal registration into the type system would allow us to subsume the global locking into the type locks; it would also give us the chance to simplify some of the complexity for re-emission and hooks:

GSignalBuilder builder;

g_signal_builder_init (&builder, "insert-text");
g_signal_builder_set_args (&builder, 3,
  (GSignalArg[]) {
    { .name = "text", .gtype = G_TYPE_STRING },
    { .name = "length", .gtype = G_TYPE_SIZE },
    { .name = "position", .gtype = G_TYPE_OFFSET },
  });
g_signal_builder_set_retval (&builder,
  G_TYPE_OFFSET);
g_signal_builder_set_class_offset (&builder,
  offsetof (EditableClass, insert_text));

signals[INSERT_TEXT] =
  g_type_add_class_signal (type,
    g_signal_builder_end (&builder));

By taking the chance of moving signals out of the their own namespace we can also move to a model where each class is responsible for providing the API necessary to connect and emit signals, as well as providing callback types for each signal. This would allow us to increase type safety, and reduce the reliance on generic API:

typedef offset_t (* EditableInsertText) (Editable *self,
                                         const char *text,
                                         size_t length,
                                         offset_t position);

unsigned long
editable_connect_insert_text (Editable *self,
                              EditableInsertText callback,
                              gpointer user_data,
                              GSignalFlags flags);

offset_t
editable_emit_insert_text (Editable *self,
                           const char *text,
                           size_t length,
                           offset_t position);

Extending the type system

Some of the metadata necessary to provide properly typed properties and signals is missing from the type system. For instance, by design, there is no type representing a uint16_t; we are supposed to create a GParamSpec to validate the value of a G_TYPE_INT in order to fit in the 16bit range. Of course, this leads to excessive run time validation, and relies on C’s own promotion rules for variadic arguments; it also does not work for signals, as those do not use GParamSpec. More importantly, though, the missing connection between C types and GTypes prevents gathering proper introspection information for properties and signal arguments: if we only have the GType we cannot generate the full metadata that can be used by documentation and language bindings, unless we’re willing to lose specificity.

Not only the type system should be sufficient to contain all the standard C types that are now available, we also need the type system to provide us with enough information to be able to serialise those types into the introspection data, if we want to be able to generate code like signal API, type safe bindings, or accurate documentation for properties and signal handlers.

Introspection

Introspection exists outside of GObject mainly because of dependencies; the parser, abstract syntax tree, and transformers are written in Python and interface with a low level C tokeniser. Adding a CPython dependency to GObject is too much of a stretch, especially when it comes to bootstrapping a system. While we could keep the dependency optional, and allow building GObject without support for introspection, keeping the code separate is a simpler solution.

Nevertheless, GObject should not ignore introspection. The current reflection API inside GObject should generate data that is compatible with the libgirepository API and with its GIR parser. Currently, gobject-introspection is tasked with generating a small C executable, compiling it, running it to extract metadata from the type system, as well as the properties and signals of a GObject type, and generate XML that can be parsed and included into the larger GIR metadata for the rest of the ABI being introspected. GObject should ship a pre-built binary, instead; it should dlopen the given library or executable, extract all the type information, and emit the introspection data. This would not make gobject-introspection more cross-compilable, but it would simplify its internals and its distributability. We would not need to know how to compile and run C code from a Python script, for one; a simple executable wrapper around a native copy of the GObject-provided binary would be enough.

Ideally, we could move the girepository API into GObject itself, and allow it to load the binary data compiled out of the XML; language bindings loading the data at run time would then need to depend on GObject instead of an additional library, and we could ship the GIR → typelib compiler directly with GLib, leaving gobject-introspection to deal only with the parsing of C headers, docblocks, and annotations, to generate the XML representation of the C/GObject ABI.

There and back again

And the ship went out into the High Sea and passed on into the West, until at last on a night of rain Frodo smelled a sweet fragrance on the air and heard the sound of singing that came over the water. And then it seemed to him that as in his dream in the house of Bombadil, the grey rain-curtain turned all to silver glass and was rolled back, and he beheld white shores and beyond them a far green country under a swift sunrise. — “The Lord of the Rings”, Volume 3: The Return of the King, Book 6: The End of the Third Age

The hard part of changing a project in a backward compatible way is resisting the temptation of fixing the existing design. Some times it’s necessary to backtrack the chain of decisions, and consider the extant code base a dead branch; not because the code is wrong, or bug free, but because any attempt at doubling down on the same design will inevitably lead to breakage. In this sense, it’s easy to just declare “maintenance bankruptcy”, and start from a new major API version: breaks allow us to fix the implementation, at the cost of adapting to new API. For instance, widgets are still the core of GTK, even after 4 major revisions; we did not rename them to “elements” or “actors”, and we did not change how the windows are structured. You are still supposed to build a tree of widgets, connect callbacks to signals, and let the main event loop run. Porting has been painful because of underlying changes in the graphics stack, or because of portability concerns, but even with the direction change of favouring composition over inheritance, the knowledge on how to use GTK has been transferred from GTK 1 to 4.

We cannot do the same for GObject. Changing how it is implemented implies changing everything that depends on it; it means introducing behavioural changes in subtle, and hard to predict ways. Luckily for us, the underlying type system is still flexible and nimble enough that it can give us the ability to change direction, and implement an entirely different approach to object orientation—one that is more in line with languages like modern C++ and Rust. By following new approaches we can slowly migrate our platform to other languages over time, with a smaller impedance mismatch caused by the current design of our object model. Additionally, by keeping the root of the type system, we maintain the ability to provide a stable C ABI that can be consumed by multiple languages, which is the strong selling point of the GNOME ecosystem.

Why do all of this work, though? Compared to a full API break, this proposal has the advantage of being tractable and realistic; I cannot overemphasise enough how little appetite there is for a “GObject 3.0” in the ecosystem. The recent API bump from libsoup2 to libsoup3 has clearly identified that changes deep into the stack end up being too costly an effort: some projects have found it easier to switch to another HTTP library altogether, rather than support two versions of libsoup for a while; other projects have decided to drop compatibility with libsoup2, forcing the hand of every reverse dependency both upstream and downstream. Breaking GObject would end up breaking the ecosystem, with the hope of a “perfect” implementation way down the line and with very few users on one side, and a dead branch used by everybody else on the other.

Of course, the complexity of the change is not going to be trivial, and it will impact things like the introspection metadata and the various language bindings that exist today; some bindings may even require a complete redesign. Nevertheless, by implementing this new object model and leaving GObject alone, we buy ourselves enough time and space to port our software development platform towards a different future.

Maybe this way we will get to save the Shire; and even if we give up some things, or even lose them, we still get to keep what matters.

by ebassi at August 23, 2023 08:23 PM

August 21, 2023

Melissa Wen

AMD Driver-specific Properties for Color Management on Linux (Part 1)

TL;DR:

Color is a visual perception. Human eyes can detect a broader range of colors than any devices in the graphics chain. Since each device can generate, capture or reproduce a specific subset of colors and tones, color management controls color conversion and calibration across devices to ensure a more consistent color reproduction. We can expose a GPU-accelerated display color management pipeline to support this process and enhance results, and this is what we are doing on Linux to improve color management on Gamescope/SteamDeck. Even with the challenges of being external developers, we have been working on mapping AMD GPU color capabilities to the Linux kernel color management interface, which is a combination of DRM and AMD driver-specific color properties. This more extensive color management pipeline includes pre-defined Transfer Functions, 1-Dimensional LookUp Tables (1D LUTs), and 3D LUTs before and after the plane composition/blending.


The study of color is well-established and has been explored for many years. Color science and research findings have also guided technology innovations. As a result, color in Computer Graphics is a very complex topic that I’m putting a lot of effort into becoming familiar with. I always find myself rereading all the materials I have collected about color space and operations since I started this journey (about one year ago). I also understand how hard it is to find consensus on some color subjects, as exemplified by all explanations around the 2015 online viral phenomenon of The Black and Blue Dress. Have you heard about it? What is the color of the dress for you?

So, taking into account my skills with colors and building consensus, this blog post only focuses on GPU hardware capabilities to support color management :-D If you want to learn more about color concepts and color on Linux, you can find useful links at the end of this blog post.

Linux Kernel, show me the colors ;D

DRM color management interface only exposes a small set of post-blending color properties. Proposals to enhance the DRM color API from different vendors have landed the subsystem mailing list over the last few years. On one hand, we got some suggestions to extend DRM post-blending/CRTC color API: DRM CRTC 3D LUT for R-Car (2020 version); DRM CRTC 3D LUT for Intel (draft - 2020); DRM CRTC 3D LUT for AMD by Igalia (v2 - 2023); DRM CRTC 3D LUT for R-Car (v2 - 2023). On the other hand, some proposals to extend DRM pre-blending/plane API: DRM plane colors for Intel (v2 - 2021); DRM plane API for AMD (v3 - 2021); DRM plane 3D LUT for AMD - 2021. Finally, Simon Ser sent the latest proposal in May 2023: Plane color pipeline KMS uAPI, from discussions in the 2023 Display/HDR Hackfest, and it is still under evaluation by the Linux Graphics community.

All previous proposals seek a generic solution for expanding the API, but many seem to have stalled due to the uncertainty of matching well the hardware capabilities of all vendors. Meanwhile, the use of AMD color capabilities on Linux remained limited by the DRM interface, as the DCN 3.0 family color caps and mapping diagram below shows the Linux/DRM color interface without driver-specific color properties [*]:

Bearing in mind that we need to know the variety of color pipelines in the subsystem to be clear about a generic solution, we decided to approach the issue from a different perspective and worked on enabling a set of Driver-Specific Color Properties for AMD Display Drivers. As a result, I recently sent another round of the AMD driver-specific color mgmt API.

For those who have been following the AMD driver-specific proposal since the beginning (see [RFC][V1]), the main new features of the latest version [v2] are the addition of pre-blending Color Transformation Matrix (plane CTM) and the differentiation of Pre-defined Transfer Functions (TF) supported by color blocks. For those who just got here, I will recap this work in two blog posts. This one describes the current status of the AMD display driver in the Linux kernel/DRM subsystem and what changes with the driver-specific properties. In the next post, we go deeper to describe the features of each color block and provide a better picture of what is available in terms of color management for Linux.

The Linux kernel color management API and AMD hardware color capabilities

Before discussing colors in the Linux kernel with AMD hardware, consider accessing the Linux kernel documentation (version 6.5.0-rc5). In the AMD Display documentation, you will find my previous work documenting AMD hardware color capabilities and the Color Management Properties. It describes how AMD Display Manager (DM) intermediates requests between the AMD Display Core component (DC) and the Linux/DRM kernel interface for color management features. It also describes the relevant function to call the AMD color module in building curves for content space transformations.

A subsection also describes hardware color capabilities and how they evolve between versions. This subsection, DC Color Capabilities between DCN generations, is a good starting point to understand what we have been doing on the kernel side to provide a broader color management API with AMD driver-specific properties.

Why do we need more kernel color properties on Linux?

Blending is the process of combining multiple planes (framebuffers abstraction) according to their mode settings. Before blending, we can manage the colors of various planes separately; after blending, we have combined those planes in only one output per CRTC. Color conversions after blending would be enough in a single-plane scenario or when dealing with planes in the same color space on the kernel side. Still, it cannot help to handle the blending of multiple planes with different color spaces and luminance levels. With plane color management properties, userspace can get better representation of colors to deal with the diversity of color profiles of devices in the graphics chain, bring a wide color gamut (WCG), convert High-Dynamic-Range (HDR) content to Standard-Dynamic-Range (SDR) content (and vice-versa). With a GPU-accelerated display color management pipeline, we can use hardware blocks for color conversions and color mapping and support advanced color management.

The current DRM color management API enables us to perform some color conversions after blending, but there is no interface to calibrate input space by planes. Note that here I’m not considering some workarounds in the AMD display manager mapping of DRM CRTC de-gamma and DRM CRTC CTM property to pre-blending DC de-gamma and gamut remap block, respectively. So, in more detail, it only exposes three post-blending features:

  • DRM CRTC de-gamma: used to convert the framebuffer’s colors to linear gamma;
  • DRM CRTC CTM: used for color space conversion;
  • DRM CRTC gamma: used to convert colors to the gamma space of the connected screen.

AMD driver-specific color management interface

We can compare the Linux color management API with and without the driver-specific color properties. From now, we denote driver-specific properties with the AMD prefix and generic properties with the DRM prefix. For visual comparison, I bring the DCN 3.0 family color caps and mapping diagram closer and present it here again:

Mixing AMD driver-specific color properties with DRM generic color properties, we have a broader Linux color management system with the following features exposed by properties in the plane and CRTC interface, as summarized by this updated diagram:

The blocks highlighted by red lines are the new properties in the driver-specific interface developed by me (Igalia) and Joshua (Valve). The red dashed lines are new links between API and AMD driver components implemented by us to connect the Linux/DRM interface to AMD hardware blocks, mapping components accordingly. In short, we have the following color management properties exposed by the DRM/AMD display driver:

  • Pre-blending - AMD Display Pipe and Plane (DPP):
    • AMD plane de-gamma: 1D LUT and pre-defined transfer functions; used to linearize the input space of a plane;
    • AMD plane CTM: 3x4 matrix; used to convert plane color space;
    • AMD plane shaper: 1D LUT and pre-defined transfer functions; used to delinearize and/or normalize colors before applying 3D LUT;
    • AMD plane 3D LUT: 17x17x17 size with 12 bit-depth; three dimensional lookup table used for advanced color mapping;
    • AMD plane blend/out gamma: 1D LUT and pre-defined transfer functions; used to linearize back the color space after 3D LUT for blending.
  • Post-blending - AMD Multiple Pipe/Plane Combined (MPC):
    • DRM CRTC de-gamma: 1D LUT (can’t be set together with plane de-gamma);
    • DRM CRTC CTM: 3x3 matrix (remapped to post-blending matrix);
    • DRM CRTC gamma: 1D LUT + AMD CRTC gamma TF; added to take advantage of driver pre-defined transfer functions;

Note: You can find more about AMD display blocks in the Display Core Next (DCN) - Linux kernel documentation, provided by Rodrigo Siqueira (Linux/AMD display developer) in a 2021-documentation series. In the next post, I’ll revisit this topic, explaining display and color blocks in detail.

How did we get a large set of color features from AMD display hardware?

So, looking at AMD hardware color capabilities in the first diagram, we can see no post-blending (MPC) de-gamma block in any hardware families. We can also see that the AMD display driver maps CRTC/post-blending CTM to pre-blending (DPP) gamut_remap, but there is post-blending (MPC) gamut_remap (DRM CTM) from newer hardware versions that include SteamDeck hardware. You can find more details about hardware versions in the Linux kernel documentation/AMDGPU Product Information.

I needed to rework these two mappings mentioned above to provide pre-blending/plane de-gamma and CTM for SteamDeck. I changed the DC mapping to detach stream gamut remap matrixes from the DPP gamut remap block. That means mapping AMD plane CTM directly to DPP/pre-blending gamut remap block and DRM CRTC CTM to MPC/post-blending gamut remap block. In this sense, I also limited plane CTM properties to those hardware versions with MPC/post-blending gamut_remap capabilities since older versions cannot support this feature without clashes with DRM CRTC CTM.

Unfortunately, I couldn’t prevent conflict between AMD plane de-gamma and DRM plane de-gamma since post-blending de-gamma isn’t available in any AMD hardware versions until now. The fact is that a post-blending de-gamma makes little sense in the AMD color pipeline, where plane blending works better in a linear space, and there are enough color blocks to linearize content before blending. To deal with this conflict, the driver now rejects atomic commits if users try to set both AMD plane de-gamma and DRM CRTC de-gamma simultaneously.

Finally, we had no other clashes when enabling other AMD driver-specific color properties for our use case, Gamescope/SteamDeck. Our main work for the remaining properties was understanding the data flow of each property, the hardware capabilities and limitations, and how to shape the data for programming the registers - AMD color block capabilities (and limitations) are the topics of the next blog post. Besides that, we fixed some driver bugs along the way since it was the first Linux use case for most of the new color properties, and some behaviors are only exposed when exercising the engine.

Take a look at the Gamescope/Steam Deck Color Pipeline[**], and see how Gamescope uses the new API to manage color space conversions and calibration (please click on the image for a better view):

In the next blog post, I’ll describe the implementation and technical details of each pre- and post-blending color block/property on the AMD display driver.

* Thank Harry Wentland for helping with diagrams, color concepts and AMD capabilities.

** Thank Joshua Ashton for providing and explaining Gamescope/Steam Deck color pipeline.

*** Thanks to the Linux Graphics community - explicitly Harry, Joshua, Pekka, Simon, Sebastian, Siqueira, Alex H. and Ville - to all the learning during this Linux DRM/AMD color journey. Also, Carlos and Tomas for organizing the 2023 Display/HDR Hackfest where we have a great and immersive opportunity to discuss Color & HDR on Linux.

  1. Cinematic Color - 2012 SIGGRAPH course notes by Jeremy Selan: an introduction to color science, concepts and pipelines.
  2. Color management and HDR documentation for FOSS graphics by Pekka Paalanen: documentation and useful links on applying color concepts to the Linux graphics stack.
  3. HDR in Linux by Jeremy Cline: a blog post exploring color concepts for HDR support on Linux.
  4. Methods for conversion of high dynamic range content to standard dynamic range content and vice-versa by ITU-R: guideline for conversions between HDR and SDR contents.
  5. Using Lookup Tables to Accelerate Color Transformations by Jeremy Selan: Nvidia blog post about Lookup Tables on color management.
  6. The Importance of Being Linear by Larry Gritz and Eugene d’Eon: Nvidia blog post about gamma and color conversions.

August 21, 2023 11:13 AM

August 10, 2023

Brian Kardell

Igalia: Mid-season Power Rankings

Igalia: Mid-season Power Rankings

Let’s take a look at how the year is stacking up in terms of Open Source contributions. If this were an episode of Friends its title would be "The One With the Charts".

I’ve written before about how I have personally come to really appreciate the act of making a simple list of “things recently accomplished”. It’s always eye opening, and for me at least, usually therapeutic.

For me personally, it’s been super weird first half of the year and… I feel like I could use a nice list.

It's been a weird year, right?

I mean, not just for me personally, for all of us I guess, right?

Mass layoffs everywhere for a while, new shuffling of people we know from Google to Shopify, Mozilla to Google, Google to Igalia, Mozilla to Igalia, Mozilla to Apple, Google to Meta… Who’s on first? Third base!

LLMs are suddenly everywhere. All of the “big” CSS features people have been clamouring for forever are suddenly right here. HTML finally got <dialog> and now is getting a popover (via attributes). Apple’s got some funky XR glasses coming. There is suddenly significant renewed interest in two novel web engines. And that’s just some of the tech stuff.

So yeah… Let’s see what, if any, impacts all of this are having on the state of projects Igalia works on by looking at our commits so far this year… Note that Igalia's slice of the pie is separated in each of the charts for quick identification...

Quick disclaimers

All of these stats are based on commits publicly available through their respective gits. This is of course an imperfect measure for many reasons - some commits are huge, some are small. Some small commits are really hard while some large commits are easy, but verbose. Finally, the biggest challenge, even if we accept these metrics is mapping commits to organizations. We use a fairly elaborate system and many checkpoints - we collaborate annually with several of the projects to cross check these mappings. Still, you'll see lots of entries in these charts with just an individual's name. Often these are individual contractors or contributors, but sometimes it's just that we cannot currently map them some other way. If you see one that should be counted differently, please let me know!

The Big Web Projects

Igalia is still the one (and still having fun) with the most commits in Chromium and WebKit after Google and Apple, respectively, as we'll see... But we can add some more #1’s this year - even some where we’re not excluding the steward…

Chromium

Igalia claims a whopping 41.9% of the (non-Google) commits so far in 2023!! That’s more than Microsoft, Intel, Opera, Samsung and ARM combined!!! Yowza!

data

Top 10 contributors

ContributorContributions
Igalia41.56%
Microsoft16.42%
Intel11.74%
Opera4.88%
Ho Cheung3.92%
Samsung2.08%
Stephan Hartmann1.62%
ARM1.58%
Naver Corporation1.48%
Bytedance1.12%
127 other committers14%

As you read the others, keep in mind that the chromium repository actually has more than chrome inside it, so comparisons of these aren't Apples-to-Apples (or, Googles or foxes).

WebKit

52.6% of the non-Apple commits in WebKit so far this year are from Igalians. It's interesting to note that a huge 4.9% of all of these are from accounts with less than 10 commits this year - pretty close to what it was in Firefox!

data

Top 10 contributors

ContributorContributions
Igalia54.94%
Sony18.29%
Ahmad Saleem8.81%
Red Hat5.50%
Rose2.69%
open-tec.co.jp1.96%
warlow.dev1.91%
umich.edu0.62%
Alexey Knyazev0.51%
Google0.45%
40 other committers4%

Firefox

We're sitting at the #5 spot (excluding Mozilla) with 8.87% of commits. Firefox is, in a lot of ways, the trickiest to describe, but just look at it: It's very diverse! As the inventors of modern open source, I guess it makes sense. The mozilla-central repository has the most indidivual significant contributors as well as a really long line of tiny contributors. The tiny contributors (less than 10 commits) contributed 5.2% (compared to 4.85% in WebKit, for example). However, there are also a few external contributors who are just astoundingly profilfic (some of these bigger slices represent hundreds of commits) and such a number of significant indivial contributors, it amounts to a lot.

data

Top 10 contributors

ContributorContributions
André Bargull14.38%
Red Hat12.31%
Birchill10.32%
Gregory Pappas9.04%
Igalia8.87%
Robert Longson5.12%
CanadaHonk4.28%
Masatoshi Kimura2.65%
ganna2.21%
Itiel1.68%
174 other committers29%

Pause for a Note

When you look at these charts, it's really heartening to see how many people and organizations care and contribute. Especially when you look at the Mozilla/Firefox example, it really gives the impression that that project is just a million volunteers. But, it's important to keep it all in perspective too. WebKit has about 50 contributing orgs and individuals, Chrome about 140 and Firefox about 185. A lot more significant a % of contributions come from individuals in Mozilla. Importantly: In all of these projects, the steward org's contributions absolutely dwarf the rest of the world's contributions combined:

A version of the pies showing the steward's contributions, for scale (Mozilla contributed 87.2% of all commits, Apple 78.1% and Google 95.5% to their respective projects).

If you think this is astounding, please check out my post Webrise and our Web Ecosystem Health series on the Igalia Chats Podcast

Wolvic

A new #1 in the reports. I guess it should come as no suprise at all that we're #1 in terms of commits to our Wolvic XR browser. It looks at lot like other projects in terms of the steward's balance. What's more interesting, I think, is that its funding model is based on partnerships with several organizations and a collective rather than Igalia as a "go it alone" source.

data

Top 10 contributors

ContributorContributions
Igalia94.90%
opensuse.org0.77%
ratcliffefamily.org0.77%
gallegonovato0.51%
net-c.ca0.51%
Ayaskant Panigrahi0.51%
Anushka Chakraborty0.26%
Ana2k0.26%
Luna Jernberg0.26%
zohocorp.com0.26%

Servo

This year, thanks to some new funding and internal investment we can add Servo to a very special #1 list! Igalia is second to no one in terms of commits there either with 52.7% of all commits! An amazing 22.24% of those commits in servo are from unmappable committers with less than 10 commits so far this year!

data

Top 10 contributors

ContributorContributions
Igalia52.68%
Mozilla25.92%
sagudev5.11%
Pu Xingyu2.73%
michaelgrigoryan252.02%
Alex Touchet1.66%
2shiori171.55%
cybai (Haku)1.19%
Yutaro Ohno0.71%
switchpiggy0.59%
30 other committers6%

Test-262

Test-262 is the conformance test suite for ECMA work (JavaScript). I guess you could say we're doing a lot of work there as well, because guess who's got the most commits there? If you guessed Igalia, you'd be right, with 53.4% of all commits!

data

Top 10 contributors

ContributorContributions
Igalia53.64%
Google11.26%
Justin Grant10.60%
Richard Gibson3.97%
André Bargull3.31%
Jordan Harband3.31%
Bocoup3.31%
Veera2.65%
Huáng Jùnliàng1.99%
José Julián Espina1.32%

Note that total number of commits in Test262 is comparatively rather small as compared to many of the other projects here.

Babel

Igalians are now the #1 contributors to Babel, contributing 46.6% of all commits so far this year!

data

Top 10 contributors

ContributorContributions
Igalia46.35%
Huáng Jùnliàng22.32%
liuxingbaoyu19.31%
Jonathan Browne0.86%
Avery0.86%
fisker Cheung0.86%
Dimitri Papadopoulos Orfanos0.43%
FabianWarnecke0.43%
Abdulaziz Ghuloum0.43%
magic-akari0.43%

V8

Igalia is the #7 contributor to V8 (exclulding Google)! This is a pretty busy repo and it's interesting that 6.36% of these commits are from the unmapped/individual contributors with less than ten commits so far this year.

data

Top 10 contributors

ContributorContributions
Google85.66%
Red Hat2.98%
Intel2.57%
loongson.cn2.09%
iscas.ac.cn1.66%
Microsoft1.13%
ARM0.86%
Igalia0.79%
Bytedance0.41%
eswincomputing.com0.34%

Google's contributions account for a giant 87.5% of all commits here as well.

But that's not all!

All of the above is just looking specifically at the big web projects because, you know, the web is sort of my thing. If you're reading my blog, there's a pretty good chance it's your thing too. But Igalia does way more than that, if you can believe it. I probably don't talk about it enough, but it's pretty amazing. I suppose I can't give a million more charts, but here are just a few more highlights of other notable projects and specifications where Igalia has been playing a big role... (Keep in mind that specifications move a lot more slowly and so have generally far less commits)

  • HTML: Igalians were #3 among contribtor commits to HTML with 8.94% so far this year (behind Google and Apple).
  • Web Assembly: Igalia is the #3 contributor to Web Assembly with 8.75% of the commits so far this year!
  • ARIA: So far this year, Igalia is the #1 contributor to the ARIA repo with 19.4% of commits!
  • NativeScript: Igalia is currently the #1 contributor so far this year to the NativeScript repository with 58.3% of all commits!
  • GStreamer GStreamer is widely used and powerful open source multimedia framework. Igalia is the #2 contributor there!
  • VK-GL-CTS: The official Kronos OpenGL and Vulkan conformance test suite (graphics). It would be a massive understatement to say that Igalia has been a major contributor: We're the #1 contributor there with 31.1% of all commits.
  • Mesa: The Mesa 3D Graphics Library is huge and contains open source implementations of pretty much every graphical standard (Vulkan, as mentioned above, for example). Igalia is the #5 contributor there so far this year, contributing 6.62% of all commits.
  • Piglit: Piglit is an open-source test suite for OpenGL implementations. Igalia is the #5 contributor there with 6.86%

Wrapping up...

It's always amazing to me to look at the data. I hope it's interesting to others too. There are, of course, lots of reasons that all of the committers do what they do, but ultimately, open source development and maintenance benefits us all. The reason that Igalia is able to do all of this is that we are funded by a diverse array of clients making things downstream with needs.

You know where to find us...

August 10, 2023 04:00 AM

August 08, 2023

Víctor Jáquez

DMABuf modifier negotiation in GStreamer

It took almost a year of design and implementation but finally the DMABuf modifier negotiation in GStreamer is merged. Big kudos to all the people involved but mostly to He Junyan, who did the vast majority of the code.

What’s a DMAbuf modifier?

DMABuf are the Linux kernel mechanism to share buffers among different drivers or subsystems. A particular case of DMABuf are the DRM PRIME buffers which are buffers shared by the Display Rendering Manager (DRM) subsystem. They allowed sharing video frames between devices with zero copy.

When we initially added support for DMABuf in GStreamer, we assumed that only color format and size mattered, just as old video frames stored in system memory. But we were wrong. Beside color format and size, also the memory layout has to be considered when sharing DMABufs. By not considering it, the produced output had horrible tiled artifacts in screen. This memory layout is known as modifier, and it’s uniquely described by an uint64 number.

How was designed and implemented?

First, we wrote a design document for caps negotiation with dmabuf
modifiers
, where we added a new color format (DMA_DRM) and a new caps field (drm-format). This new caps field holds a string, or a list of strings, composed by the tuple DRM_color_format : DRM_modifier.

Second, we extended the video info object to support DMABuf with helper functions that parse and construct the drm-format field.

Third, we added the dmabuf caps negotiation in glupload. This part was the most difficult one, since the capability of importing DMABufs to OpenGL (which is only available in EGL/GLES) is run-time defined, by querying the hardware. Also, there are two code paths to import frames: direct or RGB-emulated. Direct would be the most efficient, but it depends on the presence of GLES2 API in the driver; while RGB-emulated is imported as a set of RGB images where each component is an image. At the end more than a thousand lines of code were added to the glupload element, beside the code added to EGL context object.

Fourth, and unexpectedly, waylandsink also got DMABuf caps negotiation
support.

And lastly, decoders in `va** plugin merged theirs DMABuf caps negotiation support.

How I can test it?

You need, of course, to user the current main branch of GStreamer, since it’s just fresh and there’s no release yet. Then you need a box with VA support. And if you inspect, for example, vah264dec, you might see this output if your box is Intel (but also AMD through Mesa is supported though the negotiated memory is linear so far):

Pad Templates:
SINK template: 'sink'
Availability: Always
Capabilities:
video/x-h264
profile: { (string)main, (string)baseline, (string)high, (string)progressive-high, (string)constrained-high, (string)constrained-baseline }
width: [ 1, 4096 ]
height: [ 1, 4096 ]
alignment: au
stream-format: { (string)avc, (string)avc3, (string)byte-stream }

SRC template: 'src'
Availability: Always
Capabilities:
video/x-raw(memory:VAMemory)
width: [ 1, 4096 ]
height: [ 1, 4096 ]
format: NV12
video/x-raw(memory:DMABuf)
width: [ 1, 4096 ]
height: [ 1, 4096 ]
format: DMA_DRM
drm-format: NV12:0x0100000000000002
video/x-raw
width: [ 1, 4096 ]
height: [ 1, 4096 ]
format: NV12

What it’s saying, for memory:DMABuf caps feature, the drm-format to negotiate is NV12:0x0100000000000002.

Now some tests:

NOTE: These commands assume that va decoders are primary ranked (see merge request 2312), and that you’re in a Wayland session.

$ gst-play-1.0 --flags=0x47 video.file --videosink=waylandsink
$ GST_GL_API=gles2 gst-play-1.0 --flags=0x47 video.file --videosink=glimagesink
$ gst-play-1.0 --flags=0x47 video.file --videosink='glupload ! gtkglsink'

Right now it’s required to add --flags=0x47 to playbin because it adds video filters that still don’t negotiate the new DMABuf caps.

GST_GL_API=gles2 instructs GStreamer OpenGL to use GLES2 API, which allows direct importation of YUV images.

Thanks to all the people involved in this effort!

As usual, if you would like to learn more about DMABuf, VA-API, GStreamer or any other open multimedia framework, contact us!

by vjaquez at August 08, 2023 01:00 PM