Planet Igalia

February 23, 2024

Manuel Rego

Servo at FOSDEM 2024

Following the trend started on my last blog post this is a blog post about Servo presence in a new event, this time FOSDEM 2024. This was my first time at FOSDEM, which is a special and different conference with lots of people and talks; it was quite an experience!

Picture of Rakhi during her Servo talk at FOSDEM with Tauri experiment slide in the background.
Servo talk by Rakhi Sharma at FOSDEM 2024

Embedding Servo in Rust projects #

My colleague Rakhi Sharma gave a great talk about Servo in the Rust devroom. The talk went deep into how Servo can be embedded by other Rust projects, showing the work on the Servo’s Minibrowser, and the latest experiments related to the integration with Tauri through their cross-platform WebView library called wry. It’s worth to highlight that the latter has been sponsored by NLnet Foundation, big thanks for your support! Attending FOSDEM was nice as we also met Daniel Thompson-Yvetot from Tauri, which we have been working with as part of this collaboration, and discussed about improvements and plans for the future.

Slides and video of the talk are already available online. Don’t miss it if you’re interested in the more recent news around Servo development and how to embedded it in other projects.

Servo at Linux Foundation Europe stand #

Servo is a Linux Foundation Europe project, and they have a stand at FOSDEM. This allowed us to show Servo to people passing by there and explain the ongoing development efforts on the project. There were also some nice Servo swag that people really liked. The stand gave us the chance to talk to different people about Servo. Most of them were very interested in the project revival and are hopeful about a promising future for the project.

Picture of Linux Foundation Europe stand at FOSDEM 2024 with different Servo swag (stickers, flyers, buttons, cups, caps, teddy bears). Also in the picture there is a screen showing a video of Servo rendering WebGPU content.
Servo swag at Linux Foundation Europe stand at FOSDEM 2024

As part of our attendance to FOSDEM we had the chance to meet some people we have been working with recently. We met Taym Haddadi one active Servo contributor, it was nice to talk to people contributing to the project and meet them in real life. In addition, we were also talking briefly with NLnet folks, they have one of the most popular stands with stickers of all the projects they fund. Furthermore, we also have the chance to talk to Outreachy folks, as we’re working into getting Servo back into Outreachy internship program this year.

Other Igalia talks at FOSDEM 2024 #

Finally, I’d like to highlight that Igalia presence at FOSDEM was much bigger than only Servo related stuff. There were a large groups of igalians participating in the event, and we have a bunch of talks there:

February 23, 2024 12:00 AM

February 20, 2024

Carlos García Campos

A Clarification About WebKit Switching to Skia

In the previous post I talked about the plans of the WebKit ports currently using Cairo to switch to Skia for 2D rendering. Apple ports don’t use Cairo, so they won’t be switching to Skia. I understand the post title was confusing, I’m sorry about that. The original post has been updated for clarity.

by carlos garcia campos at February 20, 2024 06:11 PM

Alex Bradbury

Clarifying instruction semantics with P-Code

I've recently had a need to step through quite a bit of disassembly for different architectures, and although some architectures have well-written ISA manuals it can be a bit jarring switching between very different assembly syntaxes (like "source, destination" for AT&T vs "destination, source" for just about everything else) or tedious looking up different ISA manuals to clarify the precise semantics. I've been using a very simple script to help convert an encoded instruction to a target-independent description of its semantics, and thought I may as well share it as well as some thoughts on its limitations.

instruction_to_pcode

The script is simplicity itself, thanks to the pypcode bindings to Ghidra's SLEIGH library which provides an interface to convert an input to the P-Code representation. Articles like this one provide an introduction and there's the reference manual in the Ghidra repo but it's probably easiest to just look at a few examples. P-Code is used as the basis of Ghidra's decompiler and provides a consistent human-readable description of the semantics of instructions for supported targets.

Here's an example aarch64 instruction:

$ ./instruction_to_pcode aarch64 b874c925
-- 0x0: ldr w5, [x9, w20, SXTW #0x0]
0) unique[0x5f80:8] = sext(w20)
1) unique[0x7200:8] = unique[0x5f80:8]
2) unique[0x7200:8] = unique[0x7200:8] << 0x0
3) unique[0x7580:8] = x9 + unique[0x7200:8]
4) unique[0x28b80:4] = *[ram]unique[0x7580:8]
5) x5 = zext(unique[0x28b80:4])

In the above you can see that the disassembly for the instruction is dumped, and then 5 P-Code instructions are printed showing the semantics. These P-Code instructions directly use the register names for architectural registers (as a reminder, AArch64 has 64-bit GPRs X0-X30 with the bottom halves acessible through W-W30). Intermediate state is stored in unique[addr:width] locations. So the above instruction sign-extends w20, adds to x9, and reads a 32-bit value from the resulting address, then zero-extends to 64-bits when storing into x5.

The output is somewhat more verbose for architectures with flag registers, e.g. cmpb $0x2f,-0x1(%r11) produces:

./instruction_to_pcode x86-64 --no-reverse-input "41 80 7b ff 2f"
-- 0x0: CMP byte ptr [R11 + -0x1],0x2f
0) unique[0x3100:8] = R11 + 0xffffffffffffffff
1) unique[0xbd80:1] = *[ram]unique[0x3100:8]
2) CF = unique[0xbd80:1] < 0x2f
3) unique[0xbd80:1] = *[ram]unique[0x3100:8]
4) OF = sborrow(unique[0xbd80:1], 0x2f)
5) unique[0xbd80:1] = *[ram]unique[0x3100:8]
6) unique[0x28e00:1] = unique[0xbd80:1] - 0x2f
7) SF = unique[0x28e00:1] s< 0x0
8) ZF = unique[0x28e00:1] == 0x0
9) unique[0x13180:1] = unique[0x28e00:1] & 0xff
10) unique[0x13200:1] = popcount(unique[0x13180:1])
11) unique[0x13280:1] = unique[0x13200:1] & 0x1
12) PF = unique[0x13280:1] == 0x0

But simple instructions that don't set flags do produce concise P-Code:

$ ./instruction_to_pcode riscv64 "9d2d"
-- 0x0: c.addw a0,a1
0) unique[0x15880:4] = a0 + a1
1) a0 = sext(unique[0x15880:4])

Other approaches

P-Code was an intermediate language I'd encountered before and of course benefits from having an easy to use Python wrapper and fairly good support for a range of ISAs in Ghidra. But there are lots of other options - angr (which uses Vex, taken from Valgrind) compares some options and there's more in this paper. Radare2 has ESIL, but while I'm sure you'd get used to it, it doesn't pass the readability test for me. The rev.ng project uses QEMU's TCG. This is an attractive approach because you benefit from more testing and ISA extension support for some targets vs P-Code (Ghidra support is lacking for RVV, bitmanip, and crypto extensions).

Another route would be to pull out the semantic definitions from a formal spec (like Sail) or even an easy to read simulator (e.g. Spike for RISC-V). But in both cases, definitions are written to minimise repetition to some degree, while when expanding the semantics we prefer explicitness, so would want to expand to a form that differs a bit from the Sail/Spike code as written.

February 20, 2024 12:00 PM

February 19, 2024

Carlos García Campos

WebKitGTK and WPEWebKit Switching to Skia for 2D Graphics Rendering

In recent years we have had an ongoing effort to improve graphics performance of the WebKit GTK and WPE ports. As a result of this we shipped features like threaded rendering, the DMA-BUF renderer, or proper vertical retrace synchronization (VSync). While these improvements have helped keep WebKit competitive, and even perform better than other engines in some scenarios, it has been clear for a while that we were reaching the limits of what can be achieved with a CPU based 2D renderer.

There was an attempt at making Cairo support GPU rendering, which did not work particularly well due to the library being designed around stateful operation based upon the PostScript model—resulting in a convenient and familiar API, great output quality, but hard to retarget and with some particularly slow corner cases. Meanwhile, other web engines have moved more work to the GPU, including 2D rendering, where many operations are considerably faster.

We checked all the available 2D rendering libraries we could find, but none of them met all our requirements, so we decided to try writing our own library. At the beginning it worked really well, with impressive results in performance even compared to other GPU based alternatives. However, it proved challenging to find the right balance between performance and rendering quality, so we decided to try other alternatives before continuing with its development. Our next option had always been Skia. The main reason why we didn’t choose Skia from the beginning was that it didn’t provide a public library with API stability that distros can package and we can use like most of our dependencies. It still wasn’t what we wanted, but now we have more experience in WebKit maintaining third party dependencies inside the source tree like ANGLE and libwebrtc, so it was no longer a blocker either.

In December 2023 we made the decision of giving Skia a try internally and see if it would be worth the effort of maintaining the project as a third party module inside WebKit. In just one month we had implemented enough features to be able to run all MotionMark tests. The results in the desktop were quite impressive, getting double the score of MotionMark global result. We still had to do more tests in embedded devices which are the actual target of WPE, but it was clear that, at least in the desktop, with this very initial implementation that was not even optimized (we kept our current architecture that is optimized for CPU rendering) we got much better results. We decided that Skia was the option, so we continued working on it and doing more tests in embedded devices. In the boards that we tried we also got better results than CPU rendering, but the difference was not so big, which means that with less powerful GPUs and with our current architecture designed for CPU rendering we were not that far from CPU rendering. That’s the reason why we managed to keep WPE competitive in embeeded devices, but Skia will not only bring performance improvements, it will also simplify the code and will allow us to implement new features . So, we had enough data already to make the final decision of going with Skia.

In February 2024 we reached a point in which our Skia internal branch was in an “upstreamable” state, so there was no reason to continue working privately. We met with several teams from Google, Sony, Apple and Red Hat to discuss with them about our intention to switch from Cairo to Skia, upstreaming what we had as soon as possible. We got really positive feedback from all of them, so we sent an email to the WebKit developers mailing list to make it public. And again we only got positive feedback, so we started to prepare the patches to import Skia into WebKit, add the CMake integration and the initial Skia implementation for the WPE port that already landed in main.

We will continue working on the Skia implementation in upstream WebKit, and we also have plans to change our architecture to better support the GPU rendering case in a more efficient way. We don’t have a deadline, it will be ready when we have implemented everything currently supported by Cairo, we don’t plan to switch with regressions. We are focused on the WPE port for now, but at some point we will start working on GTK too and other ports using cairo will eventually start getting Skia support as well.

by carlos garcia campos at February 19, 2024 01:27 PM

February 16, 2024

Melissa Wen

Keep an eye out: We are preparing the 2024 Linux Display Next Hackfest!

Igalia is preparing the 2024 Linux Display Next Hackfest and we are thrilled to announce that this year’s hackfest will take place from May 14th to 16th at our HQ in A Coruña, Spain.

This unconference-style event aims to bring together the most relevant players in the Linux display community to tackle current challenges and chart the future of the display stack.

Key goals for the hackfest include:

  • Releasing the power of collaboration: We’ll work to remove bottlenecks and pave the way for smoother, more performant displays.
  • Problem-solving powerhouse: Brainstorming sessions and collaborative coding will target issues like HDR, color management, variable refresh rates, and more.
  • Building on past commitments: Let’s solidify the progress made in recent years and push the boundaries even further.

The hackfest fosters an intimate and focused environment to brainstorm, hack, and design solutions alongside fellow display experts. Participants will dive into discussions, tinker with code, and contribute to shaping the future of the Linux display stack.

More details are available on the official website.

Stay tuned! Keep an eye out for more information, mark your calendars and start prepping your hacking gear.

February 16, 2024 05:25 PM

February 14, 2024

Lucas Fryzek

A Dive into Vulkanised 2024

Vulkanised sign at google’s office
Vulkanised sign at google’s office

Last week I had an exciting opportunity to attend the Vulkanised 2024 conference. For those of you not familar with the event, it is “The Premier Vulkan Developer Conference” hosted by the Vulkan working group from Khronos. With the excitement out of the way, I decided to write about some of the interesting information that came out of the conference.

A Few Presentations

My colleagues Iago, Stéphane, and Hyunjun each had the opportunity to present on some of their work into the wider Vulkan ecosystem.

Stéphane and Hyujun presenting
Stéphane and Hyujun presenting

Stéphane & Hyunjun presented “Implementing a Vulkan Video Encoder From Mesa to Streamer”. They jointly talked about the work they performed to implement the Vulkan video extensions in Intel’s ANV Mesa driver as well as in GStreamer. This was an interesting presentation because you got to see how the new Vulkan video extensions affected both driver developers implementing the extensions and application developers making use of the extensions for real time video decoding and encoding. Their presentation is available on vulkan.org.

Iago presenting
Iago presenting

Later my colleague Iago presented jointly with Faith Ekstrand (a well-known Linux graphic stack contributor from Collabora) on “8 Years of Open Drivers, including the State of Vulkan in Mesa”. They both talked about the current state of Vulkan in the open source driver ecosystem, and some of the benefits open source drivers have been able to take advantage of, like the common Vulkan runtime code and a shared compiler stack. You can check out their presentation for all the details.

Besides Igalia’s presentations, there were several more which I found interesting, with topics such as Vulkan developer tools, experiences of using Vulkan in real work applications, and even how to teach Vulkan to new developers. Here are some highlights for some of them.

Using Vulkan Synchronization Validation Effectively

John Zulauf had a presentation of the Vulkan synchronization validation layers that he has been working on. If you are not familiar with these, then you should really check them out. They work by tracking how resources are used inside Vulkan and providing error messages with some hints if you use a resource in a way where it is not synchronized properly. It can’t catch every error, but it’s a great tool in the toolbelt of Vulkan developers to make their lives easier when it comes to debugging synchronization issues. As John said in the presentation, synchronization in Vulkan is hard, and nearly every application he tested the layers on reveled a synchronization issue, no matter how simple it was. He can proudly say he is a vkQuake contributor now because of these layers.

6 Years of Teaching Vulkan with Example for Video Extensions

This was an interesting presentation from a professor at the university of Vienna about his experience teaching graphics as well as game development to students who may have little real programming experience. He covered the techniques he uses to make learning easier as well as resources that he uses. This would be a great presentation to check out if you’re trying to teach Vulkan to others.

Vulkan Synchronization Made Easy

Another presentation focused on Vulkan sync, but instead of debugging it, Grigory showed how his graphics library abstracts sync away from the user without implementing a render graph. He presented an interesting technique that is similar to how the sync validation layers work when it comes ensuring that resources are always synchronized before use. If you’re building your own engine in Vulkan, this is definitely something worth checking out.

Vulkan Video Encode API: A Deep Dive

Tony at Nvidia did a deep dive into the new Vulkan Video extensions, explaining a bit about how video codecs work, and also including a roadmap for future codec support in the video extensions. Especially interesting for us was that he made a nice call-out to Igalia and our work on Vulkan Video CTS and open source driver support on slide (6) :)

Thoughts on Vulkanised

Vulkanised is an interesting conference that gives you the intersection of people working on Vulkan drivers, game developers using Vulkan for their graphics backend, visual FX tool developers using Vulkan-based tools in their pipeline, industrial application developers using Vulkan for some embedded commercial systems, and general hobbyists who are just interested in Vulkan. As an example of some of these interesting audience members, I got to talk with a member of the Blender foundation about his work on the Vulkan backend to Blender.

Lastly the event was held at Google’s offices in Sunnyvale. Which I’m always happy to travel to, not just for the better weather (coming from Canada), but also for the amazing restaurants and food that’s in the Bay Area!

Great bay area food
Great bay area food

February 14, 2024 05:00 AM

February 13, 2024

Víctor Jáquez

GstVA library in GStreamer 1.22 and some new features in 1.24

I know, it’s old news, but still, I was pending to write about the GstVA library to clarify its purpose and scope.I didn’t want to have a library for GstVA, because to maintain a library is a though duty. I learnt it in the bad way with GStreamer-VAAPI. Therefore, I wanted GstVA as simple as possible, direct in its libva usage, and self-contained.

As far as I know, the main usage of GStreamer-VAAPI library was to have access to the VASurface from the GstBuffer, for example, with appsink. Since the beginning of GstVA, a mechanism to access the surface ID was provided, as `GStreamer OpenGL** offers access to the texture ID: through a flag when mapping a GstGL memory backed buffer. You can see how this mechanism is used in the one of the GstVA sample apps.

Nevertheless, later another use case appeared which demanded a library for GstVA: Other GStreamer elements needed to produce or consume VA surface backed buffers. Or to say it more concretely: they needed to use the GstVA allocator and buffer pool, and the mechanism to share the GstVA context along the pipeline too. The plugins with those VA related elements are msdk and qsv, when they operate on Linux. Both elements use Intel OneVPL, so they are basically competitors. The main difference is that the first is maintained by Intel, and has more features, specially for Linux, whilst the former in maintained by Seungha Yang, and it appears to be better tested in Windows.

These are the objects exposed by the GstVA library API:

GstVaDisplay #

GstVaDisplay represents a VADisplay, the interface between the application and the hardware accelerator. This class is abstract, and it’s supposed not to be instantiated. Instantiation has to go through it derived classes, such as

This class is shared among all the elements in the pipeline via GstContext, so all the plugged elements share the same connection the hardware accelerator. Unless the plugged element in the pipeline has a specific name, as in the case of multi GPU systems.

Let’s talk a bit about multi GPU systems. This is the gst-inspect-1.0 output of a system with an Intel and an AMD GPU, both with VA support:

$ gst-inspect-1.0 va
Plugin Details:
Name va
Description VA-API codecs plugin
Filename /home/igalia/vjaquez/gstreamer/build/subprojects/gst-plugins-bad/sys/va/libgstva.so
Version 1.23.1.1
License LGPL
Source module gst-plugins-bad
Documentation https://gstreamer.freedesktop.org/documentation/va/
Binary package GStreamer Bad Plug-ins git
Origin URL Unknown package origin

vaav1dec: VA-API AV1 Decoder in Intel(R) Gen Graphics
vacompositor: VA-API Video Compositor in Intel(R) Gen Graphics
vadeinterlace: VA-API Deinterlacer in Intel(R) Gen Graphics
vah264dec: VA-API H.264 Decoder in Intel(R) Gen Graphics
vah264lpenc: VA-API H.264 Low Power Encoder in Intel(R) Gen Graphics
vah265dec: VA-API H.265 Decoder in Intel(R) Gen Graphics
vah265lpenc: VA-API H.265 Low Power Encoder in Intel(R) Gen Graphics
vajpegdec: VA-API JPEG Decoder in Intel(R) Gen Graphics
vampeg2dec: VA-API Mpeg2 Decoder in Intel(R) Gen Graphics
vapostproc: VA-API Video Postprocessor in Intel(R) Gen Graphics
varenderD129av1dec: VA-API AV1 Decoder in AMD Radeon Graphics in renderD129
varenderD129av1enc: VA-API AV1 Encoder in AMD Radeon Graphics in renderD129
varenderD129compositor: VA-API Video Compositor in AMD Radeon Graphics in renderD129
varenderD129deinterlace: VA-API Deinterlacer in AMD Radeon Graphics in renderD129
varenderD129h264dec: VA-API H.264 Decoder in AMD Radeon Graphics in renderD129
varenderD129h264enc: VA-API H.264 Encoder in AMD Radeon Graphics in renderD129
varenderD129h265dec: VA-API H.265 Decoder in AMD Radeon Graphics in renderD129
varenderD129h265enc: VA-API H.265 Encoder in AMD Radeon Graphics in renderD129
varenderD129jpegdec: VA-API JPEG Decoder in AMD Radeon Graphics in renderD129
varenderD129postproc: VA-API Video Postprocessor in AMD Radeon Graphics in renderD129
varenderD129vp9dec: VA-API VP9 Decoder in AMD Radeon Graphics in renderD129
vavp9dec: VA-API VP9 Decoder in Intel(R) Gen Graphics

22 features:
+-- 22 elements

As you can observe, each card registers the supported element. The first GPU, the Intel one, doesn’t insert renderD129 after the va prefix, while the AMD Radeon, does. As you could imagine, renderD129 is the device name in /dev/dri:

$ ls /dev/dri
by-path card0 card1 renderD128 renderD129

The appended device name expresses the DRM device the elements use. And only after the second GPU the device name is appended. This allows to use the untagged elements either for the first GPU or to allow the usage of a wrapped display injected by the user application, such as VA X11 or Wayland connections.

Notice that sharing DMABuf-based buffers between different GPUs is theoretically possible, but not assured.

Keep in mind that, currently, nvidia-vaapi-driver is not supported by GstVA, and the driver will be ignored by the plugin register since 1.24.

VA allocators #

There are two types of memory allocators in GstVA:

  • VA surface allocator. It allocates a GstMemory that wraps a VASurfaceID, which represents a complete frame directly consumable by VA-based elements. Thus, a GstBuffer will hold one and only one GstMemory of this type. These buffers are the very same used by the system memory buffers, since the surfaces generally can map their content to CPU memory (unless they are encrypted, but GstVA haven’t been tested for that use case).
  • VA-DMABuf allocator. It descends from GstDmaBufAllocator, though it’s not an allocator in strict sense, since it only imports DMABufs to VA surfaces, and exports VA surfaces to DMABufs. Notice that a single VA surface can be exported as multiple DMABuf-backed GstMemories, and the contrary, a VA surface can be imported from multiple DMABufs.

Also, VA allocators offer a couple methods that can be useful for applications that use appsink, for example gst_va_buffer_get_surface and gst_va_buffer_peek_display.

One particularity of both allocators is that they keep an internal pool of the allocated VA surfaces, since the mapping of buffers/memories is not always 1:1, as in the case of DMABuf; to reuse surfaces even if the buffer pool ditches them, and to avoid surface leaks, keeping track of them all the time.

GstVaPool #

This class is only useful for other GStreamer elements capable of consume VA surfaces, so they could instantiate, configure and propose a VA buffer pool, but not for applications.

And that’s all for now. Thank you for bear with me up to here. But remember, the API of the library is unstable, and it can change any time. So, no promises are made :)

February 13, 2024 12:00 AM

February 12, 2024

Brian Kardell

StyleSheet Parfait

StyleSheet Parfait

In this post I'll talk about some interesting things (some people might pronounce this 'footguns') around adoptedStyleSheets, conversations and thoughts around open styling problems and @layer.

If you're not familiar with adopted stylesheets, they're a way that your shadow roots can share literal styesheet instances by reference. That's a cool idea, right? If you have 10 fancy-inputs on the same page, it makes no sense for each of them to have their own copy of the whole stylesheet.

It's fairly early days for this still (in standards and support terms, at least) and we'll put improvements on top of this, but for now it is a pretty basic JavaScript API: Every shadow root now has an .adoptedStyleSheets property, which is an array. You can push stylesheets onto it or just assign an array of stylesheets. Currently those stylesheets have to be instances created via the recently introduced constructor new CSSStyleSheet().

Cool.

Steve Orvell opened an issue suggesting that I make half-light use adopted stylesheets. Sure, why not. In practice what this really saves is mainly the parse time, since browsers are pretty good at optimizing this otherwise, but that's still important and, in fact, it made the code more concise as well.

Whoopsie

However, there is an important bit about adopted stylesheets that I forgot about when I implemented this initially (which is strange because I am on the record discussing it in CSSWG): adopted stylesheets are treated as if they come after any stylesheets in the (shadow) root.

Previously, half-light (and earlier experiments) took great care to put stylesheets from the outer page before any that the component itself. That seems right to me, and what the adopted stylesheets were doing now with adopted stylesheets seemed wrong...

Enter: Layers

A solution to this newly created problem that's fairly easy is to wrap the rules that are adopted with @layer. Then, if your component has a style element, the rules in there will, by default, win. And that's true even if the rules that the page author pushed in had higher specificity! That's a pretty nice improvement. If some code helps you understand, here's a pen that illustrates it all:

See the Pen adoptedstylesheets and layers by вкαя∂εℓℓ (@briankardell) on CodePen.

Layers... Like an ogre.

Eric and I recently did a podcast on the Open Styleable shadow roots topic with Mia. Some of the thoughts she shared, and later conversations that followed with Westbrook and Nolan, convinced me to try to explore how we could use layers in half-light.

It had me thinking that the way that adopted stylesheets and layers both work seems to allow that we could develop some kind of 'shadow styling protocol' here. Maybe the simplest way to do this is to just give the layer a well-known name: I called it --crossroot

Now, when a page author uses half-light to set a style like:

@media --crossroot { 
  h1 { ... }
}

This is adopted into shadow roots as:

@layer --crossroot { 
  h1 { ... }
}

That is a lower layer than anything in the default layer of a component's shadow root (both stylesheets or adopted stylesheets).

The maybe interesting part this adds is that it means that the component itself can consciously manage it's layers if it chooses to do so! For example..

this.shadowRoot.innerHTML = `
  <style>
    @layer base, --crossroot, main;
    ... 
    /* add rules to those layers  */
    ... 
  </style>
  ${some_html}
`

If that sounds a little confusing, it's really not too bad - what it means is that, going from least to most specific, rules would evaluate roughly like:

  1. User Agent styles.
  2. Page authored rules that inherit into the Shadow DOM.
  3. @layers (including --crossroot half-light provides) in the shadow
  4. Rules in style elements in the shadow that aren't in a layer.

Combining ideas...

There was also some unrelated feedback from Nolan that this approach was a non-starter for some use cases - like pushing Bootstrap down to all of the shadow roots. The previous shadow-boxing library would have supported that better. Luckily, though we've built this on Media Queries, so CSS has that pretty well figured out - all we have to do is add support to half-light to make Media Queries work in link and style tags (in the head) as well. Easy enough, and I agree that's a good improvement. So, now you can also write markup like this:

<link rel="stylesheet" href="../prism.css" media="screen, --crossroot"></link>

<!-- or to target shadows of specific elements, add a selector... -->
<link rel="stylesheet" href="../prism.css" media="screen, (--crossroot x-foo)"></link>

So... That's it, this is all in half-light now... And guess what? It's still only 95 lines of code. Thanks for all the feedback so far! So, wdyt? Don't forget to leave me an indication of how you're feeling about it with an emoji (and/or comment) on the Emoji sentiment or short comment issue.

Very special thanks to everyone who has commented and shared thoughts constructively along the way, even when they might not agree. If you actually voted in the emoji sentiment poll: ❤. Thanks especially to Mia who has been great to discuss/review and improve ideas with.

February 12, 2024 05:00 AM

February 05, 2024

Eric Meyer

Bookmarklet: Load All GitHub Comments

What happened was, Brian and I were chatting about W3C GitHub issues and Brian mentioned how really long issues are annoying to search and read, because GitHub has this thing where if there are too many comments on an issue, it snips out the middle with a “Load more…” button that’s very tastefully designed and pretty easy to miss if you’re quick-scrolling to try to catch up.  The squiggle-line would be a good marker, if it weren’t so tasteful as to blend into the background in a way that makes the Baby WCAG cry.

And what’s worse, from this perspective, is that if the issue has been discussed to a very particular kind of death, the “Load more…” button can have more “Load more…” buttons hiding within.  So even if you know there was an interesting comment, and you remember a word or two of it, page-searching in your browser will do no good if the comment in question is buried one or more XMLHTTPRequest calls deep.

“I really wish GitHub had an ‘expand all comments’ button at the top or something,” Brian said (or words to that effect).

Well, it was a Friday afternoon and I was feeling code-hacky, so I wrote a bookmarklet.  Here it is in easy-to-save hyperlink form:

0){setTimeout(start,5000)}}setTimeout(start,500);void(20240130);">0){setTimeout(start,5000)}}setTimeout(start,500);void(20240130);">GitHub issue loader

It waits half a second after you activate it to find all the buttons on the page (in my test runs, usually six hundred of them).  Then it looks through all the buttons to find the ones that have a textContent of “Load more…” and dispatches a click event to each one.  With that done, it waits five seconds and does it all again, waits five seconds to do it again, and so on.  Once it finds there are zero buttons with the “Load more…” textContent, it exits.  And, if five seconds is too quick due to slow loading times, you can always invoke the bookmarklet again should you come across a “Load more…” button.

If you want this ability for yourself, just drag the link above into your bookmark toolbar or bookmarks menu, and whenever you load up a mega-thread GitHub issue, fire the bookmarklet to load all the comments.  I imagine there may be cleaner ways to do this, but I was able to codeslam this in about 15 minutes using ViolentMonkey on live GitHub pages, and it does the thing.

I did consider complexifying the ViolentMonkey script so that any GitHub page is scanned for the “Load more…” button, and if one is present, then a “Load all comments” button is plopped into the top of the page, but I knew that would take at least another 15 minutes and my codeslam window was closing.  Also, it would require anyone using it to run ViolentMonkey (or equivalent) all the time, whereas the bookmarlet has zero impact unless the user invokes it.  If you want to extend this into something more than it is and share your solution with the world, by all means feel free.

The point of all this being, if you too wish GitHub had an easy way to load all the comments without you having to search for the “Load more…” button yourself, now there’s a bookmarklet made just for you.  Enjoy!

#ghload {border: 1px solid; padding: 0.75em; background: #FED3; border-radius: 0.75em; display: block; margin-inline: auto; max-width: max-content; margin-block: 1.5em; box-shadow: 0.25em 0.33em 0.67em #0003; text-indent: 0;}

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at February 05, 2024 02:49 PM

January 29, 2024

Stephen Chenney

The CSS Highlight Inheritance Model

The CSS highlight inheritance model describes the process for inheriting the CSSproperties of the various highlight pseudo elements:

  • ::selection controlling the appearance of selected content
  • ::spelling-error controlling the appearance of misspelled word markers
  • ::grammar-error controlling how grammar errors are marked
  • ::target-text controlling the appearance of the string matching a target-text URL
  • ::highlight defining the appearance of a named highlight, accessed via the CSS highlight API

The inheritance model described here was proposed and agreed upon in 2022, and is part of the CSS Pseudo-Elements Module Level 4 Working Draft Specification, but ::selection was implemented and widely used long before this, with browsers differing in how they implemented inheritance for selection styles. When this model is implemented and released to users, the inheritance of CSS properties for the existing pseudo elements, particularly ::selection will change in a way that may break sites.

The model is implemented behind a flag in Chrome 118 and higher, as is expected to be enabled by default as early as Chrome 123 in February 2024. To see the effects right now, enable “Experimental Web Platform features” via chrome://flags.

::selection, the Original highlight Pseudo #

Historically, ::selection, or even -moz-selection, was the only pseudo element available to control the highlighting of text, in this case the appearance of selected content. Sites use ::selection when they want to modify the default browser selection painting, most often adjusting the color to improve contrast or convey additional meaning to users. Sites also use the feature to work around selection painting problems in browsers, with the style .broken::selection { background-color: transparent; } and a .broken class on elements with broken selection painting. We’ll use this last situation as the example as we demonstrate the changes to property inheritance.

The ::selection pseudo element as implemented in most browsers uses originating element inheritance, meaning that the selection inherited properties from the style of the nearest element to which the selection is being applied. The inheritance behavior of ::selection has never been fully compatible among browsers, and has failed to comply with the CSS specification even as the spec has been updated over time. As a result, workarounds are required when a page-wide selection style is inadequate, as demonstrated with this example (works when highlight inheritance is not enabled):

<style>
::selection /* = *::selection (universal) */ {
background-color: lightgreen;
}
.broken::selection {
background-color: transparent;
}
</style>
<p>Some <em>not broken</em> text</p>
<p class="broken">Some <em>broken</em> text</p>
<script>
range = new Range();
range.setStart(document.body, 0);
range.setEnd(document.body, 3)
document.getSelection().addRange(range);
</script>

The intent of the page author is probably to have everything inside .broken be invisible for selection, but that does not happen. The problem here is that ::selection applies to all elements, including the <em> element. While the .broken element uses its own ::selection, that only applies to the element itself when selected, not its descendants.

You could always add more specific ::selection selectors, such as .broken >em::selection, but that is brittle to DOM changes and leads to style bloat.

You could also use a CSS custom property for the page-wide selection color, and just make it transparent for the broken elements, like this (works when highlight inheritance is not enabled):

<style>
:root {
--selection-color: lightgrey;
}
::selection {
background-color: var(--selection-color);
}
.broken {
--selection-color: transparent;
}
</style>
<p>Some <em>not broken</em> text</p>
<p class="broken">Some <em>broken</em> text</p>
<script>
range = new Range();
range.setStart(document.body, 0);
range.setEnd(document.body, 3)
document.getSelection().addRange(range);
</script>

In this case, we define a page-wide selection color as a custom property referenced in a ::selection rule that matches all elements. The more specific .broken rule re-defines the custom property to make the selection transparent. With originating inheritance, the ::selection for the word “broken” inherits the custom property value from the <em> element (it’s originating element), which in turn inherits the custom property value from the .broken element. This custom property workaround is how sites such as GitHub have historically worked around the problems with originating inheritance for ::selection.

New highlighting Features Change the Trade-Offs #

The question facing the CSS Working Group, the people who set the standards for CSS, was whether to make the specification match the implementations using originating inheritance, or require browsers to change to use the specified highlight inheritance model. The question sat on the back-burner for a while because web developers were not complaining very loudly (there was a workaround once custom properties were implemented) and there was uncertainty around whether the change would break sites. Some new features in CSS changed the calculus.

First came a set of additional CSS highlight pseudo elements that pushed for a resolution to the question of how they should inherit. Given that the new pseudos were to be implemented according to the spec, while ::selection was not, there was the possibility for ::selection inheritance behavior to be inconsistent with that of the new highlight pseudos:

  • ::target-text for controlling how URL text-fragments targets are rendered
  • ::spelling-error for modifying the appearance of browser detected spelling errors
  • ::grammar-error for modifying browser detected grammar errors
  • ::highlight for defining persistent highlights on a page

In addition, the set of properties allowed in ::selection and other highlight pseudos was expanded to allow for text decorations and control of fill and stroke colors on text, and maybe text-shadow. More properties makes the CSS custom property workaround for originating inheritance unwieldy in practice because it expands the set of variables needed.

The decision was made to change browsers to match the spec, thus changing the way that ::selection inheritance works in browsers. ::selection was the only widely used highlight pseudo and authors expected it to use originating inheritance, so any change to the inheritance behavior was likely to cause problems for some sites.

What is highlight inheritance? #

The CSS highlight inheritance model starts with defining a highlight inheritance tree for each type of highlight pseudo. These trees are parallel to the DOM element tree with the same branching stucture, except the nodes in the tree now represent the highlight styles for each element. The highlight pseudos inherit their styles through the ancestor chain in their highlight inheritance tree, with body::selection inheriting from html::selection and so on.

Under this model, the original example behaves the same because the ::selection rule still matches all elements:

<style>
::selection {
background-color: lightgreen;
}
.broken::selection {
background-color: transparent;
}
</style>
<p>Some <em>not broken</em> text</p>
<p class="broken">Some <em>broken</em> text</p>

However, with highlight inheritance is is recommended that page-wide selection styles be set in :root::selection or body::selection.

<style>
:root::selection {
background-color: lightgreen;
}
.broken::selection {
background-color: transparent;
}
</style>
<p>Some <em>not broken</em> text</p>
<p class="broken">Some <em>broken</em> text</p>
<script>
range = new Range();
range.setStart(document.body, 0);
range.setEnd(document.body, 3)
document.getSelection().addRange(range);
</script>

The :root::selection is at the root of the highlight inheritance tree, because it is the selection pseudo for the document root element. All selection pseudos in the document now inherit from their parent in the highlight tree, and do not need to match a universal ::selection rule. Furthermore, a change in ::selection matching a particular element will also be inherited by all its descendants.

In this example, there are no rules matching the <em> elements, so they look at their parent selection pseudo. The “not broken” text will be highlighted in lightgreen inherited from the root via it’s parent chain. The “broken” text receives a transparent background inherited from its parent <p> that matches its own rule.

The Custom Property Workaround Now Fails #

The most significant breakage caused by the change to highlight inheritance is due to sites employing the custom property workaround for originating inheritance. Consider our former example with that workaround:

<style>
:root {
--selection-color: lightgrey;
}
:root::selection {
background-color: var(--selection-color);
}
.broken {
--selection-color: transparent;
}
</style>
<p>Some <em>not broken</em> text</p>
<p class="broken">Some <em>broken</em> text</p>

According to the original specification, and the initial implementation in chromium, the default selection color is used everywhere in this example because the custom --selection-color: lightgrey; property is defined on :root which is the root of the DOM tree but not the highlight tree. The latter is rooted at :root::selection which does not define the property.

Many sites use :root as the location for CSS custom properties. Upvoted answers on Stack Overflow explicitly recommend doing so for selection colors. Unsurprisingly, many sites broke when chromium initially enabled highlight inheritance, including GitHub. To fix this, the specification was changed to require that :root::selection inherit custom properties from :root.

But all is not well. The <em> element’s ::selection has its background set to the custom property value, which is now inherited from :root via :root::selection. But because the inheritance chain does not involve .broken at all, the custom property is evaluated against the :root value. To make this work, the custom property must be overridden in a .broken::selection rule:

  :root {
--selection-color: lightgrey;
}
:root::selection {
background-color: var(--selection-color);
}
.broken::selection {
--selection-color: transparent;
background-color: var(--selection-color);
}

Note that there’s really no point using custom properties overrides for highlights, unless you desire different ::selection styles in different parts of the document and want them to use a common set of property values that you would define on the root.

The Current Situation (at the time of writing) #

While writing this, highlight inheritance is used for all highlight pseudos except ::selection and ::target-text in chromium-based browsers (other browsers are still implementing the additional pseudo elements). Highlight inheritance is also enabled for ::selection and ::target-text for users that have “Experimental Web Platform features” enabled in chrome::flags in Chrome.

Chrome is trying to enable highlight inheritance for ::selection. One attempt to enable for all users was rolled back, and another attempt will be made in an upcoming release. The current plan is to iterate until the change sticks. That is, until site authors notice that things have changed and make suitable site changes.

If inertia is too strong, additional changes to the spec may be needed to support, in particular, the continued use of the existing custom property workaround.

Thanks #

The implementation of highlight inheritance in chromium was undertaken by Igalia S.L. funded by Bloomberg L.P. Delan Azabani did most of the work, with contributions from myself.

January 29, 2024 12:00 AM

January 26, 2024

Stephen Chenney

CSS Spelling and Grammar Styling

CSS Spelling and Grammar features enable sites to customize the appearance of the markers that browsers render when a spelling or grammar error is detected, and to control the display of spelling and grammar markers that the site itself defines.

  • The text decoration spelling-error and grammar-error line types render native spelling or grammar markers controlled by CSS. The primary use case is for site-specific spell-check or grammar systems, allowing sites to render native looking error markers for the errors the site itself detects, rather than errors the browser detects. They are part of the CSS Text Decoration Module Level 4 Working Draft Specification.
  • The ::spelling-error and ::grammar-error pseudo elements allow sites to apply custom styling to the browser rendered spelling and grammar error markers. The target use case enables sites to override the native markers should custom markers offer a better user experience on the site. They are part of the CSS Pseudo-Elements Module Level 4 Working Draft Specification.

These features are available for all users in Chrome 121 and upward, and corresponding Edge versions. You can also enable “Experimental Web Platform features” with chrome://flags in earlier versions. The features should be coming soon to Safari and Firefox.

CSS Text Decorations for Spelling and Grammar Errors #

Imagine a site that uses a specialized dictionary for detecting spelling errors in user content. Maybe users are composing highly domain specific content, like botany plant names, or working in a language not well represented by existing spelling checkers. To provide an experience familiar to users, the site might wish to use the native spelling and grammar error markers with errors detected by the site itself. Before the error line types for CSS Text Decoration were available, only the browser could decide what received native markers.

Here’s an example of how you might control the rendering of spelling markers.

<style>
.spelling-error {
text-decoration-line: spelling-error;
}
</style>
<div id="content">Some content that has a misspeled word in it</div>
<script>
const range = document.createRange();
range.setStart(content.firstChild, 24);
range.setEnd(content.firstChild, 33);
const newSpan = document.createElement("span");
newSpan.classList.add("spelling-error");
range.surroundContents(newSpan);
</script>

How it works:

  • Create a style for the spelling errors. In this case a spelling error will have a text decoration added with the line type for native spelling errors.
  • As the user edits contents, JS can track the text, process it for spelling errors, and then trigger a method to mark the errors.
  • The script above is one way to apply the style to the error, assuming you have the start and end offset of the error within its Text node. A range is created with the given offset, and then the contents of the range are pulled out of the Text and put within a span, that is then re-inserted where the range once was. The span gets the spelling-error style.

The result will vary across browser and platform, depending on the native conventions for marking errors.

Using other text-decoration properties or, rather, not #

All other text decoration properties are ignored when text-decoration-line: spelling-error or grammar-error are used. The native markers rendered for these decorations are not, in general, styleable, so there would be no clear way to apply additional properties (such as text-decoration-color or text-decoration-thickness).

Spelling and Grammar Highlight Pseudos #

The ::spelling-error and ::grammar-error CSS pseudo classes allow sites to customize the markers rendered by the browser for browser-detected errors. The native markers are not rendered, being replaced by whatever styling is defined by the pseudo class applied to the text content for the error.

For the first example, let’s define a red background tint to replace the native spelling markers:

<style>
::spelling-error {
background-color: rgba(255,0,0,0.5);
text-decoration-line: none;
}
</style>
<textarea id="content">Some content that has a misspeled word in it</textarea>
<script>
content.focus();
</script>

Two things to note:

  • The example defines the root spelling error style, applying to all elements. In general, sites should use a single style for spelling errors to avoid user confusion.
  • The text-decoration-line property is set to none to suppress the native spelling error marker. Without this, the style would inherit the text-decoration-line property from the user agent, which defines text-decoration-line: spelling-error.

A limited range of properties may be used in ::spelling-error and ::grammar-error; see the spec for details. The limits address two potential concerns:

  • The spelling markers must not affect layout of text in any way, otherwise the page contents would shift when a spelling error was detected.
  • The spelling markers must not be too expensive to render; see Privacy Concerns below.

The supported properties include text decorations. Here’s another example:

<style>
::spelling-error {
text-decoration: 2px wavy blue underline;
}
textarea {
width: 200px;
height: 200px;
}
</style>
<textarea id="content">Some content that has a misspeled word in it</textarea>
<script>
content.focus();
</script>

One consequence of user agent styling is that you must provide a line type for any custom text decoration. Without one, the highlight will inherit the native spelling error line type. As discussed above, all other properties are ignored with that line type. For example, this renders the native spelling error, not the defined one:

<style>
::spelling-error {
text-decoration-style: solid;
text-decoration-color: green;
}
</style>
<textarea id="content">Some content that has a misspeled word in it</textarea>
<script>
content.focus();
</script>

Accessibility Concerns #

Modifying the appearance of user agent features, such as spelling and grammar errors, has significant potential to hurt users through poor contrast, small details, or other accessibility problems. Always maintain good contrast with all the backgrounds present on the page. Ensure that error markers are unambiguous. Include spelling and grammar errors in your accessibility testing.

CSS Spelling and Grammar styling is your friend when the native error markers pose accessibility problems, such as poor contrast. It’s arguably the strongest motivation for supplying the features in the first place.

Privacy Concerns #

User dictionaries contain names and other personal information, so steps have been taken to ensure that sites cannot observe the presence or absence of spelling or grammar errors. Computed style queries in Javascript will report the presence of custom spelling and grammar styles regardless of whether or not they are currently rendered for an element.

There is some potential for timing attacks that render potential spelling errors and determine paint timing differences. This is mitigated in two ways: browsers do not report accurate timing data, and the time to render error markers is negligible in the overal workload of rendering the page. The set of allowed properties for spelling and grammar pseudos is also limited to avoid the possibility of very expensive rendering operations.

Thanks #

The implementation of CSS Spelling and Grammar Features was undertaken by Igalia S.L. funded by Bloomberg L.P.

January 26, 2024 12:00 AM

January 25, 2024

Brian Kardell

half-light

half-light

Evolving ideas on "open stylable" issues, a new proposal.

Recently I wrote a piece called Lovely Trees in which I described how the Shadow DOM puts a lot of people off because the use case it's designed around thus far aren't the one most authors feel like they have. That's a real shame because there is a lot of usefulness there that is just lost. People seem to want something just a little less shadowy.

However, there are several slightly different visions for what it is we want, and how to achieve it. Each offers significantly different details and implications.

In that piece I also noted that we can make any of these possible through script in order to explore the space, gain practical experience with and iterate on until we at least have a pretty good idea what will work really well.

To this end I offered a library that I called "shadow-boxing" with several possible "modes" authors could explore.

"Feels like the wrong place"

Many people seemed to think this would be way better expressed in CSS itself somehow, rather than metadata in your HTML.

There is a long history here. Originally there were combinators for crossing the shadow boundary, but these were problematic and removed.

However, the fact that this kept coming up in different ways made me continue to discuss and bounce possible ideas around. I made a few rough takes, shared them with some people, and thanks to Mia and Dave Rupert for some good comments and discussion, today I'm adding a separate library which I think will make people much happier.

Take 2: half-light.js

half-light.js is a very tiny (~100 LoC) library that lets you express styles for shadow doms in your page in your CSS (it should be CSS inline or linked in the head). You can specify whether those rules apply to both your page and shadow roots, or just shadow roots, or just certain shadow roots, etc. Let's have a look.. All of these make use of CSS @media rules containing a custom --crossroot which can be functional. The easiest way to understand it is with code, let's have a look...

Rules for shadows, not the page...

This applies to <h1>'s in all shadow roots, but not in the light DOM of the page itself.

@media --crossroot { 
  h1 { ... }
}

Authors can also provide a selector filter to specify which elements should have their shadow roots affected. This can be, for example, a tag list. In the example below, it will style the <h2>'s in the shadows of <x-foo> or <x-bar> elements, and not to those in the light DOM of your page itself...

@media --crossroot(x-foo, x-bar) { 
  h2 { ... }
}

Most selectors should work there, so you could also exclude if you prefer. The example below will style the <h3>'s in the shadows of all elements except those of <x-bat> elements ...

@media --crossroot(:not(x-bat)) {
  h3 { ... } 
}

Rules for shadows, and the page...

It's really just a trick of @media that we're tapping into: Begin any of the examples above with screen, and put the whole --crossroot in parenthesis. The example below styles all the <h1>'s in both your light DOM and all shadows...

@media screen, (--crossroot) { 
  h1 { ... }
}

Or, to use the exclusion route from above, but to apply to all <h3>'s in the page, or of shadows of all elements except those of <x-bat> elements ...

@media screen, (--crossroot(:not(x-bat))) {
  h3 { ... } 
}

Play with it... There's a pen below. Once you've got an impression, give me your impression, even with a simple Emoji sentiment or short comment here.

See the Pen halflight by вкαя∂εℓℓ (@briankardell) on CodePen.

Is this a proposal?

No, not yet, it's just a library that should allow us to experiment with something "close enough" to what "features" a real proposal might need to support, and very vaguely what it might look like.

A real proposal, if it came from this, would certainly not use this syntax, which is simply trying to strike a balance between being totally valid and easy to process, and "close enough" for us to get an idea if it's got the right moving parts.

January 25, 2024 05:00 AM

January 22, 2024

Samuel Iglesias

XDC 2023: Behind the curtains

Time flies! Back in October, Igalia organized X.Org Developers Conference 2023 in A Coruña, Spain.

In case you don’t know it, X.Org Developers Conference, despite the X.Org in the name, is a conference for all developers working in the open-source graphics stack: anything related to DRM/KMS, Mesa, X11 and Wayland compositors, etc.

A Coruña's Orzán beach

This year, I participated in the organization of XDC in A Coruña, Spain (again!) by taking care of different aspects: from logistics in the venue (Palexco) to running it in person. It was a very tiring but fulfilling experience.

Sponsors

First of all, I would like to thank all the sponsors for their support, as without them, this conference wouldn’t happen:

XDC 2023 sponsors

They didn’t only give economic support to the conference: Igalia sponsored the welcome event and lunches; X.Org Foundation sponsored coffee breaks; Tourism Office of A Coruña sponsored the guided tour in the city center; and Raspberry Pi sent Raspberry Pi 5 boards to all speakers!

XDC 2023 Stats

XDC 2023 was a success on attendance and talks submissions. Here you have some stats:

  • 📈 160 registered attendees.
  • 👬 120 attendees picked their badge in person.
  • 💻 25 attendees registered as virtual.
  • 📺 More than 6,000 views on live stream.
  • 📝 55 talks/workshops/demos distributed in three days of conference..
  • 🧗‍♀️ There were 3 social events: welcome event, city center guide tour, and one unofficial climbing activity!

XDC 2023 welcome event

Was XDC 2023 perfect organization-wise? Of course… no! Like in any event, we had some issues here and there: one with the Wi-Fi network that was quickly detected and fixed; some issues with the meals and coffee breaks (food allergies mainly), we lost some seconds of audio of a talk in the on-live streaming, and other minor things. Not bad for a community-run event!

Nevertheless, I would like to thank all the staff at Palexco for their quick response and their understanding.

Talk recordings & slides

XDC 2023 talk by André Almeida

Want to see again some talks? All conference recordings were uploaded to X.Org Foundation Youtube channel.

Slides are available to download in each talk description.

Enjoy!

XDC 2024

XDC 2024 will be in North America

We cannot tell yet where is going to happen XDC 2024, other than it will be in North America… but I can tell you that this will be announced soon. Stay tuned!

Want to organize XDC 2025 or XDC 2026?

If we continue with the current cadence: 2025 would be again in Europe, and 2026 event would be in North America.

There is a list of requirements here. Nevertheless, feel free to contact me, or to the X.Org Board of Directors, in order to get first-hand experience and knowledge about what organizing XDC entails.

XDC 2023 audience

Thanks

Thanks to all volunteers, collaborators, Palexco staff, GPUL, X.Org Foundation and many other people for their hard work. Special thanks to my Igalia colleague Chema, who did an outstanding job organizing the event together with me.

Thanks for the sponsors for their extraordinary support to this conference.

Thanks to Igalia not only for sponsoring the event, but also for all the support I got during the past year. I am glad to be part of this company, and I am always surprised by how great my colleagues are.

And last, but not least, thanks to all speakers and attendees. Without you, the conference won’t exist.

See you at XDC 2024!

January 22, 2024 09:06 AM

January 11, 2024

Andy Wingo

micro macro story time

Today, a tiny tale: about 15 years ago I was working on Guile’s macro expander. Guile inherited this code from an early version of Kent Dybvig’s portable syntax expander. It was... not easy to work with.

Some difficulties were essential. Scope is tricky, after all.

Some difficulties were incidental, but deep. The expander is ultimately a function that translates Scheme-with-macros to Scheme-without-macros. However, it is itself written in Scheme-with-macros, so to load it on a substrate without macros requires a pre-expanded copy of itself, whose data representations need to be compatible with any incremental change, so that you will be able to use the new expander to produce a fresh pre-expansion. This difficulty could have been avoided by incrementally bootstrapping the library. It works once you are used to it, but it’s gnarly.

But then, some difficulties were just superflously egregious. Dybvig is a totemic developer and researcher, but a generation or two removed from me, and when I was younger, it never occurred to me to just email him to ask why things were this way. (A tip to the reader: if someone is doing work you are interested in, you can just email them. Probably they write you back! If they don’t respond, it’s not you, they’re probably just busy and their inbox leaks.) Anyway in my totally speculatory reconstruction of events, when Dybvig goes to submit his algorithm for publication, he gets annoyed that “expand” doesn’t sound fancy enough. In a way it’s similar to the original SSA developers thinking that “phony functions” wouldn’t get published.

So Dybvig calls the expansion function “χ”, because the Greek chi looks like the X in “expand”. Fine for the paper, whatever paper that might be, but then in psyntax, there are all these functions named chi and chi-lambda and all sorts of nonsense.

In early years I was often confused by these names; I wasn’t in on the pun, and I didn’t feel like I had enough responsibility for this code to think what the name should be. I finally broke down and changed all instances of “chi” to “expand” back in 2011, and never looked back.

Anyway, this is a story with a very specific moral: don’t name your functions chi.

by Andy Wingo at January 11, 2024 02:10 PM

Maíra Canal

Introducing CPU jobs to the Raspberry Pi

Igalia is always working hard to improve 3D rendering drivers of the Broadcom VideoCore GPU, found in Raspberry Pi devices. One of our most recent efforts in this sense was the implementation of CPU jobs from the Vulkan driver to the V3D kernel driver.

What are CPU jobs and why do we need them?

In the V3DV driver, there are some Vulkan commands that cannot be performed by the GPU alone, so we implement those as CPU jobs on Mesa. A CPU job is a job that requires CPU intervention to be performed. For example, in the Broadcom VideoCore GPUs, we don’t have a way to calculate the timestamp. But we need the timestamp for Vulkan timestamp queries. Therefore, we need to calculate the timestamp on the CPU.

A CPU job in userspace also implies CPU stalling. Sometimes, we need to hold part of the command submission flow in order to correctly synchronize their execution. This waiting period caused the CPU to stall, thereby preventing the continuous submission of jobs to the GPU. To mitigate this issue, we decided to move CPU job mechanisms from the V3DV driver to the V3D kernel driver.

In the V3D kernel driver, we have different kinds of jobs: RENDER jobs, BIN jobs, CSD jobs, TFU jobs, and CLEAN CACHE jobs. For each of those jobs, we have a DRM scheduler instance that helps us to synchronize the jobs.

If you want to know more about the different kinds of V3D jobs, check out this November Update: Exploring V3D blogpost, where I explain more about all the V3D IOCTLs and jobs.

Jobs of the same kind are submitted, dispatched, and processed in the same order they are executed, using a standard first-in-first-out (FIFO) queue system. We can synchronize different jobs across different queues using DRM syncobjs. More about the V3D synchronization framework and user extensions can be learned in this two-part blog post from Melissa Wen.

From the kernel documentation, a DRM syncobj (synchronisation objects) are containers for stuff that helps sync up GPU commands. They’re super handy because you can use them in your own programs, share them with other programs, and even use them across different DRM drivers. Mostly, they’re used for making Vulkan fences and semaphores work.

By moving the CPU job from userspace to the kernel, we can make use of the DRM schedule queues and all the advantages it brings with it. For this, we created a new type of job in the V3D kernel driver, a CPU job, which also means creating a new DRM scheduler instance and a CPU job queue. Now, instead of stalling the submission thread waiting for the GPU to idle, we can use DRM syncobjs to synchronize both CPU and GPU jobs in a submission, providing more efficient usage of the GPU.

How did we implement the CPU jobs in the kernel driver?

After we decided to have a CPU job implementation in the kernel space, we could think about two possible implementations for this job: creating an IOCTL for each type of CPU job or using a user extension to provide a polymorphic behavior to a single CPU job IOCTL.

We have different types of CPU jobs (indirect CSD jobs, timestamp query jobs, copy query results jobs…) and each of them has a common infrastructure of allocation and synchronization but performs different operations. Therefore, we decided to go with the option to use user extensions.

On Melissa’s blogpost, she digs deep into the implementation of generic IOCTL extensions in the V3D kernel driver. But, to put it simply, instead of expanding the data struct for each IOCTL every time we need to add a new feature, we define a user extension chain instead. As we add new optional interfaces to control the IOCTL, we define a new extension struct that can be linked to the IOCTL data only when required by the user.

Therefore, we created a new IOCTL, drm_v3d_submit_cpu, which is used to submit any type of CPU job. This single IOCTL can be extended by a user extension, which allows us to reuse the common infrastructure - avoiding code repetition - and yet use the user extension ID to identify the type of job and depending on the type of job, perform a certain operation.

struct drm_v3d_submit_cpu {
        /* Pointer to a u32 array of the BOs that are referenced by the job.
         *
         * For DRM_V3D_EXT_ID_CPU_INDIRECT_CSD, it must contain only one BO,
         * that contains the workgroup counts.
         *
         * For DRM_V3D_EXT_ID_TIMESTAMP_QUERY, it must contain only one BO,
         * that will contain the timestamp.
         *
         * For DRM_V3D_EXT_ID_CPU_RESET_TIMESTAMP_QUERY, it must contain only
         * one BO, that contains the timestamp.
         *
         * For DRM_V3D_EXT_ID_CPU_COPY_TIMESTAMP_QUERY, it must contain two
         * BOs. The first is the BO where the timestamp queries will be written
         * to. The second is the BO that contains the timestamp.
         *
         * For DRM_V3D_EXT_ID_CPU_RESET_PERFORMANCE_QUERY, it must contain no
         * BOs.
         *
         * For DRM_V3D_EXT_ID_CPU_COPY_PERFORMANCE_QUERY, it must contain one
         * BO, where the performance queries will be written.
         */
        __u64 bo_handles;

        /* Number of BO handles passed in (size is that times 4). */
        __u32 bo_handle_count;

        __u32 flags;

        /* Pointer to an array of ioctl extensions*/
        __u64 extensions;
};

Now, we can create a CPU job and submit it with a CPU job user extension.

And which extensions are available?

  1. DRM_V3D_EXT_ID_CPU_INDIRECT_CSD: this CPU job allows us to submit an indirect CSD job. An indirect CSD job is a job that, when executed in the queue, will map an indirect buffer, read the dispatch parameters, and submit a regular dispatch. This CPU job is used in Vulkan calls like vkCmdDispatchIndirect().
  2. DRM_V3D_EXT_ID_CPU_TIMESTAMP_QUERY: this CPU job calculates the query timestamp and updates the query availability by signaling a syncobj. This CPU job is used in Vulkan calls like vkCmdWriteTimestamp().
  3. DRM_V3D_EXT_ID_CPU_RESET_TIMESTAMP_QUERY: this CPU job resets the timestamp queries based on the value offset of the first query. This CPU job is used in Vulkan calls like vkCmdResetQueryPool() for timestamp queries.
  4. DRM_V3D_EXT_ID_CPU_COPY_TIMESTAMP_QUERY: this CPU job copies the complete or partial result of a query to a buffer. This CPU job is used in Vulkan calls like vkCmdCopyQueryPoolResults() for timestamp queries.
  5. DRM_V3D_EXT_ID_CPU_RESET_PERFORMANCE_QUERY: this CPU job resets the performance queries by resetting the values of the perfmons. This CPU job is used in Vulkan calls like vkCmdResetQueryPool() for performance queries.
  6. DRM_V3D_EXT_ID_CPU_COPY_PERFORMANCE_QUERY: similar to DRM_V3D_EXT_ID_CPU_COPY_TIMESTAMP_QUERY, this CPU job copies the complete or partial result of a query to a buffer. This CPU job is used in Vulkan calls like vkCmdCopyQueryPoolResults() for performance queries.

The CPU job IOCTL structure is similar to any other V3D job. We allocate the job struct, parse all the extensions, init the job, look up the BOs and lock its reservations, add the proper dependencies, and push the job to the DRM scheduler entity.

When running a CPU job, we execute the following code:

static const v3d_cpu_job_fn cpu_job_function[] = {
        [V3D_CPU_JOB_TYPE_INDIRECT_CSD] = v3d_rewrite_csd_job_wg_counts_from_indirect,
        [V3D_CPU_JOB_TYPE_TIMESTAMP_QUERY] = v3d_timestamp_query,
        [V3D_CPU_JOB_TYPE_RESET_TIMESTAMP_QUERY] = v3d_reset_timestamp_queries,
        [V3D_CPU_JOB_TYPE_COPY_TIMESTAMP_QUERY] = v3d_copy_query_results,
        [V3D_CPU_JOB_TYPE_RESET_PERFORMANCE_QUERY] = v3d_reset_performance_queries,
        [V3D_CPU_JOB_TYPE_COPY_PERFORMANCE_QUERY] = v3d_copy_performance_query,
};

static struct dma_fence *
v3d_cpu_job_run(struct drm_sched_job *sched_job)
{
        struct v3d_cpu_job *job = to_cpu_job(sched_job);
        struct v3d_dev *v3d = job->base.v3d;

        v3d->cpu_job = job;

        if (job->job_type >= ARRAY_SIZE(cpu_job_function)) {
                DRM_DEBUG_DRIVER("Unknown CPU job: %d\n", job->job_type);
                return NULL;
        }

        trace_v3d_cpu_job_begin(&v3d->drm, job->job_type);

        cpu_job_function[job->job_type](job);

        trace_v3d_cpu_job_end(&v3d->drm, job->job_type);

        return NULL;
}

The interesting thing is that each CPU job type executes a completely different operation.

The complete kernel implementation has already landed in drm-misc-next and can be seen right here.

What did we change in Mesa-V3DV to use the new kernel-V3D CPU job?

After landing the kernel implementation, I needed to accommodate the new CPU job approach in the userspace.

A fundamental rule is not to cause regressions, i.e., to keep backwards userspace compatibility with old versions of the Linux kernel. This means we cannot break new versions of Mesa running in old kernels. Therefore, we needed to create two paths: one preserving the old way to perform CPU jobs and the other using the kernel to perform CPU jobs.

So, for example, the indirect CSD job used to add two different jobs to the queue: a CPU job and a CSD job. Now, if we have the CPU job capability in the kernel, we only add a CPU job and the CSD job is dispatched from within the kernel.

-   list_addtail(&csd_job->list_link, &cmd_buffer->jobs);
+
+   /* If we have a CPU queue we submit the CPU job directly to the
+    * queue and the CSD job will be dispatched from within the kernel
+    * queue, otherwise we will have to dispatch the CSD job manually
+    * right after the CPU job by adding it to the list of jobs in the
+    * command buffer.
+    */
+   if (!cmd_buffer->device->pdevice->caps.cpu_queue)
+      list_addtail(&csd_job->list_link, &cmd_buffer->jobs);

Furthermore, now we can use syncobjs to sync the CPU jobs. For example, in the timestamp query CPU job, we used to stall the submission thread and wait for completion of all work queued before the timestamp query. Now, we can just add a barrier to the CPU job and it will be properly synchronized by the syncobjs without stalling the submission thread.

   /* The CPU job should be serialized so it only executes after all previously
    * submitted work has completed
    */
   job->serialize = V3DV_BARRIER_ALL;

We were able to test the implementation using multiple CTS tests, such as dEQP-VK.compute.pipeline.indirect_dispatch.*, dEQP-VK.pipeline.monolithic.timestamp.*, dEQP-VK.synchronization.*, dEQP-VK.query_pool.* and dEQP-VK.multiview.*.

The userspace implementation has already landed in Mesa and the full implementation can be checked in this MR.


More about the on-going challenges in the Raspberry Pi driver stack can be checked during this XDC 2023 talk presented by Iago Toral, Juan Suárez and myself. During this talk, Iago mentioned the CPU job work that we have been doing.

Also I cannot finish this post without thanking Melissa Wen and Iago Toral for all the help while developing the CPU jobs for the V3D kernel driver.

January 11, 2024 01:30 PM

January 08, 2024

Andy Wingo

missing the point of webassembly

I find most descriptions of WebAssembly to be uninspiring: if you start with a phrase like “assembly-like language” or a “virtual machine”, we have already lost the plot. That’s not to say that these descriptions are incorrect, but it’s like explaining what a dog is by starting with its circulatory system. You’re not wrong, but you should probably lead with the bark.

I have a different preferred starting point which is less descriptive but more operational: WebAssembly is a new fundamental abstraction boundary. WebAssembly is a new way of dividing computing systems into pieces and of composing systems from parts.

This all may sound high-falutin´, but it’s for real: this is the actually interesting thing about Wasm.

fundamental & abstract

It’s probably easiest to explain what I mean by example. Consider the Linux ABI: Linux doesn’t care what code it’s running; Linux just handles system calls and schedules process time. Programs that run against the x86-64 Linux ABI don’t care whether they are in a container or a virtual machine or “bare metal” or whether the processor is AMD or Intel or even a Mac M3 with Docker and Rosetta 2. The Linux ABI interface is fundamental in the sense that either side can implement any logic, subject to the restrictions of the interface, and abstract in the sense that the universe of possible behaviors has been simplified to a limited language, in this case that of system calls.

Or take HTTP: when you visit wingolog.org, you don’t have to know (but surely would be delighted to learn) that it’s Scheme code that handles the request. I don’t have to care if the other side of the line is curl or Firefox or Wolvic. HTTP is such a successful fundamental abstraction boundary that at this point it is the default for network endpoints; whether you are a database or a golang microservice, if you don’t know that you need a custom protocol, you use HTTP.

Or, to rotate our metaphorical compound microscope to high-power magnification, consider the SYS-V amd64 C ABI: almost every programming language supports some form of extern C {} to access external libraries, and the best language implementations can produce artifacts that implement the C ABI as well. The standard C ABI splits programs into parts, and allows works from separate teams to be composed into a whole. Indeed, one litmus test of a fundamental abstraction boundary is, could I reasonably define an interface and have an implementation of it be in Scheme or OCaml or what-not: if the answer is yes, we are in business.

It is in this sense that WebAssembly is a new fundamental abstraction boundary.

WebAssembly shares many of the concrete characteristics of other abstractions. Like the Linux syscall interface, WebAssembly defines an interface language in which programs rely on host capabilities to access system features. Like the C ABI, calling into WebAssembly code has a predictable low cost. Like HTTP, you can arrange for WebAssembly code to have no shared state with its host, by construction.

But WebAssembly is a new point in this space. Unlike the Linux ABI, there is no fixed set of syscalls: WebAssembly imports are named, typed, and without pre-defined meaning, more like the C ABI. Unlike the C ABI, WebAssembly modules have only the shared state that they are given; neither side has a license to access all of the memory in the “process”. And unlike HTTP, WebAssembly modules are “in the room” with their hosts: close enough that hosts can allow themselves the luxury of synchronous function calls, and to allow WebAssembly modules to synchronously call back into their hosts.

applied teleology

At this point, you are probably nodding along, but also asking yourself, what is it for? If you arrive at this question from the “WebAssembly is a virtual machine” perspective, I don’t think you’re well-equipped to answer. But starting as we did by the interface, I think we are better positioned to appreciate how WebAssembly fits into the computing landscape: the narrative is generative, in that you can explore potential niches by identifying existing abstraction boundaries.

Again, let’s take a few examples. Say you ship some “smart cities” IoT device, consisting of a microcontroller that runs some non-Linux operating system. The system doesn’t have an MMU, so you don’t have hardware memory protections, but you would like to be able to enforce some invariants on the software that this device runs; and you would also like to be able to update that software over the air. WebAssembly is getting used in these environments; I wish I had a list of deployments at hand, but perhaps we can at least take this article last year from a WebAssembly IoT vendor as proof of commercial interest.

Or, say you run a function-as-a-service cloud, meaning that you run customer code in response to individual API requests. You need to limit the allowable set of behaviors from the guest code, so you choose some abstraction boundary. You could use virtual machines, but that would be quite expensive in terms of memory. You could use containers, but you would like more control over the guest code. You could have these functions written in JavaScript, but that means that your abstraction is no longer fundamental; you limit your applicability. WebAssembly fills an interesting niche here, and there are a number of products in this space, for example Fastly Compute or Fermyon Spin.

Or to go smaller, consider extensible software, like the GIMP image editor or VS Code: in the past you would use loadable plug-in modules via the C ABI, which can be quite gnarly, or you lean into a particular scripting language, which can be slow, inexpressive, and limit the set of developers that can write extensions. It’s not a silver bullet, but WebAssembly can have a role here. For example, the Harfbuzz text shaping library supports fonts with an embedded (em-behdad?) WebAssembly extension to control how strings of characters are mapped to positioned glyphs.

aside: what boundaries do

They say that good fences make good neighbors, and though I am not quite sure it is true—since my neighbor put up a fence a few months ago, our kids don’t play together any more—boundaries certainly facilitate separation of functionality. Conway’s law is sometimes applied as a descriptive observation—ha-ha, isn’t that funny, they just shipped their org chart—but this again misses the point, in that boundaries facilitate separation, but also composition: if I know that I can fearlessly allow a font to run code because I have an appropriate abstraction boundary between host application and extension, I have gained in power. I no longer need to be responsible for every part of the product, and my software can scale up to solve harder problems by composing work from multiple teams.

There is little point in using WebAssembly if you control both sides of a boundary, just as (unless you have chickens) there is little point in putting up a fence that runs through the middle of your garden. But where you want to compose work from separate teams, the boundaries imposed by WebAssembly can be a useful tool.

narrative generation

WebAssembly is enjoying a tail-wind of hype, so I think it’s fair to say that wherever you find a fundamental abstraction boundary, someone is going to try to implement it with WebAssembly.

Again, some examples: back in 2022 I speculated that someone would “compile” Docker containers to WebAssembly modules, and now that is a thing.

I think at some point someone will attempt to replace eBPF with Wasm in the Linux kernel; eBPF is just not as good a language as Wasm, and the toolchains that produce it are worse. eBPF has clunky calling-conventions about what registers are saved and spilled at call sites, a decision that can be made more efficiently for the program and architecture at hand when register-allocating WebAssembly locals. (Sometimes people lean on the provably-terminating aspect of eBPF as its virtue, but that could apply just as well to Wasm if you prohibit the loop opcode (and the tail-call instructions) at verification-time.) And why don’t people write whole device drivers in eBPF? Or rather, targetting eBPF from C or what-have-you. It’s because eBPF is just not good enough. WebAssembly is, though! Anyway I think Linux people are too chauvinistic to pick this idea up but I bet Microsoft could do it.

I was thinking today, you know, it actually makes sense to run a WebAssembly operating system, one which runs WebAssembly binaries. If the operating system includes the Wasm run-time, it can interpose itself at syscall boundaries, sometimes allowing it to avoid context switches. You could start with something like the Linux ABI, perhaps via WALI, but for a subset of guest processes that conform to particular conventions, you could build purpose-built composition that can allocate multiple WebAssembly modules to a single process, eliding inter-process context switches and data copies for streaming computations. Or, focussing on more restricted use-cases, you could make a microkernel; googling around I found this article from a couple days ago where someone is giving this a go.

wwwhat about the wwweb

But let’s go back to the web, where you are reading this. In one sense, WebAssembly is a massive web success, being deployed to literally billions of user agents. In another, it is marginal: people do not write web front-ends in WebAssembly. Partly this is because the kind of abstraction supported by linear-memory WebAssembly 1.0 isn’t a good match for the garbage-collected DOM API exposed by web browsers. As a corrolary, languages that are most happy targetting this linear-memory model (C, Rust, and so on) aren’t good for writing DOM applications either. WebAssembly is used in auxiliary modules where you want to run legacy C++ code on user devices, or to speed up a hot leaf function, but isn’t a huge success.

This will change with the recent addition of managed data types to WebAssembly, but not necessarily in the way that you might think. Like, now that it will be cheaper and more natural to pass data back and forth with JavaScript, are we likely to see Wasm/GC progressively occupying more space in web applications? For me, I doubt that progressive is the word. In the same way that you wouldn’t run a fence through the middle of your front lawn, you wouldn’t want to divide your front-end team into JavaScript and WebAssembly sub-teams. Instead I think that we will see more phase transitions, in which whole web applications switch from JavaScript to Wasm/GC, compiled from Dart or Elm or what have you. The natural fundamental abstraction boundary in a web browser is between the user agent and the site’s code, not within the site’s code itself.

conclusion

So, friends, if you are talking to a compiler engineer, by all means: keep describing WebAssembly as an virtual machine. It will keep them interested. But for everyone else, the value of WebAssembly is what it does, which is to be a different way of breaking a system into pieces. Armed with this observation, we can look at current WebAssembly uses to understand the nature of those boundaries, and to look at new boundaries to see if WebAssembly can have a niche there. Happy hacking, and may your components always compose!

by Andy Wingo at January 08, 2024 11:45 AM

January 05, 2024

Andy Wingo

scheme modules vs whole-program compilation: fight

In a recent dispatch, I explained the whole-program compilation strategy used in Whiffle and Hoot. Today’s note explores what a correct solution might look like.

being explicit

Consider a module that exports an increment-this-integer procedure. We’ll use syntax from the R6RS standard:

(library (inc)
  (export inc)
  (import (rnrs))
  (define (inc n) (+ n 1)))

If we then have a program:

(import (rnrs) (inc))
(inc 42)

Then the meaning of this program is clear: it reduces to (+ 42 1), then to 43. Fine enough. But how do we get there? How does the compiler compose the program with the modules that it uses (transitively), to produce a single output?

In Whiffle (and Hoot), the answer is, sloppily. There is a standard prelude that initially has a number of bindings from the host compiler, Guile. One of these is +, exposed under the name %+, where the % in this case is just a warning to the reader that this is a weird primitive binding. Using this primitive, the prelude defines a wrapper:

...
(define (+ x y) (%+ x y))
...

At compilation-time, Guile’s compiler recognizes %+ as special, and therefore compiles the body of + as consisting of a primitive call (primcall), in this case to the addition primitive. The Whiffle (and Hoot, and native Guile) back-ends then avoid referencing an imported binding when compiling %+, and instead produce backend-specific code: %+ disappears. Most uses of the + wrapper get inlined so %+ ends up generating code all over the program.

The prelude is lexically splatted into the compilation unit via a pre-expansion phase, so you end up with something like:

(let () ; establish lexical binding contour
  ...
  (define (+ x y) (%+ x y))
  ...
  (let () ; new nested contour
    (define (inc n) (+ n 1))
    (inc 42)))

This program will probably optimize (via partial evaluation) to just 43. (What about let and define? Well. Perhaps we’ll get to that.)

But, again here I have taken a short-cut, which is about modules. Hoot and Whiffle don’t really do modules, yet anyway. I keep telling Spritely colleagues that it’s complicated, and rightfully they keep asking why, so this article gets into it.

is it really a big letrec?

Firstly you have to ask, what is the compilation unit anyway? I mean, given a set of modules A, B, C and so on, you could choose to compile them separately, relying on the dynamic linker to compose them at run-time, or all together, letting the compiler gnaw on them all at once. Or, just A and B, and so on. One good-enough answer to this problem is library-group form, which explicitly defines a set of topologically-sorted modules that should be compiled together. In our case, to treat the (inc) module together with our example program as one compilation unit, we would have:

(library-group
  ;; start with sequence of libraries
  ;; to include in compilation unit...
  (library (inc) ...)

  ;; then the tail is the program that
  ;; might use the libraries
  (import (rnrs) (inc))
  (inc 42))

In this example, the (rnrs) base library is not part of the compilation unit. Presumably it will be linked in, either as a build step or dynamically at run-time. For Hoot we would want the whole prelude to be included, because we don’t want any run-time dependencies. Anyway hopefully this would expand out to something like the set of nested define forms inside nested let lexical contours.

And that was my instinct: somehow we are going to smash all these modules together into a big nested letrec, and the compiler will go to town. And this would work, for a “normal” programming language.

But with Scheme, there is a problem: macros. Scheme is a “programmable programming language” that allows users to extend its syntax as well as its semantics. R6RS defines a procedural syntax transformer (“macro”) facility, in which the user can define functions that run on code at compile-time (specifically, during syntax expansion). Scheme macros manage to compose lexical scope from the macro definition with the scope at the macro instantiation site, by annotating these expressions with source location and scope information, and making syntax transformers mostly preserve those annotations.

“Macros are great!”, you say: well yes, of course. But they are a problem too. Consider this incomplete library:

(library (ctinc)
  (import (rnrs) (inc))
  (export ctinc)
  (define-syntax ctinc
    (lambda (stx)
      ...)) // ***

The idea is to define a version of inc, but at compile-time: a (ctinc 42) form should expand directly to 43, not a call to inc (or even +, or %+). We define syntax transformers with define-syntax instead of define. The right-hand-side of the definition ((lambda (stx) ...)) should be a procedure of one argument, which returns one value: so far so good. Or is it? How do we actually evaluate what (lambda (stx) ...) means? What should we fill in for ...? When evaluating the transformer value, what definitions are in scope? What does lambda even mean in this context?

Well... here we butt up against the phasing wars of the mid-2000s. R6RS defines a whole system to explicitly declare what bindings are available when, then carves out a huge exception to allow for so-called implicit phasing, in which the compiler figures it out on its own. In this example we imported (rnrs) for the default phase, and this is the module that defines lambda (and indeed define and define-syntax). The standard defines that (rnrs) makes its bindings available both at run-time and expansion-time (compilation-time), so lambda means what we expect that it does. Whew! Let’s just assume implicit phasing, going forward.

The operand to the syntax transformer is a syntax object: an expression annotated with source and scope information. To pick it apart, R6RS defines a pattern-matching helper, syntax-case. In our case ctinc is unary, so we can begin to flesh out the syntax transformer:

(library (ctinc)
  (import (rnrs) (inc))
  (export ctinc)
  (define-syntax ctinc
    (lambda (stx)
      (syntax-case stx ()
        ((ctinc n)
         (inc n)))))) // ***

But here there’s a detail, which is that when syntax-case destructures stx to its parts, those parts themselves are syntax objects which carry the scope and source location annotations. To strip those annotations, we call the syntax->datum procedure, exported by (rnrs).

(library (ctinc)
  (import (rnrs) (inc))
  (export ctinc)
  (define-syntax ctinc
    (lambda (stx)
      (syntax-case stx ()
        ((ctinc n)
         (inc (syntax->datum #'n)))))))

And with this, voilà our program:

(library-group
  (library (inc) ...)
  (library (ctinc) ...)
  (import (rnrs) (ctinc))
  (ctinc 42))

This program should pre-expand to something like:

(let ()
  (define (inc n) (+ n 1))
  (let ()
    (define-syntax ctinc
      (lambda (stx)
        (syntax-case stx ()
          ((ctinc n)
           (inc (syntax->datum #'n))))))
    (ctinc 42)))

And then expansion should transform (ctinc 42) to 43. However, our naïve pre-expansion is not good enough for this to be possible. If you ran this in Guile you would get an error:

Syntax error:
unknown file:8:12: reference to identifier outside its scope in form inc

Which is to say, inc is not available as a value within the definition of ctinc. ctinc could residualize an expression that refers to inc, but it can’t use it to produce the output.

modules are not expressible with local lexical binding

This brings us to the heart of the issue: with procedural macros, modules impose a phasing discipline on the expansion process. Definitions from any given module must be available both at expand-time and at run-time. In our example, ctinc needs inc at expand-time, which is an early part of the compiler that is unrelated to any later partial evaluation by the optimizer. We can’t make inc available at expand-time just using let / letrec bindings.

This is an annoying result! What do other languages do? Well, mostly they aren’t programmable, in the sense that they don’t have macros. There are some ways to get programmability using e.g. eval in JavaScript, but these systems are not very amenable to “offline” analysis of the kind needed by an ahead-of-time compiler.

For those declarative languages with macros, Scheme included, I understand the state of the art is to expand module-by-module and then stitch together the results of expansion later, using a kind of link-time optimization. You visit a module’s definitions twice: once to evaluate them while expanding, resulting in live definitions that can be used by further syntax expanders, and once to residualize an abstract syntax tree, which will eventually be spliced into the compilation unit.

Note that in general the expansion-time and the residual definitions don’t need to be the same, and indeed during cross-compilation they are often different. If you are compiling with Guile as host and Hoot as target, you might implement cons one way in Guile and another way in Hoot, choosing between them with cond-expand.

lexical scope regained?

What is to be done? Glad you asked, Vladimir. But, I don’t really know. The compiler wants a big blob of letrec, but the expander wants a pearl-string of modules. Perhaps we try to satisfy them both? The library-group paper suggests that modules should be expanded one by one, then stitched into a letrec by AST transformations. It’s not that lexical scope is incompatible with modules and whole-program compilation; the problems arise when you add in macros. So by expanding first, in units of modules, we reduce high-level Scheme to a lower-level language without syntax transformers, but still on the level of letrec.

I was unreasonably pleased by the effectiveness of the “just splat in a prelude” approach, and I will miss it. I even pled for a kind of stop-gap fat-fingered solution to sloppily parse module forms and keep on splatting things together, but colleagues helpfully talked me away from the edge. So good-bye, sloppy: I repent my ways and will make amends, with 40 hail-maries and an alpha renaming thrice daily and more often if in moral distress. Further bulletins as events warrant. Until then, happy scheming!

by Andy Wingo at January 05, 2024 08:43 PM

v8's precise field-logging remembered set

A remembered set is used by a garbage collector to identify graph edges between partitioned sub-spaces of a heap. The canonical example is in generational collection, where you allocate new objects in newspace, and eventually promote survivor objects to oldspace. If most objects die young, we can focus GC effort on newspace, to avoid traversing all of oldspace all the time.

Collecting a subspace instead of the whole heap is sound if and only if we can identify all live objects in the subspace. We start with some set of roots that point into the subspace from outside, and then traverse all links in those objects, but only to other objects within the subspace.

The roots are, like, global variables, and the stack, and registers; and in the case of a partial collection in which we identify live objects only within newspace, also any link into newspace from other spaces (oldspace, in our case). This set of inbound links is a remembered set.

There are a few strategies for maintaining a remembered set. Generally speaking, you start by implementing a write barrier that intercepts all stores in a program. Instead of:

obj[slot] := val;

You might abstract this away:

write_slot(obj, sizeof obj, &obj[slot], val);

As you can see, it’s quite an annoying transformation to do by hand; typically you will want some sort of language-level abstraction that lets you keep the more natural syntax. C++ can do this pretty well, or if you are implementing a compiler, you just add this logic to the code generator.

Then the actual write barrier... well its implementation is twingled up with implementation of the remembered set. The simplest variant is a card-marking scheme, whereby the heap is divided into equal-sized power-of-two-sized cards, and each card has a bit. If the heap is also divided into blocks (say, 2 MB in size), then you might divide those blocks into 256-byte cards, yielding 8192 cards per block. A barrier might look like this:

void write_slot(ObjRef obj, size_t size,
                SlotAddr slot, ObjRef val) {
  obj[slot] := val; // Start with the store.

  uintptr_t block_size = 1<<21;
  uintptr_t card_size = 1<<8;
  uintptr_t cards_per_block = block_size / card_size;

  uintptr_t obj_addr = obj;
  uintptr_t card_idx = (obj_addr / card_size) % cards_per_block;

  // Assume remset allocated at block start.
  void *block_start = obj_addr & ~(block_size-1);
  uint32_t *cards = block_start;

  // Set the bit.
  cards[card_idx / 32] |= 1 << (card_idx % 32);
}

Then when marking the new generation, you visit all cards, and for all marked cards, trace all outbound links in all live objects that begin on the card.

Card-marking is simple to implement and simple to statically allocate as part of the heap. Finding marked cards takes time proportional to the size of the heap, but you hope that the constant factors and SIMD minimize this cost. However iterating over objects within a card can be costly. You hope that there are few old-to-new links but what do you know?

In Whippet I have been struggling a bit with sticky-mark-bit generational marking, in which new and old objects are not spatially partitioned. Sometimes generational collection is a win, but in benchmarking I find that often it isn’t, and I think Whippet’s card-marking barrier is at fault: it is simply too imprecise. Consider firstly that our write barrier applies to stores to slots in all objects, not just those in oldspace; a store to a new object will mark a card, but that card may contain old objects which would then be re-scanned. Or consider a store to an old object in a more dense part of oldspace; scanning the card may incur more work than needed. It could also be that Whippet is being too aggressive at re-using blocks for new allocations, where it should be limiting itself to blocks that are very sparsely populated with old objects.

what v8 does

There is a tradeoff in write barriers between the overhead imposed on stores, the size of the remembered set, and the precision of the remembered set. Card-marking is relatively low-overhead and usually small as a fraction of the heap, but not very precise. It would be better if a remembered set recorded objects, not cards. And it would be even better if it recorded slots in objects, not just objects.

V8 takes this latter strategy: it has per-block remembered sets which record slots containing “interesting” links. All of the above words were to get here, to take a brief look at its remembered set.

The main operation is RememberedSet::Insert. It takes the MemoryChunk (a block, in our language from above) and the address of a slot in the block. Each block has a remembered set; in fact, six remembered sets for some reason. The remembered set itself is a SlotSet, whose interesting operations come from BasicSlotSet.

The structure of a slot set is a bitvector partitioned into equal-sized, possibly-empty buckets. There is one bit per slot in the block, so in the limit the size overhead for the remembered set may be 3% (1/32, assuming compressed pointers). Currently each bucket is 1024 bits (128 bytes), plus the 4 bytes for the bucket pointer itself.

Inserting into the slot set will first allocate a bucket (using C++ new) if needed, then load the “cell” (32-bit integer) containing the slot. There is a template parameter declaring whether this is an atomic or normal load. Finally, if the slot bit in the cell is not yet set, V8 will set the bit, possibly using atomic compare-and-swap.

In the language of Blackburn’s Design and analysis of field-logging write barriers, I believe this is a field-logging barrier, rather than the bit-stealing slot barrier described by Yang et al in the 2012 Barriers Reconsidered, Friendlier Still!. Unlike Blackburn’s field-logging barrier, however, this remembered set is implemented completely on the side: there is no in-object remembered bit, nor remembered bits for the fields.

On the one hand, V8’s remembered sets are precise. There are some tradeoffs, though: they require off-managed-heap dynamic allocation for the buckets, and traversing the remembered sets takes time proportional to the whole heap size. And, should V8 ever switch its minor mark-sweep generational collector to use sticky mark bits, the lack of a spatial partition could lead to similar problems as I am seeing in Whippet. I will be interested to see what they come up with in this regard.

Well, that’s all for today. Happy hacking in the new year!

by Andy Wingo at January 05, 2024 09:44 AM

January 03, 2024

Brian Kardell

What's Good?

What's Good?

I think this is a question worth asking, let me explain why…

A few times a year, every developer advocate will ask developers about what developing features they're interested in or what pain points they experience most. That is a good thing. We should keep doing that.

While we don't often present it in this light, one thing this does is inform prioritization. The simple truth is that resources are way too finite, so we have to look for good signals about what should be prioritized.

I think it's similarly important to have feedback on the other end too: What's your satisfaction like on the web features you've gotten in the last few years? Can you name any that you use all the time? Can you name some that you over-estimated your need for? Something that you thought you needed/wanted but then ultimately didn't end up using so much? Something that you had high hopes for, but failed you?

But why ask that?

Well, it seems quite probable that there are things we can learn from that.

Maybe we can look at where we should have listened more, or pushed more. Maybe we can learn things about the processes that successful things took that unsuccessful things didn't (did they go through WICG? Were there polyfills? Origin trials? Did they stay in experimental builds behind a flag for a long time? Were they done at roughly the same time in all browsers?). Can we compare the amount of resources and time required between them? Maybe those things could also inform prioritization somehow?

Normally, around this time, I'd have wrapped up working on the latest Web Almanac and there would be lots of data flying at me which scratches a little bit of the kind of itch I've got - but this year we didn't do one, so I find myself wondering: How are people getting along with all of those things we've been delivering since 2019 or so?

So, let me know!

January 03, 2024 05:00 AM

January 02, 2024

Qiuyi Zhang (Joyee)

Fixing snapshot support of class fields in V8

Up until V8 10.0, the class field initializers had been

January 02, 2024 09:20 PM

Building V8 on an M1 MacBook

I’ve recently got an M1 MacBook and played around with it a bit. It seems many open source projects still haven’t added MacOS with ARM64

January 02, 2024 09:20 PM

On deps/v8 in Node.js

I recently ran into a V8 test failure that only showed up in the V8 fork of Node.js but not in the upstream. Here I’ll write down my

January 02, 2024 09:20 PM

Tips and Tricks for Node.js Core Development and Debugging

I thought about writing some guides on this topic in the nodejs/node repo, but it’s easier to throw whatever tricks I personally use on

January 02, 2024 09:20 PM

Fixing Node.js vm APIs, part 1 - memory leaks and segmentation faults

This year I spent some time fixing a few long-standing issues in the Node.js vm APIs that had been blocking users from upgrading away

January 02, 2024 09:20 PM

Uncaught exceptions in Node.js

In this post, I’ll jot down some notes that I took when refactoring the uncaught exception handling routines in Node.js. Hopefully it

January 02, 2024 07:17 PM

New Blog

I’ve been thinking about starting a new blog for a while now. So here it is.

Not sure if I am going to write about tech here.

January 02, 2024 07:17 PM

Eric Meyer

Once Upon a Browser

Once upon a time, there was a movie called Once Upon a Forest.  I’ve never seen it.  In fact, the only reason I know it exists is because a few years after it was released, Joshua Davis created a site called Once Upon a Forest, which I was doing searches to find again.  The movie came up in my search results; the site, long dead, did not.  Instead, I found its original URL on Joshua’s Wikipedia page, and the Wayback Machine coughed up snapshots of it, such as this one.  You can also find static shots of it on Joshua’s personal web site, if you scroll far enough.

That site has long stayed with me, not so much for its artistic expression (which is pleasant enough) as for how the pieces were produced.  Joshua explained in a talk that he wrote code to create generative art, where it took visual elements and arranged them randomly, then waited for him to either save the result or hit a key to try again.  He created the elements that were used, and put constraints on how they might be arranged, but allowed randomness to determine the outcome.

That appealed to me deeply.  I eventually came to realize that the appeal was rooted in my love of the web, where we create content elements and visual styles and scripted behavior, and then we send our work into a medium that definitely has constraints, but something very much like the random component of generative art: viewport size, device capabilities, browser, and personal preference settings can combine in essentially infinite ways.  The user is the seed in the RNG of our work’s output.

Normally, we try very hard to minimize the variation our work can express.  Even when crossing from one experiential stratum to another  —  that is to say, when changing media breakpoints  —  we try to keep things visually consistent, orderly, and understandable.  That drive to be boring for the sake of user comprehension and convenience is often at war with our desire to be visually striking for the sake of expression and enticement.

There is a lot, and I mean a lot, of room for variability in web technologies.  We work very hard to tame it, to deny it, to shun it.  Too much, if you ask me.

About twelve and half years ago, I took a first stab at pushing back on that denial with a series posted to Flickr called “Spinning the Web”, where I used CSS rotation transforms to take consistent, orderly, understandable web sites and shake them up hard.  I enjoyed the process, and a number of people enjoyed the results.

google.com, late November 2023

In the past few months, I’ve come back to the concept for no truly clear reason and have been exploring new approaches and visual styles.  The first collection launched a few days ago: Spinning the Web 2023, a collection of 26 web sites remixed with a combination of CSS and JS.

I’m announcing them now in part because this month has been dubbed “Genuary”, a month for experimenting with generative art, with daily prompts to get people generating.  I don’t know if I’ll be following any of the prompts, but we’ll see.  And now I have a place to do it.

You see, back in 2011, I mentioned that my working title for the “Spinning the Web” series was “Once Upon a Browser”.  That title has never left me, so I’ve decided to claim it and created an umbrella site with that name.  At launch, it’s sporting a design that owes quite a bit to Once Upon a Forest  —  albeit with its own SVG-based generative background, one I plan to mess around with whenever the mood strikes.  New works will go up there from time to time, and I plan to migrate the 2011 efforts there as well.  For now, there are pointers to the Flickr albums for the old works.

I said this back in 2011, and I mean it just as much in 2023: I hope you enjoy these works even half as much as I enjoyed creating them.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at January 02, 2024 04:22 PM

January 01, 2024

Alex Bradbury

Reflections on ten years of LLVM Weekly

Today, with Issue #522 I'm marking ten years of authoring LLVM Weekly, a newsletter summarising developments on projects under the LLVM umbrella (LLVM, Clang, MLIR, Flang, libcxx, compiler-rt, lld, LLDB, ...). Somehow I've managed to keep up an unbroken streak, publishing every single Monday since the first issue back on Jan 6th 2014 (the first Monday of 2014 - you can also see the format hasn't changed much!). With a milestone like that, now is the perfect moment to jot down some reflections on the newsletter and thoughts for the future.

Motivation and purpose

Way back when I started LLVM Weekly, I'd been working with LLVM for a few years as part of developing and supporting a downstream compiler for a novel research architecture. This was a very educational yet somewhat lonely experience, and I sought to more closely follow upstream LLVM development to keep better abreast of changes that might impact or help my work, to learn more about parts of the compiler I wasn't actively using, and also to feel more of a connection to the wider LLVM community given my compiler work was a solo effort. The calculus for kicking off an LLVM development newsletter was dead simple: I found value in tracking development anyway, the incremental effort to write up and share with others wasn't too great, and I felt quite sure others would benefit as well.

Looking back at my notes (I have a huge Markdown file with daily notes going back to 2011 - a file of this rough size and format is also a good stress test for text editors!) it seems I thought seriously about the idea of starting something up at the beginning of December 2013. I brainstormed the format, looked at other newsletters I might want to emulate, and went ahead and just did it starting in the new year. It really was as simple as that. I figured better to give it a try and stop it if it gets no traction rather than waste lots of time putting out feelers on level of interest and format. As a sidenote, I was delighted to see many of the newsletters I studied at the time are still going: This Week in Rust Perl Weekly (I'll admit this surprised me!), Ubuntu Weekly News, OCaml Weekly News, and Haskell Weekly News.

Readership and content

The basic format of LLVM Weekly is incredibly simple - highlight relevant news articles and blog posts, pick out some forum/mailing discussions (sometimes trying to summarise complex debates - but this is very challenging and time intensive), and highlight some noteworthy commits from across the project. More recently I've taken to advertising the scheduled online sync-ups and office hours for the week. Notably absent are any kind of ads or paid content. I respect that others have made successful businesses in this kind of space, but although I've always written LLVM Weekly on my own personal time I've never felt comfortable trying to monetise other people's attention or my relationship with the community in this way.

The target audience is really anyone with an interest in keeping track of LLVM development, though I don't tend to expand every acronym or give a from-basics explanation for every term, so some familiarity with the project is assumed if you want to understand every line. The newsletter is posted to LLVM's Discourse, to llvmweekly.org, and delivered direct to people's inboxes. I additionally post on Twitter and on Mastodon linking to each issue. I don't attempt to track open rates or have functioning analytics, so only have a rough idea of readership. There are ~3.5k active subscribers directly to the mailing list, ~7.5k Twitter followers, ~180 followers on Mastodon (introduced much more recently), and an unknown number of people reading via llvmweekly.org or RSS. I'm pretty confident that I'm not just shouting in the void at least.

There are some gaps or blind spots of course. I make no attempt to try to link to patches that are under-review, even though many might have interesting review discussions because it would simply be too much work to sort through them and if the discussion is particularly contentious or requires input from a wider cross-section of the LLVM community you'd expect an RFC to be posted anyway. Although I do try to highlight MLIR threads or commits, as it's not an area of LLVM I'm working right now I probably miss some things. Thankfully Javed Absar has taken up writing an MLIR newsletter that helps plug those gaps. I'm also not currently trawling through repos under the LLVM GitHub organisation other than the main llvm-project monorepo, though perhaps I should...

I've shied away from reposting job posts as the overhead is just too high. I found dealing with requests to re-advertise (and considering if this is useful to the community) or determining if ads are sufficiently LLVM related just wasnt a good use of time when there's a good alternative. People can check the job post category on LLVM discourse or search for LLVM on their favourite jobs site.

How it works

There are really two questions to be answered here: how I go about writing it each week, and what tools and services are used. In terms of writing:

  • I have a checklist I follow just to ensure nothing gets missed and help dive back in quickly if splitting work across multiple days.
  • tig --since=$LAST_WEEK_DATE $DIR to step through commits in the past week for each sub-project within the monorepo. Tig is a fantastic text interface for git, and I of course have an ugly script that I bind to a key that generates the [shorthash](github_link) style links I insert for each chosen commit.
  • I make a judgement call as to whether I think a commit might be of interest to others. This is bound to be somewhat flawed, but hopefully better than ramdom selection! I always really appreciate feedback if you think I missed something important, or tips on things you think I should include next week.
    • There's a cheat that practically guarantees a mention in LLVM Weekly without even needing to drop me a note though - write documentation! It's very rare I see a commit that adds more docs and fail to highlight it.
  • Similarly, I scan through LLVM Discourse posts over the past week and pick out discussions I think readers may be interested in. Most RFCs will be picked up as part of this. In some cases if there's a lengthy discussion I might attempt to summarise or point to key messages, but honestly this is rarer than I'd like as it can be incredibly time consuming. I try very hard to remain a neutral voice and no to insert personal views on technical discussions.
  • Many ask how long it takes to write, and the answer is of course that it varies. It's easy to spend a lot of time trying to figure out the importance of commits or discussions in parts of the compiler I don't work with much, or to better summarise content. The amount of activity can also vary a lot week to week (especially on Discourse). It's mostly in the 2.5-3.5h range (very rarely any more than 4 hours) to write, copyedit, and send.

There's not much to be said on the tooling said, except that I could probably benefit from refreshing my helper scripts. Mail sending is handled by Mailgun, who have changed ownership three times since I started. I handle double opt-in via a simple Python script on the server-side and mail sending costs me $3-5 a month. Otherwise, I generate the static HTML with some scripts that could do with a bit more love. The only other running costs are the domain name fees and a VPS that hosts some other things as well, so quite insignificant compared to the time commitment.

How you can help

I cannot emphasise enough that I'm not an expert on all parts of LLVM, and I'm also only human and can easily miss things. If you did something you think people may be interested in and I failed to cover it, I almost certainly didn't explicitly review it and deem it not worthy. Please do continue to help me out by dropping links and suggestions. Writing commit messages that make it clear if a change has wider impact also helps increase the chance I'll pick it up.

I noted above that it is particularly time consuming to summarise back and forth in lengthy RFC threads. Sometimes people step up and do this and I always try to link to it when this happens. The person who initiated a thread or proposal is best placed write such a summary, and it's also a useful tool to check that you interpreted people's suggestions/feedback correctly, but it can still be helpful if others provide a similar service.

Many people have fed back they find LLVM Weekly useful to stay on top of LLVM developments. This is gratifying, but also a pretty huge responsibility. If you have thoughts on things I could be doing differently to serve the community even better without a big difference in time commitment, I'm always keen to hear ideas and suggestions.

Miscellaneous thoughts

To state the obvious, ten years is kind of a long time. A lot has happened with me in that time - I've got married, we had a son, I co-founded and helped grow a company, and then moved on from that, kicked off the upstream RISC-V LLVM backend, and much more. One of the things I love working with compilers is that there's always new things to learn, and writing LLVM Weekly helps me learn at least a little more each week in areas outside of where I'm currently working. There's been a lot of changes in LLVM as well. Off the top off my head: there's been the move from SVN to Git, moving the Git repo to GitHub, moving from Phabricator to GitHub PRs, Bugzilla to GitHub issues, mailing lists to Discourse, relicensing to Apache 2.0 with LLVM exception, the wider adoption of office hours and area-specific sync-up calls, and more. I think even the LLVM Foundation was set up a little bit after LLVM Weekly started. It's comforting to see the llvm.org website design remains unchanged though!

It's also been a time period where I've become increasingly involved in LLVM. Upstream work - most notably initiating the RISC-V LLVM backend, organising an LLVM conference, many talks, serving on various program committees for LLVM conferences, etc etc. When I started I felt a lot like someone outside the community looking in and documenting what I saw. That was probably accurate too, given the majority of my work was downstream. While I don't feel like an LLVM "insider" (if such a thing exists?!), I certainly feel a lot more part of the community than I did way back then.

An obvious question is whether there are other ways of pulling together the newsletter that are worth pursuing. My experience with large language models so far has been that they haven't been very helpful in reducing the effort for the more time consuming aspects of producing LLVM Weekly, but perhaps that will change in the future. If I could be automated away then that's great - perhaps I'm misjudging how much of my editorial input is signal rather than just noise, but I don't think we're there yet for AI. More collaborative approaches to producing content would be another avenue to explore. For the current format, the risk is that the communication overhead and stress of seeing if various contributions actually materialise before the intended publication date is quite high. If I did want to spread the load or hand it over, then a rotating editorship would probably be most interesting to me. Even if multiple people contribute, each week a single would act as a backstop to make sure something goes out.

The unbroken streak of LLVM Weekly editions each Monday has become a bit totemic. It's certainly not always convenient having this fixed commitment, but it can also be nice to have this rhythm to the week. Even if it's a bad week, at least it's something in the bag that people seem to appreciate. Falling into bad habits and frequently missing weeks would be good for nobody, but I suspect that a schedule that allowed the odd break now and then would be just fine. Either way, I feel a sense of relief having hit the 10 year unbroken streak. I don't intend to start skipping weeks, but should life get in the way and the streak gets broken I'll feel rather more relaxed about it having hit that arbitrary milestone.

Looking forwards and thanks

So what's next? LLVM Weekly continues, much as before. I don't know of I'll still be writing it in another 10 years time, but I'm not planning to stop soon. If it ceases to be a good use of my time, ceases to have values for others, or I find there's a better way of generating similar impact then it would only be logical to move on. But for now, onwards and upwards.

Many thanks are due. Thank you to the people who make LLVM what it is - both technically and in terms of its community that I've learned so much from. Thank you to Igalia where I work for creating an environment where I'm privileged enough to be paid to contribute upstream to LLVM (get in touch if you have LLVM needs!). Thanks to my family for ongoing support and of course putting up with the times my LLVM Weekly commitment is inconvenient. Thank you to everyone who has been reading LLVM Weekly and especially those sending in feedback or tips or suggestions for future issues.

On a final note, if you've got this far you should make sure you are subscribed to LLVM Weekly and follow on Mastodon or on Twitter.

January 01, 2024 12:00 PM

December 26, 2023

Brian Kardell

Lovely Trees

Lovely Trees

You've read lots of Web Components posts lately, I think this one is a little different.

I'm thrilled that so many people are suddenly learning about, and falling in love with Custom Elements separately from Shadow DOM. It is, in my mind, best to learn about Custom Elements first anyways. And, if your goals are to use it in some simple, static sites, or blogs - then, well, that might be all you need. It's just a better version of what we used to do with jQuery, really.

But…

Here is where I want to say something about Shadow DOM, and I expect it will go something like this:

It's not so bad.

But, stick with me. It won't be that bad, I promise.

The simple light DOM way is, yes, good. But it creates a poor illusion if the component manipulates the DOM. That illusion is easy shattered as neither component authors, nor page authors can reason well about the tree in potentially important ways, because it is a transform of the page author's tree to some tree of the component author's making, with no designed coordination. And then, everything starts breaking down quickly, not just CSS. Script too, uses selectors and tree relationships. Where there is any kind of real complexity: It's cases like that that Shadow DOM should serve us very well for.

But it falls short

Despite this, it seems that for plenty of people who have tried, Shadow DOM is falling short. All of the posts I'm reading show that people are so put off by the current state that they'd flipped the bit until now on Custom Elements too. When one steps back and looks at all of the calls (from people who have been trying to use Shadow DOM for a while) for a few variations of "open style-able roots", or "slots in the light DOM", or "ability to use IDREFs across shadow roots" - or even the fact that we're bringing back scoped styles: It really starts to seem like maybe we've missed (or at least not recognized or prioritized) multiple important use cases along the way.

I think this is because we've focused mainly on giving developers a capability to build and share something that is pretty similar to native widgets - and that's not what most people think they need. Indeed, I think we've failed to wrestle with differences that seem sharp.

For example: Browsers are extremely careful to not expose their "Shadow DOM" internals because the consequences of doing so could be dire. If they didn't, then when browsers try to push an update that makes some otherwise innocuous, even welcome change, everything goes wrong. Users suddenly experience problems in tons of apps. Maybe they suddenly can't activate a control. Perhaps that prevents them from getting the information they need for their bank, or their insurance. It can be a very big deal. People start filing bugs on those websites and writing hate filled blog posts. Devs from those websites do the same in kind. And so on. No one wins.

However, code libraries (of custom elements or anything else) are different. It's sites themselves, not the browser or the library, that are in charge of deploying upgrades to libraries, which involves testing and avoids the worst surprises. Neither the site author, nor the library author, it seems, generally requires the kind of extreme upgrade guarantees that current Shadow DOM is built to grant. Largely, it seems they would provide other trade-offs instead.

The design of Shadow DOM also hasn't focused enough on collaboration. I believe (as I have since the beginning) that most uses of Shadow DOM are about some kind of collaboration -- more about preventing friendly fire. But what we've created is perhaps more like a programming language with only private — no protected or "friendly" concepts.

If you think that all of this sounds kind of damning of standards, it's more complicated than it seems. There are no cow paths to pave here. But, what if there were? Because, at this point, it sure seems like there could be.

I'm not saying I'd like to build a summer home there, but the Shadow Trees are actually quite lovely.

Treading Some Cow Paths

Lots of coordination is totally possible, it simply requires jumping through hoops and isn't standard. The community can, probably should, spend some time proving out and living with a few different ideas. That would make standardizing one of them much easier (standards are at their best, in my take, when they are mostly writing down the slang that developed and was tested naturally, in the wild). Better still it taps into the creative power of the commons to get us functional solutions now, rather than making us wait forever for solutions that might not arrive for years - or even ever! This sort of approach is how we got things like .querySelector() .matches() and .closest().

Today, if you create a Shadow DOM, style rules inside don't leak out to the rest of the page, and don't "leak in" from the page. There are a lot of people who dislike that second part. Tricky thing is, they don't all seem to dislike it the same way, or want the same kind of solution(s). What we need here, I think, is practical experience and, luckily, we have the raw materials to try solutions to some of this in the wild ourselves and see what pains it soothes (and probably, also realize some that it causes).

For example, here are few major potential philosophies:

Let components decide
Authors extend a new base class which then automatically pulls down a copy of some, or all of the styles provided by the page.
Let page authors decide
Lets the page say "these are the base, simple styles for all components" regardless of what they extend. I think this is kind of key because one of the really nice things about custom elements is that many of us might like to share and find and mix and match, which is pretty hard to do while also basing a solution on extending a particular base class.

But which one is "right"? All of them feel more natural for some use cases/scenarios. All of them are probably just terrible for others. Maybe there are more variants! Maybe what we need is a "pick one, that's how your page will work" idea. Or, maybe we need all of them to work! I think we can only learn through use and experimentation, so...

Here's a tiny library to let you try each those things!

And a little glitch you can poke around, inspect, remix, play with, and tweak.

Go on... Pick one. Try it. Remix the glitch, make a pen, try it on your site. Love it or hate it. Let it inspire better ideas. But, most importantly - share your thoughts - regardless! Did it do good things for you? Was it tricky? I want to know!

December 26, 2023 05:00 AM

December 24, 2023

Alex Bradbury

Let the (terminal) bells ring out

I just wanted to take a few minutes to argue that the venerable terminal bell is a helpful and perhaps overlooked tool for anyone who does a lot of their work out of a terminal window. First, an important clarification. Bells ringing, chiming, or (as is appropriate for the season) jingling all sounds very noisy - but although you can configure your terminal emulator to emit a sound for the terminal bell, I'm actually advocating for configuring a non-intrusive but persistent visual notification.

BEL

Our goal is to generate a visual indicator on demand (e.g. when a long-running task has finished) and to do so with minimal fuss. This should work over ssh and without worrying about forwarding connections to some notification daemon. The ASCII BEL control character (alternatively written as BELL by those willing to spend characters extravagantly) meets these requirements. You'll just need co-operation from your terminal emulator and window manager to convert the bell to an appropriate notification.

BEL is 7 in ASCII, but can be printed using \a in printf (including the /usr/bin/printf you likely use from your shell, defined in POSIX). There's even a Rosetta Code page on ringing the terminal bell from various languages. Personally, I like to define a shell alias such as:

alias bell="printf '\aBELL!\n'"

Printing some text alongside the bell is helpful for confirming the bell was triggered as expected even after it was dismissed. Then, if kicking off a long operation like an LLVM compile and test use something like:

cmake --build . && ./bin/llvm-lit -s test; bell

The ; ensures the bell is produced regardless of the exit code of the previous commands. All being well, this sets the urgent hint on the X11 window used by your terminal, and your window manager produces a subtle but persistent visual indicator that is dismissed after you next give focus to the source of the bell. Here's how it looks for me in DWM:

Screenshot of DWM showing a notification from abell

The above example shows 9 workspaces (some of them named), where the llvm workspace has been highlighted because a bell was produced there. You'll also spot that I have a timers workspace, which I tend to use for miscellaneous timers. e.g. a reminder before a meeting is due to start, or when I'm planning to switch a task. I have a small tool for this I might share in a future post.

A limitation versus triggering freedesktop.org Desktop Notifications is that there's no payload / associated message. For me this isn't a big deal, such messages are distracting, and it's easy enough to see the full context when switching workspaces. It's possible it's a problem for your preferred workflow of course.

You could put \a in your terminal prompt ($PS1), meaning a bell is triggered after every command finishes. For me this would lead to too many notifications for commands I didn't want to carefully monitor the output for, but your mileage may vary. After publishing this article, my Igalia colleague Adrian Perez pointed me to a slight variant on this that he uses: in Zsh $TTYIDLE makes it easy to configure behaviour based on the duration of a command and he configures zsh so a bell is produced for commands that take longer than 30 seconds to complete.

Terminal emulator support

Unfortunately, setting the urgent hint upon a bell is not supported by gnome-terminal, with a 15 year-old issue left unresolved. It is however supported by the otherwise very similar xfce4-terminal (just enable the visual bell in preferences), and I switched solely due to this issue.

From what I can tell, this is the status of visual bell support via setting the X11 urgent hint:

  • xfce4-terminal: Supported. In Preferences -> Advanced ensure "Visual bell" is ticked.
  • xterm: Set XTerm.vt100.bellIsUrgent: true in your .Xresources file.
  • rxvt-unicode (urxvt): Set URxvt.urgentOnBell: true in your .Xresources file.
  • alacritty: Supported. Works out of the box with no additional configuration needed.
  • gnome-terminal: Not supported.
  • konsole: As far as I can tell it isn't supported. Creating a new profile and setting the "Terminal bell mode" to "Visual Bell" doesn't seem to result in the urgent hint being set.

Article changelog
  • 2023-12-24: Add note about configuring a bell for commands taking longer than a certain threshold duration in Zsh.
  • 2023-12-24: Initial publication date.

December 24, 2023 12:00 PM

December 22, 2023

Ricardo García

Vulkan extensions Igalia helped ship in 2023

Last year I wrote a recap of the Vulkan extensions Igalia helped ship in 2022, and in this post I’ll do the exact same for 2023.

Igalia Logo next to the Vulkan Logo

For context and quoting the previous recap:

The ongoing collaboration between Valve and Igalia lets me and some of my colleagues work on improving the open-source Vulkan and OpenGL Conformance Test Suite. This work is essential to ship quality Vulkan drivers and, from the Khronos side, to improve the Vulkan standard further by, among other things, adding new functionality through API extensions. When creating a new extension, apart from reaching consensus among vendors about the scope and shape of the new APIs, CTS tests are developed in order to check the specification text is clear and vendors provide a uniform implementation of the basic functionality, corner cases and, sometimes, interactions with other extensions.

In addition to our CTS work, many times we review the Vulkan specification text from those extensions we develop tests for. We also do the same for other extensions and changes, and we also submit fixes and improvements of our own.

So, without further ado, this is the list of extensions we helped ship in 2023.

VK_EXT_attachment_feedback_loop_dynamic_state

This extension builds on last year’s VK_EXT_attachment_feedback_loop_layout, which is used by DXVK 2.0+ to more efficiently support D3D9 games that read from active render targets. The new extension shipped this year adds support for setting attachment feedback loops dynamically on command buffers. As all extensions that add more dynamic state, the goal here is to reduce the number of pipeline objects applications need to create, which makes using the API more flexible. It was created by our beloved super good coder and Valve contractor Mike Blumenkrantz. We reviewed the spec and are listed as contributors, and we wrote dynamic variants of the existing CTS tests.

VK_EXT_depth_bias_control

A new extension proposed by Joshua Ashton that also helps with layering D3D9 on top of Vulkan. The original problem is quite specific. In D3D9 and other APIs, applications can specify what is called a “depth bias” for geometry using an offset that is to be added directly as an exact value to the original depth of each fragment. In Vulkan, however, the depth bias is expressed as a factor of “r”, where “r” is a number that depends on the depth buffer format and, furthermore, may not have a specific fixed value. Implementations can use different values of “r” in an acceptable range. The mechanism provided by Vulkan without this extension is useful to apply small offsets and solve some problems, but it’s not useful to apply large offsets and/or emulate D3D9 by applying a fixed-value bias. The new extension solves these problems by giving apps the chance to control depth bias in a precise way. We reviewed the spec and are listed as contributors, and wrote CTS tests for this extension to help ship it.

VK_EXT_dynamic_rendering_unused_attachments

This extension was proposed by Piers Daniell from NVIDIA to lift some restrictions in the original VK_KHR_dynamic_rendering extension, which is used in Vulkan to avoid having to create render passes and framebuffer objects. Dynamic rendering is very interesting because it makes the API much easier to use and, in many cases and specially in desktop platforms, it can be shipped without any associated performance loss. The new extension relaxes some restrictions that made pipelines more tightly coupled with render pass instances. Again, the goal here is to be able to reuse the same pipeline object with multiple render pass instances and remove some combinatorial explosions that may occur in some apps. We reviewed the spec and are listed as contributors, and wrote CTS tests for the new extension.

VK_EXT_image_sliced_view_of_3d

Shipped at the beginning of the year by Mike Blumenkrantz, the extension again helps emulating other APIs on top of Vulkan. Specifically, the extension allows creating 3D views of 3D images such that the views contain a subset of the slices in the image, using a Z offset and range, in the same way D3D12 allows. We reviewed the spec, we’re listed as contributors, and we wrote CTS tests for it.

VK_EXT_pipeline_library_group_handles

This one comes from Valve contractor Hans-Kristian Arntzen, who is mostly known for working on Proton projects like VKD3D-Proton. The extension is related to ray tracing and adds more flexibility when creating ray tracing pipelines. Ray tracing pipelines can hold thousands of different shaders and are sometimes built incrementally by combining so-called pipeline libraries that contain subsets of those shaders. However, to properly use those pipelines we need to create a structure called a shader binding table, which is full of shader group handles that have to be retrieved from pipelines. Prior to this extension, shader group handles from pipeline libraries had to be requeried once the final pipeline is linked, as they were not guaranteed to be constant throughout the whole process. With this extension, an implementation can tell apps they will not modify shader group handles in subsequent link steps, which makes it easier for apps to build shader binding tables. More importantly, this also more closely matches functionality in DXR 1.1, making it easier to emulate DirectX Raytracing on top of Vulkan raytracing. We reviewed the spec, we’re listed as contributors and we wrote CTS tests for it.

VK_EXT_shader_object

Shader objects is probably the most notorious extension shipped this year, and we contributed small bits to it. This extension makes every piece of state dynamic and removes the need to use pipelines. It’s always used in combination with dynamic rendering, which also removes render passes and framebuffers as explained above. This results in great flexibility from the application point of view. The extension was created by Daniel Story from Nintendo, and its vast set of CTS tests was created by Žiga Markuš but we added our grain of sand by reviewing the spec and proposing some changes (which is why we’re listed as contributors), as well as fixing some shader object tests and providing some improvements here and there once they had been merged. A good part of this work was done in coordination with Mesa developers which were working on implementing this extension for different drivers.

VK_KHR_video_encode_h264 and VK_KHR_video_encode_h265

Fresh out of the oven, these Vulkan Video extensions allow leveraging the hardware to efficiently encode H.264 and H.265 streams. This year we’ve been doing a ton of work related to Vulkan Video in drivers, libraries like GStreamer and CTS/spec, including the two extensions mentioned above. Although not listed as contributors to the spec in those two Vulkan extensions, our work played a major role in advancing the state of Vulkan Video and getting them shipped.

Epilogue

That’s it for this year! I’m looking forward to help ship more extension work the next one and trying to add my part in making Vulkan drivers on Linux (and other platforms!) more stable and feature rich. My Vulkan Video colleagues at Igalia have already started work on future Vulkan Video extensions for AV1 and VP9. Hopefully some of that work is ratified next year. Fingers crossed!

December 22, 2023 02:50 PM

December 21, 2023

Eric Meyer

Pixelating Live with SVG

For reasons I’m not going to get into here, I want be able to pixelate web pages, or even parts of web pages, entirely from the client side.  I’m using ViolentMonkey to inject scripts into pages, since it lets me easily open the ViolentMonkey browser-toolbar menu and toggle scripts on or off at will.

I’m aware I could take raster screenshots of pages and then manipulate them in an image editor.  I don’t want to do that, though  —  I want to pixelate live.  For reasons.

So far as I’m aware, my only option here is to apply SVG filters by way of CSS.  The problem I’m running into is that I can’t figure out how to construct an SVG filter that will exactly:

  • Divide the element into cells; for example, a grid of 4×4 cells
  • Find the average color of the pixels in each cell
  • Flood-fill each cell with the average color of its pixels

As a way of understanding the intended result, see the following screenshot of Wikipedia’s home page, and then the corresponding pixelated version, which I generated using the Pixelate filter in Acorn.

Wikipedia in the raw, and blockified.

See how the text is rendered out?  That’s key here.

I found a couple of SVG pixelators in a StackOverflow post, but what they both appear to do is sample pixels at regularly-spaced intervals, then dilate them.  This works pretty okay for things like photographs, but it falls down hard when it comes to text, or even images of diagrams.  Text is almost entirely vanished, as shown here.

The text was there a minute ago, I swear it.

I tried Gaussian blurring at the beginning of my filters in an attempt to overcome this, but that mostly washed the colors out, and didn’t make the text more obviously text, so it was a net loss.  I messed around with dilation radii, and there was no joy there.  I did find some interesting effects along the way, but none of them were what I was after.

I’ve been reading through various tutorials and MDN pages about SVG filters, and I’m unable to figure this out.  Though I may be wrong, I feel like the color-averaging step is the sticking point here, since it seems like <feTile> and <feFlood> should be able to handle the first and last steps.  I’ve wondered if there’s a way to get a convolve matrix to do the color-averaging part, but I have no idea  —  I never learned matrix math, and later-life attempts to figure it out have only gotten me as far as grasping the most general of principles.  I’ve also tried to work out if a displacement map could be of help here, but so far as I can tell, no.  But maybe I just don’t understand them well enough to tell?

It also occurred to me, as I was prepared to publish this, that maybe a solution would be to use some kind of operation (a matrix, maybe?) to downsize the image and then use another operation to upsize it to the original size.  So to pixelfy a 1200×1200 image into 10×10 blocks, smoothly downsize it to 120×120 and then nearest-neighbor it back up to 1200×1200.  That feels like it would make sense as a technique, but once again, even if it does make sense I can’t figure out how to do it.  I searched for terms like image scale transform matrix but I either didn’t get good results, or didn’t understand them when I did.  Probably the latter, if we’re being honest.

So, if you have any ideas for how to make this work, I’m all ears  —  either here in the comments, on your own site, or as forks of the Codepen I set up for exactly that purpose.  My thanks for any help!


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at December 21, 2023 03:35 PM

December 20, 2023

Melissa Wen

The Rainbow Treasure Map Talk: Advanced color management on Linux with AMD/Steam Deck.

Last week marked a major milestone for me: the AMD driver-specific color management properties reached the upstream linux-next!

And to celebrate, I’m happy to share the slides notes from my 2023 XDC talk, “The Rainbow Treasure Map” along with the individual recording that just dropped last week on youtube – talk about happy coincidences!

Steam Deck Rainbow: Treasure Map & Magic Frogs

While I may be bubbly and chatty in everyday life, the stage isn’t exactly my comfort zone (hallway talks are more my speed). But the journey of developing the AMD color management properties was so full of discoveries that I simply had to share the experience. Witnessing the fantastic work of Jeremy and Joshua bring it all to life on the Steam Deck OLED was like uncovering magical ingredients and whipping up something truly enchanting.

For XDC 2023, we split our Rainbow journey into two talks. My focus, “The Rainbow Treasure Map,” explored the new color features we added to the Linux kernel driver, diving deep into the hardware capabilities of AMD/Steam Deck. Joshua then followed with “The Rainbow Frogs” and showed the breathtaking color magic released on Gamescope thanks to the power unlocked by the kernel driver’s Steam Deck color properties.

Packing a Rainbow into 15 Minutes

I had so much to tell, but a half-slot talk meant crafting a concise presentation. To squeeze everything into 15 minutes (and calm my pre-talk jitters a bit!), I drafted and practiced those slides and notes countless times.

So grab your map, and let’s embark on the Rainbow journey together!

Slide 1: The Rainbow Treasure Map - Advanced Color Management on Linux with AMD/SteamDeck

Intro: Hi, I’m Melissa from Igalia and welcome to the Rainbow Treasure Map, a talk about advanced color management on Linux with AMD/SteamDeck.

Slide 2: List useful links for this technical talk

Useful links: First of all, if you are not used to the topic, you may find these links useful.

  1. XDC 2022 - I’m not an AMD expert, but… - Melissa Wen
  2. XDC 2022 - Is HDR Harder? - Harry Wentland
  3. XDC 2022 Lightning - HDR Workshop Summary - Harry Wentland
  4. Color management and HDR documentation for FOSS graphics - Pekka Paalanen et al.
  5. Cinematic Color - 2012 SIGGRAPH course notes - Jeremy Selan
  6. AMD Driver-specific Properties for Color Management on Linux (Part 1) - Melissa Wen

Slide 3: Why do we need advanced color management on Linux?

Context: When we talk about colors in the graphics chain, we should keep in mind that we have a wide variety of source content colorimetry, a variety of output display devices and also the internal processing. Users expect consistent color reproduction across all these devices.

The userspace can use GPU-accelerated color management to get it. But this also requires an interface with display kernel drivers that is currently missing from the DRM/KMS framework.

Slide 4: Describe our work on AMD driver-specific color properties

Since April, I’ve been bothering the DRM community by sending patchsets from the work of me and Joshua to add driver-specific color properties to the AMD display driver. In parallel, discussions on defining a generic color management interface are still ongoing in the community. Moreover, we are still not clear about the diversity of color capabilities among hardware vendors.

To bridge this gap, we defined a color pipeline for Gamescope that fits the latest versions of AMD hardware. It delivers advanced color management features for gamut mapping, HDR rendering, SDR on HDR, and HDR on SDR.

Slide 5: Describe the AMD/SteamDeck - our hardware

AMD/Steam Deck hardware: AMD frequently releases new GPU and APU generations. Each generation comes with a DCN version with display hardware improvements. Therefore, keep in mind that this work uses the AMD Steam Deck hardware and its kernel driver. The Steam Deck is an APU with a DCN3.01 display driver, a DCN3 family.

It’s important to have this information since newer AMD DCN drivers inherit implementations from previous families but aldo each generation of AMD hardware may introduce new color capabilities. Therefore I recommend you to familiarize yourself with the hardware you are working on.

Slide 6: Diagram with the three layers of the AMD display driver on Linux

The AMD display driver in the kernel space: It consists of three layers, (1) the DRM/KMS framework, (2) the AMD Display Manager, and (3) the AMD Display Core. We extended the color interface exposed to userspace by leveraging existing DRM resources and connecting them using driver-specific functions for color property management.

Slide 7: Three-layers diagram highlighting AMD Display Manager, DM - the layer that connects DC and DRM

Bridging DC color capabilities and the DRM API required significant changes in the color management of AMD Display Manager - the Linux-dependent part that connects the AMD DC interface to the DRM/KMS framework.

Slide 8: Three-layers diagram highlighting AMD Display Core, DC - the shared code

The AMD DC is the OS-agnostic layer. Its code is shared between platforms and DCN versions. Examining this part helps us understand the AMD color pipeline and hardware capabilities, since the machinery for hardware settings and resource management are already there.

Slide 9: Diagram of the AMD Display Core Next architecture with main elements and data flow

The newest architecture for AMD display hardware is the AMD Display Core Next.

Slide 10: Diagram of the AMD Display Core Next where only DPP and MPC blocks are highlighted

In this architecture, two blocks have the capability to manage colors:

  • Display Pipe and Plane (DPP) - for pre-blending adjustments;
  • Multiple Pipe/Plane Combined (MPC) - for post-blending color transformations.

Let’s see what we have in the DRM API for pre-blending color management.

Slide 11: Blank slide with no content only a title 'Pre-blending: DRM plane'

DRM plane color properties:

This is the DRM color management API before blending.

Nothing!

Except two basic DRM plane properties: color_encoding and color_range for the input colorspace conversion, that is not covered by this work.

Slide 12: Diagram with color capabilities and structures in AMD DC layer without any DRM plane color interface (before blending), only the DRM CRTC color interface for post blending

In case you’re not familiar with AMD shared code, what we need to do is basically draw a map and navigate there!

We have some DRM color properties after blending, but nothing before blending yet. But much of the hardware programming was already implemented in the AMD DC layer, thanks to the shared code.

Slide 13: Previous Diagram with a rectangle to highlight the empty space in the DRM plane interface that will be filled by AMD plane properties

Still both the DRM interface and its connection to the shared code were missing. That’s when the search begins!

Slide 14: Color Pipeline Diagram with the plane color interface filled by AMD plane properties but without connections to AMD DC resources

AMD driver-specific color pipeline:

Looking at the color capabilities of the hardware, we arrive at this initial set of properties. The path wasn’t exactly like that. We had many iterations and discoveries until reached to this pipeline.

Slide 15: Color Pipeline Diagram connecting AMD plane degamma properties, LUT and TF, to AMD DC resources

The Plane Degamma is our first driver-specific property before blending. It’s used to linearize the color space from encoded values to light linear values.

Slide 16: Describe plane degamma properties and hardware capabilities

We can use a pre-defined transfer function or a user lookup table (in short, LUT) to linearize the color space.

Pre-defined transfer functions for plane degamma are hardcoded curves that go to a specific hardware block called DPP Degamma ROM. It supports the following transfer functions: sRGB EOTF, BT.709 inverse OETF, PQ EOTF, and pure power curves Gamma 2.2, Gamma 2.4 and Gamma 2.6.

We also have a one-dimensional LUT. This 1D LUT has four thousand ninety six (4096) entries, the usual 1D LUT size in the DRM/KMS. It’s an array of drm_color_lut that goes to the DPP Gamma Correction block.

Slide 17: Color Pipeline Diagram connecting AMD plane CTM property to AMD DC resources

We also have now a color transformation matrix (CTM) for color space conversion.

Slide 18: Describe plane CTM property and hardware capabilities

It’s a 3x4 matrix of fixed points that goes to the DPP Gamut Remap Block.

Both pre- and post-blending matrices were previously gone to the same color block. We worked on detaching them to clear both paths.

Now each CTM goes on its own way.

Slide 19: Color Pipeline Diagram connecting AMD plane HDR multiplier property to AMD DC resources

Next, the HDR Multiplier. HDR Multiplier is a factor applied to the color values of an image to increase their overall brightness.

Slide 20: Describe plane HDR mult property and hardware capabilities

This is useful for converting images from a standard dynamic range (SDR) to a high dynamic range (HDR). As it can range beyond [0.0, 1.0] subsequent transforms need to use the PQ(HDR) transfer functions.

Slide 21: Color Pipeline Diagram connecting AMD plane shaper properties, LUT and TF, to AMD DC resources

And we need a 3D LUT. But 3D LUT has a limited number of entries in each dimension, so we want to use it in a colorspace that is optimized for human vision. It means in a non-linear space. To deliver it, userspace may need one 1D LUT before 3D LUT to delinearize content and another one after to linearize content again for blending.

Slide 22: Describe plane shaper properties and hardware capabilities

The pre-3D-LUT curve is called Shaper curve. Unlike Degamma TF, there are no hardcoded curves for shaper TF, but we can use the AMD color module in the driver to build the following shaper curves from pre-defined coefficients. The color module combines the TF and the user LUT values into the LUT that goes to the DPP Shaper RAM block.

Slide 23: Color Pipeline Diagram connecting AMD plane 3D LUT property to AMD DC resources

Finally, our rockstar, the 3D LUT. 3D LUT is perfect for complex color transformations and adjustments between color channels.

Slide 24: Describe plane 3D LUT property and hardware capabilities

3D LUT is also more complex to manage and requires more computational resources, as a consequence, its number of entries is usually limited. To overcome this restriction, the array contains samples from the approximated function and values between samples are estimated by tetrahedral interpolation. AMD supports 17 and 9 as the size of a single-dimension. Blue is the outermost dimension, red the innermost.

Slide 25: Color Pipeline Diagram connecting AMD plane blend properties, LUT and TF, to AMD DC resources

As mentioned, we need a post-3D-LUT curve to linearize the color space before blending. This is done by Blend TF and LUT.

Slide 26: Describe plane blend properties and hardware capabilities

Similar to shaper TF, there are no hardcoded curves for Blend TF. The pre-defined curves are the same as the Degamma block, but calculated by the color module. The resulting LUT goes to the DPP Blend RAM block.

Slide 27: Color Pipeline Diagram  with all AMD plane color properties connect to AMD DC resources and links showing the conflict between plane and CRTC degamma

Now we have everything connected before blending. As a conflict between plane and CRTC Degamma was inevitable, our approach doesn’t accept that both are set at the same time.

Slide 28: Color Pipeline Diagram connecting AMD CRTC gamma TF property to AMD DC resources

We also optimized the conversion of the framebuffer to wire encoding by adding support to pre-defined CRTC Gamma TF.

Slide 29: Describe CRTC gamma TF property and hardware capabilities

Again, there are no hardcoded curves and TF and LUT are combined by the AMD color module. The same types of shaper curves are supported. The resulting LUT goes to the MPC Gamma RAM block.

Slide 30: Color Pipeline Diagram with all AMD driver-specific color properties connect to AMD DC resources

Finally, we arrived in the final version of DRM/AMD driver-specific color management pipeline. With this knowledge, you’re ready to better enjoy the rainbow treasure of AMD display hardware and the world of graphics computing.

Slide 31: SteamDeck/Gamescope Color Pipeline Diagram with rectangles labeling each block of the pipeline with the related AMD color property

With this work, Gamescope/Steam Deck embraces the color capabilities of the AMD GPU. We highlight here how we map the Gamescope color pipeline to each AMD color block.

Slide 32: Final slide. Thank you!

Future works: The search for the rainbow treasure is not over! The Linux DRM subsystem contains many hidden treasures from different vendors. We want more complex color transformations and adjustments available on Linux. We also want to expose all GPU color capabilities from all hardware vendors to the Linux userspace.

Thanks Joshua and Harry for this joint work and the Linux DRI community for all feedback and reviews.

The amazing part of this work comes in the next talk with Joshua and The Rainbow Frogs!

Any questions?


References:

  1. Slides of the talk The Rainbow Treasure Map.
  2. Youtube video of the talk The Rainbow Treasure Map.
  3. Patch series for AMD driver-specific color management properties (upstream Linux 6.8v).
  4. SteamDeck/Gamescope color management pipeline
  5. XDC 2023 website.
  6. Igalia website.

December 20, 2023 12:00 PM

Alex Bradbury

Ownership you can count on

Introduction

I came across the paper Ownership You Can Count On (by Adam Dingle and David F. Bacon, seemingly written in 2007) some years ago and it stuck with me as being an interesting variant on traditional reference counting. Since then I've come across references to it multiple times in the programming language design and implementation community and thought it might be worth jotting down some notes on the paper itself and the various attempts to adopt its ideas.

Ownership you can count on and the Gel language

The basic idea is very straight-forward. Introduce the idea of an owning pointer type (not dissimilar to std::unique_ptr in C++) where each object may have only a single owning pointer and the object is freed when that pointer goes out of scope. In addition to that allow an arbitrary number of non-owning pointers to the object, but require that all non-owning pointers have been destroyed by the time the object is freed. This requirement prevents use-after-free and is implemented using run-time checks. A reference count is incremented or decremented whenever a non-owning pointer is created or destroyed (referred to as alias counting in the paper). That count is checked when the object is freed and if it still has non-owning references (i.e. the count is non-zero), then the program exits with an error.

Contrast this with a conventional reference counting system, where objects are freed when their refcount reaches zero. In the alias counting scheme, the refcount stored on the object is simply decremented when a non-owning reference goes out of scope, and this refcount needs to be compared to 0 only upon the owning pointer going out of scope rather than upon every decrement (as an object is never freed as a result of the non-owning pointer count reaching zero). Additionally, refcount manipulation is never needed when passing around the owning pointer. The paper also describes an analysis that allows many refcount operations to be elided in regions where there aren't modifications to owned pointers of the owned object's subclass/class/superclass (which guarantees the pointed-to object can't be destructed in this region). The paper also claims support for destroying data structures with pointer cycles that can't be automatically destroyed with traditional reference counting. The authors suggest for cases where you might otherwise reach for multiple ownership, (e.g. an arbitrary graph) to allocate an array of owning pointers to hold your nodes, then use non-owning pointers between them.

The paper describes a C# derived language called Gel which only requires two additional syntactic constructs to support the alias counting model: owned pointers indicated by a ^ (e.g. Foo ^f = new Foo(); and a take operator to take ownership of a value from an owning field. Non-owning pointers are written just as Foo f. They also achieve a rather neat erasure property, whereby if you take a Gel program and remove all ^ and take you'll have a C# program that is valid as long as the original Gel program was valid.

That all sounds pretty great, right? Does this mean we can have it all: full memory safety, deterministic destruction, low runtime and annotation overhead? As you'd expect, there are some challenges:

  • The usability in practice is going to depend a lot on how easy it is for a programmer to maintain the invariant that no unowned pointers outlive the owned pointer, especially as failure to do so results in a runtime crash.
    • Several languages have looked at integrating the idea, so there's some information later in this article on their experiences.
  • Although data structures with pointer cycles that couldn't be automatically destroyed by reference counting an be handled, there are also significant limitations. Gel can't destroy graphs containing non-owning pointers to non-ancestor nodes unless the programmer writes logic to null out those pointers. This could be frustrating and error-prone to handle.
    • The authors present a multi-phase object destruction mechanism aiming to address these limitations, though the cost (potentially recursively descending the graph of the ownership tree 3 times) depends on how much can be optimised away.
  • Although it's not a fundamental limitation of the approach, Gel doesn't yet provide any kind of polymorphism between owned and unowned references. This would be necessary for any modern language with support for generics.
  • The reference count elimination optimisation described in the paper assumes single-threaded execution.
    • Though as noted there, thread escape analysis or structuring groups of objects into independent regions (also see recent work on Verona) could provide a solution.

So in summary, an interesting idea that is meaningfully different to traditional reference counting, but the largest program written using this scheme is the Gel compiler itself and many of the obvious questions require larger scale usage to judge the practicality of the scheme.

Influence and adoption of the paper's ideas

Ownership You Can Count On was written around 2007 and as far as I can tell never published in the proceedings of a conference or workshop, or in a journal. Flicking through the Gel language repository and applying some world-class logical deduction based on the directory name holding a draft version of the paper leads to me suspect it was submitted to PLDI though. Surprisingly it has no academic citations, despite being shared publicly on David F. Bacon's site (and Bacon has a range of widely cited papers related to reference counting / garbage collection). Yet, the work has been used as the basis for memory management in one language (Inko) and was seriously evaluated (partially implemented?) for both Nim and Vale.

Inko started out with a garbage collector, but its creator Yorick Peterse announced in 2021 a plan to adopt the scheme from Ownership You Can Count on, who then left his job to work on Inko full-time and successfully transitioned to the new memory management scheme in the Inko 0.10.0 release about a year later. Inko as a language is more ambitious than Gel - featuring parametric polymorphism, lightweight processes for concurrency, and more. Yet it's still early days and it's not yet, for instance, good for drawing conclusions about performance on modern systems as optimisations aren't currently applied. Dusty Phillips wrote a blog post earlier this year explaining Inko's memory management through some example data structures, which also includes some thoughts on the usability of the system and some drawbacks. Some of the issues may be more of a result of the language being young, e.g. the author notes it took a lot of trial and error to figure out some of the described techniques (perhaps this will be easier once common patterns are better documented and potentially supported by library functions or syntax?), or that debuggability is poor when the program exits with a dangling reference error.

Nim was at one point going to move to Gel's scheme (see the blog post and (RFC). I haven't been able to find a detailed explanation for the reasons why it was rejected, though Andreas Rumpf (Nim language creator) commented on a Dlang forum discussion thread about the paper that "Nim tried to use this in production before moving to ORC and I'm not looking back, 'ownership you can count on' was actually quite a pain to work with...". Nim has since adopted a more conventional non-atomic reference counted scheme (ARC), with an optional cycle collector (ORC).

Vale was previously adopting the Gel / Ownership You Can Count On scheme (calling the unowned references "constraint references"), but has since changed path slightly and now uses "generational references" Rather than a reference count, each allocation includes a generation counter which is incremented each time it is reused. Fat pointers that include the expected value of the generation counter are used, and checked before dereferencing. If they don't match, that indicates the memory location was since reallocated and the program will fault. This also puts restrictions on the allocator's ability to reuse freed allocations without compromising safety. Vale's memory management scheme retains similarities to Gel: the language is still based around single ownership and this is exploited to elide checks on owning references/pointers.

Conclusion and other related work

Some tangentially related things that didn't make it into the main body of text above:

  • A language that references Gel in its design but goes in a slightly different direction is Wouter van Oortmerssen's Lobster. Its memory management scheme attempts to infer ownership (and optimise in the case where single ownership can be inferred) rather than requiring single ownership like the languages listed above.
  • One of the discussions on Ownership You Can Count On referenced this Pointer Ownership Model paper as having similarities. It chooses to categorise pointers as either "responsible" or "irresponsible" - cute!

And, that's about it. If you're hoping for a definitive answer on whether alias counting is a winning idea or not I'm sorry to disappoint, but I've at least succeeded in collecting together the various places I've seen it explored and am looking forward to seeing how Inko's adoption of it evolves. I'd be very interested to hear any experience reports of adopting alias counting or using a language like Inko that tries to use it.

December 20, 2023 12:00 PM

December 14, 2023

Jacobo Aragunde

Setting up a minimal, command-line Android emulator on Linux

Android has all kinds of nice development tools, but sometimes you just want to run an apk and don’t need all the surrounding tooling. In my case, I have already have my Chromium setup, which can produce binaries for several platforms including Android.

I usually test on a physical device, a smartphone, but I would like to try a device with a tablet form factor and I don’t have one at hand.

Chromium smartphone and tablet screenshots

Chromium provides different user experiences for smartphones and tablets.

I’ve set up the most stripped-down environment possible to run an Android emulator using the same tools provided by the platform. Notice these are generic instructions, not tied to Chromium tooling, despite I’m doing this mainly to run Chromium.

The first step is to download the latest version of the command line tools, instead of Android Studio, from the Android developer website. We will extract the tools to the location ~/Android/Sdk, in the path where they like to find themselves:

mkdir -p ~/Android/Sdk/cmdline-tools
unzip -d ~/Android/Sdk/cmdline-tools ~/Downloads/commandlinetools-linux-10406996_latest.zip 
mv ~/Android/Sdk/cmdline-tools/cmdline-tools/ ~/Android/Sdk/cmdline-tools/latest

Now we have the tools installed in ~/Android/Sdk/cmdline-tools/latest, we need to add their binaries to the path. We will also add other paths for tools we are about to install, with the command:

export PATH=~/Android/Sdk/cmdline-tools/latest/bin:~/Android/Sdk/emulator:~/Android/Sdk/platform-tools:$PATH

Run the command above for a one-time use, or add it to ~/.bashrc or your preferred shell configuration for future use.

You will also need to setup another envvar:

export ANDROID_HOME=~/Android/Sdk

Again, run it once for a one-time use or add it to your shell configuration in ~/.bashrc or equivalent.

Now we use the tool sdkmanager, which came with Android’s command-line tools, to install the rest of the software we need. All of this will be installed inside your $ANDROID_HOME, so in ~/Android/Sdk.

sdkmanager "emulator" "platform-tools" "platforms;android-29" "system-images;android-29;google_apis;x86_64"

In the command above I have installed the emulator, platform tools (adb and friends), and system libraries and a system image for Android 10 (API level 29). You may check what other things are available with sdkmanager --list, you will find multiple variants of the Android platform and system images.

Finally, we have to setup a virtual machine, called AVD (“Android Virtual Device”) in Android jargon. We do that with:

avdmanager -v create avd -n Android_API_29_Google -k "system-images;android-29;google_apis;x86_64" -d pixel_c

Here I have created a virtual device called “Android_API_29_Google” with the system image I had downloaded, and the form factor and screen size of a Pixel C, which is a tablet. You may get a list of all the devices that can be simulated with avdmanager list device.

Now we can already start an emulator running the AVD with the name we have just chosen, with:

emulator -avd Android_API_29_Google

The files and configuration for this AVD are stored in ~/.android/avd/Android_API_29_Google.avd, or a sibling directory if you used a different name. Configuration is in the config.ini file, you can set here many options you would normally configure from Android Studio, and even more. I recommend to change at least this value, for your convenience:

hw.keyboard = yes

Otherwise you will find yourself typing URLs with a mouse on a virtual keyboard… Not fun.

Finally, your software will be the up-to-date after a fresh install but, when new versions are released, you will be able to install them with:

sdkmanager --update

Make sure to run this every now and then!

At this point, your setup should be ready, and all the tools you need are in the PATH. Feel free to reuse the commands above to create any number of virtual devices with different Android system images, different form factors… Happy hacking!

by Jacobo Aragunde Pérez at December 14, 2023 05:00 PM

December 13, 2023

Melissa Wen

15 Tips for Debugging Issues in the AMD Display Kernel Driver

A self-help guide for examining and debugging the AMD display driver within the Linux kernel/DRM subsystem.

It’s based on my experience as an external developer working on the driver, and are shared with the goal of helping others navigate the driver code.

Acknowledgments: These tips were gathered thanks to the countless help received from AMD developers during the driver development process. The list below was obtained by examining open source code, reviewing public documentation, playing with tools, asking in public forums and also with the help of my former GSoC mentor, Rodrigo Siqueira.

Pre-Debugging Steps:

Before diving into an issue, it’s crucial to perform two essential steps:

1) Check the latest changes: Ensure you’re working with the latest AMD driver modifications located in the amd-staging-drm-next branch maintained by Alex Deucher. You may also find bug fixes for newer kernel versions on branches that have the name pattern drm-fixes-<date>.

2) Examine the issue tracker: Confirm that your issue isn’t already documented and addressed in the AMD display driver issue tracker. If you find a similar issue, you can team up with others and speed up the debugging process.

Understanding the issue:

Do you really need to change this? Where should you start looking for changes?

3) Is the issue in the AMD kernel driver or in the userspace?: Identifying the source of the issue is essential regardless of the GPU vendor. Sometimes this can be challenging so here are some helpful tips:

  • Record the screen: Capture the screen using a recording app while experiencing the issue. If the bug appears in the capture, it’s likely a userspace issue, not the kernel display driver.
  • Analyze the dmesg log: Look for error messages related to the display driver in the dmesg log. If the error message appears before the message “[drm] Display Core v...”, it’s not likely a display driver issue. If this message doesn’t appear in your log, the display driver wasn’t fully loaded and you will see a notification that something went wrong here.

4) AMD Display Manager vs. AMD Display Core: The AMD display driver consists of two components:

  • Display Manager (DM): This component interacts directly with the Linux DRM infrastructure. Occasionally, issues can arise from misinterpretations of DRM properties or features. If the issue doesn’t occur on other platforms with the same AMD hardware - for example, only happens on Linux but not on Windows - it’s more likely related to the AMD DM code.
  • Display Core (DC): This is the platform-agnostic part responsible for setting and programming hardware features. Modifications to the DC usually require validation on other platforms, like Windows, to avoid regressions.

5) Identify the DC HW family: Each AMD GPU has variations in its hardware architecture. Features and helpers differ between families, so determining the relevant code for your specific hardware is crucial.

  • Find GPU product information in Linux/AMD GPU documentation
  • Check the dmesg log for the Display Core version (since this commit in Linux kernel 6.3v). For example:
    • [drm] Display Core v3.2.241 initialized on DCN 2.1
    • [drm] Display Core v3.2.237 initialized on DCN 3.0.1

Investigating the relevant driver code:

Keep from letting unrelated driver code to affect your investigation.

6) Narrow the code inspection down to one DC HW family: the relevant code resides in a directory named after the DC number. For example, the DCN 3.0.1 driver code is located at drivers/gpu/drm/amd/display/dc/dcn301. We all know that the AMD’s shared code is huge and you can use these boundaries to rule out codes unrelated to your issue.

7) Newer families may inherit code from older ones: you can find dcn301 using code from dcn30, dcn20, dcn10 files. It’s crucial to verify which hooks and helpers your driver utilizes to investigate the right portion. You can leverage ftrace for supplemental validation. To give an example, it was useful when I was updating DCN3 color mapping to correctly use their new post-blending color capabilities, such as:

Additionally, you can use two different HW families to compare behaviours. If you see the issue in one but not in the other, you can compare the code and understand what has changed and if the implementation from a previous family doesn’t fit well the new HW resources or design. You can also count on the help of the community on the Linux AMD issue tracker to validate your code on other hardware and/or systems.

This approach helped me debug a 2-year-old issue where the cursor gamma adjustment was incorrect in DCN3 hardware, but working correctly for DCN2 family. I solved the issue in two steps, thanks for community feedback and validation:

8) Check the hardware capability screening in the driver: You can currently find a list of display hardware capabilities in the drivers/gpu/drm/amd/display/dc/dcn*/dcn*_resource.c file. More precisely in the dcn*_resource_construct() function. Using DCN301 for illustration, here is the list of its hardware caps:

	/*************************************************
	 *  Resource + asic cap harcoding                *
	 *************************************************/
	pool->base.underlay_pipe_index = NO_UNDERLAY_PIPE;
	pool->base.pipe_count = pool->base.res_cap->num_timing_generator;
	pool->base.mpcc_count = pool->base.res_cap->num_timing_generator;
	dc->caps.max_downscale_ratio = 600;
	dc->caps.i2c_speed_in_khz = 100;
	dc->caps.i2c_speed_in_khz_hdcp = 5; /*1.4 w/a enabled by default*/
	dc->caps.max_cursor_size = 256;
	dc->caps.min_horizontal_blanking_period = 80;
	dc->caps.dmdata_alloc_size = 2048;
	dc->caps.max_slave_planes = 2;
	dc->caps.max_slave_yuv_planes = 2;
	dc->caps.max_slave_rgb_planes = 2;
	dc->caps.is_apu = true;
	dc->caps.post_blend_color_processing = true;
	dc->caps.force_dp_tps4_for_cp2520 = true;
	dc->caps.extended_aux_timeout_support = true;
	dc->caps.dmcub_support = true;

	/* Color pipeline capabilities */
	dc->caps.color.dpp.dcn_arch = 1;
	dc->caps.color.dpp.input_lut_shared = 0;
	dc->caps.color.dpp.icsc = 1;
	dc->caps.color.dpp.dgam_ram = 0; // must use gamma_corr
	dc->caps.color.dpp.dgam_rom_caps.srgb = 1;
	dc->caps.color.dpp.dgam_rom_caps.bt2020 = 1;
	dc->caps.color.dpp.dgam_rom_caps.gamma2_2 = 1;
	dc->caps.color.dpp.dgam_rom_caps.pq = 1;
	dc->caps.color.dpp.dgam_rom_caps.hlg = 1;
	dc->caps.color.dpp.post_csc = 1;
	dc->caps.color.dpp.gamma_corr = 1;
	dc->caps.color.dpp.dgam_rom_for_yuv = 0;

	dc->caps.color.dpp.hw_3d_lut = 1;
	dc->caps.color.dpp.ogam_ram = 1;
	// no OGAM ROM on DCN301
	dc->caps.color.dpp.ogam_rom_caps.srgb = 0;
	dc->caps.color.dpp.ogam_rom_caps.bt2020 = 0;
	dc->caps.color.dpp.ogam_rom_caps.gamma2_2 = 0;
	dc->caps.color.dpp.ogam_rom_caps.pq = 0;
	dc->caps.color.dpp.ogam_rom_caps.hlg = 0;
	dc->caps.color.dpp.ocsc = 0;

	dc->caps.color.mpc.gamut_remap = 1;
	dc->caps.color.mpc.num_3dluts = pool->base.res_cap->num_mpc_3dlut; //2
	dc->caps.color.mpc.ogam_ram = 1;
	dc->caps.color.mpc.ogam_rom_caps.srgb = 0;
	dc->caps.color.mpc.ogam_rom_caps.bt2020 = 0;
	dc->caps.color.mpc.ogam_rom_caps.gamma2_2 = 0;
	dc->caps.color.mpc.ogam_rom_caps.pq = 0;
	dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
	dc->caps.color.mpc.ocsc = 1;

	dc->caps.dp_hdmi21_pcon_support = true;

	/* read VBIOS LTTPR caps */
	if (ctx->dc_bios->funcs->get_lttpr_caps) {
		enum bp_result bp_query_result;
		uint8_t is_vbios_lttpr_enable = 0;

		bp_query_result = ctx->dc_bios->funcs->get_lttpr_caps(ctx->dc_bios, &is_vbios_lttpr_enable);
		dc->caps.vbios_lttpr_enable = (bp_query_result == BP_RESULT_OK) && !!is_vbios_lttpr_enable;
	}

	if (ctx->dc_bios->funcs->get_lttpr_interop) {
		enum bp_result bp_query_result;
		uint8_t is_vbios_interop_enabled = 0;

		bp_query_result = ctx->dc_bios->funcs->get_lttpr_interop(ctx->dc_bios, &is_vbios_interop_enabled);
		dc->caps.vbios_lttpr_aware = (bp_query_result == BP_RESULT_OK) && !!is_vbios_interop_enabled;
	}

Keep in mind that the documentation of color capabilities are available at the Linux kernel Documentation.

Understanding the development history:

What has brought us to the current state?

9) Pinpoint relevant commits: Use git log and git blame to identify commits targeting the code section you’re interested in.

10) Track regressions: If you’re examining the amd-staging-drm-next branch, check for regressions between DC release versions. These are defined by DC_VER in the drivers/gpu/drm/amd/display/dc/dc.h file. Alternatively, find a commit with this format drm/amd/display: 3.2.221 that determines a display release. It’s useful for bisecting. This information helps you understand how outdated your branch is and identify potential regressions. You can consider each DC_VER takes around one week to be bumped. Finally, check testing log of each release in the report provided on the amd-gfx mailing list, such as this one Tested-by: Daniel Wheeler:

Reducing the inspection area:

Focus on what really matters.

11) Identify involved HW blocks: This helps isolate the issue. You can find more information about DCN HW blocks in the DCN Overview documentation. In summary:

  • Plane issues are closer to HUBP and DPP.
  • Blending/Stream issues are closer to MPC, OPP and OPTC. They are related to DRM CRTC subjects.

This information was useful when debugging a hardware rotation issue where the cursor plane got clipped off in the middle of the screen.

Finally, the issue was addressed by two patches:

12) Issues around bandwidth (glitches) and clocks: May be affected by calculations done in these HW blocks and HW specific values. The recalculation equations are found in the DML folder. DML stands for Display Mode Library. It’s in charge of all required configuration parameters supported by the hardware for multiple scenarios. See more in the AMD DC Overview kernel docs. It’s a math library that optimally configures hardware to find the best balance between power efficiency and performance in a given scenario.

Finding some clk variables that affect device behavior may be a sign of it. It’s hard for a external developer to debug this part, since it involves information from HW specs and firmware programming that we don’t have access. The best option is to provide all relevant debugging information you have and ask AMD developers to check the values from your suspicions.

  • Do a trick: If you suspect the power setup is degrading performance, try setting the amount of power supplied to the GPU to the maximum and see if it affects the system behavior with this command: sudo bash -c "echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level"

I learned it when debugging glitches with hardware cursor rotation on Steam Deck. My first attempt was changing the clock calculation. In the end, Rodrigo Siqueira proposed the right solution targeting bandwidth in two steps:

Checking implicit programming and hardware limitations:

Bring implicit programming to the level of consciousness and recognize hardware limitations.

13) Implicit update types: Check if the selected type for atomic update may affect your issue. The update type depends on the mode settings, since programming some modes demands more time for hardware processing. More details in the source code:

/* Surface update type is used by dc_update_surfaces_and_stream
 * The update type is determined at the very beginning of the function based
 * on parameters passed in and decides how much programming (or updating) is
 * going to be done during the call.
 *
 * UPDATE_TYPE_FAST is used for really fast updates that do not require much
 * logical calculations or hardware register programming. This update MUST be
 * ISR safe on windows. Currently fast update will only be used to flip surface
 * address.
 *
 * UPDATE_TYPE_MED is used for slower updates which require significant hw
 * re-programming however do not affect bandwidth consumption or clock
 * requirements. At present, this is the level at which front end updates
 * that do not require us to run bw_calcs happen. These are in/out transfer func
 * updates, viewport offset changes, recout size changes and pixel
depth changes.
 * This update can be done at ISR, but we want to minimize how often
this happens.
 *
 * UPDATE_TYPE_FULL is slow. Really slow. This requires us to recalculate our
 * bandwidth and clocks, possibly rearrange some pipes and reprogram
anything front
 * end related. Any time viewport dimensions, recout dimensions,
scaling ratios or
 * gamma need to be adjusted or pipe needs to be turned on (or
disconnected) we do
 * a full update. This cannot be done at ISR level and should be a rare event.
 * Unless someone is stress testing mpo enter/exit, playing with
colour or adjusting
 * underscan we don't expect to see this call at all.
 */

enum surface_update_type {
UPDATE_TYPE_FAST, /* super fast, safe to execute in isr */
UPDATE_TYPE_MED,  /* ISR safe, most of programming needed, no bw/clk change*/
UPDATE_TYPE_FULL, /* may need to shuffle resources */
};

Using tools:

Observe the current state, validate your findings, continue improvements.

14) Use AMD tools to check hardware state and driver programming: help on understanding your driver settings and checking the behavior when changing those settings.

  • DC Visual confirmation: Check multiple planes and pipe split policy.

  • DTN logs: Check display hardware state, including rotation, size, format, underflow, blocks in use, color block values, etc.

  • UMR: Check ASIC info, register values, KMS state - links and elements (framebuffers, planes, CRTCs, connectors). Source: UMR project documentation

15) Use generic DRM/KMS tools:

  • IGT test tools: Use generic KMS tests or develop your own to isolate the issue in the kernel space. Compare results across different GPU vendors to understand their implementations and find potential solutions. Here AMD also has specific IGT tests for its GPUs that is expect to work without failures on any AMD GPU. You can check results of HW-specific tests using different display hardware families or you can compare expected differences between the generic workflow and AMD workflow.

  • drm_info: This tool summarizes the current state of a display driver (capabilities, properties and formats) per element of the DRM/KMS workflow. Output can be helpful when reporting bugs.

Don’t give up!

Debugging issues in the AMD display driver can be challenging, but by following these tips and leveraging available resources, you can significantly improve your chances of success.

Worth mentioning: This blog post builds upon my talk, “I’m not an AMD expert, but…” presented at the 2022 XDC. It shares guidelines that helped me debug AMD display issues as an external developer of the driver.

Open Source Display Driver: The Linux kernel/AMD display driver is open source, allowing you to actively contribute by addressing issues listed in the official tracker. Tackling existing issues or resolving your own can be a rewarding learning experience and a valuable contribution to the community. Additionally, the tracker serves as a valuable resource for finding similar bugs, troubleshooting tips, and suggestions from AMD developers. Finally, it’s a platform for seeking help when needed.

Remember, contributing to the open source community through issue resolution and collaboration is mutually beneficial for everyone involved.

December 13, 2023 12:25 PM

December 12, 2023

Andy Wingo

sir talks-a-lot

I know, dear reader: of course you have already seen all my talks this year. Your attentions are really too kind and I thank you. But those other people, maybe you share one talk with them, and then they ask you for more, and you have to go stalking back through the archives to slake their nerd-thirst. This happens all the time, right?

I was thinking of you this morning and I said to myself, why don’t I put together a post linking to all of my talks in 2023, so that you can just send them a link; here we are. You are very welcome, it is really my pleasure.

2023 talks

Scheme + Wasm + GC = MVP: Hoot Scheme-to-Wasm compiler update. Wasm standards group, Munich, 11 Oct 2023. slides

Scheme to Wasm: Use and misuse of the GC proposal. Wasm GC subgroup, 18 Apr 2023. slides

A world to win: WebAssembly for the rest of us. BOB, Berlin, 17 Mar 2023. blog slides youtube

Cross-platform mobile UI: “Compilers, compilers everywhere”. EOSS, Prague, 27 June 2023. slides youtube blog blog blog blog blog blog

CPS Soup: A functional intermediate language. Spritely, remote, 10 May 2023. blog slides

Whippet: A new GC for Guile. FOSDEM, Brussels, 4 Feb 2023. blog event slides

but wait, there’s more

Still here? The full talks archive will surely fill your cup.

by Andy Wingo at December 12, 2023 03:18 PM

December 10, 2023

Andy Wingo

a simple hdr histogram

Good evening! This evening, a note on high-dynamic-range (HDR) histograms.

problem

How should one record garbage collector pause times?

A few options present themselves: you could just record the total pause time. Or, also record the total number of collections, which allows you to compute the average. Maximum is easy enough, too. But then you might also want the median or the p90 or the p99, and these percentile values are more gnarly: you either need to record all the pause times, which can itself become a memory leak, or you need to approximate via a histogram.

Let’s assume that you decide on the histogram approach. How should you compute the bins? It would be nice to have microsecond accuracy on the small end, but if you bin by microsecond you could end up having millions of bins, which is not what you want. On the other end you might have multi-second GC pauses, and you definitely want to be able to record those values.

Consider, though, that it’s not so important to have microsecond precision for a 2-second pause. This points in a direction of wanting bins that are relatively close to each other, but whose absolute separation can vary depending on whether we are measuring microseconds or milliseconds. You want approximately uniform precision over a high dynamic range.

logarithmic binning

The basic observation is that you should be able to make a histogram that gives you, say, 3 significant figures on measured values. Such a histogram would count anything between 1230 and 1240 in the same bin, and similarly for 12300 and 12400. The gap between bins increases as the number of digits grows.

Of course computers prefer base-2 numbers over base-10, so let’s do that. Say we are measuring nanoseconds, and the maximum number of seconds we expect is 100 or so. There are about 230 nanoseconds in a second, and 100 is a little less than 27, so that gives us a range of 37 bits. Let’s say we want a precision of 4 significant base-2 digits, or 4 bits; then we will have one set of 24 bins for 10-bit values, another for 11-bit values, and so-on, for a total of 37 × 24 bins, or 592 bins. If we use a 32-bit integer count per bin, such a histogram would be 2.5kB or so, which I think is acceptable.

Say you go to compute the bin for a value. Firstly, note that there are some values that do not have 4 significant bits: if you record a measurement of 1 nanosecond, presumably that is just 1 significant figure. These are like the denormals in floating-point numbers. Let’s just say that recording a value val in [0, 24-1] goes to bin val.

If val is 24 or more, then we compute the major and minor components. The major component is the number of bits needed to represent val, minus the 4 precision bits. We can define it like this in C, assuming that val is a 64-bit value:

#define max_value_bits 37
#define precision 4
uint64_t major = 64ULL - __builtin_clzl(val) - precision;

The 64 - __builtin_clzl(val) gives us the ceiling of the base-2 logarithm of the value. And actually, to take into account the denormal case, we do this instead:

uint64_t major = val < (1ULL << precision)
  ? 0ULL
  : 64ULL - __builtin_clzl(val) - precision;

Then to get the minor component, we right-shift val by major bits, unless it is a denormal, in which case the minor component is the value itself:

uint64_t minor = val < (1 << precision)
  ? val
  : (val >> (major - 1ULL)) & ((1ULL << precision) - 1ULL);

Then the histogram bucket for a given value can be computed directly:

uint64_t idx = (major << precision) | minor;

Of course, we would prefer to bound our storage, hence the consideration about 37 total bits in 100 seconds of nanoseconds. So let’s do that, and record any out-of-bounds value in the special last bucket, indicating that we need to expand our histogram next time:

if (idx >= (max_value_bits << precision))
  idx = max_value_bits << precision;

The histogram itself can then be defined simply as having enough buckets for all major and minor components in range, plus one for overflow:

struct histogram {
  uint32_t buckets[(max_value_bits << precision) + 1];
};

Getting back the lower bound for a bucket is similarly simple, again with a case for denormals:

uint64_t major = idx >> precision;
uint64_t minor = idx & ((1ULL << precision) - 1ULL);
uint64_t min_val = major
  ? ((1ULL << precision) | minor) << (major - 1ULL)
  : minor;

y u no library

How many lines of code does something need to be before you will include it as a library instead of re-implementing? If I am honest with myself, there is no such limit, as far as code goes at least: only a limit of time. I am happy to re-implement anything; time is my only enemy. But strategically speaking, time too is the fulcrum: if I can save time by re-implementing over integrating a library, I will certainly hack.

The canonical library in this domain is HdrHistogram. But even the C port is thousands of lines of code! A histogram should not take that much code! Hence this blog post today. I think what we have above is sufficient. HdrHistogram’s documentation speaks in terms of base-10 digits of precision, but that is easily translated to base-2 digits: if you want 0.1% precision, then in base-2 you’ll need 10 bits; no problem.

I should of course mention that HdrHistogram includes an API that compensates for coordinated omission, but I think such an API is straigtforward to build on top of the basics.

My code, for what it is worth, and which may indeed be buggy, is over here. But don’t re-use it: write your own. It could be much nicer in C++ or Rust, or any other language.

Finally, I would note that somehow this feels very basic; surely there is prior art? I feel like in 2003, Google would have had a better answer than today; alack. Pointers appreciated to other references, and if you find them, do tell me more about your search strategy, because mine is inadequate. Until then, gram you later!

by Andy Wingo at December 10, 2023 09:27 PM

Stéphane Cerveau

Vulkan Video encoder in GStreamer

Vulkan Video encoder in GStreamer #

During the last months of 2023, we, at Igalia, decided to focus on the latest provisional specs proposed by the Vulkan Video Khronos TSG group to support encode operations in an open reference.

As the Khronos TSG Finalizes Vulkan Video Extensions for Accelerated H.264 and H.265 Encode the 19th of December 2023, here is an update on our work to support both h264 and h265 codec in GStreamer.

This work has been possible thanks to the relentless work of the Khronos TSG during the past years, including the IHV vendors, such as AMD, NVIDIA, Intel, but also Rastergrid and its massive work on the specifications wording and the vulkan tooling, and of course Cognizant for its valuable contributions to the Vulkan CTS.

This work started with moving structures in drivers and specifications along the CTS tests definition. It has been a joint venture to validate the specifications and check that the drivers suppports properly the features necessary to encode a video with h264 and h265 codecs. Breaking news, more formats should come very soon...

The GStreamer code #

Building a community #

Thanks first to Lynne and David Airlie, on their RADV/FFmpeg effort, a first implementation was available last year to validate the provisional specifications of the encoders. This work helped us a lot to design the GStreamer version and offer a performant solution for it.

The GStreamer Vulkan bits have been also possible thanks to the tight collaboration with the CTS folks which helped us a lot to understand the crashes we could experience without any clues in the drivers.

To write this GStreamer code, we got inspiration from the code written by the GStreamer community along the GstVA encoders and v4l2codecs plugins. So great kudos to the GStreamer community and especially to He Junayan from Intel, Collabora folks paving the way to stateless codecs and of course Igalia's past contributions.

And finally the active and accurate reviews from Centricular folks to achieve a valid synchronized operation in Vulkan, see for example https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/5079 in addition to their initial work to support Vulkan.

Give me the (de)coded bits #

The code is available on the GStreamer MR and we plan to have it ready in the 1.24 version.

This code needs at least the Vulkan SDK version 1.3.274 with KHR Vulkan Video encoding extension support. It should be available publicly when the specifications and the tooling will be available along the next SDK release early 2024.

To build it, nothing is more simple than (after installing the LunarG SDK and the GST requirements):

$ meson setup builddir -Dgst-plugins-bad:vulkan-video=enabled
$ ninja -C builddir

And to run it, you'll need an IHV driver supporting the KHR extension, which should happen very soon, or the community driven driver such as Igalia's Intel ANV driver. You can also try the AMD RADV driver available here when KHR will be available.

To encode and validate the bitstream you can run this transcode GStreamer pipeline:

$ gst-launch-1.0 videotestsrc ! vulkanupload ! vulkanh265enc ! h265parse ! avdec_h265 ! autovideosink

As you can see a vulkanupload is necessary to upload the GStreamer buffers to the GPU memory, then the buffer will be encoded through the vulkanh265enc which will generate a byte-stream AU aligned stream which can be decoded with any decoder, here the FFMpeg one.

We are planning to simplify this pipeline and allow to select an encoder from a multiple device configuration.

It supports most of the Vulkan video encoder features and can encode I, P and B Frames for better streaming performance. See this blog post

Challenges #

One of the most important challenge here was to put all the bits together and understand when the drivers tell you that you are doing something wrong with only an obscure segfault.

Vulkan Video

The vulkan process to encode stateless a raw bitstream is quite simple as it will be:

  • Initialize the video session with the correct video profile (h264,h265 etc...)
  • Initialize the sessions paramaters such as SPS/PPS which will be used by the driver during the encode session
  • Reset the codec state
  • Encode the buffers:
    • Change the quality/rate control if necessary
    • Retrieve the video session parameters from the driver if necessary
    • Set the bitstream standard parameters such as slice header data.
    • Set the begin video info to give the context to the encoder for the reference
    • Encode the buffer
    • Query the encoded result
    • End the video coding operation
    • repeat ..

The challenge here was not the state machine design but more the understanding of the underlying bits necessary to tell the driver what and how to encode it. Indeed to encode a video stream, a lot of parameters (more than 200..) are necessary and a bad use of these parameters can lead to obscure crashes in the driver stack.

So first it was important to validate the data provided to the Vulkan drivers through the Validation layer. This part in implementing an application using Vulkan is the first and essential step to avoid headaches, see here

You can checkout the code from here to get the latest one and configure it to validate your project. Otherwise it should come from the Vulkan SDK.

This step helped us a lot to understand the mechanics expected by the driver to provide the frames and their references to encode those properly. It also helped to use some API such as the rate control or the quality level.

But Validation Layers cant do everything and that's why designing, reviewing a solid codebase in CTS as a reference is crucial to implement the GStreamer application or any other 3rd party application. So a lot of time in Igalia has been dedicated to ease and simplify the CTS tests to facilitate its usage.

Indeed the Validation Layers do not validate the Video Standard parameters for example such as the one provided for the format underlying specific data.

So we spent a few hours checking some missing parameters in CTS as in GStreamer which was leading to obscure crash, especially when we were playing with references. A blog post should follow explaining this part.

To debug this missing bits, we used some facilitating tools such as GfxReconstruct Vulkan layer or the VK_LAYER_LUNARG_api_dump which allows you to get an essential dump of your calls and compare the results with other implemetations.

Conclusion #

It was a great journey providing a cross platform solution to the community to encode video using largely adopted codec such as h264 and h265. As said before other codecs will come very soon, so stay tuned!

I mentioned the team work in this journey as without the dedication of all, it wouldn't have been possible to achieve this work in a short time span.

And last but not least, I'd like to special thank also Valve to sponsor this work without whom it wouldn't have been possible to go as far and as fast!

We'll attend the Vulkanised 2024 event where we'll demonstrate this work. Feel free to come and talk to us there. We'll be delighted to hear and discuss your doubts or brilliant ideas !

As usual, if you would like to learn more about Vulkan, GStreamer or any other open multimedia framework, please contact us!

December 10, 2023 12:00 AM

December 08, 2023

Andy Wingo

v8's mark-sweep nursery

Today, a followup to yesterday’s note with some more details on V8’s new young-generation implementation, minor mark-sweep or MinorMS.

A caveat again: these observations are just from reading the code; I haven’t run these past the MinorMS authors yet, so any of these details might be misunderstandings.

The MinorMS nursery consists of pages, each of which is 256 kB, unless huge-page mode is on, in which case they are 2 MB. The total default size of the nursery is 72 MB by default, or 144 MB if pointer compression is off.

There can be multiple threads allocating into the nursery, but let’s focus on the main allocator, which is used on the main thread. Nursery allocation is bump-pointer, whether in a MinorMS page or scavenger semi-space. Bump-pointer regions are called linear allocation buffers, and often abbreviated as Lab in the source, though the class is LinearAllocationArea.

If the current bump-pointer region is too small for the current allocation, the nursery implementation finds another one, or triggers a collection. For the MinorMS nursery, each page collects the set of allocatable spans in a free list; if the free-list is non-empty, it pops off one entry as the current and tries again.

Otherwise, MinorMS needs another page, and specifically a swept page: a page which has been visited since the last GC, and whose spans of unused memory have been collected into a free-list. There is a concurrent sweeping task which should usually run ahead of the mutator, but if there is no swept page available, the allocator might need to sweep some. This logic is in MainAllocator::RefillLabMain.

Finally, if all pages are swept and there’s no Lab big enough for the current allocation, we trigger collection from the roots. The initial roots are the remembered set: pointers from old objects to new objects. Most of the trace happens concurrently with the mutator; when the nursery utilisation rises over 90%, V8 will kick off concurrent marking tasks.

Then once the mutator actually runs out of space, it pauses, drains any pending marking work, marks conservative roots, then drains again. I am not sure whether MinorMS with conservative stack scanning visits the whole C/C++ stacks or whether it manages to install some barriers (i.e. “don’t scan deeper than 5 frames because we collected then, and so all older frames are older”); dunno. All of this logic is in MinorMarkSweepCollector::MarkLiveObjects.

Marking traces the object graph, setting object mark bits. It does not trace pages. However, the MinorMS space promotes in units of pages. So how to decide what pages to promote? The answer is that sweeping partitions the MinorMS pages into empty, recycled, aging, and promoted pages.

Empty pages have no surviving objects, and are very useful because they can be given back to the operating system if needed or shuffled around elsewhere in the system. If they are re-used for allocation, they do not need to be swept.

Recycled pages have some survivors, but not many; MinorMS keeps the page around for allocation in the next cycle, because it has enough empty space. By default, a page is recyclable if it has 50% or more free space after a minor collection, or 30% after a major collection. MinorMS also promotes a page eagerly if in the last cycle, we only managed to allocate into 30% or less of its empty space, probably due to fragmentation. These pages need to be swept before re-use.

Finally, MinorMS doesn’t let pages be recycled indefinitely: after 4 minor cycles, a page goes into the aging pool, in which it is kept unavailable for allocation for one cycle, but is not yet promoted. This allows any new allocations on that page in the previous cycle age out and probably die, preventing premature tenuring.

And that’s it. Next time, a note on a way in which generational collectors can run out of memory. Have a nice weekend, hackfolk!

by Andy Wingo at December 08, 2023 02:34 PM

Ricardo García

I'm playing Far Cry 6 on Linux

2023-12-10 UPDATE: From Mastodon, arcepi suggested the instability problems that I described below and served as a motivation to try Far Cry 6 on Linux could be coming from having switched from NVIDIA to AMD without reinstalling Windows, because of leftover files from the NVIDIA drivers. Today morning I reinstalled Windows to test this and, indeed, the Deathloop and Far Cry 6 crashes seem to be gone (yay!). That would have removed my original motivation to try to run the game on Linux, but it doesn’t take away the main points of the post. Do take into account that the instability doesn’t seem to exist anymore (and I hope this applies to more future titles I play) but it’s still the background story to explain why I decided to install Far Cry 6 on my Fedora 39 system, so the original post follows below.

If you’ve been paying attention to the evolution of the Linux gaming ecosystem in recent years, including the release of the Steam Deck and the new Steam Deck OLED, it’s likely your initial reaction to the blog post title is a simple “OK”. However, I’m coming from a very particular place so I wanted to explain my point of view and the significance of this, and hopefully you’ll find the story interesting.

steam running on fedora 39.tn
Figure 1. Steam running on Fedora Linux 39

As a background, let me say I’ve always gamed on Windows when using my PC. If you think I’m an idiot for doing so lately, specially because my work at Igalia involves frequently interacting with Valve contractors like Samuel Pitoiset, Timur Kristóf, Mike Blumenkrantz or Hans-Kristian Arntzen, you’d be more than right. But hear me out. I’ve always gamed on Windows because it’s the safe bet. With a couple of small kids at home and very limited free time, when I game everything has to just work. No fiddling around with software, config files, or wasting time setting up the software stack. I’m supposed to boot Windows when I want to play, play, and then turn my computer off. The experience needs to be as close to a console as possible. And, for anything non-gaming, which is most of it, I’d be using my Linux system.

In the last years, thanks to the work done by Valve, the Linux gaming stack has improved a lot. Despite this, I’ve kept gaming on Windows for a variety of reasons:

  1. For a long time, my Linux disk only had a capacity of 128GB, so installing games was not a real possibility due to the amount of disk space they need.

  2. Also, I was running Slackware and installing Steam and getting the whole thing running implied a fair amount of fiddling I didn’t even want to think about.

  3. Then, when I was running Fedora on a large disk, I had kids and I didn’t want to take any risks or possibly waste time on that.

So, what changed?

sapphire pulse amd rx 6700 box
Figure 2. Sapphire Pulse AMD Radeon RX 6700 box

Earlier this year I upgraded my PC and replaced an old Intel Haswell i7-4770k with a Ryzen R5 7600X, and my GPU changed from an NVIDIA GTX 1070 to a Radeon RX 6700. The jump in CPU power was much bigger and impressive than the more modest jump in GPU power. But talking about that and the sorry state of the GPU market is a story for another blog post. In any case, I had put up with the NVIDIA proprietary driver for many years and I think, on Windows and for gaming, NVIDIA is the obvious first choice for many people, including me. Dealing with the proprietary blob under Linux was not particularly problematic, specially with the excellent way it’s handled by RPMFusion on Fedora, where essentially you only have to install a few packages and you can mostly forget about it.

However, given my recent professional background I decided to go with an AMD card for the first time. I wanted to use a fully open source graphics stack and I didn’t want to think about making compromises in Wayland support or other fronts whatsoever. Plus, at the time I upgraded my PC, the timing was almost perfect for me to switch to an AMD card, because:

  1. AMD cards were, in general, performing better for the same price than NVIDIA cards, except for ray tracing.

  2. The RX 6700 non-XT was on sale.

  3. It had the same performance as a PS5 or so.

  4. It didn’t draw a ton of power like many recent high-end GPUs (175W, similar to the 1070 and its 150W TDP).

After the system upgrade, I did notice a few more stability problems when gaming under Windows, compared to what I was used to with an NVIDIA card. You can find thousands of opinions, comments and anecdotes on the Internet about the quality of AMD drivers, and a lot of people say they’re a couple of steps below NVIDIA drivers. It’s not my intention at all to pile up on those, but it’s true my own personal experience is having generally more crashes in games and having to face more weird situations since I switched to AMD. Normally, it doesn’t get to the point of being annoying at all, but sometimes it’s a bit surprising and I could definitely notice that increase in instability without any bias on my side, I believe. Which takes us to Far Cry 6.

A few days ago I finished playing Doom Eternal and its expansions (really nice game, by the way!) and I decided to go with Far Cry 6 next. I’m slowly working my way up with some more graphically demanding games that I didn’t feel comfortable with playing on the 1070. I went ahead and installed the game on Windows. Being a big 70GB download (100GB on disk), that took a bit of time. Then I launched it, adjusted the keyboard and mouse settings to my liking and I went to the video options menu. The game had chosen the high preset for me and everything looked good, so I attempted to run the in-game benchmark to see if the game performed well with that preset (I love it when games have built-in benchmarks!). After a few seconds in a loading screen, the game crashed and I was back to the desktop. “Oh, what a bad way to start!”, I thought, without knowing what lied ahead. I launched the game again, same thing.

On the course of the 2 hours that followed, I tried everything:

  1. Launching the main game instead of the benchmark, just in case the bug only happened in the benchmark. Nope.

  2. Lowering quality and resolution.

  3. Disabling any advanced setting.

  4. Trying windowed mode, or borderless full screen.

  5. Vsync off or on.

  6. Disabling the overlays for Ubisoft, Steam, AMD.

  7. Rebooting multiple times.

  8. Uninstalling the drivers normally as well as using DDU and installing them again.

Same result every time. I also searched on the web for people having similar problems, but got no relevant search results anywhere. Yes, a lot of people both using AMD and NVIDIA had gotten crashes somewhere in the game under different circumstances, but nobody mentioned specifically being unable to reach any gameplay at all. That day I went to bed tired and a bit annoyed. I was also close to having run the game for 2 hours according to Steam, which is the limit for refunds if I recall correctly. I didn’t want to refund the game, though, I wanted to play it.

The next day I was ready to uninstall it and move on to another title in my list but, out of pure curiosity, given that I had already spent a good amount of time trying to make it run, I searched for it on the Proton compatibility database to see if it could be run on Linux, and it seemed to be possible. The game appeared to be well supported and it was verified to run on the Deck, which was good because both the Deck and my system have an RDNA2 GPU. In my head I wasn’t fully convinced this could work, because I didn’t know if the problem was in the game (maybe a bug with recent updates) or the drivers or anywhere else (like a hardware problem).

And this was, for me, when the fun started. I installed Steam on Linux from the Gnome Software app. For those who don’t know it, it’s like an app store for Gnome that acts as a frontend to the package manager.

gnome software steam.tn
Figure 3. Gnome Software showing Steam as an installed application

Steam showed up there with 3 possible sources: Flathub, an “rpmfusion-nonfree-steam” repo and the more typical “rpmfusion-nonfree” repo. I went with the last option and soon I had Steam in my list of apps. I launched that and authenticated using the Steam mobile app QR code scanning function for logging in (which is a really cool way to log in, by the way, without needing to recall your username and password).

My list of installed games was empty and I couldn’t find a way to install Far Cry 6 because it was not available for Linux. However, I thought there should be an easy way to install it and launch it using the famous Proton compatibility layer, and a quick web search revealed I only had to right-click on the game title, select Properties and choose to “Force the use of a specific Steam Play compatibility tool” under the Compatibility section. Click-click-click and, sure, the game was ready to install. I let it download again and launched it.

Context menu shown after right-clicking on Far Cry 6 on the Steam application, with the Properties option highlighted
Far Cry 6 Compatibility tab displaying the option to force the use of a specific Steam Play compatibility tool

Some stuff pops up about processing or downloading Vulkan shaders and I see it doing some work. In that first launch, the game takes more time to start compared to what I had seen under Windows, but it ends up launching (and subsequent launches were noticeably faster). That includes some Ubisoft Connect stuff showing up before the game starts and so on. Intro videos play normally and I reach the game menu in full screen. No indication that I was running it on Linux whatsoever. I go directly to the video options menu, see that the game again selected the high preset, I turn off VSync and launch the benchmark. Sincerely, honestly, completely and totally expecting it to crash one more time and that would’ve been OK, pointing to a game bug. But no, for the first time in two days this is what I get:

far cry 6 benchmark screenshot.tn
Figure 4. Far Cry 6 benchmark screenshot displaying the game running at over 100 frames per second

The benchmark runs perfectly, no graphical glitches, no stuttering, frame rates above 100FPS normally, and I had a genuinely happy and surprised grin on my face. I laughed out loud and my wife asked what was so funny. Effortless. No command lines, no config files, nothing.

As of today, I’ve played the game for over 30 hours and the game has crashed exactly once out of the blue. And I think it was an unfortunate game bug. The rest of the time it’s been running as smooth and as perfect as the first time I ran the benchmark. Framerate is completely fine and way over the 0 frames per second I got on Windows because it wouldn’t run. The only problem seems to be that when I finish playing and exit to the desktop, Steam is unable to stop the game completely for some reason (I don’t know the cause) and it shows up as still running. I usually click on the Stop button in the Steam interface after a few seconds, it stops the game and that’s it. No problem synchronizing game saves to the cloud or anything. Just that small bug that, again, only requires a single extra click.

2023-12-10 UPDATE: From Mastodon, Jaco G and Berto Garcia tell me the game not stopping problem is present in all Ubisoft games and is directly related to the Ubisoft launcher. It keeps running after closing the game, which makes Steam think the game is still running. You can try to close it from the tray if you see the Ubisoft icon there and, if that fails, you can stop the game like I described above.

Then I remember something that had happened a few months before, prior to starting to play Doom Eternal under Windows. I had tried to play Deathloop first, another game in my backlog. However, the game crashed every few minutes and an error window popped up. The amount and timing of the crashes didn’t look constant, and lowering the graphics settings sometimes would allow me to play the game a bit longer, but in any case I wasn’t able to finish the game intro level without crashes and being very annoyed. Searching for the error message on the web, I saw it looked like a game problem that was apparently affecting not only AMD users, but also NVIDIA ones, so I had mentally classified that as a game bug and, similarly to the Far Cry 6 case, I had given up on running the game without refunding it hoping to be able to play it in the future.

Now I was wondering if it was really a game bug and, even if it was, if maybe Proton could have a workaround for it and maybe it could be played on Linux. Again, ProtonDB showed the game to be verified on the Deck with encouraging recent reports. So I installed Deathloop on Linux, launched it just once and played for 20 minutes or so. No crashes and I got as far as I had gotten on Windows in the intro level. Again, no graphical glitches that I could see, smooth framerates, etc. Maybe it was a coincidence and I was lucky, but I think I will be able to play the game without issues when I’m done with Far Cry 6.

In conclusion, this story is another data point that tells us the quality of Proton as a product and software compatibility layer is outstanding. In combination with some high quality open source Mesa drivers like RADV, I’m amazed the experience can be actually better than gaming natively on Windows. Think about that: the Windows game binary running natively on a DX12 or Vulkan official driver crashes more and doesn’t work as well as the game running on top of a Windows compatibility layer with a graphics API translation layer, on top of a different OS kernel and a different Vulkan driver. Definitely amazing to me and it speaks wonders of the work Valve has been doing on Linux. Or it could also speak badly of AMD Windows drivers, or both.

Sure, some new games on launch have more compatibility issues, bugs that need fixing, maybe workarounds applied in Proton, etc. But even in those cases, if you have a bit of patience, play the game some months down the line and check ProtonDB first (ideally before buying the game), you may be in for a great experience. You don’t need to be an expert either. Not to mention that some of these details are even better and smoother if you use a Steam Deck as compared to an (officially) unsupported Linux distribution like I do.

December 08, 2023 11:38 AM

December 07, 2023

Andy Wingo

the last 5 years of V8's garbage collector

Captain, status report: I’m down here in a Jeffries tube, poking at V8’s garbage collector. However, despite working on other areas of the project recently, V8 is now so large that it’s necessary to ignore whole subsystems when working on any given task. But now I’m looking at the GC in anger: what is its deal? What does V8’s GC even look like these days?

The last public article on the structure of V8’s garbage collector was in 2019; fine enough, but dated. Now in the evening of 2023 I think it could be useful to revisit it and try to summarize the changes since then. At least, it would have been useful to me had someone else written this article.

To my mind, work on V8’s GC has had three main goals over the last 5 years: improving interactions between the managed heap and C++, improving security, and increasing concurrency. Let’s visit these in turn.

C++ and GC

Building on the 2018 integration of the Oilpan tracing garbage collector into the Blink web engine, there was some refactoring to move the implementation of Oilpan into V8 itself. Oilpan is known internally as cppgc.

I find the cppgc name a bit annoying because I can never remember what it refers to, because of the other thing that has been happpening in C++ integration: a migration away from precise roots and instead towards conservative root-finding.

Some notes here: with conservative stack scanning, we can hope for better mutator throughput and fewer bugs. The throughput comes from not having to put all live pointers in memory; the compiler can keep them in registers, and avoid managing the HandleScope. You may be able to avoid the compile-time and space costs of stack maps (side tables telling the collector where the pointers are). There are also two classes of bug that we can avoid: holding on to a handle past the lifetime of a handlescope, and holding on to a raw pointer (instead of a handle) during a potential GC point.

Somewhat confusingly, it would seem that conservative stack scanning has garnered the acronym “CSS” inside V8. What does CSS have to do with GC?, I ask. I know the answer but my brain keeps asking the question.

In exchange for this goodness, conservative stack scanning means that because you can’t be sure that a word on the stack refers to an object and isn’t just a spicy integer, you can’t move objects that might be the target of a conservative root. And indeed the conservative edge might actually not point to the start of the object; it could be an interior pointer, which places additional constraints on the heap, that it be able to resolve internal pointers.

Security

Which brings us to security and the admirable nihilism of the sandbox effort. The idea is that everything is terrible, so why not just assume that no word is safe and that an attacker can modify any word they can address. The only way to limit the scope of an attacker’s modifications is then to limit the address space. This happens firstly by pointer compression, which happily also has some delightful speed and throughput benefits. Then the pointer cage is placed within a larger cage, and off-heap data such as Wasm memories and array buffers go in that larger cage. Any needed executable code or external object is accessed indirectly, through dedicated tables.

However, this indirection comes with a cost of a proliferation in the number of spaces. In the beginning, there was just an evacuating newspace, a mark-compact oldspace, and a non-moving large object space. Now there are closer to 20 spaces: a separate code space, a space for read-only objects, a space for trusted objects, a space for each kind of indirect descriptor used by the sandbox, in addition to spaces for objects that might be shared between threads, newspaces for many of the preceding kinds, and so on. From what I can see, managing this complexity has taken a significant effort. The result is pretty good to work with, but you pay for what you get. (Do you get security guarantees? I don’t know enough to say. Better pay some more to be sure.)

Finally, the C++ integration has also had an impact on the spaces structure, and with a security implication to boot. The thing is, conservative roots can’t be moved, but the original evacuating newspace required moveability. One can get around this restriction by pretenuring new allocations from C++ into the mark-compact space, but this would be a performance killer. The solution that V8 is going for is to use the block-structured mark-compact space that is already used for the old-space, but for new allocations. If an object is ever traced during a young-generation collection, its page will be promoted to the old generation, without copying. Originally called minor mark-compact or MinorMC in the commit logs, it was renamed to minor mark-sweep or MinorMS to indicate that it doesn’t actually compact. (V8’s mark-compact old-space doesn’t have to compact: V8 usually chooses to just mark in place. But we call it a mark-compact space because it has that capability.)

This last change is a performance hazard: yes, you keep the desirable bump-pointer allocation scheme for new allocations, but you lose on locality in the old generation, and the rate of promoted bytes will be higher than with the semi-space new-space. The only relief is that for a given new-space size, you can allocate twice as many objects, because you don’t need the copy reserve.

Why do I include this discussion in the security section? Well, because most MinorMS commits mention this locked bug. One day we’ll know, but not today. I speculate that evacuating is just too rich a bug farm, especially with concurrency and parallel mutators, and that never-moving collectors will have better security properties. But again, I don’t know for sure, and I prefer to preserve my ability to speculate rather than to ask for too many details.

Concurrency

Speaking of concurrency, ye gods, the last few years have been quite the ride I think. Every phase that can be done in parallel (multiple threads working together to perform GC work) is now fully parallel: semi-space evacuation, mark-space marking and compaction, and sweeping. Every phase that can be done concurrently (where the GC runs threads while the mutator is running) is concurrent: marking and sweeping. A major sweep task can run concurrently with an evacuating minor GC. And, V8 is preparing for multiple mutators running in parallel. It’s all a bit terrifying but again, with engineering investment and a huge farm of fuzzers, it seems to be a doable transition.

Concurrency and threads means that V8 has sprouted new schedulers: should a background task have incremental or concurrent marking? How many sweepers should a given isolate have? How should you pause concurrency when the engine needs to do something gnarly?

The latest in-progress work would appear to be concurrent marking of the new-space. I think we should expect this work to result in a lower overall pause-time, though I am curious also to learn more about the model: how precise is it? Does it allow a lot of slop to get promoted? It seems to have a black allocator, so there will be some slop, but perhaps it can avoid promotion for those pages. I don’t know yet.

Summary

Yeah, GCs, man. I find the move to a non-moving young generation is quite interesting and I wish the team luck as they whittle down the last sharp edges from the conservative-stack-scanning performance profile. The sandbox is pretty fun too. All good stuff and I look forward to spending a bit more time with it; engineering out.

by Andy Wingo at December 07, 2023 12:15 PM

Eric Meyer

Three Decades of HTML

A few days ago was the 30th anniversary of the first time I wrote an HTML document.  Back in 1993, I took a Usenet posting of the “Incomplete Mystery Science Theater 3000 Episode Guide” and marked it up.  You can see the archived copy here on meyerweb.  At some point, the markup got updated for reasons I don’t remember, but I can guarantee you the original had uppercase tag names and I didn’t close any paragraphs.  That’s because I was using <P> as a shorthand for <BR><BR>, which was the style at the time.

Its last-updated date of December 3, 1993, is also the date I created it.  I was on lobby duty with the CWRU Film Society, and had lugged a laptop (I think it was an Apple PowerBook of some variety, something like a 180, borrowed from my workplace) and a printout of the HTML specification (or maybe it was “Tags in HTML”?) along with me.

I spent most of that evening in the lobby of Strosacker Auditorium, typing tags and doing find-and-replace operations in Microsoft Word, and then saving as text to a file that ended in .html, which was the style at the time.  By the end of the night, I had more or less what you see in the archived copy.

The only visual change between then and now is that a year or two later, when I put the file up in my home directory, I added the toolbars at the top and bottom of the page  —  toolbars I’d designed and made a layout standard as CWRU’s webmaster.  Which itself only happened because I learned HTML.

A couple of years ago, I was fortunate enough to be able to relate some of this story to Joel Hodgson himself.  The story delighted him, which delighted me, because delighting someone who has been a longtime hero really is one of life’s great joys.  And the fact that I got to have that conversation, to feel that joy, is inextricably rooted in my sitting in that lobby with that laptop and that printout and that Usenet post, adding tags and saving as text and hitting reload in Mosaic to instantly see the web page take shape, thirty years ago this week.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at December 07, 2023 03:40 AM

December 01, 2023

Brian Kardell

The Effects of Nuclear Maths

The Effects of Nuclear Maths

Carrying on Eric's work on “The Effects of Nuclear Weapons” - with native MathML.

As I have explained a few times since first writing about it in Harold Crick and the Web Platform in Jan 2019, I'm interested in MathML (and SVG). I'm interested in them because they are historically special: integrated into the HTML specification/parser even. Additionally, their lineage makes them unnecessarily weird and "other". Worse, they're dramatically under-invested in/de-prioritized. I believe all that really needs to change. Note that none of that says that I need MathML, personally, for the stuff that I write - but rather that I understand why it is that way and the importance for the web at large (and society). This is also the first time I've helped advocate for a thing I didn't have a bunch of first-hand experience with.

Since then, I've worked with the Math Community Group to create MathML-Core to resolve those historical issues, get a W3C TAG review, an implementation in Chromium (also done by Igalians) and recharter the MathML Working Group (which I now co-chair).

For authors, the status of things is way better. Practically speaking, you can use MathML and render some pretty darn nice looking maths for a whole lot of stuff. Some of the integrations with CSS specifically aren't aligned, but they didn't exist at all before, so maybe that’s ok.

Despite all of this, it seems lots of people and systems still use MathJax to render as SVG, sometimes also rendering MathML for accessibility reasons. But, why? It's not great for performance if you don't have to.

I really wanted to better understand. So I decided to tackle some test projects myself and to take you all along with me by documenting what I do, as I do it.

Nuking all the MathJax (references)

Last year, my colleague Eric Meyer wrote a piece about some work he'd been doing on a passion project Recreating “The Effects of Nuclear Weapons” for the Web, in which they bring a famous book to the web as faithfully as they can.

This piece includes a lot of math - hundreds of equations.

You can see the result of their work on atomicarchive.com. The first two chapters aren't so math heavy, but Chapter 3 alone contains 179 equations!

It looks good!

Luckily, it's on github.

Oh hey! These are just static HTML files!

Inside, I see that each file contains a block like this:

<script id="MathJax-script" async src="mathjax/tex-chtml.js"></script>
<script>
MathJax = {
  tex: {
    inlineMath: [['$', '$'], ['\\(', '\\)']]
  },
  svg: {
    fontCache: 'global'
  }
};
MathJax.Hub.Config({
  CommonHTML: {
    linebreaks: {automatic: true}
  }
});
</script>

These lines configure how MathJax works, including how it is identified in the page. If you scan through the source of, say, chapter 3, you will see lots of things like this:

<p>where $c_0$ is the ambient speed of sound (ahead of the shock front), $p$ is the peak overpressure (behind the shock front), $P_0$ is the ambient pressure (ahead of the shock), and $\gamma$ is the ratio of the specific heats of the medium, i.e., air. If $\gamma$ is taken as 1.4, which is the value at moderate temperatures, the equation for the shock velocity becomes</p>
    
\[
U = c_0 \left( 1 + \frac{6p}{7P_0} \right)^{1/2},
\]
Code sample from chapter 3

What you can see here is that the text is peppered with those delimiters from the configuration, and inside is TeX.

People don't like writing *ML

I just want to take a minute and address the elephant in the room: Most people don't want to write MathML, and that's not as damning as it sounds. Case in point, as I am writing this post, I am not writing <p> and <h1> and <section>s and so on… I'm writing markdown, as the gods intended.

TimBL’s original idea for HTML wasn't that people would write it - it was that machines would write it. His first browser was an editor too - but a relatively basic rich text editor, a lot like the ones you use every day on github issues or a million other places. The very first project Tim did with HTML was the CERN phone book which did all kinds of complicated stuff to dynamically write HTML - the same way we do today. In fact, it feels like 90% of engineering seems to be transforming strings into other strings: And it's nothing new. TeX has been around almost as long as I've been alive and it's a fine thing to process.

This format, of text tokens being anywhere in the document is way more ideally suited to processing on a build or server than on the client. On the client we'd more ideally work with the tree and manage re-flow and so on. Here the tree is irrelevant - in the way even - because there are some conflicting escapes or encodings… But, it's just a string... And there are lots of things that can process that string and turn it into MathML.

My colleague Fred Wang has an open source project called TeXZilla, so I'm going to use that.

Nuking the scripts

First thing first, let's get rid of those script tags. Since they contain the tokens they'd be searching the body for, but well be processing the whole document, they'll just cause problems anyways.

Ok. Done.

Next, I checkout the TeXZilla project parallel to my book project and try to use the ‘streamfilter’ which lets me just cat and | (pipe) a file to process it…

 >cat chapter1.html | node "../../texzilla/node_modules/texzilla/TeXZilla.js" streamfilter

Hmm… It fails with an error that looks something like this (not exactly if you're following along, this is from a different file):

>cat chapter1.html | node "../../texzilla/node_modules/texzilla/TeXZilla.js" streamfilter

Error: Parse error on line 145:
...}\text{Neutron + }&\!\left\{\enspace
---------------------^
Expecting 'ENDMATH1', got 'COLSEP'
.....

I don’t know TeX, so this is kind of a mystery to me. I search for the quoted line and have a look. I find some code surrounded by \begin{align} and \end{align}. I turn to the interwebs and find some examples where it says \begin{aligned} and \end{aligned}, so I try just grep/changing it across the files and, sure enough, it processes further. Which is right? Who knows - let's move on...

I find a similar error like...

Error: Parse error on line 759:
...xt{ rads. } \; \textit{Answer}\]</asi
-----------------------^
Expecting '{', got 'TEXTARG'

Once again on the interwebs I find some code like Textit{...} (uppercase T) - so, I try that. Sure enough, it works fine. Or wait... Does it? No, it just leaves that code untransformed now.

But… Now we have MathML! At least in some chapters!

With a little more looking, I find that this is a primary sort of TeX thing, it's just a quirk of TeXZilla, and I turn those those textit{...} into mathit{...} and now we're good. Finally. Burned some time on that, but not too bad.

I encounter two other problems along the way: First, escaping. The source contains ampersand escapes for a few Unicode characters which aren't a problem when you're accessing it from body.innerText or something maybe, but are a problem here. Still, it takes only a couple of minutes more to replace them with their actual Unicode counterparts: &phi;φ, &lambda; ⇒ Λ, &&times; ⇒ ×).

When I’m done I can send that to a file, something like this…

>cat chapter3.html | node "../../texzilla/node_modules/texzilla/TeXZilla.js" streamfilter > output.html

And then just open it in a browser.

There are some differences. MathJax adds a margin to math that's on its own line (block) that isn't there natively. The math formulas that are ‘tabular’ aren't aligning their text (the =), and the font is maybe a little small. After a little fooling around, I add this snip to the stylesheet:

math[display="block" i] {
  margin: 1rem 0;
}

math {
    font-size: 1rem;
}

mfrac {
    padding-inline-start: 0.275rem;
    padding-inline-end: 0.275rem;
}

mtd {
  text-align: left;
}

And the last problem I hit? An error in the actual inline TeX in the source, I think, where it was missing a $. MathJax simply left that un-transformed, just like my previous example because it lacked the closing $, but TeXZilla was all errory about it. I sent a pull request, and I guess we'll see.

And like… that's it.

All in all, the processing time here including learning the tools and figuring out the quirks in a language I'm unfamiliar with is maybe half a day - but that's scattered across short bits of free time here and there.

Is it done? Is it good enough? It definitely needs some proofing to see what I've missed (I'm sure there's some!), and good scrutiny generally, but... It looks pretty good at first glance, right (below)?

Native MathML on the left, MathJax on the right

You can browse it at:

https://bkardell.com/effects-of-nuclear-math/html/.

And, send me an issue if you find anything that should be improved before I eventually send a pull request, or you see something that looks bad enough that you wouldn't switch to native MathML for this.

But, all in all: I feel like you can make pretty good use of native MathML at this point.

December 01, 2023 05:00 AM

November 26, 2023

Alex Bradbury

Storing data in pointers

Introduction

On mainstream 64-bit systems, the maximum bit-width of a virtual address is somewhat lower than 64 bits (commonly 48 bits). This gives an opportunity to repurpose those unused bits for data storage, if you're willing to mask them out before using your pointer (or have a hardware feature that does that for you - more on this later). I wondered what happens to userspace programs relying on such tricks as processors gain support for wider virtual addresses, hence this little blog post. TL;DR is that there's no real change unless certain hint values to enable use of wider addresses are passed to mmap, but read on for more details as well as other notes about the general topic of storing data in pointers.

Storage in upper bits assuming 48-bit virtual addresses

Assuming your platform has 48-bit wide virtual addresses, this is pretty straightforward. You can stash whatever you want in those 16 bits, but you'll need to ensure you masking them out for every load and store (which is cheap, but has at least some cost) and would want to be confident that there's no other attempted users for these bits. The masking would be slightly different in kernel space due to rules on how the upper bits are set:

What if virtual addresses are wider than 48 bits?

So we've covered the easy case, where you can freely (ab)use the upper 16 bits for your own purposes. But what if you're running on a system that has wider than 48 bit virtual addresses? How do you know that's the case? And is it possible to limit virtual addresses to the 48-bit range if you're sure you don't need the extra bits?

You can query the virtual address width from the command-line by catting /proc/cpuinfo, which might include a line like address sizes : 39 bits physical, 48 bits virtual. I'd hope there's a way to get the same information without parsing /proc/cpuinfo, but I haven't been able to find it.

As for how to keep using those upper bits on a system with wider virtual addresses, helpfully the behaviour of mmap is defined with this compatibility in mind. It's explicitly documented for x86-64, for AArch64 and for RISC-V that addresses beyond 48-bits won't be returned unless a hint parameter beyond a certain width is used (the details are slightly different for each target). This means if you're confident that nothing within your process is going to be passing such hints to mmap (including e.g. your malloc implementation), or at least that you'll never need to try to reuse upper bits of addresses produced in this way, then you're free to presume the system uses no more than 48 bits of virtual address.

Top byte ignore and similar features

Up to this point I've completely skipped over the various architectural features that allow some of the upper bits to be ignored upon dereference, essentially providing hardware support for this type of storage of additional data within pointeres by making additional masking unnecessary.

  • x86-64 keeps things interesting by having slightly different variants of this for Intel and AMD.
    • Intel introduced Linear Address Masking (LAM), documented in chapter 6 of their document on instruction set extensions and future features. If enabled this modifies the canonicality check so that, for instance, on a system with 48-bit virtual addresses bit 47 must be equal to bit 63. This would allow bits 62:48 (15 bits) can be freely used with no masking needed. "LAM57" allows 62:57 to be used (6 bits). It seems as if Linux is currently opting to only support LAM57 and not LAM48. Support for LAM can be configured separately for user and supervisor mode, but I'll refer you to the Intel docs for details.
    • AMD instead describes Upper Address Ignore (see section 5.10) which allows bits 63:57 (7 bits) to be used, and unlike LAM doesn't require bit 63 to match the upper bit of the virtual address. As documented in LWN, this caused some concern from the Linux kernel community. Unless I'm missing it, there doesn't seem to be any level of support merged in the Linux kernel at the time of writing.
  • RISC-V has the proposed pointer masking extension which defines new supervisor-level extensions Ssnpm, Smnpm, and Smmpm to control it. These allow PMLEN to potentially be set to 7 (masking the upper 7 bits) or 16 (masking the upper 16 bits). In usual RISC-V style, it's not mandated which of these are supported, but the draft RVA23 profile mandates that PMLEN=7 must be supported at a minimum. Eagle-eyed readers will note that the proposed approach has the same issue that caused concern with AMD's Upper Address Ignore, namely that the most significant bit is no longer required to be the same as the top bit of the virtual address. This is noted in the spec, with the suggestion that this is solvable at the ABI level and some operating systems may choose to mandate that the MSB not be used for tagging.
  • AArch64 has the Top Byte Ignore (TBI) feature, which as the name suggests just means that the top 8 bits of a virtual address are ignored when used for memory accesses and can be used to store data. Any other bits between the virtual address width and top byte must be set to all 0s or all 1s, as before. TBI is also used by Arm's Memory Tagging Extension (MTE), which uses 4 of those bits as the "key" to be compared against the "lock" tag bits associated with a memory location being accessed. Armv8.3 defines another potential consumer of otherwise unused address bits, pointer authentication which uses 11 to 31 bits depending on the virtual address width if TBI isn't being used, or 3 to 23 bits if it is.

A relevant historical note that multiple people pointed out: the original Motorola 68000 had a 24-bit address bus and so the top byte was simply ignored which caused well documented porting issues when trying to expand the address space.

Storing data in least significant bits

Another commonly used trick I'd be remiss not to mention is repurposing a small number of the least significant bits in a pointer. If you know a certain set of pointers will only ever be used to point to memory with a given minimal alignment, you can exploit the fact that the lower bits corresponding to that alignment will always be zero and store your own data there. As before, you'll need to account for the bits you repurpose when accessing the pointer - in this case either by masking, or by adjusting the offset used to access the address (if those least significant bits are known).

As suggested by Per Vognsen, after this article was first published, you can exploit x86's scaled index addressing mode to use up to 3 bits that are unused due to alignment, but storing your data in the upper bits. The scaled index addressing mode meaning there's no need for separate pointer manipulation upon access. e.g. for an 8-byte aligned address, store it right-shifted by 3 and use the top 3 bits for metadata, then scaling by 8 using SIB when accessing (which effectively ignores the top 3 bits). This has some trade-offs, but is such a neat trick I felt I have to include it!

Some real-world examples

To state what I hope is obvious, this is far from an exhaustive list. The point of this quick blog post was really to discuss cases where additional data is stored alongside a pointer, but of course unused bits can also be exploited to allow a more efficient tagged union representation (and this is arguably more common), so I've included some examples of that below:

  • The fixie trie, is a variant of the trie that uses 16 bits in each pointer to store a bitmap used as part of the lookup logic. It also exploits the minimum alignment of pointers to repurpose the least significant bit to indicate if a value is a branch or a leaf.
  • On the topic of storing data in the least significant bits, we have a handy PointerIntPair class in LLVM to allow the easy implementation of this optimisation. There's also an 'alignment niches' proposal for Rust which would allow this kind of optimisation to be done automatically for enums (tagged unions). Another example of repurposing the LSB found in the wild would be the Linux kernel using it for a spin lock (thanks Vegard Nossum for the tip, who notes this is used in the kernel's directory entry cache hashtable). There are surely many many more examples.
  • Go repurposes both upper and lower bits in its taggedPointer, used internally in its runtime implementation.
  • If you have complete control over your heap then there's more you can do to make use of embedded metadata, including using additional bits by avoiding allocation outside of a certain range and using redundant mappings to avoid or reduce the need for masking. OpenJDK's ZGC is a good example of this, utilising a 42-bit address space for objects and upon allocation mapping pages to different aliases to allow pointers using their metadata bits to be dereferenced without masking.
  • A fairly common trick in language runtimes is to exploit the fact that values can be stored inside the payload of double floating point NaN (not a number) values and overlap it with pointers (knowing that the full 64 bits aren't needed) and even small integers. There's a nice description of this in JavaScriptCore, but it was famously used in LuaJIT. Andy Wingo also has a helpful write-up. Along similar lines, OCaml steals just the least significant bit in order to efficiently support unboxed integers (meaning integers are 63-bit on 64-bit platforms and 31-bit on 32-bit platforms).
  • Apple's Objective-C implementation makes heavy use of unused pointer bits, with some examples documented in detail on Mike Ash's excellent blog (with a more recent scheme described on Brian T. Kelley's blog. Inlining the reference count (falling back to a hash lookup upon overflow) is a fun one. Another example of using the LSB to store small strings in-line is squoze.
  • V8 opts to limit the heap used for V8 objects to 4GiB using pointer compression, where an offset is used alongside the 32-bit value (which itself might be a pointer or a 31-bit integer, depending on the least significant bit) to refer to the memory location.
  • As this list is becoming more of a collection of things slightly outside the scope of this article I might as well round it off with the XOR linked list, which reduces the storage requirements for doubly linked lists by exploiting the reversibility of the XOR operation.
  • I've focused on storing data in conventional pointers on current commodity architectures but there is of course a huge wealth of work involving tagged memory (also an area where I've dabbled - something for a future blog post perhaps) and/or alternative pointer representations. I've touched on this with MTE (mentioned due to its interaction with TBI), but another prominent example is of course CHERI which moves to using 128-bit capabilities in order to fit in additional inline metadata. David Chisnall provided some observations based on porting code to CHERI that relies on the kind of tricks described in this post.

Fin

What did I miss? What did I get wrong? Let me know on Mastodon or email (asb@asbradbury.org).

You might be interested in the discussion of this article on lobste.rs, on HN, on various subreddits, or on Mastodon.


Article changelog
  • 2023-12-02:
    • Reference Brian T. Kelley's blog providing a more up-to-date description of "pointer tagging" in Objective-C. Spotted on Mastodon.
  • 2023-11-27:
    • Mention Squoze (thanks to Job van der Zwan).
    • Reworded the intro so as not to claim "it's quite well known" that the maximum virtual address width is typically less than 64 bits. This might be interpreted as shaming readers for not being aware of that, which wasn't my intent. Thanks to HN reader jonasmerlin for pointing this out.
    • Mention CHERI is the list of "real world examples" which is becoming dominated by instances of things somewhat different to what I was describing! Thanks to Paul Butcher for the suggestion.
    • Link to relevant posts on Mike Ash's blog (suggested by Jens Alfke).
    • Link to the various places this article is being discussed.
    • Add link to M68k article suggested by Christer Ericson (with multiple others suggesting something similar - thanks!).
  • 2023-11-26:
    • Minor typo fixes and rewordings.
    • Note the Linux kernel repurposing the LSB as a spin lock (thanks to Vegard Nossum for the suggestion).
    • Add SIB addressing idea shared by Per Vognsen.
    • Integrate note suggested by Andy Wingo that explicit masking often isn't needed when the least significant pointer bits are repurposed.
    • Add a reference to Arm Pointer Authentication (thanks to the suggestion from Tzvetan Mikov).
  • 2023-11-26: Initial publication date.

November 26, 2023 12:00 PM

November 24, 2023

Andy Wingo

tree-shaking, the horticulturally misguided algorithm

Let’s talk about tree-shaking!

looking up from the trough

But first, I need to talk about WebAssembly’s dirty secret: despite the hype, WebAssembly has had limited success on the web.

There is Photoshop, which does appear to be a real success. 5 years ago there was Figma, though they don’t talk much about Wasm these days. There are quite a number of little NPM libraries that use Wasm under the hood, usually compiled from C++ or Rust. I think Blazor probably gets used for a few in-house corporate apps, though I could be fooled by their marketing.

You might recall the hyped demos of 3D first-person-shooter games with Unreal engine again from 5 years ago, but that was the previous major release of Unreal and was always experimental; the current Unreal 5 does not support targetting WebAssembly.

Don’t get me wrong, I think WebAssembly is great. It is having fine success in off-the-web environments, and I think it is going to be a key and growing part of the Web platform. I suspect, though, that we are only just now getting past the trough of disillusionment.

It’s worth reflecting a bit on the nature of web Wasm’s successes and failures. Taking Photoshop as an example, I think we can say that Wasm does very well at bringing large C++ programs to the web. I know that it took quite some work, but I understand the end result to be essentially the same source code, just compiled for a different target.

Similarly for the JavaScript module case, Wasm finds success in getting legacy C++ code to the web, and as a way to write new web-targetting Rust code. These are often tasks that JavaScript doesn’t do very well at, or which need a shared implementation between client and server deployments.

On the other hand, WebAssembly has not been a Web success for DOM-heavy apps. Nobody is talking about rewriting the front-end of wordpress.com in Wasm, for example. Why is that? It may sound like a silly question to you: Wasm just isn’t good at that stuff. But why? If you dig down a bit, I think it’s that the programming models are just too different: the Web’s primary programming model is JavaScript, a language with dynamic typing and managed memory, whereas WebAssembly 1.0 was about static typing and linear memory. Getting to the DOM from Wasm was a hassle that was overcome only by the most ardent of the true Wasm faithful.

Relatedly, Wasm has also not really been a success for languages that aren’t, like, C or Rust. I am guessing that wordpress.com isn’t written mostly in C++. One of the sticking points for this class of language. is that C#, for example, will want to ship with a garbage collector, and that it is annoying to have to do this. Check my article from March this year for more details.

Happily, this restriction is going away, as all browsers are going to ship support for reference types and garbage collection within the next months; Chrome and Firefox already ship Wasm GC, and Safari shouldn’t be far behind thanks to the efforts from my colleague Asumu Takikawa. This is an extraordinarily exciting development that I think will kick off a whole ‘nother Gartner hype cycle, as more languages start to update their toolchains to support WebAssembly.

if you don’t like my peaches

Which brings us to the meat of today’s note: web Wasm will win where compilers create compact code. If your language’s compiler toolchain can manage to produce useful Wasm in a file that is less than a handful of over-the-wire kilobytes, you can win. If your compiler can’t do that yet, you will have to instead rely on hype and captured audiences for adoption, which at best results in an unstable equilibrium until you figure out what’s next.

In the JavaScript world, managing bloat and deliverable size is a huge industry. Bundlers like esbuild are a ubiquitous part of the toolchain, compiling down a set of JS modules to a single file that should include only those functions and data types that are used in a program, and additionally applying domain-specific size-squishing strategies such as minification (making monikers more minuscule).

Let’s focus on tree-shaking. The visual metaphor is that you write a bunch of code, and you only need some of it for any given page. So you imagine a tree whose, um, branches are the modules that you use, and whose leaves are the individual definitions in the modules, and you then violently shake the tree, probably killing it and also annoying any nesting birds. The only thing that’s left still attached is what is actually needed.

This isn’t how trees work: holding the trunk doesn’t give you information as to which branches are somehow necessary for the tree’s mission. It also primes your mind to look for the wrong fixed point, removing unneeded code instead of keeping only the necessary code.

But, tree-shaking is an evocative name, and so despite its horticultural and algorithmic inaccuracies, we will stick to it.

The thing is that maximal tree-shaking for languages with a thicker run-time has not been a huge priority. Consider Go: according to the golang wiki, the most trivial program compiled to WebAssembly from Go is 2 megabytes, and adding imports can make this go to 10 megabytes or more. Or look at Pyodide, the Python WebAssembly port: the REPL example downloads about 20 megabytes of data. These are fine sizes for technology demos or, in the limit, very rich applications, but they aren’t winners for web development.

shake a different tree

To be fair, both the built-in Wasm support for Go and the Pyodide port of Python both derive from the upstream toolchains, where producing small binaries is nice but not necessary: on a server, who cares how big the app is? And indeed when targetting smaller devices, we tend to see alternate implementations of the toolchain, for example MicroPython or TinyGo. TinyGo has a Wasm back-end that can apparently go down to less than a kilobyte, even!

These alternate toolchains often come with some restrictions or peculiarities, and although we can consider this to be an evil of sorts, it is to be expected that the target platform exhibits some co-design feedback on the language. In particular, running in the sea of the DOM is sufficiently weird that a Wasm-targetting Python program will necessarily be different than a “native” Python program. Still, I think as toolchain authors we aim to provide the same language, albeit possibly with a different implementation of the standard library. I am sure that the ClojureScript developers would prefer to remove their page documenting the differences with Clojure if they could, and perhaps if Wasm becomes a viable target for Clojurescript, they will.

on the algorithm

To recap: now that it supports GC, Wasm could be a winner for web development in Python and other languages. You would need a different toolchain and an effective tree-shaking algorithm, so that user experience does not degrade. So let’s talk about tree shaking!

I work on the Hoot Scheme compiler, which targets Wasm with GC. We manage to get down to 70 kB or so right now, in the minimal “main” compilation unit, and are aiming for lower; auxiliary compilation units that import run-time facilities (the current exception handler and so on) from the main module can be sub-kilobyte. Getting here has been tricky though, and I think it would be even trickier for Python.

Some background: like Whiffle, the Hoot compiler prepends a prelude onto user code. Tree-shakind happens in a number of places:

Generally speaking, procedure definitions (functions / closures) are the easy part: you just include only those functions that are referenced by the code. In a language like Scheme, this gets you a long way.

However there are three immediate challenges. One is that the evaluation model for the definitions in the prelude is letrec*: the scope is recursive but ordered. Binding values can call or refer to previously defined values, or capture values defined later. If evaluating the value of a binding requires referring to a value only defined later, then that’s an error. Again, for procedures this is trivially OK, but as soon as you have non-procedure definitions, sometimes the compiler won’t be able to prove this nice “only refers to earlier bindings” property. In that case the fixing letrec (reloaded) algorithm will end up residualizing bindings that are set!, which of all the tree-shaking passes above require the delicate DCE pass to remove them.

Worse, some of those non-procedure definitions are record types, which have vtables that define how to print a record, how to check if a value is an instance of this record, and so on. These vtable callbacks can end up keeping a lot more code alive even if they are never used. We’ll get back to this later.

Similarly, say you print a string via display. Well now not only are you bringing in the whole buffered I/O facility, but you are also calling a highly polymorphic function: display can print anything. There’s a case for bitvectors, so you pull in code for bitvectors. There’s a case for pairs, so you pull in that code too. And so on.

One solution is to instead call write-string, which only writes strings and not general data. You’ll still get the generic buffered I/O facility (ports), though, even if your program only uses one kind of port.

This brings me to my next point, which is that optimal tree-shaking is a flow analysis problem. Consider display: if we know that a program will never have bitvectors, then any code in display that works on bitvectors is dead and we can fold the branches that guard it. But to know this, we have to know what kind of arguments display is called with, and for that we need higher-level flow analysis.

The problem is exacerbated for Python in a few ways. One, because object-oriented dispatch is higher-order programming. How do you know what foo.bar actually means? Depends on foo, which means you have to thread around representations of what foo might be everywhere and to everywhere’s caller and everywhere’s caller’s caller and so on.

Secondly, lookup in Python is generally more dynamic than in Scheme: you have __getattr__ methods (is that it?; been a while since I’ve done Python) everywhere and users might indeed use them. Maybe this is not so bad in practice and flow analysis can exclude this kind of dynamic lookup.

Finally, and perhaps relatedly, the object of tree-shaking in Python is a mess of modules, rather than a big term with lexical bindings. This is like JavaScript, but without the established ecosystem of tree-shaking bundlers; Python has its work cut out for some years to go.

in short

With GC, Wasm makes it thinkable to do DOM programming in languages other than JavaScript. It will only be feasible for mass use, though, if the resulting Wasm modules are small, and that means significant investment on each language’s toolchain. Often this will take the form of alternate toolchains that incorporate experimental tree-shaking algorithms, and whose alternate standard libraries facilitate the tree-shaker.

Welp, I’m off to lunch. Happy wassembling, comrades!

by Andy Wingo at November 24, 2023 11:41 AM

Adrián Pérez

Standalone Binaries With Zig CC and Meson

Have you ever wanted to run a program in some device without needing to follow a complicated, possibly slow build process? In many cases it can be simpler to use zig cc to compile a static binary and call it a day.

Zig What?

The Zig programming language is a new(ish) programming language that may not get as much rep as others like Rust, but the authors have had one of the greatest ideas ever: reuse the chops that Clang has as a cross-compiler, throw in a copy of the sources for the Musl C library, and provide an extremely convenient compiler driver that builds the parts of the C library needed by your program on-demand.

Try the following, it just works, it’s magic:

cat > hi.c <<EOF
#include <stdio.h>

int main() {
    puts("That's all, folks!");
    return 0;
}
EOF
zig cc --target=aarch64-linux-musl -o hi-arm hi.c

Go on, try it. I’ll wait.


All good? Here is what it looks like for me:

% uname -m
x64_64
% file hi-arm
hi-arm: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), statically linked, with debug_info, not stripped
% ls -lh hi-arm
-rwxrwxr-x+ 1 aperez aperez 82K nov 24 16:08 hi-arm*
% llvm-strip -x --strip-unneeded hi-arm
% ls -lh hi-arm
-rwxrwxr-x+ 1 aperez aperez 8,2K nov 24 16:11 hi-arm*
% 

A compiler running on a x86_64 machine produced a 64-bit ARM program out of thin air (no cross-sysroot, no additional tooling), which does not have any runtime dependency (it is linked statically), weighting 82 KiB, and can be stripped further down to only 8,2 KiB.

While I cannot say much about Zig the programming language itself because I have not had the chance to spend much time writing code in it, I have the compiler installed just because of zig cc.

A Better Example

One of the tools I often want to run in a development board is drm_info, so let’s cross-compile that. This little program prints all the information it can gather from a DRM/KMS device, optionally formatted as JSON (with the -j command line flag) and it can be tremendously useful to understand the capabilities of GPUs and their drivers. My typical usage of this tool is knowing whether WPE WebKit would run in a given embedded device, and as guidance for working on Cog’s DRM platform plug-in.

It is also a great example because it uses Meson as its build system and has a few dependencies—so it is not a trivial example like the one above—but not so many that one would run into much trouble. Crucially, both its main dependencies, json-c and libdrm, are available as WrapDB packages that Meson itself can fetch and configure automatically. Combined with zig cc, this means the only other thing we need is installing Meson.

Well, that is not completely true. We also need to tell Meson how to cross-compile using our fancy toolchain. For this, we can write the following cross file, which I have named zig-aarch64.conf:

[binaries]
c = ['/opt/zigcc-wrapper/cc', 'aarch64-linux-musl']
objcopy = 'llvm-objcopy'
ranlib = 'llvm-ranlib'
strip = 'llvm-strip'
ar = 'llvm-ar'

[target_machine]
system = 'linux'
cpu_family = 'aarch64'
cpu = 'cortex-a53'
endian = 'little'

While ideally it should be possible to set the value for c in the cross file to zig with the needed command line flags, in practice there is a bug that prevents the linker detection in Meson from working correctly. At least until the patch which fixes the issue is not yet part of a release, the wrapper script at /opt/zigcc-wrapper/cc will defer to using lld when it notices that the linker version is requested, or otherwise call zig cc:

#! /bin/sh
set -e
target=$1
shift

if [ "$1" = '-Wl,--version' ] ; then
    exec clang -fuse-ld=lld "$@"
fi
exec zig cc --target="$target" "$@"

Now, this is how we would go about building drm_info, note the use of --default-library=static to ensure that the subprojects for the dependencies are built accordingly and that they can be linked into the resulting executable:

git clone https://gitlab.freedesktop.org/emersion/drm_info.git
meson setup drm_info.aarch64 drm_info \
  --cross-file=zig-aarch64.conf --default-library=static
meson compile -Cdrm_info.aarch64

A short while later we will be kicked back into our shell prompt, where we can check the resulting binary:

% file drm_info.aarch64/drm_info
drm_info.aarch64/drm_info: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), statically linked, with debug_info, not stripped
% ls -lh drm_info.aarch64/drm_info
-rwxr-xr-x 1 aperez aperez 1,8M nov 24 23:07 drm_info.aarch64/drm_info*
% llvm-strip -x --strip-unneeded drm_info.aarch64/drm_info
% ls -lh drm_info.aarch64/drm_info
-rwxr-xr-x 1 aperez aperez 199K nov 24 23:09 drm_info.aarch64/drm_info*
%

That’s a ~200 KiB drm_info binary that can be copied over to any 64-bit ARM-based device running Linux (the kernel). Magic!

The General Magic logo, but its text reads

by aperez (adrian@perezdecastro.org) at November 24, 2023 11:32 AM

November 16, 2023

Andy Wingo

a whiff of whiffle

A couple nights ago I wrote about a superfluous Scheme implementation and promised to move on from sheepishly justifying my egregious behavior in my next note, and finally mention some results from this experiment. Well, no: I am back on my bullshit. Tonight I write about a couple of implementation details that discerning readers may find of interest: value representation, the tail call issue, and the standard library.

what is a value?

As a Lisp, Scheme is one of the early “dynamically typed” languages. These days when you say “type”, people immediately think propositions as types, mechanized proof of program properties, and so on. But “type” has another denotation which is all about values and almost not at all about terms: one might say that vector-ref has a type, but it’s not part of a proof; it’s just that if you try to vector-ref a pair instead of a vector, you get a run-time error. You can imagine values as being associated with type tags: annotations that can be inspected at run-time for, for example, the sort of error that vector-ref will throw if you call it on a pair.

Scheme systems usually have a finite set of type tags: there are fixnums, booleans, strings, pairs, symbols, and such, and they all have their own tag. Even a Scheme system that provides facilities for defining new disjoint types (define-record-type et al) will implement these via a secondary type tag layer: for example that all record instances are have the same primary tag, and that you have to retrieve their record type descriptor to discriminate instances of different record types.

Anyway. In Whiffle there are immediate types and heap types. All values have a low-bit tag which is zero for heap objects and nonzero for immediates. For heap objects, the first word of the heap object has tagging in the low byte as well. The 3-bit heap tag for pairs is chosen so that pairs can just be two words, with no header word. There is another 3-bit heap tag for forwarded objects, which is used but the GC when evacuating a value. Other objects put their heap tags in the low 8 bits of the first word. Additionally there is a “busy” tag word value, used to prevent races when evacuating from multiple threads.

Finally, for generational collection of objects that can be “large” – the definition of large depends on the collector implementation, and is not nicely documented, but is more than, like, 256 bytes – anyway these objects might need to have space for a “remembered” bit in the object themselves. This is not the case for pairs but is the case for, say, vectors: even though they are prolly smol, they might not be, and they need space for a remembered bit in the header.

tail calls

When I started Whiffle, I thought, let’s just compile each Scheme function to a C function. Since all functions have the same type, clang and gcc will have no problem turning any tail call into a proper tail call.

This intuition was right and wrong: at optimization level -O2, this works great. We don’t even do any kind of loop recognition / contification: loop iterations are tail calls and all is fine. (Not the most optimal implementation technique, but the assumption is that for our test cases, GC costs will dominate.)

However, when something goes wrong, I will need to debug the program to see what’s up, and so you might think to compile at -O0 or -Og. In that case, somehow gcc does not compile to tail calls. One time while debugging a program I was flummoxed at a segfault during the call instruction; turns out it was just stack overflow, and the call was trying to write the return address into an unmapped page. For clang, I could use the musttail attribute; perhaps I should, to allow myself to debug properly.

Not being able to debug at -O0 with gcc is annoying. I feel like if GNU were an actual thing, we would have had the equivalent of a musttail attribute 20 years ago already. But it’s not, and we still don’t.

stdlib

So Whiffle makes C, and that C uses some primitives defined as inline functions. Whiffle actually lexically embeds user Scheme code with a prelude, having exposed a set of primitives to that prelude and to user code. The assumption is that the compiler will open-code all primitives, so that the conceit of providing a primitive from the Guile compilation host to the Whiffle guest magically works out, and that any reference to a free variable is an error. This works well enough, and it’s similar to what we currently do in Hoot as well.

This is a quick and dirty strategy but it does let us grow the language to something worth using. I think I’ll come back to this local maximum later if I manage to write about what Hoot does with modules.

coda

So, that’s Whiffle: the Guile compiler front-end for Scheme, applied to an expression that prepends a user’s program with a prelude, in a lexical context of a limited set of primitives, compiling to very simple C, in which tail calls are just return f(...), relying on the C compiler to inline and optimize and all that.

Perhaps next up: some results on using Whiffle to test Whippet. Until then, good night!

by Andy Wingo at November 16, 2023 09:11 PM

José Dapena

V8 profiling instrumentation overhead

In the last 2 years, I have been contributing to improve the performance of V8 profiling instrumentation in different scenarios, both for Linux (using Perf) and for Windows (using Windows Performance Toolkit).

My work has been mostly improving the stack walk instrumentation that allows Javascript generated code to show in the stack traces, referring to the original code, in the same way native compiled code in C or C++ can provide that information.

In this blog post I run different benchmarks to determine how much overhead this instrumentation introduces.

How can you enable profiling instrumentation?

V8 implements system native profile instrumentation for the Javascript code, Both for Linux Perf and for Windows Performance Toolkit. Though, those are disabled by default.

For Windows, the command line switch --enable-etw-stack-walking instruments the generated Javascript code, to emit the source code information for the compiled functions using ETW (Event Tracing for Windows). This gives a better insight of time spent in different functions, specially when stack profiling is enabled in Windows Performance Toolkit.

For Linux Perf, there are several command line switches:
--perf-basic-prof: this emits, for any Javascript generated code, its memory location and a descriptive name (that includes the source location of the function).
--perf-basic-prof-only-functions: same as the previous one, but only emitting functions, excluding other stubs as regular expressions.
--perf-prof: this is a different way to provide the information for Linux Perf. It generates a more complex format specified here. On top of the addresses of the functions, it also includes source code information, even with details about the exact line of code inside the function that is executed in a sample.
--perf-prof-annotate-wasm: this one extends --perf-prof, to add debugging information to WASM code.
--perf-prof-unwinding-info: This last one also extends --perf-prof, but providing experimental support for unwinding info.

And then, we have interpreted code support. V8 does not generate JIT code for everything, and in many cases it runs interpreted code that calls common builtin code. So, in a stack trace, instead of seeing the Javascript method that eventually runs those methods through the interpreter, we see those common methods. The solution? Using --interpreted-frames-native-stack, that adds additional information that identifies which Javascript method in the stack is actually interpreted. This is basic to understand the attribution of running code, especially if it is not considered hot enough to be compiled.

They are disabled by default, is it a problem?

We would ideally want to have all of this profiling support always enabled when we are profiling code. For that, we would want to avoid having to pass any command line switch, so we can profile Javascript workloads without any additional configuration.

But all these command line switches are disabled by default. Why? There are several reasons.

What happens on Linux?

In Linux, Perf profiling instrumentation will unconditionally generate additional files, as it cannot know if Perf is running or not. --perf-basic-prof backend writes the generated information to .map files, and --perf-prof backend writes additional information to jit-*.dump files that will be used later with perf inject -j.

Additionally, the instrumentation requires code compaction to be disabled. This is because code compaction will change the memory location of the compiled code. V8 generates CODE_MOVE internal events for profiler instrumentation. But ETW and Linux Perf backends do not handle those events for a variety of reasons. It has an impact in memory as there is no way to compact the space allocated for code without code move.

So, when we enable any of the Linux Perf command line switches we both generate extra files and take more memory.

And what about Windows?

In Windows we have the same problem with code compaction. We do not support generating CODE_MOVE events so, while profiling, we cannot enable code compaction.

And the emitted ETW events with JIT code position information do still take memory space and make the profile recordings bigger.

But there is an important difference: Windows ETW API allows applications to know when they are being profiled, and even filter that. V8 only emits the code location information for ETW if the application is being profiled, and code compaction is only disabled when profiling is ongoing.

That means the overhead for enabling --enable-etw-stack-walking is expected to be minimal.

What about --interpreted-frames-native-stack? It has also been optimized to generate the extra stack information only when a profile is being recorded.

Can we enable them by default?

For Linux, the answer is a clear no. Any V8 workload would generate profiling information in files, and disabling code compaction makes memory usage higher. These problems happen no matter if you are recording a profile or not.

But, Windows ETW support adds overhead only when profiling! It looks like it could make sense to enable both --interpreted-frames-native-stack and --enable-etw-stack-walking for Windows.

Are we there yet? Overhead analysis

So first, we need to verify the actual overhead because we do not want to introduce a regression by enabling any of these options. I also want to confirm the actual overhead in Linux matches the expectation. To do that I am showing the result of running CSuite benchmarks, included as part of V8. I am capturing both the CPU and memory usage.

Linux benchmarks

The results obtained in Linux are…

Legend:
– REC: recording a profile (Yes, No).
– No flag: running the test without any extra command line flag.
– BP: passing --perf-basic-prof.
– IFNS: passing --interpreted-frames-native-stack.
– BP+IFNS: passing both --perf-basic-prof and --interpreted-frames-native-stack.

Linux results – score:

Test REC Better No flag BP IFNS BP+IFNS
Octane No Higher 9101.2 8953.3 9077.1 9097.3
Yes 9112.8 9041.7 9004.8 9093.9
Kraken No Lower 4060.3 4108.9 4076.6 4119.9
Yes 4078.1 4141.7 4083.2 4131.3
Sunspider No Lower 595.7 622.8 595.4 627.8
Yes 598.5 626.6 599.3 633.3

Linux results – memory (in Kilobytes)

Test REC No flag BP IFNS BP+IFNS
Octane No 244040.0 249152.0 243442.7 253533.8
Yes 242234.7 245010.7 245632.9 252108.0
Kraken No 46039.1 46169.5 46009.1 46497.1
Yes 46002.1 46187.0 46024.4 46520.5
Sunspider No 90267.0 90857.2 90214.6 91100.4
Yes 90210.3 90948.4 90195.9 91110.8

What is seen in the results?
– There is apparently no CPU or memory overhead of --interpreted-frames-native-stack alone. Though there is an outlier while recording Octane memory usage, that could be an error in the sampling process. This is expected as the additional information is only generated while any profiling instrumentation is enabled.
– There is no clear extra cost in memory for recording vs not recording. This is expected as the extra information only depends on having the switches enabled. It does not detect when recording is ongoing.
– There is, as expected, a cost in CPU when Perf is ongoing, in most of the cases.
--perf-basic-prof has CPU and memory impact. And combined with --interpreted-frames-native-stack, the impact is even higher as expected.

This is on top of the fact that files are generated. The benchmarks back not enabling Perf --perf-basic-prof by default. But as --interpreted-frames-native-stack has only overhead in combination with profiling switches, it could make sense to consider enabling it on Linux.

Windows benchmarks

What about Windows results?

Legend:
– No flag: running the test without any extra command line flag.
– ETW: passing --enable-etw-stack-walking.
– IFNS: passing --interpreted-frames-native-stack.
– ETW+IFNS: passing both --enable-etw-stack-walking and --interpreted-frames-native-stack.

Windows results – score:

Test REC Better No flag ETW IFNS ETW+IFNS
Octane No Higher 8323.8 8336.2 8308.7 8310.4
Yes 8050.9 7991.3 8068.8 7900.6
Kraken No Lower 4273.0 4294.4 4303.7 4279.7
Yes 4380.2 4416.4 4433.0 4413.9
Sunspider No Lower 671.1 670.8 672.0 671.5
Yes 693.2 703.9 690.9 716.2

Windows results – memory:

Test REC No flag ETW IFNS ETW+IFNS
Octane Not recording 202535.1 204944.9 200700.4 203557.8
Recording 205125.8 204801.3 206856.9 209260.4
Kraken Not recording 76188.2 76095.2 76102.1 76188.5
Recording 76031.6 76265.0 76215.6 76662.0
Sunspider Not recording 31784.9 31888.4 31806.9 31882.7
Recording 31848.9 31882.1 31802.7 32339.0

What is observed?
– No memory or CPU overhead is observed when recording is not ongoing, with --enable-etw-stack-walking and/or --interpreted-frames-native-stack.
– Even recording, there is no clearly visible overhead of --interpreted-frames-native-stack.
– When recording, --enable-etw-stack-walking has an impact on CPU. But not on memory.
– When --enable-etw-stack-walking and --interpreted-frames-native-stack are combined, while recording, both memory and CPU overheads are observed.

So, again, it should be possible to enable --interpreted-frames-native-stack on Windows by default as it only has impact while recording a profile with --enable-etw-stack-walking. And it should be possible to enable --enable-etw-stack-walking by default too as, again, the impact happens only when a profile is recorded.

Windows: there are still problems

Is it that simple? Is it just OK only adding overhead when recording a profile?

One problem with this is that there is an impact in system wide recordings, even if the focus is not on Javascript execution.

Also, the CPU impact is not linear. Most of it happens when a profile recording starts, V8 emits the code position of all the methods in all the already existing Javascript functions and builtins. Now, imagine a regular desktop with several V8 based browsers as Chrome and Edge, with all their tabs. Or a server with many NodeJS instances. All that generates the information of all their methods at the same time.

So the impact of the overhead is significant, even if it only happens while recording.

Next steps

For --interpreted-frames-native-stack it looks like it would be better to propose enabling it by default in all studied platforms (Windows and Linux). It significantly improves the profiling instrumentation, and it has impact only when actual instrumentation is used. And then it still allows disabling it for the specific cases where the original builtins recording is preferred.

For --enable-etw-stack-walking, it could make sense to also propose enabling it by default, even with the known overhead while recording profiles. And just make sure there are no surprise regressions.

But, likely, the best option is trying to reduce that overhead. First, by allowing to filter better which processes generate the additional information from Windows Performance Recorder. And then, also, by reducing the initial overhead as much as possible. I.e. a big part of it is emitting the builtins and V8 snapshot compiled functions for each of the contexts again and again. Ideally we want to avoid that duplication if possible.

Wrapping up

While on Linux, enabling profile instrumentation with command line switches is not an option, Windows dynamically enabling instrumentation allows to consider enabling ETW support by default. But there are still optimization opportunities that could be addressed before.

Thanks!

This report has been done as part of the collaboration between Bloomberg and Igalia. Thanks!


by José Dapena Paz at November 16, 2023 09:40 AM

Brian Kardell

Let's Play a Game

Let's Play a Game

It's Interop 24 planning time! Let's play a game: Tell me what you prioritize, or don't and why?

I've written before about how prioritization is hard. Just in the most tangible of terms, there are differing finite resources, distributed differently across organizations. That is, not all teams have the same size or composition, or current availability. They're all also generally managed and planned independently with several other factors in mind, and in totally different architectures. That's a big part of what makes standards often take so long to get sorted out to where everyone can just use them without significant concerns.

Conversely, occasionally we see things ship at what seems by comparison amazing speed. Sometimes that's a bit of an illusion: They've been brewing and aligning hard parts for years and we're only counting the end bits. But other times it's just because we've aligned. A nice way to think about this, if you're a developer, is to compare it to web page performance. If all of the resources just load as discovered and there's no thought put into optimization, things can get really out of control. When you do some optimization and coordination, you can get things many times faster pretty quickly. The big difference here is mainly that instead of milliseconds, with standards substitute weeks or something.

I think that's one of the best things about Interop. While it isn't a promise by anyone or anything, it does seem to serve as a function to get us to focus together on some things and align - doing that optimization work.

It's a damned tricky exercise though, I'm not going to lie. There are so many things to consider and align. I kind of like sharing insight into this because I spent a really long time as a developer myself with a significantly different perspective. Seeing all of this from the inside the last few years has been really eye-opening. So, I thought, let's play a game... It might be useful to both of us.

Below is a list of links to about 90 2024 proposals. It's not carefully complete and I have excluded a few that I personally feel perhaps don't fit the necessary criteria for inclusion. It would, of course, be a huge time suck for you to spend your unpaid time carefully researching all of them - so that's not the game. Note that this represents a tiny fraction of the open issues in the platform - so you can start to see how this is challenging.

So, the game is: Bubble sort as many of these as you can (at least 5-10 would be great) and, optionally, share something back about why you prioritized or de-prioritized something. That can be in a reply blog post, a few messages on social media of your choice, share a gist... Whatever you like.

Here's the list (note there is a bit more prose afterward)

  1. MediaCapture device enumeration
  2. Standardize SVG properties in CSS to support unitless geometry
  3. Fetch Web API
  4. Web Share API
  5. Full support of background properties and remove of prefixes
  6. Canvas text rendering and metrics (2024 edition)
  7. Scroll-driven Animations
  8. Scrollend Events
  9. <search>
  10. text-box-trim
  11. Web Audio API
  12. CSS :dir() selectors and dir=auto interoperability
  13. CSS Nesting
  14. WebRTC peer connections and codecs
  15. scrollbar-width CSS property
  16. JavaScript Promise Integration
  17. CSS style container queries (custom properties)
  18. Allowing <hr> inside of <select>
  19. CSS background-clip
  20. text-wrap: pretty
  21. font-size-adjust
  22. requestIdleCallback
  23. <details> and <summary> elements
  24. CSS text-indent
  25. Relative Color Syntax
  26. scrollbar-color CSS property
  27. input[type="range"] styling
  28. Top Layer Animations
  29. Gamut mapping
  30. Canvas2D filter and reset
  31. WebXR
  32. User activation (2024 edition)
  33. text-transform: full-size-kana & text-transform: full-width
  34. document.caretPostitionFromPoint
  35. Unit division and multiplication for mixed units of the same type within calc()
  36. EXPAND :has() to include support for more pseudo-classes
  37. WebM AV1 video codec
  38. CSS image() function
  39. WebDriver BiDi
  40. CSS box-decoration-break
  41. text-wrap: balance
  42. inverted-colors Media Query
  43. Accessibility (computed role + accname)
  44. display: contents
  45. Accessibility issues with display properties (not including display: contents)
  46. HTMLVideoElement.requestVideoFrameCallback()
  47. Text Fragments
  48. blocking="render" attribute on scripts and style sheets
  49. Indexed DB v3
  50. Declarative Shadow DOM
  51. attr() support extended capabilities
  52. Notifications API
  53. HTTP(S) URLs for WebSocket
  54. WasmGC
  55. size-adjust
  56. video-dynamic-range Media Query
  57. CSS scrollbar-gutter
  58. CSS Typed OM Level 1 (houdini)
  59. import attributes / JSON modules / CSS modules
  60. CSS element() function
  61. View Transitions Level 1
  62. Local Network Access and Mixed Content specification
  63. WebM Opus audio codec
  64. Custom Media Queries
  65. Trusted Types
  66. WebTransport API
  67. CSS object-view-box
  68. Intersection Observer v2
  69. Custom Highlight API
  70. backdrop-filter
  71. WebRTC “end-to-end-encryption”
  72. Streams
  73. Media pseudo classes: :paused/:playing/:seeking/:buffering/:stalled/:muted/:volume-locked
  74. CSS box sizing properties with MathML Core
  75. Detect UA Transitions on same-document Navigations
  76. css fill/stroke
  77. JPEG XL image format
  78. font-family keywords
  79. CSS Multi-Column Layout block element breaking
  80. Navigation API
  81. Ready-made counter styles
  82. hidden=until-found and auto-expanding details
  83. Popover
  84. P3 All The Things
  85. URLPattern
  86. margin-trim
  87. overscroll-behavior on the root scroller
  88. CSS Painting API Level 1 (houdini)
  89. CSS Logical Properties and Values
  90. CSS caret, caret-color and caret-shape properties

The tricky thing is that there is no 'right' answer here, and one can easily make several cases. For example, perhaps we should prioritize things that cannot be polyfilled or done with pre-processors over things that can. That sounds reasonable enough, but given that the queue never empties: Do you just do that forever? Does it eventually lead to larger areas that are ignored? Similarly, do you focus on the things that are sure to be used by everyone, or do you assume those already get the most priority and focus on the things that don't. Do you pick certain compat pain points/bugs, or focus on launching new features quickly and very interoperably?

It's a hard game! Looking forward to your answers.

November 16, 2023 05:00 AM

November 14, 2023

Andy Wingo

whiffle, a purpose-built scheme

Yesterday I promised an apology but didn’t actually get past the admission of guilt. Today the defendant takes the stand, in the hope that an awkward cross-examination will persuade the jury to take pity on a poor misguided soul.

Which is to say, let’s talk about Whiffle: what it actually is, what it is doing for me, and why on earth it is that [I tell myself that] writing a new programming language implementation is somehow preferable than re-using an existing one.

graphic designgarbage collection is my passion

Whiffle is purpose-built to test the Whippet garbage collection library.

Whiffle lets me create Whippet test cases in C, without actually writing C. C is fine and all, but the problem with it and garbage collection is that you have to track all stack roots manually, and this is an error-prone process. Generating C means that I can more easily ensure that each stack root is visitable by the GC, which lets me make test cases with more confidence; if there is a bug, it is probably not because of an untraced root.

Also, Whippet is mostly meant for programming language runtimes, not for direct use by application authors. In this use-case, probably you can use less “active” mechanisms for ensuring root traceability: instead of eagerly recording live values in some kind of handlescope, you can keep a side table that is only consulted as needed during garbage collection pauses. In particular since Scheme uses the stack as a data structure, I was worried that using handle scopes would somehow distort the performance characteristics of the benchmarks.

Whiffle is not, however, a high-performance Scheme compiler. It is not for number-crunching, for example: garbage collectors don’t care about that, so let’s not. Also, Whiffle should not go to any effort to remove allocations (sroa / gvn / cse); creating nodes in the heap is the purpose of the test case, and eliding them via compiler heroics doesn’t help us test the GC.

I settled on a baseline-style compiler, in which I re-use the Scheme front-end from Guile to expand macros and create an abstract syntax tree. I do run some optimizations on that AST; in the spirit of the macro writer’s bill of rights, it does make sense to provide some basic reductions. (These reductions can be surprising, but I am used to the Guile’s flavor of cp0 (peval), and this project is mostly for me, so I thought it was risk-free; I was almost right!).

Anyway the result is that Whiffle uses an explicit stack. A safepoint for a thread simply records its stack pointer: everything between the stack base and the stack pointer is live. I do have a lingering doubt about the representativity of this compilation strategy; would a conclusion drawn from Whippet apply to Guile, which uses a different stack allocation strategy? I think probably so but it’s an unknown.

what’s not to love

Whiffle also has a number of design goals that are better formulated in the negative. I mentioned compiler heroics as one thing to avoid, and in general the desire for a well-understood correspondence between source code and run-time behavior has a number of other corrolaries: Whiffle is a pure ahead-of-time (AOT) compiler, as just-in-time (JIT) compilation adds noise. Worse, speculative JIT would add unpredictability, which while good on the whole would be anathema to understanding an isolated piece of a system like the GC.

Whiffle also should produce stand-alone C files, without a thick run-time. I need to be able to understand and reason about the residual C programs, and depending on third-party libraries would hinder this goal.

Oddly enough, users are also an anti-goal: as a compiler that only exists to test a specific GC library, there is no sense in spending too much time making Whiffle nicer for other humans, humans whose goal is surely not just to test Whippet. Whiffle is an interesting object, but is not meant for actual use or users.

corners: cut

Another anti-goal is completeness with regards to any specific language standard: the point is to test a GC, not to make a useful Scheme. Therefore Whippet gets by just fine without flonums, fractions, continuations (delimited or otherwise), multiple return values, ports, or indeed any library support at all. All of that just doesn’t matter for testing a GC.

That said, it has been useful to be able to import standard Scheme garbage collection benchmarks, such as earley or nboyer. These have required very few modifications to run in Whippet, mostly related to Whippet’s test harness that wants to spawn multiple threads.

and so?

I think this evening we have elaborated a bit more about the “how”, complementing yesterday’s note about the “what”. Tomorrow (?) I’ll see if I can dig in more to the “why”: what questions does Whiffle let me ask of Whippet, and how good of a job does it do at getting those answers? Until then, may all your roots be traced, and happy hacking.

by Andy Wingo at November 14, 2023 10:10 PM

November 13, 2023

Andy Wingo

i accidentally a scheme

Good evening, dear hackfriends. Tonight’s missive is an apology: not quite in the sense of expiation, though not quite not that, either; rather, apology in the sense of explanation, of exegesis: apologia. See, I accidentally made a Scheme. I know I have enough Scheme implementations already, but I went and made another one. It’s for a maybe good reason, though!

one does not simply a scheme

I feel like we should make this the decade of leaning into your problems, and I have a Scheme problem, so here we are. See, I co-maintain Guile, and have been noodling on a new garbage collector (GC) for Guile, Whippet. Whippet is designed to be embedded in the project that uses it, so one day I hope it will be just copied into Guile’s source tree, replacing the venerable BDW-GC that we currently use.

The thing is, though, that GC implementations are complicated. A bug in a GC usually manifests itself far away in time and space from the code that caused the bug. Language implementations are also complicated, for similar reasons. Swapping one GC for another is something to be done very carefully. This is even more the case when the switching cost is high, which is the case with BDW-GC: as a collector written as a library to link into “uncooperative” programs, there is more cost to moving to a conventional collector than in the case where the embedding program is already aware that (for example) garbage collection may relocate objects.

So, you need to start small. First, we need to prove that the new GC implementation is promising in some way, that it might improve on BDW-GC. Then... embed it directly into Guile? That sounds like a bug farm. Is there not any intermediate step that one might take?

But also, how do you actually test that a GC algorithm or implementation is interesting? You need a workload, and you need the ability to compare the new collector to the old, for that workload. In Whippet I had been writing some benchmarks in C (example), but this approach wasn’t scaling: besides not sparking joy, I was starting to wonder if what I was testing would actually reflect usage in Guile.

I had an early approach to rewrite a simple language implementation like the other Scheme implementation I made to demonstrate JIT code generation in WebAssembly, but that soon foundered against what seemed to me an unlikely rock: the compiler itself. In my wasm-jit work, the “compiler” itself was in C++, using the C++ allocator for compile-time allocations, and the result was a tree of AST nodes that were interpreted at run-time. But to embed the benchmarks in Whippet itself I needed something C, which is less amenable to abstraction of any kind... Here I think I could have made a different choice: to somehow allow C++ or something as a dependency to write tests, or to do more mallocation in the “compiler”...

But that wasn’t fun. A lesson I learned long ago is that if something isn’t fun, I need to turn it into a compiler. So I started writing a compiler to a little bytecode VM, initially in C, then in Scheme because C is a drag and why not? Why not just generate the bytecode C file from Scheme? Same dependency set, once the C file is generated. And then, as long as you’re generating C, why go through bytecode at all? Why not just, you know, generate C?

after all, why not? why shouldn’t i keep it?

And that’s how I accidentally made a Scheme, Whiffle. Tomorrow I’ll write a little more on what Whiffle is and isn’t, and what it’s doing for me. Until then, happy hacking!

by Andy Wingo at November 13, 2023 09:36 PM

November 08, 2023

Brian Kardell

Wolvic Store Experiment

Wolvic Store Experiment

A post about cool stuff for a cause.

As of today, you can purchase some cool Wolvic swag, and proceeds will go to supporting development of the browser:

There’s mugs, and water bottles and shirts and sweatshirts and … You know, stuff.

Why?

For the last few years I’ve been writing about the health of the whole web ecosystem, and talking about it on our podcast. We just assume that the “search revenue pays for browsers” model that got us here is fine and will last forever. I’m pretty sure it won’t.

When you look at how engines and browsers are funded, it’s way too centralized on search funding. Of course, this is not a way to fund a new browser - rather it rewards those that are already making really significant entry. The whole thing is actually kind of fragile when you step back and look at it.

It would be nice to change that, and we only achieve that by trying.

So, in general at Igalia we’re trying several ways to diversify funding. Specifically for Wolvic, we’re looking into how we might explore other kinds of advertising, but we’ve got this far with investment from Igalia, a partnership model and an Open Collective.

It’s kind of easy to get applause when I talk about this, but the truth is very few people put money into our collectives. It reminded me of… Well… Every charity, ever.

Lots of charities (or even NPR) give you some swag with a donation. For some reason, that’s helpful in expanding the base of people making any kind of donation at all. For some reason, offering swag makes people consider putting some donation toward the cause.

Of course, this isn’t especially efficient: If you buy a product from our store, maybe only a few dollars of that becomes an the actual donation. It would be far more efficient to just donate a couple of dollars to the collective, but very few do. So we’re trying something new.

Will offering swag help here too? Is this a model that could help us pay for things? I don’t know, but I’m always looking for new ways to talk about this problem and new experiments we could try to keep the discussion going. So, what do you think?

November 08, 2023 05:00 AM

November 07, 2023

Melissa Wen

AMD Driver-specific Properties for Color Management on Linux (Part 2)

TL;DR:

This blog post explores the color capabilities of AMD hardware and how they are exposed to userspace through driver-specific properties. It discusses the different color blocks in the AMD Display Core Next (DCN) pipeline and their capabilities, such as predefined transfer functions, 1D and 3D lookup tables (LUTs), and color transformation matrices (CTMs). It also highlights the differences in AMD HW blocks for pre and post-blending adjustments, and how these differences are reflected in the available driver-specific properties.

Overall, this blog post provides a comprehensive overview of the color capabilities of AMD hardware and how they can be controlled by userspace applications through driver-specific properties. This information is valuable for anyone who wants to develop applications that can take advantage of the AMD color management pipeline.

Get a closer look at each hardware block’s capabilities, unlock a wealth of knowledge about AMD display hardware, and enhance your understanding of graphics and visual computing. Stay tuned for future developments as we embark on a quest for GPU color capabilities in the ever-evolving realm of rainbow treasures.


Operating Systems can use the power of GPUs to ensure consistent color reproduction across graphics devices. We can use GPU-accelerated color management to manage the diversity of color profiles, do color transformations to convert between High-Dynamic-Range (HDR) and Standard-Dynamic-Range (SDR) content and color enhacements for wide color gamut (WCG). However, to make use of GPU display capabilities, we need an interface between userspace and the kernel display drivers that is currently absent in the Linux/DRM KMS API.

In the previous blog post I presented how we are expanding the Linux/DRM color management API to expose specific properties of AMD hardware. Now, I’ll guide you to the color features for the Linux/AMD display driver. We embark on a journey through DRM/KMS, AMD Display Manager, and AMD Display Core and delve into the color blocks to uncover the secrets of color manipulation within AMD hardware. Here we’ll talk less about the color tools and more about where to find them in the hardware.

We resort to driver-specific properties to reach AMD hardware blocks with color capabilities. These blocks display features like predefined transfer functions, color transformation matrices, and 1-dimensional (1D LUT) and 3-dimensional lookup tables (3D LUT). Here, we will understand how these color features are strategically placed into color blocks both before and after blending in Display Pipe and Plane (DPP) and Multiple Pipe/Plane Combined (MPC) blocks.

That said, welcome back to the second part of our thrilling journey through AMD’s color management realm!

AMD Display Driver in the Linux/DRM Subsystem: The Journey

In my 2022 XDC talk “I’m not an AMD expert, but…”, I briefly explained the organizational structure of the Linux/AMD display driver where the driver code is bifurcated into a Linux-specific section and a shared-code portion. To reveal AMD’s color secrets through the Linux kernel DRM API, our journey led us through these layers of the Linux/AMD display driver’s software stack. It includes traversing the DRM/KMS framework, the AMD Display Manager (DM), and the AMD Display Core (DC) [1].

The DRM/KMS framework provides the atomic API for color management through KMS properties represented by struct drm_property. We extended the color management interface exposed to userspace by leveraging existing resources and connecting them with driver-specific functions for managing modeset properties.

On the AMD DC layer, the interface with hardware color blocks is established. The AMD DC layer contains OS-agnostic components that are shared across different platforms, making it an invaluable resource. This layer already implements hardware programming and resource management, simplifying the external developer’s task. While examining the DC code, we gain insights into the color pipeline and capabilities, even without direct access to specifications. Additionally, AMD developers provide essential support by answering queries and reviewing our work upstream.

The primary challenge involved identifying and understanding relevant AMD DC code to configure each color block in the color pipeline. However, the ultimate goal was to bridge the DC color capabilities with the DRM API. For this, we changed the AMD DM, the OS-dependent layer connecting the DC interface to the DRM/KMS framework. We defined and managed driver-specific color properties, facilitated the transport of user space data to the DC, and translated DRM features and settings to the DC interface. Considerations were also made for differences in the color pipeline based on hardware capabilities.

Exploring Color Capabilities of the AMD display hardware

Now, let’s dive into the exciting realm of AMD color capabilities, where a abundance of techniques and tools await to make your colors look extraordinary across diverse devices.

First, we need to know a little about the color transformation and calibration tools and techniques that you can find in different blocks of the AMD hardware. I borrowed some images from [2] [3] [4] to help you understand the information.

Predefined Transfer Functions (Named Fixed Curves):

Transfer functions serve as the bridge between the digital and visual worlds, defining the mathematical relationship between digital color values and linear scene/display values and ensuring consistent color reproduction across different devices and media. You can learn more about curves in the chapter GPU Gems 3 - The Importance of Being Linear by Larry Gritz and Eugene d’Eon.

ITU-R 2100 introduces three main types of transfer functions:

  • OETF: the opto-electronic transfer function, which converts linear scene light into the video signal, typically within a camera.
  • EOTF: electro-optical transfer function, which converts the video signal into the linear light output of the display.
  • OOTF: opto-optical transfer function, which has the role of applying the “rendering intent”.

AMD’s display driver supports the following pre-defined transfer functions (aka named fixed curves):

  • Linear/Unity: linear/identity relationship between pixel value and luminance value;
  • Gamma 2.2, Gamma 2.4, Gamma 2.6: pure power functions;
  • sRGB: 2.4: The piece-wise transfer function from IEC 61966-2-1:1999;
  • BT.709: has a linear segment in the bottom part and then a power function with a 0.45 (~1/2.22) gamma for the rest of the range; standardized by ITU-R BT.709-6;
  • PQ (Perceptual Quantizer): used for HDR display, allows luminance range capability of 0 to 10,000 nits; standardized by SMPTE ST 2084.

These capabilities vary depending on the hardware block, with some utilizing hardcoded curves and others relying on AMD’s color module to construct curves from standardized coefficients. It also supports user/custom curves built from a lookup table.

1D LUTs (1-dimensional Lookup Table):

A 1D LUT is a versatile tool, defining a one-dimensional color transformation based on a single parameter. It’s very well explained by Jeremy Selan at GPU Gems 2 - Chapter 24 Using Lookup Tables to Accelerate Color Transformations

It enables adjustments to color, brightness, and contrast, making it ideal for fine-tuning. In the Linux AMD display driver, the atomic API offers a 1D LUT with 4096 entries and 8-bit depth, while legacy gamma uses a size of 256.

3D LUTs (3-dimensional Lookup Table):

These tables work in three dimensions – red, green, and blue. They’re perfect for complex color transformations and adjustments between color channels. It’s also more complex to manage and require more computational resources. Jeremy also explains 3D LUT at GPU Gems 2 - Chapter 24 Using Lookup Tables to Accelerate Color Transformations

CTM (Color Transformation Matrices):

Color transformation matrices facilitate the transition between different color spaces, playing a crucial role in color space conversion.

HDR Multiplier:

HDR multiplier is a factor applied to the color values of an image to increase their overall brightness.

AMD Color Capabilities in the Hardware Pipeline

First, let’s take a closer look at the AMD Display Core Next hardware pipeline in the Linux kernel documentation for AMDGPU driver - Display Core Next

AMDGPU driver > Display Core Next (DCN)"> AMDGPU driver > Display Core Next (DCN)">

In the AMD Display Core Next hardware pipeline, we encounter two hardware blocks with color capabilities: the Display Pipe and Plane (DPP) and the Multiple Pipe/Plane Combined (MPC). The DPP handles color adjustments per plane before blending, while the MPC engages in post-blending color adjustments. In short, we expect DPP color capabilities to match up with DRM plane properties, and MPC color capabilities to play nice with DRM CRTC properties.

Note: here’s the catch – there are some DRM CRTC color transformations that don’t have a corresponding AMD MPC color block, and vice versa. It’s like a puzzle, and we’re here to solve it!

AMD Color Blocks and Capabilities

We can finally talk about the color capabilities of each AMD color block. As it varies based on the generation of hardware, let’s take the DCN3+ family as reference. What’s possible to do before and after blending depends on hardware capabilities describe in the kernel driver by struct dpp_color_caps and struct mpc_color_caps.

The AMD Steam Deck hardware provides a tangible example of these capabilities. Therefore, we take SteamDeck/DCN301 driver as an example and look at the “Color pipeline capabilities” described in the file: driver/gpu/drm/amd/display/dcn301/dcn301_resources.c

/* Color pipeline capabilities */

dc->caps.color.dpp.dcn_arch = 1; // If it is a Display Core Next (DCN): yes. Zero means DCE.
dc->caps.color.dpp.input_lut_shared = 0;
dc->caps.color.dpp.icsc = 1; // Intput Color Space Conversion  (CSC) matrix.
dc->caps.color.dpp.dgam_ram = 0; // The old degamma block for degamma curve (hardcoded and LUT). `Gamma correction` is the new one.
dc->caps.color.dpp.dgam_rom_caps.srgb = 1; // sRGB hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.bt2020 = 1; // BT2020 hardcoded curve support (seems not actually in use)
dc->caps.color.dpp.dgam_rom_caps.gamma2_2 = 1; // Gamma 2.2 hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.pq = 1; // PQ hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.hlg = 1; // HLG hardcoded curve support
dc->caps.color.dpp.post_csc = 1; // CSC matrix
dc->caps.color.dpp.gamma_corr = 1; // New `Gamma Correction` block for degamma user LUT;
dc->caps.color.dpp.dgam_rom_for_yuv = 0;

dc->caps.color.dpp.hw_3d_lut = 1; // 3D LUT support. If so, it's always preceded by a shaper curve. 
dc->caps.color.dpp.ogam_ram = 1; // `Blend Gamma` block for custom curve just after blending
// no OGAM ROM on DCN301
dc->caps.color.dpp.ogam_rom_caps.srgb = 0;
dc->caps.color.dpp.ogam_rom_caps.bt2020 = 0;
dc->caps.color.dpp.ogam_rom_caps.gamma2_2 = 0;
dc->caps.color.dpp.ogam_rom_caps.pq = 0;
dc->caps.color.dpp.ogam_rom_caps.hlg = 0;
dc->caps.color.dpp.ocsc = 0;

dc->caps.color.mpc.gamut_remap = 1; // Post-blending CTM (pre-blending CTM is always supported)
dc->caps.color.mpc.num_3dluts = pool->base.res_cap->num_mpc_3dlut; // Post-blending 3D LUT (preceded by shaper curve)
dc->caps.color.mpc.ogam_ram = 1; // Post-blending regamma.
// No pre-defined TF supported for regamma.
dc->caps.color.mpc.ogam_rom_caps.srgb = 0;
dc->caps.color.mpc.ogam_rom_caps.bt2020 = 0;
dc->caps.color.mpc.ogam_rom_caps.gamma2_2 = 0;
dc->caps.color.mpc.ogam_rom_caps.pq = 0;
dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
dc->caps.color.mpc.ocsc = 1; // Output CSC matrix.

I included some inline comments in each element of the color caps to quickly describe them, but you can find the same information in the Linux kernel documentation. See more in struct dpp_color_caps, struct mpc_color_caps and struct rom_curve_caps.

Now, using this guideline, we go through color capabilities of DPP and MPC blocks and talk more about mapping driver-specific properties to corresponding color blocks.

DPP Color Pipeline: Before Blending (Per Plane)

Let’s explore the capabilities of DPP blocks and what you can achieve with a color block. The very first thing to pay attention is the display architecture of the display hardware: previously AMD uses a display architecture called DCE

  • Display and Compositing Engine, but newer hardware follows DCN - Display Core Next.

The architectute is described by: dc->caps.color.dpp.dcn_arch

AMD Plane Degamma: TF and 1D LUT

Described by: dc->caps.color.dpp.dgam_ram, dc->caps.color.dpp.dgam_rom_caps,dc->caps.color.dpp.gamma_corr

AMD Plane Degamma data is mapped to the initial stage of the DPP pipeline. It is utilized to transition from scanout/encoded values to linear values for arithmetic operations. Plane Degamma supports both pre-defined transfer functions and 1D LUTs, depending on the hardware generation. DCN2 and older families handle both types of curve in the Degamma RAM block (dc->caps.color.dpp.dgam_ram); DCN3+ separate hardcoded curves and 1D LUT into two block: Degamma ROM (dc->caps.color.dpp.dgam_rom_caps) and Gamma correction block (dc->caps.color.dpp.gamma_corr), respectively.

Pre-defined transfer functions:

  • they are hardcoded curves (read-only memory - ROM);
  • supported curves: sRGB EOTF, BT.709 inverse OETF, PQ EOTF and HLG OETF, Gamma 2.2, Gamma 2.4 and Gamma 2.6 EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. Setting TF = Identity/Default and LUT as NULL means bypass.

References:

AMD Plane 3x4 CTM (Color Transformation Matrix)

AMD Plane CTM data goes to the DPP Gamut Remap block, supporting a 3x4 fixed point (s31.32) matrix for color space conversions. The data is interpreted as a struct drm_color_ctm_3x4. Setting NULL means bypass.

References:

AMD Plane Shaper: TF + 1D LUT

Described by: dc->caps.color.dpp.hw_3d_lut

The Shaper block fine-tunes color adjustments before applying the 3D LUT, optimizing the use of the limited entries in each dimension of the 3D LUT. On AMD hardware, a 3D LUT always means a preceding shaper 1D LUT used for delinearizing and/or normalizing the color space before applying a 3D LUT, so this entry on DPP color caps dc->caps.color.dpp.hw_3d_lut means support for both shaper 1D LUT and 3D LUT.

Pre-defined transfer function enables delinearizing content with or without shaper LUT, where AMD color module calculates the resulted shaper curve. Shaper curves go from linear values to encoded values. If we are already in a non-linear space and/or don’t need to normalize values, we can set a Identity TF for shaper that works similar to bypass and is also the default TF value.

Pre-defined transfer functions:

  • there is no DPP Shaper ROM. Curves are calculated by AMD color modules. Check calculate_curve() function in the file amd/display/modules/color/color_gamma.c.
  • supported curves: Identity, sRGB inverse EOTF, BT.709 OETF, PQ inverse EOTF, HLG OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 inverse EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. When setting Plane Shaper TF (!= Identity) and LUT at the same time, the color module will combine the pre-defined TF and the custom LUT values into the LUT that’s actually programmed. Setting TF = Identity/Default and LUT as NULL works as bypass.

References:

AMD Plane 3D LUT

Described by: dc->caps.color.dpp.hw_3d_lut

The 3D LUT in the DPP block facilitates complex color transformations and adjustments. 3D LUT is a three-dimensional array where each element is an RGB triplet. As mentioned before, the dc->caps.color.dpp.hw_3d_lut describe if DPP 3D LUT is supported.

The AMD driver-specific property advertise the size of a single dimension via LUT3D_SIZE property. Plane 3D LUT is a blog property where the data is interpreted as an array of struct drm_color_lut elements and the number of entries is LUT3D_SIZE cubic. The array contains samples from the approximated function. Values between samples are estimated by tetrahedral interpolation The array is accessed with three indices, one for each input dimension (color channel), blue being the outermost dimension, red the innermost. This distribution is better visualized when examining the code in [RFC PATCH 5/5] drm/amd/display: Fill 3D LUT from userspace by Alex Hung:

+	for (nib = 0; nib < 17; nib++) {
+		for (nig = 0; nig < 17; nig++) {
+			for (nir = 0; nir < 17; nir++) {
+				ind_lut = 3 * (nib + 17*nig + 289*nir);
+
+				rgb_area[ind].red = rgb_lib[ind_lut + 0];
+				rgb_area[ind].green = rgb_lib[ind_lut + 1];
+				rgb_area[ind].blue = rgb_lib[ind_lut + 2];
+				ind++;
+			}
+		}
+	}

In our driver-specific approach we opted to advertise it’s behavior to the userspace instead of implicitly dealing with it in the kernel driver. AMD’s hardware supports 3D LUTs with 17-size or 9-size (4913 and 729 entries respectively), and you can choose between 10-bit or 12-bit. In the current driver-specific work we focus on enabling only 17-size 12-bit 3D LUT, as in [PATCH v3 25/32] drm/amd/display: add plane 3D LUT support:

+		/* Stride and bit depth are not programmable by API yet.
+		 * Therefore, only supports 17x17x17 3D LUT (12-bit).
+		 */
+		lut->lut_3d.use_tetrahedral_9 = false;
+		lut->lut_3d.use_12bits = true;
+		lut->state.bits.initialized = 1;
+		__drm_3dlut_to_dc_3dlut(drm_lut, drm_lut3d_size, &lut->lut_3d,
+					lut->lut_3d.use_tetrahedral_9,
+					MAX_COLOR_3DLUT_BITDEPTH);

A refined control of 3D LUT parameters should go through a follow-up version or generic API.

Setting 3D LUT to NULL means bypass.

References:

AMD Plane Blend/Out Gamma: TF + 1D LUT

Described by: dc->caps.color.dpp.ogam_ram

The Blend/Out Gamma block applies the final touch-up before blending, allowing users to linearize content after 3D LUT and just before the blending. It supports both 1D LUT and pre-defined TF. We can see Shaper and Blend LUTs as 1D LUTs that are sandwich the 3D LUT. So, if we don’t need 3D LUT transformations, we may want to only use Degamma block to linearize and skip Shaper, 3D LUT and Blend.

Pre-defined transfer function:

  • there is no DPP Blend ROM. Curves are calculated by AMD color modules;
  • supported curves: Identity, sRGB EOTF, BT.709 inverse OETF, PQ EOTF, HLG inverse OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. If plane_blend_tf_property != Identity TF, AMD color module will combine the user LUT values with pre-defined TF into the LUT parameters to be programmed. Setting TF = Identity/Default and LUT to NULL means bypass.

References:

MPC Color Pipeline: After Blending (Per CRTC)

DRM CRTC Degamma 1D LUT

The degamma lookup table (LUT) for converting framebuffer pixel data before apply the color conversion matrix. The data is interpreted as an array of struct drm_color_lut elements. Setting NULL means bypass.

Not really supported. The driver is currently reusing the DPP degamma LUT block (dc->caps.color.dpp.dgam_ram and dc->caps.color.dpp.gamma_corr) for supporting DRM CRTC Degamma LUT, as explaning by [PATCH v3 20/32] drm/amd/display: reject atomic commit if setting both plane and CRTC degamma.

DRM CRTC 3x3 CTM

Described by: dc->caps.color.mpc.gamut_remap

It sets the current transformation matrix (CTM) apply to pixel data after the lookup through the degamma LUT and before the lookup through the gamma LUT. The data is interpreted as a struct drm_color_ctm. Setting NULL means bypass.

DRM CRTC Gamma 1D LUT + AMD CRTC Gamma TF

Described by: dc->caps.color.mpc.ogam_ram

After all that, you might still want to convert the content to wire encoding. No worries, in addition to DRM CRTC 1D LUT, we’ve got a AMD CRTC gamma transfer function (TF) to make it happen. Possible TF values are defined by enum amdgpu_transfer_function.

Pre-defined transfer functions:

  • there is no MPC Gamma ROM. Curves are calculated by AMD color modules.
  • supported curves: Identity, sRGB inverse EOTF, BT.709 OETF, PQ inverse EOTF, HLG OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 inverse EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. When setting CRTC Gamma TF (!= Identity) and LUT at the same time, the color module will combine the pre-defined TF and the custom LUT values into the LUT that’s actually programmed. Setting TF = Identity/Default and LUT to NULL means bypass.

References:

Others

AMD CRTC Shaper and 3D LUT

We have previously worked on exposing CRTC shaper and CRTC 3D LUT, but they were removed from the AMD driver-specific color series because they lack userspace case. CRTC shaper and 3D LUT works similar to plane shaper and 3D LUT but after blending (MPC block). The difference here is that setting (not bypass) Shaper and Gamma blocks together are not expected, since both blocks are used to delinearize the input space. In summary, we either set Shaper + 3D LUT or Gamma.

Input and Output Color Space Conversion

There are two other color capabilities of AMD display hardware that were integrated to DRM by previous works and worth a brief explanation here. The DC Input CSC sets pre-defined coefficients from the values of DRM plane color_range and color_encoding properties. It is used for color space conversion of the input content. On the other hand, we have de DC Output CSC (OCSC) sets pre-defined coefficients from DRM connector colorspace properties. It is uses for color space conversion of the composed image to the one supported by the sink.

References:

The search for rainbow treasures is not over yet

If you want to understand a little more about this work, be sure to watch Joshua and I presented two talks at XDC 2023 about AMD/Steam Deck colors on Gamescope:

In the time between the first and second part of this blog post, Uma Shashank and Chaitanya Kumar Borah published the plane color pipeline for Intel and Harry Wentland implemented a generic API for DRM based on VKMS support. We discussed these two proposals and the next steps for Color on Linux during the Color Management workshop at XDC 2023 and I briefly shared workshop results in the 2023 XDC lightning talk session.

The search for rainbow treasures is not over yet! We plan to meet again next year in the 2024 Display Hackfest in Coruña-Spain (Igalia’s HQ) to keep up the pace and continue advancing today’s display needs on Linux.

Finally, a HUGE thank you to everyone who worked with me on exploring AMD’s color capabilities and making them available in userspace.

November 07, 2023 08:12 AM

Loïc Le Page

Use EGLStreams in a WPE WebKit backend

Build a WPE WebKit backend based on the EGLStream extension.

In the previous post we saw how to build a basic WPE Backend from scratch. Now we are going to transfer the frames produced by the WPEWebProcess to the application process without copying the graphical data out of the GPU memory.

There are different mechanisms to transfer hardware buffers from one process to another. We are going to use here the EGLStream extension that is available on compatible hardware from the EGL API.

The reference project for this WPE Backend is WPEBackend-offscreen-nvidia. It is using the EGLStream extension to produce the hardware buffers on the WPEWebProcess side, and the EGL_NV_stream_consumer_eglimage extension (that is specific to NVidia hardware) to transform those buffers into EGLImages on the application process side.

N.B. the WPEBackend-offscreen-nvidia project and the content of this post are compatible with the WPE WebKit 2.38.x or 2.40.x versions using the 1.14.x version of libwpe.

With the growing adoption of DMA Buffers on all modern Linux platforms, the WPE WebKit architecture is evolving and, in the future, the need for a WPE Backend should disappear.

Future designs of the libwpe API should allow to directly receive the video buffers on the application process side without needing to implement a different WPE Backend for each hardware platform. From the application developer point of view, it will simplify the usage of WPE WebKit by hiding all multiprocesses considerations.

The EGLStream extension #

EGLStream is an extension from the EGL API. It defines some mechanisms to transfer hardware video buffers from one process to another without getting out of the GPU memory. In this aspect, it fulfils the same purpose as DMA Buffers but without the complexity of negociating the different parameters and handling tranparently multi-planes images. On the other end, the EGLStream extension is less flexible than DMA Buffers and is unlikely to be available on non-NVidia hardware.

In other words: the EGLStream extension is quite easy to use but you will need an NVidia GPU.

The EGLStream goes in one exclusive direction from a producer in charge of creating the video buffer, to a consumer that will receive and present the video buffer. So you need two endpoints, one in the consumer process and one in the producer process.

Indeed, apart from the EGLStream extension itself, you will need other extensions to create the content of the video buffer on the producer side, and to consume this content on the consumer side.

The producer extension used in the WPEBackend-offscreen-nvidia project is EGL_KHR_stream_producer_eglsurface. It is straightforward to use because you create a specific surface the same way you would be creating a PBuffer surface and then when calling eglSwapBuffers the EGL implementation will take care of finishing the drawing and sending the content to the consumer.

On the consumer side, you can use different extensions. The most interesting in our case are:

In WPEBackend-offscreen-nvidia we are using the latter approach, so we can directly transfer the obtained EGLImage to the user application for presentation.

How to initialize the EGLStream between the producer and the consumer #

The order of creation of the EGLStream endpoints, the steps for their connection and configuration of the producer and consumer extensions are quite strict:

1. You start with creating the consumer endpoint of the ELGStream: #

EGLStream createConsumerStream(EGLDisplay eglDisplay) noexcept
{
    static constexpr const EGLint s_streamAttribs[] = {EGL_STREAM_FIFO_LENGTH_KHR, 1,
                                                       EGL_CONSUMER_ACQUIRE_TIMEOUT_USEC_KHR, 1000 * 1000,
                                                       EGL_NONE};
    return eglCreateStreamKHR(eglDisplay, s_streamAttribs);
}

The EGL_STREAM_FIFO_LENGTH_KHR parameter defines the length of the EGLStream queue. If set to 0, the EGLStream will work in mailbox mode which means that each time the producer has a new frame it will empty the stream content and replace the actual frame by the new one. If greater than 0, the EGLStream will work in fifo mode which means that the stream queue can contain up to EGL_STREAM_FIFO_LENGTH_KHR frames.

Here we are configuring a queue of 1 frame because the specification of EGL_KHR_stream_producer_eglsurface guaranties that, in this case, when calling eglSwapBuffers(...) on the producer side the call will block until the consumer reads the previous frame in the EGLStream queue. It will ensure proper rendering synchronization between the application process side and the WPEWebProcess side without needing to rely on IPC which would add an extra delay between frames.

The EGL_CONSUMER_ACQUIRE_TIMEOUT_USEC_KHR parameter defines the maximum timeout in microseconds to wait on the consumer side to acquire a frame when calling eglStreamConsumerAcquireKHR(...). It is only used with the EGL_KHR_stream_consumer_gltexture extension as EGL_NV_stream_consumer_eglimage offers to set a different timeout for each call of its eglQueryStreamConsumerEventNV(...) acquire function.

2. Still on the consumer side, you get the EGLStream file descriptor by calling eglGetStreamFileDescriptorKHR(...). #

Then you need to initialize the consumer mechanism.

If you are using the EGL_KHR_stream_consumer_gltexture extension #

You initialize your OpenGL or GLES context and create the external texture associated with the EGLStream:

glGenTextures(1, &texture);
glBindTexture(GL_TEXTURE_EXTERNAL_OES, texture);
glTexParameteri(GL_TEXTURE_EXTERNAL_OES, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_EXTERNAL_OES, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_EXTERNAL_OES, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
glTexParameteri(GL_TEXTURE_EXTERNAL_OES, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);

eglStreamConsumerGLTextureExternalKHR(eglDisplay, eglStream);

By calling eglStreamConsumerGLTextureExternalKHR(...) the currently bound GL_TEXTURE_EXTERNAL_OES texture is associated with the provided EGLStream.

If you are using the EGL_NV_stream_consumer_eglimage extension #

In this case the initial configuration is easier as you only need to call the eglStreamImageConsumerConnectNV(...) function to initialize the consumer.

Sending the consumer EGLStream file descriptor to the producer #

Once the consumer initialized, you need to send the EGLStream file descriptor to the producer process through IPC. If the producer process is the parent process of the consumer, you can just send the integer value of the file descriptor and then, on the producer process, get an open handle to the same resource by calling:

long procFD = syscall(SYS_pidfd_open, cpid, 0);
streamFD = syscall(SYS_pidfd_getfd, procFD, streamFD, 0);
close(procFD);

with cpid the PID of the consumer child process and streamFD the integer value of the EGLStream file descriptor on the consumer process side.

If producer and consumer processes are not bound, you cannot use this technic because of processes access rights. In this case, the only way of transfering the file descriptor resource from the consumer to the producer process is to use a Unix socket using the sendmsg(...) and recvmsg(...) functions with a SCM_RIGHTS message type (see the IPC communication integration).

3. During all this time, on the producer side, you are waiting for the EGLStream file descriptor. #

When you finally receive the file descriptor, you can create the producer endpoint of the EGLStream and initialize the producer mechanism with the EGL_KHR_stream_producer_eglsurface extension.

EGLStream eglStream = eglCreateStreamFromFileDescriptorKHR(eglDisplay, consumerFD);

const EGLint surfaceAttribs[] = {EGL_WIDTH, width, EGL_HEIGHT, height, EGL_NONE};
EGLSurface eglSurface = eglCreateStreamProducerSurfaceKHR(eglDisplay, config, eglStream, surfaceAttribs);

Just like with a classical PBuffer surface, you specify the dimensions of the EGLStream producer surface in the surface attributes. The provided EGL config is the same configuration as the one used to create the EGLContext. When you choose the EGLConfig with eglChooseConfig(...), you must specify the value EGL_STREAM_BIT_KHR for the EGL_SURFACE_TYPE parameter.

At the moment of selecting the EGLContext before drawing, you will bind this newly created eglSurface as target surface. Then eglSwapBuffers(...) will automatically takes care of finishing the rendering and sending the video buffer content to the consumer endpoint of the EGLStream.

sequenceDiagram
    participant A as Producer
    participant B as Consumer
    B ->> B: Create the EGLStream
    B ->> B: Get the EGLStream file descriptor
    B ->> B: Initialize the consumer
    B ->> A: Send the file descriptor
    A ->> A: Create the EGLStream
    A ->> A: Initialize the producer
    A --> B: Connected
    activate A
    activate B
    loop Rendering
        A ->> A: Draw frame
        A ->> B: Send frame
        B ->> B: Consume frame
        B -->> A: EGLStream ready for next frame
    end
    deactivate A
    deactivate B

How to consume the EGLStream frames #

As explained above, on the producer side there is nothing special to do. You can do the rendering just like with a conventional surface as the whole EGLStream mechanism is transparently handled during the eglSwapBuffers(...) call.

On the consumer side, you need to manually acquire and release the frames. The principle is the same with any of the consumer extensions:

  • you acquire the next frame from the EGLStream queue
  • you do the presentation
  • you release the frame when not needed anymore

but the actual functions calls are slightly different.

With the EGL_KHR_stream_consumer_gltexture extension #

In this case, the rendering loop will look like this:

while (drawing)
{
    EGLint status = 0;
    if (!eglQueryStreamKHR(eglDisplay, eglStream, EGL_STREAM_STATE_KHR, &status))
        break;

    switch (status)
    {
    case EGL_STREAM_STATE_CREATED_KHR:
    case EGL_STREAM_STATE_CONNECTING_KHR:
        continue;

    case EGL_STREAM_STATE_DISCONNECTED_KHR:
        drawing = false;
        break;

    case EGL_STREAM_STATE_EMPTY_KHR:
    case EGL_STREAM_STATE_OLD_FRAME_AVAILABLE_KHR:
    case EGL_STREAM_STATE_NEW_FRAME_AVAILABLE_KHR:
        break;
    }
    if (!drawing)
        break;

    if (!eglStreamConsumerAcquireKHR(eglDisplay, eglStream))
        continue;

    // Drawing code...
    // The currently bound GL_TEXTURE_EXTERNAL_OES holds the content
    // of the new EGLStream frame

    eglSwapBuffers(eglDisplay, eglSurface);
    eglStreamConsumerReleaseKHR(eglDisplay, eglStream);
}

The eglStreamConsumerAcquireKHR(...) call will block during maximum EGL_CONSUMER_ACQUIRE_TIMEOUT_USEC_KHR microseconds or until receiving a frame (or until the EGLStream is disconnected). The value of EGL_CONSUMER_ACQUIRE_TIMEOUT_USEC_KHR is set once for all when creating the consumer EGLStream endpoint.

With the EGL_NV_stream_consumer_eglimage extension #

In this case, the rendering loop will look like this:

static constexpr EGLTime ACQUIRE_MAX_TIMEOUT_USEC = 1000 * 1000;
EGLImage eglImage = EGL_NO_IMAGE;

while (drawing)
{
    EGLenum event = 0;
    EGLAttrib data = 0;
    // WARNING: specifications state that the timeout is in nanoseconds
    // (see: https://registry.khronos.org/EGL/extensions/NV/EGL_NV_stream_consumer_eglimage.txt)
    // but in reality it is in microseconds (at least with the version 535.113.01 of the NVidia drivers)
    if (!eglQueryStreamConsumerEventNV(eglDisplay, eglStream, ACQUIRE_MAX_TIMEOUT_USEC, &event, &data))
    {
        drawing = false;
        continue;
    }

    switch (event)
    {
    case EGL_STREAM_IMAGE_ADD_NV:
        if (eglImage)
            eglDestroyImage(eglDisplay, eglImage);

        eglImage = eglCreateImage(eglDisplay, EGL_NO_CONTEXT, EGL_STREAM_CONSUMER_IMAGE_NV,
                                  static_cast<EGLClientBuffer>(eglStream), nullptr);
        continue;

    case EGL_STREAM_IMAGE_REMOVE_NV:
        if (data)
        {
            EGLImage image = reinterpret_cast<EGLImage>(data);
            eglDestroyImage(eglDisplay, image);
            if (image == eglImage)
                eglImage = EGL_NO_IMAGE;
        }
        continue;

    case EGL_STREAM_IMAGE_AVAILABLE_NV:
        if (eglStreamAcquireImageNV(eglDisplay, eglStream, &eglImage, EGL_NO_SYNC))
            break;
        else
            continue;

    default:
        continue;
    }

    // Present the EGLImage...

    eglStreamReleaseImageNV(eglDisplay, eglStream, eglImage, EGL_NO_SYNC);
}

if (eglImage)
    eglDestroyImage(eglDisplay, eglImage);

Integration of EGLStreams in WPEBackend-offscreen-nvidia #

The WPEBackend-offscreen-nvidia project is based on the WPEBackend-direct architecture presented in the previous post, with the following differences:

On the application process side #

The ViewBackend initializes an EGL surfaceless display using the EGL_MESA_platform_surfaceless extension in order to be able to create a consumer EGLStream.

Then the WPE Backend consumes the EGLStream frames from a separated thread while the end-user application callback is called from the application main thread. This separation is needed to avoid border effects between the application rendering EGL display and the consumer surfaceless display.

After acquiring a new frame, the consumer thread waits until the end-user application main thread calls the wpe_offscreen_nvidia_view_backend_dispatch_frame_complete(...) function. This advertises the WPE Backend that the frame can be released. Then the consumer thread releases the frame and tries to acquire the next one. Unlike in WPEBackend-direct, there is no need for further IPC synchronization as the EGLStream producer (on the WPEWebProcess side) is blocked until the frame is released on the consumer side. The rendering synchronization is implicit through the EGLStream mechanisms.

sequenceDiagram
    participant A as WPE Backend - Consumer thread
    participant B as WPE Backend - Application main thread
    participant C as End-user application
    A ->> A: Acquire frame from producer
    activate A
    A ->> A: wait
    B ->> C: Call presentation callback with actual EGLImage
    C ->> C: Use the EGLImage
    C ->> B: wpe_offscreen_nvidia_view_backend_dispatch_frame_complete(...)
    B ->> A: Unlock consumer thread
    deactivate A
    A ->> A: Release frame

The IPC communication #

As far as IPC communication is concerned, WPEBackend-offscreen-nvidia adds the possibility to transfer file descriptors between processes using a Unix socket channel and the SCM_RIGHTS message type.

bool Channel::writeFileDescriptor(int fd) noexcept
{
    assert(m_localFd != -1);
    if (fd == -1)
        return false;

    union {
        cmsghdr header;
        char buffer[CMSG_SPACE(sizeof(int))];
    } control = {};

    msghdr msg = {};
    msg.msg_control = control.buffer;
    msg.msg_controllen = sizeof(control.buffer);

    cmsghdr* header = CMSG_FIRSTHDR(&msg);
    header->cmsg_len = CMSG_LEN(sizeof(int));
    header->cmsg_level = SOL_SOCKET;
    header->cmsg_type = SCM_RIGHTS;
    *reinterpret_cast<int*>(CMSG_DATA(header)) = fd;

    if (sendmsg(m_localFd, &msg, MSG_EOR | MSG_NOSIGNAL) == -1)
    {
        closeChannel();
        m_handler.handleError(*this, errno);
        return false;
    }

    return true;
}

int Channel::readFileDescriptor() noexcept
{
    assert(m_localFd != -1);

    union {
        cmsghdr header;
        char buffer[CMSG_SPACE(sizeof(int))];
    } control = {};

    msghdr msg = {};
    msg.msg_control = control.buffer;
    msg.msg_controllen = sizeof(control.buffer);

    if (recvmsg(m_localFd, &msg, MSG_WAITALL) == -1)
    {
        closeChannel();
        m_handler.handleError(*this, errno);
        return -1;
    }

    cmsghdr* header = CMSG_FIRSTHDR(&msg);
    if (header && (header->cmsg_len == CMSG_LEN(sizeof(int))) && (header->cmsg_level == SOL_SOCKET) &&
        (header->cmsg_type == SCM_RIGHTS))
    {
        int fd = *reinterpret_cast<int*>(CMSG_DATA(header));
        return fd;
    }

    return -1;
}

On the WPEWebProcess side #

The RendererBackendEGL provides a surfaceless platform with the EGL_PLATFORM_SURFACELESS_MESA value. In order to work correctly with WPE WebKit it requires a small patch in the WebKit source code (see here).

By default, the WPE WebKit code sets the EGL_WINDOW_BIT value for the EGL_SURFACE_TYPE parameter when using surfaceless EGL contexts in the GLContextEGL.cpp file. We need to set the EGL_STREAM_BIT_KHR value to select EGLConfigs compatible with an EGLStream producer surface.

Then in RendererBackendEGLTarget, the only differences are:

  • in the RendererBackendEGLTarget::frameWillRender(...) method it lazily initializes the producer EGLStream if needed (as soon as it has received the EGLStream file descriptor from the consumer side). And then it attaches the EGLStream producer surface to the current ThreadedCompositor EGLContext. This way all further drawing will be re-targeted to the producer surface.
  • in the RendererBackendEGLTarget::frameRendered(...) method it swaps the producer surface buffers (which will block until the consumer acquires the current new frame) and then calls directly the wpe_renderer_backend_egl_target_dispatch_frame_complete(...) function as the rendering synchronization has already implicitly occurred during the buffers swapping phase.

The rest of the code is the same as in WPEBackend-direct.

Using WPEBackend-offscreen-nvidia in a docker container without windowing system #

As the producer (WPEWebProcess side) and the consumer (application process side) are both using EGL surfaceless contexts, WPE WebKit running with the WPEBackend-offscreen-nvidia backend can work on a Linux system with no windowing system at all.

This is very useful when doing offscreen rendering on server-side using docker containers for example. As soon as you have a NVidia GPU correctly configured on the host, you can create lightweight containers with a minimal set of packages and do web rendering into EGLImages with full hardware acceleration and zero-copy of the video buffers from the GPU to the CPU (you still may have copies - or not - inside the GPU memory depending on the drivers implementation and the usage of your final EGLImages).

To use the host NVidia GPU from within a docker container, you first need to install the NVIDIA Container Toolkit.

Then the WPEBackend-offscreen-nvidia project provides a Dockerfile example to build the whole backend with the WPE WebKit patched library, and to run it with the provided webview-sample. It also provides a docker-compose.yml example to run the previous webview-sample with the NVidia GPU.

These examples still install some X11 dependencies as the webview-sample creates an X11 window to present the content of the produced EGLImages. If you are using WPE WebKit and the WPEBackend-offscreen-nvidia backend in an environment with no windowing system, you will only need to install the following dependencies:

ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=graphics

RUN apt-get update -qq && \
    apt-get upgrade -qq && \
    apt-get install -qq --no-install-recommends \
        gstreamer1.0-plugins-good gstreamer1.0-plugins-bad \
        libgstreamer-gl1.0-0 libepoxy0 libharfbuzz-icu0 libjpeg8 \
        libxslt1.1 liblcms2-2 libopenjp2-7 libwebpdemux2 libwoff1 libcairo2 \
        libglib2.0-0 libsoup2.4-1 libegl1-mesa

WORKDIR /root
RUN mkdir -p /usr/share/glvnd/egl_vendor.d && \
    cat <<EOF > /usr/share/glvnd/egl_vendor.d/10_nvidia.json
{
    "file_format_version" : "1.0.0",
    "ICD" : {
        "library_path" : "libEGL_nvidia.so.0"
    }
}
EOF

The gstreamer plugins are only needed if you want to use multimedia features, else you can also remove them.

November 07, 2023 12:00 AM

November 06, 2023

José Dapena

Keep GCC running (2023 update)

Last week I attended BlinkOn 18 in Sunnyvale Google offices. For the 5th time I presented the status of Chromium build using the GNU toolchain components: GCC compiler and libstdc++ standard library implementation.

This blog post recaps the current status in 2023, as it was presented in the conference.

Current status

First things first: GCC is still maintained, and working, on the official development releases! Though, as official build bots will not check that, fixes usually take a few extra days to land.

This is the result of the work from contributors of several companies (Igalia, LGE, Vewd and others). But most important, from individual contributors (last 2 years the main contributor, with more than 50% of the commit has been Stephan Hartmann from Gentoo).

GCC support

The work to support GCC is coordinated from the GCC Chromium meta bug.

Since I started tracking the GCC support we have been getting a quite stable number of contributions, 70-100 per year. Though in 2023 (even if we are counting only 8 months on this chart), we are way below 40.

What happened? I am not really sure. Though, I have some ideas. First, simply as we move to newer versions of GCC, the implementation of recent C++ standards have improved. But then, also, this is the result of the great work that has been done recently in both GCC and LLVM, and also in standarization process, to get more interoperable implementations.

Main GCC problems

Now, I will focus on the most frequent causes for build breakage affecting GCC since January 2022.

Variable clashes

The rules for visibility in C++ are slightly different in GCC. An example: if a class declares a getter with the same name of a declared type in current namespace, Clang will be mostly Ok with that. But GCC will fail with an error.

Bad:

using Foo = int;

class A {
    ...
    Foo Foo();
    ...
}

A possible fix is renaming the accessor method name:

Foo GetFoo();

Or using a explicit namespace for the type:

::Foo GetFoo()

Ambiguous constructors

GCC may sometimes fail to resolve which constructor to use when there is an implicit type conversion. To avoid that, we can make that conversion explicit or use all braces initializers.

constexpr is more strict in GCC

In GCC a constexpr method declared as default demands all its parts to be also declared constexpr:

Bad

int MethodA() { ... };

constexpr MethodB() { ... MethodA() ... };

Two possible fixes: or dropping constexpr from the using method, or adding constexpr to all used methods.

noexcept strictness

noexcept strictness in GCC and Clang is the same when exceptions are enabled, requiring that all invoked functions used from a noexcept are also noexcept. Though, when exceptions are disabled using -fno-exception, Clang will ignore the errors. But GCC will still check the rules.

This works in Clang and not in GCC if -fno-exception is set:

class A {
    ...
    virtual int Method();
};

class B {
    ...
    int Method() noexcept override;
};

CPU intrinsics casts

Implicit casts of CPU intrinsic types will fail in GCC, requiring explicit conversion or cast.

As example:

int8_t input __attribute__((vector_size(16)));
uint16_t output = _mm_movemask_epi8(input);

If we see the declaration of the intrinsic call:

int _mm_movemask_epi8 (__m128i a);

GCC will not allow implicit casting from an int8_t __attribute__((vector_size(16))), though the storage is the same. In GCC we require a reinterpret_cast:

int8_t input __attribute__((vector_size(16)));
uint16_t output = _mm_movemask_epi8(reinterpret_cast<__m128i>(input));

Template specializations need to be in a namespace scope

Template specializations are not allowed in GCC if they are not in a namespace scope. Usually this error materializes with developers adding the template specialization in the templated class scope.

A failing example:

namespace a {
    class A() {
        template<typename T>
        T Foo(const T&t ) {
            ...
        }

        template<>
        size_ Foo(const size_t& t)
    };
}

The fix is moving the template specialization to the namespace scope:

namespace a {
    class A() {
        template<typename T>
        T Foo(const T&t ) {
            ...
        }
    };

    template<>
    size_ A::Foo(const size_t& t) {
        ...
    }
}

libstdc++ support

Regarding the C++ standard library implementation from GNU project, the work is coordinated in the libstdc++ Chromium meta bug.

We have definitely observed a big increase of required fixes in 2022, and projection of 2023 is going to be similar.

In this case there are two possible reasons. One is that we are more exhaustively tracking the work using the meta bug.

Main libstdc++ problems

In the case of libstdc++, these are the most frequent causes for build breakage since January 2022.

Missing includes

The libc++ implementation is different from libstdc++. And that implies some library headers that could be indirectly included in libc++ are not in libstdc++.

STL containers not allowing const members

Let’s see this code:

std::vector<const int> v;

In Clang, this is allowed. Though, in GCC there is an explicit assert forbidding it: std::vector must have a non-const, non-volatile value_type.

This is not only specific to std::vector, but also to std::unordered:*, std::list, std::set, std::deque, std::forward_list or std::multiset.

The solution? Just do not use const as members:

std::vector<int> v;

Destructor of unique_ptr requires declaration of contained type

Assigning from nullptr to std::unique_ptr requires destructor declaration

class A;

std::unique_ptr<A> GetValue() {
    return nullptr;
}

Usually it is just needed to include the full declaration of the class, as it needs the size of the type for the default destructor.

Yocto meta-chromium layer

In my case, to verify GCC and libstdc++ support, on top of building Chromium in Ubuntu using both, I am also maintaining a Yocto layer that builds Chromium development releases. I usually try to verify the build in less than a week after each release.

The layer is available at github.com/Igalia/meta-chromium

I am regularly testing the build using core-image-weston in both Raspberry PI 4 (64 bits) and Intel x86-64 (using Qemu). There I try both X11 and Wayland Ozone backends.

Wrapping up

I would still recommend people to move to use the officially supported Chromium toolchains: libc++ and Clang. With them, downstreams get a far more tested implementation, more security features, or better integration for sanitizers. But, meanwhile, I expect things will still be kept working as several downstreams and distributions will still ship Chromium on top of the GNU toolchains.

Even with the GNU toolchain not being officially supported in Chromium, the community has been successful providing support for both GCC and libstdc++. Thanks to all the contributors!

by José Dapena Paz at November 06, 2023 06:53 AM