Planet Igalia

October 28, 2024

Maíra Canal

Unleashing Power: Enabling Super Pages on the RPi

Unleashing the power of 3D graphics in the Raspberry Pi is a key commitment for Igalia through its collaboration with Raspberry Pi. The introduction of Super Pages for the Raspberry Pi 4 and 5 marks another step in this journey, offering some performance enhancements and more efficient memory usage. In this post, we’ll dive deep into the technical details of Super Pages, discuss the challenges we faced during implementation, and illustrate the benefits this feature brings to the Raspberry Pi ecosystem.

What are Super Pages?

A Memory Management Unit (MMU) is a hardware component responsible for handling memory access at the system level. It translates virtual addresses used by programs into physical addresses in main memory, enabling efficient memory management and protection. The MMU allows the operating system to allocate memory dynamically, isolating processes from one another to prevent them from interfering with each other’s memory.

Recommendation: 📚 Structured computer organization by Andrew Tanenbaum

The V3D MMU, which is part of the Broadcom GPU found in the Raspberry Pi 4 and 5, is responsible for translating 32-bit virtual addresses (VA) used by V3D into 40-bit physical addresses used externally to V3D. The MMU relies on a page table, stored in physical memory, which maps virtual addresses to their corresponding physical addresses. The operating system manages this page table, and the MMU uses it to perform address translation during memory access.

A fundamental principle of modern operating systems is that memory is not stored contiguously. Instead, a contiguous block of memory is divided into smaller blocks, called “pages”, which are scattered across the entire address space. These pages are typically 4KB in size. This approach enables more efficient memory management and allows for features like virtual memory and memory protection.

Over the years, the amount of available memory in computers has increased dramatically. An early IBM PC had up to 640 KiB of RAM, whereas the ThinkPad I’m typing on right now has 32 GB of RAM. Naturally, memory demands have grown alongside this increase. Today, it’s common for web browsers to consume several gigabytes of RAM, and a single shader can take up multiple megabytes.

As memory usage grows, a 4KB page size may become inefficient for managing large memory blocks. Handling a large number of small pages for a single block means the MMU must perform multiple address translations, which increases overhead. This can reduce the effectiveness of the Translation Lookaside Buffer (TLB), as it must store and handle more entries, potentially leading to more cache misses and reduced overall performance.

This is why many CPU manufacturers have introduced support for larger page sizes. For instance, x86 CPUs typically support 4KB and 2MB pages, with 1GB pages available if supported by the hardware. Similarly, ARM64 CPUs can support 4KB, 16KB, and 64KB page sizes. These larger page sizes help reduce the number of pages the MMU needs to manage, improving performance by reducing the overhead of address translation and making more efficient use of the TLB.

So, if CPUs are using bigger sizes, why shouldn’t GPUs do the same?

By default, V3D supports 4KB pages. However, by setting specific bits in the page table entry, it is possible to create 64KB “Big Pages” and 1MB “Super Pages.” The issue is that the current V3D driver available in Linux does not enable the use of Big or Super Pages, meaning this hardware feature is currently unused.

The advantage of enabling Big and Super Pages is that once an entry for any page within a Big or Super Page is cached in the MMU, it can be used to translate all virtual addresses within that page’s range without needing to fetch additional entries. In theory, this should result in improved performance, especially for applications with high memory demands, such as those using multiple large buffer objects (BOs).

As Igalia continually strives to enhance the experience for Raspberry Pi users, we decided to implement this feature in the upstream kernel. But before diving into the implementation details, let’s take a look at the real-world results and see if the theoretical benefits of Super Pages have translated into measurable improvements for Raspberry Pi users.

What Does This Feature Mean for RPi Users?

With Super Pages implemented, let’s now explore the actual performance improvements observed on the Raspberry Pi and see how impactful this feature is for users.

Benchmarking Super Pages: Traces and FPS Improvements

To measure the impact of Super Pages, we tested a variety of games and demos traces on the Raspberry Pi 4 and 5, covering genres from action to racing. On average, we observed a +1.40% FPS improvement on the Raspberry Pi 4 and a +1.30% improvement on the Raspberry Pi 5.

For instance, on the Raspberry Pi 4, Warzone 2100 saw an 8.36% FPS increase, and on the Raspberry Pi 5, Quake II enjoyed a 3.62% boost. These examples demonstrate the benefits of Super Pages in resource-demanding applications, where optimized memory handling becomes critical.

Raspberry Pi 4 FPS Improvements

Trace Before Super Pages After Super Pages Improvement
warzone2100.30secs.1024x768.trace 56.39 61.10 +8.36%
ue4_shooter_game_shooting_low_quality_640x480.gfxr 20.71 21.47 +3.65%
quake3e_capture_frames_1800_through_2400_1920x1080.gfxr 60.88 62.50 +2.67%
supertuxkart-menus_1024x768.trace 112.62 115.61 +2.65%
ue4_shooter_game_shooting_high_quality_640x480.gfxr 20.45 20.88 +2.10%
quake2-gles3-1280x720.trace 59.76 60.84 +1.82%
ue4_sun_temple_640x480.gfxr 27.60 28.03 +1.54%
vkQuake_capture_frames_1_through_1200_1280x720.gfxr 54.59 55.30 +1.29%
ue4_shooter_game_low_quality_640x480.gfxr 32.75 33.08 +1.00%
sponza_demo02_800x600.gfxr 20.90 21.03 +0.61%
supertuxkart-racing_1024x768.trace 8.58 8.63 +0.60%
ue4_shooter_game_high_quality_640x480.gfxr 19.62 19.74 +0.59%
serious_sam_trace02_1280x720.gfxr 44.00 44.21 +0.50%
ue4_vehicle_game-2_640x480.gfxr 12.59 12.65 +0.49%
sponza_demo01_800x600.gfxr 21.42 21.46 +0.19%
quake3e-1280x720.trace 84.45 84.52 +0.09%

Raspberry Pi 5 FPS Improvements

Trace Before Super Pages After Super Pages Improvement
quake2-gles3-1280x720.trace 151.77 157.26 +3.62%
supertuxkart-menus_1024x768.trace 306.79 313.88 +2.31%
warzone2100.30secs.1024x768.trace 140.92 144.03 +2.21%
vkQuake_capture_frames_1_through_1200_1280x720.gfxr 131.45 134.20 +2.10%
ue4_vehicle_game-2_640x480.gfxr 24.42 24.88 +1.89%
ue4_shooter_game_high_quality_640x480.gfxr 32.12 32.53 +1.29%
ue4_sun_temple_640x480.gfxr 42.05 42.55 +1.20%
ue4_shooter_game_shooting_high_quality_640x480.gfxr 52.77 53.31 +1.04%
quake3e-1280x720.trace 238.31 240.53 +0.93%
warzone2100.70secs.1024x768.trace 151.09 151.81 +0.48%
sponza_demo02_800x600.gfxr 50.81 51.05 +0.46%
supertuxkart-racing_1024x768.trace 20.91 20.98 +0.33%
ue4_shooter_game_low_quality_640x480.gfxr 59.68 59.86 +0.29%
quake3e_capture_frames_1_through_1800_1920x1080.gfxr 167.70 168.17 +0.29%
ue4_shooter_game_shooting_low_quality_640x480.gfxr 53.40 53.51 +0.22%
quake3e_capture_frames_1800_through_2400_1920x1080.gfxr 163.37 163.64 +0.17%
serious_sam_trace02_1280x720.gfxr 60.00 60.03 +0.06%
sponza_demo01_800x600.gfxr 45.04 45.04 <.01%

While an average +1% FPS improvement might seem modest, Super Pages can deliver more noticeable gains in memory-intensive 3D applications and when the GPU is under heavy usage. Let’s see how the Super Pages perform on Mesa CI.

Benchmarking Super Pages: Mesa CI Job Duration

To avoid introducing regressions in user-space, I usually test my custom kernels with Mesa CI, focusing on the “broadcom-postmerge” stage to verify that all Piglit and CTS tests ran smoothly. For Super Pages, I was pleasantly surprised by the job duration results, as some job durations were reduced by several minutes.

Mesa CI Jobs Duration Improvements

Job Before Super Pages After Super Pages
v3d-rpi4-traces:arm64 ~4m30s ~3m40s
v3d-rpi5-traces:arm64 ~3m30s ~2m45s
v3d-rpi4-gl-full:arm64 */6 ~24-25 minutes ~22-23 minutes
v3d-rpi5-gl-full:arm64 ~48 minutes ~48 minutes
v3dv-rpi4-vk-full:arm64 */6 ~44 minutes ~41 minutes
v3dv-rpi5-vk-full:arm64 ~102 minutes ~92 minutes

Seeing these reductions is especially rewarding. For example, the “v3dv-rpi5-vk-full:arm64” job duration decreased by 10 minutes, meaning more FPS for users and shorter wait times for Mesa developers.

Benchmarking Super Pages: PS2 Emulation

After sharing a couple of tables, I’ll admit that showcasing performance improvements solely through numbers doesn’t always convey the real impact. Personally, I find it more satisfying to see performance gains in action with real-world applications.

This led me to explore PlayStation 2 (PS2) emulation on the RPi 5. From watching YouTube videos, I noticed that PS2 is a popular console for the RPi 5. While the PlayStation (PS1) emulates well even on the RPi 4, and Nintendo 64 and Sega Saturn struggle across most hardware, PS2 hits a sweet spot for testing the RPi 5’s limits.

Fortunately, I still have my childhood PS2 — my second console after the Nintendo GameCube, and one of the most successful consoles worldwide, including in Brazil. With a library packed with titles like Metal Gear Solid, Resident Evil, Tomb Raider, and Shadow of the Colossus, the PS2 remains a great system for collectors and retro gamers alike.

I selected a few games from my collection to benchmark on the RPi 5 using a PS2 emulator. My emulator of choice was Aether SX2 with Vulkan support. Although AetherSX2 is no longer in development, it still performs well on the RPi.

Initially, many games were barely playable, especially those with large buffer objects, like Shadow of the Colossus and Gran Turismo 4. However, after enabling Super Pages support, I noticed immediate improvements. For example, Shadow of the Colossus wouldn’t even open before Super Pages, and while it’s not fully playable yet, it does load now. This isn’t a silver bullet, but it’s a step forward in improving the driver one piece at a time.

I ended up selecting four games for a video comparison: Burnout 3: Takedown, Metal Gear Solid 3: Snake Eater, Resident Evil 4, and Tekken 4.

Disclaimer: The BIOS used in the emulator was extracted from my own PS2, and I played only games I own, with ROMs I personally extracted. Neither I nor Igalia encourage using downloaded BIOS or ROM files from the internet.

From the video, we can see noticeable improvements in all four games. Although they aren’t perfectly playable yet, the performance gains are evident, particularly in Resident Evil 4, where the gameplay saw a solid 5 FPS boost. I realize 18 FPS might not satisfy most players, but I still had a lot of fun playing Resident Evil 4 on the RPi 5.

When tracking the FPS for these games, it’s clear that the performance gains go well beyond the average 1% seen in other benchmarks. Super Pages show their true potential in high-memory applications like PS2 emulation.

Having seen the performance gains Super Pages can bring to the Raspberry Pi, let’s now dive into the technical aspects of the feature.

Implementing Super Pages

The first challenge was figuring out how to allocate a contiguous block of memory using shmem. The Shared Memory Virtual Filesystem (shmem) is used as a flexible memory mechanism that allows the GPU and CPU to share access to BOs through the system’s temporary filesystem, tmpfs. tmpfs is a volatile filesystem that stores files in RAM, making it ideal for temporary or high-speed data that doesn’t need to persist on RAM.

For example, to allocate a 256KB BO across four 64KB pages, we need four contiguous 64KB memory blocks. However, by default, tmpfs only allocates memory in PAGE_SIZE chunks (as seen in shmem_file_setup()), whereas PAGE_SIZE is 4KB on the Raspberry Pi 4 and 16KB on the Raspberry Pi 5. Since the function drm_gem_object_init() — which initializes an allocated shmem-backed GEM object — relies on shmem_file_setup() to back these objects in memory, we had to consider alternatives, as the default PAGE_SIZE would divide memory into increments that are too small to ensure the large, contiguous blocks needed by the GPU.

The solution we proposed was to create drm_gem_object_init_with_mnt(), which allows us to specify the tmpfs mountpoint where the GEM object will be created. This enables us to allocate our BOs in a mountpoint that supports larger page sizes. Additionally, to ensure that our BOs are allocated in the correct mountpoint, we introduced drm_gem_shmem_create_with_mnt(), which allows the mountpoint to be specified when creating a new DRM GEM shmem object.

[PATCH v6 04/11] drm/gem: Create a drm_gem_object_init_with_mnt() function

[PATCH v6 06/11] drm/gem: Create shmem GEM object in a given mountpoint

The next challenge was figuring out how to create a new mountpoint that would allow for different page sizes based on the allocation. Simply creating a new tmpfs mountpoint with a fixed bigger page size wouldn’t suffice, as we needed flexibility for various allocations. Inspired by the i915 driver, we decided to use a tmpfs mountpoint with the “huge=within_size” flag. This flag, which requires the kernel to be configured with CONFIG_TRANSPARENT_HUGEPAGE, enables the allocation of huge pages.

Transparent Huge Pages (THP) is a kernel feature that automatically manages large memory pages to improve performance without needing changes from applications. THP dynamically combines smaller pages into larger ones, typically 2MB, reducing memory management overhead and improving cache efficiency.

To support our new allocation strategy, we created a dedicated tmpfs mountpoint for V3D, called gemfs, which provides us an ideal space for managing these larger allocations.

[PATCH v6 05/11] drm/v3d: Introduce gemfs

With everything in place for contiguous allocations, the next step was configuring V3D to enable Big/Super Page support.

We began by addressing a major source of memory pressure on the Raspberry Pi: the current 128KB alignment for allocations in the virtual memory space. This alignment wastes space when handling small BO allocations, especially since the userspace driver performs a large number of these small allocations.

As a result, we can’t fully utilize the 4GB address space available for the GPU on the Raspberry Pi 4 or 5. For example, we can currently allocate up to 32,000 BOs of 4KB (~140MB) and 3,000 BOs of 400KB (~1.3GB). This becomes a limitation for memory-intensive applications. By reducing the page alignment to 4KB, we can significantly increase the number of BOs, allowing up to 1,000,000 BOs of 4KB (~4GB) and 10,000 BOs of 400KB (~4GB).

Therefore, the first change I made was reducing the VA alignment of all allocations to 4KB.

[PATCH v6 07/11] drm/v3d: Reduce the alignment of the node allocation

With the alignment issue resolved, we can now implement the code to properly set the flags on the Page Table Entries (PTE) for Big/Super Pages. Setting these flags is straightforward — a simple bitwise operation. The challenge lies in determining which BOs can be allocated in Super Pages. For a BO to be eligible for a Big Page, its virtual address must be aligned to 64KB, and the same applies to its physical address. Same thing for Super Pages, but now the addresses must be aligned to 1MB.

If the BO qualifies for a Big/Super Page, we need to iterate over 16 4KB pages (for Big Pages) or 256 4KB pages (for Super Pages) and insert the appropriate PTE.

Additionally, we modified the way we iterate through the BO’s memory. This was necessary because the THP may not always allocate the entire BO contiguously. For example, it might only allocate contiguously 1MB of a 2MB block. To handle this, we now iterate over the blocks of contiguous memory scattered across the scatterlist, ensuring that each segment is properly handled during the allocation process.

What is a scatterlist? It is a Linux Kernel data structure that manages non-contiguous memory as if it were contiguous. It organizes separate memory blocks into a single logical buffer, allowing efficient data handling, especially in Direct Memory Access (DMA) operations, without needing a physically contiguous memory allocation.

[PATCH v6 08/11] drm/v3d: Support Big/Super Pages when writing out PTEs

However, the last few patches alone don’t fully enable the use of Super Pages. While PATCH 08/11 technically allows for Super Pages, we’re still relying on DRM GEM shmem objects, meaning allocations are still happening in PAGE_SIZE chunks. Although Big/Super Pages could potentially be used if the system naturally allocated 1MB or 64KB contiguously, this is quite rare and not our intended outcome. Our goal is to actively use Big/Super Pages as much as possible.

To achieve this, we’ll utilize the V3D-specific mountpoint we created earlier for BO allocation whenever possible. By creating BOs through drm_gem_shmem_create_with_mnt(), we can ensure that large pages are allocated contiguously when possible, enabling the consistent use of Big/Super Pages.

[PATCH v6 09/11] drm/v3d: Use gemfs/THP in BO creation if available

And there you have it — Big/Super Pages are now fully enabled in V3D. The only requirement to activate this feature in any given kernel is ensuring that CONFIG_TRANSPARENT_HUGEPAGE is enabled.

Final Words

You can learn more about ongoing enhancements to the Raspberry Pi driver stack in this XDC 2024 talk by José María “Chema” Casanova Crespo. In the talk, Chema discusses the Super Pages work I developed, along with other advancements in the driver stack.

Of course, there are still plenty of improvements on the horizon at Igalia. I’m currently experimenting with 64KB CLE allocations in user-space, and I hope to share more good news soon.

Finally, I’d like to express my gratitude to Iago Toral and Tvrtko Ursulin for their invaluable support in developing Super Pages for the V3D kernel driver. Thank you both for sharing your experience with me!

October 28, 2024 12:00 PM

October 26, 2024

Stephen Chenney

Caret Customization on the Web

A caret is the symbol used to show where text will appear in text input applications,such as a word processor, code editor or input form. It might be a vertical bar, or an underscore, or some other shape. It may flash, or pulse.

On the web, sites currently have some control over the editing caret through CSS, along with entirely custom solutions. Here I discuss the current state of caret customization on the web and look at proposals to allow more customization through CSS.

Carets on the Web #

The majority of web sites use the default editing caret. In all browsers it is a vertical bar in the text color that blinks. The blink rate and duration varies across browsers: Chrome’s cursor blinks continuously at a fixed rate; Firefox blinks the caret continuously at a rate from user settings; Safari blinks with keyboard input but switches to a strobe effect with dictation input.

All browsers support the CSS caret-color property allowing sites to color the caret independent of the text, and allow the color to be animated. You can also make the caret disappear with a transparent color.

Changing the caret shape is currently not possible through CSS in any browser. There are some ways to work around this, such as those documented in this Stack Overflow post. The general idea is hide the caret and replace it with an element controlled by script. The script relies on the selection range and other APIs to know where the element should be positioned. Or you can completely replace the default browser editing behavior with a custom javascript editor (as is done by, for example, Google Docs).

Browser support for caret customization currently leaves web developers in an awkward situation: accept the basic default cursor with color control, or implement a custom editor, but nothing in-between.

Improving Caret Customization #

There are at least two CSS properties not yet implemented in browsers that improve caret customization. The first concerns the interaction between color animation and the default blinking behavior, and the second adds support for different caret shapes.

The CSS caret-animation property #

When the caret-color is animated, there is no way to reliably synchronize the animation with the browser’s blinking rate. Different browsers blink at different rates, and it may be under user control. The CSS caret-animation property, when set to the value manual suppresses the blinking, giving the color animation complete control over the color of the caret at all times. Using caret-animation: auto (the initial value) leaves the blinking behavior under browser control.

Site control of the blinking is both an accessibility benefit and potentially harmful to users. For users sensitive to motion, disabling the blinking may be a benefit. At the same time, a cursor that does not blink is much harder for users to recognize. Please use caution when employing this property when it is available in browsers.

There is an implementation of caret-animation in Chrome 132 behind the CSSCaretAnimation flag, accessible through --enable-blink-features="CSSCaretAnimation" on the command line. Feel free to comment on the bug if you have thoughts on this feature.

The CSS caret-shape property #

The shape of the caret in native applications is most commonly a vertical bar, an underscore or a rectangular block. In addition, the shape often varies depending on the input mode, such as insert or replace. The CSS caret-shape property allows sites to choose one of these shapes for the caret, or leave the choice up to the browser. The recognized property values are auto, bar, block and underscore.

No browser currently supports the caret-shape property, and the specification needs a little work to confirm the exact location and shape of the underscore and block. Please leave feedback on the Chromium bug if you would like this feature to be implemented.

Other proposed caret properties #

The caret-shape property does not allow any control over the size of the caret, such as the thickness of the bar or block. There was, for example, a proposal to add caret-width as a means for working around browser bugs on zooming and transforming the caret. Please create a new CSS working Group issue if you would like greater customization and can provide use cases.

October 26, 2024 12:00 AM

October 22, 2024

Stephen Chenney

CSS Highlight Inheritance has Shipped!

The CSS highlight inheritance model describes the process for inheriting the CSS properties of the various highlight pseudo elements:

  • ::selection controlling the appearance of selected content
  • ::spelling-error controlling the appearance of misspelled word markers
  • ::grammar-error controlling how grammar errors are marked
  • ::target-text controlling the appearance of the string matching a target-text URL
  • ::highlight defining the appearance of a named highlight, accessed via the CSS highlight API

The feature has now launched in Chrome 131, and this post describes how to achieve the old selection styling behavior if selection is now not visible, or newly visible, on your site.

The most common problem comes when a site relies on ::selection properties only applying to the matched elements and not child elements. For example, sites don’t want <div> elements themselves to be visible, but do want the text content inside them to show the selected text. Some sites may have used such styling to work around old bugs in browsers whereby the highlight on the selected <div> would be too large.

Previously, this style block would have the intended result:

<style>
div::selection {
background-color: transparent;
}
</style>

With CSS highlight inheritance the children of the <div> will now also have transparent selection, which is probably not what was intended.

There’s a good chance that the selection styling added to work around bugs is no longer needed. So the first thing to try is removing the style rule entirely.

Should you really want to style selection on an element but not its children, simplest is to add a style rule to apply the default selection behavior to everything that is not a <div> (or whatever element you are selecting):

<style>
div::selection {
background-color: transparent;
}
:not(div)::selection {
background-color: Highlight;
}
</style>

Highlight is the system default selection background color, whatever that might be in the browser and operating system you are using.

More efficient is to use the child combinator because it will only create new pseudo style objects for the <div> and its direct children, inheriting the rest:

<style>
div::selection {
background-color: transparent;
}
div > *::selection {
background-color: Highlight;
}
</style>

Overall, the former may be better if you have multiple ::selection styles or your ::selection styles already use complex rules, because it avoids the need to have a matching child selector for every one of those.

There are other situations, but building on the examples we suggest two generalizations:

  • Make use of the CSS system colors Highlight and HighlightText to revert selection back to the platform default.
  • Augment any existing ::selection style rules with complementary rules to cancel the selection style changes on other elements that should not inherit the change.

Chrome issue 374528707 may be used to report breakage due to enabling CSS highlight inheritance and to see fixes for the reported cases.

October 22, 2024 12:00 AM

October 17, 2024

Jani Hautakangas

Bringing WebKit back to Android: Internals

In my last blog post, I introduced the WPE-Android project by providing a high-level overview of what the project aims to achieve and the motivations behind bringing WebKit back to Android. This post will take a deeper dive into the technical details and internal structure of WPE-Android, focusing on some key areas of its design and implementation.

WPE-Android is not a standalone WebKit port but rather an extension and adaptation of the existing WPE WebKit APIs. By leveraging the libwpe adaptation layer library and a platform-specific libwpebackend plugin, WPE-Android integrates seamlessly with Android. These components work together to provide WebKit with the necessary access to Android-specific functionality, enabling it to render web content efficiently on Android devices.

WPE-Android High Level

Overall Design #

At the core of WPE-Android lies the native WPE WebKit codebase, which powers much of the browser functionality. However, for Android applications to interact with this native code, a bridge must be established between the Java environment of Android apps and the C++ world of WebKit. This is where the Java Native Interface (JNI) comes into play.

The JNI allows Java code to call native methods implemented in C or C++ and vice versa. In the case of WPE-Android, a robust JNI layer is implemented to facilitate the communication between the Android system and the WebKit engine. This layer consists of several utility classes, which are designed to handle JNI calls efficiently and reduce the possibility of code duplication. These classes essentially act as intermediaries, ensuring that all interactions with the WebKit engine are managed smoothly and reliably.

Below is a simplified diagram of the overall design of WPE-Android internals, highlighting the key components involved in this architecture.

WPE-Android Design

WPEBackend-Android: Bridging the Platform #

WPE-Android relies heavily on the libwpe library to enable platform-specific functionalities. The role of libwpe is crucial because it abstracts away the platform-specific details, allowing WebKit to run on various platforms, including Android, without needing to be aware of the underlying system intricacies.

One of the primary responsibilities of libwpe is to interface with the platform-specific libwpebackend plugin, which handles tasks such as platform graphics buffer support and the sharing of buffers between the WebKit UIProcess and WebProcess. The libwpebackend plugin ensures that graphical content generated by the WebProcess can be efficiently displayed on the device’s screen by the UIProcess.

Although this libwpebackend plugin is a critical component of WPE-Android, I won’t go into describing its detailed implementation in this post. However, for those interested in the internal workings of the WPE Backend, I highly recommend reading Loïc Le Page’s comprehensive blog post on the subject: Create WPE Backends. In WPE-Android, this backend functionality is implemented in the WPEBackend-Android repository.

Recently, WPE-Android has been upgraded to WPE WebKit 2.46.0, which introduces an initial version of the new WPE adaptation API called WPE Platform API. This API is designed to provide a cleaner and more flexible way of integrating WPE WebKit with various platforms. However, since this API is still under active development, WPE-Android currently continues to use the older libwpe API.

The diagram below shows the internals of the WPEBackend-android and how it integrates to WPE WebKit

WPEBackend-android

Rendering Web Content: The Journey from WebProcess to Screen #

Rendering web content efficiently on Android involves several moving parts. Once the WebProcess has generated the graphical frames, these frames need to be displayed on the device’s screen in a seamless and performant manner. To achieve this, WPE-Android makes use of the Android SurfaceControl component.

SurfaceControl plays a key role in managing the surface that displays the rendered content. It allows buffers generated by the WebProcess to be posted directly to SurfaceFlinger, which is Android’s system compositor responsible for combining multiple surfaces into a single screen image. This direct posting of buffers to SurfaceFlinger ensures that the rendered frames are composed and displayed in a highly efficient way, with minimal overhead.

The diagram below illustrates how the WebProcess-generated frames are transferred to the Android system and eventually rendered on the device’s screen.

WPE-Android Rendering

Event loop #

WPE WebKit is built on top of GLib, which provides a wide array of functionality for managing events, network communications, and other system-level tasks. GLib is essential to the smooth operation of WPE WebKit, especially when it comes to handling asynchronous events and running the event loop.

To integrate WPE WebKit’s GLib-based event loop with the Android platform, WPE-Android uses a mechanism that drives the WebKit event loop using the Android looper. Specifically, event file descriptors (FDs) from the WPE WebKit GLib main loop are fed into the Android main looper. On each iteration of the event loop, WPE-Android checks whether any of the file descriptors in the GLib main loop have changed. Based on these changes, WPE-Android adjusts the Android looper by adding or removing FDs as necessary.

This complex logic is implemented in a class called MessagePump, which handles the synchronization between the two event loops and ensures that events are processed in a timely manner.

WPE-Android MessagePump

What’s New Since the Last Update #

Since the last update, WPE-Android has undergone a number of significant changes. These updates have brought new features, bug fixes, and performance improvements, making the project even more robust and capable of handling modern web content on Android devices. Below is a list of the most notable changes:

  • Upgrade to Cerbero 1.24.8: This upgrade ensures better compatibility and improved build processes.
  • Upgrade to NDK r27 LTS: The move to the latest NDK release provides long-term support and brings several new features and optimizations.
  • Upgrade to WPE WebKit 2.46.0: This version of WPE WebKit introduces new functionality, including support for the Skia backend.
  • Enabled Skia backend: Skia is a graphics engine that provides hardware-accelerated rasterization, which enhances rendering performance.
  • Enabled Remote Inspector: This allows developers to remotely inspect and debug web content.
  • Published WPEView to Maven Central: The WPEView library is now publicly available, making it easier for developers to integrate WPE-Android into their own projects.
  • Major bug fixes: Significant improvements have been made to the GLib main loop integration, ensuring smoother operation and fewer edge cases.
  • Dropped gnutls in favor of openssl: This change simplifies the security stack.
  • Streamlined libraries: Unnecessary deployment of libraries that are already provided by Android as system libraries has been removed.
  • Refined WPEView API: Internal methods that should not be exposed to developers have been moved to implementation details, reducing API surface and making the library more user-friendly

Demos #

Mediaplayer #

Demo shows WPE-Android mediaplayer demo on Android device. Demo source code can be found in mediaplayer

Remote Inspector #

Demo shows Remote Inspector usage on Android device. Detailed instructions how to run Remote Inspector can be found in README.md

This demo shows loading the www.igalia.com webpage in a desktop browser and connecting it to a remote inspector service on device. The video demonstrates how webpage elements can be inspected and edited in real-time using the remote inspector.

Try it yourself #

With the recent release of WPEView to Maven Central, it’s now easier than ever to experiment with WPE-Android and integrate it into your own Android projects

To get started, make sure you have mavenCentral() included in your project’s repository configuration. Here’s how you can do it:

dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
google()
mavenCentral()
}
}

Once that’s done, you can add WPEView as a dependency in your project:

dependencies {
implementation("org.wpewebkit.wpeview:wpeview:0.1.0")
...
}

If you’re interested in learning more or contributing to the project, you can find all the details on the WPE-Android GitHub page. We welcome feedback, contributions, and new ideas!

This project is partially funded by the NLNet Foundation, and we appreciate their support in making it possible.

October 17, 2024 12:00 AM

October 16, 2024

Enrique Ocaña

Adding buffering hysteresis to the WebKit GStreamer video player

The <video> element implementation in WebKit does its job by using a multiplatform player that relies on a platform-specific implementation. In the specific case of glib platforms, which base their multimedia on GStreamer, that’s MediaPlayerPrivateGStreamer.

WebKit GStreamer regular playback class diagram

The player private can have 3 buffering modes:

  • On-disk buffering: This is the typical mode on desktop systems, but is frequently disabled on purpose on embedded devices to avoid wearing out their flash storage memories. All the video content is downloaded to disk, and the buffering percentage refers to the total size of the video. A GstDownloader element is present in the pipeline in this case. Buffering level monitoring is done by polling the pipeline every second, using the fillTimerFired() method.
  • In-memory buffering: This is the typical mode on embedded systems and on desktop systems in case of streamed (live) content. The video is downloaded progressively and only the part of it ahead of the current playback time is buffered. A GstQueue2 element is present in the pipeline in this case. Buffering level monitoring is done by listening to GST_MESSAGE_BUFFERING bus messages and using the buffering level stored on them. This is the case that motivates the refactoring described in this blog post, what we actually wanted to correct in Broadcom platforms, and what motivated the addition of hysteresis working on all the platforms.
  • Local files: Files, MediaStream sources and other special origins of video don’t do buffering at all (no GstDownloadBuffering nor GstQueue2 element is present on the pipeline). They work like the on-disk buffering mode in the sense that fillTimerFired() is used, but the reported level is relative, much like in the streaming case. In the initial version of the refactoring I was unaware of this third case, and only realized about it when tests triggered the assert that I added to ensure that the on-disk buffering method was working in GST_BUFFERING_DOWNLOAD mode.

The current implementation (actually, its wpe-2.38 version) was showing some buffering problems on some Broadcom platforms when doing in-memory buffering. The buffering levels monitored by MediaPlayerPrivateGStreamer weren’t accurate because the Nexus multimedia subsystem used on Broadcom platforms was doing its own internal buffering. Data wasn’t being accumulated in the GstQueue2 element of playbin, because BrcmAudFilter/BrcmVidFilter was accepting all the buffers that the queue could provide. Because of that, the player private buffering logic was erratic, leading to many transitions between “buffer completely empty” and “buffer completely full”. This, it turn, caused many transitions between the HaveEnoughData, HaveFutureData and HaveCurrentData readyStates in the player, leading to frequent pauses and unpauses on Broadcom platforms.

So, one of the first thing I tried to solve this issue was to ask the Nexus PlayPump (the subsystem in charge of internal buffering in Nexus) about its internal levels, and add that to the levels reported by GstQueue2. There’s also a GstMultiqueue in the pipeline that can hold a significant amount of buffers, so I also asked it for its level. Still, the buffering level unstability was too high, so I added a moving average implementation to try to smooth it.

All these tweaks only make sense on Broadcom platforms, so they were guarded by ifdefs in a first version of the patch. Later, I migrated those dirty ifdefs to the new quirks abstraction added by Phil. A challenge of this migration was that I needed to store some attributes that were considered part of MediaPlayerPrivateGStreamer before. They still had to be somehow linked to the player private but only accessible by the platform specific code of the quirks. A special HashMap attribute stores those quirks attributes in an opaque way, so that only the specific quirk they belong to knows how to interpret them (using downcasting). I tried to use move semantics when storing the data, but was bitten by object slicing when trying to move instances of the superclass. In the end, moving the responsibility of creating the unique_ptr that stored the concrete subclass to the caller did the trick.

Even with all those changes, undesirable swings in the buffering level kept happening, and when doing a careful analysis of the causes I noticed that the monitoring of the buffering level was being done from different places (in different moments) and sometimes the level was regarded as “enough” and the moment right after, as “insufficient”. This was because the buffering level threshold was one single value. That’s something that a hysteresis mechanism (with low and high watermarks) can solve. So, a logical level change to “full” would only happen when the level goes above the high watermark, and a logical level change to “low” when it goes under the low watermark level.

For the threshold change detection to work, we need to know the previous buffering level. There’s a problem, though: the current code checked the levels from several scattered places, so only one of those places (the first one that detected the threshold crossing at a given moment) would properly react. The other places would miss the detection and operate improperly, because the “previous buffering level value” had been overwritten with the new one when the evaluation had been done before. To solve this, I centralized the detection in a single place “per cycle” (in updateBufferingStatus()), and then used the detection conclusions from updateStates().

So, with all this in mind, I refactored the buffering logic as https://commits.webkit.org/284072@main, so now WebKit GStreamer has a buffering code much more robust than before. The unstabilities observed in Broadcom devices were gone and I could, at last, close Issue 1309.

by eocanha at October 16, 2024 06:12 AM

October 15, 2024

Alex Bradbury

Accessing a cross-compiled build tree from qemu-system

I'm cross-compiling a large codebase (LLVM and sub-projects such as Clang) and want to access the full tree - the source and the build artifacts from under qemu. This post documents the results of my experiments in various ways to do this. Note that I'm explicitly choosing to run timings using something that approximates the work I want to do rather than any microbenchmark targeting just file access time.

Requirements:

  • The whole source tree and the build directories are accessible from the qemu-system VM.
  • Writes must be possible, but don't need to be reflected in the original directory (so importantly, although files are written as part of the test being run, I don't need to get them back out of the virtual machine).

Results

The test simply involves taking a cross-compiled LLVM tree (with RISC-V as the only enabled target) and running the equivalent of ninja check-llvm on it after exposing / transferring it to the VM using the listed method. Results in chart form for overall time taken in the "empty cache" case:

9pfs
13m05s 
virtiofsd
13m56s 
squashfs
11m33s 
ext4
11m42s 
  • 9pfs: 13m05s with empty cache, 12m45s with warm cache
  • virtiofsd: 13m56s with empty cache, 13m49s with warm cache
  • squashfs: 11m33s (~17s image build, 11m16s check-llvm) with empty cache, 11m16s (~3s image build, 11m13s check-llvm) with warm cache
  • new ext4 filesystem: 11m42s, (~38s image build, 11m4s check-llvm) with empty cache, 11m34s (~30s image build, 11m4s check-llvm) with warm cache.

By way of comparison, it takes ~10m40s to naively transfer the data by piping it over ssh (tarring on one end, untarring on the other).

For this use case, building and mounting a filesystem seems the most compelling option with the ext4 being overall simplest (no need to use overlayfs) assuming you have an e2fsprogs new enough to support tar input to mkfs.ext4. I'm not sure why I'm not seeing the purported performance improvements when using virtiofsd.

Notes on test setup

  • The effects of caching are accounted for by taking measurements in two scenarios below. Although neither is a precise match for the situation in a typical build system, it at least gives some insight into the impact of the page cache in a controlled manner.
    • "Empty cache". Caches are dropped (sync && echo 3 | sudo tee /proc/sys/vm/drop_caches) before building the filesystem (if necessary) and launching QEMU.
    • "Warm cache". Caches are dropped as above but immediately after that (i.e. before building any needed filesystem) we load all files that will be accessed so they will be cached. For simplicity, just do this by doing using a variant of the tar command described below.
  • Running on AMD Ryzen 9 7950X3D with 128GiB RAM.
  • QEMU 9.1.0, running qemu-system-riscv64, with 32 virtual cores matching the number 16 cores / 32 hyper-threads in the host.
  • LLVM HEAD as of 2024-10-13 (48deb3568eb2452ff).
  • Release+Asserts build of LLVM with all non-experimental targets enabled. Built in a two-stage cross targeting rv64gc_zba_zbb_zbs.
  • Running riscv64 Debian unstable within QEMU.
  • Results gathered via time ../bin/llvm-lit -sv . running from test/ within the build directory, and combining this with the timing for whatever command(s) are needed to transfer and extract the source+build tree.
  • Only the time taken to build the filesystem (if relevant) and to run the tests is measured. There may be very minor difference in the time taken to mount the different filesystems, but we assume this is too small to be worth measuring.
  • If I were to rerun this, I would invoke lit with --order=lexical to ensure runtimes aren't affected by the order of tests (e.g. long-running tests being scheduled last).
  • Total size of the files being exposed is ~8GiB or so.
  • Although not strictly necessary, I've tried to map to the uid:gid pair used in the guest where possible.

Details: 9pfs (QEMU -virtfs)

  • Add -virtfs local,path=$HOME/llvm-project/,mount_tag=llvm,security_model=none,id=llvm to the QEMU command line.
  • Within the guest sudo mount -t 9p -o trans=virtio,version=9p2000.L,msize=512000 llvm /mnt. I haven't swept different parameters for msize and mount reports 512000 is the maximum for the virtio transport.
  • Sadly 9pfs doesn't support ID-mapped mounts at the moment, which would allow us to easily remap the uid and gid used within the file tree on the host to one of an arbitrary user account in the guest without chowning everything (which would also require security_model=mapped-xattr or security_model=mapped-file to be passed to qemu. In the future you should be able to do sudo mount --bind --map-users 1000:1001:1 --map-groups 984:1001:1 /mnt mapped
  • We now need to set up an overlayfs that we can freely write to.
    • mkdir -p upper work llvm-project
    • sudo mount -t overlay overlay -o lowerdir=/mnt,upperdir=$HOME/upper,workdir=$HOME/work $HOME/llvm-project
  • You could create a user account with appropriate uid/gid to match those in the exposed directory, but for simplicity I've just run the tests as root within the VM.

Details: virtiofsd

  • Installed the virtiofsd package on Arch Linux (version 1.11.1). Refer to the documentation for examples of how to run.
  • Run /usr/lib/virtiofsd --socket-path=/tmp/vhostqemu --shared-dir=$HOME/llvm-project --cache=always
  • Add the following commands to your qemu launch command (where the memory size parameter to -object seems to need to match the -m argument you used):
  -chardev socket,id=char0,path=/tmp/vhostqemu \
  -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=llvm \
  -object memory-backend-memfd,id=mem,size=64G,share=on -numa node,memdev=mem
  • sudo mount -t virtiofs llvm /mnt
  • Unfortunately id-mapped mounts also aren't quite supported for virtiofs at the time of writing. The changes were recently merged into the upstream kernel and a merge request to virtiofsd is pending.
  • mkdir -p upper work llvm-project
  • sudo mount -t overlay overlay -o lowerdir=/mnt,upperdir=$HOME/upper,workdir=$HOME/work $HOME/llvm-project
  • As with the 9p tests, just ran the tests using sudo.

Details: squashfs

  • I couldn't get a regex that squashfs would accept to exclude all build/* directories except build/stage2cross, so use find to generate the individual directory exclusions.
  • As you can see in the commandline below, I disabled compression altogether.
  • mksquashfs ~/llvm-project llvm-project.squashfs -no-compression -force-uid 1001 -force-gid 1001 -e \ .git $(find $HOME/llvm-project/build -maxdepth 1 -mindepth 1 -type d -not -name 'stage2cross' -printf '-e build/%P ')
  • Add this to the qemu command line: -drive file=$HOME/llvm-project.squashfs,if=none,id=llvm,format=raw -device virtio-net-device,netdev=net
  • sudo mount -t squashfs /dev/vdb /mnt
  • Use the same instructions for setting up overlayfs as for 9pfs above. There's no need to run the tests using sudo as we've set the appropriate uid.

Details: new ext4 filesystem

  • Make use of the -d parameter of mkfs.ext4 which allows you to create a filesystem, using the given directory or tarball to initialise it. As mkfs.ext4 lacks features for filtering the input or forcing the uid/gid of files, we rely on tarball input (only just added in e2fsprogs 1.47.1).
  • Create a filesystem image that's big enough with fallocate -l 15GiB llvm-project.img
  • Execute:
tar --create \
  --file=- \
  --owner=1001 \
  --group=1001 \
  --exclude=.git \
  $(find $HOME/llvm-project/build -maxdepth 1 -mindepth 1 -type d -not -name 'stage2cross' -printf '--exclude=build/%P ') \
  -C $HOME/llvm-project . \
  | mkfs.ext4 -d - llvm-project.img
  • Attach the drive by adding this to the qemu command line -device virtio-blk-device,drive=hdb -drive file=$HOME/llvm-project.img,format=raw,if=none,id=hdb
  • Mount in qemu with sudo mount -t ext4 /dev/vdb $HOME/llvm-project

Details: tar piped over ssh

  • Essentially, use the tar command above and pipe it to tar running under qemu-system over ssh to extract it to root filesystem (XFS in my VM setup).
  • This could be optimised. e.g. by selecting a less expensive ssh encryption algorithm (or just netcat), a better networking backend to QEMU, and so on. But the intent is to show the cost of the naive "obvious" solution.
  • Execute:
tar --create \
  --file=- \
  --owner=1001 \
  --group=1001 \
  --exclude=.git \
  $(find $HOME/llvm-project/build -maxdepth 1 -mindepth 1 -type d -not -name 'stage2cross' -printf '--exclude=build/%P ') \
  -C $HOME/llvm-project . \
  | ssh -p10222 asb@localhost "mkdir -p llvm-project && tar xf - -C \
  llvm-project"

October 15, 2024 10:00 AM

October 09, 2024

Pablo Saavedra

Diagnosing Memory Leaks in WPEWebKit with Google Perftools

This blog post will focus on memory issue case, and I’ll walk you through my experiences in investigating and resolving it. The issue involved WPEWebKit and the Cairo graphics library. I'll also cover the tools I used for memory profiling and how I ultimately found problem. keywords: memory leak, WPEWebKit, Cairo, google-perftools. The Context "It's not a matter of if, but when". As a software developer, whether during the coding phase or ongoing maintenance, encountering memory-related issues is nearly inevitable. For those working in languages with manual memory management, especially seasoned developers of C and C++, bugs related to memory — such as double frees, invalid memory accesses, and leaks — are all too familiar. Common memory problems, including memory leaks, buffer overflows, and insufficient allocation, can quickly escalate into significant headaches. These issues often result in degraded performance, system crashes, or even security vulnerabilities. These kind of bugs often arise in user-space programs, where memory is managed without special safeguards—let alone in more complex scenarios involving CMA, mmap(), DMAbuf … Tackling these challenges requires a deep understanding of how your application handles memory, along with careful debugging and optimization to ensure efficient, safe memory usage. In my experience, you won’t always be able to rely on the same tool or approach to analyze memory-related issues. More often than not, the code where the defect is found will be complex and, quite possibly, not written by you. Additionally, in many cases, you won’t be able to create a simple test case to replicate the issue until after investing significant time and effort in understanding and pinpointing the exact root cause. One such scenario is the case I want to discuss today ...

by Pablo Saavedra at October 09, 2024 06:25 AM

October 03, 2024

Andy Wingo

preliminary notes on a nofl field-logging barrier

When you have a generational collector, you aim to trace only the part of the object graph that has been allocated recently. To do so, you need to keep a remembered set: a set of old-to-new edges, used as roots when performing a minor collection. A language run-time maintains this set by adding write barriers: little bits of collector code that run when a mutator writes to a field.

Whippet’s nofl space is a block-structured space that is appropriate for use as an old generation or as part of a sticky-mark-bit generational collector. It used to have a card-marking write barrier; see my article diving into V8’s new write barrier, for more background.

Unfortunately, when running whiffle benchmarks, I was seeing no improvement for generational configurations relative to whole-heap collection. Generational collection was doing fine in my tiny microbenchmarks that are part of Whippet itself, but when translated to larger programs (that aren’t yet proper macrobenchmarks), it was a lose.

I had planned on doing some serious tracing and instrumentation to figure out what was happening, and thereby correct the problem. I still plan on doing this, but instead for this issue I used the old noggin technique instead: just, you know, thinking about the thing, eventually concluding that unconditional card-marking barriers are inappropriate for sticky-mark-bit collectors. As I mentioned in the earlier article:

An unconditional card-marking barrier applies to stores to slots in all objects, not just those in oldspace; a store to a new object will mark a card, but that card may contain old objects which would then be re-scanned. Or consider a store to an old object in a more dense part of oldspace; scanning the card may incur more work than needed. It could also be that Whippet is being too aggressive at re-using blocks for new allocations, where it should be limiting itself to blocks that are very sparsely populated with old objects.

That’s three problems. The second is well-known. But the first and last are specific to sticky-mark-bit collectors, where pages mix old and new objects.

a precise field-logging write barrier

Back in 2019, Steve Blackburn’s paper Design and Analysis of Field-Logging Write Barriers took a look at the state of the art in precise barriers that record not regions of memory that have been updated, but the precise edges (fields) that were written to. He ends up re-using this work later in the 2022 LXR paper (see §3.4), where the write barrier is used for deferred reference counting and a snapshot-at-the-beginning (SATB) barrier for concurrent marking. All in all field-logging seems like an interesting strategy. Relative to card-marking, work during the pause is much less: you have a precise buffer of all fields that were written to, and you just iterate that, instead of iterating objects. Field-logging does impose some mutator cost, but perhaps the payoff is worth it.

To log each old-to-new edge precisely once, you need a bit per field indicating whether the field is logged already. Blackburn’s 2019 write barrier paper used bits in the object header, if the object was small enough, and otherwise bits before the object start. This requires some cooperation between the collector, the compiler, and the run-time that I wasn’t ready to pay for. The 2022 LXR paper was a bit vague on this topic, saying just that it used “a side table”.

In Whippet’s nofl space, we have a side table already, used for a number of purposes:

  1. Mark bits.

  2. Iterability / interior pointers: is there an object at a given address? If so, it will have a recognizable bit pattern.

  3. End of object, to be able to sweep without inspecting the object itself

  4. Pinning, allowing a mutator to prevent an object from being evacuated, for example because a hash code was computed from its address

  5. A hack to allow fully-conservative tracing to identify ephemerons at trace-time; this re-uses the pinning bit, since in practice such configurations never evacuate

  6. Bump-pointer allocation into holes: the mark byte table serves the purpose of Immix’s line mark byte table, but at finer granularity. Because of this though, it is swept lazily rather than eagerly.

  7. Generations. Young objects have a bit set that is cleared when they are promoted.

Well. Why not add another thing? The nofl space’s granule size is two words, so we can use two bits of the byte for field logging bits. If there is a write to a field, a barrier would first check that the object being written to is old, and then check the log bit for the field being written. The old check will be to a byte that is nearby or possibly the same as the one to check the field logging bit. If the bit is unsert, we call out to a slow path to actually record the field.

preliminary results

I disassembled the fast path as compiled by GCC and got something like this on x86-64, in AT&T syntax, for the young-generation test:

mov    %rax,%rdx
and    $0xffffffffffc00000,%rdx
shr    $0x4,%rax
and    $0x3ffff,%eax
or     %rdx,%rax
testb  $0xe,(%rax)

The first five instructions compute the location of the mark byte, from the address of the object (which is known to be in the nofl space). If it has any of the bits in 0xe set, then it’s in the old generation.

Then to test a field logging bit it’s a similar set of instructions. In one of my tests the data type looks like this:

struct Node {
  uintptr_t tag;
  struct Node *left;
  struct Node *right;
  int i, j;
};

Writing the left field will be in the same granule as the object itself, so we can just test the byte we fetched for the logging bit directly with testb against $0x80. For right, we should be able to know it’s in the same slab (aligned 4 MB region) and just add to the previously computed byte address, but the C compiler doesn’t know that right now and so recomputes. This would work better in a JIT. Anyway I think these bit-swizzling operations are just lost in the flow of memory accesses.

For the general case where you don’t statically know the offset of the field in the object, you have to compute which bit in the byte to test:

mov    %r13,%rcx
mov    $0x40,%eax
shr    $0x3,%rcx
and    $0x1,%ecx
shl    %cl,%eax
test   %al,%dil

Is it good? Well, it improves things for my whiffle benchmarks, relative to the card-marking barrier, seeing a 1.05×-1.5× speedup across a range of benchmarks. I suspect the main advantage is in avoiding the “unconditional” part of card marking, where a write to a new object could cause old objects to be added to the remembered set. There are still quite a few whiffle configurations in which the whole-heap collector outperforms the sticky-mark-bit generational collector, though; I hope to understand this a bit more by building a more classic semi-space nursery, and comparing performance to that.

Implementation links: the barrier fast-path, the slow path, and the sequential store buffers. (At some point I need to make it so that allocating edge buffers in the field set causes the nofl space to page out a corresponding amount of memory, so as to be honest when comparing GC performance at a fixed heap size.)

Until next time, onwards and upwards!

by Andy Wingo at October 03, 2024 08:54 AM

October 01, 2024

Alex Bradbury

Arch Linux on remote server setup runbook

As I develop, I tend to offload a lot of the computationally heavy build or test tasks to a remote build machine. It's convenient for this to match the distribution I use on my local machine, and I so far haven't felt it would be advantageous to add another more server-oriented distro to the mix to act as a host to an Arch Linux container. This post acts as a quick reference / runbook for me on the occasions I want to spin up a new machine with this setup. To be very explicit, this is shared as something that might happen to be helpful if you have similar requirements, or to give some ideas if you have different ones. I definitely don't advocate that you blindly copy it.

Although this is something that could be fully automated, I find structuring this kind of thing as a series of commands to copy and paste and check/adjust manually hits the sweetspot for my usage. As always, the Arch Linux wiki and its install guide is a fantastic reference that you should go and check if you're unsure about anything listed here.

Details about my setup

  • I'm typically targeting Hetzner servers with two SSDs. I'm not worried about data loss or minor downtime waiting for a disk to be replaced before re-imaging, so RAID0 is just fine. The commands below assume this setup.
  • I do want encrypted root just in case anything sensitive is ever copied to the machine. As it is a remote machine, remote unlock over SSH must be supported (i.e. initrd contains an sshd that will accept the passphrase for unlocking the root partition before continuing the boot process).
    • The described setup means that if the disk is thrown away or given to another customer, you shouldn't have anything plain text in the root partition lying around it. As the boot partition is unencrypted, there are a whole bunch of threat models where someone is able to modify that boot partition which wouldn't be addressed by this approach.
  • The setup also assumes you start from some kind of rescue environment (in my case Hetzner's) that means you can freely repartition and overwrite the disk drives.
  • If you're not on Hetzner, you'll have to use different mirrors as they're not available externally.
  • Hosts typically provide their own base images and installer scripts. The approach detailed here is good if you'd rather control the process yourself, and although there are a few hoster-specific bits in here it's easy enough to replace them.
  • For simplicity I use UEFI boot and EFI stub. If you have problems with efibootmgr, you might want to use systemd-boot instead.
  • Disclaimer: As is hopefully obvious, the below instructions will overwrite anything on the hard drive of the system you're running them on.

Setting up and chrooting into Arch bootstrap environment from rescue system

See the docs on the Hetzner rescue system for details on how to enter it. If you've previously used this server, you likely want to remove old known_hosts entries with ssh-keygen -R your.server.

Now ssh in and do the following:

wget http://mirror.hetzner.de/archlinux/iso/latest/archlinux-bootstrap-x86_64.tar.zst
tar -xvf archlinux-bootstrap-x86_64.tar.zst --numeric-owner
sed -i '1s;^;Server=https://mirror.hetzner.de/archlinux/$repo/os/$arch\n\n;' root.x86_64/etc/pacman.d/mirrorlist
mount --bind root.x86_64/ root.x86_64/ # See <https://bugs.archlinux.org/task/46169>
printf "About to enter bootstrap chroot\n===============================\n"
./root.x86_64/bin/arch-chroot root.x86_64/

You are now chrooted into the Arch boostrap environment. We will do as much work from there as possible so as to minimise dependency on tools in the Hetzner rescue system.

Set up drives and create and chroot into actual rootfs

The goal is now to set up the drives, perform an Arch Linux bootstrap, and chroot into it. The UEFI partition is placed at 1MiB pretty much guaranteed to be properly aligned for any reasonable physical sector size (remembering space must be left at the beginning of the disk for the partition table itself).

We'll start by collecting info that will be used throughout the process:

printf "Enter the desired new hostname:\n"
read NEW_HOST_NAME; export NEW_HOST_NAME
printf "Enter the desired passphrase for unlocking the root partition:\n"
read ROOT_PART_PASSPHRASE
printf "Enter the public ssh key (e.g. cat ~/.ssh/id_ed22519.pub) that will be used for access:\n"
read PUBLIC_SSH_KEY; export PUBLIC_SSH_KEY
printf "Enter the name of the main user account to be created:\n"
read NEW_USER; export NEW_USER
printf "You've entered the following information:\n"
printf "NEW_HOST_NAME: %s\n" "$NEW_HOST_NAME"
printf "ROOT_PART_PASSPHRASE: %s\n" "$ROOT_PART_PASSPHRASE"
printf "PUBLIC_SSH_KEY: %s\n" "$PUBLIC_SSH_KEY"
printf "NEW_USER: %s\n" "$NEW_USER"

And now proceed to set up the disks, create filesystems, perform an initial bootstrap and chroot into the new rootfs:

pacman-key --init
pacman-key --populate archlinux
pacman -Sy --noconfirm xfsprogs dosfstools mdadm cryptsetup

for M in /dev/md?; do mdadm --stop $M; done
for DRIVE in /dev/nvme0n1 /dev/nvme1n1; do
  sfdisk $DRIVE <<EOF
label: gpt

start=1MiB, size=255MiB, type=uefi
start=256MiB, type=raid
EOF
done

# Warning from mdadm about falling back to creating md0 via node seems fine to ignore
yes | mdadm --create --verbose --level=0 --raid-devices=2 --homehost=any /dev/md/0 /dev/nvme0n1p2 /dev/nvme1n1p2

mkfs.fat -F32 /dev/nvme0n1p1
printf "%s\n" "$ROOT_PART_PASSPHRASE" | cryptsetup -y -v luksFormat --batch-mode /dev/md0
printf "%s\n" "$ROOT_PART_PASSPHRASE" | cryptsetup open /dev/md0 root
unset ROOT_PART_PASSPHRASE

mkfs.xfs /dev/mapper/root # rely on this calaculating appropriate sunit and swidth

mount /dev/mapper/root /mnt
mkdir /mnt/boot
mount /dev/nvme0n1p1 /mnt/boot
pacstrap /mnt base linux linux-firmware efibootmgr \
  xfsprogs dosfstools mdadm cryptsetup \
  mkinitcpio-netconf mkinitcpio-tinyssh mkinitcpio-utils python3 \
  openssh sudo net-tools git man-db man-pages vim
genfstab -U /mnt >> /mnt/etc/fstab
# Note that sunit and swidth in fstab confusingly use different units vs those shown by xfs <https://superuser.com/questions/701124/xfs-mount-overrides-sunit-and-swidth-options>

mdadm --detail --scan >> /mnt/etc/mdadm.conf
printf "About to enter newrootfs chroot\n===============================\n"
arch-chroot /mnt

Do final configuration from within the new rootfs chroot

Unfortunately I wasn't able to find a way to get ipv6 configured via DHCP on Hetzner's network, so we rely on hardcoding it (which seems to be their recommendation.

We set up sudo and also configure ssh to disable password-based login. A small swapfile is configured in order to allow the kernel to move allocated but not actively used pages there if deemed worthwhile.

sed /etc/locale.gen -i -e "s/^\#en_GB.UTF-8 UTF-8.*/en_GB.UTF-8 UTF-8/"
locale-gen
# Ignore "System has not been booted with systemd" and "Failed to connect to bus" error for next command.
systemd-firstboot --locale=en_GB.UTF-8 --timezone=UTC --hostname="$NEW_HOST_NAME"
ln -s /dev/null /etc/udev/rules.d/80-net-setup-link.rules # disable persistent network names

ssh-keygen -A # generate openssh host keys for the tinyssh hook to pick up
printf "%s\n" "$PUBLIC_SSH_KEY" > /etc/tinyssh/root_key
sed /etc/mkinitcpio.conf -i -e 's/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block mdadm_udev netconf tinyssh encryptssh filesystems fsck)/'
# The call to tinyssh-convert in mkinitcpio-linux is incorrect, so do it
# manually <https://github.com/grazzolini/mkinitcpio-tinyssh/issues/10>
tinyssh-convert /etc/tinyssh/sshkeydir < /etc/ssh/ssh_host_ed25519_key
mkinitcpio -p linux

printf "efibootmgr before changes:\n==========================\n"
efibootmgr -u
# Clean up any other efibootmgr entries from previous installs
for BOOTNUM in $(efibootmgr | grep '^Boot0' | grep -v PXE | sed 's/^Boot\([0-9]*\).*/\1/'); do
  efibootmgr -b $BOOTNUM -B
done
# Set up efistub
efibootmgr \
  --disk /dev/nvme0n1 \
  --part 1 \
  --create \
  --label 'Arch Linux' \
  --loader /vmlinuz-linux \
  --unicode "root=/dev/mapper/root rw cryptdevice=UUID=$(blkid -s UUID -o value /dev/md0):root:allow-discards initrd=\initramfs-linux.img ip=:::::eth0:dhcp" \
  --verbose
efibootmgr -o 0001,0000 # prefer PXE boot. May need to check IDs
printf "efibootmgr after changes:\n=========================\n"
efibootmgr -u

mkswap --size=8G --file /swapfile
cat - <<EOF > /etc/systemd/system/swapfile.swap
[Unit]
Description=Swap file

[Swap]
What=/swapfile

[Install]
WantedBy=multi-user.target
EOF
systemctl enable swapfile.swap

cat - <<EOF > /etc/systemd/network/10-eth0.network
[Match]
Name=eth0

[Network]
DHCP=yes
Address=$(ip -6 addr show dev eth0 scope global | grep "scope global" | cut -d' ' -f6)
Gateway=$(ip route show | head -n 1 | cut -d' ' -f 3)
Gateway=fe80::1
EOF
systemctl enable systemd-networkd.service systemd-resolved.service systemd-timesyncd.service
printf "PasswordAuthentication no\n" > /etc/ssh/sshd_config.d/20-no-password-auth.conf
systemctl enable sshd.service
useradd -m -g users -G wheel -s /bin/bash "$NEW_USER"
usermod --pass='!' root # disable root login
chmod +w /etc/sudoers
printf "%%wheel ALL=(ALL) ALL\n" >> /etc/sudoers
chmod -w /etc/sudoers
mkdir "/home/$NEW_USER/.ssh"
printf "%s\n" "$PUBLIC_SSH_KEY" > "/home/$NEW_USER/.ssh/authorized_keys"
chmod 700 "/home/$NEW_USER/.ssh"
chmod 600 "/home/$NEW_USER/.ssh/authorized_keys"
chown -R "$NEW_USER:users" "/home/$NEW_USER/.ssh"

Set user password and reboot

# The next command will ask for the user password (needed for sudo).
passwd "$NEW_USER"

Then ctrl+d twice and reboot.

Unlocking rootfs and logging in

Firstly, you likely want to remove any saved host public keys as there will be a new one for the newly provisioned machine ssh-keygen -R your.server. The server will have a different host key for the root key unlock so I'd recommend using a different known_hosts file for unlock. e.g. ssh -o UserKnownHostsFile=~/.ssh/known_hosts_unlock root@your.server. If you put something like the following in your ~/.ssh/config you could just ssh unlock-foo-host:

Host unlock-foo-host
HostName your.server
User root
UserKnownHostsFile ~/.ssh/known_hosts_unlock

Once you've successfully entered the key to unlock the rootfs, you can just ssh as normal to the server.


Article changelog
  • 2024-10-01: Initial publication date.

October 01, 2024 12:00 PM

September 28, 2024

Qiuyi Zhang (Joyee)

Reproducible Node.js built-in snapshots, part 3 - fixing V8 startup snapshot

In the previous posts we looked into how the Node.js executable was made a bit more reproducible after the Node.js snapshot data and the V8 code cache were made reproducible, and did a bit of anatomy on the unreproducible V8 snapshot blobs

September 28, 2024 07:00 PM

Reproducible Node.js built-in snapshots, part 2 - V8 code cache and snapshot blobs

In the previous post, we covered how the Node.js built-in snapshot is generated and embedded into the executable, and how I fixed the Node

September 28, 2024 07:00 PM

Reproducible Node.js built-in snapshots, part 1 - Overview and Node.js fixes

In a recent effort to make the Node.js builds reproducible again (at least on Linux), a big remaining piece was the built-in snapshot. I wrote some notes about the journey of making the Node

September 28, 2024 07:00 PM

September 27, 2024

Carlos García Campos

Graphics improvements in WebKitGTK and WPEWebKit 2.46

WebKitGTK and WPEWebKit recently released a new stable version 2.46. This version includes important changes in the graphics implementation.

Skia

The most important change in 2.46 is the introduction of Skia to replace Cairo as the 2D graphics renderer. Skia supports rendering using the GPU, which is now the default, but we also use it for CPU rendering using the same threaded rendering model we had with Cairo. The architecture hasn’t changed much for GPU rendering: we use the same tiled rendering approach, but buffers for dirty regions are rendered in the main thread as textures. The compositor waits for textures to be ready using fences and copies them directly to the compositor texture. This was the simplest approach that already resulted in much better performance, specially in the desktop with more powerful GPUs. In embedded systems, where GPUs are not so powerful, it’s still better to use the CPU with several rendering threads in most of the cases. It’s still too early to announce anything, but we are already experimenting with different models to improve the performance even more and make a better usage of the GPU in embedded devices.

Skia has received several GCC specific optimizations lately, but it’s always more optimized when built with clang. The optimizations are more noticeable in performance when using the CPU for rendering. For this reason, since version 2.46 we recommend to build WebKit with clang for the best performance. GCC is still supported, of course, and performance when built with GCC is quite good too.

HiDPI

Even though there aren’t specific changes about HiDPI in 2.46, users of high resolution screens using a device scale factor bigger than 1 will notice much better performance thanks to scaling being a lot faster on the GPU.

Accelerated canvas

The 2D canvas can be accelerated independently on whether the CPU or the GPU is used for painting layers. In 2.46 there’s a new setting WebKitSettings:enable-2d-canvas-acceleration to control the 2D canvas acceleration. In some embedded devices the combination of CPU rendering for layer tiles and GPU for the canvas gives the best performance. The 2D canvas is normally rendered into an image buffer that is then painted in the layer as an image. We changed that for the accelerated case, so that the canvas is now rendered into a texture that is copied to a compositor texture to be directly composited instead of painted into the layer as an image. In 2.46 the offscreen canvas is enabled by default.

There are more cases where accelerating the canvas is not desired, for example when the canvas size is not big enough it’s faster to use the GPU. Also when there’s going to be many operations to “download” pixels from GPU. Since this is not always easy to predict, in 2.46 we added support for the willReadFrequently canvas setting, so that when set by the application when creating the canvas it causes the canvas to be always unaccelerated.

Filters

All the CSS filters are now implemented using Skia APIs, and accelerated when possible. The most noticeable change here is that sites using blur filters are no longer slow.

Color spaces

Skia brings native support for color spaces, which allows us to greatly simplify the color space handling code in WebKit. WebKit uses color spaces in many scenarios – but especially in case of SVG and filters. In case of some filters, color spaces are necessary as some operations are simpler to perform in linear sRGB. The good example of that is feDiffuseLighting filter – it yielded wrong visual results for a very long time in case of Cairo-based implementation as Cairo doesn’t have a support for color spaces. At some point, however, Cairo-based WebKit implementation has been fixed by converting pixels to linear in-place before applying the filter and converting pixels in-place back to sRGB afterwards. Such a workarounds are not necessary anymore as with Skia, all the pixel-level operations are handled in a color-space-transparent way as long as proper color space information is provided. This not only impacts the results of some filters that are now correct, but improves performance and opens new possibilities for acceleration.

Font rendering

Font rendering is probably the most noticeable visual change after the Skia switch with mixed feedback. Some people reported that several sites look much better, while others reported problems with kerning in other sites. In other cases it’s not really better or worse, it’s just that we were used to the way fonts were rendered before.

Damage tracking

WebKit already tracks the area of the layers that has changed to paint only the dirty regions. This means that we only repaint the areas that changed but the compositor incorporates them and the whole frame is always composited and passed to the system compositor. In 2.46 there’s experimental code to track the damage regions and pass them to the system compositor in addition to the frame. Since this is experimental it’s disabled by default, but can be enabled with the runtime feature PropagateDamagingInformation. There’s also UnifyDamagedRegions feature that can be used in combination with PropagateDamagingInformation to unify the damage regions into one before passing it to the system compositor. We still need to analyze the impact of damage tracking in performance before enabling it by default. We have also started an experiment to use the damage information in WebKit compositor and avoid compositing the entire frame every time.

GPU info

Working on graphics can be really hard in Linux, there are too many variables that can result in different outputs for different users: the driver version, the kernel version, the system compositor, the EGL extensions available, etc. When something doesn’t work for some people and work for others, it’s key for us to gather as much information as possible about the graphics stack. In 2.46 we have added more useful information to webkit://gpu, like the DMA-BUF buffer format and modifier used (for GTK port and WPE when using the new API). Very often the symptom is the same, nothing is rendered in the web view, even when the causes could be very different. For those cases, it’s even more difficult to gather the info because webkit://gpu doesn’t render anything either. In 2.46 it’s possible to load webkit://gpu/stdout to get the information as a JSON directly in stdout.

Sysprof

Another common symptom for people having problems is that a particular website is slow to render, while for others it works fine. In these cases, in addition to the graphics stack information, we need to figure out where we are slower and why. This is very difficult to fix when you can’t reproduce the problem. We added initial support for profiling in 2.46 using sysprof. The code already has some marks so that when run under sysprof we get useful information about timings of several parts of the graphics pipeline.

Next

This is just the beginning, we are already working on changes that will allow us to make a better use of both the GPU and CPU for the best performance. We have also plans to do other changes in the graphics architecture to improve synchronization, latency and security. Now that we have adopted sysprof for profiling, we are also working on improvements and new tools.

by carlos garcia campos at September 27, 2024 10:30 AM

Ricardo García

Waiter, there's an IES in my DGC!

Finally! Yesterday Khronos published Vulkan 1.3.296 including VK_EXT_device_generated_commands. Thousands of engineering hours seeing the light of day, and awesome news for Linux gaming.

Device-Generated Commands, or DGC for short, are Vulkan’s equivalent to ExecuteIndirect in Direct3D 12. Thanks to this extension, originally based on a couple of NVIDIA vendor extensions, it will be possible to prepare sequences of commands to run directly from the GPU, and executing those sequences directly without any data going through the CPU. Also, Proton now has a much-more official leg to stand on when it has to translate ExecuteIndirect from D3D12 to Vulkan while you run games such as Starfield.

The extension not only provides functionality equivalent to ExecuteIndirect. It goes beyond that and offers more fine-grained control like explicit preprocessing of command sequences, or switching shaders and pipelines with each sequence thanks to something called Indirect Execution Sets, or IES for short, that potentially work with ray tracing, compute and graphics (both regular and mesh shading).

As part of my job at Igalia, I’ve implemented CTS tests for this extension and I had the chance to work very closely with an awesome group of developers discussing specification, APIs and test needs. I hope I don’t forget anybody and apologize in advance if so.

  • Mike Blumenkrantz, of course. Valve contractor, Super Good Coder and current OpenGL Working Group chair who took the initial specification work from Patrick Doane and carried it across the finish line. Be sure to read his blog post about DGC. Also incredibly important for me: he developed, and kept up-to-date, an implementation of the extension for lavapipe, the software Vulkan driver from Mesa. This was invaluable in allowing me to create tests for the extension much faster and making sure tests were in good shape when GPU driver authors started running them.

  • Spencer Fricke from LunarG. Spencer did something fantastic here. For the first time, the needed changes in the Vulkan Validation Layers for such a large extension were developed in parallel while tests and the spec were evolving. His work will be incredibly useful for app developers using the extension in their games. It also allowed me to detect test bugs and issues much earlier and fix them faster.

  • Samuel Pitoiset (Valve contractor), Connor Abbott (Valve contractor), Lionel Landwerlin (Intel) and Vikram Kushwaha (NVIDIA) providing early implementations of the extension, discussing APIs, reporting test bugs and needs, and making sure the extension works as good as possible for a variety of hardware vendors out there.

  • To a lesser degree, most others mentioned as spec contributors for the extension, such as Hans-Kristian Arntzen (Valve contractor), Baldur Karlsson (Valve contractor), Faith Ekstrand (Collabora), etc, making sure the spec works for them too and makes sense for Proton, RenderDoc, and drivers such as NVK and others.

If you’ve noticed, a significant part of the people driving this effort work for Valve and, from my side, the work has also been carried as part of Igalia’s collaboration with them. So my explicit thanks to Valve for sponsoring all this work.

If you want to know a bit more about DGC, stay tuned for future talks about this topic. In about a couple of weeks, I’ll present a lightning talk (5 mins) with an overview at XDC 2024 in Montreal. Don’t miss it!

September 27, 2024 09:42 AM

Brian Kardell

Igalia: 2024 Mid-season Power Rankings

Igalia: 2024 Mid-season Power Rankings

Let’s take the good old annual look at how Igalia is stacking up so far in terms of Open Source contributions - you know, "The One With the Charts"

Igalia is involved in so many things it's honestly hard for me to keep track. So, every now and then, I like to step back and take stock of it, and what it means to the open source ecosystem. Let's have a look...

The Big Browser Projects

Some interesting shifts this year in browser projects and where we've made the biggest impacts.

chromium

In in 2018 Microsoft announced that they were switching to a Chromium based engine. However, in 2019 it was Igalia (not Microsoft) who wound up with the most commits to Chromium (after Google). We thought for sure that they would overtake us in 2020, but we also had the most commits in 2020. And in 2021, 2022 and 2023. 5 years later, after increases on their end and some shifts on ours, Microsoft has finally eclipsed us as the one with the most commits after Google. Congratulations Microsoft!... Enjoy it! For now :)

data

Top 10 contributors

ContributorContributions
Microsoft37.26%
Igalia17.94%
Intel12.41%
januschka.com6.72%
Opera6.09%
Samsung1.69%
Bytedance1.48%
LGE0.90%
Anton Bershanskyi0.79%
Md Hasibul Hasan0.74%

It's nice to see Intel and Opera stepping it up this year too!

It seems impossible to not comment on the fact that > 6.7% of commits this year are from a single individual, Helmut Januschka as mostly unpaid work. On the one hand, that's heroic but also, it feels a little exploitative, right? Multi-million, billion or even trillion dollar corporations with products built on chromium could surely fund a little more. If you work for one of those and would like to help change that...

Another distinction that is perhaps worth making is that this is measuring commits to the whole chromium project which is kind of a super project. The majority of the code (and commits) are in support of the browser (chromium is a also base browser implementation without some of the "Google" stuff that is in Chrome). These include graphics/rendering related standalone projects like SKIA, or wildvine for DRM, or things that deal with the different OS windowing systems (X/Wayland on Linux, for example), abstract plumbing for chromium based downstream. On the engine itself, it's mainly just Google, Microsoft, Igalia, Intel and Opera.

WebKit

Still the number one contributors after Apple to WebKit 💪🏻💪🏼💪🏽💪🏾💪🏿.

data

Top 10 contributors

ContributorContributions
Igalia65.65%
Sony16.40%
Ubie3.19%
Red Hat2.63%
softathome.com1.16%
Rose0.99%
Alexey Knyazev0.73%
GitHub0.73%
Ian Grunert0.69%
cox.net0.65%
Gecko

Hey look at that - Igalia moves into second place after Mozilla! I'm not sure how anyone could not observe that André Bargull (@anba, a contributor to Mozilla for 16 years!) is just amazingly prolific and important to the project. Mozilla has a long tail of almost 600 individual/unaffliated contributors!

data

Top 10 contributors

ContributorContributions
André Bargull23.75%
Igalia10.71%
Red Hat8.74%
Birchill6.19%
protonmail.com5.66%
longsonr3.38%
Jonatan Klemets2.43%
yahoo.com1.82%
Debadree Chatterjee1.63%
Mugurell1.52%

WHATWG Stuff

HTML

Would you expect Igalia to be #2 in HTML? We are, this year! It's astounding the size of the role Google plays in a lot of this stuff, really. Then again, they have the only economic model for it so it makes a lot of sense that they invest heavily - this is peanuts comparatively.

data

Top 10 contributors

ContributorContributions
Google40.12%
Igalia14.37%
Apple12.57%
Mozilla3.59%
ibiblio.org2.99%
w3.org2.99%
Alex Rudenko2.40%
serenityos.org1.80%
meiert.com1.80%
Tawanda Moyo1.80%
DOM

What about DOM? Hey-o it's us again! Good job Igalia!

data

Top 10 contributors

ContributorContributions
Google40.12%
Igalia14.37%
Apple12.57%
Mozilla3.59%
ibiblio.org2.99%
w3.org2.99%
Alex Rudenko2.40%
serenityos.org1.80%
meiert.com1.80%
Tawanda Moyo1.80%
Web Platform Tests

It's almost like we're the 4th browser engine - #4 in contributions to Web Platform Tests, after Google, Mozilla, Apple. Go us!

data

Top 10 contributors

ContributorContributions
Google45.51%
Mozilla18.49%
Apple7.62%
Igalia6.15%
Intel4.09%
Microsoft3.69%
Opera0.99%
Maksim Sadym0.96%
Jonathan Lee0.88%
longsonr0.78%

We are champions

There are also a few related projects where Igalia is now the champion: Servo, the independent, parallel, memory safe, Rust based engine - and Wolvic, the XR browser formerly known as Firefox Reality, and Babel where we have been playing a huge role. In each of these projects Igalia is doing sponsored work, and our own investment because we believe they are important to the ecosystem right now.

Servo

#1 in servo 🔥🔥!

It's still recent history that Igalia got involved in Servo, but it's been exciting to watch it develop! If you or your organization would be interested in funding some work, reach out to me. You can also donate via GitHub sponsors or open collective.

data

Top 10 contributors

ContributorContributions
Igalia43.94%
sagudev7.96%
Huawei4.16%
inventati.org3.71%
Taym Haddadi3.54%
Oluwatobi Sofela3.45%
Zesty2.83%
Rosemary Ajayi2.12%
Mozilla1.77%
Crab Nebula1.68%
Babel

Igalia is also #1 in Babel commits 🔥🔥! Did you expect it? Nicolò Ribaudo has been busy!

data

Top 10 contributors

ContributorContributions
Igalia38.58%
liuxingbaoyu28.19%
Huáng Jùnliàng20.18%
renovate[bot]2.08%
Amjad Yahia Robeen Hassan1.48%
Sukka1.19%
blake.id0.59%
fisker Cheung0.59%
samual.uk0.59%
ynnsuis0.30%
Wolvic

Of course, as the stewards of Wolvic, we're the #1 in contributors there too! 91.6% of commits! There's also now the wolvic-chromium

data

Top 10 contributors

ContributorContributions
Igalia91.64%
weblate.org1.55%
reimu.nl1.24%
сэр Аноним1.24%
Felipe Erias0.93%
hotmail.es0.93%
opensuse.org0.62%
Ha Anh Vu0.31%
Houssem Nasri0.31%
HoussemNasri0.31%

Accessibility

Igalia has had an investment in accessibility for years. We've got a lot of expertise, and play an important role in some key projects:

Orca

Realisically this should be above as we're pretty much the maintainers of Orca, the screen reader for Linux (shout out to my colleague Joanie!).

data

Top 10 contributors

ContributorContributions
Igalia76.43%
ubuntu.com2.42%
gnome.org2.17%
Andy Holmes1.78%
bk.ru1.15%
pickup.hu1.15%
gtu.ge1.15%
Yaron Shahrabani1.15%
ukr.net0.89%
Sabri Ünal0.89%
aria

Igalia is #3 in commits to the aria repo - unsurprising as our colleague Valerie Young is co-chair of the working group!

data

Top 10 contributors

ContributorContributions
Paciello Group21.51%
w3.org17.74%
Igalia16.13%
github-actions[bot]9.68%
Daniel Montalvo7.53%
Peter Krautzberger5.38%
Adobe4.84%
Matt Garrish4.30%
usablenet.com1.61%
pkra1.61%
CORE-AAM

The AAMs are what actually map the standards to the accessibility layer, there are several of them. Very few people work on these, and guess who is pretty important there too? If you guessed Igalia, you'd be right. Igalia is the #1 contributor in CORE-AAM, Graphics-AAM and MathML-AAMs.

data

Top 10 contributors

ContributorContributions
github-actions[bot]28.57%
Igalia28.57%
w3.org28.57%
Paciello Group14.29%

That's not remotely all

These are just some that I am closer to and feel are interesting to highlight, but the list of open source repositories in which Igalia is playing an important to critical role is way longer than this. For example:

  1. Igalia #2 in commits to GStreamer, the powerful multimedia framework. We're the main maintainers of the GStreamer ports of WebKit used in Linux browsers on desktops as well as on a plethora of embedded devices.
  2. We've been very involved in work on GL, and contributing to Khronos Vulkan, OpenGL, and OpenGL ES Conformance Tests. In fact, we are the #1 contributor to the Vulkan GL Conformance Test Suite Repository!
  3. Igalia is the #2 contributor to the CI-tron project, which was started by Valve and is currently developed mostly by Valve contractors. CI-tron lets developers easily create CI systems based on bare metal computers, which is particularly important for Mesa and other open-source graphics projects.
  4. Igalia is the #1 contributor to libsoup the HTTP client/server library for GNOME.
  5. Igalia is the #3 contributor to Mesa the 3D Graphics Library

Anyway, I love to take the time to look at these every year and see how we're doing. It's easy to get lost in the bustle of everything that's going on and miss how much we really do - it's good to take stock. I hope that others find this interesting too, and maybe learn something about Igalia. Most of the work that we do is funded by companies. If your organization has some needs, you know where to find me :)

September 27, 2024 04:00 AM

September 26, 2024

Andy Wingo

needed-bits optimizations in guile

Hey all, I had a fun bug this week and want to share it with you.

numbers and representations

First, though, some background. Guile’s numeric operations are defined over the complex numbers, not over e.g. a finite field of integers. This is generally great when writing an algorithm, because you don’t have to think about how the computer will actually represent the numbers you are working on.

In practice, Guile will represent a small exact integer as a fixnum, which is a machine word with a low-bit tag. If an integer doesn’t fit in a word (minus space for the tag), it is represented as a heap-allocated bignum. But sometimes the compiler can realize that e.g. the operands to a specific bitwise-and operation are within (say) the 64-bit range of unsigned integers, and so therefore we can use unboxed operations instead of the more generic functions that do run-time dispatch on the operand types, and which might perform heap allocation.

Unboxing is important for speed. It’s also tricky: under what circumstances can we do it? In the example above, there is information that flows from defs to uses: the operands of logand are known to be exact integers in a certain range and the operation itself is closed over its domain, so we can unbox.

But there is another case in which we can unbox, in which information flows backwards, from uses to defs: if we see (logand n #xff), we know:

  • the result will be in [0, 255]

  • that n will be an exact integer (or an exception will be thrown)

  • we are only interested in a subset of n‘s bits.

Together, these observations let us transform the more general logand to an unboxed operation, having first truncated n to a u64. And actually, the information can flow from use to def: if we know that n will be an exact integer but don’t know its range, we can transform the potentially heap-allocating computation that produces n to instead truncate its result to the u64 range where it is defined, instead of just truncating at the use; and potentially this information could travel farther up the dominator tree, to inputs of the operation that defines n, their inputs, and so on.

needed-bits: the |0 of scheme

Let’s say we have a numerical operation that produces an exact integer, but we don’t know the range. We could truncate the result to a u64 and use unboxed operations, if and only if only u64 bits are used. So we need to compute, for each variable in a program, what bits are needed from it.

I think this is generally known a needed-bits analysis, though both Google and my textbooks are failing me at the moment; perhaps this is because dynamic languages and flow analysis don’t get so much attention these days. Anyway, the analysis can be local (within a basic block), global (all blocks in a function), or interprocedural (larger than a function). Guile’s is global. Each CPS/SSA variable in the function starts as needing 0 bits. We then compute the fixpoint of visiting each term in the function; if a term causes a variable to flow out of the function, for example via return or call, the variable is recorded as needing all bits, as is also the case if the variable is an operand to some primcall that doesn’t have a specific needed-bits analyser.

Currently, only logand has a needed-bits analyser, and this is because sometimes you want to do modular arithmetic, for example in a hash function. Consider Bon Jenkins’ lookup3 string hash function:

#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))
#define mix(a,b,c) \
{ \
  a -= c;  a ^= rot(c, 4);  c += b; \
  b -= a;  b ^= rot(a, 6);  a += c; \
  c -= b;  c ^= rot(b, 8);  b += a; \
  a -= c;  a ^= rot(c,16);  c += b; \
  b -= a;  b ^= rot(a,19);  a += c; \
  c -= b;  c ^= rot(b, 4);  b += a; \
}
...

If we transcribe this to Scheme, we get something like:

(define (jenkins-lookup3-hashword2 str)
  (define (u32 x) (logand x #xffffFFFF))
  (define (shl x n) (u32 (ash x n)))
  (define (shr x n) (ash x (- n)))
  (define (rot x n) (logior (shl x n) (shr x (- 32 n))))
  (define (add x y) (u32 (+ x y)))
  (define (sub x y) (u32 (- x y)))
  (define (xor x y) (logxor x y))

  (define (mix a b c)
    (let* ((a (sub a c)) (a (xor a (rot c 4)))  (c (add c b))
           (b (sub b a)) (b (xor b (rot a 6)))  (a (add a c))
           (c (sub c b)) (c (xor c (rot b 8)))  (b (add b a))
           ...)
      ...))

  ...

These u32 calls are like the JavaScript |0 idiom, to tell the compiler that we really just want the low 32 bits of the number, as an integer. Guile’s compiler will propagate that information down to uses of the defined values but also back up the dominator tree, resulting in unboxed arithmetic for all of these operations.

(When writing this, I got all the way here and then realized I had already written quite a bit about this, almost a decade ago ago. Oh well, consider this your lucky day, you get two scoops of prose!)

the bug

All that was just prelude. So I said that needed-bits is a fixed-point flow analysis problem. In this case, I want to compute, for each variable, what bits are needed for its definition. Because of loops, we need to keep iterating until we have found the fixed point. We use a worklist to represent the conts we need to visit.

Visiting a cont may cause the program to require more bits from the variables that cont uses. Consider:

(define-significant-bits-handler
    ((logand/immediate label types out res) param a)
  (let ((sigbits (sigbits-intersect
                   (inferred-sigbits types label a)
                   param
                   (sigbits-ref out res))))
    (intmap-add out a sigbits sigbits-union)))

This is the sigbits (needed-bits) handler for logand when one of its operands (param) is a constant and the other (a) is variable. It adds an entry for a to the analysis out, which is an intmap from variable to a bitmask of needed bits, or #f for all bits. If a already has some computed sigbits, we add to that set via sigbits-union. The interesting point comes in the sigbits-intersect call: the bits that we will need from a are first the bits that we infer a to have, by forward type-and-range analysis; intersected with the bits from the immediate param; intersected with the needed bits from the result value res.

If the intmap-add call is idempotent—i.e., out already contains sigbits for a—then out is returned as-is. So we can check for a fixed-point by comparing out with the resulting analysis, via eq?. If they are not equal, we need to add the cont that defines a to the worklist.

The bug? The bug was that we were not enqueuing the def of a, but rather the predecessors of label. This works when there are no cycles, provided we visit the worklist in post-order; and regardless, it works for many other analyses in Guile where we compute, for each labelled cont (basic block), some set of facts about all other labels or about all other variables. In that case, enqueuing a predecessor on the worklist will cause all nodes up and to including the variable’s definition to be visited, because each step adds more information (relative to the analysis computed on the previous visit). But it doesn’t work for this case, because we aren’t computing a per-label analysis.

The solution was to rewrite that particular fixed-point to enqueue labels that define a variable (possibly multiple defs, because of joins and loop back-edges), instead of just the predecessors of the use.

Et voilà ! If you got this far, bravo. Type at y’all again soon!

by Andy Wingo at September 26, 2024 12:44 PM

September 25, 2024

Melissa Wen

Reflections on 2024 Linux Display Next Hackfest

Hey everyone!

The 2024 Linux Display Next hackfest concluded in May, and its outcomes continue to shape the Linux Display stack. Igalia hosted this year’s event in A Coruña, Spain, bringing together leading experts in the field. Samuel Iglesias and I organized this year’s edition and this blog post summarizes the experience and its fruits.

One of the highlights of this year’s hackfest was the wide range of backgrounds represented by our 40 participants (both on-site and remotely). Developers and experts from various companies and open-source projects came together to advance the Linux Display ecosystem. You can find the list of participants here.

The event covered a broad spectrum of topics affecting the development of Linux projects, user experiences, and the future of display technologies on Linux. From cutting-edge topics to long-term discussions, you can check the event agenda here.

Organization Highlights

The hackfest was marked by in-depth discussions and knowledge sharing among Linux contributors, making everyone inspired, informed, and connected to the community. Building on feedback from the previous year, we refined the unconference format to enhance participant preparation and engagement.

Structured Agenda and Timeboxes: Each session had a defined scope, time limit (1h20 or 2h10), and began with an introductory talk on the topic.

  • Participant-Led Discussions: We pre-selected in-person participants to lead discussions, allowing them to prepare introductions, resources, and scope.
  • Transparent Scheduling: The schedule was shared in advance as GitHub issues, encouraging participants to review and prepare for sessions of interest.

Engaging Sessions: The hackfest featured a variety of topics, including presentations and discussions on how participants were addressing specific subjects within their companies.

  • No Breakout Rooms, No Overlaps: All participants chose to attend all sessions, eliminating the need for separate breakout rooms. We also adapted run-time schedule to keep everybody involved in the same topics.
  • Real-time Updates: We provided notifications and updates through dedicated emails and the event matrix room.

Strengthening Community Connections: The hackfest offered ample opportunities for networking among attendees.

  • Social Events: Igalia sponsored coffee breaks, lunches, and a dinner at a local restaurant.

  • Museum Visit: Participants enjoyed a sponsored visit to the Museum of Estrela Galicia Beer (MEGA).

Fruitful Discussions and Follow-up

The structured agenda and breaks allowed us to cover multiple topics during the hackfest. These discussions have led to new display feature development and improvements, as evidenced by patches, merge requests, and implementations in project repositories and mailing lists.

With the KMS color management API taking shape, we discussed refinements and best approaches to cover the variety of color pipeline from different hardware-vendors. We are also investigating techniques for a performant SDR<->HDR content reproduction and reducing latency and power consumption when using the color blocks of the hardware.

Color Management/HDR

Color Management and HDR continued to be the hottest topic of the hackfest. We had three sessions dedicated to discuss Color and HDR across Linux Display stack layers.

Color/HDR (Kernel-Level)

Harry Wentland (AMD) led this session.

Here, kernel Developers shared the Color Management pipeline of AMD, Intel and NVidia. We counted with diagrams and explanations from HW-vendors developers that discussed differences, constraints and paths to fit them into the KMS generic color management properties such as advertising modeset needs, IN\_FORMAT, segmented LUTs, interpolation types, etc. Developers from Qualcomm and ARM also added information regarding their hardware.

Upstream work related to this session:

Color/HDR (Compositor-Level)

Sebastian Wick (RedHat) led this session.

It started with Sebastian’s presentation covering Wayland color protocols and compositor implementation. Also, an explanation of APIs provided by Wayland and how they can be used to achieve better color management for applications and discussions around ICC profiles and color representation metadata. There was also an intensive Q&A about LittleCMS with Marti Maria.

Upstream work related to this session:

Color/HDR (Use Cases and Testing)

Christopher Cameron (Google) and Melissa Wen (Igalia) led this session.

In contrast to the other sessions, here we focused less on implementation and more on brainstorming and reflections of real-world SDR and HDR transformations (use and validation) and gainmaps. Christopher gave a nice presentation explaining HDR gainmap images and how we should think of HDR. This presentation and Q&A were important to put participants at the same page of how to transition between SDR and HDR and somehow “emulating” HDR.

We also discussed on the usage of a kernel background color property.

Finally, we discussed a bit about Chamelium and the future of VKMS (future work and maintainership).

Power Savings vs Color/Latency

Mario Limonciello (AMD) led this session.

Mario gave an introductory presentation about AMD ABM (adaptive backlight management) that is similar to Intel DPST. After some discussions, we agreed on exposing a kernel property for power saving policy. This work was already merged on kernel and the userspace support is under development.

Upstream work related to this session:

Strategy for video and gaming use-cases

Leo Li (AMD) led this session.

Miguel Casas (Google) started this session with a presentation of Overlays in Chrome/OS Video, explaining the main goal of power saving by switching off GPU for accelerated compositing and the challenges of different colorspace/HDR for video on Linux.

Then Leo Li presented different strategies for video and gaming and we discussed the userspace need of more detailed feedback mechanisms to understand failures when offloading. Also, creating a debugFS interface came up as a tool for debugging and analysis.

Real-time scheduling and async KMS API

Xaver Hugl (KDE/BlueSystems) led this session.

Compositor developers have exposed some issues with doing real-time scheduling and async page flips. One is that the Kernel limits the lifetime of realtime threads and if a modeset takes too long, the thread will be killed and thus the compositor as well. Also, simple page flips take longer than expected and drivers should optimize them.

Another issue is the lack of feedback to compositors about hardware programming time and commit deadlines (the lastest possible time to commit). This is difficult to predict from drivers, since it varies greatly with the type of properties. For example, color management updates take much longer.

In this regard, we discusssed implementing a hw_done callback to timestamp when the hardware programming of the last atomic commit is complete. Also an API to pre-program color pipeline in a kind of A/B scheme. It may not be supported by all drivers, but might be useful in different ways.

VRR/Frame Limit, Display Mux, Display Control, and more… and beer

We also had sessions to discuss a new KMS API to mitigate headaches on VRR and Frame Limit as different brightness level at different refresh rates, abrupt changes of refresh rates, low frame rate compensation (LFC) and precise timing in VRR more.

On Display Control we discussed features missing in the current KMS interface for HDR mode, atomic backlight settings, source-based tone mapping, etc. We also discussed the need of a place where compositor developers can post TODOs to be developed by KMS people.

The Content-adaptive Scaling and Sharpening session focused on sharpening and scaling filters. In the Display Mux session, we discussed proposals to expose the capability of dynamic mux switching display signal between discrete and integrated GPUs.

In the last session of the 2024 Display Next Hackfest, participants representing different compositors summarized current and future work and built a Linux Display “wish list”, which includes: improvements to VTTY and HDR switching, better dmabuf API for multi-GPU support, definition of tone mapping, blending and scaling sematics, and wayland protocols for advertising to clients which colorspaces are supported.

We closed this session with a status update on feature development by compositors, including but not limited to: plane offloading (from libcamera to output) / HDR video offloading (dma-heaps) / plane-based scrolling for web pages, color management / HDR / ICC profiles support, addressing issues such as flickering when color primaries don’t match, etc.

After three days of intensive discussions, all in-person participants went to a guided tour at the Museum of Extrela Galicia beer (MEGA), pouring and tasting the most famous local beer.

Feedback and Future Directions

Participants provided valuable feedback on the hackfest, including suggestions for future improvements.

  • Schedule and Break-time Setup: Having a pre-defined agenda and schedule provided a better balance between long discussions and mental refreshments, preventing the fatigue caused by endless discussions.
  • Action Points: Some participants recommended explicitly asking for action points at the end of each session and assigning people to follow-up tasks.
  • Remote Participation: Remote attendees appreciated the inclusive setup and opportunities to actively participate in discussions.
  • Technical Challenges: There were bandwidth and video streaming issues during some sessions due to the large number of participants.

Thank you for joining the 2024 Display Next Hackfest

We can’t help but thank the 40 participants, who engaged in-person or virtually on relevant discussions, for a collaborative evolution of the Linux display stack and for building an insightful agenda.

A big thank you to the leaders and presenters of the nine sessions: Christopher Cameron (Google), Harry Wentland (AMD), Leo Li (AMD), Mario Limoncello (AMD), Sebastian Wick (RedHat) and Xaver Hugl (KDE/BlueSystems) for the effort in preparing the sessions, explaining the topic and guiding discussions. My acknowledge to the others in-person participants that made such an effort to travel to A Coruña: Alex Goins (NVIDIA), David Turner (Raspberry Pi), Georges Stavracas (Igalia), Joan Torres (SUSE), Liviu Dudau (Arm), Louis Chauvet (Bootlin), Robert Mader (Collabora), Tian Mengge (GravityXR), Victor Jaquez (Igalia) and Victoria Brekenfeld (System76). It was and awesome opportunity to meet you and chat face-to-face.

Finally, thanks virtual participants who couldn’t make it in person but organized their days to actively participate in each discussion, adding different perspectives and valuable inputs even remotely: Abhinav Kumar (Qualcomm), Chaitanya Borah (Intel), Christopher Braga (Qualcomm), Dor Askayo, Jiri Koten (RedHat), Jonas Ådahl (Red Hat), Leandro Ribeiro (Collabora), Marti Maria (Little CMS), Marijn Suijten, Mario Kleiner, Martin Stransky (Red Hat), Michel Dänzer (Red Hat), Miguel Casas-Sanchez (Google), Mitulkumar Golani (Intel), Naveen Kumar (Intel), Niels De Graef (Red Hat), Pekka Paalanen (Collabora), Pichika Uday Kiran (AMD), Shashank Sharma (AMD), Sriharsha PV (AMD), Simon Ser, Uma Shankar (Intel) and Vikas Korjani (AMD).

We look forward to another successful Display Next hackfest, continuing to drive innovation and improvement in the Linux display ecosystem!

September 25, 2024 01:50 PM

Paulo Matos

FEX x87 Stack Optimization

In this blog post, I will describe my recent work on improving and optimizing the handling of x87 code in FEX. This work landed on July 22, 2024, through commit https://github.com/FEX-Emu/FEX/commit/a1378f94ce8e7843d5a3cd27bc72847973f8e7ec (tests and updates to size benchmarks landed separately).

FEX is an ARM64 emulator of Intel assembly and, as such, needs to be able to handle things as old as x87, which was the way we did floating point pre-SSE.

Consider a function to calculate the well-known Pythagorean Identity:

float PytIdent(float x) {
return sinf(x)*sinf(x) + cosf(x)*cosf(x);
}

Here is how this can be translated into assembly:

  sub rsp, 4
  movss [rsp], xmm0
  fld [rsp]
  fsincos
  fmul st0, st0
  fxch
  fmul st0, st0
  faddp
  fstp [rsp]
  movss xmm0, [rsp]

These instructions will take x from register xmm0 and return the result of the identity (the number 1, as we expect) to the same register.

A brief visit to the x87 FPU

The x87 FPU consists of 8 registers (R0 to R7), which are aliased to the MMX registers - mm0 to mm7. The FPU uses them as a stack and tracks the top of the stack through the TOS (Top of Stack) or TOP (my preferred way to refer to it), which is kept in the control word. To understand how it all fits together, I will describe the three most important components of this setup for the purpose of this post:

  • FPU Stack-based operation
  • Status register
  • Tag register

From the Intel spec, these components look like this (image labels are from the Intel in case you need to refer to it):

Patch Details

The 8 registers are labelled as ST0-ST7 relative to the position of TOP. In the case above TOP is 3; therefore, the 3rd register is ST0. Since the stack grows down, ST1 is the 4th register and so on until it wraps around. When the FPU is initialized TOP is set to 0. Once a value is loaded, the TOP is decreased, its value becomes 7 and the first value added to the stack is in the 7th register. Because the MMX registers are aliased to the bottom 64 bits of the stack registers, we can also refer to them as MM0 to MM7. This is not strictly correct because the MMX registers are 64 bits and the FPU registers are 80 bits but if we squint, they are basically the same.

The spec provides an example of how the FPU stack works. I copy the image here verbatim for your understanding. Further information can be found in the spec itself.

Patch Details

The TOP value is stored along with other flags in the FPU Status Word, which we won't discuss in detail here.

Patch Details

And the tag word marks which registers are valid or empty. The Intel spec also allows marking the registers as zero or special. At the moment, tags in FEX are binary, marking the register as valid or invalid. Instead of two bits per register, we only use one bit of information per register.

Patch Details

Current issues

The way things worked in FEX before this patch landed is that stack based operations would essentially always lower to perform three operations:

  • Load stack operation argument register values from special memory location;
  • Call soft-float library to perform the operation;
  • Store result in special memory location;

For example, let's consider the instruction FADDP st4, st0. This instruction performs st0 <- f80add(st4, st0) && pop(). In x87.cpp, the lowering of this instruction involved the following steps:

  • Load current value of top from memory;
  • With the current value of top, we can load the values of the registers in the instruction, in this case st4 and st0 since they are at a memory offset from top;
  • Then we call the soft-float library function to perform 80-bit adds;
  • Finally we store the result into st4 (which is a location in memory);
  • And pop the stack, which means marking the current top value invalid and incrementing top;

In code with a lot of x87 usage, these operations become cumbersome and data movement makes the whole code extremely slow. Therefore, we redesigned the x87 system to cope better with stack operations. FEX support a reduced precision mode that uses 64bit floating points rather than 80 bits, avoiding the soft-float library calls and this work applies equally to the code in reduced precision mode.

The new pass

The main observation to understand the operation of the new pass is that when compiling a block of instructions, where multiple instructions are x87 instructions, before code generation we have an idea of what the stack will look like, and if we have complete stack information we can generate much better code than the one generated on a per-instruction basis.

Instead of lowering each x87 instruction in x87.cpp to its components as discussed earlier, we simply generate stack operation IR instructions (all of these added new since there were no stack operations in the IR). Then an optimization pass goes through the IR on a per block basis and optimizes the operations and lowers them.

Here’s a step-by-step explanation of how this scheme works for the code above. When the above code gets to the OpcodeDispatcher, there’ll be one function a reasonable amount of opcodes, but generally one per functionality, meaning that although there are several opcodes for different versions of fadd, there is a single function in OpcodeDispatcher for x87 (in x87.cpp) implementing it.

Each of these functions will just transform the incoming instruction into an IR node that deals with the stack. These stack IR operations have Stack in its name. See IR.json for a list of these.

x86 Asm IR nodes
fld qword [rsp] LoadMem + PushStack
fsincos F80SinCosStack
fmul st0, st0 F80MulStack
fxch F80StackXchange
fmul st0, st0 F80MulStack
faddp F80AddStack + PopStackDestroy
fstp qword [rsp] StoreStackMemory + PopStackDestroy

The IR will then look like:

	%9 i64 = LoadRegister #0x4, GPR
	%10 i128 = LoadMem FPR, %9 i64, %Invalid, #0x10, SXTX, #0x1
	(%11 i0) PushStack %10 i128, %10 i128, #0x10, #0x1
	(%12 i0) F80SINCOSStack
	%13 i64 = Constant #0x0
	(%14 i8) StoreContext GPR, %13 i64, #0x3fa
	(%15 i0) F80MulStack #0x0, #0x0
	(%16 i0) F80StackXchange #0x1
	%17 i64 = Constant #0x0
	(%18 i8) StoreContext GPR, %17 i64, #0x3f9
	(%19 i0) F80MulStack #0x0, #0x0
	(%20 i0) F80AddStack #0x1, #0x0
	(%21 i0) PopStackDestroy
	%23 i64 = LoadRegister #0x4, GPR
	(%24 i0) StoreStackMemory %23 i64, i128, #0x1, #0x8
	(%25 i0) PopStackDestroy

The Good Case

Remember that before this pass, we would have never generated this IR as is. There were no stack operations, therefore each of these F80MulStack operations, would generate loads from the MMX registers that are mapped into memory for each of the operands, call to F80Mul softfloat library function and then a write to the memory mapped MMX register to write the result.

However, once we get to the optimization pass a see a straight line block line this, we can step through the IR nodes and create a virtual stack with pointers to the IR nodes. I have sketched the stacks for the first few loops in the pass.

The pass will only process x87 instructions, these are the one whose IR nodes are marked with "X87: true in IR.json. Therefore the first two instructions above:

	%9 i64 = LoadRegister #0x4, GPR
	%10 i128 = LoadMem FPR, %9 i64, %Invalid, #0x10, SXTX, #0x1

are ignored. But PushStack is not. PushStack will literally push the node into the virtual stack.

	(%11 i0) PushStack %10 i128, %10 i128, #0x10, #0x1
Patch Details

Internally before we push the node into our internal stack, we update the necessary values like the top of the stack. It starts at zero and it's decremented to seven, so the value %10 i128 is inserted in what we'll call R7, now ST0.

Next up is the computation of SIN and COS. This will result in SIN ST0 replacing ST0 and then pushing COS ST0, where the value in ST0 here is %10.

	(%12 i0) F80SINCOSStack
Patch Details

Then we have two instructions that we just don't deal with in the pass, these set the C2 flag to zero - as required by FSINCOS.

	%13 i64 = Constant #0x0
	(%14 i8) StoreContext GPR, %13 i64, #0x3fa

Followed by a multiplication of the stack element to square them. So we square ST0.

	(%15 i0) F80MulStack #0x0, #0x0
Patch Details

The next instruction swaps ST0 with ST1 so that we can square the sine value.

	(%16 i0) F80StackXchange #0x1
Patch Details

Again, similarly two instructions that are ignored by the pass, which set flag C1 to zero - as required by FXCH.

	%17 i64 = Constant #0x0
	(%18 i8) StoreContext GPR, %17 i64, #0x3f9

And again we square the value at the top of the stack.

	(%19 i0) F80MulStack #0x0, #0x0
Patch Details

We are almost there - to finish off the computation we need to add these two values together.

	(%20 i0) F80AddStack #0x1, #0x0

In this case we add ST1 to ST0 and store the result in ST1.

Patch Details

This is followed by a pop of the stack.

	(%21 i0) PopStackDestroy

As expected this removed the F80Mul from the top of the stack, leaving us with the result of the computation at the top.

Patch Details

OK - there are three more instructions.

	%23 i64 = LoadRegister #0x4, GPR
	(%24 i0) StoreStackMemory %23 i64, i128, #0x1, #0x8
	(%25 i0) PopStackDestroy

The first two store the top of the stack to memory, which is the result of our computation and which maths assures us it's 1. And then we pop the stack just to clean it up, although this is not strictly necessary since we are finished anyway.

Note that we have finished analysing the block, and played it through in our virtual stack. Now we can generate the instructions that are necessary to calculate the stack values, without always load/stor'ing them back to memory. In addition to generating these instructions we also generate instructions to update the tag word, update the top of stack value and save whatever remaining values in the stack at the end of the block into memory - into their respective memory mapped MMX registers. This is the good case where all the values necessary for the computation were found in the block. However, this is not necessarily always the case.

The Bad Case

Let's say that FEX for some reason breaks the beautiful block we had before into two:

Block 1:

	%9 i64 = LoadRegister #0x4, GPR
	%10 i128 = LoadMem FPR, %9 i64, %Invalid, #0x10, SXTX, #0x1
	(%11 i0) PushStack %10 i128, %10 i128, #0x10, #0x1
	(%12 i0) F80SINCOSStack
	%13 i64 = Constant #0x0
	(%14 i8) StoreContext GPR, %13 i64, #0x3fa
	(%15 i0) F80MulStack #0x0, #0x0
	(%16 i0) F80StackXchange #0x1
	%17 i64 = Constant #0x0
	(%18 i8) StoreContext GPR, %17 i64, #0x3f9

Block 2:

	(%19 i0) F80MulStack #0x0, #0x0
	(%20 i0) F80AddStack #0x1, #0x0
	(%21 i0) PopStackDestroy
	%23 i64 = LoadRegister #0x4, GPR
	(%24 i0) StoreStackMemory %23 i64, i128, #0x1, #0x8
	(%25 i0) PopStackDestroy

When we generate code for the first block, there are absolutely no blockers and we run through it exactly as we did before, except that after StoreContext, we reach the end of the block and our virtual stack is in this state:

Patch Details

At this point, we'll do what we said we always do before at the end of the block. We save the values in the virtual stack to the respective MMX registers in memory mapped locations: %36 is saved to MM7 and %34 is saved to MM6. The TOP is recorded to be 6 and the tags for MM7 and MM6 are marked to be valid. Then we exit the block. And start analyzing the following block.

When we start the pass for the following block, our virtual stack is empty. We have no knowledge of virtual stacks for previous JITTed blocks, and we see this instruction:

	(%19 i0) F80MulStack #0x0, #0x0

We cannot play through this instruction in our virtual stack because it multiplies the value in ST0 with itself and our virtual stack does not have anything in ST0. It's empty after all.

In this case, we switch to the slow path that does exactly what we used to do before this work. We load the value of ST0 from memory by finding the current value for TOP and loading the respective register value, issue the F80Mul on this value and storing it back to the MMX register. This is then done for the remainder of the block.

The Ugly Case

The ugly case is when we actually force the slow path, not because we don't have all the values in the virtual stack but out of necessity.

There are a few instructions that trigger a _StackForceSlow() and forces us down the slow path. These are: FLDENV, FRSTOR, FLDCW. All of these load some part or all of the FPU environment from memory and it triggers a switch to the slow path so that values are loaded properly and we can then use those new values in the block. It is not impossible to think we could at least in some cases, load those values from memory, try to rebuild our virtual stack for the current block and continue without switching to the slow path but that hasn't been yet done.

There are a few instructions other instructions that trigger a _SyncStackToSlow(), which doesn't force us down the slow path but instead just updates the in-memory data from the virtual stack state. These are: FNSTENV, FNSAVE, and FNSTSW. All of these store some part or all of the FPU environment into memory and it ensures that the values in-memory are up-to-date so that the store doesn't use obsolete values.

Results

In addition to the huge savings in data movement from the loads and stores of register values, TOP and tags for each stack instruction, we also optimize memory copies through the stack. So if there's a bunch of value loaded to the stack, which are then stored to memory, we can just memcopy these values without going through the stack.

In our code size validation benchmarks from various Steam games, we saw significant reductions in code size. These size code reductions saw a similarly high jump in FPS numbers. For example:

Game Before After Reduction
Half-Life (Block 1) 2034 1368 32.7%
Half-Life (Block 2) 838 551 34.2%
Oblivion (Block 1) 34065 25782 24.3%
Oblivion (Block 2) 22089 16635 24.6%
Psychonauts (Block 1) 20545 16357 20.3%
Psychonauts Hot Loop (Matrix Swizzle) 2340 165 92.9%

Future work

The bulk of this work is now complete, however there are a few details that still need fixing. There's still the mystery of why Touhou: Luna Nights works with reduced precision but not normal precision (if anything you would expect it to be the other way around). Then there's also some work to be done on the interface between the FPU state and the MMX state as described in "MMX/x87 interaction is subtly broken".

This is something I am already focusing on and will continue working on this until complete.

September 25, 2024 12:00 AM

September 23, 2024

Eric Meyer

Announcing BCD Watch

One of the things I think we all struggle with is keeping up to date with changes in web development.  You might hear about a super cool new CSS feature or JavaScript API, but it’s never supported by all the browsers when you hear about it, right?  So you think “I’ll have to make sure check in on that again later” and quickly forget about it.  Then some time down the road you hear about it again, talked about like it’s been best practice for years.

To help address this, Brian Kardell and I have built a service called BCD Watch, with a nicely sleek design by Stephanie Stimac.  It’s free for all to use thanks to the generous support of Igalia in terms of our time and hosting the service.

What BCD Watch does is, it grabs releases of the Browser Compatibility Data (BCD) repository that underpins the support tables on MDN and services like caniuse.com.  It then analyzes what’s changed since the previous release.

Every Monday, BCD Watch produces two reports.  The Weekly Changes Report lists all the changes to BCD that happened in the previous week — what’s been added, removed, or renamed in the whole of BCD.  It also tells you which of the Big Three browsers newly support (or dropped support for) each listed feature, along with a progress bar showing how close the feature is to attaining Baseline status.

The Weekly Baselines Report is essentially a filter of the first report: instead of all the changes, it lists only changes to Baseline status, noting which features are newly Baseline.  Some weeks, it will have nothing to report. Other weeks, it will list everything that’s reached Baseline’s “Newly Available” tier.

Both reports are available as standalone RSS, Atom, and JSON feeds, which are linked at the bottom of each report.  So while you can drop in on the site every week to bask in the visual design if you want (and that’s fine!), you can also get a post or two in your feed reader every Monday that will get you up to date on what’s been happening in the world of web development.

If you want to look back at older reports, the home page has a details/summary collapsed list of weekly reports going back to the beginning of 2022, which we generated by downloading all the BCD releases back that far, and running the report script against them.

If you encounter any problems with BCD Watch or have suggestions for improvements, please feel free to open an issue in the repository, or submit suggested changes via pull request if you like.  We do expect the service to evolve over time, perhaps adding a report for things that have hit Baseline Widely Available status (30 months after hitting all three engines) or reports that look at more than just the Big Three engines.  Hard to say!  Always in motion, the future is.

Whatever we may add, though, we’ll keep BCD Watch centered on the idea of keeping you better up to date on web dev changes, once a week, every week.  We really hope this is useful and interesting for you!  We’ve definitely appreciated having the weekly updates as we built and tested this, and we think a lot of you will, too.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at September 23, 2024 03:54 PM

Brian Kardell

Keeping up: Signal to Noise

Keeping up: Signal to Noise

Last year in December, Eric Meyer and I did an episode of Igalia Chats titled "The Struggle to Keep Up with Web Tech". It recounts the multi-dimensional challenges that developers can face in trying to keep up with developments in the web platform.

I've thought for a long time now that we all need better ways to keep up, and especially to not miss the most important stuff. Today, I'm happy to annouce what we think is a real help at that.

Introducing bcd-watch

If you're not familiar, "bcd" is short for "browser compatibility data". The browser-compat-data repository is the source of what is shown on MDN and used for many other critical things as well. All of the browser vendors, contributors and tech writers strive together to keep the information there accurate and up to date.

Back in 2022, Eric and I had the idea to write a tool which would fetch this data from time to time, compare it with the previous data and put together a brief report that you could view, and an RSS feed you could subscribe to. I wrote a thing about BCD and the fact that you could actually even use GitHub's API to subscribe to the release notes, which are already pretty nice. This was November 2022, and it was teasing a few screenshots from the prototype we had.

A month and 5 days later, I unexpectedly had a heart attack.

So, it got shoved to the back burner for a while but today we're finally sharing it. My colleage Eric Meyer has a a much better post annoucing this that I'll just point you to: Annoucing BCD-Watch..

September 23, 2024 04:00 AM

September 18, 2024

Andy Wingo

whippet progress update: feature-complete!

Greetings, gentle readers. Today, an update on recent progress in the Whippet embeddable garbage collection library.

feature-completeness

When I started working on Whippet, two and a half years ago already, I was aiming to make a new garbage collector for Guile. In the beginning I was just focussing on proving that it would be advantageous to switch, and also learning how to write a GC. I put off features like ephemerons and heap resizing until I was satisfied with the basics.

Well now is the time: with recent development, Whippet is finally feature-complete! Huzzah! Now it’s time to actually work on getting it into Guile. (I have some performance noodling to do before then, adding tracing support, and of course I have lots of ideas for different collectors to build, but I don’t have any missing features at the current time.)

heap resizing

When you benchmark a garbage collector (or a program with garbage collection), you generally want to fix the heap size. GC overhead goes down with more memory, generally speaking, so you don’t want to compare one collector at one size to another collector at another size.

(Unfortunately, many (most?) benchmarks of dynamic language run-times and the programs that run on them fall into this trap. Imagine you are benchmarking a program before and after a change. Automatic heap sizing is on. Before your change the heap size is 200 MB, but after it is 180 MB. The benchmark shows the change to regress performance. But did it really? It could be that at the same heap size, the change improved performance. You won’t know unless you fix the heap size.)

Anyway, Whippet now has an implementation of MemBalancer. After every GC, we measure the live size of the heap, and compute a new GC speed, as a constant factor to live heap size. Every second, in a background thread, we observe the global allocation rate. The heap size is then the live data size plus the square root of the live data size times a factor. The factor has two components. One is constant, the expansiveness of the heap: the higher it is, the more room the program has. The other depends on the program, and is computed as the square root of the ratio of allocation speed to collection speed.

With MemBalancer, the heap ends up getting resized at every GC, and via the heartbeat thread. During the GC it’s easy because it’s within the pause; no mutators are running. From the heartbeat thread, mutators are active: taking the heap lock prevents concurrent resizes, but mutators are still consuming empty blocks and producing full blocks. This works out fine in the same way that concurrent mutators is fine: shrinking takes blocks from the empty list one by one, atomically, and returns them to the OS. Expanding might reclaim paged-out blocks, or allocate new slabs of blocks.

However, even with some exponentially weighted averaging on the speed observations, I have a hard time understanding whether the algorithm is overall a good thing. I like the heartbeat thread, as it can reduce memory use of idle processes. The general square-root idea sounds nice enough. But adjusting the heap size at every GC feels like giving control of your stereo’s volume knob to a hyperactive squirrel.

GC collection time vs memory usage graph comparing V8 with MemBalancer. MemBalancer is shown to be better.
Figure 5 from the MemBalancer paper

Furthermore, the graphs in the MemBalancer paper are not clear to me: the paper claims more optimal memory use even in a single-heap configuration, but the x axis of the graphs is “average heap size”, which I understand to mean that maximum heap size could be higher than V8’s maximum heap size, taking into account more frequent heap size adjustments. Also, some measurement of total time would have been welcome, in addition to the “garbage collection time” on the paper’s y axis; there are cases where less pause time doesn’t necessarily correlate to better total times.

deferred page-out

Motivated by MemBalancer’s jittery squirrel, I implemented a little queue for use in paging blocks in and out, for the mmc and pcc collectors: blocks are quarantined for a second or two before being returned to the OS via madvise(MADV_DONTNEED). That way if you release a page and then need to reacquire it again, you can do so without bothering the kernel or other threads. Does it matter? It seems to improve things marginally and conventional wisdom says to not mess with the page table too much, but who knows.

mmc rename

Relatedly, Whippet used to be three things: the project itself, consisting of an API and a collection of collectors; one specific collector; and one specific space within that collector. Last time I mentioned that I renamed the whippet space to the nofl space. Now I finally got around to renaming what was the whippet collector as well: it is now the mostly-marking collector, or mmc. Be it known!

Also as a service note, I removed the “serial copying collector” (scc). It had the same performance as the parallel copying collector with parallelism=1, and pcc can be compiled with GC_PARALLEL=0 to explicitly choose the simpler serial grey-object worklist.

per-object pinning

The nofl space has always supported pinned objects, but it was never exposed in the API. Now it is!

Of course, collectors with always-copying spaces won’t be able to pin objects. If we want to support use of these collectors with embedders that require pinning, perhaps because of conservative root scanning, we’d need to switch to some kind of mostly-copying algorithm.

safepoints without allocation

Another missing feature was a safepoint API. It hasn’t been needed up to now because my benchmarks all allocate, but for programs that have long (tens of microseconds maybe) active periods without allocation, you want to be able to stop them without waiting too long. Well we have that exposed in the API now.

removed ragged stop

Following on my article on ragged stops, I removed ragged-stop marking from mmc, for a nice net 180 line reduction in some gnarly code. Speed seems to be similar.

next up: tracing

And with that, I’m relieved to move on to the next phase of Whippet development. Thanks again to NLnet for their support of this work. Next up, adding fine-grained tracing, so that I can noodle a bit on performance. Happy allocating!

by Andy Wingo at September 18, 2024 02:18 PM

September 16, 2024

Brian Kardell

Micropayments History

Micropayments History

Hey, remember that period when there were like a ton of online 'coins' that people are trying to make happen? And they all kind of wanted to be 'the one', they all involved some kind of crypto thing you didn't totally understand, and they'd all enable micropayments and also solve all of these interesting problems about money. Remember when people were like "let's make an IETF standard!" and others were like "this is a thing we should do at the W3C"? And then some really big companies got involved? Do you remember that time... in the mid to late 90s?

Because, it totally happened.

Check out this draft of Micro Payment Transfer Protocol (MPTP) Version 0.1 from November 1995, and just have a look at the abstract and references section and note how many proposals there already were in 1995. They're pretty hard to find today, unfortunately, lost to the bitrot of time, but here's one, probably slightly updated version (it's dated May 7, 1996) that I was able to dig up as a postscript file and convert to pdf. It's focused on how you can mint coins cryptographically, and the cost of it and how you can prevent forgery and theft of coins and so on.

There are minutes from a joint CommerceNet/W3C Electronic Payments Initiative Kickoff meeting which took place in Paris December 18, 1995 which was attended by 50+ people from a wide variety of companies from banking and credit cards to telecom and hardwardware systems. In it there is a mention already of "the people looking at micropayments" and pointing out that "with credit card systems there's usually a minimum dollar amount below which a transaction isn't profitable."

There's this Jan 1996 IETF draft for Application Level Internet Payment Syntax, and in Feb 1997 there was another meeting in Paris in which they discussed micropayments and then again in Brussels in Sept 1997. There's this grab from W3C's Payments Roadmap (the oldest grab here is from 1997, but it's probably older). Look how much there is!

There's also this link is to a public w3c page which has a lot of links to things which were "already working (more or less)".

IBM tried this minipay thing. Carnegie Mellon developed this NETBILL PROJECT in 1997.

Jakob Nielsen's Alertbox had a piece in early 1998: The Case for Micropayments - in which he said "I predict that most sites that are not financed through traditional product sales will move to micropayments in less than two years." Then about two years later he doubled down saying "I now finally believe that the first wave of micropayment services will hit in 2000."

He wasn't totally wrong, because then there was Beanz.

It was huge. Here's a whole 20 minute video someone did on Beanz. But, in short, it earned multiple big rounds of tens of million dollar funding, even $5m just from Larry Ellison. It had posh offices worldwide and really got a lot of attention. I guess it started out as mainly a thing like "loyalty points" but that you could spend anywhere. Everyone had loyalty points, but you could only spend them there. So maybe you smoked Marlboro - you earned "Marlboro Miles" and could trade them in for stuff in their catalog. The idea was "why not let me spend my Marlboro miles toward kinda whatever I want". In the end, that sounds a lot like "almost money". Today, I guess credit cards have subsumed this a bit, you can get cash back rewards or other things.

But, it was also totally new and it had a lot of competitors. Around 2000 we got a big one: Flooz with Whoopi making commercials!

And in that same year we also got My Points (still around), Netscentives, CyberGold (quickly aquired by MyPoints for $157m), internetcash.com and even Confinity x.com PayPal. At the same time all of the new tech was helping regular old transaction costs come down too.

And then the internet bubble burst.

And then Clay Shirky wrote "The case against micropayments" which argued pretty strongly that they're just a bad idea.

It seems the W3C work on this also stopped around that time, in 2001.

I guess it's never totally stopped. There was, for a minute, Simpay - a consortium formed in 2003 by Orange, T-Mobile, Vodafone, and Telefónica to promote mobile payment solutions across Europe (wow that was 20 years ago?!), and as I say some of those things above still exist, as do other things like fetch.com and, others have been integrated into existing systems more as Clay Shirky predicted. And, well, of course, there was Bitcoin in 2008 and tons of crypto coins that followed.

Along the way, in March 2014 the W3C had a workshop on web payments in Paris which discussed also "new virtual currencies". Ultimately, the W3C work restarted in 2015 to streamline the online “check-out” process and make payments easier and more secure on the Web under the new "Web Payments Working Group."

Anyway, it was an interesting rabbit hole to dive down, prompted by a number of recent discussions I've had on the health of the web ecosystem and the number of times that something like micropayments seems to come up. I'll write more about some of those discussions and thoughts in a future post.

Thanks to my good friend Coralie Mercier for pointing at the rabbit hole and pushing me into it :) And also for a number of pointers and links.

September 16, 2024 04:00 AM

September 13, 2024

José Dapena

Maintaining Chromium downstream: update strategies

This is the second of a series of blog posts I am publishing for sharing some considerations about the challenges of maintaining a downstream of Chromium.

The first part, Maintaining downstreams of Chromium: why downstreams?, provided initial definitions, and an analysis of why someone would need to maintain a downstream.

In this post I will focus on the possible different strategies for tracking the Chromium upstream project changes, and integrating the downstream changes.

Applying changes

The first problem is related to the fact that our downstream will include additional changes on top of upstream (otherwise we would not need a downstream, right?).

There are two different approaches for this, based on the Git strategy used: rebase vs. merge.

Merge strategy

This consists on maintaining an upstream branch and periodically merging its changes to the downstream branch.

In the case of Chromium, the size of the repository and the amount and frequency of the changes is really big, so the chances that merging changes cause any conflict are higher than in other smaller projects.

Rebase strategy

This consists on maintaining the downstream changes as a series of patches, that are applied on top of an upstream baseline.

Updating means changing the baseline and applying all the patches. As this is done, not all patches will cause a conflict. And, for the ones that do, the complexity is far smaller.

Tip: keep the downstream as small as possible

Don’t hesitate to remove features that are more prone to cause conflicts. And at least, think carefully, for each change, if it can be done in a way that matches better upstream codebase.

And, if some of your changes are good for the commons, just contribute them to upstream!

When to update?

A critical aspect of applying upstream changes is how often to do that.

The size and structure of the team involved is highly dependent on this decision. And of course, planning the resources.

Usually this needs to be as much predictable as possible, and bound to the upstream release cycle.

Some examples used in downstream projects:

  • Weekly or daily tracking.
  • Major releases.

Upstream release policy and rebases

It is not only important how often you track upstream repository, but also what you track.

Chromium, nowadays, follows this procedure for each major release (first number in the version):

  • Main development happens in main branch. Several releases are tagged daily. And, mostly every week, one of them is released to the dev channel for users to try (with additional testing).
  • At a point, an stabilization branch is created, entering beta stage. The quality bar for what is landed in this branch is raised. Only stabilization work is done. More releases per day are tagged, and again, approximately once per week one is released to beta channel.
  • When planned, the branch becomes stable. This means it is ready for wide user adoption, so the stable channel will pick its releases from this branch.

That means main (development branch) targets version N, then N-1 branch is the beta (or stabilization branch), and N-2 is the stable branch. Nowadays Chromium targets publishing a new major stable version every four weeks (and that also means a release spends 4 weeks in beta channel).

You can read the official release cycle documentation if you want to know more, including the extended stable (or long term support) channel.

Some downstreams will track upstream main branch. Some others will just start applying the downstream changes when first release lands the stable channel. And some may just start when main is branched and stabilization work begins.

Tip: update often

The more you update, the smaller the change set you need to consider. And that means reducing the complexity of the changes.

This is specially important when merging instead of rebasing, as applying the new changes is done in one step.

This could be hard, though, depending on the size of the downstream. And, from time to time some refactors upstream will still imply non-trivial work to adapt the downstream changes.

Tip: update NOT so often

OK, OK. I just told the opposite, right?

Now, think of a complex change upstream, that implies many intermediate steps, with a moving architecture that evolves as it is developed (something not unfrequent in Chromium). Updating often means you will need to adapt your downstream to all the intermediate steps.

That definitely could mean more work.

A nice strategy for matching upstream stable releases

If you want to publish an stable release almost the same day or week of the official upstream stable release, and your downstream changes are not trivial, then a good moment for starting applying changes is when the stabilization branch happens.

But the first days there is a lot of stabilization work happening, as the quality bar is raised. So… waiting a few days after the beta branch happens could be a good idea.

An idea: just wait for the second version published in beta channel (the first one happens right after branching). That should give still three full weeks before the version is promoted to the stable channel.

Tracking main branch: automate everything!

In case you want to follow upstream main branch, in a daily basis (or even per commit), then it is just not practical to do that manually.

The solution for that is automation, at different levels:

  • Applying the changes.
  • Resolving the conflicts.
  • And, more important, testing the result is working.

Good automation is, though, expensive. It requires computing power for building Chromium often, and run tests. But, increasing test coverage will detect issues earlier, and give an extra margin for resolving them.

In any case, no matter your update strategy is, automation will always be helpful.

Next

This is the end of the post. So, what comes next?

In the next post in this series, I will talk about the downstream size problem, and different approaches for keeping it under control.

References

by José Dapena Paz at September 13, 2024 06:57 AM

September 11, 2024

Igalia Compilers Team

Summary of the July-August 2024 TC39 plenary

code { word-break: normal; }

The TC39 committee met again at the end of July, this time remotely, to discuss updated to various proposals at different stages of the standardization process. Let's go together through some of the most significant updates!

You can also read the full agenda and the meeting minutes on GitHub.

Day 1 #

Test262 Updates #

On the compilers team at Igalia we have several reviewers for Test262, which is the conformance test suite that ensures all JS implementations conform to the ECMA-262 specification. As of this TC39 meeting, RegExp.escape, Atomics.pause, Math.sumPrecise, Base64, and source phase imports now have full or almost-full test coverage. Additionally, our team is working towards the goal of having a testing plan for each new proposal, to make it easier for people to write tests.

Intl.DurationFormat igalia logo #

DurationFormat continues its approach to Stage 4 with a fix to an edge case involving the formatting of certain negative durations when using the "numeric" or "two-digit" styles. Previously if a zero-valued "numeric" style unit was the first formatted unit in a negative duration, the negative sign would get dropped.

Drop assert from import attributes igalia logo #

The import attributes proposal was originally called "import assertions" and had the following syntax:

import data from "./data.json" assert { type: "json" }

The semantics of this example is that "./data.json" is fetched/interpreted ignoring the assertions (for example, web browsers would have known to parse the file as JSON because the server would use the application/json MIME type). Then the assertions were applied, potentially throwing an error.

The proposal had then been updated from "assertions" to "attributes", and the keyword was changed to reflect the new semantics:

import data from "./data.json" with { type: "json" }

Attributes can affect how a module is loaded and interpreted, rather than just causing an error if it's the "wrong" module. Web browsers can send the proper HTTP headers to servers indicating that they expect a JSON response, and other non-HTTP-based platform can use attributes to decide how they want to parse the file.

Chrome, Node.js and Deno shipped the assert keyword a while ago, so TC39 was not sure that removing it would have been web-compatible. Chrome succesfully unshipped support for assert in Chrome 126, released in June, so now the proposal has been updated to only support with.

AsyncContext Update igalia logo #

The AsyncContext allows defining variables that are implicitly propagated following the "async flow" of your code:

const clickX = new AsyncContext.Variable();

document.addEventListener("click", e => {
clickX.run(e.clientX, async () => {
await timeout("30 seconds");
await doSomething();
})
});

async function doSomething() {
console.log("Doing something due to the click at x=" + clickX.get());
}

In the above example, doSomething will log 30 seconds after each click, logging the x coordinate of the mouse corresponding to the click that scheduled that doSomething call.

The proposal, currently at stage 2, is proceeding steadily and the champions are currently figuring out how it should be integrated with other web APIs.

Deferred imports igalia logo #

The import defer proposal allows importing modules while deferring their execution until when it's needed, to reduce their impact on application startup time:

import defer * as mod from "./dep.js";

$button.addEventListener("click", () => {
console.log(mod.number); // ./dep.js is only evaluated at this point, when we need it.
})

The main difference between using import defer and dynamic import() is that the former can be used synchronously (without having to introduce promises or async/await), at the expense of only deferring code execution and not code loading.

During this meeting proposal reached Stage 2.7, which means that it's almost ready to be implemented (it's only missing test262 tests). The committee confirmed the semantics around top-level await: given that modules using top-level await cannot be executed asynchronously, they will be executed at startup even if they are part of a set of modules pulled in by an import defer declaration.

Day 2 #

Intl Locale Info #

Frank Yung-Fong Tang of Google provided a minor update to the stage 3 Intl.LocaleInfo proposal, and received consensus for a PR fixing a minor bug with the handling of requests for the week of day and clarifying the result of using the "-u-ca-iso8601" Unicode locale extension. He also gave an update on the implementation status of Intl.LocaleInfo, noting that it is currently implemented in Chrome and Safari, with a Mozilla implementation pending.

Temporal Update igalia logo #

Our colleague Philip Chimento presented another minor update on Temporal. There were two bug fixes to approve: one to correctly handle a corner case in the time zone database where the day started at half past midnight on March 31, 1919 in Toronto due to a time zone shift, and another to disallow a particular combination of options in Temporal.Duration.prototype.round() where there were multiple interpretations possible for what the outcome should be.

Temporal is nearing completion; JS engines continue to work on implementing it. Firefox is the closest to having a complete implementation. There is a checklist that you can follow if you are interested in further updates on the TC39 side.

Fingerprinting discussion igalia logo #

Ben Allen led a fruitful discussion on the potential for fingerprinting risk should browser implementations allow users to dynamically update locale-related information, either through adding new locales beyond the ones shipped with the browser or adding additional locale-related components, such as additional scripts or additional calendars. Including this feature would greatly improve the usability of the Web for users who speak minority languages -- but letting hosts know whether users have those non-base locales installed could result in those users having their privacy compromised in a way that in some cases could be genuinely dangerous for them.

Although no browsers currently allow for dynamic locale-related updates, it is a feature under consideration among multiple browser-makers. The conversation concluded with the decision for the relevant privacy teams to consider the matter and return with feedback.

RegExp.escape for Stage 3 #

RegExp.escape is a new utility for escaping strings so that they can be safely used in regular expression.

Consider the case where you have a string ".js", and you want to check wether another strings ends with that suffix using a regular expression. You might be tempted to write something like this:

const suffix = ".js";
const matches = new RegExp(`${suffix}$`).test(filename);

However, that code has a problem: . has a special meaning in regular expressions, so matches would also be true for something.notjs! To fix this, you would need to escape the suffix:

const suffix = ".js";
const matches = new RegExp(`${RegExp.escape(suffix)}$`).test(filename);

Thanks to the proposal's champion Jordan Harband the committee approved RegExp.escape to advance to Stage 3, which means that browsers can now implement and ship it.

Day 3 #

Decimal igalia logo #

Decimal champion Jesse Alama gave an update about the current status of the proposal. There were minor changes to the API; the main update was to make the spec text considerable sharper, thanks to close cooperation with new co-author Waldemar Horwat. Jesse asked for decimal to advance to stage 2 in the TC39 process, but it did not, because the argument that decimal, in its current form, is too close to just being a library. In the discussion, some concerns were raised that if we were to go ahead with the current Decimal128-based data model, in which trailing zeroes are preserved, we close the door to a potential future where decimal exists JavaScript as a primitive type. Our work in decimal continues -- stay tuned!

September 11, 2024 12:00 AM

September 10, 2024

Enrique Ocaña

Don’t shoot yourself in the foot with the C++ move constructor

Move semantics can be very useful to transfer ownership of resources, but as many other C++ features, it’s one more double edge sword that can harm yourself in new and interesting ways if you don’t read the small print.

For instance, if object moving involves super and subclasses, you have to keep an extra eye on what’s actually happening. Consider the following classes A and B, where the latter inherits from the former:

#include <stdio.h>
#include <utility>

#define PF printf("%s %p\n", __PRETTY_FUNCTION__, this)

class A {
 public:
 A() { PF; }
 virtual ~A() { PF; }
 A(A&& other)
 {
  PF;
  std::swap(i, other.i);
 }

 int i = 0;
};

class B : public A {
 public:
 B() { PF; }
 virtual ~B() { PF; }
 B(B&& other)
 {
  PF;
  std::swap(i, other.i);
  std::swap(j, other.j);
 }

 int j = 0;
};

If your project is complex, it would be natural that your code involves abstractions, with part of the responsibility held by the superclass, and some other part by the subclass. Consider also that some of that code in the superclass involves move semantics, so a subclass object must be moved to become a superclass object, then perform some action, and then moved back to become the subclass again. That’s a really bad idea!

Consider this usage of the classes defined before:

int main(int, char* argv[]) {
 printf("Creating B b1\n");
 B b1;
 b1.i = 1;
 b1.j = 2;
 printf("b1.i = %d\n", b1.i);
 printf("b1.j = %d\n", b1.j);
 printf("Moving (B)b1 to (A)a. Which move constructor will be used?\n");
 A a(std::move(b1));
 printf("a.i = %d\n", a.i);
 // This may be reading memory beyond the object boundaries, which may not be
 // obvious if you think that (A)a is sort of a (B)b1 in disguise, but it's not!
 printf("(B)a.j = %d\n", reinterpret_cast<B&>(a).j);
 printf("Moving (A)a to (B)b2. Which move constructor will be used?\n");
 B b2(reinterpret_cast<B&&>(std::move(a)));
 printf("b2.i = %d\n", b2.i);
 printf("b2.j = %d\n", b2.j);
 printf("^^^ Oops!! Somebody forgot to copy the j field when creating (A)a. Oh, wait... (A)a never had a j field in the first place\n");
 printf("Destroying b2, a, b1\n");
 return 0;
}

If you’ve read the code, those printfs will have already given you some hints about the harsh truth: if you move a subclass object to become a superclass object, you’re losing all the subclass specific data, because no matter if the original instance was one from a subclass, only the superclass move constructor will be used. And that’s bad, very bad. This problem is called object slicing. It’s specific to C++ and can also happen with copy constructors. See it with your own eyes:

Creating B b1
A::A() 0x7ffd544ca690
B::B() 0x7ffd544ca690
b1.i = 1
b1.j = 2
Moving (B)b1 to (A)a. Which move constructor will be used?
A::A(A&&) 0x7ffd544ca6a0
a.i = 1
(B)a.j = 0
Moving (A)a to (B)b2. Which move constructor will be used?
A::A() 0x7ffd544ca6b0
B::B(B&&) 0x7ffd544ca6b0
b2.i = 1
b2.j = 0
^^^ Oops!! Somebody forgot to copy the j field when creating (A)a. Oh, wait... (A)a never had a j field in the first place
Destroying b2, a, b1
virtual B::~B() 0x7ffd544ca6b0
virtual A::~A() 0x7ffd544ca6b0
virtual A::~A() 0x7ffd544ca6a0
virtual B::~B() 0x7ffd544ca690
virtual A::~A() 0x7ffd544ca690

Why can something that seems so obvious become such a problem, you may ask? Well, it depends on the context. It’s not unusual for the codebase of a long lived project to have started using raw pointers for everything, then switching to using references as a way to get rid of null pointer issues when possible, and finally switch to whole objects and copy/move semantics to get rid or pointer issues (references are just pointers in disguise after all, and there are ways to produce null and dangling references by mistake). But this last step of moving from references to copy/move semantics on whole objects comes with the small object slicing nuance explained in this post, and when the size and all the different things to have into account about the project steals your focus, it’s easy to forget about this.

So, please remember: never use move semantics that convert your precious subclass instance to a superclass instance thinking that the subclass data will survive. You can regret about it and create difficult to debug problems inadvertedly.

Happy coding!

by eocanha at September 10, 2024 07:58 AM

September 07, 2024

Andy Wingo

conservative gc can be faster than precise gc

Should your garbage collector be precise or conservative? The prevailing wisdom is that precise is always better. Conservative GC can retain more objects than strictly necessary, making GC slow: GC has to more frequently, and it has to trace a larger heap on each collection. However the calculus is not as straightforward as most people think, and indeed there are some reasons to expect that conservative root-finding can result in faster systems.

(I have made / relayed some of these arguments before but I feel like a dedicated article can make a contribution here.)

problem precision

Let us assume that by conservative GC we mean conservative root-finding, in which the collector assumes that any integer on the stack that happens to be a heap address indicates a reference on the object containing that address. The address doesn’t have to be at the start of the object. Assume that objects on the heap are traced precisely; contrast to BDW-GC which generally traces both the stack and the heap conservatively. Assume a collector that will pin referents of conservative roots, but in which objects not referred to by a conservative root can be moved, as in Conservative Immix or Whippet’s stack-conservative-mmc collector.

With that out of the way, let’s look at some reasons why conservative GC might be faster than precise GC.

smaller lifetimes

A compiler that does precise root-finding will typically output a side-table indicating which slots in a stack frame hold references to heap objects. These lifetimes aren’t always precise, in the sense that although they precisely enumerate heap references, those heap references might actually not be used in the continuation of the stack frame. When GC occurs, it might mark more objects as live than are actually live, which is the imputed disadvantage of conservative collectors.

This is most obviously the case when you need to explicitly register roots with some kind of handle API: the handle will typically be kept live until the scope ends, but that might be an overapproximation of lifetime. A compiler that can assume conservative stack scanning may well exhibit more precision than it would if it needed to emit stack maps.

no run-time overhead

For generated code, stack maps are great. But if a compiler needs to call out to C++ or something, it needs to precisely track roots in a run-time data structure. This is overhead, and conservative collectors avoid it.

smaller stack frames

A compiler may partition spill space on a stack into a part that contains pointers to the heap and a part containing numbers or other unboxed data. This may lead to larger stack sizes than if you could just re-use a slot for two purposes, if the lifetimes don’t overlap. A similar concern applies for compilers that partition registers.

no stack maps

The need to emit stack maps is annoying for a compiler and makes binaries bigger. Of course it’s necessary for precise roots. But then there is additional overhead when tracing the stack: for each frame on the stack, you need to look up the stack map for the return continuation, which takes time. It may be faster to just test if words on the stack might be pointers to the heap.

unconstrained compiler

Having to make stack maps is a constraint imposed on the compiler. Maybe if you don’t need them, the compiler could do a better job, or you could use a different compiler entirely. A conservative compiler can sometimes have better codegen, for example by the use of interior pointers.

anecdotal evidence

The Conservative Immix paper shows that conservative stack scanning can beat precise scanning in some cases. I have reproduced these results with parallel-stack-conservative-mmc compared to parallel-mmc. It’s small—maybe a percent—but it was a surprising result to me and I thought others might want to know.

Also, Apple’s JavaScriptCore uses conservative stack scanning, and V8 is looking at switching to it. Funny, right?

conclusion

When it comes to designing a system with GC, don’t count out conservative stack scanning; the tradeoffs don’t obviously go one way or the other, and conservative scanning might be the right engineering choice for your system.

by Andy Wingo at September 07, 2024 10:00 AM

September 06, 2024

Andy Wingo

on taking advantage of ragged stops

Many years ago I read one of those Cliff Click “here’s what I learned” articles in which he was giving advice about garbage collector design, and one of the recommendations was that at a GC pause, running mutator threads should cooperate with the collector by identifying roots from their own stacks. You can read a similar assertion in their VEE2005 paper, The Pauseless GC Algorithm, though this wasn’t the source of the information.

One motivation for the idea was locality: a thread’s stack is already local to a thread. Then specifically in the context of a pauseless collector, you need to avoid races between the collector and the mutator for a thread’s stack, and having the thread visit its own stack neatly handles this problem.

However, I am not so interested any more in (so-called) pauseless collectors; though I have not measured myself, I am convinced enough by the arguments in the Distilling the real costs of production garbage collectors paper, which finds that state of the art pause-minimizing collectors actually increase both average and p99 latency, relative to a well-engineered collector with a pause phase. So, the racing argument is not so relevant to me, because a pause is happening anyway.

There was one more argument that I thought was interesting, which was that having threads visit their own stacks is a kind of cheap parallelism: the mutator threads are already there, they might as well do some work; it could be that it saves time, if other threads haven’t seen the safepoint yet. Mutators exhibit a ragged stop, in the sense that there is no clean cutoff time at which all mutators stop simultaneously, only a time after which no more mutators are running.

Visiting roots during a ragged stop introduces concurrency between the mutator and the collector, which is not exactly free; notably, it prevents objects marked while mutators are running from being evacuated. Still, it could be worth it in some cases.

Or so I thought! Let’s try to look at the problem analytically. Consider that you have a system with N processors, a stop-the-world GC with N tracing threads, and M mutator threads. Let’s assume that we want to minimize GC latency, as defined by the time between GC is triggered and the time that mutators resume. There will be one triggering thread that causes GC to begin, and then M–1 remote threads that need to reach a safepoint before the GC pause can begin.

The total amount of work that needs to be done during GC can be broken down into rootsi, the time needed to visit roots for mutator i, and then graph, the time to trace the transitive closure of live objects. We want to know whether it’s better to perform rootsi during the ragged stop or in the GC pause itself.

Let’s first look to the case where M is 1 (just one mutator thread). If we visit roots before the pause, we have

latencyragged,M=1 = roots0 + graphN

Which is to say, thread 0 triggers GC, visits its own roots, then enters the pause in which the whole graph is traced by all workers with maximum parallelism. It may be that graph tracing doesn’t fully parallelize, for example if the graph has a long singly-linked list, but the parallelism with be maximal within the pause as there are N workers tracing the graph.

If instead we visit roots within the pause, we have:

latencypause,M=1= roots0+graphN

This is strictly better than the ragged-visit latency.

If we have two threads, then we will need to add in some delay, corresponding to the time it takes for remote threads to reach a safepoint. Let’s assume that there is a maximum period (in terms of instructions) at which a mutator will check for safepoints. In that case the worst-case delay will be a constant, and we add it on to the latency. Let us assume also there are more than two threads available. The marking-roots-during-the-pause case it’s easiest to analyze:

latencypause,M=2= delay + roots0+roots1+graphN

In this case, a ragged visit could win: while the triggering thread is waiting for the remote thread to stop, it could perform roots0, moving the work out of the pause, reducing pause time, and reducing latency, for free.

latencyragged,M=2= delay + roots1 + graphN

However, we only have this win if the root-visiting time is smaller than the safepoint delay; otherwise we are just prolonging the pause. Thing is, you don’t know in general. If indeed the root-visiting time is short, relative to the delay, we can assume the roots elements of our equation are 0, and so the choice to mark during ragged stop doesn’t matter at all! If we assume instead that root-visiting time is long, then it is suboptimally parallelised: under-parallelised if we have more than M cores, oversubscribed if M is greater than N, and needlessly serializing before the pause while it waits for the last mutator to finish marking its roots. What’s worse, root-visiting actually slows down delay, because the oversubscribed threads compete with visitation for CPU time.

So in summary, I plan to switch away from doing GC work during the ragged stop. It is complexity that doesn’t pay. Onwards and upwards!

by Andy Wingo at September 06, 2024 12:15 PM

Brian Kardell

1.3 Million Subtests

1.3 Million Subtests

Today the Servo project crossed a milestone.

If you're not aware, all of the mainstream browsers that exist today use one of three browser projects/engines (Chromium/Blink, Firefox/Gecko or WebKit). These in turn both have their roots in projects begun in the late 1990's (Gecko and KHTML). Further, they have only 1 funding model, and largely 1 funding source. Over the years, the projects have grown to tens of millions of lines of code and have now had tens of thousands of person years worth of investment.

All of this is just to underscore that getting a new one is incredibly challenging. Microsoft, a web giant, tried and eventually gave up and embraced Chromium.

But... If you haven't been paying attention, something very fun and exciting has been developing over the last couple of years. New life has been breathed into a movement for what we call "novel engines" - that is, engines that don't come from those initial two (khtml/Gecko).

What's more is that most of the life at least comes from 2 engines which are being developed and funded differently - and they're making rapid progress now.

As of today, one of those novel engines - Servo (stewarded by Igalia) passes 1.3 million subtests in Web Platform Tests! Specfically: 1,303,530 as of this writing! Congratulations Servo project!

For perspective, that's over 73% of the subtests that it's currently running. Of course, that's a totally arbitrary threshold (kind of a like a birthday), but it's nice to track progress and celebrate milestones and that's a big sounding round number to stop, raise a glass and say "well done!" and "keep up the good work".

Speaking of keeping up the good work: You can join a growing number of others in helping to support the work on Servo directly by donating through GitHub sponsors or our open collective. The Servo Technical Steering Committee collectively discusses how to prioritize the spending of available funds in the public monthly calls. You can also contribute code, reviews and other effort via the Servo GitHub repository. Let's see how quickly we can reach 1.4 million :)

September 06, 2024 04:00 AM

September 04, 2024

Frédéric Wang

My recent contributions to Gecko (3/3)

Note: This blog post was written on June 2024. As of September 2024, final work to ship the feature is still in progress. Please follow bug 1797715 for the latest updates.

Introduction

This is the final blog post in a series about new web platform features implemented in Gecko, as part as an effort at Igalia to increase browser interoperability.

Let’s take a look at fetch priority attributes, which enable web developers to optimize resource loading by specifying the relative priority of resources to be fetched by the browser.

Fetch priority

The web.dev article on fetch priority explains in more detail how web developers can use fetch priority to optimize resource loading, but here’s a quick overview.

fetchpriority is a new attribute with the value auto (default behavior), high, or low. Setting the attribute on a script, link or img element indicates whether the corresponding resource should be loaded with normal, higher, or lower priority 1:

<head>
  <script src="high.js" fetchpriority="high"></script>
  <link rel="stylesheet" href="auto.css" fetchpriority="auto">
</head>
<body>
  <img src="low.png" alt="low" fetchpriority="low">
</body>

The priority can also be set in the RequestInit parameter of the fetch() method:

await fetch("high.txt", {priority: "high"});

The <link> element has some interesting features. One of them is combining rel=preload and as to fetch a resource with a particular destination 2:

<link rel="preload" as="font" href="high.woff2" fetchpriority="high">

You can even use Link in HTTP response headers and in particular early hints sent before the final response:

103 Early Hint
Link: <high.js>; rel=preload; as=script; fetchpriority=high

These are basically all the places where a fetch priority attribute can be used.

Note that other parameters are also taken into account when deciding the priority to use for resources, such as the position of the element in the page (e.g. blocking resources in <head>), other attributes on the element (<script async>, <script defer>, <link media>, <link rel>…) or the resource’s destination.

Finally, some browsers implement speculative HTML parsing, allowing them to continue fetching resources declared in the HTML markup while the parser is blocked. As far as I understand, Firefox has its own separate HTML parsing code for that purpose, which also has to take fetch priority attributes into account.

Implementation-defined prioritization

If you have not run away after reading the complexity described in the previous section, let’s talk a bit more about how fetch priority attributes are interpreted. The spec contains the following step when fetching a resource (emphasis mine):

If request’s internal priority is null, then use request’s priority, initiator, destination, and render-blocking in an implementation-defined manner to set request’s internal priority to an implementation-defined object.

So browsers would use the high/low/auto hints as well as the destination in order to calculate an internal priority value 3, but the details of this value are not provided in the specification, and it’s up to the browser to decide what to do. This is a bit unfortunate for our interoperability goal, but that’s probably the best we can do, given that each browser already has its own stategies to optimize resource loading. I think this also gives browsers some flexibility to experiment with optimizations… which can be hard to predict when you realize that web devs also try to adapt their content to the behavior of (the most popular) browsers!

In any case, the spec authors were kind enough to provide a note with more suggestions (emphasis mine):

The implementation-defined object could encompass stream weight and dependency for HTTP/2, priorities used in Extensible Prioritization Scheme for HTTP for transports where it applies (including HTTP/3), and equivalent information used to prioritize dispatch and processing of HTTP/1 fetches. [RFC9218]

OK, so what does that mean? I’m not a networking expert, but this is what I could gather after discussing with the Necko team and reading some HTTP specs:

  • HTTP/1 does not have a dedicated prioritization mechanism, but Firefox uses its internal priority to order requests.
  • HTTP/2 has a “stream priority” mechanism and Firefox uses its internal priority to implement that part of the spec. However, it was considered too complex and inefficient, and is likely poorly supported by existing web servers…
  • In upcoming releases, Firefox will use its internal priority to implement the Extensible Prioritization Scheme used by HTTP/2 and HTTP/3. See bug 1865040 and bug 1864392. Essentially, this means using its internal priority to adjust the urgency parameter.

Note that various parts of Firefox rely on NS_NewChannel to load resources, including the fetching algorithm above, which Firefox uses to implement the fetch() method. However, other cases mentioned in the first section have their own code paths with their own calls to NS_NewChannel, so these places must also be adjusted to take the fetch priority and destination into account.

Finishing the implementation work

Summarizing a bit, implementing fetch priority is a matter of:

  1. Adding fetchpriority to DOM objects for HTMLImageElement, HTMLLinkElement, HTMLScriptElement, and RequestInit.
  2. Parsing the fetch priority attribute into an auto/low/high enum.
  3. Passing the information to the callers of NS_NewChannel.
  4. Using that information to set the internal priority.
  5. Using that internal priority for HTTP requests.

Mirko Brodesser started this work in June 2023, and had already implemented almost all of the features discussed above. fetch(), <img>, and <link rel=preload as=image> were handled by Ziran Sun and I, while Valentin Gosu from Mozilla made HTTP requests use the internal priority.

The main blocker was due to that “implementation-defined” use of fetch priority. Mirko’s approach was to align Firefox with the behavior described in the web.dev article, which reflects Chromium’s implementation. But doing so would mean changing Firefox’s default behavior when fetchpriority is not specified (or explicitly set to auto), and it was not clear whether Chromium’s prioritization choices were the best fit for Firefox’s own implementation of resource loading.

After meeting with Mozilla, we agreed on a safer approach:

  1. Introduce runtime preferences to control how Firefox adjusts internal priorities when low, high, or auto is specified. By default, auto does not affect the internal priority so current behavior is preserved.
  2. Ask Mozilla’s performance team to run an experiment, so we can decide the best values for these preferences.
  3. Ship fetch priority with the chosen values, probably cleaning things up a bit. Any other ideas, including the ones described in the web.dev article, could be handled in future enhancements.

We recently entered phase 2 of this plan, so fingers crossed it works as expected!

Internal WPT tests

This project is part of the interoperability effort, but again, the “implementation-defined” part meant that we had very few WPT tests for that feature, really only those checking fetchpriority attributes for the DOM part.

Fortunately Mirko, who is a proponent of Test-driven development, had written quite a lot of internal WPT tests that use internal APIs to retrieve the internal priority. To test Link headers, he used the handy wptserve pipes. The only thing he missed was checking support in Early hints, but some WPT tests for early hints using WPT Python Handlers were available, so integrating them into Mirko’s tests was not too difficult.

It was also straightforward for Ziran and I to extend Mirko’s tests to cover fetch, img, and <link rel=preload as=image>, with one exception: when the fetch() method uses a non-default destination. In most of these code paths, we call NS_NewChannel to perform a fetch. But fetch() is tricky, because if the fetch event is intercepted, the event handler might call the fetch() method again using the same destination (e.g. image).

Handling this correctly involves multiple processes and IPC communication, which ended up not working well with the internal APIs used by Mirko’s tests. It took me a while to understand what was happening in bug 1881040, and in the end I came up with a new approach.

Upstreamable WPT tests

First, let’s pause for a moment: all the tests we have so far use an internal API to verify the internal priority, but they don’t actually check how that internal priority is used by Firefox when it sends HTTP requests. Valentin mentioned we should probably have some tests covering that, and not only would it solve the problem with fetch() calls in fetch event handlers, it would also remove the use of an internal API, making the tests potentially reusable by other browsers.

To make this kind of test possible, I added a WPT Python Handler that parses the urgency from a HTTP request and responds with an urgency-dependent resource, such as a stylesheet with different property values, an image of a different size, or an audio or video file of a different duration.

When a test uses resources with different fetch priorities, this influences the urgency values of their HTTP requests, which in turn influences the response in a way that the test can check for in JavaScript. This is a bit complicated, but it works!

Conclusion

Fetch priority has been enabled in Firefox Nightly for a while, and experiments started recently to determine the optimal priority adjustments. If everything goes well, we will be able to push this feature to the finish line after the (northern) summer.

Helping implement this feature also gave me the opportunity to work a bit on the Firefox networking code, which I had not touched since the collaboration with IPFS, and I learned a lot about resource loading and WPT features for HTTP requests.

To me, the “implementation-defined” part was still a bit awkward for the web platform. We had to write our own internal WPT tests and do extra effort to prepare the feature for shipping. But in the end, I believe things went relatively smoothly.

Acknowledgments

To conclude this series of blog posts, I’d also like to thank Alexander Surkov, Cathie Chen, Jihye Hong, Martin Robinson, Mirko Brodesser, Oriol Brufau, Ziran Sun, and others at Igalia who helped on implementing these features in Firefox. Thank you to Emilio Cobos, Olli Pettay, Valentin Gosu, Zach Hoffman, and others from the Mozilla community who helped with the implementation, reviews, tests and discussions. Finally, our spelling and grammar expert Delan Azabani deserves special thanks for reviewing this series of blog post and providing useful feedback.

  1. Other elements have been or are being considered (e.g. <iframe>, SVG <image> or SVG <script>), but these are the only ones listed in the HTML spec at the time of writing. 

  2. As mentioned below, the browser needs to know about the actual destination in order to properly calculate the priority. 

  3. As far as I know, Firefox does not take initiator into account, nor does it support render-blocking yet

September 04, 2024 10:00 PM

Tvrtko Ursulin

DRM scheduling cgroup controller

Introduction #

The topic of a Direct Rendering Manager (DRM) cgroup controller is something which has been proposed a few times in the past, but so far is still missing from the Linux graphics stack. Some of those attempts were focusing on controlling the GPU memory usage aspect, while some were concerned with scheduling. As I am continuing to explore this area as part of my work at Igalia, in this post we will discuss one possible way of implementing the latter.

General problem statement which we are trying to address is the fact many GPUs (and their respective kernel drivers) can simultaneously schedule workloads from different clients and that there are use-cases where having external control over scheduling decisions would be beneficial.

But first to clarify what we mean by “external control”. By that term we refer to the scheduling decisions being influenced from the outside of the actual process doing the rendering. If we were to draw a parallel to CPU scheduling, that would be the difference between a process (or a thread) issuing a system call such as setpriority(2) or nice(2) itself (“internal control”), versus its scheduling priority being modified by an external entity such as the user issuing the renice(1) shell command, launching the executable via the nice(1) shell command, or even using the CPU scheduling cgroup controller (“external control”).

This has two benefits. Firstly, it is the user who typically knows which tasks are higher priority and which should run in the background and therefore be as much as it is possible isolated from starving the foreground tasks from resources. Secondly, external control can be applied on any process in an unified manner, without the need for applications to individually expose the means to control their scheduling priority.

If we now return back to the world of GPU scheduling we find ourselves in a landscape where internal scheduling control is possible with many GPU drivers, but the external control is not. To improve on that there are some technical and conceptual challenges, because GPUs are not as nice and uniform in their scheduling needs and capabilities as CPUs are, but if we would be able to come up with something reasonable even if not perfect, it could bring improvements to the user experience in a variety of scenarios.

Past attempts - Priority based controllers #

The earliest attempt I can remember was from 2018, by Matt Roper[1], who proposed to implement a driver-specific priority based controller. The RFC limited itself to i915 (kernel driver for Intel GPUs) and, although the priority-based setup is well established in the world of CPU scheduling, and it is easy to understand its effects, the proposal did not gain much traction.

Because of the aforementioned advantages, when I proposed my version of the controller in 2022[2], it also included a slightly different version of a priority-based controller. In contrast to the earlier one, this proposal was in principle driver-agnostic and the priority levels were also abstracted.

The proposal was also accompanied by benchmark results showing that the approach was effective in allowing users on Linux to launch GPU tasks in the background, while leaving more GPU bandwidth to the foreground task than when not using the controller. Similarly on ChromeOS, when wired into the focused versus un-focused window cgroup management, it was able to demonstrate relatively more GPU time given to the foreground window.

Current proposal - Weight based controller #

Anticipating the potential lack of sufficient support for this approach the same RFC also included a second controller which takes a different route. It abstracts things one step further and implements a weight based controller based on GPU utilisation[3].

The basic idea is that the GPU time budget is split based on relative group weights across the cgroup hierarchy, and that the controller notifies the individual DRM drivers when their clients are over budget. From there it is left for the individual drivers to know how to best manage this situation, depending on the specific scheduling capabilities of the driver and the GPU hardware.

The user interface completely mimics the exiting CPU and IO cgroup controllers with the single drm.weight control file. The weights carry no absolute meaning and are only relative within a single group of siblings. Their only purpose is to split out the time budget between them.

Visually one potential cgroup configuration could look like this:

A visual representation of cgroup hierarchy

The DRM cgroup controller then executes a periodic scanning task which queries each DRM client for its GPU usage and notifies drivers when clients are over their allocated budget.

If we expand the concept with runtime adjustment of group weights based on window focus status, with two graphically active clients such as a game and a web browser, we can end up with the following two scenarios:

Example cgroup hierararchy re-configured based on window focus

Here we show the actual GPU utilisation of each group together with their drm.weight. On the left hand side the web browser is the focused window, with the weights 100-to-10 in its favour.

The compositor is not using its full 200 / (200 + 100) so a portion is passed on to the desktop group to the extent of the full 80% required. Inside the desktop group the game is currently using 70%, while its actual allocation is 80% * (10 / (100 + 10)) = 7.27%. Therefore it is currently consuming is more than the budget and the corresponding DRM driver will be notified by the controller and will be able to do something about it.

After the user has given focus to the game window, relative weights will be adjusted and so will the budgets. Now the web browser will be over budget and therefore it can be throttled down, limiting the effect of its background activity on the foreground game window.

First driver implementation - i915 #

Back when I started developing this idea Intel GPU’s were my main focus, which is why i915 was the first driver I wired up with the controller.

There I implemented a rather simple approach of dynamically adjusting the scheduling priority of the throttled contexts, to the amount proportional to how much client is over budget in relative terms.

Implementation would also cross-check against the physical engine utilisation, since in i915 we have easy access to that metric, and only throttle if the latter is close to being fully utilised. (Why this makes sense could be an interesting digression relating to the fact that a single cgroup can in theory contain multiple GPUs and multiple clients using a mix of those GPUs. But lets leave that for later.)

One of the scenarios I used to test how well this works is to run two demanding GPU clients, each in its own cgroup, tweak their relative weights, and see what happens. The results were encouraging and are shown in the following table.

Benchmark results under Linux

We can see that, when a clients group weight was decreased, the GPU bandwidth it was receiving also went down, as a consequence of the lowered context priority after receiving the over-budget notification.

This is a suitable moment to mention how the DRM cgroup controller does not promise perfect control, that is, achieving the actual GPU sharing ratios as expressed by group-relative weights. As we have mentioned before, GPU scheduling is not nearly at the same level of quality and granularity as in the CPU world, so the goal it sets is simply to improve things - do something which has a positive impact on user experience. At the same time, the mechanism and control interface proposed does not preclude individual drivers doing as good job as they can. Or even a future possibility of replacing the inner workings with a controller with something smarter, with no need to change the user space control interface.

Going back to the initial i915 implementation, the second test I have done was attempting to wire up with the background/foreground window focus handling in ChromeOS. There I experimented with a game (Android VM) running in parallel with a WebGL demo in a browser. At a certain point after both clients were running I lowered the weight of the background game and on the below screenshot we can see how the FPS metric in a browser jumped up.

Screenshot of foreground versus background change under ChromeOS

This illustrates how having the controller can indeed improve the user experience. The user’s focus will be at the foreground window and therefore it does make sense to prioritise GPU access to that client for better interactiveness and smoother rendering there. In fact, in this example the actual FPS jumped from around 48-49 to 60fps. Meaning that throttling the background client has allowed the foreground one to match its rendering to display’s refresh rate.

Second implementation - amdgpu #

AMD’s kernel module was the next interesting driver which I wired up with the controller.

The fact that its scheduling is built on top of the DRM scheduler with only three distinct priority levels mandated a different approach to throttling. We keep a sorted list of “most offending” clients (most out of budget, or most borrowed unused budget from the sibling group), with the idea that the top client on that list gets throttled by lowering its scheduling priority. That was relatively straightforward to implement and sounded like it could potentially satisfy the most basic use case of background task isolation.

To test the runtime behaviour we set up two sibling cgroups and vary their relative scheduling weights. In one cgroup we run glxgears with vsync turned off and log its frame rate over time, while in the second group we run glmark2.

Let us first have a look on how glxgears frame rate varies during this test, depending on three different scheduling weight ratios between the cgroups. Scheduling weight ratio is expressed as glxgears:glmark2 ie. 10:1 means glxgears scheduling weight was ten times as much as configured for glmark2.

We can observe that, as the glmark2 is progressing through its various sub-benchmarks, glxgears frame rate is changing too. But it was overall higher in the runs where the scheduling weight ratio was in its favour. That is a positive result showing that even a simple implementation seems to be having the desired effect, at least to some extent.

Scheduling weight effect on glxgears running in parallel with glmark2

For the second test we can look from the perspective of glmark2, checking how the benchmark score change depending on the ratio of scheduling weights.

Scheduling weight effect on glmark2 running in parallel with glxgears

Again we see that the scores are generally improving when the scheduling weight ratio is increased in favour of the benchmark.

However, in neither case the change of the result is proportional to actual ratios. This is because the primitive implementation is not able to precisely limit the “background” client, but is only able to achieve some throttling. Also, there is an inherent delay in how fast the controller can react given the control loop is based on periodic scanning. This period is configurable and was set to two seconds for the above tests.

Conclusion #

Hopefully this write-up has managed to demonstrate two main points:

  • First, that a generic and driver agnostic approach to DRM scheduling cgroup controller can improve user experience and enable new use cases. While at the same time following the established control interface as it exists for CPU and IO control, which makes it future-proof and extendable;

  • Secondly, that even relatively basic driver implementations can be somewhat effective in providing positive control effects.

It also probably needs to be re-iterated that neither the driver implementations or the cgroup controller implementation itself are limited by the user interface proposed. Both could be independently improved under the hood in the future.

What is next? There is more work to be done such as conducting more detailed testing, polishing the implementation and potentially attempting to wire up more drivers to the controller. Further advocacy work in the DRM community too.

References #


  1. https://lore.kernel.org/dri-devel/20180120015141.10118-1-matthew.d.roper@intel.com/ ↩︎

  2. https://lore.kernel.org/lkml/20221019173254.3361334-1-tvrtko.ursulin@linux.intel.com/ ↩︎

  3. https://lore.kernel.org/lkml/ZVE3shwiRbUQyAqs@mtj.duckdns.org/T/ ↩︎

September 04, 2024 12:00 AM

August 29, 2024

Brian Kardell

What makes it exciting?

What makes it exciting?

Some thoughts on the things that we get excited about, or don't - and why.

Hey, check this out...

That's Igalia's website running in Google Chrome in kiosk mode. In kiosk mode there is no surrounding "chrome" and so no rendered controls. It has none of that stuff that we normally associate with "browsers" (tabs, back/forward/refresh buttons, a URL bar, etc)

In fact, while I say that that's Google Chrome running in kiosk mode, it could just as well be embedded browser, like WPEWebKit which has no built in controls. Is it?

I'm not telling 😏.

But, it doesn't really matter: I'm just using it to set up the question of whether you feel like that is a browser? And perhaps whether WPEWebKit excites you as a browser? There's no "right" answer.

Ok, hold that thought, and look at this...

This is the webkit.org site running in WebKit mini-browser. It isn't Safari, it is just... er... WebKit. Is it a browser?

On the one hand, it's right there in the name, right? It's a Mini...browser. And, it does have the most important controls.

Still, you can't download it as a finished product, you have to build it. And no one thinks you should use this as your daily browser. Does it capture your attention as a browser?

Probably not?

Ok, now what about this:

That's ladybird. Yes, that's exciting, right?

I mean, I totally agree...

But the reason it's exciting has almost nothing to do with it being a browser. The Ladybird browser is hardly more than the mini-browser, in fact. It's purpose it the same and how you get it is the same. What makes it exciting is that it's based on a novel engine.

That said, excitement isn't currently (or probably for the foreseeable future) of the form "because we can actually use that browser day to day". Unlike in the mini-browsers of the others, there is a lot left to do here and the long tail and tricky bits of the web that are needed are... well, a lot, and tricky. They also have this pesky issue where the more people use your browser, the more bugs and unwritten rules you find you have to manage, all without breaking something else. So, Ladybird's got a long way to get to a place where you can practically use it -- but it is exciting.

Anyway, I asked some people a while back if they were as excited about Servo and they said "well, but it's not a browser".

Batman, thinking...
Hmmmmm...

And, indeed, recently someone from the Servo community created a GitHub repo which will be a browser based on Servo (named Verso - great name!) and the crowd seemingly did go wild on that (double the points of almost any other Servo related post on the orange website), despite the fact that at the time, it was effectively still a mostly-not-function repo.

So, anyway... I just wanted to step back and say...

Are you not entertained?

That's the Servo mini-browser running 3 tabs. And you know how I got it? I went to servo.org, downloaded it, installed it and launched it. I can currently use keyboard commands to launch and close tabs. It's also open source, so anyone could in theory have a better downstream one (maybe Verso?) or help grow a browser in Servo itself (more like Firefox, I guess).

So... Is Servo a browser?

I mean... no. But also... yes and it's super exciting.

You should be excited about it, I think.

You can help support it with some funding through GitHub or Open Collective (wealthy donors and business support is welcome too :)). How to choose and prioritize the work funded by donations is decided through collective discussions on the public monthly calls of the Servo Technical Steering Committee.

August 29, 2024 04:00 AM

August 28, 2024

Jani Hautakangas

Bringing WebKit back to Android: Progress and Perspectives

In my previous blog post, I delved into the technical aspects of reintroducing WebKit to the Android platform. It was an exciting journey, filled with the challenges and triumphs that come with working on a project as ambitious as WPE-Android. However, I realize that the technical depth of that post may have left some readers seeking more context. Today, I want to take a step back and offer a broader view of what this project is all about—why we’re doing it, how it builds on the WPEWebKit engine, and the progress we’ve made so far.

The Vision: Reviving WebKit on Android #

WebKit has a storied history in the world of web browsers, serving as the backbone for Safari, Epiphany, and many embedded browsers. However, over time, Android’s landscape has shifted toward Blink/Chromium, the engine behind Chrome. While Blink and Chromium have undoubtedly shaped the modern web, there are compelling reasons to bring WebKit back to Android.

WPE-Android is an effort to reintroduce WebKit into the Android ecosystem as a modern, efficient, and secure browser engine. Our goal is to provide developers with more options—whether they’re building full-fledged browsers, integrating web views into native apps, or exploring innovative applications in IoT and embedded systems. By leveraging WebKit’s unique strengths, we’re opening new doors for creativity and innovation on the Android platform.

Why WPEWebKit? #

At the heart of WPE-Android is WPEWebKit, a streamlined version of the WebKit engine specifically optimized for embedded systems. Unlike its desktop counterpart, WPEWebKit is designed to be lightweight, efficient, and highly adaptable to various hardware environments. This makes it an ideal foundation for bringing WebKit back to Android.

The decision to base WPE-Android on WPEWebKit is strategic. WPEWebKit is not only performant but also backed by a strong community of developers and organizations dedicated to its continuous improvement. This community-driven approach ensures that WPE-Android benefits from a robust, well-maintained codebase, with contributions from experts around the world.

Building on a Strong Foundation #

Since the inception of WPE-Android, our focus has been on making WebKit a viable option for Android developers. This involves more than just getting the engine to run on Android—it’s about ensuring that it’s stable, integrates seamlessly with Android’s unique features, and offers a developer-friendly experience.

A significant part of our work has involved optimizing the interaction between WPEWebKit and Android’s graphics stack. As part of that, we decided to focus on Android API level 30 and higher to keep the prototyping phase faster and simpler. Our efforts have aimed at achieving smooth and consistent performance, ensuring that WPE-Android can meet the needs of modern Android applications.

We are building a foundation to run instrumentation tests in CI to ensure that we don’t regress and that we get consistent results that match Android’s system WebView APIs. We continue adding more APIs that are similar to Android System WebView offerings and provide similar results.

Additionally, we’ve focused on enhancing the integration of WPE-Android with Android-specific features. This includes improving support for touch input and dialogs, refining the way web views are handled within native Android applications, and ensuring compatibility with the Android development environment. These enhancements make WPE-Android a natural fit for developers who are already familiar with the Android platform.

What’s new #

Most of the changes are under the hood improvements. The task that required the most effort was upgrading and rebasing our patches on top of Cerbero. After we upgraded to WPE WebKit 2.44.1, we required a more recent GLib version provided by the newer Cerbero version. Along with the upgrade, we managed to refactor and squash many of the patches that we had on top of Cerbero. We went from 175 patches down to 66, which will simplify the next upgrade.

Here’s a list of the most notable changes since the last update:

  • Upgraded to WPE WebKit 2.44.1.
  • Upgraded Cerbero to version 1.24.2.
  • Upgraded Android NDK to version r26d.
  • Migrated from libsoup2 to libsoup3 for HTTP/2 support.
  • Support for proper device scale ratio according to Android’s DisplayMetrics. This takes into account the screen size and pixel density, automatically adapting rendered content to show with appropriate dimensions on all devices.
  • Support for JS dialogs (Alert, Confirm, Prompt). Integrates Android dialogs with JavaScript alert(), confirm(), and prompt() prompts. Also provides an option to build custom native dialogs for these prompts.
  • Instrumentation tests for recently added features and a CI pipeline for running them.
  • API to receive HTTP errors. WPEViewClient interface onReceivedHttpError to catch HTTP error codes >= 400.
  • API to evaluate JavaScript. Provides the WPEView method evaluateJavascript to inject and evaluate JavaScript code on a loaded page.

Demos #

Dialog prompts #

The demo shows the default WPEView alert() prompt integration on the left side. On the right side, an application using WPEView has overridden the onJsAlert method from the WPEChromeClient interface and provides a custom native alert dialog for the JavaScript alert() prompt. The custom dialog is constructed using Android’s AlertDialog.Builder factory. Similar customization can be applied to JavaScript confirm() and prompt() prompts by overriding the onJsConfirm and onJsPrompt methods from the WPEChromeClient interface.

Default WPEView alert dialog
Custom WPEView alert dialog

Evaluate javascript #

The demo shows how to inject JavaScript and call functions on a loaded page from Kotlin code.

HTML and JavaScript:

<script>
function showName(message) {
document.getElementById('name').innerHTML=message;
}
</script>
<center>
<br><br><br><br>
<h1>WPEView</h1>
<p>Evaluate javascript</p>"
<br><br><br><br>
<h2>What's your name?</h2>
<h1 id="name"></h1>
</center>

Kotlin/WPEView code:

binding.toolbarButton.setOnClickListener {
webview.evaluateJavascript("showName(\"" + binding.toolbarEditText.text + "\")",null)
binding.toolbarEditText.setText("")
}

Device scale factor #

Android devices come with a variety of screen sizes, resolutions, and screen densities (pixels per inch, also known as ppi). In order for the UI to look consistent and good across all different devices, the device scale factor needs to be applied to the UI. Screen density can be fetched via the Android DisplayMetrics API, and in WPE WebKit, this corresponds to the device scale factor that can be set using wpe_view_backend_dispatch_set_device_scale_factor. Previously, in WPE-Android, we had hardcoded that value to 2.0, but now we are using proper metrics specific to each device.

Below are some screenshots from before and after applying the proper device scale. I’m using a Google Pixel 7 device, which has a density value of 2.75.

Old hardcoded device scale factor 2.0
Device scale factor from DisplayMetrics density

Looking Forward #

Our goal is to make WPE-Android even more accessible and usable for the broader Android development community. This involves ongoing performance optimizations, expanding device compatibility, and potentially providing more resources like documentation, example projects, and developer tools to ease the adoption of WPE-Android.

We believe that by offering WebKit as a viable option on Android, we’re contributing to a more diverse and innovative web ecosystem. WPE-Android is not just about bringing back a familiar engine—it’s about giving developers the tools they need to create fast, secure, and beautiful web experiences on Android devices.

Conclusion #

The journey of bringing WebKit back to Android has been both challenging and rewarding so far. By building on the strong foundation of WPEWebKit, we’re crafting a tool that empowers developers to push the boundaries of what’s possible with web technologies on Android. The progress we’ve made so far is just the beginning, and I’m excited to see how the project will continue to evolve.

If you’re interested in learning more or getting involved, you can find all the details on the WPE-Android GitHub page.

August 28, 2024 12:00 AM

August 26, 2024

Sergio Villar

A New Way to Browse: Eye Tracking Comes to Wolvic!

We’re thrilled to share some exciting news with you. Wolvic is about to transform how you interact with the web in a VR environment with the introduction of eye tracking support! Starting with the just released v1.7.0 release on the Gecko backend and the highly anticipated v1.0 release on the Chromium backend, you’ll be able to control the browser pointer just by looking at what you want to interact with. While this feature is still being refined, it’s a fantastic start, and we can’t wait for you to try it out.

August 26, 2024 06:07 AM

August 25, 2024

Andy Wingo

whippet update: faster evacuation, eager sweeping of empty blocks

Good evening. Tonight, notes on things I have learned recently while hacking on the Whippet GC library.

service update

For some time now, the name Whippet has referred to three things. Firstly, it is the project as a whole, consisting of an include-only garbage collection library containing a compile-time configurable choice of specific collector implementations. Also, it is the name of a specific Immix-derived collector. Finally, it is the name of a specific space within that collector, in which objects are mostly marked in place but can be evacuated if appropriate.

Well, naming being one of the two hard problems of computer science, I can only ask for forgiveness and understanding. I have started fixing this situation with the third component, renaming the whippet space to the nofl space. Nofl stands for no-free-list, indicating that it’s a (mostly) mark space but which does bump-pointer allocation instead of using freelists. Also, it stands for novel, in the sense that as far as I can tell, it is a design that hasn’t been tried yet.

unreliable evacuation

Following Immix, the nofl space has always had optimistic evacuation. It prefers to mark objects in place, but if fragmentation gets too high, it will try to defragment by evacuating sparse blocks into a small set of empty blocks reserved for this purpose. If the evacuation reserve fills up, nofl will dynamically switch to marking in place.

My previous implementation was a bit daft: some subset of blocks would get marked as being evacuation targets, and evacuation would allocate into those blocks in ascending address order. Parallel GC threads would share a single global atomically-updated evacuation allocation pointer. As you can imagine, this was a serialization bottleneck; I initially thought it wouldn’t be so important but for some workloads it is.

I had chosen this strategy to maximize utilization of the evacuation reserve; if you had 8 GC workers, each allocating into their own block, their blocks won’t all be full at the end of GC; that would waste space.

But reliability turns out to be unimportant. It’s more important to let parallel GC threads do their thing without synchronization, to the extent possible.

Also, this serialized allocation discipline imposed some complexity on the nofl space implementation; the evacuation allocator was not like the “normal” allocator. With worker-local allocation buffers, these two allocators are now essentially the same. (They differ in that the normal allocator interleaves lazy sweeping with allocation, and can allocate into blocks with survivors from the last GC instead of requiring empty blocks.)

eager sweeping

Another initial bad idea I had was to lean too much on lazy sweeping as a design principle. The idea was that deferring sweeping work until just before an allocator would write to a block would minimize cache overhead (you page in a block right when you will use it) and provide for workload-appropriate levels of parallelism (multiple mutator threads naturally parallelize sweeping).

Lazy sweeping was very annoying when it came to discovery of empty blocks. Empty blocks are precious: they can be returned to the OS if needed, they are useful for evacuation, and they have nice allocation properties, in that you can just do bump-pointer from beginning to end.

Nofl was discovering empty blocks just in time, from the allocator. If the allocator acquired a new block and found that it was empty, it would return it to a special list of empty blocks. Only if all sweepable pages were exhausted would an allocator use an empty block. But to prevent an allocator from pausing forever, there was a limit to the number of empty swept blocks that would be returned to the collector (10, as it happens); an 11th empty swept block would be given to a mutator for allocation. And so on and so on. Complicated, and you only know the number of empty blocks yielded by the last collection when the whole next allocation cycle has finished.

The obvious solution is some kind of separate mark on blocks, in addition to a mark on objects. I didn’t do it initially out of fear of overhead; marking is a fast path. The implementation I ended up making was a packed bitvector, with one bit per 64 kB block, at the beginning of each 4 MB slab of blocks. The beginning blocks are for metadata like this. For reasons, I don’t have the space for full bytes. When marking an object, if I see that a block’s mark is unset, I do an atomic_fetch_or_explicit on the byte with memory_order_relaxed ordering. In this way I only do the atomic op very rarely. It seems that on ARMv8.1 there is actually an instruction to do atomic bit setting; everywhere else it’s a garbage compare-and-swap thing, but on my x64 machine it’s fine.

Then after collection, during the pause, if I see a block is unmarked, I move it directly to the empty set instead of sweeping it. (I could probably do this concurrently outside the pause, but that would be for another day.)

And the overhead? Negative! Negative, in the sense that because I don’t have to sweep empty blocks, and that (for some workloads) collection can produce a significant-enough fraction of empty blocks, I actually see speedups with this strategy, relative to lazy sweeping. It also simplifies the allocator (no need for that return-the-11th-block logic).

The only wrinkle is as regards generational collection: nofl currently uses the sticky mark bit algorithm, which has to be applied also to block marks. Subtle, but not complicated.

fin

Next up is growing and shrinking the nofl-using Whippet collector (which might need another name), using the membalancer algorithm, and then I think I will be ready to move on to getting Whippet into Guile. Until then, happy hacking!

by Andy Wingo at August 25, 2024 08:29 PM

August 19, 2024

Andy Wingo

javascript weakmaps should be iterable

Good evening. Tonight, a brief position statement: it is a mistake for JavaScript’s WeakMap to not be iterable, and we should fix it.

story time

A WeakMap associates a key with a value, as long as the key is otherwise reachable in a program. (It is an ephemeron table.)

When WeakMap was added to JavaScript, back in the ES6 times, some implementors thought that it could be reasonable to implement weak maps not as a data structure in its own right, but rather as a kind of property on each object. Under the hood, adding an key→value association to a map M would set key[M] = value. GC would be free to notice dead maps and remove their associations in live objects.

If you implement weak maps like this, or are open to the idea of such an implementation, then you can’t rely on the associations being enumerable from the map itself, as they are instead spread out among all the key objects. So, ES6 specified WeakMap as not being iterable; given a map, you can’t know what’s in it.

As with many things GC-related, non-iterability of weak maps then gained a kind of legendary status: the lore states that non-iterability preserves some key flexibility for JS implementations, and therefore it is good, and you just have to accept it and move on.

dynamics

Time passed, and two things happened.

One was that this distributed WeakMap implementation strategy did not pan out; everyone ended up implementing weak maps as their own kind of object, and people use an algorithm like the one Steve Fink described a couple years ago to compute the map×key⇒value conjunction. The main original motivation for non-iterability was no longer valid.

The second development was WeakRef and FinalizationRegistry, which expose some details of reachability as viewed by the garbage collector to user JS code. With WeakRef (and WeakMap), you can build an iterable WeakMap.

(Full disclosure: I did work on ES6 and had a hand in FinalizationRegistry but don’t do JS language work currently.)

Thing is, your iterable WeakMap is strictly inferior to what the browser can provide: its implementation is extraordinarily gnarly, shipped over the wire instead of already in the browser, uses more memory, is single-threaded and high-latency (because FinalizationRegistry), and non-standard. What if instead as language engineers we just did our jobs and standardized iterability, as we do with literally every other collection in the JS standard?

Just this morning I wrote yet another iterable WeakSet (which has all the same concerns as WeakMap), and while it’s sufficient for my needs, it’s not good (lacking prompt cleanup of dead entries), and by construction can’t be great (because it has to be redundantly implemented on top of WeakSet instead of being there already).

I am sympathetic to deferring language specification decisions to allow the implementation space to be explored, but when the exploration is done and the dust has settled, we shouldn’t hesitate to pick a winner: JS weak maps and sets should be iterable. Godspeed, brave TC39 souls; should you take up this mantle, you are doing the Lord’s work!

Thanks to Philip Chimento for notes on the timeline and Joyee Cheung for notes on the iterable WeakMap implementation in the WeakRef spec. All errors mine, of course!

by Andy Wingo at August 19, 2024 08:13 PM

Emmanuele Bassi

On Vala

It seems I raised a bit of a stink on Twitter last week:

Of course, and with reason, I’ve been called out on this by various people. Luckily, it was on Twitter, so we haven’t seen articles on Slashdot and Phoronix and LWN with headlines like “GNOME developer says Vala is dead and will be removed from all servers for all eternity and you all suck”. At least, I’ve only seen a bunch of comments on Reddit about this, but nobody cares about that particular cesspool of humanity.

Sadly, 140 characters do not leave any room for nuance, so maybe I should probably clarify what I wrote on a venue with no character limit.

First of all, I’d like to apologise to people that felt I was attacking them or their technical choices: it was not my intention, but see above, re: character count. I may have only about 1000 followers on Twitter, but it seems that the network effect is still a bit greater than that, so I should be careful when wording opinions. I’d like to point out that it’s my private Twitter account, and you can only get to what it says if you follow me, or if you follow people who follow me and decide to retweet what I write.

My PSA was intended as a reflection on the state of Vala, and its impact on the GNOME ecosystem in terms of newcomers, from the perspective of a person that used Vala for his own personal projects; recommended Vala to newcomers; and has to deal with the various build issues that arise in GNOME because something broke in Vala or in projects using Vala. If you’re using Vala outside of GNOME, you have two options: either ignore all I’m saying, as it does not really apply to your case; or do a bit of soul searching, and see if what I wrote does indeed apply to you.

First of all, I’d like to qualify my assertion that Vala is a “dead language”. Of course people see activity in the Git repository, see the recent commits and think “the project is still alive”. Recent commits do not tell a complete story.

Let’s look at the project history for the past 10 cycles (roughly 2.5 years). These are the commits for every cycle, broken up in two values: one for the full repository, the other one for the whole repository except the vapi directory, which contains the VAPI files for language bindings:

Commits

Aside from the latest cycle, Vala has seen very little activity; the project itself, if we exclude binding updates, has seen less than 100 commits for every cycle — some times even far less. The latest cycle is a bit of an outlier, but we can notice a pattern of very little work for two/three cycles, followed by a spike. If we look at the currently in progress cycle, we can already see that the number of commits has decreased back to 55/42, as of this morning.

Commits

Number of commits is just a metric, though; more important is the number of contributors. After all, small, incremental changes may be a good thing in a language — though, spoiler alert: they are usually an indication of a series of larger issues, and we’ll come to that point later.

These are the number of developers over the same range of cycles, again split between committers to the full repository and to the full repository minus the vapi directory:

Developers

As you can see, the number of authors of changes is mostly stable, but still low. If we have few people that actively commit to the repository it means we have few people that can review a patch. It means patches linger longer and longer, while reviewers go through their queues; it means that contributors get discouraged; and, since nobody is paid to work full time on Vala, it means that any interruption caused by paid jobs will be a bottleneck on the project itself.

These concerns are not unique of a programming language: they exist for every volunteer-driven free and open source project. Programming languages, though, like core libraries, are problematic because any bottleneck causes ripple effects. You can take any stalled project you depend on, and vendor it into your own, but if that happens to the programming language you’re using, then you’re pretty much screwed.

For these reasons, we should also look at how well-distributed is the workload in Vala, i.e. which percentage of the work is done by the authors of those commits; the results are not encouraging. Over that range of cycles, Only two developers routinely crossed the 5% of commits:

  • Rico Tzschichholz
  • Jürg Billeter

And Rico has been the only one to consistently author >50% of the commits. This means there’s only one person dealing with the project on a day to day basis.

As the maintainer of a project who basically had to do all the work, I cannot even begin to tell you how soul-crushing that can become. You get burned out, and you feel responsible for everyone using your code, and then you get burned out some more. I honestly don’t want Rico to burn out, and you shouldn’t, either.

So, let’s go into unfair territory. These are the commits for Rust — the compiler and standard library:

Rust

These are the commits for Go — the compiler and base library:

Go

These are the commits for Vala — both compiler and bindings:

Vala

These are the number of commits over the past year. Both languages are younger than Vala, have more tools than Vala, and are more used than Vala. Of course, it’s completely unfair to compare them, but those numbers should give you a sense of scale, of what is the current high bar for a successful programming language these days. Vala is a niche language, after all; it’s heavily piggy-backing on the GNOME community because it transpiles to C and needs a standard library and an ecosystem like the one GNOME provides. I never expected Vala to rise to the level of mindshare that Go and Rust currently occupy.

Nevertheless, we need to draw some conclusions about the current state of Vala — starting from this thread, perhaps, as it best encapsulates the issues the project is facing.

Vala, as a project, is limping along. There aren’t enough developers to actively effect change on the project; there aren’t enough developers to work on ancillary tooling — like build system integration, debugging and profiling tools, documentation. Saying that “Vala compiles to C so you can use tools meant for C” is comically missing the point, and it’s effectively like saying that “C compiles to binary code, so you can disassemble a program if you want to debug it”. Being able to inspect the language using tools native to the language is a powerful thing; if you have to do the name mangling in your head in order to set a breakpoint in GDB you are elevating the barrier of contributions way above the head of many newcomers.

Being able to effect change means also being able to introduce change effectively and without fear. This means things like continuous integration and a full test suite heavily geared towards regression testing. The test suite in Vala is made of 210 units, for a total of 5000 lines of code; the code base of Vala (vala AST, codegen, C code emitter, and the compiler) is nearly 75 thousand lines of code. There is no continuous integration, outside of the one that GNOME Continuous performs when building Vala, or the one GNOME developers perform when using jhbuild. Regressions are found after days or weeks, because developers of projects using Vala update their compiler and suddenly their projects cease to build.

I don’t want to minimise the enormous amount of work that every Vala contributor brought to the project; they are heroes, all of them, and they deserve as much credit and praise as we can give. The idea of a project-oriented, community-oriented programming language has been vindicated many times over, in the past 5 years.

If I scared you, or incensed you, then you can still blame me, and my lack of tact. You can still call me an asshole, and you can think that I’m completely uncool. What I do hope, though, is that this blog post pushes you into action. Either to contribute to Vala, or to re-new your commitment to it, so that we can look at my words in 5 years and say “boy, was Emmanuele wrong”; or to look at alternatives, and explore new venues in order to make GNOME (and the larger free software ecosystem) better.

by ebassi at August 19, 2024 10:53 AM

August 10, 2024

Abhijeet Kandalkar

Exploring sandboxing in CEF and CEFSharp for Windows platform

Sandboxing is a critical feature for ensuring the security and stability of applications that embed web content. In this blog, we will delve into how sandboxing is achieved in the Chromium Embedded Framework (CEF). All analysis in this blog is performed on Windows operating system.

CEFClient and CEFSharp

The base CEF framework includes support for the C and C++ programming languages. Also it provides a test applications to demonstrate its functionality. CEFClient, a one of the C++ test application, serves this purpose.

CEF provides CAPI bindings to enable application development across various programming languages. For integrating the CEF framework with .NET, you should explore CEFSharp. It is a C# application that leverages the CEF open source project, allowing developers to create applications in C#.

CEFSharp(C#)
            \  CAPI 
              -------> CEF(C++) ------> Chromium(C++)
            /
CEFClient(C++)

What is the sandbox?

As mentioned in documentation of sandbox in Chromium repository,

The sandbox is a C++ library that allows the creation of sandboxed processes — processes that execute within a very restrictive environment. The only resources sandboxed processes can freely use are CPU cycles and memory. For example, sandboxes processes cannot write to disk or display their own windows. What exactly they can do is controlled by an explicit policy. Chromium renderers are sandboxed processes.

Understanding Sandboxing in CEF

Since CEF uses Chromium internally, we decided to investigate how sandboxing is implemented by examining Chromium’s source code. Our findings revealed that the sandbox is designed to be versatile, with no hard dependencies on the Chromium browser itself, making it usable with other applications.

CEF prepares a static library using the existing sandbox implementation in the Chromium repository and links it to the main application. This process enables the sandbox functionality in CEF applications. For more detailed information, refer to the sandbox_win.cc file in the CEF repository.

if (is_win) {
  static_library("cef_sandbox") {
    sources = [ "libcef_dll/sandbox/sandbox_win.cc" ]
    include_dirs = [ "." ]
    deps = [ "libcef/features", "//sandbox" ]
  }
}

Implementing Sandboxing in CEFClient

CEFClient achieves a sandboxing by linking to cef_sandbox library

  executable("cefclient") {
      deps += [
        ":cef_sandbox",
      ]

In the CEFClient application, we need to create a sandbox object and pass it to CEF, which then forwards it internally to Chromium to configure the sandbox.

  CefScopedSandboxInfo scoped_sandbox;
  sandbox_info = scoped_sandbox.sandbox_info();
  ...
  // Initialize the CEF browser process.
  if (!context->Initialize(main_args, settings, app, sandbox_info)) {
    return 1;
  }

The following call stack illustrates how CEFClient code calls into CEF, ultimately transitioning into the Chromium environment to configure the sandbox.

The CEF test application (CEFClient) successfully launches a sandboxed CEF, which is verified by loading the chrome://sandbox URL.


Why C# Applications Cannot Use Sandboxed CEF ?

As we have seen above, for C++ applications it is possible to use sandboxing but C# application can not use sandboxed CEF. But why ?

To achieve effective sandboxing in your CEF application, ensure that you link the cef_sandbox.lib static library. This linking must be done specifically to the main process of the application. While C++ applications can directly link static libraries without issue, C# applications do not support static linking of libraries.

Static library means that the library is going to be merged with application. This concept doesn't exist in .NET as .NET supports only dynamic link libraries

If you attempt to create a DLL for the cef_sandbox code and link it dynamically to your application, the Chromium sandbox code will detect this and fail with the error SBOX_ERROR_INVALID_LINK_STATE. Therefore, static linking of the sandboxing code to the main application is mandatory for sandbox functionality.

  // Attempt to start a sandboxed process from sandbox code hosted not within
  // the main EXE. This is an unsupported operation by the sandbox.
  SBOX_ERROR_INVALID_LINK_STATE = 64,

Conclusion

While sandboxing is a powerful feature available for C++ applications using CEF, it is not feasible for C# applications due to the inherent differences in how these languages operate within the system. The managed environment of C# introduces limitations that make it impossible to integrate the same level of sandboxing as in C++. Consequently, C# applications cannot leverage sandboxed CEF using the existing setup.


Future study

To enable sandboxing in C# applications, we may need to modify the .NET runtime (which is written in C++) to support sandboxing. This would involve linking the cef_sandbox library directly to the runtime. The .NET Core runtime includes several executables/libraries that act as the main entry point for code execution, typically referred to as the “host”. These host executables can be customized as needed.

For example, when you run the apphost executable, it initializes the .NET Core runtime and starts your application. Creating a sandbox object in the apphost and passing it to the application might help address sandboxing for C# applications. However, this approach has not been tested, so its effectiveness cannot be confirmed without further exploration and modification of the C# language runtime.


References

August 10, 2024 06:30 PM

August 08, 2024

Gyuyoung Kim

Chrome iOS Browser on Blink

At the beginning of this year, Apple allowed third-party browser engines, such as Gecko and Blink, to be used on iOS and iPadOS. The Chromium community has been developing Chrome iOS based on Blink as an experimental project. This post provides an overview of the project and examines its progress during the first half of 2024.

Structure of Chrome iOS (w/blink)

First, let’s look at the overall structure of Chrome iOS. This simple diagram illustrates the structure of Chrome iOS.

The UI components are placed in the //ios/chrome directory, which functions similarly to the //chrome layer in other implementations. The //ios/web layer also provides APIs for its initialization, content navigation, presenting UI callbacks, saving or restoring navigation sessions and browser data, and more. The //ios/web/public defines the public APIs, while various other directories within //ios/web implement them. The implementation of these public APIs invokes WebKit APIs to provide the necessary functionalities.

For Chrome iOS based on Blink, the developers decided to reuse the existing iOS Chrome UI implementation last year. Thus, one of the main developments of Chrome iOS based on Blink happened in the //ios/web/content directory which is the rectangle filled by a green color, created to implement the public APIs using Blink. As you can see in the below diagram, they added the content directory to //ios/web directory and implemented the public APIs using Blink’s content APIs. Notably, they’ve introduced a ContentWebState as a prototype implementation of WebState to replace the class in //ios/web/content/web_state. However, as shown in the diagram, the directory currently only includes in five components: web_state, navigation, UI, init, and js_messaging. Therefore, more components need to be implemented for Chrome iOS on Blink.

Igalia Contributions

Igalia has been contributing to this project since the project was made public by the community. We’ve worked mainly on UI related components e.g. file and color chooser, context menu, select list popup and integration with Chrome IOS UI. The pictures below are the screen captures of the chooser implementations that we contributed.



We also helped with the features of multimedia such as the video screen capture, hardware encoding/decoding, and audio recording.

We’ve worked on a few testing frameworks (e.g. unit tests, browser and web tests). Also, we’ve filtered out unsupported tests and failed tests. Specifically, we’ve implemented the infrastructure to run the web tests on the simulator. For your information, the test coverage of the web test on iOS Blink was about 72.2% and the pass ratio among the working tests was 91.83% at the beginning of this year. Then we’ve been maintaining the bot for Chrome iOS on Blink to keep the build and testing because keeping build and running tests are crucial for Blink to bring up to Chrome iOS.

Besides we supported the remote debugging with DevTools on Blink for iOS. Now developers are able to remotely use DevTools in a host machine (e.g. Mac) and inspect Chrome or Content Shell for development.

Moreover, we’ve worked on graphics stuff related to compositing and rendering. For instance, we supported the metal on ANGLE as well as fixed bugs in the graphics layers.

You can find the detailed contributions patch list here.

The major changes for H1 2024

Now, let’s review the major changes during the first half of this year. Firstly, the minimum iOS SDK version was bumped up to 17.4 which supports the BrowserEngineKit library. And the [browser/unit] tests began to run on the ios-blink bot with the 17.4 SDK.

By reusing the existing iOS Chrome UI, ContentWebState was introduced as a prototype implementation of WebState. During H1 2024, new methods were added further or previously empty methods were implemented. For example, GetVirtualKeyboardHeight, OnVisibilityChanged, DidFinishLoad, DidFailLoad methods were implemented.

BrowserEngineKit APIs that were announced by Apple to support third-party browser engines on iOS and iPadOS have been applied to JIT, GPU, network, and content process creation.

As mentioned above, Igalia implemented a color chooser, a file chooser, and a context menu during the period.

Lastly, the package size was reduced by removing duplicated resources.

Remaining Tasks

We’ve briefly looked at the current status of the project so far, but many functionalities still need to be supported. For example, regarding UI features, functionalities such as printing preview, download, text selection, request desktop site, zoom text, translate, find in page, and touch events are not yet implemented or are not functioning correctly
Moreover, there are numerous failing or skipped tests in unit tests, browser tests, and web tests. Ensuring that these tests are enabled and passing the test should also be a key focus moving forward.

Conclusion

The Blink-based port of Chrome iOS is a large and laborious project. Igalia has made significant contributions to this project and while it is still in an early stage, more features, infrastructure and tools need to be ported to Blink. Anyhow, we believe that we are on the right track for eventually replacing WebKit by Blink on Chromium related products for iOS.

by gyuyoung at August 08, 2024 02:35 AM

August 02, 2024

Pawel Lampe

Nuts and bolts of Canvas2D - globalCompositeOperation and shadows.

In recent months I’ve been privileged to work on the transition from Cairo to Skia for 2D graphics rendering in WPE and GTK WebKit ports. Big reworks like this are a great opportunity to explore all kinds of graphics-related APIs. One of the broader APIs in this area is the CanvasRenderingContext2D API from HTML Canvas. It’s a fairly straightforward yet extensive API allowing one to perform all kinds of drawing operations on the canvas. The comprehensiveness, however, comes at the expense of some complex situations the web engine needs to handle under the hood. One such situation was the issue I was working on recently regarding broken test cases involving drawing shadows when using Skia in WebKit. What makes it complex is that some problems are still visible due to multiple web engine layers being involved, but despite that I was eventually able to address the broken test cases.

In the next few sections I’m going to introduce the parts of the API that are involved in the problems while in the sections closer to the end I will gradually showcase the problems and explore potential paths toward fixing the entire situation.

Drawing on Canvas2D with globalCompositeOperation #

The Canvas2D API offers multiple methods for drawing various primitives such as rectangles, arcs, text etc. On top of that, it allows one to control compositing and clipping using the globalCompositeOperation property. The idea is very simple - the user of an API can change the property using one of the predefined compositing operations and immediately after that, all new drawing operations will behave according to the rules the particular compositing operation specifies:

canvas2DContext.fillRect(...); // Draws rect on top of existing content (default).
canvas2DContext.globalCompositeOperation = 'destination-atop';
canvas2DContext.fillRect(...); // Draws rect according to 'destination-atop'.

There are many compositing operations, but I’ll be focusing mostly on the ones having source and destination in their names. The source and destination terms refer to the new content to be drawn and the existing (already-drawn) content respectively.

The images below present some examples of compositing operations in action:

Compositing operations in action.

Drawing on Canvas2D with shadows #

When drawing primitives using the Canvas2D API one can use shadow* properties to enable drawing of shadows along with any content that is being drawn. The usage is very simple - one has to alter at least one property such as e.g. shadowOffsetX to make the shadow visible:

canvas2DContext.shadowColor = "#0f0";
canvas2DContext.shadowOffsetX = 10;
// From now on, any draw call will have a green shadow attached.

the above combined with simple code to draw a circle produces a following effect:

Circle with shadow.

Shadows meet globalCompositeOperation #

Things are getting interesting once one starts thinking about how globalCompositeOperation may affect the way shadows are drawn. When I thought about it for the first time, I imagined at least 3 possibilities:

  • Shadow and shadow origin are both treated as one entity (shadow always below the origin) and thus are drawn together.
  • Shadow and shadow origin are combined and then drawn as a one entity.
  • Shadow and shadow origin are drawn separately - shadow first, then the content.

When I confronted the above with the drawing model and shadows specification, it turned out the last guess was the correct one. The specification basically says that the shadow should be computed first, then composited within the clipping region over the current canvas content, and finally, the shadow origin should be composited within the clipping region over the current canvas content (the original canvas content combined with shadow).

The above can be confirmed visually using few examples (generated using chromium browser v126.0.6478.126):

Shadows combined with compositing operation.

  • The source-over operation shows the drawing order - destination first, shadow second, and shadow origin third.
  • The destination-over operation shows the reversed drawing order - destination first, shadow second (below destination), and shadow origin third (below destination and shadow).
  • The source-atop operation is more tricky as it behaves like source-over but with clipping to the destination content - therefore, destination is drawn first, then clipping is set to destination, then the shadow is drawn, and finally the shadow origin is drawn.
  • The destination-atop operation is even more tricky as it behaves like destination-over yet with the clipping region always being different. That difference can be seen on the image below that presents intermediate states of canvas after each drawing step:
    Breakdown of destination-atop operation.
    • The initial state shows a canvas after drawing the destination on it.
    • The after drawing shadow state, shows a shadow drawn below the destination. In this case, the clipping is set to new content (shadow), and hence the part of destination that is not “atop” shadow is being clipped out.
    • The after drawing shadow origin state, shows the final state after drawing the shadow origin below the previous canvas content (new destination) that is at this point “a shadow combined with destination”. Similarly as in the previous step, the clipping is set to the new content (shadow origin), and hence any part of new destination that is not “atop” the shadow origin is being clipped out.

Discrepancies between browser engines #

Whenever one realizes the drawing of shadows with globalCompositeOperation in general may be tricky, then one must also consider that when it comes to particular browser engines, the things are even more tricky as virtually no graphics library provides an API that matches the Canvas2D API 1-to-1. This means that depending on the graphics library used, the browser engine must implement more or less integration parts here and there. For example, one can imagine that some graphics library may not have native support for shadows - that would mean the browser engine has to prepare shadows itself by e.g. drawing shadow origin (no matter how complex) on extra surface, changing color, blurring etc. so that it can be used as a whole once prepared.

Having said the above, one would expect that all the above aspects should be tested and implemented really well. After all, whenever the subject matter becomes complicated, extra care is required. It turns out, however, this is not necessarily the case when it comes to globalCompositeOperation and shadows. As for the testing part, there are very few tests (2d.shadow.composite*) in WPT (Web Platform Tests) covering the use cases described above. It’s also not much better for internal web engine test suites. As for implementations, there’s a substantial amount of discrepancy.

Simple examples #

To show exactly what’s the situation, the examples from section Shadows meet globalCompositeOperation can be used again. This time using browsers representing different web engines:

  • Chromium 126.0.6478.126 Shadows combined with compositing operation - Chromium.
  • Firefox 128.0 Shadows combined with compositing operation - Firefox.
  • Gnome Web (Epiphany) 45.0 (WebKit/Cairo) Shadows combined with compositing operation - Epiphany.
  • WPE MiniBrowser build from WebKit@098c58dd13bf40fc81971361162e21d05cb1f74a (WebKit/Skia) Shadows combined with compositing operation - WPE MiniBrowser.
  • Safari 17.1 (WebKit/Core Graphics) Shadows combined with compositing operation - Safari.
  • Servo release from 2024/07/04 Shadows combined with compositing operation - Servo.
  • Ladybird build from 2024/06/29 Shadows combined with compositing operation - Ladybird

First of all, it’s evident that experimental browsers such as servo and ladybird are falling behind the competition - servo doesn’t seem to support shadows at all, while ladybird doesn’t support anything other than drawing a rect filled with color.

Second, the non-experimental browsers are pretty stable in terms of covering most of the combinations presented above.

Finally, the most tricky combination above seems to be the one including destination-atop - in that case almost every mainstream browser renders different results:

  • Chromium is the only one rendering correctly.
  • Firefox and Epiphany are pretty close, but both are suffering from a similar glitch where the red part is covered by the part of destination that should be clipped out already.
  • WPE MiniBrowser and Safari are both rendering in correct order, but the clipping is wrong.

More sophisticated examples #

Until now, the discrepancies don’t seem to be very dramatic, and hence it’s time to present more sophisticated examples that are an extended version of the test case from the WebKit source tree:

  • Chromium 126.0.6478.126

Shadows combined with compositing operation - Chromium.

  • Firefox 128.0

Shadows combined with compositing operation - Firefox.

  • Gnome Web (Epiphany) 45.0 (WebKit/Cairo)

Shadows combined with compositing operation - Epiphany.

  • WPE MiniBrowser build from WebKit@098c58dd13bf40fc81971361162e21d05cb1f74a (WebKit/Skia)

Shadows combined with compositing operation - WPE MiniBrowser.

  • Safari 17.1 (WebKit/Core Graphics)

Shadows combined with compositing operation - Safari.

  • Servo release from 2024/07/04

Shadows combined with compositing operation - Servo.

  • Ladybird build from 2024/06/29

Shadows combined with compositing operation - Ladybird.

Other than destination-out, xor, and a few simple operations presented before, all the operations presented above pose serious problems to the majority of browsers. The only browser that is correct in all the cases (to the best of my understanding) is Chromium that is using rendering engine called blink which in turn uses the Skia library. One may wonder if perhaps it’s Skia that’s responsible for the Chromium success, but given the above results where e.g. WPE MiniBrowser uses Skia as well, it’s evident that the problems lay above the particular graphics library.

Looking at the operations and browsers that render incorrectly, it’s clearly visible that even small problems - with either ordering of draw calls or clipping - lead to spectacularly broken results. The pinnacle of misery is the source-out operation that is the most variable one across browsers. One has to admit, however, that WPE MiniBrowser is slightly closer to being correct than others.

Towards unification #

Fixing the above problems is a long journey. After all, every single web engine has to be fixed in its own, specific way. If the specification would be a problem - it would be the obvious way to start. However, as mentioned in the section Shadows meet globalCompositeOperation, the specification, is pretty clear on how drawing, shadows, and globalCompositeOperation come together. In such case, the next obvious place to start improving things is a WPT test suite.

What makes WPT outstanding is that it is a de facto standard cross-browser test suite for testing the web platform stack. Thus the test suite is developed as an open collaboration effort by developers from around the globe and hence is very broad in terms of specification coverage. What’s also important, the test results are actively evaluated against the popular browser engines and published under wpt.fyi, therefore putting some pressure on web engine developers to fix the problems so that they keep up with competition.

Granted the above, extending WPE test suite by adding test cases to cover globalCompositeOperation operations combined with shadows is the reasonable first step towards the unification of browser implementations. This can be done either by directly contributing tests to WPT, or by creating an issue. Personally, I’ve decided to file an issue first (WPT#46544) and to add tests once I have some time. I haven’t contributed to WPT yet, but I’m excited to work with it soon. Once I land my first pull request, I’ll start fixing WebKit and I won’t hesitate to post some updates on this blog.

August 02, 2024 12:00 AM

July 26, 2024

Loïc Le Page

FFmpeg 101

A high-level architecture overview to start with FFmpeg.

FFmpeg package content #

FFmpeg is composed of a suite of tools and libraries.

FFmpeg tools #

The tools can be used to encode/decode/transcode a multitude of different audio and video formats, and to stream the encoded media over networks.

  • ffmpeg: a command line tool to convert multimedia files between formats
  • ffplay: a simple mediaplayer based on SDL and the FFmpeg libraries
  • ffprobe: a simple multimedia stream analyzer

FFmpeg libraries #

The libraries can be used to integrate those same features into your own product.

  • libavformat: I/O and muxing/demuxing
  • libavcodec: encoding/decoding
  • libavfilter: graph-based filters for raw media
  • libavdevice: input/output devices
  • libavutil: common multimedia utilities
  • libswresample: audio resampling, samples format conversion and audio mixing
  • libswscale: color conversion and image scaling
  • libpostproc: video post-processing (deblocking/noise filters)

FFmpeg simple player #

A basic usage of FFmpeg is to demux a multimedia stream (obtained from a file or from the network) into its audio and video streams and then to decode those streams into raw audio and raw video data.

To manage the media streams, FFmpeg uses the following structures:

  • AVFormatContext: a high level structure providing sync, metadata and muxing for the streams
  • AVStream: a continuous stream (audio or video)
  • AVCodec: defines how data are encoded and decoded
  • AVPacket: encoded data in the stream
  • AVFrame: decoded data (raw video frame or raw audio samples)

The process used to demux and decode follows this logic:

basic processing

Here is the basic code needed to read an encoded multimedia stream from a file, analyze its content and demux the audio and video streams. Those features are provided by the libavformat library and it uses the AVFormatContext and AVStream structures to store the information.

// Allocate memory for the context structure
AVFormatContext* format_context = avformat_alloc_context();

// Open a multimedia file (like an mp4 file or any format recognized by FFmpeg)
avformat_open_input(&format_context, filename, NULL, NULL);
printf("File: %s, format: %s\n", filename, format_context->iformat->name);

// Analyze the file content and identify the streams within
avformat_find_stream_info(format_context, NULL);

// List the streams
for (unsigned int i = 0; i < format_context->nb_streams; ++i)
{
    AVStream* stream = format_context->streams[i];

    printf("---- Stream %02d\n", i);
    printf("  Time base: %d/%d\n", stream->time_base.num, stream->time_base.den);
    printf("  Framerate: %d/%d\n", stream->r_frame_rate.num, stream->r_frame_rate.den);
    printf("  Start time: %" PRId64 "\n", stream->start_time);
    printf("  Duration: %" PRId64 "\n", stream->duration);
    printf("  Type: %s\n", av_get_media_type_string(stream->codecpar->codec_type));

    uint32_t fourcc = stream->codecpar->codec_tag;
    printf("  FourCC: %c%c%c%c\n", fourcc & 0xff, (fourcc >> 8) & 0xff, (fourcc >> 16) & 0xff, (fourcc >> 24) & 0xff);
}

// Close the multimedia file and free the context structure
avformat_close_input(&format_context);

Once we’ve got the different streams from inside the multimedia file, we need to find specific codecs to decode the streams to raw audio and raw video data. All codecs are statically included in libavcodec. You can easily create your own codec by just creating an instance of the FFCodec structure and registering it as an extern const FFCodec in libavcodec/allcodecs.c, but this would be a different topic for another post.

To find the codec corresponding to the content of an AVStream, we can use the following code:

// Stream obtained from the AVFormatContext structure in the former streams listing loop
AVStream* stream = format_context->streams[i];

// Search for a compatible codec
const AVCodec* codec = avcodec_find_decoder(stream->codecpar->codec_id);
if (!codec)
{
    fprintf(stderr, "Unsupported codec\n");
    continue;
}
printf("  Codec: %s, bitrate: %" PRId64 "\n", codec->name, stream->codecpar->bit_rate);

if (codec->type == AVMEDIA_TYPE_VIDEO)
{
    printf("  Video resolution: %dx%d\n", stream->codecpar->width, stream->codecpar->height);
}
else if (codec->type == AVMEDIA_TYPE_AUDIO)
{
    printf("  Audio: %d channels, sample rate: %d Hz\n",
        stream->codecpar->ch_layout.nb_channels,
        stream->codecpar->sample_rate);
}

With the right codec and codec parameters extracted from the AVStream information, we can now allocate the AVCodecContext structure that will be used to decode the corresponding stream. It is important to remember the index of the stream we want to decode from the former streams list (format_context->streams) because this index will be used later to identify the demuxed packets extracted by the AVFormatContext.

In the following code we’re going to select the first video stream contained in the multimedia file.

// first_video_stream_index is determined during the streams listing in the former loop
int first_video_stream_index = ...;

AVStream* first_video_stream = format_context->streams[first_video_stream_index];
AVCodecParameters* first_video_stream_codec_params = first_video_stream->codecpar;
const AVCodec* first_video_stream_codec = avcodec_find_decoder(first_video_stream_codec_params->codec_id);

// Allocate memory for the decoding context structure
AVCodecContext* codec_context = avcodec_alloc_context3(first_video_stream_codec);

// Configure the decoder with the codec parameters
avcodec_parameters_to_context(codec_context, first_video_stream_codec_params);

// Open the decoder
avcodec_open2(codec_context, first_video_stream_codec, NULL);

Now that we have a running decoder, we can extract the demuxed packets using the AVFormatContext structure and decode them to raw video frames. For that we need 2 different structures:

  • AVPacket which contains the encoded packets extracted from the input multimedia file,
  • AVFrame which will contain the raw video frame after the AVCodecContext has decoded the former packets.
// Allocate memory for the encoded packet structure
AVPacket* packet = av_packet_alloc();

// Allocate memory for the decoded frame structure
AVFrame* frame = av_frame_alloc();

// Demux the next packet from the input multimedia file
while (av_read_frame(format_context, packet) >= 0)
{
    // The demuxed packet uses the stream index to identify the AVStream it is coming from
    printf("Packet received for stream %02d, pts: %" PRId64 "\n", packet->stream_index, packet->pts);

    // In our example we are only decoding the first video stream identified formerly by first_video_stream_index
    if (packet->stream_index == first_video_stream_index)
    {
        // Send the packet to the previsouly initialized decoder
        int res = avcodec_send_packet(codec_context, packet);
        if (res < 0)
        {
            fprintf(stderr, "Cannot send packet to the decoder: %s\n", av_err2str(res));
            break;
        }

        // The decoder (AVCodecContext) acts like a FIFO queue, we push the encoded packets on one end and we need to
        // poll the other end to fetch the decoded frames. The codec implementation may (or may not) use different
        // threads to perform the actual decoding.

        // Poll the running decoder to fetch all available decoded frames until now
        while (res >= 0)
        {
            // Fetch the next available decoded frame
            res = avcodec_receive_frame(codec_context, frame);
            if (res == AVERROR(EAGAIN) || res == AVERROR_EOF)
            {
                // No more decoded frame is available in the decoder output queue, go to next encoded packet
                break;
            }
            else if (res < 0)
            {
                fprintf(stderr, "Error while receiving a frame from the decoder: %s\n", av_err2str(res));
                goto end;
            }

            // Now the AVFrame structure contains a decoded raw video frame, we can process it further...
            printf("Frame %02" PRId64 ", type: %c, format: %d, pts: %03" PRId64 ", keyframe: %s\n",
                codec_context->frame_num, av_get_picture_type_char(frame->pict_type), frame->format, frame->pts,
                (frame->flags & AV_FRAME_FLAG_KEY) ? "true" : "false");

            // The AVFrame internal content is automatically unreffed and recycled during the next call to
            // avcodec_receive_frame(codec_context, frame)
        }
    }

    // Unref the packet internal content to recycle it for the next demuxed packet
    av_packet_unref(packet);
}

// Free the previously allocated memory for the different FFmpeg structures
end:
    av_packet_free(&packet);
    av_frame_free(&frame);
    avcodec_free_context(&codec_context);
    avformat_close_input(&format_context);

The way the former code is acting is resumed in the next diagram:

processing diagram

You can find the full code here.

To build the example you will need meson and ninja. If you have python and pip installed, you can install them very easily by calling pip3 install meson ninja. Then, once the example archive extracted to a ffmpeg-101 folder, go to this folder and call: meson setup build. It will automatically download the right version of FFmpeg if you don’t have it already installed on your system. Then call: ninja -C build to build the code and ./build/ffmpeg-101 sample.mp4 to run it.

You should obtain the following result:

File: sample.mp4, format: mov,mp4,m4a,3gp,3g2,mj2
---- Stream 00
  Time base: 1/3000
  Framerate: 30/1
  Start time: 0
  Duration: 30000
  Type: video
  FourCC: avc1
  Codec: h264, bitrate: 47094
  Video resolution: 206x80
---- Stream 01
  Time base: 1/44100
  Framerate: 0/0
  Start time: 0
  Duration: 440320
  Type: audio
  FourCC: mp4a
  Codec: aac, bitrate: 112000
  Audio: 2 channels, sample rate: 44100 Hz
Packet received for stream 00, pts: 0
Send video packet to decoder...
Frame 01, type: I, format: 0, pts: 000, keyframe: true
Packet received for stream 00, pts: 100
Send video packet to decoder...
Frame 02, type: P, format: 0, pts: 100, keyframe: false
Packet received for stream 00, pts: 200
Send video packet to decoder...
Frame 03, type: P, format: 0, pts: 200, keyframe: false
Packet received for stream 00, pts: 300
Send video packet to decoder...
Frame 04, type: P, format: 0, pts: 300, keyframe: false
Packet received for stream 00, pts: 400
Send video packet to decoder...
Frame 05, type: P, format: 0, pts: 400, keyframe: false
Packet received for stream 00, pts: 500
Send video packet to decoder...
Frame 06, type: P, format: 0, pts: 500, keyframe: false
Packet received for stream 00, pts: 600
Send video packet to decoder...
Frame 07, type: P, format: 0, pts: 600, keyframe: false
Packet received for stream 00, pts: 700
Send video packet to decoder...
Frame 08, type: P, format: 0, pts: 700, keyframe: false
Packet received for stream 01, pts: 0
Packet received for stream 01, pts: 1024
Packet received for stream 01, pts: 2048
Packet received for stream 01, pts: 3072
Packet received for stream 01, pts: 4096
Packet received for stream 01, pts: 5120
Packet received for stream 01, pts: 6144
Packet received for stream 01, pts: 7168
Packet received for stream 01, pts: 8192
Packet received for stream 01, pts: 9216
Packet received for stream 01, pts: 10240
Packet received for stream 01, pts: 11264
Packet received for stream 01, pts: 12288
Packet received for stream 01, pts: 13312
Packet received for stream 01, pts: 14336
Packet received for stream 01, pts: 15360
Packet received for stream 01, pts: 16384
Packet received for stream 01, pts: 17408
Packet received for stream 01, pts: 18432
Packet received for stream 01, pts: 19456
Packet received for stream 01, pts: 20480
Packet received for stream 01, pts: 21504
Packet received for stream 00, pts: 800
Send video packet to decoder...
Frame 09, type: P, format: 0, pts: 800, keyframe: false
Packet received for stream 00, pts: 900
Send video packet to decoder...
Frame 10, type: P, format: 0, pts: 900, keyframe: false

July 26, 2024 12:00 AM

July 24, 2024

Andy Wingo

whippet progress update: funding, features, future

Greets greets! Today, an update on recent progress in Whippet, including sponsorship, a new collector, and a new feature.

the lob, the pitch

But first, a reminder of what the haps: Whippet is a garbage collector library. The target audience is language run-time authors, particularly “small” run-times: wasm2c, Guile, OCaml, and so on; to a first approximation, the kinds of projects that currently use the Boehm-Demers-Weiser collector.

The pitch is that if you use Whippet, you get a low-fuss small dependency to vendor into your source tree that offers you access to a choice of advanced garbage collectors: not just the conservative mark-sweep collector from BDW-GC, but also copying collectors, an Immix-derived collector, generational collectors, and so on. You can choose the GC that fits your problem domain, like Java people have done for many years. The Whippet API is designed to be a no-overhead abstraction that decouples your language run-time from the specific choice of GC.

I co-maintain Guile and will be working on integrating Whippet in the next months, and have high hopes for success.

bridgeroos!

I’m delighted to share that Whippet was granted financial support from the European Union via the NGI zero core fund, administered by the Dutch non-profit, NLnet foundation. See the NLnet project page for the overview.

This funding allows me to devote time to Whippet to bring it from proof-of-concept to production. I’ll finish the missing features, spend some time adding tracing support, measuring performance, and sanding off any rough edges, then work on integrating Whippet into Guile.

This bloggery is a first update of the progress of the funded NLnet project.

a new collector!

I landed a new collector a couple weeks ago, a parallel copying collector (PCC). It’s like a semi-space collector, in that it always evacuates objects (except large objects, which are managed in their own space). However instead of having a single global bump-pointer allocation region, it breaks the heap into 64-kB blocks. In this way it supports multiple mutator threads: mutators do local bump-pointer allocation into their own block, and when their block is full, they fetch another from the global store.

The block size is 64 kB, but really it’s 128 kB, because each block has two halves: the active region and the copy reserve. It’s a copying collector, after all. Dividing each block in two allows me to easily grow and shrink the heap while ensuring there is always enough reserve space.

Blocks are allocated in 64-MB aligned slabs, so you get 512 blocks in a slab. The first block in a slab is used by the collector itself, to keep metadata for the rest of the blocks, for example a chain pointer allowing blocks to be collected in lists, a saved allocation pointer for partially-filled blocks, whether the block is paged in or out, and so on.

The PCC not only supports parallel mutators, it can also trace in parallel. This mechanism works somewhat like allocation, in which multiple trace workers compete to evacuate objects into their local allocation buffers; when an allocation buffer is full, the trace worker grabs another, just like mutators do.

However, unlike the simple semi-space collector which uses a Cheney grey worklist, the PCC uses the fine-grained work-stealing parallel tracer originally developed for Whippet’s Immix-like collector. Each trace worker maintains a local queue of objects that need tracing, which currently has 1024 entries. If the local queue becomes full, the worker will publish 3/4 of those entries to the worker’s shared worklist. When a worker runs out of local work, it will first try to remove work from its own shared worklist, then will try to steal from other workers.

Of course, because threads compete to evacuate objects, we have to use atomic compare-and-swap instead of simple forwarding pointer updates; if you only have one mutator thread and are OK with just one tracing thread, you can avoid the ~30% performance penalty that atomic operations impose. The PCC generally starts to win over a semi-space collector when it can trace with 2 threads, and gets better with each thread you add.

I sometimes wonder whether the worklist should contain grey edges or grey objects. MMTk seems to do the former, and bundles edges into work packets, which are the unit of work sharing. I don’t know yet what is best and look forward to experimenting once I have better benchmarks.

Anyway, maintaining an external worklist is cheating in a way: unlike the Cheney worklist, this memory is not accounted for as part of the heap size. If you are targetting a microcontroller or something, probably you need to choose a different kind of collector. Fortunately, Whippet enables this kind of choice, as it contains a number of collector implementations.

What about benchmarks? Well, I’ll be doing those soon in a more rigorous way. For now I will just say that it seems to behave as expected and I am satisfied; it is useful to have a performance oracle against which to compare other collectors.

finalizers!

This week I landed support for finalizers!

Finalizers work in all the collectors: semi, pcc, whippet, and the BDW collector that is a shim to use BDW-GC behind the Whippet API. They have a well-defined relationship with ephemerons and are multi-priority, allowing embedders to build guardians or phantom references on top.

In the future I should do some more work to make finalizers support generations, if the collector is generational, allowing a minor collection to avoid visiting finalizers for old objects. But this is a straightforward extension that will happen at some point.

future!

And that’s the end of this update. Next up, I am finally going to tackle heap resizing, using the membalancer approach. Then basic Windows and Mac support, then I move on to the tracing and performance measurement phase. Onwards and upwards!

by Andy Wingo at July 24, 2024 09:19 AM

July 22, 2024

Eric Meyer

Design for Real Life News!

If you’re reading this, odds are you’ve at least heard of A Book Apart (ABA), who published Design for Real Life, which I co-wrote with Sara Wachter-Boettcher back in 2016.  What you may not have heard is that ABA has closed up shop.  There won’t be any more new ABA titles, nor will ABA continue to sell the books in their catalog.

That’s the bad news.  The great news is that ABA has transferred the rights for all of its books to their respective authors! (Not every ex-publisher does this, and not every book contract demands it, so thanks to ABA.) We’re all figuring out what to do with our books, and everyone will make their own choices.  One of the things Sara and I have decided to do is to eventually put the entire text online for free, as a booksite.  That isn’t ready yet, but it should be coming somewhere down the road.

In the meantime, we’ve decided to cut the price of print and e-book copies available through Ingram.  DfRL was the eighteenth book ABA put out, so we’ve decided to make the price of both the print and e-book $18, regardless of whether those dollars are American, Canadian, or Australian.  Also €18 and £18.  Basically, in all five currencies we can define, the price is 18 of those.

…unless you buy it through Apple Books; then it’s 17.99 of every currency, because the system forces us to make it cheaper than the list price and also have the amount end in .99.  Obversely, if you’re buying a copy (or copies) for a library, the price has to be more than the list price and also end in .99, so the library cost is 18.99 currency units.  Yeah, I dunno either.

At any rate, compared to its old price, this is a significant price cut, and in some cases (yes, Australia, we’re looking at you) it’s a huge discount.  Or, at least, it will be at a discount once online shops catch up.  The US-based shops seem to be up to date, and Apple Books as well, but some of the “foreign” (non-U.S.) sources are still at their old prices.  In those cases, maybe wishlist or bookmark or something and keep an eye out for the drop.  We hope it will become global by the end of the week.  And hey, as I write this, a couple of places have the ebook version for like 22% less than our listed price.

So!  If you’ve always thought about buying a copy but never got around to it, now’s a good time to get a great deal.  Ditto if you’ve never heard of the book but it sounds interesting, or you want it in ABA branding, or really for any other reason you have to buy a copy now.

I suppose the real question is, should you buy a copy?  We’ll grant that some parts of it are a little dated, for sure.  But the concepts and approaches we introduced can be seen in a lot of work done even today.  It made significant inroads into government design practices in the UK and elsewhere, for example, and we still hear from people who say it really changed how they think about design and UX.  We’re still very proud of it, and we think anyone who takes the job of serving their users seriously should give it a read.  But then, I guess we would, or else we’d never have written it in the first place.

And that’s the story so far.  I’ll blog again when the freebook is online, and if anything else changes as we go through the process.  Got questions?  Leave a comment or drop me a line.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at July 22, 2024 03:22 PM

Andy Wingo

finalizers, guardians, phantom references, et cetera

Consider guardians. Compared to finalizers, in which the cleanup procedures are run concurrently with the mutator, by the garbage collector, guardians allow the mutator to control concurrency. See Guile’s documentation for more notes. Java’s PhantomReference / ReferenceQueue seems to be similar in spirit, though the details differ.

questions

If we want guardians, how should we implement them in Whippet? How do they relate to ephemerons and finalizers?

It would be a shame if guardians were primitive, as they are a relatively niche language feature. Guile has them, yes, but really what Guile has is bugs: because Guile implements guardians on top of BDW-GC’s finalizers (without topological ordering), all the other things that finalizers might do in Guile (e.g. closing file ports) might run at the same time as the objects protected by guardians. For the specific object being guarded, this isn’t so much of a problem, because when you put an object in the guardian, it arranges to prepend the guardian finalizer before any existing finalizer. But when a whole clique of objects becomes unreachable, objects referred to by the guarded object may be finalized. So the object you get back from the guardian might refer to, say, already-closed file ports.

The guardians-are-primitive solution is to add a special guardian pass to the collector that will identify unreachable guarded objects. In this way, the transitive closure of all guarded objects will be already visited by the time finalizables are computed, protecting them from finalization. This would be sufficient, but is it necessary?

answers?

Thinking more abstractly, perhaps we can solve this issue and others with a set of finalizer priorities: a collector could have, say, 10 finalizer priorities, and run the finalizer fixpoint once per priority, in order. If no finalizer is registered at a given priority, there is no overhead. A given embedder (Guile, say) could use priority 0 for guardians, priority 1 for “normal” finalizers, and ignore the other priorities. I think such a facility could also support other constructs, including Java’s phantom references, weak references, and reference queues, especially when combined with ephemerons.

Anyway, all this is a note for posterity. Are there other interesting mutator-exposed GC constructs that can’t be implemented with a combination of ephemerons and multi-priority finalizers? Do let me know!

by Andy Wingo at July 22, 2024 09:27 AM

July 18, 2024

Brian Kardell

927: Thoughts on a Global Design System

927: Thoughts on a Global Design System

My thoughts on "A Global Design System" as is being discussed in OpenUI.

As you may or may not be aware, there's been recent discussion in OpenUI, brought forward by an effort by my fellow Pittsburgher Brad Frost, about the group taking on the effort of creating a global design system.

First, let me say that the problem that Brad describes is real, and also not new. He and I have discussed this in the past as well. I've spent a lot (the majority maybe) of my career (which began in the 90s) working on projects that were either using, evaluating or making their own common controls.

So much wasted energy

While explaining this, Brad frequently notes that inventing and reinventing the same things over and over wastes an enormous amount of human potential. We could be spending that time better.

I mean... Yes. I agree.

But, even more than that, the time spent re-inventing is only part of the story. The status quo is good for approximately no one. It also has multiplicative effects far beyond just the actual reinvention..

There might be 100 toolkits/component libraries which combined have 100k worth of invested hours, and yeah, that's a huge amount of time... Those hours are also wildly skewed. 1 might have 10x or even 100x the thought, care, review and testing than another.

But while there might be thousands of people spending time re-inventing, there are millions of authors who need components - and so many are spending at least a few hours, or maybe in some cases days searching for suitable components. I've been involved in corporate evaluations that were weeks of time. And it's hard to evaluate them and make good choices that consider accessibility, responsiveness, design, and internationalization. It is not only time-consuming, we often don't have the skills to do it. That is, after all, one of the reasons we want them: So that we don't each have to know all that stuff.

But then, how do we expect authors make a good choice?

Sometimes the ones with the least effort put into them can have a great looking web site, nice documentation, charismatic promotion, or be somehow associated with a big tech company. Too often we wind up choosing components by proxy and just assuming that something else must mean it's good, and will last a long time. However, history has not borne that out — see the various component toolkits and design systems from even big orgs like Microsoft and Google, for example, that fell by the wayside.

But yeah - multiply that time out... What all of this currently creates is bad all around. All of the millions of developers looking and ultimately unable to make well-informed choices is probably tens of millions of hours, by comparison.

In the end, many give up and re-implement again, making the problem even worse.

Each one might introduce tiny variations and accidentally invent something subtly new and create new challenges for users that we'll spend years sorting out too.

Ugh. It's bad. We should want a better future, and we should act on that.

Imagining a Better Future

Here's where I believe we get into trouble though: We have to be clear on what are we imagining, and whether it is practical/pragmatic to deliver adequate value in a reasonable timeframe.

Native HTML?

We could, for example, choose to imagine that HTML can be given a great and complete set of elements representing a complete UI toolkit. In addition to correcting all of the issues with the elements we've added so far, this means adding powerful grids connected to data, tabsets, notifications, carousels, charts, and so on.

Can it? Eventually, maybe, but I hope it is not controversial to suggest that it is extremely unlikely that we could accomplish this with the necessary qualities and in a reasonable timeframe. There's just no information or insight I have that gives me hope that focusing only on that scenario is a good idea.

This is a good end-goal for many components, but it's not where to start. It's hard and time consuming and gated on very specific and limited participation of a small number of people. HTML itself moves slow, on purpose.

I think HTML is at the end of 99 other steps...

The real question, I believe, is about improving how we get there, and deliver iterative partial value along the way.

New Web Components Reference Implementations?

It's been suggested that we could work on a single standard with a reference implementation for each component.

I do believe that ultimately this is a good goal, but I'd like to suggest that it's not where to start either.

The challenges to this are less than trying to add it to HTML in some ways, it doesn't require browser vendorts to act in concert, sure. We can iterate on it, sure. But the challenges are still huge and trading knowns for unknowns.

Instead of needing to convince 3 browser vendors to act in concert, we have to convince several UI kit vendors and developers to participate. We also have to convince everyone to use it and try to avoid XKCD 927 territory...

XKCD 927
Situation: There are 14 Competing Standards
Person 1: 14? Ridiculous! We need to develop one universal standard that covers everyone's use cases!
Person 2: Yeah!
Situation: There are 15 Competing Standards

This is exacerbated by the fact that it won't come all at once. It'll still be a non-trivial amount of time before we have a whole library of components which could reasonably be promoted for use. It still requires people with expertise (probably many of the same people as if it were native) to participate for reviewing accessibility, usability, internationalization, etc. In practice, there are just very finite resources available to put toward large scale, long term cooperation. Practically speaking, it seems likely we could only focus on a couple of components at a time.

Let's say we finish and ship the first component: Tabs. Can we really call it a global design system if it has just one component? Won't that really limit who can/will adopt it?

Adopt, modify and bless an library

It's been suggested that we could take up a library as a kind of a 'donation' project to provide a starting point. Specifically, maintainers from Shoelace/Web Awesome (also formerly MS components) have volunteered components for this purpose. Not as a "this is the thing" but a "this is a start". That would give us a nice leap forward.

Yeah, it would.

Except... Doesn't it raise a lot of questions we have to answer anyways?

First, but maybe not as importantly: Why that one? That goes to legitimacy. We should be able to explain why this is not just the first attractive looking opportunity that presented itself.

More importantly, it seems to me that the rest of the situation decribed above remains largely unchanged. We can't seriously promote that until it is deemed "good", and practically speaking it seems that we will approve them individually, not as a library. So, can't we define how we think it should work before we worry about picking a library?

The most obvious thing we could have ever done that with was jQuery, and we didn't.

I think that a library of reference implementations that we can agree to and recommend is still very far along the timeline...

The real question, I believe, is about improving how we get there, and deliver iterative partial value along the way.

We still don't have a great way to evolve the web - but I keep saying that I think we should.

How I think we could get there...

This is what I want more than anything: A plan to get there. Reasonable steps of value along the way, comparatively quickly.

It is effectively what I thought in 2013-2014 too. I suggested to the W3C Advisory Committee that we needed to rethink how we approach standards to include this sort of idea, which could work more like languages/dictionaries. I tried to suggest the W3C should create such a process/listing/review process.

What follows is a vague outline of what I imagine:

I'd like to create a central place where we lay out some new rules and a process where components, in a basic form that we agree to (it is as a module, should it use shadow dom or not, etc) can be submitted.

What are the criteria? That's the first few steps...

We'd define some criteria gating submission, first with IP/license agreements we agree to, possibly some kind of bar for contributors or stars or something, but mainly: A commitment of participation in this new process, especially from experts. Honestly, participation is a bigger part of the limiting factor than anyone really imagines.

Once submitted it would undergo wide review and get some kind of 'verification stamps' for each kind (a11y, i18n, etc).

For this reason, I would really love to try to include the authors of government tools here. They are legally mandated and funded to solve the problem already and seem highly incentivized to participate. A collective of government efforts also lends immediate credibility and sense of independence to it.

To me, ideally, we would begin with a call for components/participation.

A call for particpation/submissions...

You might have noticed...

You might have noticed that I didn't answer the question of "how do we pick one?" That's because I think that's like 99 steps down the road and will come naturally.

We can get a set of people who can contribute tabs, and a set of people who can review, and we can all discuss several of them at the same time. We can begin to lay out conformance criteria, and give each one little 'conformance stamps' along the way. Inevitably we can more easily get implementations to align and develop universal definitions and tests -- new stamps to be had.

Component get conformance stamps...

For authors, along the way, there's a nice central catalog somewhere, like webcomponents.org, but better. You'll know those have been submitted, and which ones have which conformance stamps. Maybe there isn't a 'the one', yet. But, it's ok? You have a smaller set, and the information you really need to choose one. Maybe all 3 of them are ... fine?

That's not the worst thing, we can sit back and evaluate it for a while while already saving ourselves collectively millions of hours and our users a lot of pain.

In fact, collecting data and a little variation is good. Probably, they continue to align, or one begins to be the clearer winner.

We have very well defined, portable criteria for testing and more or less 1 definition...

And, that's the point: As we go we would slowly, but without stopping major progress at any point. Even if nothing more happens, each of those steps has had real value. No one has just wasted time.

Then, maybe we can get somewhere where we have a single reference implementation of all of those things - or even a standard almost identical to them.

We have a true global reference implementation... Should we bake it into HTML?

In any case, that's how I would prefer to approach it. I wouldn't call it a "global design system" to start, because I wouldn't even start out assuming there would be only one of anything initially... But eventually.

July 18, 2024 04:00 AM

Igalia Compilers Team

Summary of the June 2024 TC39 plenary in Helsinki

In June, many colleagues from Igalia participated in a TC39 meeting organized in Helsinki by Aalto University and Mozilla to discuss proposed features for the JavaScript standard alongside delegates from various other organizations.

Let's delve together into some of the most exciting updates!

You can also read the full agenda and the meeting minutes on GitHub.

Day 1 #

import defer to Stage 2.7 igalia logo #

The import defer proposal allows pre-loading modules while deferring their evaluation until a later time. This proposal aims at giving developers better tools to optimize the startup performance of their application.

As soon as some code needs the variables exported by a deferred module, it will be synchronously evaluated to immediately give access to its value:

// This does not cause evaluation of my-mod or its dependencies:
import defer * as myMod from "./my-mod.js";

$button.addEventListener("click", () => {
// but this does!
console.log("val:", myMod.val);
});

This is similar to, when using CommonJS, moving a require(...) call from the top-level of a module to inside a function. Adding a similar capability to ES modules is one further step towards helping people migrate from CommonJS to ESM.

The proposal reached stage 2.7 after answering a few questions centered around the behavior of top-level await: modules with top-level await will not be deferred even if imported through import defer, because they cannot be synchronously evaluted. If you application can easily handle asynchronous deferred evaluation of modules, it can as well use dynamic import().

The proposal now needs to have test262 tests written, to be able to go to Stage 3.

Promise.try to Stage 3 #

The new Promise.try helper allows calling a functions that might or might not be asynchronous, unifying their error handling paths. This is useful for asynchronous APIs (that should always signal errors by returning a rejected promise) that interact with user-provided callbacks.

Consider this example:

type AsyncNegator = (val: number) => number | Promise<number>;
function subtract(a: number, b: number, negate: AsyncNegator): Promise<number> {
return Promise.resolve(negate(b)).then(negB => a + negB);
}

While the developer here took care of wrapping negate's result in Promise.resolve, in case negate returns a number directly, what happens in negate throws an error? In that case, subtract will throw synchronously rather than returning a rejected promise!

With Promise.try, you can easily handle both the success and error paths correctly:

type AsyncNegator = (val: number) => number | Promise<number>;
function subtract(a: number, b: number, negate: AsyncNegator): Promise<number> {
return Promise.try(() => negate(b)).then(negB => a + negB);
}

Day 2 #

Source maps update #

Source maps are an important tool in a developer's toolbox: they are what lets you debug transpiled/minified code in your editor or browser, while still stepping through your original hand-authored code.

While they are supported by most tools and browsers, there hasn't been a shared standard that defines how they should work. Tools and browsers all have to peek at what the others are doing to understand how to properly implement them, and this situation makes it very difficult to evolve the format to improve the debugging experience.

TC39 recently picked up the task of formalizing the standard, as well as adding new features such as the scopes proposal that would let devtools better understand renamed variables and inlined functions.

Iterator.zip to Stage 2.7 #

TC39 is working on many helpers for more easily working with iterators ("lazy lists", that only produce values as needed). While most of them are in the Iterator Helpers proposal, this one is advancing on its own.

Iterator.zip allows pairing values coming from multiple iterators:

function getNums(start = 0, step = 1) {
for (let val = start; ; start += step) yield step;
}

let naturals = getNums();
let evens = getNums(0, 2);
let negatives = getNums(-1, -1);

// an iterator of [a natural, an even, a negative]
let allTogether = Iterators.zip([naturals, evens, negative]);

console.log(allTogether.next().value); // [0, 0, -1]
console.log(allTogether.next().value); // [1, 2, -2]
console.log(allTogether.next().value); // [2, 4, -3]

This proposal, like import defer, just reached the new Stage 2.7: it will now need test262 tests to be eligible for Stage 3.

Temporal reduction igalia logo #

Temporal is one of the longest awaited features of JavaScript, advancing bit by bit on its path to stage 4 as obstacles are removed. For the last 6 months or so we have been working on removing one of the final obstacles: addressing feedback from JS engines on the size and complexity of the proposal, which culminated in this meeting.

As we get closer to having shipping implementations, it's become clear that the size of Temporal was an obstacle for platforms such as low-end Android devices: it added a large chunk to the size of the JS engine all at once. So, Philip Chimento and Justin Grant presented a slate of API removals to make the proposal smaller.

What was removed? Some methods previously existed for convenience, but were removed as somewhat redundant because there was a one-line way to accomplish the same thing. A more substantial removal was Temporal.Calendar and Temporal.TimeZone objects, along with the ability to extend them to implement custom calendars and custom time zones. We've received feedback that these have been the most complicated parts of the proposal for implementations, and they've also been where the most bugs have popped up. As well, due to the existence of formats like jsCalendar (RFC 8984), as well as learning more about the drawbacks of a callback-based design, we believe there are better designs possible for custom time zones and calendars than there were when the feature was designed.

Most of the slate of removals was adopted, and Temporal continues its journey to stage 4 smaller than it was before. You can follow the progress in this ticket on Temporal's issue tracker.

Day 3 #

Decimal update igalia logo #

If you're tired of the fact that 0.1 + 0.2 is not 0.3 in JavaScript, then the decimal proposal is for you! This proposal, which is currently at stage 1, was presented by Jesse Alama. The goal was to present an update about some changes to the API, and go through the status of the proposal's spec text. Although most of the committee was generally supportive of allowing this proposal to go to stage 2, it remains at stage 1 due to some concerns about missing details in the spec text and the overall motivation of the proposal.

Discard bindings update #

The intention of discard bindings is to formalize a pattern commonly seen out there in the wild of JavaScript:

let [ _, rhs ] = s.split(".");

Notice that underscore character (_)? Although our intention is to signal to the readers of our code that we don't care what is on the left-hand side of the . in our string, _ is actually a valid identifier. It might even contain a big value, which takes up memory. This is just the tip of the iceberg; things get even more complex when one imagines binding -- but not using! -- even more complex entities. We would like to use _ -- or perhaps something else, like void -- to signal to the JavaScript engine that it can throw away whatever value _ might have held. Ron Buckton presented an update about this proposal and successfully advanced it to Stage 2 in the TC39 process.

Signals update #

Signals is an ambitious proposal that takes some of the various approaches to reactivity found out there in various JS frameworks such as Vue, React, and so on. The idea is to bring some form of reactivity into the JS Core. Of course, different approaches to reactivity are possible, and should remain valid; the idea of the Signals proposal is to provide some common abstraction that different aproaches can build on. Dan Ehrenberg presented an update on this new proposal, which is currently at stage 1. A lot of work remains to be done; Dan explicitly said that a request to move Signals to stage 2 might take at least a year.

July 18, 2024 12:00 AM

July 12, 2024

Georges Stavracas

Profiling a web engine

One topic that interests me endlessly is profiling. I’ve covered this topic many times in this blog, but not enough to risk sounding like a broken record yet. So here we are again!

Not everyone may know this but GNOME has its own browser, Web (a.k.a. Epiphany, or Ephy for the intimates). It’s a fairly old project, descendant of Galeon. It uses the GTK port of WebKit as its web engine.

The recent announcement that WebKit on Linux (both WebKitGTK and WPE WebKit) switched to Skia for rendering brought with it a renewed interest in measuring the performance of WebKit.

And that was only natural; prior to that, WebKit on Linux was using Cairo, which is entirely CPU-based, whereas Skia had both CPU and GPU-based rendering easily available. The CPU renderer mostly matches Cairo in terms of performance and resource usage. Thus one of the big promises of switching to Skia was better hardware utilization and better overall performance by switching to the GPU renderer.

A Note About Cairo

Even though nowadays we often talk about Cairo as a legacy piece of software, there’s no denying that Cairo is really good at what it does. Cairo can and often is extremely fast at 2D rendering on the CPU, specially for small images with simple rendering. Cairo has received optimizations and improvements for this specific use case for almost 20 years, and it is definitely not a low bar to beat.

I think it’s important to keep this in mind because, as tempting as it may sound, simply switching to use GPU rendering doesn’t necessarily imply better performance.

Guesswork is a No-No

Optimizations should always be a byproduct of excellent profiling. Categorically speaking, meaningful optimizations are a consequence of instrumenting the code so much that the bottlenecks become obvious.

I think the most important and practical lesson I’ve learned is: when I’m guessing what are the performance issues of my code, I will be wrong pretty much 100% of the time. The only reliable way to optimize anything is to have hard data about the behavior of the app.

I mean, so many people – myself included – were convinced that GNOME Software was slow due to Flatpak that nobody thought about looking at app icons loading.

Enter the Profiler

Thanks to the fantastic work of Søren Sandmann, Christian Hergert, et al, we have a fantastic modern system profiler: Sysprof.

Sysprof offers a variety of instruments to profile the system. The most basic one uses perf to gather stack traces of the processes that are running. Sysprof also supports time marks, which allow plotting specific events and timings in a timeline. Sysprof also offers extra instrumentation for more specific metrics, such as network usage, graphics, storage, and more.

  • Screenshot of Sysprof's callgraph view
  • Screenshot of Sysprof's flamegraphs view
  • Screenshot of Sysprof's mark chart view
  • Screenshot of Sysprof's waterfall view

All these metrics are super valuable when profiling any app, but they’re particularly useful for profiling WebKit.

One challenging aspect of WebKit is that, well, it’s not exactly a small project. A WebKit build can easily take 30~50min. You need a fairly beefy machine to even be able to build a debug build of WebKit. The debug symbols can take hundreds of megabytes. This makes WebKit particularly challenging to profile.

Another problem is that Sysprof marks require integration code. Apps have to purposefully link against, and use, libsysprof-capture to send these marks to Sysprof.

Integrating with Sysprof

As a first step, Adrian brought the libsysprof-capture code into the WebKit tree. As libsysprof-capture is a static library with minimal dependencies, this was relatively easy. We’re probably going to eventually remove the in-tree copy and switch to host system libsysprof-capture, but having it in-tree was enough to kickstart the whole process.

Originally I started sprinkling Sysprof code all around the WebKit codebase, and to some degree, it worked. But eventually I learned that WebKit has its own macro-based tracing mechanism that is only ever implemented for Apple builds.

Looking at it, it didn’t seem impossible to implement these macros using Sysprof, and that’s what I’ve been doing for the past few weeks. The review was lengthy but behold, WebKit now reports Sysprof marks!

Screenshot of Sysprof with WebKit marks highlighted

Right now these marks cover a variety of JavaScript events, layout and rendering events, and web page resources. This all came for free from integrating with the preexisting tracing mechanism!

This gives us a decent understanding of how the Web process behaves. It’s not yet complete enough, but it’s a good start. I think the most interesting data to me is correlating frame timings across the whole stack, from the kernel driver to the compositor to GTK to WebKit’s UI process to WebKit’s Web process, and back:

Screenshot of Sysprof with lots of compositor and GTK and WebKit marks

But as interesting as it may be, oftentimes the fun part of profiling is being surprised by the data you collect.

For example, in WebKit, one specific, seemingly innocuous, completely bland method is in the top 3 of the callgraph chart:

Screenshot of Sysprof showing the callgraph view with an interesting result highlighted

Why is WebCore::FloatRect::contains so high in the profiling? That’s what I’m investigating right now. Who guessed this specific method would be there? Nobody, as far as I know.

Once this is out in a stable release, anyone will be able to grab a copy of GNOME Web, and run it with Sysprof, and help find out any performance issues that only reproduce in particular combinations of hardware.

Next Plans

To me this already is a game changer for WebKit, but of course we can do more. Besides the rectangular surprise, and one particular slowdown that comes from GTK loading Vulkan on startup, no other big obvious data point popped up. Specially in the marks, I think their coverage is still fairly small compared to what it could have been.

We need more data.

Some ideas that are floating right now:

  • Track individual frames and correlate them with Sysprof marks
  • Measure top-to-bottom-to-top latency
  • Measure input delay
  • Integrate with multimedia frames

Perhaps this will allow us to make WebKit the prime web engine for Linux, with top-tier performance, excellent system integration, and more. Maybe we can even redesign the whole rendering architecture of WebKit on Linux to be more GPU friendly now. I can dream high, can’t I? 🙂

In any case, I think we have a promising and exciting time ahead for WebKit on Linux!

by Georges Stavracas at July 12, 2024 12:42 PM

Madeeha Javed

Igalia's Latest Contributions to Graphics

The Igalia Graphics team has been expanding and making significant contributions in the space of open source graphics. An earlier blog post by our team member Lucas provides an excellent insight in to the team’s evolution over the past years. The following series of posts will attempt to summarize the team’s recent engagements:

  • This post covers our updates on GPU color management, Turnip, V3DV, DRM/KMS, Etnaviv and community events we have been participating in.
  • The next post will cover news from our CTS, Vulkan Video, Mesa CI, GPU reset work and talks about some new initiatives that recently we got involved in.

Before dwelling in to details, it is worth mentioning the recent highlights; Igalia hosted 2024 Linux Display Next Hackfest in May this year and X.org Developers Conference 2023 in October last year, both in the beautiful city of A Coruña. These events were a huge success in creating a hub for graphics experts to foster open innovation. Continue reading for more details on these events.

A Vibrant Linux #

Last year brought great news for AMD GPU color management: the AMD driver-specific color management properties reached the upstream linux-next! My Igalia colleague Melissa Wen has been spearheading this effort for some time now and has journalled every detail in a series of blog posts.

AMD has been improving its display color management pipeline with each new hardware generation. The new color capabilities, before and after plane composition, can be used by compositors and userspace applications to provide a vibrant experience to the end-user. Exposing AMD driver-specific color properties is a step towards advanced color management on Linux, allowing gamut mapping, HDR rendering, HDR on SDR, and SDR on HDR.

On a very high level, there are 2 parts of this support:

  • Upgrading the DRM/KMS Linux interface to expose the new features to the user-space. One major challenge was the limited DRM/KMS interface, which only exposed a small set of post-blending color properties. Latest AMD Display Core Next hardware has many more post-blending and pre-blending capabilities. Melissa’s work involved mapping these capabilities to the AMD driver’s display core interface and then to the DRM interface. Her blog post provides a brief overview of this extensive mapping effort.

  • Updating the AMD’s Linux display driver to expose the new hardware features. AMD DCN 3.0 comes with cutting edge color capabilities described by Melissa here and this blog post also talks about the AMD’s Linux display subsystem components and about the new properties.

I quote here some of Melissa’s write-ups that helped me get some understanding about this vast subject:

Turnip Upgrades #

Turnip, the open-source Vulkan driver for Qualcomm Adreno GPUs, has been receiving major upgrades this year for Qualcomm’s Adreno 7XX GPUs.

From my colleague Danylo Piliaiev’s Turnip update at FOSDEM 2024, Turnip seems to be in a great state; major Vulkan extensions and better debug support, AAA desktop games can now run via FEX + Turnip on Linux, with some from the Termux community even running desktop games on Android with Box64/FEX + Turnip.

The highlight of Danylo’s talk is the A7XX support. The team started the year with A7XX bring up and now ramping on adding support for the new features introduced in A7XX:

  • Mark Collins, who also represents Igalia at the Khronos Vulkan WG, implemented GMEM rendering for A7XX, which can be considerably faster and more power efficient than sysmem rendering depending on what’s being rendered. Followed up by support for unidirectional LRZ, bringing A7XX to parity with A6XX’s GMEM rendering feature set and further boosting performance, with more performance improvements for A7XX on the horizon.

  • Our colleague Amber Harmonia added support for allowing a shader to contain 64-bit atomic operations on signed and unsigned integers and support for allowing rasterizing wide lines while Fixed Stride Draw Table support is work-in-progress.

In addition to new feature support, we are committed to providing a robust and performant driver.

Recently, Job Noorman has joined our Turnip team to improve the IR3 compiler. He improved handling of predicate registers and added support for predication. Adreno GPUs have special registers that store the result of a condition called predicate registers, utilizing these registers can eliminate branches in the generated code thereby improving performance. Similarly, more than 10% code size reduction was observed in shader-db with his patch for using rptN instructions.

Turnip has come far and has been giving competition to the Adreno’s proprietary driver recently. Here is Assassin’s Creed running on Adreno + Turnip. Check the FPS on that screen!

Turnip Development Resources #

Danylo usually talks about analyzing some of the major Turnip issues in his series of blog posts “Turnips in the wild” with part 3 being the latest addition. This is exactly what you need to jump start Turnip development.

As always, the team also discovered many new techniques of debugging GPU issues. GPU driver developers want to modify the GPU command stream on run-time to see the outcome of editing it in different ways. Danylo implemented this highly sought out feature as a tool for Adreno and describes how this tool can be used.

DRM/KMS Improvements #

The management of the display, graphics and composition in Linux lies in the kernel DRM/KMS framework. Igalian Maíra Canal provides full disclosure on our notable contributions authoring, reviewing and testing kernel DRM patches while I privide a few highlights here:

  • My Igalia colleague André Almeida and Simon Ser have been working on Asynchronous Page Flips, an optimization that allows applications to flip a plane for immediate presentation. The support for this feature is now available in the atomic API. Plus, with André’s patch, it is enabled for all planes including the primary plane if the hardware supports it.

  • Maíra has been working on feature crucial to graphics development on RPi. She supplied per client GPU usage statistics as well as global GPU utilization.

  • In order to ensure continuous job submission to the GPU, CPU jobs submitted from userspace must be prevented. With a series of patches from Maíra moved CPU jobs mechanisms from the V3DV driver to the V3D kernel driver.

We want more Pi! #

After achieving Vulkan 1.2 conformance on V3DV, the Igalia team working on V3DV have been focusing on instrumental enhancements of the driver. V3DV is Broadcom Video Core GPU’s Vulkan driver on the image

RPi 5 was launched in October last year with a new BCM GPU. Alejandro provided an overview of the team’s journey through V3DV development since RPi 4 and then talks about challenges of RPi 5 support in V3DV:

More improvements and new Vulkan extensions were supported last year.

This year Iago landed support for Vulkan dynamic rendering extension. VK_KHR_dynamic_rendering is a popular Vulkan extension that has added flexibility to the Vulkan API by allowing users to skip render pass and frame buffer objects and start immediate rendering. And now its available on the Pi.

As mentioned in the DRM/KMS improvements above, Maíra together with José María Casanova (Chema) and Melissa supported GPU utilization stats and CPU jobs optimization. Here is a snapshot of collection of GPU stats on Pi5:

image

RPi 5 continues to use OpenGL/Wayland based Wayfire compositor on these devices. Christopher was therefore tasked with enabling Wayfire to run on RPi 3 and 4 as well. He achieved this by software rendering implementing by a Pixman back-end. Check out the demo:

Iago also made some interesting observations while experimenting with SuperTuxKart on the Pi. You will be pleasantly surprised to know how Vulkan out-performed OpenGL.

The team has been working towards Vulkan 1.3 and we will hopefully be able to share more news on that front very soon.

Etnaviv #

Christian Gmeiner, one of the maintainers of Etnaviv (open-source graphics driver for Vivante GPUs), joined our team last year. We are very excited to have him on-board because it is a testament to Igalia’s dedication towards open source graphics software development.

Christian is also enjoying being at Igalia as he discusses in blog post and also reveals his plans for Etnaviv:

  • Improving Etnaviv’s Gallium driver.
  • Exposing GLES3.
  • Moving towards a new back-end compiler.

One of his latest updates is the user-space hardware database. He explains that a user-space driver HW database has been introduced to obtain GPU specific information like GPU features and limits, corresponding to the introduction of an in-kernel hardware database. I am sure this will be super helpful for the reverse engineers out there!

News & Community Events #

Igalians are always eager to share their knowledge and expertise with the open source community by participating in key organizations and events.

Good bye ‘Xorg’ and Hello ‘Linux Foundation’ #

image

There is quite a trend in Igalians serving on the X.Org Foundation’s Board of Directors. Samuel Iglesias took on this responsibility for a number of terms but this year he is stepping down. He reminisced about his role in this blog post.

Ricardo was, however, elected as one of the board of directors in 2022 and stayed on the board till Q1 2024, leaving Christopher Michael as the only Igalian currently on the board. In his blog post, Ricardo introduces the X.Org Foundation but also tackles some questions about its future.

image

Samuel was invited to join the Linux Foundation (Europe) advisory board and he has accepted the invitation. This is a huge milestone for the whole graphics team. Congratulations Sam!

2024 Linux Display Hackfest #

This is a rather new event that has materialized in the Linux community to enhance the Linux display stack.

Melissa’s work on HDR and AMD color management together with interesting discussions during XDC 2023 Color Management workshop paved the way for the event this year and therefore, Igalia graciously offered to host it.

The event attracted key participants from Linux community, AMD, Nvidia, Google, Fedora, and Gnome, focusing on topics like HDR/color Management, variable refresh rate, tearing, multiplane/hardware overlay for video and gaming, real-time scheduling, async KMS API, power saving vs. color/latency, content-adaptive scaling and sharpening, and display control. The success of this event has highlighted the need for future editions.

Embedded Open Source Summit 2024 #

At EOSS this year, we presented the following talks:

FOSDEM 2024 #

At FOSDEM this year, we presented the following talks:

Vukanised 2024 #

At Vukanised this year, we presented the following talks:

Stéphane Cerveau & Hyunjun Ko, “Implementing a Vulkan Video Encoder From Mesa to Streamer” Iago Toral, Faith Ekstrand, “8 Years of Open Drivers, including the State of Vulkan in Mesa”

Igalians who attended the event found it quite informative on the subject.

XDC 2023 #

Igalia hosted XDC 2023 in the city of their headquarters, A Coruña. We also presented many talks and demos.

The lightning talks and demos had an equally active participation from Igalia:

Workshops were organized for discussion on larger subjects like advance color management (discussion summary) and continuous integration (discussion summary).

The Future #

Igalia graphics team has profound expertise in Mesa, Vulkan, OpenGL and Linux kernel. We have also embraced new and really interesting graphics technologies that I talk about in my next post.

July 12, 2024 12:00 AM

July 10, 2024

Andy Wingo

copying collectors with block-structured heaps are unreliable

Good day, garbage pals! This morning, a quick note on “reliability” and garbage collectors, how a common GC construction is unreliable, and why we choose it anyway.

on reliability

For context, I’m easing back in to Whippet development. One of Whippet’s collectors is a semi-space collector. Semi-space collectors are useful as correctness oracles: they always move objects, so they require their embedder to be able to precisely enumerate all edges of the object graph, to update those edges to point to the relocated objects. A semi-space collector is so simple that if there is a bug, it is probably in the mutator rather than the collector. They also have well-understood performance, as as such are useful when comparing performance of other collectors.

But one other virtue of the simple semi-space collector is that it is reliable, in the sense that given a particular set of live objects, allocated in any order, there is a single heap size at which the allocation (and collection) will succeed, and below which the program fails (not enough memory). This is because all allocations go in the same linear region, collection itself doesn’t allocate memory, the amount of free space after an object (the fragmentation) does not depend on where it is allocated, and those object extents just add up in a commutative way.

Reliability is a virtue. Sometimes it is a requirement: for example, the Toit language and run-time targets embeded microcontrollers, and you there you have finite resources and either they workload fits or it doesn’t. You can’t really tolerate a result of “it works sometimes”. So, Toit uses a generational semi-space + mark-sweep collector that never makes things worse.

on block-structured heaps

But, threads make reliability tricky. With Whippet I am targetting embedders with multiple mutator threads, and a classic semi-space collector doesn’t scale – you could access the allocation pointer atomically, but that would be a bottleneck, serializing mutators, not to mention the cache contention.

The usual solution for this problem is to arrange the heap in such a way that different threads can allocate in different areas, so they don’t need to share an allocation pointer and so they don’t write to the same cache lines. And, the most common way to do this is to use a block-structured heap; for example you might have a 256 MB heap, but divided into 4096 blocks, each of which is 64 kB. That’s enough granularity to dynamically partition out space between many threads: you keep a list of available blocks and allocator threads compete to grab fresh blocks as needed. There’s central contention on the block list, so you want blocks big enough that you aren’t fetching blocks too often.

To break a heap into blocks requires a large-object space, to allow for allocations that are larger than a block. And actually, as I mentioned in the article about block-structured heaps, usually you choose a threshold for large object allocations that is smaller than the block size, to limit the maximum and expected amount of fragmentation at the end of each block, when an allocation doesn’t fit.

on unreliability

Which brings me to my point: a copying collector with a block-structured heap is unreliable, in the sense that there is no single heap size below which the program fails and above which it succeeds.

Consider a mutator with a single active thread, allocating a range of object sizes, all smaller than the large object threshold. There is a global list of empty blocks available for allocation, and the thread grabs blocks as needed and bump-pointer allocates into that block. The last allocation in each block will fail: that’s what forces the thread to grab a new fresh block. The space left at the end of the block is fragmentation.

Assuming that the sequence of allocations performed by the mutator is deterministic, by the time the mutator has forced the first collection, the total amount of fragmentation will also be deterministic, as will the set of live roots at the time of collection. Assume also that there is a single collector thread which evacuates the live objects; this will also produce deterministic fragmentation.

However, there is no guarantee that the post-collection fragmentation is less than the pre-collection fragmentation. Unless objects are copied in such a way that preserves allocation order—generally not the case for a semi-space collector, and it would negate any advantage of a block-structured heap—then different object order could produce different end-of-block fragmentation.

causes of unreliability

The unreliability discussed above is due to non-commutative evacuation. If your collector marks objects in place, you are not affected. If you don’t commute live objects—if you preserve their allocation order, as Toit’s collector does—then you are not affected. If your evacuation commutes, as in the case of the simple semi-space collector, you are not affected. But if you have a block-structured heap and you evacuate, your collector is probably unreliable.

There are other sources of unreliability in a collector, but to my mind they are not as fundamental as this one.

  • Multiple mutator threads generally lead to a kind of unreliability, because the size of the live object graph is not deterministic at the time of collection: even if all threads have the same allocation trace, they don’t necessarily proceed in lock-step nor stop in the same place.

  • Adding collector threads to evacuate in parallel adds its own form of unreliability: if you have 8 evacuator threads, then there are 8 blocks local to the evacuator threads which also contribute to post-collection wasted space, possibly an entire block per thread.

  • Some collectors will allocate memory during collection, for example to represent a worklist of objects that need tracing. This allocation can fail. Also, annoyingly, collection-time allocation complicates comparison: you can no longer compare two collectors at the same heap size, because one of them cheats.

  • Virtual memory and paging can make you have a bad time. For example, you go to allocate a large object, so you remove some empty blocks from the main space and return them to the OS, providing you enough budget to allocate the new large object. Then the new large object is collected, so you reclaim the pages you returned to the OS, adding them to the available list. But probably you don’t page them in already, because who wants a syscall? They get paged in lazily when the mutator needs them, but that could fail because of other processes on the system.

embracing unreliability

I think it only makes sense to insist on a reliable collector if your mutator does not have threads; otherwise, the fragmentation-related unreliability pales in comparison.

What’s more, fragmentation-related unreliability can be entirely mitigated by giving the heap more memory: the maximum amount of fragmentation is an object just shy of the large object threshold, per block, so in our case 8 kB per 64 kB. So, just increase your heap by 12.5%. You will certainly not regret increasing your heap by 12.5%.

And happily, increasing heap size also works to mitigate unreliability related to multiple mutator threads. Consider 10 threads each of which has a local object graph that is usually 10 MB but briefly 100MB when calculating: usually when GC happens, total live object size is 10×10MB=100MB, but sometimes as much as 1 GB; there is a minimum heap size for which the program sometimes works, but also a minimum heap size at which it always works. The trouble is, of course, that you generally only know the minimum always-works size by experimentation, and you are unlikely to be able to actually measure the maximum heap size.

Which brings me to my final point, which is that virtual memory and growable heaps are good. Unless you have a dedicated devops team or you are evaluating a garbage collector, you should not be using a fixed heap size. The ability to just allocate some pages to keep the heap from being too tight buys us a form of soft reliability.

And with that, end of observations. Happy fragmenting, and until next time!

by Andy Wingo at July 10, 2024 08:48 AM

Stephen Chenney

Canvas Text Editing

Editing text in HTML canvas has never been easy. It requires identifying which character is under the hit point in order to place a caret, and it requires computing bounds for a range of text that is selected. The existing implementations of Canvas TextMetrics made these things possible, but not without a lot of Javascript making multiple expensive calls to compute metrics for substrings. Three new additions to the TextMetrics API are intended to support editing use cases in Canvas text. They are in the standards pipeline, and implemented in Chromium-based browsers behind the ExtendedTextMetrics flag:

  • caretPositionFromPoint gives the location in a string corresponding to a pixel length along the string. Use it to identify where the caret is in the string, and what the bounds of a selection range are.
  • getSelectionRects returns the rectangles that a browser would use to highlight a range of text. Use it to draw the selection highlight.
  • getActualBoundingBox returns the bounding box for a sub-range of text within a string. Use it if you need to know whether a point lies within a substring, rather than the entire string.

To enable the flag, use --enable-blink-features=ExtendedTextMetrics when launching Chrome from a script or command line, or enable “Experimental Web Platform features” via chrome://flags/#enable-experimental-web-platform-features.

I wrote a basic web app (opens in a new tab) in order to demonstrate the use of these features. It will function in Chrome versions beyond 128.0.6587.0 (Canary at the time of writing) with the above flags set.

The app allows the editing of a single line of text drawn in an HTML canvas. Here I’ll work through usage of the new features.

In the demo, the first instance of “new Canvas Text Metrics” is considered a link back to this blog page. Canvas Text has no notion of links, and thousands of people have looked at Stack Exchange for a way to insert hyperlinks in canvas text. Part of the problem, assuming you know where the link is in the text, is determining when the link was clicked on. The TextMetrics getActualBoundingBox(start, end) method is intended to simplify the problem by returning the bounding box of a substring of the text, in this case the link.

  onStringChanged() {
text_metrics = context.measureText(string);
link_start_position = string.indexOf(link_text);
if (link_start_position != -1) {
link_end_position = link_start_position + link_text.length;
}
}
...
linkHit(x, y) {
let bound_rect = undefined;
try {
bound_rect = text_metrics.getActualBoundingBox(link_start_position, link_end_position);
} catch (error) {
return false;
}
let relative_x = x - string_x;
let relative_y = y - string_y;
return relative_x >= bound_rect.left && relative_y >= bound_rect.top
&& relative_x < bound_rect.right && relative_y < bound_rect.bottom;
}

The first function finds the link in the string and stores the start and end string offsets. When a click event happens, the second method is called to determine if the hit point was within the link area. The text metrics object is queried for the bounding box of the link’s substring. Note the call is contained within a try...catch block because an exception will be returned if the substring is invalid. The event offset is mapped into the coordinate system of the text (in this case by subtracting the text location) and the resulting point is tested against the rectangle.

In more general situations you may need to use a regular expression to find links, and keep track of a more complex transformation chain to convert event locations into the text string’s coordinate system.

Mapping a Point to a String Index #

A primary concept of any editing application is the caret location because it indicates where typed text will appear, or what will be deleted by backspace, or where an insertion will happen. Mapping a hit point in the canvas into the caret position in the text string is a fundamental editing operation. It is possible to do this with existing methods but it is expensive (you can do a binary search using the width of substrings).

The TextMetrics caretPositionFromPoint(offset) method uses existing code in browsers to efficiently map a point to a string position. The underlying functionality is very similar to the document.caretPositionFromPoint(x,y) method, but modified for the canvas situation. The demo code uses it to position the caret and to identify the selection range.

  text_offset = event.offsetX - string_x;
caret_position = text_metrics.caretPositionFromPoint(text_offset);

The caretPositionFromPoint function takes the horizontal offset, in pixels, measured from the origin of the text (based on the textAlign property of the canvas context). The function finds the character boundary closest to the given offset, then returns the character index to the right for left-to-right text, and to the left for right-to-left text. The offset can be negative to allow characters to the left of the origin to be mapped.

In the figure below, the top string has textDirection = "ltr" and textAlign = "center". The origin for measuring offsets is the center of the string. Green shows the offsets given, while blue shows the indexes returned. The bottom string demonstrates textDirection = "rtl" and textAlign = "start".

An offset past the beginning of the text always returns 0, and past the end returns the string length. Note that the offset is always measured left-to-right, even if the text direction is right-to-left. The “beginning” and “end” of the text string do respect the text direction, so for RTL text the beginning is on the right.

The caretPositionFromPoint function may produce very counter-intuitive results when the text string has mixed bidi content, such as a latin substring within an arabic string. As the offset moves along the string the positions will not steadily increase, or decrease, but may jump around at the boundaries of a directional run. Full handling of bidi content requires incorporating bidi level information, particularly for selecting text, and is beyond the scope of this article.

Selection Rectangles #

Selected text is normally indicated by drawing a highlight over the range, but to produce such an effect in canvas requires estimating the rectangle using existing text metrics, and again making multiple queries to text metrics to obtain the left and right extents. The new TextMetrics getSelectionRects(start, end) function returns a list of browser defined selection rectangles for the given subrange of the string. There may be multiple rectangles because the browser returns one for each bidi run; you would need to draw them all to highlight the complete range. The demo assumes a single rectangle because it assumes no mixed-direction strings.

selection_rect = text_metrics.getSelectionRects(selection_range[0], selection_range[1])[0];
...
context.fillStyle = 'yellow';
context.fillRect(selection_rect.x + string_x,
selection_rect.y + string_y,
selection_rect.width,
selection_rect.height)

Like all the new methods, the rectangle returned is in the coordinate system of the string, as defined by the transform, textAlign and textBaseline.

Conclusion #

The new Canvas Text Metrics described here are in the process of standardization. When the feedback process is opened we will update this blog post with the place to raise issues with these proposed methods.

Thanks #

The implementation of Canvas Text Features was aided by Igalia S.L. funded by Bloomberg L.P.

July 10, 2024 12:00 AM

July 09, 2024

Guilherme Piccoli

Presenting kdumpst, or how to collect kernel crash logs on Arch Linux

It’s been a long time since I last posted something here – yet there are interesting news to mention, a bit far from ARM64 (the topic of my previous posts). Let’s talk today about kernel crashes, or even better, how can we collect information if a kernel panic happens on Arch Linux and on SteamOS, … Continue reading "Presenting kdumpst, or how to collect kernel crash logs on Arch Linux"

by gpiccoli at July 09, 2024 07:21 PM

July 08, 2024

Frédéric Wang

My recent contributions to Gecko (2/3)

Introduction

This is the second in a series of blog posts describing new web platform features Igalia has implemented in Gecko, as part of an effort to improve browser interoperability. I’ll talk about the task of implementing ‘content-visibility’, to which several Igalians have contributed since early 2022, and I’ll focus on two main roadblocks I had to overcome.

The ‘content-visibility’ property

In the past, Igalia worked on CSS containment, a feature allowing authors to isolate a subtree from the rest of the document to improve rendering performance. This is done using the ‘contain’ property, which accepts four kinds of containment: size, layout, style and paint.

‘content-visibility’ is a new property allowing authors to “hide” some content from the page, and save the browser unnecessary work by applying containment. The most interesting one is probably content-visibility: auto, which hides content that is not relevant to the user. This is essentially native “virtual scrolling”, allowing you to build virtualized or “recycled” lists without breaking accessibility and find-in-page.

To explain this, consider the typical example of a page with a series of posts, as shown below. By default, each post would have the four types of containment applied, plus it won’t be painted, won’t respond to hit-testing, and would use the dimensions specified in the ‘contain-intrinsic-size’ property. It’s only once a post becomes relevant to the user (e.g. when scrolled close enough to the viewport, or when focus is moved into the post) that the actual effort to properly render the content, and calculate its actual size, is performed:

div.post {
  content-visibility: auto;
  contain-intrinsic-size: 500px 1000px;
}
<div class="post">
...
</div>
<div class="post">
...
</div>
<div class="post">
...
</div>
<div class="post">
...
</div>

If a post later loses its relevance (e.g. when scrolled away, or when focus is lost) then it would use the dimensions specified by ‘contain-intrinsic-size’ again, discarding the content size that was obtained after layout. One can also avoid that and use the last remembered size instead:

div.post {
  contain-intrinsic-size: auto 500px auto 1000px;
}

Finally, there is also a content-visibility: hidden value, which is the same as content-visibility: auto but never reveals the content, enhancing other methods to hide content such as display: none or visibility: hidden.

This is just a quick overview of the feature, but I invite you to read the web.dev article on content-visibility for further details and thoughts.

Viewport distance for content-visibility: auto

As is often the case, the feature looks straightforward to implement, but issues appear when you get into the details.

In bug 1807253, my colleague Oriol Brufau raised an interoperability bug with a very simple test case, reproduced below for convenience. Chromium would report 0 and 42, whereas Firefox would sometimes report 0 twice, meaning that the post did not become relevant after a rendering update:

<!DOCTYPE html>
<div id="post" style="content-visibility: auto">
  <div style="height: 42px"></div>
</div>
<script>
console.log(post.clientHeight);
requestAnimationFrame(() => requestAnimationFrame(() => {
  console.log(post.clientHeight);
}));
</script>

It turned out that an early version of the specification relied too heavily on an modified version of IntersectionObserver to synchronously detect when an element is close to the viewport, as this was how it was implemented in Chromium. However, the initial implementation in Firefox relied on a standard IntersectionObserver (with asynchronous notifications of observers) and so failed to produce the behavior described in the specification. This issue was showing up in several WPT failures.

To solve that problem, the moment when we determine an element’s proximity to the viewport was moved into the HTML5 specification, at the step when the rendering is updated, more precisely when the ResizeObserver notifications are broadcast. My colleague Alexander Surkov had started rewriting Firefox’s implementation to align with this new behavior in early 2023, and I took over his work in November.

Since this touches the “update the rendering” step which is executed on every page, it was quite likely to break things… and indeed many regressions were caused by my patch, for example:

  • One regression was about white flickering of pages on every reload/navigation.
  • One more regression was about content-visibility: auto nodes not being rendered at all.
  • Another regression was about new resize loop errors appearing in tests.
  • Some test cases were also found where the “update the rendering step” would repeat indefinitely, causing performance regressions.
  • Last but not least, crashes were reported.

Some of these issues were due to the fact that support for the last remembered size in Firefox relied on an internal ResizeObserver. However, the CSS Box Sizing spec only says that the last remembered size is updated when ResizeObserver events are delivered, not that such an internal ResizeObserver object is actually needed. I removed this internal observer and ensured the last remembered size is computed directly in the “update the rendering” phase, making the whole thing simpler and more robust.

Dynamic changes to CSS ‘contain’ and ‘content-visibility’

Before sending the intent-to-ship, we reviewed remaining issues and stumbled on bug 1765615, which had been opened during the initial 2022 work. Mozilla indicated this performance bug was important enough to consider an optimization, so I started tackling the issue.

Elaborating a bit about what was mentioned above, a non-visible ‘content-visibility’ implies layout, style and paint containment, and when the element is not relevant to the user, it also implies size containment 1. This has certain side effects, for example paint and layout containment establish an independent formatting context and affect how the contained box interacts with floats and how margin collapsing applies. Style containment can even have more drastic consequences, since they make counter-* and *-quote properties scoped to the subtree.

When we dynamically modify the ‘contain’ or ‘content-visibility’ properties, or when the relevance of a content-visibility: auto element changes, browsers must make sure that the rendering is properly updated. It turned out that there were almost no tests for that, and unsurprisingly, Chromium and WebKit had various invalidation bugs. Firefox was always forcing a rebuild of the tree used for rendering, which avoided such bugs but is not optimal.

I wrote a couple of web platform tests for ‘contain’ and ‘content-visibility’ 2, and made sure that Firefox does the minimal invalidation effort needed, being careful not to cause any regressions. As a result, except for style containment changes, we’re now able to avoid the cost a rebuild of the tree used for rendering!

Conclusion

Almost two years after the initial work on ‘content-visibility’, I was able to send the intent-to-ship, and the feature finally became available in Firefox 125. Finishing the implementation work on this feature was challenging, but quite interesting to me.

I believe ‘content-visibility’ is a good example of why implementing a feature in different browsers is important to ensure that both the specification and tests are good enough. The lack of details in the spec regarding when we determine viewport proximity, and the absence for WPT tests for invalidation, definitely made the Firefox work take longer than expected. But finishing that implementation work was also useful for improving the spec, tests, and other implementations 3.

I’ll conclude this series of blog posts with fetch priority, which also has its own interesting story…

  1. In both cases, “implies” means the used value of ‘contain’ is modified accordingly. 

  2. One of the thing I had to handle with care was the update of the accessibility tree, since content that is not relevant to the user must not be exposed. Unfortunately it’s not possible to write WPT tests for accessibility yet so for now I had to write internal Firefox-specific non-regression tests. 

  3. Another interesting report happened after the release and is related to content-visibility: auto on elements drawn in a canvas

July 08, 2024 10:00 PM