Planet Igalia

November 19, 2024

Melissa Wen

Display/KMS Meeting at XDC 2024: Detailed Report

XDC 2024 in Montreal was another fantastic gathering for the Linux Graphics community. It was again a great time to immerse in the world of graphics development, engage in stimulating conversations, and learn from inspiring developers.

Many Igalia colleagues and I participated in the conference again, delivering multiple talks about our work on the Linux Graphics stack and also organizing the Display/KMS meeting. This blog post is a detailed report on the Display/KMS meeting held during this XDC edition.

Short on Time?

  1. Catch the lightning talk summarizing the meeting here (you can even speed up 2x):
  1. For a quick written summary, scroll down to the TL;DR section.

TL;DR

This meeting took 3 hours and tackled a variety of topics related to DRM/KMS (Linux/DRM Kernel Modesetting):

  • Sharing Drivers Between V4L2 and KMS: Brainstorming solutions for using a single driver for devices used in both camera capture and display pipelines.
  • Real-Time Scheduling: Addressing issues with non-blocking page flips encountering sigkills under real-time scheduling.
  • HDR/Color Management: Agreement on merging the current proposal, with NVIDIA implementing its special cases on VKMS and adding missing parts on top of Harry Wentland’s (AMD) changes.
  • Display Mux: Collaborative design discussions focusing on compositor control and cross-sync considerations.
  • Better Commit Failure Feedback: Exploring ways to equip compositors with more detailed information for failure analysis.

Bringing together Linux display developers in the XDC 2024

While I didn’t present a talk this year, I co-organized a Display/KMS meeting (with Rodrigo Siqueira of AMD) to build upon the momentum from the 2024 Linux Display Next hackfest. The meeting was attended by around 30 people in person and 4 remote participants.

Speakers: Melissa Wen (Igalia) and Rodrigo Siqueira (AMD)

Link: https://indico.freedesktop.org/event/6/contributions/383/

Topics: Similar to the hackfest, the meeting agenda was built over the first two days of the conference and mixed talks follow-up with new ideas and ongoing community efforts.

The final agenda covered five topics in the scheduled order:

  1. How to share drivers between V4L2 and DRM for bridge-like components (new topic);
  2. Real-time Scheduling (problems encountered after the Display Next hackfest);
  3. HDR/Color Management (ofc);
  4. Display Mux (from Display hackfest and XDC 2024 talk, bringing AMD and NVIDIA together);
  5. (Better) Commit Failure Feedback (continuing the last minute topic of the Display Next hackfest).

Unpacking the Topics

Similar to the hackfest, the meeting agenda evolved over the conference. During the 3 hours of meeting, I coordinated the room and discussion rounds, and Rodrigo Siqueira took notes and also contacted key developers to provide a detailed report of the many topics discussed.

From his notes, let’s dive into the key discussions!

How to share drivers between V4L2 and KMS for bridge-like components.

Led by Laurent Pinchart, we delved into the challenge of creating a unified driver for hardware devices (like scalers) that are used in both camera capture pipelines and display pipelines.

  • Problem Statement: How can we design a single kernel driver to handle devices that serve dual purposes in both V4L2 and DRM subsystems?
  • Potential Solutions:
    1. Multiple Compatible Strings: We could assign different compatible strings to the device tree node based on its usage in either the camera or display pipeline. However, this approach might raise concerns from device tree maintainers as it could be seen as a layer violation.
    2. Separate Abstractions: A single driver could expose the device to both DRM and V4L2 through separate abstractions: drm-bridge for DRM and V4L2 subdev for video. While simple, this approach requires maintaining two different abstractions for the same underlying device.
    3. Unified Kernel Abstraction: We could create a new, unified kernel abstraction that combines the best aspects of drm-bridge and V4L2 subdev. This approach offers a more elegant solution but requires significant design effort and potential migration challenges for existing hardware.

Real-Time Scheduling Challenges

We have discussed real-time scheduling during this year Linux Display Next hackfest and, during the XDC 2024, Jonas Adahl brought up issues uncovered while progressing on this front.

  • Context: Non-blocking page-flips can, on rare occasions, take a long time and, for that reason, get a sigkill if the thread doing the atomic commit is a real-time schedule.
  • Action items:
    • Explore alternative backtraces during the busy wait (e.g., ftrace).
    • Investigate the maximum thread time in busy wait to reproduce issues faced by compositors. Tools like RTKit (mutter) can be used for better control (Michel Dänzer can help with this setup).

HDR/Color Management

This is a well-known topic with ongoing effort on all layers of the Linux Display stack and has been discussed online and in-person in conferences and meetings over the last years.

Here’s a breakdown of the key points raised at this meeting:

  • Talk: Color operations for Linux color pipeline on AMD devices: In the previous day, Alex Hung (AMD) presented the implementation of this API on AMD display driver.
  • NVIDIA Integration: While they agree with the overall proposal, NVIDIA needs to add some missing parts. Importantly, they will implement these on top of Harry Wentland’s (AMD) proposal. Their specific requirements will be implemented on VKMS (Virtual Kernel Mode Setting driver) for further discussion. This VKMS implementation can benefit compositor developers by providing insights into NVIDIA’s specific needs.
  • Other vendors: There is a version of the KMS API applied on Intel color pipeline. Apart from that, other vendors appear to be comfortable with the current proposal but lacks the bandwidth to implement it right now.
  • Upstream Patches: The relevant upstream patches were can be found here. [As humorously notes, this series is eagerly awaiting your “Acked-by” (approval)]
  • Compositor Side: The compositor developers have also made significant progress.
    • KDE has already implemented and validated the API through an experimental implementation in Kwin.
    • Gamescope currently uses a driver-specific implementation but has a draft that utilizes the generic version. However, some work is still required to fully transition away from the driver-specific approach. AP: work on porting gamescope to KMS generic API
    • Weston has also begun exploring implementation, and we might see something from them by the end of the year.
  • Kernel and Testing: The kernel API proposal is well-refined and meets the DRM subsystem requirements. Thanks to Harry Wentland effort, we already have the API attached to two hardware vendors and IGT tests, and, thanks to Xaver Hugl, a compositor implementation in place.

Finally, there was a strong sense of agreement that the current proposal for HDR/Color Management is ready to be merged. In simpler terms, everything seems to be working well on the technical side - all signs point to merging and “shipping” the DRM/KMS plane color management API!

Display Mux

During the meeting, Daniel Dadap led a brainstorming session on the design of the display mux switching sequence, in which the compositor would arm the switch via sysfs, then send a modeset to the outgoing driver, followed by a modeset to the incoming driver.

  • Context:
  • Key Considerations:
    • HPD Handling: There was a general consensus that disabling HPD can be part of the sequence for internal panels and we don’t need to focus on it here.
    • Cross-Sync: Ensuring synchronization between the compositor and the drivers is crucial. The compositor should act as the “drm-master” to coordinate the entire sequence, but how can this be ensured?
    • Future-Proofing: The design should not assume the presence of a mux. In future scenarios, direct sharing over DP might be possible.
  • Action points:
    • Sharing DP AUX: Explore the idea of sharing DP AUX and its implications.
    • Backlight: The backlight definition represents a problem in the mux switch context, so we should explore some of the current specs available for that.

Towards Better Commit Failure Feedback

In the last part of the meeting, Xaver Hugl asked for better commit failure feedback.

  • Problem description: Compositors currently face challenges in collecting detailed information from the kernel about commit failures. This lack of granular data hinders their ability to understand and address the root causes of these failures.

To address this issue, we discussed several potential improvements:

  • Direct Kernel Log Access: One idea is to directly load relevant kernel logs into the compositor. This would provide more detailed information about the failure and potentially aid in debugging.
  • Finer-Grained Failure Reporting: We also explored the possibility of separating atomic failures into more specific categories. Not all failures are critical, and understanding the nature of the failure can help compositors take appropriate action.
  • Enhanced Logging: Currently, the dmesg log doesn’t provide enough information for user-space validation. Raising the log level to capture more detailed information during failures could be a viable solution.

By implementing these improvements, we aim to equip compositors with the necessary tools to better understand and resolve commit failures, leading to a more robust and stable display system.

A Big Thank You!

Huge thanks to Rodrigo Siqueira for these detailed meeting notes. Also, Laurent Pinchart, Jonas Adahl, Daniel Dadap, Xaver Hugl, and Harry Wentland for bringing up interesting topics and leading discussions. Finally, thanks to all the participants who enriched the discussions with their experience, ideas, and inputs, especially Alex Goins, Antonino Maniscalco, Austin Shafer, Daniel Stone, Demi Obenour, Jessica Zhang, Joan Torres, Leo Li, Liviu Dudau, Mario Limonciello, Michel Dänzer, Rob Clark, Simon Ser and Teddy Li.

This collaborative effort will undoubtedly contribute to the continued development of the Linux display stack.

Stay tuned for future updates!

November 19, 2024 01:00 PM

November 18, 2024

Ricardo García

My XDC 2024 talk about VK_EXT_device_generated_commands

Some days ago I wrote about the new VK_EXT_device_generated_commands Vulkan extension that had just been made public. Soon after that, I presented a talk at XDC 2024 with a brief introduction to it. It’s a lightning talk that lasts just about 7 minutes and you can find the embedded video below, as well as the slides and the talk transcription if you prefer written formats.

Truth be told, the topic deserves a longer presentation, for sure. However, when I submitted my talk proposal for XDC I wasn’t sure if the extension was going to be public by the time XDC would take place. This meant I had two options: if I submitted a half-slot talk and the extension was not public, I needed to talk for 15 minutes about some general concepts and a couple of NVIDIA vendor-specific extensions: VK_NV_device_generated_commands and VK_NV_device_generated_commands_compute. That would be awkward so I went with a lighning talk where I could talk about those general concepts and, maybe, talk about some VK_EXT_device_generated_commands specifics if the extension was public, which is exactly what happened.

Fortunately, I will talk again about the extension at Vulkanised 2025. It will be a longer talk and I will cover the topic in more depth. See you in Cambridge in February and, for those not attending, stay tuned because Vulkanised talks are recorded and later uploaded to YouTube. I’ll post the link here and in social media once it’s available.

XDC 2024 recording

Talk slides and transcription

Title slide: Device-Generated Commands in Vulkan (VK_EXT_device_generated_commands)

Hello, I’m Ricardo from Igalia and I’m going to talk about Device-Generated Commands in Vulkan. This is a new extension that was released a couple of weeks ago. I wrote CTS tests for it, I helped with the spec and I worked with some actual heros, some of them present in this room, that managed to get this implemented in a driver.

What are device-generated commands?

Device-Generated Commands is an extension that allows apps to go one step further in GPU-driven rendering because it makes it possible to write commands to a storage buffer from the GPU and later execute the contents of the buffer without needing to go through the CPU to record those commands, like you typically do by calling vkCmd functions working with regular command buffers.

It’s one step ahead of indirect draws and dispatches, and one step behind work graphs.

Naïve CPU-based approach

Getting away from Vulkan momentarily, if you want to store commands in a storage buffer there are many possible ways to do it. A naïve approach we can think of is creating the buffer as you see in the slide. We assign a number to each Vulkan command and store it in the buffer. Then, depending on the command, more or less data follows. For example, lets take the sequence of commands in the slide: (1) push constants followed by (2) dispatch. We can store a token number or command id or however you want to call it to indicate push constants, then we follow with meta-data about the command (which is the section in green color) containing the layout, stage flags, offset and size of the push contants. Finally, depending on the size, we store the push constant values, which is the first chunk of data in blue. For the dispatch it’s similar, only that it doesn’t need metadata because we only want the dispatch dimensions.

But this is not how GPUs work. A GPU would have a very hard time processing this. Also, Vulkan doesn’t work like this either. We want to make it possible to process things in parallel and provide as much information in advance as possible to the driver.

VK_EXT_device_generated_commands

So in Vulkan things are different. The buffer will not contain an arbitrary sequence of commands where you don’t know which one comes next. What we do is to create an Indirect Commands Layout. This is the main concept. The layout is like a template for a short sequence of commands. We create this layout using the tokens and meta-data that we saw colored red and green in the previous slide.

We specify the layout we will use in advance and, in the buffer, we ony store the actual data for each command. The result is that the buffer containing commands (lets call it the DGC buffer) is divided into small chunks, called sequences in the spec, and the buffer can contain many such sequences, but all of them follow the layout we specified in advance.

In the example, we have push constant values of a known size followed by the dispatch dimensions. Push constant values, dispatch. Push constant values, dispatch. Etc.

Restricted Command Selection

The second thing Vulkan does is to severely limit the selection of available commands. You can’t just start render passes or bind descriptor sets or do anything you can do in a regular command buffer. You can only do a few things, and they’re all in this slide. There’s general stuff like push contants, stuff related to graphics like draw commands and binding vertex and index buffers, and stuff to dispatch compute or ray tracing work. That’s it.

Moreover, each layout must have one token that dispatches work (draw, compute, trace rays) but you can only have one and it must be the last one in the layout.

Indirect Execution Sets

Something that’s optional (not every implementation is going to support this) is being able to switch pipelines or shaders on the fly for each sequence.

Summing up, in implementations that allow you to do it, you have to create something new called Indirect Execution Sets, which are groups or arrays of pipelines that are more or less identical in state and, basically, only differ in the shaders they include.

Inside each set, each pipeline gets an index and you can change the pipeline used for each sequence by (1) specifying the Execution Set in advance (2) using an execution set token in the layout, and (3) storing a pipeline index in the DGC buffer as the token data.

Quick How-To

The summary of how to use it would be:

First, create the commands layout and, optionally, create the indirect execution set if you’ll switch pipelines and the driver supports that.

Then, get a rough idea of the maximum number of sequences that you’ll run in a single batch.

With that, create the DGC buffer, query the required preprocess buffer size, which is an auxiliar buffer used by some implementations, and allocate both.

Then, you record the regular command buffer normally and specify the state you’ll use for DGC. This also includes some commands that dispatch work that fills the DGC buffer somehow.

Finally, you dispatch indirect work by calling vkCmdExecuteGeneratedCommandsEXT. Note you need a barrier to synchronize previous writes to the DGC buffer with reads from it.

You can also do explicit preprocessing but I won’t go into detail here.

Closing slide: Thanks for watching!

That’s it. Thank for watching, thanks Valve for funding a big chunk of the work involved in shipping this, and thanks to everyone who contributed!

November 18, 2024 03:55 PM

Pablo Saavedra

Building a Sustainable Rack for Embedded Boards Management

In my daily work, I often interact with various embedded system boards. However, the lack of an organized setup for these devices frequently becomes a frustrating hurdle.

The challenges are twofold:

  1. Disorganized Environment: The absence of a centralized space to keep devices neatly arranged complicates workflow.
  2. On-Site Dependency: Tasks like updating operating system images, interacting with input devices across multiple boards, monitoring display outputs, or managing power cycles require physically being present in front of the systems.
[This could very well be an ode to the Flying Spaghetti Monster]

To address these issues, I’ve started planning a new personal project: Building my custom rack-mounted workstation for embedded systems development that commonly I use.

With this setup I aim to improve my efficiency by obtaining remote management capabilities for my development tasks. Here are the key objectives I’m focusing on for this:

  1. Remote Power Management:
    Implementing the capability of remotely power on/off the embedded boards.
  2. Input Device Multiplexing:
    Organize the space by arranging the boards and cables into a rack, and centralize the management of input devices such as a mouse, touchscreen, and keyboard for multiple boards using a KVM switch. This eliminates the need for having separate peripherals connected to each board.
  3. Remote KVM Access:
    Enabling remote access to HDMI outputs for up to eight devices, allowing me to monitor screen outputs without physical proximity.

While there are still some trade-offs, like limited remote interaction with display peripherals, the setup addresses my primary needs effectively.

I really believe this solution will streamline my day-to-day workflow, minimize interruptions, and eliminate the need to be tethered to a specific location for embedded system tasks.

Once complete, I hope this project will serve as a template for others who face similar challenges in embedded systems development.


Technical Solution

To implement my goals for this custom rack-mounted workstation I’ve identified the following technical solutions:

  1. Remote Power Management:
    I plan to use a managed PoE switch (Zyxel GS1900 series) that is supported by a open source library for easy integration. The boards will be powered directly via PoE using compatible adapters.
  2. Input Device Multiplexing:
    After scanning the market, I found a KVM switch (TESmart KVM HDMI HKS801-E23) that offers LAN-based management for selecting the active output. This will allow centralized input management with remote accessibility.
  3. Remote KVM Access:
    I’ll rely on an HDMI H.264 video encoder for streaming display outputs from the embedded boards. This solution ensures efficient and high-quality remote monitoring. I will connect the HDMI output of the KVM to an HDMI splitter to duplicate the signal. One of the outputs will go to a monitor, while the other will be connected to the HDMI Livestream Video Encoder. This device will generate a video stream accessible over the network, allowing remote viewing of the KVM’s active output. Disclaimer: This solution has a notable limitation. The design does not enable remote management of peripherals. This is not a major issue for me, as I usually don’t need to interact extensively with the display. However, in cases where interaction is necessary, I mitigate this limitation by using uinput to create and simulate virtual input devices.

For the rack design, I have decided on an 8U rack, which fits perfectly in the available space. Given the variety and number of embedded boards I work with, the setup needs to be flexible. To accommodate this:

  • Removable Trays:
    I’ll use sliding trays to hold the boards. Although the boards and their cables will remain loose, as they are now, the trays will allow me to keep them neatly concealed when not in use.
  • Minimal Frame Structure: Instead of using a cabinet, I will construct the rack frame using steel mounting rails. The trays and all other devices will be mounted directly onto these rails.

Stay tuned—I’ll share updates and learnings as I bring this project to life. If you have ideas or experiences to share, feel free to reach out.


Appendix: List of the purchased hardware for the project

by Pablo Saavedra at November 18, 2024 07:46 AM

November 15, 2024

Manuel Rego

Servo Revival: 2023-2024

As many of you already know, Igalia took over the maintenance of the Servo project in January 2023. We’ve been working hard on bringing the project back to life again, and this blog post is a summary of our achievements so far.

Some history first #

You can skip this this section if you already know about the Servo project.

Servo is an experimental browser engine created by Mozilla in 2012. From the very beginning, it was developed alongside the Rust language, like a showcase for the new language that Mozilla was developing. Servo has always aimed to be a performant and secure web rendering engine, as those are also main characteristics of the Rust language.

Mozilla was the main force behind Servo’s development for many years, with some other companies like Samsung collaborating too. Some of Servo’s components, like Stylo and WebRender, were adopted and used in Firefox releases by 2017, and continue to be used there today.

In 2020, Mozilla laid off the whole Servo team, and the project moved to the Linux Foundation. Despite some initial interest in the project, by the end of 2020 there was barely any active work on the project, and 2021 and 2022 weren’t any better. At that point, many considered the Servo project to be abandoned.

2023: Igalia takes over maintenance of Servo #

Things changed in 2023, when Igalia got external funding to work on Servo and get the project moving again. We have previous experience working on Servo during the Mozilla years, and we also have a wide experience working on other web rendering engines.

To explore new markets and grow a stable community, the Servo project joined Linux Foundation Europe in September 2023, and is now one of the most active projects there. LF Europe is an umbrella organization that hosts the Servo project, and provides us with opportunities to showcase Servo at several events.

Over the last two years, Igalia has had a team of around five engineers working full-time on the Servo project. We’re taking care of the project maintenance, communication, community governance, and a large portion of development. We have also been looking for new partners and organizations that may be interested in the Servo project, and can devote resources or money towards moving it forward.

Picture of the Servo Project Updates talk at Linux Foundation Europe Member Summit 2024. In the picture you can see a room with a group of people sitting watching the presentation. On the projector there is a slide about Servo usage exmaples showing Tauri, Qt WebView, Blitz and Verso. Manuel Rego is giving the talk at the right of the projector. Servo Project Updates talk at Linux Foundation Europe Member Summing 2024 (Picture by Linux Foundation)

Achievements #

Two years is not a lot of time for a project like Servo, and we’re very proud of what we’ve achieved in this period. Some highlights, including work from Igalia and the wider Servo community:

  • Maintenance: General project maintenance, tooling, documentation, community, and governance work. Lots of work has been put in upgrading Servo dependencies, specially the big ones (SpiderMonkey, Stylo and WebRender). There have been improvements in Servo’s CI and tooling. We now have a Servo book, the Servo community has been growing, and the Technical Steering Committee (TSC) is running periodic meetings every month.
  • Communications: The project is alive again and we want the world to know it, so we’ve been doing monthly blog posts, weekly updates on social media, plus conference talks and booths at several events. The project is more popular than ever, and our GitHub stars haven’t stopped growing.
  • Layout engine: Servo had two layout engines: its original one, and a newer one started around 2020, which follows specs more closely but was in very early stages. After some analysis, we decided to go with the new layout engine and start to implement features in it like floats, tables, flexbox, fonts, right-to-left, etc. We now pass 1.4 million subtests (1,401,889 at the time of publishing this blog post) in the Web Platform Tests. And of the tests we run today, our overall pass rate has gone from 40.8% in April 2023 to 62.0% (+21.2). 📈
  • New platforms: Servo used to support Linux, macOS, and Windows. That’s still the case, but we’ve also added support for Android and OpenHarmony, helping us reach a wide range of mobile devices. 📱
  • Embedded experiments: We’ve been working with the developers of Tauri, making Servo easier to embed and creating a first prototype of using Servo as web engine in WRY. Our community has also experimented with integrating Servo in other projects, including Blitz, Qt WebView, and Verso.
  • Outreachy: Outreachy is an internship program for getting people from underrepresented groups into our industry via FOSS, and we’ve rejoined the program in 2024. Our intern in the May cohort, @eerii, brought DevTools support back to Servo.

Of course, it’s impossible to summarize all of the work that happened in so much time, with so many people working together to achieve these great results. Big thanks to everyone that has collaborated on the project! 🙏

Screencast of Servo browsing the servo.org website and some Wikipedia articles. Screencast of Servo browsing the servo.org website and some Wikipedia articles

Some numbers #

While it’s hard to list all of our achievements, we can take a look at some stats. Numbers are always tricky to draw conclusions from, but we’ve been taking a look at the number of PRs merged in Servo’s main repository since 2018, to understand how the project is faring.

2018 2019 2020 2021 2022 2023 2024 (Jan-Oct)
PRs 1,188 986 669 118 65 776 1,440
Contributors 27.33 27.17 14.75 4.92 2.83 11.33 26.70
Contributors ≥ 10 2.58 1.67 1.17 0.08 0.00 1.58 4.20
  • PRs: total numbers of PRs merged. We’re now doing more than we were doing in 2018-2019.
  • Contributors: average number of contributors per month. We’re getting very close to the numbers of 2018-2019, even though the average can still vary in the next two months, it looks like a good number.
  • Contributors ≥ 10: average number of contributors that have merged more than 10 PRs per month. In this case we already have bigger numbers than in 2018-2019.

As a clarification, these numbers don’t include PRs from bots (dependabot and Servo WPT Sync).

Our participation in Outreachy has also made a marked difference. During each month-long contribution period, we get a huge influx of new people contributing to the project. This year, we participated in both the May and December cohorts, and the contribution periods are very visible in our March and October 2024 stats:

Jan Feb Mar Apr May Jun Jul Aug Sep Oct
PRs 103 97 255 111 71 96 103 188 167 249
Contributors 19 22 32 27 23 22 30 31 24 37
Contributors ≥ 10 2 3 9 3 2 2 2 6 5 8

Anyway overall, we think these numbers ratify that the project is back to life, and the community is growing and engaging with the project. We’re very happy about them.

Donations #

Early this year we set up the Servo Open Collective and GitHub Sponsors, where many people and organizations have since been donating to the project. We decide how to spend this money transparently in the TSC, and so far we’ve used it to cover the project’s infrastructure costs, like self-hosted runners to speed up CI times.

We’re very grateful to everyone donating money to the project, so far adding up to over $24,500 from more than 500 donors. Thank you! 🙏

Picture of the Servo booth at Linux Foundation Europe Member Summit 2024. It's a talbe with a TV on the left showing a video of Servo running on Raspberry Pi, there's also a smaller screen on the right, showing a presentation about Servo. Behind the table there's a Servo banner. Over the table there is a Raspberry Pi, a Rockchip board and an Android phone. There are also stikcers and flyers. Servo booth at Linux Foundation Europe Member Summing 2024 (Picture by Linux Foundation)

About the future #

Servo has a lot of potential. Several things make it a very special project:

  • Independent: Servo is managed under Linux Foundation Europe with an open governance model, which we believe is very important for the open web ecosystem as a whole.
  • Performant: Servo is the only web rendering engine that uses parallel layout for web content. Though Servo is not yet faster than other web engines that are already production-ready, we’re exploring new ways to render web content that take advantage of modern CPUs with many cores.
  • Secure: Servo is written in Rust, which prevents many security vulnerabilities by design, such as memory safety bugs. Other kinds of unsoundness, such as concurrency bugs, are also easier to manage in Rust.

Of course, the project has some challenges too:

  • Experimental: Servo is still in an experimental phase and is not a production-ready product. We still lack several features that are important for web content.
  • Competition: The dominant web rendering engines have huge teams behind them, working on making them better, faster, and more complete. Catching up with them is a really complex task.
  • Funding: This is a key topic for any open source project, and Servo is no different in this regard. We need organizations from public and private sectors to join efforts with us, and grow a healthy ecosystem around the project.

To finish on a positive note, we also have some key opportunities:

  • Single web apps: Servo may excel at rendering specific web apps in embedded scenarios, where the HTML and CSS features are known in advance.
  • UI frameworks: UI frameworks in the Rust ecosystem could start using Servo as web engine to render their contents.
  • Web engine: Servo could become the default web engine for any Rust application.
  • Web browser: In the future, we could grow into being able to develop a full-featured web browser.

The future of Servo, as with many other open source projects, is still uncertain. We’ll see how things go in the coming years, but we would love to see that the Servo project can keep growing.

Servo is a huge project. To keep it alive and making progress, we need continuous funding on a bigger scale than crowdfunding can generally accomplish. If you’re interested in contributing to the project or sponsoring the development of specific functionality, please contact us at join@servo.org or igalia.com/contact.

Let’s hope we can walk this path together and keep working on Servo for many years ahead. 🤞

PS: Special thanks to my colleague Delan Azabani for proofreading and copyediting this blog post. 🙏

by Manuel Rego Casasnovas at November 15, 2024 12:00 AM

November 03, 2024

Mario Sanchez Prada

Igalia and WebKit: status update and plans (2024)

It’s been more than 2 years since the last time I wrote something here, and in that time a lot of things happened. Among those, one of the main highlights was me moving back to Igalia‘s WebKit team, but this time I moved as part of Igalia’s support infrastructure to help with other types of tasks such as general coordination, team facilitation and project management, among other things.

On top of those things, I’ve been also presenting our work around WebKit in different venues, such as in the Embedded Open Source Summit or in the Embedded Recipes conference, for instance. Of course, that included presenting our work in the WebKit community as part of the WebKit Contributors Meeting, a small and technically focused event that happens every year, normally around the Bay Area (California). That’s often a pretty dense presentation where, over the course of 30-40 minutes, we go through all the main areas that we at Igalia contribute to in WebKit, trying to summarize our main contributions in the previous 12 months. This includes work not just from the WebKit team, but also from other ones such as our Web Platform, Compilers or Multimedia teams.

So far I did that a couple of times only, both last year on October 24rth as well as this year, just a couple of weeks ago in the latest instance of the WebKit Contributors meeting. I believe the session was interesting and informative, but unfortunately it does not get recorded so this time I thought I’d write a blog post to make it more widely accessible to people not attending that event.

This is a long read, so maybe grab a cup of your favorite beverage first…

Igalia and WebKit

So first of all, what is the relationship between Igalia and the WebKit project?

Igalia logoWebKit logo

In a nutshell, we are the lead developers and the maintainers of the two Linux-based WebKit ports, known as WebKitGTK and WPE. These ports share a common baseline (e.g. GLib, GStreamer, libsoup) and also some goals (e.g. performance, security), but other than that their purpose is different, with WebKitGTK being aimed at the Linux desktop, while WPE is mainly focused on embedded devices.

WPE logo

This means that, while WebKitGTK is the go-to solution to embed Web content in GTK applications (e.g. GNOME Web/Epiphany, Evolution), and therefore integrates well with that graphical toolkit, WPE does not even provide a graphical toolkit since its main goal is to be able to run well on embedded devices that often don’t even have a lot of memory or processing power, or not even the usual mechanisms for I/O that we are used to in desktop computers. This is why WPE’s architecture is designed with flexibility in mind with a backends-based architecture, why it aims for using as few resources as possible, and why it tries to depend on as few libraries as possible, so you can integrate it virtually in any kind of embedded Linux platform.

Besides that port-specific work, which is what our WebKit and Multimedia teams focus a lot of their effort on, we also contribute at a different level in the port-agnostic parts of WebKit, mostly around the area of Web standards (e.g. contributing to Web specifications and to implement them) and the Javascript engine. This work is carried out by our Web Platform and Compilers team, which tirelessly contribute to the different parts of WebCore and JavaScriptCore that affect not just the WebKitGTK and WPE ports, but also the rest of them to a bigger or smaller degree.

Last but not least, we also devote a considerable amount of our time to other topics such as accessibility, performance, bug fixing, QA... and also to make sure WebKit works well on 32-bit devices, which is an important thing for a lot of WPE users out there.

Who are our users?

At Igalia we distinguish 4 main types of users of the WebKitGTK and WPE ports of WebKit:

Port users: this category would include anyone that writes a product directly against the port’s API, that is, apps such as a desktop Web browser or embedded systems that rely on a fullscreen Web view to render its Web-based content (e.g. digital signage systems).

Platform providers: in this category we would have developers that build frameworks with one of the Linux ports at its core, so that people relying on such frameworks can leverage the power of the Web without having to directly interface with the port’s API. RDK could be a good example of this use case, with WPE at the core of the so-called Thunder plugin (previously known as WPEFramework).

Web developers: of course, Web developers willing to develop and test their applications against our ports need to be considered here too, as they come with a different set of needs that need to be fulfilled, beyond rendering their Web content (e.g. using the Web Inspector).

End users: And finally, the end user is the last piece of the puzzle we need to pay attention to, as that’s what makes all this effort a task worth undertaking, even if most of them most likely don’t need what WebKit is, which is perfectly fine :-)

We like to make this distinction of 4 possible types of users explicit because we think it’s important to understand the complexity of the amount of use cases and the diversity of potential users and customers we need to provide service for, which is behind our decisions and the way we prioritize our work.

Strategic goals

Our main goal is that our product, the WebKit web engine, is useful for more and more people in different situations. Because of this, it is important that the platform is homogeneous and that it can be used reliably with all the engines available nowadays, and this is why compatibility and interoperability is a must, and why we work with the the standards bodies to help with the design and implementation of several Web specifications.

With WPE, it is very important to be able to run the engine in small embedded devices, and that requires good performance and being efficient in multiple hardware architectures, as well as great flexibility for specific hardware, which is why we provided WPE with a backend-based architecture, and reduced dependencies to a minimum.

Then, it is also important that the QA Infrastructure is good enough to keep the releases working and with good quality, which is why I regularly maintain, evolve and keep an eye on the EWS and post-commit bots that keep WebKitGTK and WPE building, running and passing the tens of thousands of tests that we need to check continuously, to ensure we don’t regress (or that we catch issues soon enough, when there’s a problem). Then of course it’s also important to keep doing security releases, making sure that we release stable versions with fixes to the different CVEs reported as soon as possible.

Finally, we also make sure that we keep evolving our tooling as much as possible (see for instance the release of the new SDK earlier this year), as well as improving the documentation for both ports.

Last, all this effort would not be possible if not because we also consider a goal of us to maintain an efficient collaboration with the rest of the WebKit community in different ways, from making sure we re-use and contribute to other ports as much code as possible, to making sure we communicate well in all the forums available (e.g. Slack, mailing list, annual meeting).

Contributions to WebKit in numbers

Well, first of all the usual disclaimer: number of commits is for sure not the best possible metric,  and therefore should be taken with a grain of salt. However, the point here is not to focus too much on the actual numbers but on the more general conclusions that can be extracted from them, and from that point of view I believe it’s interesting to take a look at this data at least once a year.

Igalia contributions to WebKit (2024)

With that out of the way, it’s interesting to confirm that once again we are still the 2nd biggest contributor to WebKit after Apple, with ~13% of the commits landed in this past 12-month period. More specifically, we landed 2027 patches out of the 15617 ones that took place during the past year, only surpassed by Apple and their 12456 commits. The remaining 1134 patches were landed mostly by Sony, followed by RedHat and several other contributors.

Igalia contributions to WebKit (2024)Now, if we remove Apple from the picture, we can observe how this year our contributions represented ~64% of all the non-Apple commits, a figure that grew about ~11% compared to the past year. This confirms once again our commitment to WebKit, a project we started contributing about 14 years ago already, and where we have been systematically being the 2nd top contributor for a while now.

Main areas of work

The 10 main areas we have contributed to in WebKit in the past 12 months are the following ones:

  • Web platform
  • Graphics
  • Multimedia
  • JavaScriptCore
  • New WPE API
  • WebKit on Android
  • Quality assurance
  • Security
  • Tooling
  • Documentation

In the next sections I’ll talk a bit about what we’ve done and what we’re planning to do next for each of them.

Web Platform

content-visibility:auto

This feature allows skipping painting and rendering of off-screen sections, particularly useful to avoid the browser spending time rendering parts in large pages, as content outside of the view doesn’t get rendered until it gets visible.

We completed the implementation and it’s now enabled by default.

Navigation API

This is a new API to manage browser navigation actions and examine history, which we started working on in the past cycle. There’s been a lot of work happening here and, while it’s not finished yet, the current plan is that Apple will continue working on that in the next months.

hasUAVisualTransition

This is an attribute of the NavigateEvent interface, which is meant to be True if the User Agent has performed a visual transition before a navigation event. It was something that we have also finished implementing and is now also enabled by default.

Secure Curves in the Web Cryptography API

In this case, we worked on fixing several Web Interop related issues, as well as on increasing test coverage within the Web Platform Tests (WPT) test suites.

On top of that we also moved the X25519 feature to the “prepare to ship” stage.

Trusted Types

This work is related to reducing DOM-based XSS attacks. Here we finished the implementation and this is now pending to be enabled by default.

MathML

We continued working on the MathML specification by working on the support for padding, border and margin, as well as by increasing the WPT score by ~5%.

The plan for next year is to continue working on core features and improve the interaction with CSS.

Cross-root ARIA

Web components have accessibility-related issues with native Shadow DOM as you cannot reference elements with ARIA attributes across boundaries. We haven’t worked on this in this period, but the plan is to work in the next months on implementing the Reference Target proposal to solve those issues.

Canvas Formatted Text

Canvas has not a solution to add formatted and multi-line text, so we would like to also work on exploring and prototyping the Canvas Place Element proposal in WebKit, which allows better text in canvas and more extended features.

Graphics

Completed migration from Cairo to Skia for the Linux ports

If you have followed the latest developments, you probably already know that the Linux WebKit ports (i.e. WebKitGTK and WPE) have moved from Cairo to Skia for their 2D rendering library, which was a pretty big and important decision taken after a long time trying different approaches and experiments (including developing our own HW-accelerated 2D rendering library!), as well as running several tests and measuring results in different benchmarks.

Skia logoThe results in the end were pretty overwhelming and we decided to give Skia a go, and we are happy to say that, as of today, the migration has been completed: we covered all the use cases in Cairo, achieving feature parity, and we are now working on implementing new features and improvements built on top of Skia (e.g. GPU-based 2D rendering).

On top of that, Skia is now the default backend for WebKitGTK and WPE since 2.46.0, released on September 17th, so if you’re building a recent version of those ports you’ll be already using Skia as their 2D rendering backend. Note that Skia is using its GPU-based backend only on desktop environments, on embedded devices the situation is trickier and for now the default is the CPU-based Skia backend, but we are actively working to narrow the gap and to enable GPU-based rendering also on embedded.

Architecture changes with buffer sharing APIs (DMABuf)

We did a lot of work here, such as a big refactoring of the fencing system to control the access to the buffers, or the continued work towards integrating with Apple’s DisplayLink infrastructure.

On top of that, we also enabled more efficient composition using damaging information, so that we don’t need to pass that much information to the compositor, which would slow the CPU down.

Enablement of the GPUProcess

On this front, we enabled by default the compilation for WebGL rendering using the GPU process, and we are currently working in performance review and enabling it for other types of rendering.

New SVG engine (LBSE: Layer-Based SVG Engine)

If you are not familiar with this, here the idea is to make sure that we reuse the graphics pipeline used for HTML and CSS rendering, and use it also for SVG, instead of having its own pipeline. This means, among other things, that SVG layers will be supported as a 1st-class citizen in the engine, enabling HW-accelerated animations, as well as support for 3D transformations for individual SVG elements.

LBSE logo

On this front, on this cycle we added support for the missing features in the LBSE, namely:

  • Implemented support for gradients & patterns (applicable to both fill and stroke)
  • Implemented support for clipping & masking (for all shapes/text)
  • Implemented support for markers
  • Helped review implementation of SVG filters (done by Apple)

Besides all this, we also improved the performance of the new layer-based engine by reducing repaints and re-layouts as much as possible (further optimizations still possible), narrowing the performance gap with the current engine for MotionMark. While we are still not at the same level of performance as the current SVG engine, we are confident that there are several key places where, with the right funding, we should be able to improve the performance to at least match the current engine, and therefore be able to push the new engine through the finish line.

General overhaul of the graphics pipeline, touching different areas (WIP):

On top of everything else commented above, we also worked on a general refactor and simplification of the graphics pipeline. For instance, we have been working on the removal of the Nicosia layer now that we are not planning to have multiple rendering implementations, among other things.

Multimedia

DMABuf-based sink for HW-accelerated video

We merged the DMABuf-based sink for HW-accelerated video in the GL-based GStreamer sink.

WebCodecs backend

We completed the implementation of  audio/video encoding and decoding, and this is now enabled by default in 2.46. As for the next steps, we plan to keep working on the integration of WebCodecs with WebGL and WebAudio.

GStreamer-based WebRTC backends

We continued working on GstWebRTC, bringing it to a point where it can be used in production in some specific use cases, and we will still be working on this in the next months.

Other

Besides the points above, we also added an optional text-to-speech backend based on libspiel to the development branch, and worked on general maintenance around the support for Media Source Extensions (MSE) and Encrypted Media Extensions (EME), which are crucial for the use case of WPE running in set-top-boxes, and is a permanent task we will continue to work on in the next months.

JavaScriptCore

ARMv7/32-bit support:

A lot of work happened around 32-bit support in JavaScriptCore, especially around WebAssembly (WASM): we ported the WASM BBQJIT and ported/enabled concurrent JIT support, and we also completed 80% of the implementation for the OMG optimization level of WASM, which we plan to finish in the next months. If you are unfamiliar with what the OMG and BBQ optimization tiers in WASM are, I’d recommend you to take a look at this article in webkit.org: Assembling WebAssembly.

WASM logoWe also contributed to the JIT-less WASM, which is very useful for embedded systems that can’t support JIT for security or memory related constraints, and also did some work on the In-Place Interpreter (IPInt), which is a new version of the WASM Low-level interpreter (LLInt) that uses less memory and executes WASM bytecode directly without translating it to LLInt bytecode  (and should therefore be faster to execute).

Last, we also contributed most of the implementation for the WASM GC, with the exception of some Kotlin tests.

As for the next few months, we plan to investigate and optimize heap/JIT memory usage in 32-bit, as well as to finish several other improvements on ARMv7 (e.g. IPInt).

New WPE API

The new WPE API is a new API that aims at making it easier to use WPE in embedded devices, by removing the hassle of having to handle several libraries in tandem (i.e. WPEWebKit, libWPE and WPEBackend-FDO, for instance), available from WPE’s releases page, and providing a more modern API in general, better aimed at the most common use cases of WPE.

A lot of effort happened this year along these lines, including the fact that we finally upstreamed and shipped its initial implementation with WPE 2.44, back in the first half of the year. Now, while we recommend users to give it a try and report feedback as much as possible, this new API is still not set in stone, with regular development still ongoing, so if you have the chance to try it out and share your experience, comments are welcome!

Besides shipping its initial implementation, we also added support for external platforms, so that other ones can be loaded beyond the Wayland, DRM and “headless” ones, which are the default platforms already included with WPE itself. This means for instance that a GTK4 platform, or another one for RDK could be easily used with WPE.

Then of course a lot of API additions were included in the new API in the latest months:

  • Screens management APIAPI to handle different screens, ask the display for the list of screens with their device scale factor, refresh rate, geometry…
  • Top level management API: This API allows a greater degree of control, for instance by allowing more than one WebView for the same top level, as well as allowing to retrieve properties such as size, scale or state (i.e. full screen, maximized…).
  • Maximized and minimized windows API: API to maximize/minimize a top level and monitor its state. mainly used by WebDriver.
  • Preferred DMA-BUF formats API: enables asking the platform (compositor or DRM) for the list of preferred formats and their intended use (scanout/rendering).
  • Input methods APIallows platforms to provide an implementation to handle input events (e.g. virtual keyboard, autocompletion, auto correction…).
  • Gestures API: API to handle gestures (e.g. tap, drag).
  • Buffer damaging: WebKit generates information about the areas of the buffer that actually changed and we pass that to DRM or the compositor to optimize painting.
  • Pointer lock API: allows the WebView to lock the pointer so that the movement of the pointing device (e.g. mouse) can be used for a different purpose (e.g. first-person shooters).

Last, we also added support for testing automation, and we can support WebDriver now in the new API.

With all this done so far, the plan now is to complete the new WPE API, with a focus on the Settings API and accessibility support, write API tests and documentation, and then also add an external platform to support GTK4. This is done on a best-effort basis, so there’s no specific release date.

WebKit on Android

This year was also a good year for WebKit on Android, also known as WPE Android, as this is a project that sits on top of WPE and its public API (instead of developing a fully-fledged WebKit port).

Android logoIn case you’re not familiar with this, the idea here is to provide a WebKit-based alternative to the Chromium-based Web view on Android devices, in a way that leverages HW acceleration when possible and that it integrates natively (and nicely) with the several Android subsystems, and of course with Android’s native mainloop. Note that this is an experimental project for now, so don’t expect production-ready quality quite yet, but hopefully something that can be used to start experimenting with selected use cases.

If you’re adventurous enough, you can already try the APKs yourself from the releases page in GitHub at https://github.com/Igalia/wpe-android/releases.

Anyway, as for the changes that happened in the past 12 months, here is a summary:

  • Updated WPE Android to WPE 2.46 and NDK 27 LTS
  • Added support for WebDriver and included WPT test suites
  • Added support for instrumentation tests, and integrated with the GitHub CI
  • Added support for the remote Web inspector, very useful for debugging
  • Enabled the Skia backend, bringing HW-accelerated 2D rendering to WebKit on Android
  • Implemented prompt delegates, allowing implementing things such as alert dialogs
  • Implemented WPEView client interfaces, allowing responding to things such as HTTP errors
  • Packaged a WPE-based Android WebView in its own library and published in Maven Central. This is a massive improvement as now apps can use WPE Android by simply referencing the library from the gradle files, no need to build everything on their own.
  • Other changes: enabled HTTP/2 support (via the migration to libsoup3), added support for the device scale factor, improved the virtual on-screen keyboard, general bug fixing…

On top of that, we published 3 different blog posts covering different topics, from a general intro to a more deep dive explanation of the internals, and showing some demos. You can check them out in Jani’s blog at https://blogs.igalia.com/jani

As for the future, we’ll focus on stabilization and regular maintenance for now, and then we’d like to work towards achieving production-ready quality for specific cases if possible.

Quality Assurance

On the QA front, we had a busy year but in general we could highlight the following topics.

  • Fixed a lot of API tests failures in the bots that were limiting our test coverage.
  • Fixed lots of assertions-related crashes in the bots, which were slowing down the bots as well as causing other types of issues, such as bots exiting early due too many failures.
  • Enabled assertions in the release bots, which will help prevent crashes in the future, as well as with making our debug bots healthier.
  • Moved all the WebKitGTK and WPE bots to building now with Skia instead of Cairo. This means that all the bots running tests are now using Skia, and there’s only one bot still using Cairo to make sure that the compilation is not broken, but that bot does not run tests.
  • Moved all the WebKitGTK bots to use GTK4 by default. As with the move to Skia, all the WebKit bots running tests now use GTK4 and the only one remaining building with GTK3 does not run tests, it only makes sure we don’t break the GTK3 compilation for now.
  • Working on moving all the bots to use the new SDK. This is still work in progress and will likely be completed during 2025 as it’s needed to implement several changes in the infrastructure that will take some time.
  • General gardening and bot maintenance

In the next months, our main focus would be a revamp of the QA infrastructure to make sure that we can get all the bots (including the debug ones) to a healthier state, finish the migration of all the bots to the new SDK and, ideally, be able to bring back the ready-to-use WPE images that we used to have available in wpewebkit.org.

Security

The current release cadence has been working well, so we continue issuing major releases every 6 months (March, September), and then minor and unstable development releases happening on-demand when needed.

As usual, we kept aligning releases for WebKitGTK and WPE, with both of them happening at the same time (see https://webkitgtk.org/releases and https://wpewebkit.org/release), and then also publishing WebKit Security Advisories (WSA) when necessary, both for WebKitGTK and for WPE.

Last, we also shortened the time before including security fixes in stable releases this year, and we have removed support for libsoup2 from WPE, as that library is no longer maintained.

Tooling & Documentation

On tooling, the main piece of news is that this year we released the initial version of the new SDK,  which is developed on top of OCI-based containers. This new SDK fixes the issues with the current existing approaches based on JHBuild and flatpak, where one of them was great for development but poor for testing and QA, and the other one was great for testing and QA, but not very convenient for development.

This new SDK is regularly maintained and currently runs on Ubuntu 24.04 LTS with GCC 14 & Clang 18. It has been made public on GitHub and announced to the public in May 2024 in Patrick’s blog, and is now the officially recommended way of building WebKitGTK and WPE.

As for documentation, we didn’t do as much as we would have liked here, but we still landed a few contributions in docs.webkit.org, mostly related to WebKitGTK (e.g. Releases and VersioningSecurity UpdatesMultimedia). We plan to do more on this regard in the next months, though, mostly by writing/publishing more documentation and perhaps also some tutorials.

Final thoughts

This has been a fairly long blog post but, as you can see, it’s been quite a year for WebKit here at Igalia, with many exciting changes happening at several fronts, and so there was quite a lot of stuff to comment on here. This said, you can always check the slides of the presentation in the WebKit Contributors Meeting here if you prefer a more concise version of the same content.

In any case, what’s clear it’s that the next months are probably going to be quite interesting as well with all the work that’s already going on in WebKit and its Linux ports, so it’s possible that in 12 months from now I might be writing an equally long essay. We’ll see.

Thanks for reading!

by mario at November 03, 2024 05:20 PM

October 28, 2024

Maíra Canal

Unleashing Power: Enabling Super Pages on the RPi

Unleashing the power of 3D graphics in the Raspberry Pi is a key commitment for Igalia through its collaboration with Raspberry Pi. The introduction of Super Pages for the Raspberry Pi 4 and 5 marks another step in this journey, offering some performance enhancements and more efficient memory usage. In this post, we’ll dive deep into the technical details of Super Pages, discuss the challenges we faced during implementation, and illustrate the benefits this feature brings to the Raspberry Pi ecosystem.

What are Super Pages?

A Memory Management Unit (MMU) is a hardware component responsible for handling memory access at the system level. It translates virtual addresses used by programs into physical addresses in main memory, enabling efficient memory management and protection. The MMU allows the operating system to allocate memory dynamically, isolating processes from one another to prevent them from interfering with each other’s memory.

Recommendation: 📚 Structured computer organization by Andrew Tanenbaum

The V3D MMU, which is part of the Broadcom GPU found in the Raspberry Pi 4 and 5, is responsible for translating 32-bit virtual addresses (VA) used by V3D into 40-bit physical addresses used externally to V3D. The MMU relies on a page table, stored in physical memory, which maps virtual addresses to their corresponding physical addresses. The operating system manages this page table, and the MMU uses it to perform address translation during memory access.

A fundamental principle of modern operating systems is that memory is not stored contiguously. Instead, a contiguous block of memory is divided into smaller blocks, called “pages”, which are scattered across the entire address space. These pages are typically 4KB in size. This approach enables more efficient memory management and allows for features like virtual memory and memory protection.

Over the years, the amount of available memory in computers has increased dramatically. An early IBM PC had up to 640 KiB of RAM, whereas the ThinkPad I’m typing on right now has 32 GB of RAM. Naturally, memory demands have grown alongside this increase. Today, it’s common for web browsers to consume several gigabytes of RAM, and a single shader can take up multiple megabytes.

As memory usage grows, a 4KB page size may become inefficient for managing large memory blocks. Handling a large number of small pages for a single block means the MMU must perform multiple address translations, which increases overhead. This can reduce the effectiveness of the Translation Lookaside Buffer (TLB), as it must store and handle more entries, potentially leading to more cache misses and reduced overall performance.

This is why many CPU manufacturers have introduced support for larger page sizes. For instance, x86 CPUs typically support 4KB and 2MB pages, with 1GB pages available if supported by the hardware. Similarly, ARM64 CPUs can support 4KB, 16KB, and 64KB page sizes. These larger page sizes help reduce the number of pages the MMU needs to manage, improving performance by reducing the overhead of address translation and making more efficient use of the TLB.

So, if CPUs are using bigger sizes, why shouldn’t GPUs do the same?

By default, V3D supports 4KB pages. However, by setting specific bits in the page table entry, it is possible to create 64KB “Big Pages” and 1MB “Super Pages.” The issue is that the current V3D driver available in Linux does not enable the use of Big or Super Pages, meaning this hardware feature is currently unused.

The advantage of enabling Big and Super Pages is that once an entry for any page within a Big or Super Page is cached in the MMU, it can be used to translate all virtual addresses within that page’s range without needing to fetch additional entries. In theory, this should result in improved performance, especially for applications with high memory demands, such as those using multiple large buffer objects (BOs).

As Igalia continually strives to enhance the experience for Raspberry Pi users, we decided to implement this feature in the upstream kernel. But before diving into the implementation details, let’s take a look at the real-world results and see if the theoretical benefits of Super Pages have translated into measurable improvements for Raspberry Pi users.

What Does This Feature Mean for RPi Users?

With Super Pages implemented, let’s now explore the actual performance improvements observed on the Raspberry Pi and see how impactful this feature is for users.

Benchmarking Super Pages: Traces and FPS Improvements

To measure the impact of Super Pages, we tested a variety of games and demos traces on the Raspberry Pi 4 and 5, covering genres from action to racing. On average, we observed a +1.40% FPS improvement on the Raspberry Pi 4 and a +1.30% improvement on the Raspberry Pi 5.

For instance, on the Raspberry Pi 4, Warzone 2100 saw an 8.36% FPS increase, and on the Raspberry Pi 5, Quake II enjoyed a 3.62% boost. These examples demonstrate the benefits of Super Pages in resource-demanding applications, where optimized memory handling becomes critical.

Raspberry Pi 4 FPS Improvements

Trace Before Super Pages After Super Pages Improvement
warzone2100.30secs.1024x768.trace 56.39 61.10 +8.36%
ue4_shooter_game_shooting_low_quality_640x480.gfxr 20.71 21.47 +3.65%
quake3e_capture_frames_1800_through_2400_1920x1080.gfxr 60.88 62.50 +2.67%
supertuxkart-menus_1024x768.trace 112.62 115.61 +2.65%
ue4_shooter_game_shooting_high_quality_640x480.gfxr 20.45 20.88 +2.10%
quake2-gles3-1280x720.trace 59.76 60.84 +1.82%
ue4_sun_temple_640x480.gfxr 27.60 28.03 +1.54%
vkQuake_capture_frames_1_through_1200_1280x720.gfxr 54.59 55.30 +1.29%
ue4_shooter_game_low_quality_640x480.gfxr 32.75 33.08 +1.00%
sponza_demo02_800x600.gfxr 20.90 21.03 +0.61%
supertuxkart-racing_1024x768.trace 8.58 8.63 +0.60%
ue4_shooter_game_high_quality_640x480.gfxr 19.62 19.74 +0.59%
serious_sam_trace02_1280x720.gfxr 44.00 44.21 +0.50%
ue4_vehicle_game-2_640x480.gfxr 12.59 12.65 +0.49%
sponza_demo01_800x600.gfxr 21.42 21.46 +0.19%
quake3e-1280x720.trace 84.45 84.52 +0.09%

Raspberry Pi 5 FPS Improvements

Trace Before Super Pages After Super Pages Improvement
quake2-gles3-1280x720.trace 151.77 157.26 +3.62%
supertuxkart-menus_1024x768.trace 306.79 313.88 +2.31%
warzone2100.30secs.1024x768.trace 140.92 144.03 +2.21%
vkQuake_capture_frames_1_through_1200_1280x720.gfxr 131.45 134.20 +2.10%
ue4_vehicle_game-2_640x480.gfxr 24.42 24.88 +1.89%
ue4_shooter_game_high_quality_640x480.gfxr 32.12 32.53 +1.29%
ue4_sun_temple_640x480.gfxr 42.05 42.55 +1.20%
ue4_shooter_game_shooting_high_quality_640x480.gfxr 52.77 53.31 +1.04%
quake3e-1280x720.trace 238.31 240.53 +0.93%
warzone2100.70secs.1024x768.trace 151.09 151.81 +0.48%
sponza_demo02_800x600.gfxr 50.81 51.05 +0.46%
supertuxkart-racing_1024x768.trace 20.91 20.98 +0.33%
ue4_shooter_game_low_quality_640x480.gfxr 59.68 59.86 +0.29%
quake3e_capture_frames_1_through_1800_1920x1080.gfxr 167.70 168.17 +0.29%
ue4_shooter_game_shooting_low_quality_640x480.gfxr 53.40 53.51 +0.22%
quake3e_capture_frames_1800_through_2400_1920x1080.gfxr 163.37 163.64 +0.17%
serious_sam_trace02_1280x720.gfxr 60.00 60.03 +0.06%
sponza_demo01_800x600.gfxr 45.04 45.04 <.01%

While an average +1% FPS improvement might seem modest, Super Pages can deliver more noticeable gains in memory-intensive 3D applications and when the GPU is under heavy usage. Let’s see how the Super Pages perform on Mesa CI.

Benchmarking Super Pages: Mesa CI Job Duration

To avoid introducing regressions in user-space, I usually test my custom kernels with Mesa CI, focusing on the “broadcom-postmerge” stage to verify that all Piglit and CTS tests ran smoothly. For Super Pages, I was pleasantly surprised by the job duration results, as some job durations were reduced by several minutes.

Mesa CI Jobs Duration Improvements

Job Before Super Pages After Super Pages
v3d-rpi4-traces:arm64 ~4m30s ~3m40s
v3d-rpi5-traces:arm64 ~3m30s ~2m45s
v3d-rpi4-gl-full:arm64 */6 ~24-25 minutes ~22-23 minutes
v3d-rpi5-gl-full:arm64 ~48 minutes ~48 minutes
v3dv-rpi4-vk-full:arm64 */6 ~44 minutes ~41 minutes
v3dv-rpi5-vk-full:arm64 ~102 minutes ~92 minutes

Seeing these reductions is especially rewarding. For example, the “v3dv-rpi5-vk-full:arm64” job duration decreased by 10 minutes, meaning more FPS for users and shorter wait times for Mesa developers.

Benchmarking Super Pages: PS2 Emulation

After sharing a couple of tables, I’ll admit that showcasing performance improvements solely through numbers doesn’t always convey the real impact. Personally, I find it more satisfying to see performance gains in action with real-world applications.

This led me to explore PlayStation 2 (PS2) emulation on the RPi 5. From watching YouTube videos, I noticed that PS2 is a popular console for the RPi 5. While the PlayStation (PS1) emulates well even on the RPi 4, and Nintendo 64 and Sega Saturn struggle across most hardware, PS2 hits a sweet spot for testing the RPi 5’s limits.

Fortunately, I still have my childhood PS2 — my second console after the Nintendo GameCube, and one of the most successful consoles worldwide, including in Brazil. With a library packed with titles like Metal Gear Solid, Resident Evil, Tomb Raider, and Shadow of the Colossus, the PS2 remains a great system for collectors and retro gamers alike.

I selected a few games from my collection to benchmark on the RPi 5 using a PS2 emulator. My emulator of choice was Aether SX2 with Vulkan support. Although AetherSX2 is no longer in development, it still performs well on the RPi.

Initially, many games were barely playable, especially those with large buffer objects, like Shadow of the Colossus and Gran Turismo 4. However, after enabling Super Pages support, I noticed immediate improvements. For example, Shadow of the Colossus wouldn’t even open before Super Pages, and while it’s not fully playable yet, it does load now. This isn’t a silver bullet, but it’s a step forward in improving the driver one piece at a time.

I ended up selecting four games for a video comparison: Burnout 3: Takedown, Metal Gear Solid 3: Snake Eater, Resident Evil 4, and Tekken 4.

Your browser does not support the video tag.

Disclaimer: The BIOS used in the emulator was extracted from my own PS2, and I played only games I own, with ROMs I personally extracted. Neither I nor Igalia encourage using downloaded BIOS or ROM files from the internet.

From the video, we can see noticeable improvements in all four games. Although they aren’t perfectly playable yet, the performance gains are evident, particularly in Resident Evil 4, where the gameplay saw a solid 5 FPS boost. I realize 18 FPS might not satisfy most players, but I still had a lot of fun playing Resident Evil 4 on the RPi 5.

When tracking the FPS for these games, it’s clear that the performance gains go well beyond the average 1% seen in other benchmarks. Super Pages show their true potential in high-memory applications like PS2 emulation.

Having seen the performance gains Super Pages can bring to the Raspberry Pi, let’s now dive into the technical aspects of the feature.

Implementing Super Pages

The first challenge was figuring out how to allocate a contiguous block of memory using shmem. The Shared Memory Virtual Filesystem (shmem) is used as a flexible memory mechanism that allows the GPU and CPU to share access to BOs through the system’s temporary filesystem, tmpfs. tmpfs is a volatile filesystem that stores files in RAM, making it ideal for temporary or high-speed data that doesn’t need to persist on RAM.

For example, to allocate a 256KB BO across four 64KB pages, we need four contiguous 64KB memory blocks. However, by default, tmpfs only allocates memory in PAGE_SIZE chunks (as seen in shmem_file_setup()), whereas PAGE_SIZE is 4KB on the Raspberry Pi 4 and 16KB on the Raspberry Pi 5. Since the function drm_gem_object_init() — which initializes an allocated shmem-backed GEM object — relies on shmem_file_setup() to back these objects in memory, we had to consider alternatives, as the default PAGE_SIZE would divide memory into increments that are too small to ensure the large, contiguous blocks needed by the GPU.

The solution we proposed was to create drm_gem_object_init_with_mnt(), which allows us to specify the tmpfs mountpoint where the GEM object will be created. This enables us to allocate our BOs in a mountpoint that supports larger page sizes. Additionally, to ensure that our BOs are allocated in the correct mountpoint, we introduced drm_gem_shmem_create_with_mnt(), which allows the mountpoint to be specified when creating a new DRM GEM shmem object.

[PATCH v6 04/11] drm/gem: Create a drm_gem_object_init_with_mnt() function

[PATCH v6 06/11] drm/gem: Create shmem GEM object in a given mountpoint

The next challenge was figuring out how to create a new mountpoint that would allow for different page sizes based on the allocation. Simply creating a new tmpfs mountpoint with a fixed bigger page size wouldn’t suffice, as we needed flexibility for various allocations. Inspired by the i915 driver, we decided to use a tmpfs mountpoint with the “huge=within_size” flag. This flag, which requires the kernel to be configured with CONFIG_TRANSPARENT_HUGEPAGE, enables the allocation of huge pages.

Transparent Huge Pages (THP) is a kernel feature that automatically manages large memory pages to improve performance without needing changes from applications. THP dynamically combines smaller pages into larger ones, typically 2MB, reducing memory management overhead and improving cache efficiency.

To support our new allocation strategy, we created a dedicated tmpfs mountpoint for V3D, called gemfs, which provides us an ideal space for managing these larger allocations.

[PATCH v6 05/11] drm/v3d: Introduce gemfs

With everything in place for contiguous allocations, the next step was configuring V3D to enable Big/Super Page support.

We began by addressing a major source of memory pressure on the Raspberry Pi: the current 128KB alignment for allocations in the virtual memory space. This alignment wastes space when handling small BO allocations, especially since the userspace driver performs a large number of these small allocations.

As a result, we can’t fully utilize the 4GB address space available for the GPU on the Raspberry Pi 4 or 5. For example, we can currently allocate up to 32,000 BOs of 4KB (~140MB) and 3,000 BOs of 400KB (~1.3GB). This becomes a limitation for memory-intensive applications. By reducing the page alignment to 4KB, we can significantly increase the number of BOs, allowing up to 1,000,000 BOs of 4KB (~4GB) and 10,000 BOs of 400KB (~4GB).

Therefore, the first change I made was reducing the VA alignment of all allocations to 4KB.

[PATCH v6 07/11] drm/v3d: Reduce the alignment of the node allocation

With the alignment issue resolved, we can now implement the code to properly set the flags on the Page Table Entries (PTE) for Big/Super Pages. Setting these flags is straightforward — a simple bitwise operation. The challenge lies in determining which BOs can be allocated in Super Pages. For a BO to be eligible for a Big Page, its virtual address must be aligned to 64KB, and the same applies to its physical address. Same thing for Super Pages, but now the addresses must be aligned to 1MB.

If the BO qualifies for a Big/Super Page, we need to iterate over 16 4KB pages (for Big Pages) or 256 4KB pages (for Super Pages) and insert the appropriate PTE.

Additionally, we modified the way we iterate through the BO’s memory. This was necessary because the THP may not always allocate the entire BO contiguously. For example, it might only allocate contiguously 1MB of a 2MB block. To handle this, we now iterate over the blocks of contiguous memory scattered across the scatterlist, ensuring that each segment is properly handled during the allocation process.

What is a scatterlist? It is a Linux Kernel data structure that manages non-contiguous memory as if it were contiguous. It organizes separate memory blocks into a single logical buffer, allowing efficient data handling, especially in Direct Memory Access (DMA) operations, without needing a physically contiguous memory allocation.

[PATCH v6 08/11] drm/v3d: Support Big/Super Pages when writing out PTEs

However, the last few patches alone don’t fully enable the use of Super Pages. While PATCH 08/11 technically allows for Super Pages, we’re still relying on DRM GEM shmem objects, meaning allocations are still happening in PAGE_SIZE chunks. Although Big/Super Pages could potentially be used if the system naturally allocated 1MB or 64KB contiguously, this is quite rare and not our intended outcome. Our goal is to actively use Big/Super Pages as much as possible.

To achieve this, we’ll utilize the V3D-specific mountpoint we created earlier for BO allocation whenever possible. By creating BOs through drm_gem_shmem_create_with_mnt(), we can ensure that large pages are allocated contiguously when possible, enabling the consistent use of Big/Super Pages.

[PATCH v6 09/11] drm/v3d: Use gemfs/THP in BO creation if available

And there you have it — Big/Super Pages are now fully enabled in V3D. The only requirement to activate this feature in any given kernel is ensuring that CONFIG_TRANSPARENT_HUGEPAGE is enabled.

Final Words

You can learn more about ongoing enhancements to the Raspberry Pi driver stack in this XDC 2024 talk by José María “Chema” Casanova Crespo. In the talk, Chema discusses the Super Pages work I developed, along with other advancements in the driver stack.

Of course, there are still plenty of improvements on the horizon at Igalia. I’m currently experimenting with 64KB CLE allocations in user-space, and I hope to share more good news soon.

Finally, I’d like to express my gratitude to Iago Toral and Tvrtko Ursulin for their invaluable support in developing Super Pages for the V3D kernel driver. Thank you both for sharing your experience with me!

October 28, 2024 12:00 PM

October 26, 2024

Stephen Chenney

Caret Customization on the Web

A caret is the symbol used to show where text will appear in text input applications,such as a word processor, code editor or input form. It might be a vertical bar, or an underscore, or some other shape. It may flash, or pulse.

On the web, sites currently have some control over the editing caret through CSS, along with entirely custom solutions. Here I discuss the current state of caret customization on the web and look at proposals to allow more customization through CSS.

Carets on the Web #

The majority of web sites use the default editing caret. In all browsers it is a vertical bar in the text color that blinks. The blink rate and duration varies across browsers: Chrome’s cursor blinks continuously at a fixed rate; Firefox blinks the caret continuously at a rate from user settings; Safari blinks with keyboard input but switches to a strobe effect with dictation input.

All browsers support the CSS caret-color property allowing sites to color the caret independent of the text, and allow the color to be animated. You can also make the caret disappear with a transparent color.

Changing the caret shape is currently not possible through CSS in any browser. There are some ways to work around this, such as those documented in this Stack Overflow post. The general idea is hide the caret and replace it with an element controlled by script. The script relies on the selection range and other APIs to know where the element should be positioned. Or you can completely replace the default browser editing behavior with a custom javascript editor (as is done by, for example, Google Docs).

Browser support for caret customization currently leaves web developers in an awkward situation: accept the basic default cursor with color control, or implement a custom editor, but nothing in-between.

Improving Caret Customization #

There are at least two CSS properties not yet implemented in browsers that improve caret customization. The first concerns the interaction between color animation and the default blinking behavior, and the second adds support for different caret shapes.

The CSS caret-animation property #

When the caret-color is animated, there is no way to reliably synchronize the animation with the browser’s blinking rate. Different browsers blink at different rates, and it may be under user control. The CSS caret-animation property, when set to the value manual suppresses the blinking, giving the color animation complete control over the color of the caret at all times. Using caret-animation: auto (the initial value) leaves the blinking behavior under browser control.

Site control of the blinking is both an accessibility benefit and potentially harmful to users. For users sensitive to motion, disabling the blinking may be a benefit. At the same time, a cursor that does not blink is much harder for users to recognize. Please use caution when employing this property when it is available in browsers.

There is an implementation of caret-animation in Chrome 132 behind the CSSCaretAnimation flag, accessible through --enable-blink-features="CSSCaretAnimation" on the command line. Feel free to comment on the bug if you have thoughts on this feature.

The CSS caret-shape property #

The shape of the caret in native applications is most commonly a vertical bar, an underscore or a rectangular block. In addition, the shape often varies depending on the input mode, such as insert or replace. The CSS caret-shape property allows sites to choose one of these shapes for the caret, or leave the choice up to the browser. The recognized property values are auto, bar, block and underscore.

No browser currently supports the caret-shape property, and the specification needs a little work to confirm the exact location and shape of the underscore and block. Please leave feedback on the Chromium bug if you would like this feature to be implemented.

Other proposed caret properties #

The caret-shape property does not allow any control over the size of the caret, such as the thickness of the bar or block. There was, for example, a proposal to add caret-width as a means for working around browser bugs on zooming and transforming the caret. Please create a new CSS working Group issue if you would like greater customization and can provide use cases.

by Stephen Chenney at October 26, 2024 12:00 AM

October 23, 2024

Víctor Jáquez

GStreamer Conference 2024

Early this month I spent a couple weeks in Montreal, visiting the city, but mostly to attend the GStreamer Conference and the following hackfest, which happened to be co-located with the XDC Conference. It was my first time in Canada and I utterly enjoyed it. Thanks to all those from the GStreamer community that organized and attended the event.

For now you can replay the whole streams of both days of conference, but soon the split videos for each talk will be published:

GStreamer Conference 2024 - Day 1, Room 1 - October 7, 2024

https://www.youtube.com/watch?v=KLUL1D53VQI


GStreamer Conference 2024 - Day 1, Room 2 - October 7, 2024

https://www.youtube.com/watch?v=DH64D_6gc80


GStreamer Conference 2024 - Day 2, Room 2 - October 8, 2024

https://www.youtube.com/watch?v=jt6KyV757Dk


GStreamer Conference 2024 - Day 2, Room 1 - October 8, 2024

https://www.youtube.com/watch?v=W4Pjtg0DfIo


And a couple pictures of igalians :)

GStreamer Skia pluginGStreamer Skia plugin
GStreamer VAGStreamer VA
GstWebRTCGstWebRTC
GstVulkanGstVulkan
GStreamer Editing Servicesgstreamer editing services

by Víctor Jáquez at October 23, 2024 12:00 AM

October 22, 2024

Stephen Chenney

CSS Highlight Inheritance has Shipped!

The CSS highlight inheritance model describes the process for inheriting the CSS properties of the various highlight pseudo elements:

  • ::selection controlling the appearance of selected content
  • ::spelling-error controlling the appearance of misspelled word markers
  • ::grammar-error controlling how grammar errors are marked
  • ::target-text controlling the appearance of the string matching a target-text URL
  • ::highlight defining the appearance of a named highlight, accessed via the CSS highlight API

The feature has now launched in Chrome 131, and this post describes how to achieve the old selection styling behavior if selection is now not visible, or newly visible, on your site.

The most common problem comes when a site relies on ::selection properties only applying to the matched elements and not child elements. For example, sites don’t want <div> elements themselves to be visible, but do want the text content inside them to show the selected text. Some sites may have used such styling to work around old bugs in browsers whereby the highlight on the selected <div> would be too large.

Previously, this style block would have the intended result:

<style>
div::selection {
background-color: transparent;
}
</style>

With CSS highlight inheritance the children of the <div> will now also have transparent selection, which is probably not what was intended.

There’s a good chance that the selection styling added to work around bugs is no longer needed. So the first thing to try is removing the style rule entirely.

Should you really want to style selection on an element but not its children, simplest is to add a style rule to apply the default selection behavior to everything that is not a <div> (or whatever element you are selecting):

<style>
div::selection {
background-color: transparent;
}
:not(div)::selection {
background-color: Highlight;
}
</style>

Highlight is the system default selection background color, whatever that might be in the browser and operating system you are using.

More efficient is to use the child combinator because it will only create new pseudo style objects for the <div> and its direct children, inheriting the rest:

<style>
div::selection {
background-color: transparent;
}
div > *::selection {
background-color: Highlight;
}
</style>

Overall, the former may be better if you have multiple ::selection styles or your ::selection styles already use complex rules, because it avoids the need to have a matching child selector for every one of those.

There are other situations, but building on the examples we suggest two generalizations:

  • Make use of the CSS system colors Highlight and HighlightText to revert selection back to the platform default.
  • Augment any existing ::selection style rules with complementary rules to cancel the selection style changes on other elements that should not inherit the change.

Chrome issue 374528707 may be used to report breakage due to enabling CSS highlight inheritance and to see fixes for the reported cases.

by Stephen Chenney at October 22, 2024 12:00 AM

October 17, 2024

Jani Hautakangas

Bringing WebKit back to Android: Internals

In my last blog post, I introduced the WPE-Android project by providing a high-level overview of what the project aims to achieve and the motivations behind bringing WebKit back to Android. This post will take a deeper dive into the technical details and internal structure of WPE-Android, focusing on some key areas of its design and implementation.

WPE-Android is not a standalone WebKit port but rather an extension and adaptation of the existing WPE WebKit APIs. By leveraging the libwpe adaptation layer library and a platform-specific libwpebackend plugin, WPE-Android integrates seamlessly with Android. These components work together to provide WebKit with the necessary access to Android-specific functionality, enabling it to render web content efficiently on Android devices.

WPE-Android High Level

Overall Design #

At the core of WPE-Android lies the native WPE WebKit codebase, which powers much of the browser functionality. However, for Android applications to interact with this native code, a bridge must be established between the Java environment of Android apps and the C++ world of WebKit. This is where the Java Native Interface (JNI) comes into play.

The JNI allows Java code to call native methods implemented in C or C++ and vice versa. In the case of WPE-Android, a robust JNI layer is implemented to facilitate the communication between the Android system and the WebKit engine. This layer consists of several utility classes, which are designed to handle JNI calls efficiently and reduce the possibility of code duplication. These classes essentially act as intermediaries, ensuring that all interactions with the WebKit engine are managed smoothly and reliably.

Below is a simplified diagram of the overall design of WPE-Android internals, highlighting the key components involved in this architecture.

WPE-Android Design

WPEBackend-Android: Bridging the Platform #

WPE-Android relies heavily on the libwpe library to enable platform-specific functionalities. The role of libwpe is crucial because it abstracts away the platform-specific details, allowing WebKit to run on various platforms, including Android, without needing to be aware of the underlying system intricacies.

One of the primary responsibilities of libwpe is to interface with the platform-specific libwpebackend plugin, which handles tasks such as platform graphics buffer support and the sharing of buffers between the WebKit UIProcess and WebProcess. The libwpebackend plugin ensures that graphical content generated by the WebProcess can be efficiently displayed on the device’s screen by the UIProcess.

Although this libwpebackend plugin is a critical component of WPE-Android, I won’t go into describing its detailed implementation in this post. However, for those interested in the internal workings of the WPE Backend, I highly recommend reading Loïc Le Page’s comprehensive blog post on the subject: Create WPE Backends. In WPE-Android, this backend functionality is implemented in the WPEBackend-Android repository.

Recently, WPE-Android has been upgraded to WPE WebKit 2.46.0, which introduces an initial version of the new WPE adaptation API called WPE Platform API. This API is designed to provide a cleaner and more flexible way of integrating WPE WebKit with various platforms. However, since this API is still under active development, WPE-Android currently continues to use the older libwpe API.

The diagram below shows the internals of the WPEBackend-android and how it integrates to WPE WebKit

WPEBackend-android

Rendering Web Content: The Journey from WebProcess to Screen #

Rendering web content efficiently on Android involves several moving parts. Once the WebProcess has generated the graphical frames, these frames need to be displayed on the device’s screen in a seamless and performant manner. To achieve this, WPE-Android makes use of the Android SurfaceControl component.

SurfaceControl plays a key role in managing the surface that displays the rendered content. It allows buffers generated by the WebProcess to be posted directly to SurfaceFlinger, which is Android’s system compositor responsible for combining multiple surfaces into a single screen image. This direct posting of buffers to SurfaceFlinger ensures that the rendered frames are composed and displayed in a highly efficient way, with minimal overhead.

The diagram below illustrates how the WebProcess-generated frames are transferred to the Android system and eventually rendered on the device’s screen.

WPE-Android Rendering

Event loop #

WPE WebKit is built on top of GLib, which provides a wide array of functionality for managing events, network communications, and other system-level tasks. GLib is essential to the smooth operation of WPE WebKit, especially when it comes to handling asynchronous events and running the event loop.

To integrate WPE WebKit’s GLib-based event loop with the Android platform, WPE-Android uses a mechanism that drives the WebKit event loop using the Android looper. Specifically, event file descriptors (FDs) from the WPE WebKit GLib main loop are fed into the Android main looper. On each iteration of the event loop, WPE-Android checks whether any of the file descriptors in the GLib main loop have changed. Based on these changes, WPE-Android adjusts the Android looper by adding or removing FDs as necessary.

This complex logic is implemented in a class called MessagePump, which handles the synchronization between the two event loops and ensures that events are processed in a timely manner.

WPE-Android MessagePump

What’s New Since the Last Update #

Since the last update, WPE-Android has undergone a number of significant changes. These updates have brought new features, bug fixes, and performance improvements, making the project even more robust and capable of handling modern web content on Android devices. Below is a list of the most notable changes:

  • Upgrade to Cerbero 1.24.8: This upgrade ensures better compatibility and improved build processes.
  • Upgrade to NDK r27 LTS: The move to the latest NDK release provides long-term support and brings several new features and optimizations.
  • Upgrade to WPE WebKit 2.46.0: This version of WPE WebKit introduces new functionality, including support for the Skia backend.
  • Enabled Skia backend: Skia is a graphics engine that provides hardware-accelerated rasterization, which enhances rendering performance.
  • Enabled Remote Inspector: This allows developers to remotely inspect and debug web content.
  • Published WPEView to Maven Central: The WPEView library is now publicly available, making it easier for developers to integrate WPE-Android into their own projects.
  • Major bug fixes: Significant improvements have been made to the GLib main loop integration, ensuring smoother operation and fewer edge cases.
  • Dropped gnutls in favor of openssl: This change simplifies the security stack.
  • Streamlined libraries: Unnecessary deployment of libraries that are already provided by Android as system libraries has been removed.
  • Refined WPEView API: Internal methods that should not be exposed to developers have been moved to implementation details, reducing API surface and making the library more user-friendly

Demos #

Mediaplayer #

Demo shows WPE-Android mediaplayer demo on Android device. Demo source code can be found in mediaplayer

Remote Inspector #

Demo shows Remote Inspector usage on Android device. Detailed instructions how to run Remote Inspector can be found in README.md

This demo shows loading the www.igalia.com webpage in a desktop browser and connecting it to a remote inspector service on device. The video demonstrates how webpage elements can be inspected and edited in real-time using the remote inspector.

Try it yourself #

With the recent release of WPEView to Maven Central, it’s now easier than ever to experiment with WPE-Android and integrate it into your own Android projects

To get started, make sure you have mavenCentral() included in your project’s repository configuration. Here’s how you can do it:

dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
google()
mavenCentral()
}
}

Once that’s done, you can add WPEView as a dependency in your project:

dependencies {
implementation("org.wpewebkit.wpeview:wpeview:0.1.0")
...
}

If you’re interested in learning more or contributing to the project, you can find all the details on the WPE-Android GitHub page. We welcome feedback, contributions, and new ideas!

This project is partially funded by the NLNet Foundation, and we appreciate their support in making it possible.

by Jani Hautakangas at October 17, 2024 12:00 AM

October 16, 2024

Enrique Ocaña

Adding buffering hysteresis to the WebKit GStreamer video player

The <video> element implementation in WebKit does its job by using a multiplatform player that relies on a platform-specific implementation. In the specific case of glib platforms, which base their multimedia on GStreamer, that’s MediaPlayerPrivateGStreamer.

WebKit GStreamer regular playback class diagram

The player private can have 3 buffering modes:

  • On-disk buffering: This is the typical mode on desktop systems, but is frequently disabled on purpose on embedded devices to avoid wearing out their flash storage memories. All the video content is downloaded to disk, and the buffering percentage refers to the total size of the video. A GstDownloader element is present in the pipeline in this case. Buffering level monitoring is done by polling the pipeline every second, using the fillTimerFired() method.
  • In-memory buffering: This is the typical mode on embedded systems and on desktop systems in case of streamed (live) content. The video is downloaded progressively and only the part of it ahead of the current playback time is buffered. A GstQueue2 element is present in the pipeline in this case. Buffering level monitoring is done by listening to GST_MESSAGE_BUFFERING bus messages and using the buffering level stored on them. This is the case that motivates the refactoring described in this blog post, what we actually wanted to correct in Broadcom platforms, and what motivated the addition of hysteresis working on all the platforms.
  • Local files: Files, MediaStream sources and other special origins of video don’t do buffering at all (no GstDownloadBuffering nor GstQueue2 element is present on the pipeline). They work like the on-disk buffering mode in the sense that fillTimerFired() is used, but the reported level is relative, much like in the streaming case. In the initial version of the refactoring I was unaware of this third case, and only realized about it when tests triggered the assert that I added to ensure that the on-disk buffering method was working in GST_BUFFERING_DOWNLOAD mode.

The current implementation (actually, its wpe-2.38 version) was showing some buffering problems on some Broadcom platforms when doing in-memory buffering. The buffering levels monitored by MediaPlayerPrivateGStreamer weren’t accurate because the Nexus multimedia subsystem used on Broadcom platforms was doing its own internal buffering. Data wasn’t being accumulated in the GstQueue2 element of playbin, because BrcmAudFilter/BrcmVidFilter was accepting all the buffers that the queue could provide. Because of that, the player private buffering logic was erratic, leading to many transitions between “buffer completely empty” and “buffer completely full”. This, it turn, caused many transitions between the HaveEnoughData, HaveFutureData and HaveCurrentData readyStates in the player, leading to frequent pauses and unpauses on Broadcom platforms.

So, one of the first thing I tried to solve this issue was to ask the Nexus PlayPump (the subsystem in charge of internal buffering in Nexus) about its internal levels, and add that to the levels reported by GstQueue2. There’s also a GstMultiqueue in the pipeline that can hold a significant amount of buffers, so I also asked it for its level. Still, the buffering level unstability was too high, so I added a moving average implementation to try to smooth it.

All these tweaks only make sense on Broadcom platforms, so they were guarded by ifdefs in a first version of the patch. Later, I migrated those dirty ifdefs to the new quirks abstraction added by Phil. A challenge of this migration was that I needed to store some attributes that were considered part of MediaPlayerPrivateGStreamer before. They still had to be somehow linked to the player private but only accessible by the platform specific code of the quirks. A special HashMap attribute stores those quirks attributes in an opaque way, so that only the specific quirk they belong to knows how to interpret them (using downcasting). I tried to use move semantics when storing the data, but was bitten by object slicing when trying to move instances of the superclass. In the end, moving the responsibility of creating the unique_ptr that stored the concrete subclass to the caller did the trick.

Even with all those changes, undesirable swings in the buffering level kept happening, and when doing a careful analysis of the causes I noticed that the monitoring of the buffering level was being done from different places (in different moments) and sometimes the level was regarded as “enough” and the moment right after, as “insufficient”. This was because the buffering level threshold was one single value. That’s something that a hysteresis mechanism (with low and high watermarks) can solve. So, a logical level change to “full” would only happen when the level goes above the high watermark, and a logical level change to “low” when it goes under the low watermark level.

For the threshold change detection to work, we need to know the previous buffering level. There’s a problem, though: the current code checked the levels from several scattered places, so only one of those places (the first one that detected the threshold crossing at a given moment) would properly react. The other places would miss the detection and operate improperly, because the “previous buffering level value” had been overwritten with the new one when the evaluation had been done before. To solve this, I centralized the detection in a single place “per cycle” (in updateBufferingStatus()), and then used the detection conclusions from updateStates().

So, with all this in mind, I refactored the buffering logic as https://commits.webkit.org/284072@main, so now WebKit GStreamer has a buffering code much more robust than before. The unstabilities observed in Broadcom devices were gone and I could, at last, close Issue 1309.

by eocanha at October 16, 2024 06:12 AM

October 15, 2024

Alex Bradbury

Accessing a cross-compiled build tree from qemu-system

I'm cross-compiling a large codebase (LLVM and sub-projects such as Clang) and want to access the full tree - the source and the build artifacts from under qemu. This post documents the results of my experiments in various ways to do this. Note that I'm explicitly choosing to run timings using something that approximates the work I want to do rather than any microbenchmark targeting just file access time.

Requirements:

  • The whole source tree and the build directories are accessible from the qemu-system VM.
  • Writes must be possible, but don't need to be reflected in the original directory (so importantly, although files are written as part of the test being run, I don't need to get them back out of the virtual machine).

Results

The test simply involves taking a cross-compiled LLVM tree (with RISC-V as the only enabled target) and running the equivalent of ninja check-llvm on it after exposing / transferring it to the VM using the listed method. Results in chart form for overall time taken in the "empty cache" case:

9pfs
13m05s 
virtiofsd
13m56s 
squashfs
11m33s 
ext4
11m42s 
  • 9pfs: 13m05s with empty cache, 12m45s with warm cache
  • virtiofsd: 13m56s with empty cache, 13m49s with warm cache
  • squashfs: 11m33s (~17s image build, 11m16s check-llvm) with empty cache, 11m16s (~3s image build, 11m13s check-llvm) with warm cache
  • new ext4 filesystem: 11m42s, (~38s image build, 11m4s check-llvm) with empty cache, 11m34s (~30s image build, 11m4s check-llvm) with warm cache.

By way of comparison, it takes ~10m40s to naively transfer the data by piping it over ssh (tarring on one end, untarring on the other).

For this use case, building and mounting a filesystem seems the most compelling option with the ext4 being overall simplest (no need to use overlayfs) assuming you have an e2fsprogs new enough to support tar input to mkfs.ext4. I'm not sure why I'm not seeing the purported performance improvements when using virtiofsd.

Notes on test setup

  • The effects of caching are accounted for by taking measurements in two scenarios below. Although neither is a precise match for the situation in a typical build system, it at least gives some insight into the impact of the page cache in a controlled manner.
    • "Empty cache". Caches are dropped (sync && echo 3 | sudo tee /proc/sys/vm/drop_caches) before building the filesystem (if necessary) and launching QEMU.
    • "Warm cache". Caches are dropped as above but immediately after that (i.e. before building any needed filesystem) we load all files that will be accessed so they will be cached. For simplicity, just do this by doing using a variant of the tar command described below.
  • Running on AMD Ryzen 9 7950X3D with 128GiB RAM.
  • QEMU 9.1.0, running qemu-system-riscv64, with 32 virtual cores matching the number 16 cores / 32 hyper-threads in the host.
  • LLVM HEAD as of 2024-10-13 (48deb3568eb2452ff).
  • Release+Asserts build of LLVM with all non-experimental targets enabled. Built in a two-stage cross targeting rv64gc_zba_zbb_zbs.
  • Running riscv64 Debian unstable within QEMU.
  • Results gathered via time ../bin/llvm-lit -sv . running from test/ within the build directory, and combining this with the timing for whatever command(s) are needed to transfer and extract the source+build tree.
  • Only the time taken to build the filesystem (if relevant) and to run the tests is measured. There may be very minor difference in the time taken to mount the different filesystems, but we assume this is too small to be worth measuring.
  • If I were to rerun this, I would invoke lit with --order=lexical to ensure runtimes aren't affected by the order of tests (e.g. long-running tests being scheduled last).
  • Total size of the files being exposed is ~8GiB or so.
  • Although not strictly necessary, I've tried to map to the uid:gid pair used in the guest where possible.

Details: 9pfs (QEMU -virtfs)

  • Add -virtfs local,path=$HOME/llvm-project/,mount_tag=llvm,security_model=none,id=llvm to the QEMU command line.
  • Within the guest sudo mount -t 9p -o trans=virtio,version=9p2000.L,msize=512000 llvm /mnt. I haven't swept different parameters for msize and mount reports 512000 is the maximum for the virtio transport.
  • Sadly 9pfs doesn't support ID-mapped mounts at the moment, which would allow us to easily remap the uid and gid used within the file tree on the host to one of an arbitrary user account in the guest without chowning everything (which would also require security_model=mapped-xattr or security_model=mapped-file to be passed to qemu. In the future you should be able to do sudo mount --bind --map-users 1000:1001:1 --map-groups 984:1001:1 /mnt mapped
  • We now need to set up an overlayfs that we can freely write to.
    • mkdir -p upper work llvm-project
    • sudo mount -t overlay overlay -o lowerdir=/mnt,upperdir=$HOME/upper,workdir=$HOME/work $HOME/llvm-project
  • You could create a user account with appropriate uid/gid to match those in the exposed directory, but for simplicity I've just run the tests as root within the VM.

Details: virtiofsd

  • Installed the virtiofsd package on Arch Linux (version 1.11.1). Refer to the documentation for examples of how to run.
  • Run /usr/lib/virtiofsd --socket-path=/tmp/vhostqemu --shared-dir=$HOME/llvm-project --cache=always
  • Add the following commands to your qemu launch command (where the memory size parameter to -object seems to need to match the -m argument you used):
  -chardev socket,id=char0,path=/tmp/vhostqemu \
  -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=llvm \
  -object memory-backend-memfd,id=mem,size=64G,share=on -numa node,memdev=mem
  • sudo mount -t virtiofs llvm /mnt
  • Unfortunately id-mapped mounts also aren't quite supported for virtiofs at the time of writing. The changes were recently merged into the upstream kernel and a merge request to virtiofsd is pending.
  • mkdir -p upper work llvm-project
  • sudo mount -t overlay overlay -o lowerdir=/mnt,upperdir=$HOME/upper,workdir=$HOME/work $HOME/llvm-project
  • As with the 9p tests, just ran the tests using sudo.

Details: squashfs

  • I couldn't get a regex that squashfs would accept to exclude all build/* directories except build/stage2cross, so use find to generate the individual directory exclusions.
  • As you can see in the commandline below, I disabled compression altogether.
  • mksquashfs ~/llvm-project llvm-project.squashfs -no-compression -force-uid 1001 -force-gid 1001 -e \ .git $(find $HOME/llvm-project/build -maxdepth 1 -mindepth 1 -type d -not -name 'stage2cross' -printf '-e build/%P ')
  • Add this to the qemu command line: -drive file=$HOME/llvm-project.squashfs,if=none,id=llvm,format=raw -device virtio-net-device,netdev=net
  • sudo mount -t squashfs /dev/vdb /mnt
  • Use the same instructions for setting up overlayfs as for 9pfs above. There's no need to run the tests using sudo as we've set the appropriate uid.

Details: new ext4 filesystem

  • Make use of the -d parameter of mkfs.ext4 which allows you to create a filesystem, using the given directory or tarball to initialise it. As mkfs.ext4 lacks features for filtering the input or forcing the uid/gid of files, we rely on tarball input (only just added in e2fsprogs 1.47.1).
  • Create a filesystem image that's big enough with fallocate -l 15GiB llvm-project.img
  • Execute:
tar --create \
  --file=- \
  --owner=1001 \
  --group=1001 \
  --exclude=.git \
  $(find $HOME/llvm-project/build -maxdepth 1 -mindepth 1 -type d -not -name 'stage2cross' -printf '--exclude=build/%P ') \
  -C $HOME/llvm-project . \
  | mkfs.ext4 -d - llvm-project.img
  • Attach the drive by adding this to the qemu command line -device virtio-blk-device,drive=hdb -drive file=$HOME/llvm-project.img,format=raw,if=none,id=hdb
  • Mount in qemu with sudo mount -t ext4 /dev/vdb $HOME/llvm-project

Details: tar piped over ssh

  • Essentially, use the tar command above and pipe it to tar running under qemu-system over ssh to extract it to root filesystem (XFS in my VM setup).
  • This could be optimised. e.g. by selecting a less expensive ssh encryption algorithm (or just netcat), a better networking backend to QEMU, and so on. But the intent is to show the cost of the naive "obvious" solution.
  • Execute:
tar --create \
  --file=- \
  --owner=1001 \
  --group=1001 \
  --exclude=.git \
  $(find $HOME/llvm-project/build -maxdepth 1 -mindepth 1 -type d -not -name 'stage2cross' -printf '--exclude=build/%P ') \
  -C $HOME/llvm-project . \
  | ssh -p10222 asb@localhost "mkdir -p llvm-project && tar xf - -C \
  llvm-project"

October 15, 2024 12:00 PM

October 09, 2024

Pablo Saavedra

Diagnosing Memory Leaks in WPEWebKit with Google Perftools


This blog post will focus on memory issue case, and I’ll walk you through my experiences in investigating and resolving it. The issue involved WPEWebKit and the Cairo graphics library. I’ll also cover the tools I used for memory profiling and how I ultimately found problem.

keywords: memory leak, WPEWebKit, Cairo, google-perftools.

The Context

“It’s not a matter of if, but when”.

As a software developer, whether during the coding phase or ongoing maintenance, encountering memory-related issues is nearly inevitable.

For those working in languages with manual memory management, especially seasoned developers of C and C++, bugs related to memory — such as double frees, invalid memory accesses, and leaks — are all too familiar.

Common memory problems, including memory leaks, buffer overflows, and insufficient allocation, can quickly escalate into significant headaches. These issues often result in degraded performance, system crashes, or even security vulnerabilities.

These kind of bugs often arise in user-space programs, where memory is managed without special safeguards—let alone in more complex scenarios involving CMA, mmap(), DMAbuf …

Tackling these challenges requires a deep understanding of how your application handles memory, along with careful debugging and optimization to ensure efficient, safe memory usage.

In my experience, you won’t always be able to rely on the same tool or approach to analyze memory-related issues. More often than not, the code where the defect is found will be complex and, quite possibly, not written by you. Additionally, in many cases, you won’t be able to create a simple test case to replicate the issue until after investing significant time and effort in understanding and pinpointing the exact root cause. One such scenario is the case I want to discuss today.

Some time ago, I had to track down a memory issue in WPEWebKit. Given the nature of WebKit, the problem naturally involved threads, subprocesses, IPC message passing, and graphical buffers. To make matters more challenging, this was on a low-profile target device with very limited RAM and CPU power.This blog post serves as a brief summary of that experience.

The Case

“Same old story”.

Every good case has its usual suspects, and this one was no exception. The issue I encountered involved a webpage running a 2D animation on a canvas. Initial observations revealed a slow but steady increase in memory usage, eventually leading to memory exhaustion.

We knew it was a memory issue. The memory consumption of the WebProcess — the process responsible for composing and rendering the webpage frames — kept growing, slowly but steadily.

From the context and a series of checks on the webpage JavaScript logic, it seemed the error was related to the canvas painting or, more simply, a common JavaScript programming pitfall. Perhaps a temporary object was being stored in the global context, maybe in an array, or there could be a bug within WebKit’s canvas rendering code.

However, early observations are rarely definitive. They provide a starting point but often lead to multiple discarded ideas before the real issue is found. This case was no different part of the process is separating the wheat from the chaff.

The Debug

Discarding JavaScript Pitfalls

“First things first …”

… before diving into the code, we need to rule out the possibility that the issue lies within the webpage itself.

To analyze potential JavaScript pitfalls, the WebKit Web Inspector provides a powerful tool—the Memory Timeline and the JavaScript Allocations Timeline. In short, the Memory Timeline helps categorize and understand memory usage, identify spikes, and detect overall memory growth. Meanwhile, the JavaScript Allocations Timeline assists in recording, comparing, and analyzing snapshots of the JavaScript heap to track memory growth and detect leaks.

The snapshot offers valuable insights into the memory distribution of your application, helping to identify leaks and inefficiencies. For a more detailed walkthrough, I would suggest a visit to the original WebKit blog post on memory debugging, which provides a comprehensive guide on how to use these tools.

Regarding my case, to use the console.takeHeapSnapshot(<label>) command in the Web Inspector, I needed to start by opening Web Inspector in the browser while running the web application. Here’s a simple explanation of how I used this:

  1. Taking a Heap Snapshot:
  • In the Web Inspector, go to the “Timelines” tab.
  • Open the console and type console.takeHeapSnapshot("Snapshot 1"), where "Snapshot 1" is a label of your choice. This command will create a snapshot of the JavaScript heap at that point in time.
  1. Interpreting the Snapshot:
  • After taking the snapshot, switch to the “Memory” tab to view the details.
  • You’ll see a graph and a list showing memory allocations by type and context.
  • Look for areas where memory usage is unexpectedly high or where objects aren’t being released as expected.

Unfortunately for me, the analysis didn’t shed any light on the issue. Multiple runs of console.takeHeapSnapshot(<label>) showed no steady increase in objects within the JavaScript Context, which could indicate a design flaw in the webpage. Therefore, the webpage itself can be ruled out as the cause.

Creating a simplified scenario

“Time to read between the lines. Less is More”.

I began my investigation by dissecting the test page. Of course, the real webpage I had to debug was far more complex than the simplified example I will use here for illustration. For clarity’s sake, I’ll present a distilled version of the relevant parts.

<!DOCTYPE html>
<html>
<body>
<h1>Test</h1>

<canvas id="myCanvas" width="620" height="620" style="border:1px solid grey"></canvas>

<div id="id-text"/>

<script>

function addText() {
    var element = document.getElementById('id-text');

    var newHtml = `
    <div id="id-text" style="top: 146px; left: 40px; width: 250px; height: 334px; background-color: transparent; position: absolute; color: rgb(0, 0, 255);">
        <div>Text</div>
    </div>`;

    element.outerHTML = newHtml;
}

const canvas = document.getElementById("myCanvas");
const max_size = 400;
canvas.width = canvas.height = max_size;
canvas.style.position = "absolute";
canvas.style.cursor = "not-allowed";
canvas.style.pointerEvents = "none";
canvas.style.backgroundColor = "black";


const ctx = canvas.getContext("2d", { alpha: true });
ctx.fillStyle = "red";


var frame_counter = 0;
const frame_rate = 60;
const duration = 5;
const margin = 20;

function drawRandomRectangle() {

    size = (max_size - 2*margin) * (frame_counter++ / (frame_rate * duration));
    // console.log("Size " + size);

    // Clear the canvas
    if (size > max_size - 2*margin)
    {
    console.log("Clear");
        ctx.clearRect(0, 0, max_size, max_size);
    frame_counter = 0;
    }
    else
    {
        // Draw the rectangle with size dimensions 
        ctx.fillRect(margin, margin, size, size);
    }

    addText();

    window.requestAnimationFrame(drawRandomRectangle);
}

window.requestAnimationFrame(drawRandomRectangle);

</script>

</body>
</html>

The webpage creates a canvas element where a red rectangle is animated to grow in size over time, resetting once it reaches a maximum size. The canvas is styled with a black background, and its pointer events are disabled. Additionally function dynamically updates a text element, positioning it on the page as an absolutely positioned blue text block over the canvas. The drawRandomRectangle function handles the animation of the rectangle by gradually increasing its size, clearing the canvas when the rectangle exceeds the specified dimensions, and restarting the animation. The function is continuously called using requestAnimationFrame to ensure a smooth animation.

In the next image, we can see the evolution of memory consumption over time. It’s important to note, although it’s not very clear in the graph, that the situation eventually leads to the Linux OOM (Out-Of-Memory) killer being triggered due to memory starvation caused by the WPEWebProcess consuming too much memory.

One potential complication I forgot to mention earlier is that this issue was reported on a WPEWebKit instance running on an ARM-based board, which is a common scenario when dealing with WPEWebKit. Problems like these are often tied to a specific target machine, and two possibilities arise: either the issue is linked to the hardware or software stack, influencing the behavior, or, with some luck, the problem can be replicated on x86, avoiding the challenges of cross-compilation.

The good part of the memory issues is that they tend to be software-related rather than hardware-specific. Even so, the problem may still be caused by the libraries, versions, or specific compilation flags used to build WebKit for that target. So, the first step was to see if I could replicate the problem on x86. The guiding principle: Let’s compile on my laptop and see if we can continue debugging directly on it.

time Tools/Scripts/build-webkit --makeargs=-j14 --wpe --release --cmakeargs "-DENABLE_INTROSPECTION=OFF -DENABLE_DOCUMENTATION=OFF -DENABLE_COG=OFF" --no-bubblewrap-sandbox
./Tools/Scripts/run-minibrowser --wpe "https://people.igalia.com/psaavedra/test/20220530_issue.html"

I proceeded to measure the memory behavior and was fortunate to find that the problem could be reproduced on the laptop too, allowing us to continue debugging without the need for remote GDB servers or cross-compilation. Running everything natively on x86 simplifies the process significantly.

Let’s Get to Work: Profiling Memory with Google-Perftools

As I’ve mentioned, memory issues can be a serious problem in software development. Often, these issues are tied to segmentation faults, potentially caused by incorrect overwriting in the program logic. In those cases, we can rely on GDB and access debug symbols to perform a deep debugging session, track down the problem, and fix it.

Also, when faced with a crash caused by double-free errors or with memory leaks, tools like Valgrind or libASAN, are invaluable for performing heap checks, identifying who initialized the memory, where it was released, and getting us closer to a solution. These tools are helpful for detecting direct memory leaks, where a pointer to a previously allocated memory area is lost.

However, not all memory issues fit the typical definition of leaks. In this case, I encountered a different kind of memory leak. These aren’t leaks in the strict sense, but rather logical programming errors that result in unchecked memory growth. Typically, this happens when values are stored in memory — often within data structures like stacks, lists, trees, or hashes — without proper cleanup or release when they’re no longer needed.

In these situations, the memory remains valid and accessible to the application, but due to a logical flaw, it is never freed, causing the program’s memory usage to grow uncontrollably. For these situations, a memory heap checker can be what you really need.

I found Google-Perftools to be an incredibly useful tool for tracking and analyzing memory usage for these kind of cases.
Designed as a memory profiler with a focus on performance, Google Perftools helps identify inefficiencies without adding significant overhead. Unlike AddressSanitizer (ASAN), which is primarily geared towards detecting memory-related bugs like out-of-bounds access or use-after-free errors, Google Perftools emphasizes performance optimization. It provides tcmalloc (a fast memory allocator), a heap profiler, and a memory leak detector, making it a versatile option for debugging memory issues.

The following steps describe what I did to debug the memory issue using Google Perftools.

Step 1: Installing Google Perftools

First, install the required packages:

sudo apt-get install google-perftools libgoogle-perftools-dev

Step 2: Building WPEWebKit and Linking with Perftools

To build WPE and link it against Perftools:

export LDFLAGS="-lprofiler -ltcmalloc"
time Tools/Scripts/build-webkit --makeargs=-j14 --wpe --release --cmakeargs "-DENABLE_THUNDER=OFF -DENABLE_INTROSPECTION=OFF -DENABLE_DOCUMENTATION=OFF -DENABLE_COG=OFF" --no-bubblewrap-sandbox

Step 3: Injecting Perftools Environment Variables

Next, inject Perftools’ environment variables into the WPEWebProcess:

mv /home/psaavedra/dev/WebKit/WebKitBuild/WPE/Release/bin/WPEWebProcess /home/psaavedra/dev/WebKit/WebKitBuild/WPE/Release/bin/WPEWebProcess.real

cat << EOF > /home/psaavedra/dev/WebKit/WebKitBuild/WPE/Release/bin/WPEWebProcess
#!/bin/bash
export HEAPPROFILE=/tmp/prof.log HEAP_PROFILE_INUSE_INTERVAL=1048576 HEAP_PROFILE_MMAP=true
exec /home/psaavedra/dev/WebKit/WebKitBuild/WPE/Release/bin/WPEWebProcess.real \$@
EOF

chmod +x /home/psaavedra/dev/WebKit/WebKitBuild/WPE/Release/bin/WPEWebProcess

Step 4: Run the minibrowser

./Tools/Scripts/run-minibrowser --wpe "https://people.igalia.com/psaavedra/test/20220530_issue.html"
Starting tracking the heap
Dumping heap profile to /tmp/prof.log.0001.heap (1 MB currently in use)
Dumping heap profile to /tmp/prof.log.0002.heap (2 MB currently in use)
Dumping heap profile to /tmp/prof.log.0003.heap (3 MB currently in use)
Dumping heap profile to /tmp/prof.log.0004.heap (4 MB currently in use)
...

Step 5: Generating a Visual Memory Report

To visualize the memory usage, use the google-pprof tool:

google-pprof --base=/tmp/prof.log.0009.heap --svg /home/psaavedra/dev/WebKit/WebKitBuild/WPE/Release/bin/WPEWebProcess /tmp/prof.log.0011.heap > memory_profiling.svg

The command compares the memory usage between two heap profiles (prof.log.0009.heap as the base and prof.log.0011.heap as the target), generates a visual report of the differences as an SVG file (memory_profiling.svg), and uses WPEWebProcess as the executable to analyze:

  1. --base=/tmp/prof.log.0009.heap: This option specifies the base heap profile file. The base profile represents the state of the heap at an earlier point in time. When generating a memory profile, pprof will compare the current profile against this base to show the difference in memory allocation between these two points.
  2. /tmp/prof.log.0011.heap: This file represents a memory profile captured at a later point in time compared to the base. It contains memory allocation and usage information from that moment in execution.
  3. --svg: This option instructs pprof to output the memory profiling information as an SVG file, a visual representation of the memory usage.
  4. /home/psaavedra/dev/WebKit/WebKitBuild/WPE/Release/bin/WPEWebProcess: This is the binary executable that was profiled. It’s the target program for which the memory profiling was performed (in this case, WPEWebProcess), and it’s necessary for pprof to properly analyze the memory allocations.
  5. > memory_profiling.svg: This part redirects the output of the command (the generated SVG image).

Step 6 (Final): Graph Analysis

“Where there’s smoke, there’s fire”.

The memory profiling revealed that memory reserved around cairo_scaled_font_create was not being released. Interestingly, this wasn’t a typical memory leak but was tied to frequent caching during repeated text rendering in the 2D canvas, leading to memory exhaustion.

The issue was traced back to text rendering on the canvas, particularly when overlapping texts were involved. Simply removing the overlapping texts resolved the problem.

This information immediately pointed to the root cause of the problem, which I quickly identified as an issue already resolved in newer versions of the Cairo library. In particular, we encountered this Cairo issue (#514e3649469):

Fix font options leak in _cairo_gstate_ensure_scaled_font()

Font options are allocated in _cairo_gstate_ensure_scaled_font() for local
processing, but never freed. Run _cairo_font_options_fini() on these and
fix the leak.

The Solution

“Nunca choveu que non escampara” (Galician proverb).

In the end, I was fortunate to encounter an issue that had already been addressed in the upstream version of the software. This significantly simplified the troubleshooting process, as a known solution was readily available. The memory leak was traced back to a bug in Cairo version 1.16, which had been identified and resolved in the more recent Cairo version 1.18.

As a result, the fix was straightforward: updating the Cairo library to version 1.18. After applying this update, the memory leak was fully resolved, highlighting the importance of keeping dependencies up to date and leveraging upstream fixes.

Here are some key lessons learned to avoid similar problems in the future:

  • Regularly updating third-party libraries and staying informed about known issues is essential for maintaining software stability.
  • Utilizing tools like Google-Perftools for memory profiling can help detect leaks early and provide a quick understanding of memory-related issues.

by Pablo Saavedra at October 09, 2024 06:25 AM

October 01, 2024

Alex Bradbury

Arch Linux on remote server setup runbook

As I develop, I tend to offload a lot of the computationally heavy build or test tasks to a remote build machine. It's convenient for this to match the distribution I use on my local machine, and I so far haven't felt it would be advantageous to add another more server-oriented distro to the mix to act as a host to an Arch Linux container. This post acts as a quick reference / runbook for me on the occasions I want to spin up a new machine with this setup. To be very explicit, this is shared as something that might happen to be helpful if you have similar requirements, or to give some ideas if you have different ones. I definitely don't advocate that you blindly copy it.

Although this is something that could be fully automated, I find structuring this kind of thing as a series of commands to copy and paste and check/adjust manually hits the sweetspot for my usage. As always, the Arch Linux wiki and its install guide is a fantastic reference that you should go and check if you're unsure about anything listed here.

Details about my setup

  • I'm typically targeting Hetzner servers with two SSDs. I'm not worried about data loss or minor downtime waiting for a disk to be replaced before re-imaging, so RAID0 is just fine. The commands below assume this setup.
  • I do want encrypted root just in case anything sensitive is ever copied to the machine. As it is a remote machine, remote unlock over SSH must be supported (i.e. initrd contains an sshd that will accept the passphrase for unlocking the root partition before continuing the boot process).
    • The described setup means that if the disk is thrown away or given to another customer, you shouldn't have anything plain text in the root partition lying around it. As the boot partition is unencrypted, there are a whole bunch of threat models where someone is able to modify that boot partition which wouldn't be addressed by this approach.
  • The setup also assumes you start from some kind of rescue environment (in my case Hetzner's) that means you can freely repartition and overwrite the disk drives.
  • If you're not on Hetzner, you'll have to use different mirrors as they're not available externally.
  • Hosts typically provide their own base images and installer scripts. The approach detailed here is good if you'd rather control the process yourself, and although there are a few hoster-specific bits in here it's easy enough to replace them.
  • For simplicity I use UEFI boot and EFI stub. If you have problems with efibootmgr, you might want to use systemd-boot instead.
  • Disclaimer: As is hopefully obvious, the below instructions will overwrite anything on the hard drive of the system you're running them on.

Setting up and chrooting into Arch bootstrap environment from rescue system

See the docs on the Hetzner rescue system for details on how to enter it. If you've previously used this server, you likely want to remove old known_hosts entries with ssh-keygen -R your.server.

Now ssh in and do the following:

wget http://mirror.hetzner.de/archlinux/iso/latest/archlinux-bootstrap-x86_64.tar.zst
tar -xvf archlinux-bootstrap-x86_64.tar.zst --numeric-owner
sed -i '1s;^;Server=https://mirror.hetzner.de/archlinux/$repo/os/$arch\n\n;' root.x86_64/etc/pacman.d/mirrorlist
mount --bind root.x86_64/ root.x86_64/ # See <https://bugs.archlinux.org/task/46169>
printf "About to enter bootstrap chroot\n===============================\n"
./root.x86_64/bin/arch-chroot root.x86_64/

You are now chrooted into the Arch boostrap environment. We will do as much work from there as possible so as to minimise dependency on tools in the Hetzner rescue system.

Set up drives and create and chroot into actual rootfs

The goal is now to set up the drives, perform an Arch Linux bootstrap, and chroot into it. The UEFI partition is placed at 1MiB pretty much guaranteed to be properly aligned for any reasonable physical sector size (remembering space must be left at the beginning of the disk for the partition table itself).

We'll start by collecting info that will be used throughout the process:

printf "Enter the desired new hostname:\n"
read NEW_HOST_NAME; export NEW_HOST_NAME
printf "Enter the desired passphrase for unlocking the root partition:\n"
read ROOT_PART_PASSPHRASE
printf "Enter the public ssh key (e.g. cat ~/.ssh/id_ed22519.pub) that will be used for access:\n"
read PUBLIC_SSH_KEY; export PUBLIC_SSH_KEY
printf "Enter the name of the main user account to be created:\n"
read NEW_USER; export NEW_USER
printf "You've entered the following information:\n"
printf "NEW_HOST_NAME: %s\n" "$NEW_HOST_NAME"
printf "ROOT_PART_PASSPHRASE: %s\n" "$ROOT_PART_PASSPHRASE"
printf "PUBLIC_SSH_KEY: %s\n" "$PUBLIC_SSH_KEY"
printf "NEW_USER: %s\n" "$NEW_USER"

And now proceed to set up the disks, create filesystems, perform an initial bootstrap and chroot into the new rootfs:

pacman-key --init
pacman-key --populate archlinux
pacman -Sy --noconfirm xfsprogs dosfstools mdadm cryptsetup

for M in /dev/md?; do mdadm --stop $M; done
for DRIVE in /dev/nvme0n1 /dev/nvme1n1; do
  sfdisk $DRIVE <<EOF
label: gpt

start=1MiB, size=255MiB, type=uefi
start=256MiB, type=raid
EOF
done

# Warning from mdadm about falling back to creating md0 via node seems fine to ignore
yes | mdadm --create --verbose --level=0 --raid-devices=2 --homehost=any /dev/md/0 /dev/nvme0n1p2 /dev/nvme1n1p2

mkfs.fat -F32 /dev/nvme0n1p1
printf "%s\n" "$ROOT_PART_PASSPHRASE" | cryptsetup -y -v luksFormat --batch-mode /dev/md0
printf "%s\n" "$ROOT_PART_PASSPHRASE" | cryptsetup open /dev/md0 root
unset ROOT_PART_PASSPHRASE

mkfs.xfs /dev/mapper/root # rely on this calaculating appropriate sunit and swidth

mount /dev/mapper/root /mnt
mkdir /mnt/boot
mount /dev/nvme0n1p1 /mnt/boot
pacstrap /mnt base linux linux-firmware efibootmgr \
  xfsprogs dosfstools mdadm cryptsetup \
  mkinitcpio-netconf mkinitcpio-tinyssh mkinitcpio-utils python3 \
  openssh sudo net-tools git man-db man-pages vim
genfstab -U /mnt >> /mnt/etc/fstab
# Note that sunit and swidth in fstab confusingly use different units vs those shown by xfs <https://superuser.com/questions/701124/xfs-mount-overrides-sunit-and-swidth-options>

mdadm --detail --scan >> /mnt/etc/mdadm.conf
printf "About to enter newrootfs chroot\n===============================\n"
arch-chroot /mnt

Do final configuration from within the new rootfs chroot

Unfortunately I wasn't able to find a way to get ipv6 configured via DHCP on Hetzner's network, so we rely on hardcoding it (which seems to be their recommendation.

We set up sudo and also configure ssh to disable password-based login. A small swapfile is configured in order to allow the kernel to move allocated but not actively used pages there if deemed worthwhile.

sed /etc/locale.gen -i -e "s/^\#en_GB.UTF-8 UTF-8.*/en_GB.UTF-8 UTF-8/"
locale-gen
# Ignore "System has not been booted with systemd" and "Failed to connect to bus" error for next command.
systemd-firstboot --locale=en_GB.UTF-8 --timezone=UTC --hostname="$NEW_HOST_NAME"
ln -s /dev/null /etc/udev/rules.d/80-net-setup-link.rules # disable persistent network names

ssh-keygen -A # generate openssh host keys for the tinyssh hook to pick up
printf "%s\n" "$PUBLIC_SSH_KEY" > /etc/tinyssh/root_key
sed /etc/mkinitcpio.conf -i -e 's/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block mdadm_udev netconf tinyssh encryptssh filesystems fsck)/'
# The call to tinyssh-convert in mkinitcpio-linux is incorrect, so do it
# manually <https://github.com/grazzolini/mkinitcpio-tinyssh/issues/10>
tinyssh-convert /etc/tinyssh/sshkeydir < /etc/ssh/ssh_host_ed25519_key
mkinitcpio -p linux

printf "efibootmgr before changes:\n==========================\n"
efibootmgr -u
# Clean up any other efibootmgr entries from previous installs
for BOOTNUM in $(efibootmgr | grep '^Boot0' | grep -v PXE | sed 's/^Boot\([0-9]*\).*/\1/'); do
  efibootmgr -b $BOOTNUM -B
done
# Set up efistub
efibootmgr \
  --disk /dev/nvme0n1 \
  --part 1 \
  --create \
  --label 'Arch Linux' \
  --loader /vmlinuz-linux \
  --unicode "root=/dev/mapper/root rw cryptdevice=UUID=$(blkid -s UUID -o value /dev/md0):root:allow-discards initrd=\initramfs-linux.img ip=:::::eth0:dhcp" \
  --verbose
efibootmgr -o 0001,0000 # prefer PXE boot. May need to check IDs
printf "efibootmgr after changes:\n=========================\n"
efibootmgr -u

mkswap --size=8G --file /swapfile
cat - <<EOF > /etc/systemd/system/swapfile.swap
[Unit]
Description=Swap file

[Swap]
What=/swapfile

[Install]
WantedBy=multi-user.target
EOF
systemctl enable swapfile.swap

cat - <<EOF > /etc/systemd/network/10-eth0.network
[Match]
Name=eth0

[Network]
DHCP=yes
Address=$(ip -6 addr show dev eth0 scope global | grep "scope global" | cut -d' ' -f6)
Gateway=$(ip route show | head -n 1 | cut -d' ' -f 3)
Gateway=fe80::1
EOF
systemctl enable systemd-networkd.service systemd-resolved.service systemd-timesyncd.service
printf "PasswordAuthentication no\n" > /etc/ssh/sshd_config.d/20-no-password-auth.conf
systemctl enable sshd.service
useradd -m -g users -G wheel -s /bin/bash "$NEW_USER"
usermod --pass='!' root # disable root login
chmod +w /etc/sudoers
printf "%%wheel ALL=(ALL) ALL\n" >> /etc/sudoers
chmod -w /etc/sudoers
mkdir "/home/$NEW_USER/.ssh"
printf "%s\n" "$PUBLIC_SSH_KEY" > "/home/$NEW_USER/.ssh/authorized_keys"
chmod 700 "/home/$NEW_USER/.ssh"
chmod 600 "/home/$NEW_USER/.ssh/authorized_keys"
chown -R "$NEW_USER:users" "/home/$NEW_USER/.ssh"

Set user password and reboot

# The next command will ask for the user password (needed for sudo).
passwd "$NEW_USER"

Then ctrl+d twice and reboot.

Unlocking rootfs and logging in

Firstly, you likely want to remove any saved host public keys as there will be a new one for the newly provisioned machine ssh-keygen -R your.server. The server will have a different host key for the root key unlock so I'd recommend using a different known_hosts file for unlock. e.g. ssh -o UserKnownHostsFile=~/.ssh/known_hosts_unlock root@your.server. If you put something like the following in your ~/.ssh/config you could just ssh unlock-foo-host:

Host unlock-foo-host
HostName your.server
User root
UserKnownHostsFile ~/.ssh/known_hosts_unlock

Once you've successfully entered the key to unlock the rootfs, you can just ssh as normal to the server.


Article changelog
  • 2024-10-01: Initial publication date.

October 01, 2024 12:00 PM

September 28, 2024

Qiuyi Zhang (Joyee)

Reproducible Node.js built-in snapshots, part 3 - fixing V8 startup snapshot

In the previous posts we looked into how the Node.js executable was made a bit more reproducible after the Node.js snapshot data and the V8 code cache were made reproducible, and did a bit of anatomy on the unreproducible V8 snapshot blobs

by Joyee Cheung at September 28, 2024 08:44 AM

Reproducible Node.js built-in snapshots, part 2 - V8 code cache and snapshot blobs

In the previous post, we covered how the Node.js built-in snapshot is generated and embedded into the executable, and how I fixed the Node

by Joyee Cheung at September 28, 2024 07:44 AM

Reproducible Node.js built-in snapshots, part 1 - Overview and Node.js fixes

In a recent effort to make the Node.js builds reproducible again (at least on Linux), a big remaining piece was the built-in snapshot. I wrote some notes about the journey of making the Node

by Joyee Cheung at September 28, 2024 06:44 AM

September 27, 2024

Carlos García Campos

Graphics improvements in WebKitGTK and WPEWebKit 2.46

WebKitGTK and WPEWebKit recently released a new stable version 2.46. This version includes important changes in the graphics implementation.

Skia

The most important change in 2.46 is the introduction of Skia to replace Cairo as the 2D graphics renderer. Skia supports rendering using the GPU, which is now the default, but we also use it for CPU rendering using the same threaded rendering model we had with Cairo. The architecture hasn’t changed much for GPU rendering: we use the same tiled rendering approach, but buffers for dirty regions are rendered in the main thread as textures. The compositor waits for textures to be ready using fences and copies them directly to the compositor texture. This was the simplest approach that already resulted in much better performance, specially in the desktop with more powerful GPUs. In embedded systems, where GPUs are not so powerful, it’s still better to use the CPU with several rendering threads in most of the cases. It’s still too early to announce anything, but we are already experimenting with different models to improve the performance even more and make a better usage of the GPU in embedded devices.

Skia has received several GCC specific optimizations lately, but it’s always more optimized when built with clang. The optimizations are more noticeable in performance when using the CPU for rendering. For this reason, since version 2.46 we recommend to build WebKit with clang for the best performance. GCC is still supported, of course, and performance when built with GCC is quite good too.

HiDPI

Even though there aren’t specific changes about HiDPI in 2.46, users of high resolution screens using a device scale factor bigger than 1 will notice much better performance thanks to scaling being a lot faster on the GPU.

Accelerated canvas

The 2D canvas can be accelerated independently on whether the CPU or the GPU is used for painting layers. In 2.46 there’s a new setting WebKitSettings:enable-2d-canvas-acceleration to control the 2D canvas acceleration. In some embedded devices the combination of CPU rendering for layer tiles and GPU for the canvas gives the best performance. The 2D canvas is normally rendered into an image buffer that is then painted in the layer as an image. We changed that for the accelerated case, so that the canvas is now rendered into a texture that is copied to a compositor texture to be directly composited instead of painted into the layer as an image. In 2.46 the offscreen canvas is enabled by default.

There are more cases where accelerating the canvas is not desired, for example when the canvas size is not big enough it’s faster to use the GPU. Also when there’s going to be many operations to “download” pixels from GPU. Since this is not always easy to predict, in 2.46 we added support for the willReadFrequently canvas setting, so that when set by the application when creating the canvas it causes the canvas to be always unaccelerated.

Filters

All the CSS filters are now implemented using Skia APIs, and accelerated when possible. The most noticeable change here is that sites using blur filters are no longer slow.

Color spaces

Skia brings native support for color spaces, which allows us to greatly simplify the color space handling code in WebKit. WebKit uses color spaces in many scenarios – but especially in case of SVG and filters. In case of some filters, color spaces are necessary as some operations are simpler to perform in linear sRGB. The good example of that is feDiffuseLighting filter – it yielded wrong visual results for a very long time in case of Cairo-based implementation as Cairo doesn’t have a support for color spaces. At some point, however, Cairo-based WebKit implementation has been fixed by converting pixels to linear in-place before applying the filter and converting pixels in-place back to sRGB afterwards. Such a workarounds are not necessary anymore as with Skia, all the pixel-level operations are handled in a color-space-transparent way as long as proper color space information is provided. This not only impacts the results of some filters that are now correct, but improves performance and opens new possibilities for acceleration.

Font rendering

Font rendering is probably the most noticeable visual change after the Skia switch with mixed feedback. Some people reported that several sites look much better, while others reported problems with kerning in other sites. In other cases it’s not really better or worse, it’s just that we were used to the way fonts were rendered before.

Damage tracking

WebKit already tracks the area of the layers that has changed to paint only the dirty regions. This means that we only repaint the areas that changed but the compositor incorporates them and the whole frame is always composited and passed to the system compositor. In 2.46 there’s experimental code to track the damage regions and pass them to the system compositor in addition to the frame. Since this is experimental it’s disabled by default, but can be enabled with the runtime feature PropagateDamagingInformation. There’s also UnifyDamagedRegions feature that can be used in combination with PropagateDamagingInformation to unify the damage regions into one before passing it to the system compositor. We still need to analyze the impact of damage tracking in performance before enabling it by default. We have also started an experiment to use the damage information in WebKit compositor and avoid compositing the entire frame every time.

GPU info

Working on graphics can be really hard in Linux, there are too many variables that can result in different outputs for different users: the driver version, the kernel version, the system compositor, the EGL extensions available, etc. When something doesn’t work for some people and work for others, it’s key for us to gather as much information as possible about the graphics stack. In 2.46 we have added more useful information to webkit://gpu, like the DMA-BUF buffer format and modifier used (for GTK port and WPE when using the new API). Very often the symptom is the same, nothing is rendered in the web view, even when the causes could be very different. For those cases, it’s even more difficult to gather the info because webkit://gpu doesn’t render anything either. In 2.46 it’s possible to load webkit://gpu/stdout to get the information as a JSON directly in stdout.

Sysprof

Another common symptom for people having problems is that a particular website is slow to render, while for others it works fine. In these cases, in addition to the graphics stack information, we need to figure out where we are slower and why. This is very difficult to fix when you can’t reproduce the problem. We added initial support for profiling in 2.46 using sysprof. The code already has some marks so that when run under sysprof we get useful information about timings of several parts of the graphics pipeline.

Next

This is just the beginning, we are already working on changes that will allow us to make a better use of both the GPU and CPU for the best performance. We have also plans to do other changes in the graphics architecture to improve synchronization, latency and security. Now that we have adopted sysprof for profiling, we are also working on improvements and new tools.

by carlos garcia campos at September 27, 2024 10:30 AM

Ricardo García

Waiter, there's an IES in my DGC!

Finally! Yesterday Khronos published Vulkan 1.3.296 including VK_EXT_device_generated_commands. Thousands of engineering hours seeing the light of day, and awesome news for Linux gaming.

Device-Generated Commands, or DGC for short, are Vulkan’s equivalent to ExecuteIndirect in Direct3D 12. Thanks to this extension, originally based on a couple of NVIDIA vendor extensions, it will be possible to prepare sequences of commands to run directly from the GPU, and executing those sequences directly without any data going through the CPU. Also, Proton now has a much-more official leg to stand on when it has to translate ExecuteIndirect from D3D12 to Vulkan while you run games such as Starfield.

The extension not only provides functionality equivalent to ExecuteIndirect. It goes beyond that and offers more fine-grained control like explicit preprocessing of command sequences, or switching shaders and pipelines with each sequence thanks to something called Indirect Execution Sets, or IES for short, that potentially work with ray tracing, compute and graphics (both regular and mesh shading).

As part of my job at Igalia, I’ve implemented CTS tests for this extension and I had the chance to work very closely with an awesome group of developers discussing specification, APIs and test needs. I hope I don’t forget anybody and apologize in advance if so.

  • Mike Blumenkrantz, of course. Valve contractor, Super Good Coder and current OpenGL Working Group chair who took the initial specification work from Patrick Doane and carried it across the finish line. Be sure to read his blog post about DGC. Also incredibly important for me: he developed, and kept up-to-date, an implementation of the extension for lavapipe, the software Vulkan driver from Mesa. This was invaluable in allowing me to create tests for the extension much faster and making sure tests were in good shape when GPU driver authors started running them.

  • Spencer Fricke from LunarG. Spencer did something fantastic here. For the first time, the needed changes in the Vulkan Validation Layers for such a large extension were developed in parallel while tests and the spec were evolving. His work will be incredibly useful for app developers using the extension in their games. It also allowed me to detect test bugs and issues much earlier and fix them faster.

  • Samuel Pitoiset (Valve contractor), Connor Abbott (Valve contractor), Lionel Landwerlin (Intel) and Vikram Kushwaha (NVIDIA) providing early implementations of the extension, discussing APIs, reporting test bugs and needs, and making sure the extension works as good as possible for a variety of hardware vendors out there.

  • To a lesser degree, most others mentioned as spec contributors for the extension, such as Hans-Kristian Arntzen (Valve contractor), Baldur Karlsson (Valve contractor), Faith Ekstrand (Collabora), etc, making sure the spec works for them too and makes sense for Proton, RenderDoc, and drivers such as NVK and others.

If you’ve noticed, a significant part of the people driving this effort work for Valve and, from my side, the work has also been carried as part of Igalia’s collaboration with them. So my explicit thanks to Valve for sponsoring all this work.

If you want to know a bit more about DGC, stay tuned for future talks about this topic. In about a couple of weeks, I’ll present a lightning talk (5 mins) with an overview at XDC 2024 in Montreal. Don’t miss it!

September 27, 2024 09:42 AM

September 25, 2024

Melissa Wen

Reflections on 2024 Linux Display Next Hackfest

Hey everyone!

The 2024 Linux Display Next hackfest concluded in May, and its outcomes continue to shape the Linux Display stack. Igalia hosted this year’s event in A Coruña, Spain, bringing together leading experts in the field. Samuel Iglesias and I organized this year’s edition and this blog post summarizes the experience and its fruits.

One of the highlights of this year’s hackfest was the wide range of backgrounds represented by our 40 participants (both on-site and remotely). Developers and experts from various companies and open-source projects came together to advance the Linux Display ecosystem. You can find the list of participants here.

The event covered a broad spectrum of topics affecting the development of Linux projects, user experiences, and the future of display technologies on Linux. From cutting-edge topics to long-term discussions, you can check the event agenda here.

Organization Highlights

The hackfest was marked by in-depth discussions and knowledge sharing among Linux contributors, making everyone inspired, informed, and connected to the community. Building on feedback from the previous year, we refined the unconference format to enhance participant preparation and engagement.

Structured Agenda and Timeboxes: Each session had a defined scope, time limit (1h20 or 2h10), and began with an introductory talk on the topic.

  • Participant-Led Discussions: We pre-selected in-person participants to lead discussions, allowing them to prepare introductions, resources, and scope.
  • Transparent Scheduling: The schedule was shared in advance as GitHub issues, encouraging participants to review and prepare for sessions of interest.

Engaging Sessions: The hackfest featured a variety of topics, including presentations and discussions on how participants were addressing specific subjects within their companies.

  • No Breakout Rooms, No Overlaps: All participants chose to attend all sessions, eliminating the need for separate breakout rooms. We also adapted run-time schedule to keep everybody involved in the same topics.
  • Real-time Updates: We provided notifications and updates through dedicated emails and the event matrix room.

Strengthening Community Connections: The hackfest offered ample opportunities for networking among attendees.

  • Social Events: Igalia sponsored coffee breaks, lunches, and a dinner at a local restaurant.

  • Museum Visit: Participants enjoyed a sponsored visit to the Museum of Estrela Galicia Beer (MEGA).

Fruitful Discussions and Follow-up

The structured agenda and breaks allowed us to cover multiple topics during the hackfest. These discussions have led to new display feature development and improvements, as evidenced by patches, merge requests, and implementations in project repositories and mailing lists.

With the KMS color management API taking shape, we discussed refinements and best approaches to cover the variety of color pipeline from different hardware-vendors. We are also investigating techniques for a performant SDR<->HDR content reproduction and reducing latency and power consumption when using the color blocks of the hardware.

Color Management/HDR

Color Management and HDR continued to be the hottest topic of the hackfest. We had three sessions dedicated to discuss Color and HDR across Linux Display stack layers.

Color/HDR (Kernel-Level)

Harry Wentland (AMD) led this session.

Here, kernel Developers shared the Color Management pipeline of AMD, Intel and NVidia. We counted with diagrams and explanations from HW-vendors developers that discussed differences, constraints and paths to fit them into the KMS generic color management properties such as advertising modeset needs, IN\_FORMAT, segmented LUTs, interpolation types, etc. Developers from Qualcomm and ARM also added information regarding their hardware.

Upstream work related to this session:

Color/HDR (Compositor-Level)

Sebastian Wick (RedHat) led this session.

It started with Sebastian’s presentation covering Wayland color protocols and compositor implementation. Also, an explanation of APIs provided by Wayland and how they can be used to achieve better color management for applications and discussions around ICC profiles and color representation metadata. There was also an intensive Q&A about LittleCMS with Marti Maria.

Upstream work related to this session:

Color/HDR (Use Cases and Testing)

Christopher Cameron (Google) and Melissa Wen (Igalia) led this session.

In contrast to the other sessions, here we focused less on implementation and more on brainstorming and reflections of real-world SDR and HDR transformations (use and validation) and gainmaps. Christopher gave a nice presentation explaining HDR gainmap images and how we should think of HDR. This presentation and Q&A were important to put participants at the same page of how to transition between SDR and HDR and somehow “emulating” HDR.

We also discussed on the usage of a kernel background color property.

Finally, we discussed a bit about Chamelium and the future of VKMS (future work and maintainership).

Power Savings vs Color/Latency

Mario Limonciello (AMD) led this session.

Mario gave an introductory presentation about AMD ABM (adaptive backlight management) that is similar to Intel DPST. After some discussions, we agreed on exposing a kernel property for power saving policy. This work was already merged on kernel and the userspace support is under development.

Upstream work related to this session:

Strategy for video and gaming use-cases

Leo Li (AMD) led this session.

Miguel Casas (Google) started this session with a presentation of Overlays in Chrome/OS Video, explaining the main goal of power saving by switching off GPU for accelerated compositing and the challenges of different colorspace/HDR for video on Linux.

Then Leo Li presented different strategies for video and gaming and we discussed the userspace need of more detailed feedback mechanisms to understand failures when offloading. Also, creating a debugFS interface came up as a tool for debugging and analysis.

Real-time scheduling and async KMS API

Xaver Hugl (KDE/BlueSystems) led this session.

Compositor developers have exposed some issues with doing real-time scheduling and async page flips. One is that the Kernel limits the lifetime of realtime threads and if a modeset takes too long, the thread will be killed and thus the compositor as well. Also, simple page flips take longer than expected and drivers should optimize them.

Another issue is the lack of feedback to compositors about hardware programming time and commit deadlines (the lastest possible time to commit). This is difficult to predict from drivers, since it varies greatly with the type of properties. For example, color management updates take much longer.

In this regard, we discusssed implementing a hw_done callback to timestamp when the hardware programming of the last atomic commit is complete. Also an API to pre-program color pipeline in a kind of A/B scheme. It may not be supported by all drivers, but might be useful in different ways.

VRR/Frame Limit, Display Mux, Display Control, and more… and beer

We also had sessions to discuss a new KMS API to mitigate headaches on VRR and Frame Limit as different brightness level at different refresh rates, abrupt changes of refresh rates, low frame rate compensation (LFC) and precise timing in VRR more.

On Display Control we discussed features missing in the current KMS interface for HDR mode, atomic backlight settings, source-based tone mapping, etc. We also discussed the need of a place where compositor developers can post TODOs to be developed by KMS people.

The Content-adaptive Scaling and Sharpening session focused on sharpening and scaling filters. In the Display Mux session, we discussed proposals to expose the capability of dynamic mux switching display signal between discrete and integrated GPUs.

In the last session of the 2024 Display Next Hackfest, participants representing different compositors summarized current and future work and built a Linux Display “wish list”, which includes: improvements to VTTY and HDR switching, better dmabuf API for multi-GPU support, definition of tone mapping, blending and scaling sematics, and wayland protocols for advertising to clients which colorspaces are supported.

We closed this session with a status update on feature development by compositors, including but not limited to: plane offloading (from libcamera to output) / HDR video offloading (dma-heaps) / plane-based scrolling for web pages, color management / HDR / ICC profiles support, addressing issues such as flickering when color primaries don’t match, etc.

After three days of intensive discussions, all in-person participants went to a guided tour at the Museum of Extrela Galicia beer (MEGA), pouring and tasting the most famous local beer.

Feedback and Future Directions

Participants provided valuable feedback on the hackfest, including suggestions for future improvements.

  • Schedule and Break-time Setup: Having a pre-defined agenda and schedule provided a better balance between long discussions and mental refreshments, preventing the fatigue caused by endless discussions.
  • Action Points: Some participants recommended explicitly asking for action points at the end of each session and assigning people to follow-up tasks.
  • Remote Participation: Remote attendees appreciated the inclusive setup and opportunities to actively participate in discussions.
  • Technical Challenges: There were bandwidth and video streaming issues during some sessions due to the large number of participants.

Thank you for joining the 2024 Display Next Hackfest

We can’t help but thank the 40 participants, who engaged in-person or virtually on relevant discussions, for a collaborative evolution of the Linux display stack and for building an insightful agenda.

A big thank you to the leaders and presenters of the nine sessions: Christopher Cameron (Google), Harry Wentland (AMD), Leo Li (AMD), Mario Limoncello (AMD), Sebastian Wick (RedHat) and Xaver Hugl (KDE/BlueSystems) for the effort in preparing the sessions, explaining the topic and guiding discussions. My acknowledge to the others in-person participants that made such an effort to travel to A Coruña: Alex Goins (NVIDIA), David Turner (Raspberry Pi), Georges Stavracas (Igalia), Joan Torres (SUSE), Liviu Dudau (Arm), Louis Chauvet (Bootlin), Robert Mader (Collabora), Tian Mengge (GravityXR), Victor Jaquez (Igalia) and Victoria Brekenfeld (System76). It was and awesome opportunity to meet you and chat face-to-face.

Finally, thanks virtual participants who couldn’t make it in person but organized their days to actively participate in each discussion, adding different perspectives and valuable inputs even remotely: Abhinav Kumar (Qualcomm), Chaitanya Borah (Intel), Christopher Braga (Qualcomm), Dor Askayo, Jiri Koten (RedHat), Jonas Ådahl (Red Hat), Leandro Ribeiro (Collabora), Marti Maria (Little CMS), Marijn Suijten, Mario Kleiner, Martin Stransky (Red Hat), Michel Dänzer (Red Hat), Miguel Casas-Sanchez (Google), Mitulkumar Golani (Intel), Naveen Kumar (Intel), Niels De Graef (Red Hat), Pekka Paalanen (Collabora), Pichika Uday Kiran (AMD), Shashank Sharma (AMD), Sriharsha PV (AMD), Simon Ser, Uma Shankar (Intel) and Vikas Korjani (AMD).

We look forward to another successful Display Next hackfest, continuing to drive innovation and improvement in the Linux display ecosystem!

September 25, 2024 01:50 PM

Paulo Matos

FEX x87 Stack Optimization

In this blog post, I will describe my recent work on improving and optimizing the handling of x87 code in FEX. This work landed on July 22, 2024, through commit https://github.com/FEX-Emu/FEX/commit/a1378f94ce8e7843d5a3cd27bc72847973f8e7ec (tests and updates to size benchmarks landed separately).

FEX is an ARM64 emulator of Intel assembly and, as such, needs to be able to handle things as old as x87, which was the way we did floating point pre-SSE.

Consider a function to calculate the well-known Pythagorean Identity:

float PytIdent(float x) {
return sinf(x)*sinf(x) + cosf(x)*cosf(x);
}

Here is how this can be translated into assembly:

  sub rsp, 4
  movss [rsp], xmm0
  fld [rsp]
  fsincos
  fmul st0, st0
  fxch
  fmul st0, st0
  faddp
  fstp [rsp]
  movss xmm0, [rsp]

These instructions will take x from register xmm0 and return the result of the identity (the number 1, as we expect) to the same register.

A brief visit to the x87 FPU

The x87 FPU consists of 8 registers (R0 to R7), which are aliased to the MMX registers - mm0 to mm7. The FPU uses them as a stack and tracks the top of the stack through the TOS (Top of Stack) or TOP (my preferred way to refer to it), which is kept in the control word. To understand how it all fits together, I will describe the three most important components of this setup for the purpose of this post:

  • FPU Stack-based operation
  • Status register
  • Tag register

From the Intel spec, these components look like this (image labels are from the Intel in case you need to refer to it):

Patch Details

The 8 registers are labelled as ST0-ST7 relative to the position of TOP. In the case above TOP is 3; therefore, the 3rd register is ST0. Since the stack grows down, ST1 is the 4th register and so on until it wraps around. When the FPU is initialized TOP is set to 0. Once a value is loaded, the TOP is decreased, its value becomes 7 and the first value added to the stack is in the 7th register. Because the MMX registers are aliased to the bottom 64 bits of the stack registers, we can also refer to them as MM0 to MM7. This is not strictly correct because the MMX registers are 64 bits and the FPU registers are 80 bits but if we squint, they are basically the same.

The spec provides an example of how the FPU stack works. I copy the image here verbatim for your understanding. Further information can be found in the spec itself.

Patch Details

The TOP value is stored along with other flags in the FPU Status Word, which we won't discuss in detail here.

Patch Details

And the tag word marks which registers are valid or empty. The Intel spec also allows marking the registers as zero or special. At the moment, tags in FEX are binary, marking the register as valid or invalid. Instead of two bits per register, we only use one bit of information per register.

Patch Details

Current issues

The way things worked in FEX before this patch landed is that stack based operations would essentially always lower to perform three operations:

  • Load stack operation argument register values from special memory location;
  • Call soft-float library to perform the operation;
  • Store result in special memory location;

For example, let's consider the instruction FADDP st4, st0. This instruction performs st0 <- f80add(st4, st0) && pop(). In x87.cpp, the lowering of this instruction involved the following steps:

  • Load current value of top from memory;
  • With the current value of top, we can load the values of the registers in the instruction, in this case st4 and st0 since they are at a memory offset from top;
  • Then we call the soft-float library function to perform 80-bit adds;
  • Finally we store the result into st4 (which is a location in memory);
  • And pop the stack, which means marking the current top value invalid and incrementing top;

In code with a lot of x87 usage, these operations become cumbersome and data movement makes the whole code extremely slow. Therefore, we redesigned the x87 system to cope better with stack operations. FEX support a reduced precision mode that uses 64bit floating points rather than 80 bits, avoiding the soft-float library calls and this work applies equally to the code in reduced precision mode.

The new pass

The main observation to understand the operation of the new pass is that when compiling a block of instructions, where multiple instructions are x87 instructions, before code generation we have an idea of what the stack will look like, and if we have complete stack information we can generate much better code than the one generated on a per-instruction basis.

Instead of lowering each x87 instruction in x87.cpp to its components as discussed earlier, we simply generate stack operation IR instructions (all of these added new since there were no stack operations in the IR). Then an optimization pass goes through the IR on a per block basis and optimizes the operations and lowers them.

Here’s a step-by-step explanation of how this scheme works for the code above. When the above code gets to the OpcodeDispatcher, there’ll be one function a reasonable amount of opcodes, but generally one per functionality, meaning that although there are several opcodes for different versions of fadd, there is a single function in OpcodeDispatcher for x87 (in x87.cpp) implementing it.

Each of these functions will just transform the incoming instruction into an IR node that deals with the stack. These stack IR operations have Stack in its name. See IR.json for a list of these.

x86 Asm IR nodes
fld qword [rsp] LoadMem + PushStack
fsincos F80SinCosStack
fmul st0, st0 F80MulStack
fxch F80StackXchange
fmul st0, st0 F80MulStack
faddp F80AddStack + PopStackDestroy
fstp qword [rsp] StoreStackMemory + PopStackDestroy

The IR will then look like:

	%9 i64 = LoadRegister #0x4, GPR
	%10 i128 = LoadMem FPR, %9 i64, %Invalid, #0x10, SXTX, #0x1
	(%11 i0) PushStack %10 i128, %10 i128, #0x10, #0x1
	(%12 i0) F80SINCOSStack
	%13 i64 = Constant #0x0
	(%14 i8) StoreContext GPR, %13 i64, #0x3fa
	(%15 i0) F80MulStack #0x0, #0x0
	(%16 i0) F80StackXchange #0x1
	%17 i64 = Constant #0x0
	(%18 i8) StoreContext GPR, %17 i64, #0x3f9
	(%19 i0) F80MulStack #0x0, #0x0
	(%20 i0) F80AddStack #0x1, #0x0
	(%21 i0) PopStackDestroy
	%23 i64 = LoadRegister #0x4, GPR
	(%24 i0) StoreStackMemory %23 i64, i128, #0x1, #0x8
	(%25 i0) PopStackDestroy

The Good Case

Remember that before this pass, we would have never generated this IR as is. There were no stack operations, therefore each of these F80MulStack operations, would generate loads from the MMX registers that are mapped into memory for each of the operands, call to F80Mul softfloat library function and then a write to the memory mapped MMX register to write the result.

However, once we get to the optimization pass a see a straight line block line this, we can step through the IR nodes and create a virtual stack with pointers to the IR nodes. I have sketched the stacks for the first few loops in the pass.

The pass will only process x87 instructions, these are the one whose IR nodes are marked with "X87: true in IR.json. Therefore the first two instructions above:

	%9 i64 = LoadRegister #0x4, GPR
	%10 i128 = LoadMem FPR, %9 i64, %Invalid, #0x10, SXTX, #0x1

are ignored. But PushStack is not. PushStack will literally push the node into the virtual stack.

	(%11 i0) PushStack %10 i128, %10 i128, #0x10, #0x1
Patch Details

Internally before we push the node into our internal stack, we update the necessary values like the top of the stack. It starts at zero and it's decremented to seven, so the value %10 i128 is inserted in what we'll call R7, now ST0.

Next up is the computation of SIN and COS. This will result in SIN ST0 replacing ST0 and then pushing COS ST0, where the value in ST0 here is %10.

	(%12 i0) F80SINCOSStack
Patch Details

Then we have two instructions that we just don't deal with in the pass, these set the C2 flag to zero - as required by FSINCOS.

	%13 i64 = Constant #0x0
	(%14 i8) StoreContext GPR, %13 i64, #0x3fa

Followed by a multiplication of the stack element to square them. So we square ST0.

	(%15 i0) F80MulStack #0x0, #0x0
Patch Details

The next instruction swaps ST0 with ST1 so that we can square the sine value.

	(%16 i0) F80StackXchange #0x1
Patch Details

Again, similarly two instructions that are ignored by the pass, which set flag C1 to zero - as required by FXCH.

	%17 i64 = Constant #0x0
	(%18 i8) StoreContext GPR, %17 i64, #0x3f9

And again we square the value at the top of the stack.

	(%19 i0) F80MulStack #0x0, #0x0
Patch Details

We are almost there - to finish off the computation we need to add these two values together.

	(%20 i0) F80AddStack #0x1, #0x0

In this case we add ST1 to ST0 and store the result in ST1.

Patch Details

This is followed by a pop of the stack.

	(%21 i0) PopStackDestroy

As expected this removed the F80Mul from the top of the stack, leaving us with the result of the computation at the top.

Patch Details

OK - there are three more instructions.

	%23 i64 = LoadRegister #0x4, GPR
	(%24 i0) StoreStackMemory %23 i64, i128, #0x1, #0x8
	(%25 i0) PopStackDestroy

The first two store the top of the stack to memory, which is the result of our computation and which maths assures us it's 1. And then we pop the stack just to clean it up, although this is not strictly necessary since we are finished anyway.

Note that we have finished analysing the block, and played it through in our virtual stack. Now we can generate the instructions that are necessary to calculate the stack values, without always load/stor'ing them back to memory. In addition to generating these instructions we also generate instructions to update the tag word, update the top of stack value and save whatever remaining values in the stack at the end of the block into memory - into their respective memory mapped MMX registers. This is the good case where all the values necessary for the computation were found in the block. However, this is not necessarily always the case.

The Bad Case

Let's say that FEX for some reason breaks the beautiful block we had before into two:

Block 1:

	%9 i64 = LoadRegister #0x4, GPR
	%10 i128 = LoadMem FPR, %9 i64, %Invalid, #0x10, SXTX, #0x1
	(%11 i0) PushStack %10 i128, %10 i128, #0x10, #0x1
	(%12 i0) F80SINCOSStack
	%13 i64 = Constant #0x0
	(%14 i8) StoreContext GPR, %13 i64, #0x3fa
	(%15 i0) F80MulStack #0x0, #0x0
	(%16 i0) F80StackXchange #0x1
	%17 i64 = Constant #0x0
	(%18 i8) StoreContext GPR, %17 i64, #0x3f9

Block 2:

	(%19 i0) F80MulStack #0x0, #0x0
	(%20 i0) F80AddStack #0x1, #0x0
	(%21 i0) PopStackDestroy
	%23 i64 = LoadRegister #0x4, GPR
	(%24 i0) StoreStackMemory %23 i64, i128, #0x1, #0x8
	(%25 i0) PopStackDestroy

When we generate code for the first block, there are absolutely no blockers and we run through it exactly as we did before, except that after StoreContext, we reach the end of the block and our virtual stack is in this state:

Patch Details

At this point, we'll do what we said we always do before at the end of the block. We save the values in the virtual stack to the respective MMX registers in memory mapped locations: %36 is saved to MM7 and %34 is saved to MM6. The TOP is recorded to be 6 and the tags for MM7 and MM6 are marked to be valid. Then we exit the block. And start analyzing the following block.

When we start the pass for the following block, our virtual stack is empty. We have no knowledge of virtual stacks for previous JITTed blocks, and we see this instruction:

	(%19 i0) F80MulStack #0x0, #0x0

We cannot play through this instruction in our virtual stack because it multiplies the value in ST0 with itself and our virtual stack does not have anything in ST0. It's empty after all.

In this case, we switch to the slow path that does exactly what we used to do before this work. We load the value of ST0 from memory by finding the current value for TOP and loading the respective register value, issue the F80Mul on this value and storing it back to the MMX register. This is then done for the remainder of the block.

The Ugly Case

The ugly case is when we actually force the slow path, not because we don't have all the values in the virtual stack but out of necessity.

There are a few instructions that trigger a _StackForceSlow() and forces us down the slow path. These are: FLDENV, FRSTOR, FLDCW. All of these load some part or all of the FPU environment from memory and it triggers a switch to the slow path so that values are loaded properly and we can then use those new values in the block. It is not impossible to think we could at least in some cases, load those values from memory, try to rebuild our virtual stack for the current block and continue without switching to the slow path but that hasn't been yet done.

There are a few instructions other instructions that trigger a _SyncStackToSlow(), which doesn't force us down the slow path but instead just updates the in-memory data from the virtual stack state. These are: FNSTENV, FNSAVE, and FNSTSW. All of these store some part or all of the FPU environment into memory and it ensures that the values in-memory are up-to-date so that the store doesn't use obsolete values.

Results

In addition to the huge savings in data movement from the loads and stores of register values, TOP and tags for each stack instruction, we also optimize memory copies through the stack. So if there's a bunch of value loaded to the stack, which are then stored to memory, we can just memcopy these values without going through the stack.

In our code size validation benchmarks from various Steam games, we saw significant reductions in code size. These size code reductions saw a similarly high jump in FPS numbers. For example:

Game Before After Reduction
Half-Life (Block 1) 2034 1368 32.7%
Half-Life (Block 2) 838 551 34.2%
Oblivion (Block 1) 34065 25782 24.3%
Oblivion (Block 2) 22089 16635 24.6%
Psychonauts (Block 1) 20545 16357 20.3%
Psychonauts Hot Loop (Matrix Swizzle) 2340 165 92.9%

Future work

The bulk of this work is now complete, however there are a few details that still need fixing. There's still the mystery of why Touhou: Luna Nights works with reduced precision but not normal precision (if anything you would expect it to be the other way around). Then there's also some work to be done on the interface between the FPU state and the MMX state as described in "MMX/x87 interaction is subtly broken".

This is something I am already focusing on and will continue working on this until complete.

by Paulo Matos at September 25, 2024 12:00 AM

September 23, 2024

Eric Meyer

Announcing BCD Watch

One of the things I think we all struggle with is keeping up to date with changes in web development.  You might hear about a super cool new CSS feature or JavaScript API, but it’s never supported by all the browsers when you hear about it, right?  So you think “I’ll have to make sure check in on that again later” and quickly forget about it.  Then some time down the road you hear about it again, talked about like it’s been best practice for years.

To help address this, Brian Kardell and I have built a service called BCD Watch, with a nicely sleek design by Stephanie Stimac.  It’s free for all to use thanks to the generous support of Igalia in terms of our time and hosting the service.

What BCD Watch does is, it grabs releases of the Browser Compatibility Data (BCD) repository that underpins the support tables on MDN and services like caniuse.com.  It then analyzes what’s changed since the previous release.

Every Monday, BCD Watch produces two reports.  The Weekly Changes Report lists all the changes to BCD that happened in the previous week — what’s been added, removed, or renamed in the whole of BCD.  It also tells you which of the Big Three browsers newly support (or dropped support for) each listed feature, along with a progress bar showing how close the feature is to attaining Baseline status.

The Weekly Baselines Report is essentially a filter of the first report: instead of all the changes, it lists only changes to Baseline status, noting which features are newly Baseline.  Some weeks, it will have nothing to report. Other weeks, it will list everything that’s reached Baseline’s “Newly Available” tier.

Both reports are available as standalone RSS, Atom, and JSON feeds, which are linked at the bottom of each report.  So while you can drop in on the site every week to bask in the visual design if you want (and that’s fine!), you can also get a post or two in your feed reader every Monday that will get you up to date on what’s been happening in the world of web development.

If you want to look back at older reports, the home page has a details/summary collapsed list of weekly reports going back to the beginning of 2022, which we generated by downloading all the BCD releases back that far, and running the report script against them.

If you encounter any problems with BCD Watch or have suggestions for improvements, please feel free to open an issue in the repository, or submit suggested changes via pull request if you like.  We do expect the service to evolve over time, perhaps adding a report for things that have hit Baseline Widely Available status (30 months after hitting all three engines) or reports that look at more than just the Big Three engines.  Hard to say!  Always in motion, the future is.

Whatever we may add, though, we’ll keep BCD Watch centered on the idea of keeping you better up to date on web dev changes, once a week, every week.  We really hope this is useful and interesting for you!  We’ve definitely appreciated having the weekly updates as we built and tested this, and we think a lot of you will, too.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at September 23, 2024 03:54 PM

September 13, 2024

José Dapena

Maintaining Chromium downstream: update strategies

This is the second of a series of blog posts I am publishing for sharing some considerations about the challenges of maintaining a downstream of Chromium.

The first part, Maintaining downstreams of Chromium: why downstreams?, provided initial definitions, and an analysis of why someone would need to maintain a downstream.

In this post I will focus on the possible different strategies for tracking the Chromium upstream project changes, and integrating the downstream changes.

Applying changes

The first problem is related to the fact that our downstream will include additional changes on top of upstream (otherwise we would not need a downstream, right?).

There are two different approaches for this, based on the Git strategy used: rebase vs. merge.

Merge strategy

This consists on maintaining an upstream branch and periodically merging its changes to the downstream branch.

In the case of Chromium, the size of the repository and the amount and frequency of the changes is really big, so the chances that merging changes cause any conflict are higher than in other smaller projects.

Rebase strategy

This consists on maintaining the downstream changes as a series of patches, that are applied on top of an upstream baseline.

Updating means changing the baseline and applying all the patches. As this is done, not all patches will cause a conflict. And, for the ones that do, the complexity is far smaller.

Tip: keep the downstream as small as possible

Don’t hesitate to remove features that are more prone to cause conflicts. And at least, think carefully, for each change, if it can be done in a way that matches better upstream codebase.

And, if some of your changes are good for the commons, just contribute them to upstream!

When to update?

A critical aspect of applying upstream changes is how often to do that.

The size and structure of the team involved is highly dependent on this decision. And of course, planning the resources.

Usually this needs to be as much predictable as possible, and bound to the upstream release cycle.

Some examples used in downstream projects:

  • Weekly or daily tracking.
  • Major releases.

Upstream release policy and rebases

It is not only important how often you track upstream repository, but also what you track.

Chromium, nowadays, follows this procedure for each major release (first number in the version):

  • Main development happens in main branch. Several releases are tagged daily. And, mostly every week, one of them is released to the dev channel for users to try (with additional testing).
  • At a point, an stabilization branch is created, entering beta stage. The quality bar for what is landed in this branch is raised. Only stabilization work is done. More releases per day are tagged, and again, approximately once per week one is released to beta channel.
  • When planned, the branch becomes stable. This means it is ready for wide user adoption, so the stable channel will pick its releases from this branch.

That means main (development branch) targets version N, then N-1 branch is the beta (or stabilization branch), and N-2 is the stable branch. Nowadays Chromium targets publishing a new major stable version every four weeks (and that also means a release spends 4 weeks in beta channel).

You can read the official release cycle documentation if you want to know more, including the extended stable (or long term support) channel.

Some downstreams will track upstream main branch. Some others will just start applying the downstream changes when first release lands the stable channel. And some may just start when main is branched and stabilization work begins.

Tip: update often

The more you update, the smaller the change set you need to consider. And that means reducing the complexity of the changes.

This is specially important when merging instead of rebasing, as applying the new changes is done in one step.

This could be hard, though, depending on the size of the downstream. And, from time to time some refactors upstream will still imply non-trivial work to adapt the downstream changes.

Tip: update NOT so often

OK, OK. I just told the opposite, right?

Now, think of a complex change upstream, that implies many intermediate steps, with a moving architecture that evolves as it is developed (something not unfrequent in Chromium). Updating often means you will need to adapt your downstream to all the intermediate steps.

That definitely could mean more work.

A nice strategy for matching upstream stable releases

If you want to publish an stable release almost the same day or week of the official upstream stable release, and your downstream changes are not trivial, then a good moment for starting applying changes is when the stabilization branch happens.

But the first days there is a lot of stabilization work happening, as the quality bar is raised. So… waiting a few days after the beta branch happens could be a good idea.

An idea: just wait for the second version published in beta channel (the first one happens right after branching). That should give still three full weeks before the version is promoted to the stable channel.

Tracking main branch: automate everything!

In case you want to follow upstream main branch, in a daily basis (or even per commit), then it is just not practical to do that manually.

The solution for that is automation, at different levels:

  • Applying the changes.
  • Resolving the conflicts.
  • And, more important, testing the result is working.

Good automation is, though, expensive. It requires computing power for building Chromium often, and run tests. But, increasing test coverage will detect issues earlier, and give an extra margin for resolving them.

In any case, no matter your update strategy is, automation will always be helpful.

Next

This is the end of the post. So, what comes next?

In the next post in this series, I will talk about the downstream size problem, and different approaches for keeping it under control.

References

by José Dapena Paz at September 13, 2024 06:57 AM

September 11, 2024

Igalia Compilers Team

Summary of the July-August 2024 TC39 plenary

The TC39 committee met again at the end of July, this time remotely, to discuss updated to various proposals at different stages of the standardization process. Let's go together through some of the most significant updates!

You can also read the full agenda and the meeting minutes on GitHub.

Day 1 #

Test262 Updates #

On the compilers team at Igalia we have several reviewers for Test262, which is the conformance test suite that ensures all JS implementations conform to the ECMA-262 specification. As of this TC39 meeting, RegExp.escape, Atomics.pause, Math.sumPrecise, Base64, and source phase imports now have full or almost-full test coverage. Additionally, our team is working towards the goal of having a testing plan for each new proposal, to make it easier for people to write tests.

Intl.DurationFormat igalia logo #

DurationFormat continues its approach to Stage 4 with a fix to an edge case involving the formatting of certain negative durations when using the "numeric" or "two-digit" styles. Previously if a zero-valued "numeric" style unit was the first formatted unit in a negative duration, the negative sign would get dropped.

Drop assert from import attributes igalia logo #

The import attributes proposal was originally called "import assertions" and had the following syntax:

import data from "./data.json" assert { type: "json" }

The semantics of this example is that "./data.json" is fetched/interpreted ignoring the assertions (for example, web browsers would have known to parse the file as JSON because the server would use the application/json MIME type). Then the assertions were applied, potentially throwing an error.

The proposal had then been updated from "assertions" to "attributes", and the keyword was changed to reflect the new semantics:

import data from "./data.json" with { type: "json" }

Attributes can affect how a module is loaded and interpreted, rather than just causing an error if it's the "wrong" module. Web browsers can send the proper HTTP headers to servers indicating that they expect a JSON response, and other non-HTTP-based platform can use attributes to decide how they want to parse the file.

Chrome, Node.js and Deno shipped the assert keyword a while ago, so TC39 was not sure that removing it would have been web-compatible. Chrome succesfully unshipped support for assert in Chrome 126, released in June, so now the proposal has been updated to only support with.

AsyncContext Update igalia logo #

The AsyncContext allows defining variables that are implicitly propagated following the "async flow" of your code:

const clickX = new AsyncContext.Variable();

document.addEventListener("click", e => {
clickX.run(e.clientX, async () => {
await timeout("30 seconds");
await doSomething();
})
});

async function doSomething() {
console.log("Doing something due to the click at x=" + clickX.get());
}

In the above example, doSomething will log 30 seconds after each click, logging the x coordinate of the mouse corresponding to the click that scheduled that doSomething call.

The proposal, currently at stage 2, is proceeding steadily and the champions are currently figuring out how it should be integrated with other web APIs.

Deferred imports igalia logo #

The import defer proposal allows importing modules while deferring their execution until when it's needed, to reduce their impact on application startup time:

import defer * as mod from "./dep.js";

$button.addEventListener("click", () => {
console.log(mod.number); // ./dep.js is only evaluated at this point, when we need it.
})

The main difference between using import defer and dynamic import() is that the former can be used synchronously (without having to introduce promises or async/await), at the expense of only deferring code execution and not code loading.

During this meeting proposal reached Stage 2.7, which means that it's almost ready to be implemented (it's only missing test262 tests). The committee confirmed the semantics around top-level await: given that modules using top-level await cannot be executed asynchronously, they will be executed at startup even if they are part of a set of modules pulled in by an import defer declaration.

Day 2 #

Intl Locale Info #

Frank Yung-Fong Tang of Google provided a minor update to the stage 3 Intl.LocaleInfo proposal, and received consensus for a PR fixing a minor bug with the handling of requests for the week of day and clarifying the result of using the "-u-ca-iso8601" Unicode locale extension. He also gave an update on the implementation status of Intl.LocaleInfo, noting that it is currently implemented in Chrome and Safari, with a Mozilla implementation pending.

Temporal Update igalia logo #

Our colleague Philip Chimento presented another minor update on Temporal. There were two bug fixes to approve: one to correctly handle a corner case in the time zone database where the day started at half past midnight on March 31, 1919 in Toronto due to a time zone shift, and another to disallow a particular combination of options in Temporal.Duration.prototype.round() where there were multiple interpretations possible for what the outcome should be.

Temporal is nearing completion; JS engines continue to work on implementing it. Firefox is the closest to having a complete implementation. There is a checklist that you can follow if you are interested in further updates on the TC39 side.

Fingerprinting discussion igalia logo #

Ben Allen led a fruitful discussion on the potential for fingerprinting risk should browser implementations allow users to dynamically update locale-related information, either through adding new locales beyond the ones shipped with the browser or adding additional locale-related components, such as additional scripts or additional calendars. Including this feature would greatly improve the usability of the Web for users who speak minority languages -- but letting hosts know whether users have those non-base locales installed could result in those users having their privacy compromised in a way that in some cases could be genuinely dangerous for them.

Although no browsers currently allow for dynamic locale-related updates, it is a feature under consideration among multiple browser-makers. The conversation concluded with the decision for the relevant privacy teams to consider the matter and return with feedback.

RegExp.escape for Stage 3 #

RegExp.escape is a new utility for escaping strings so that they can be safely used in regular expression.

Consider the case where you have a string ".js", and you want to check wether another strings ends with that suffix using a regular expression. You might be tempted to write something like this:

const suffix = ".js";
const matches = new RegExp(`${suffix}$`).test(filename);

However, that code has a problem: . has a special meaning in regular expressions, so matches would also be true for something.notjs! To fix this, you would need to escape the suffix:

const suffix = ".js";
const matches = new RegExp(`${RegExp.escape(suffix)}$`).test(filename);

Thanks to the proposal's champion Jordan Harband the committee approved RegExp.escape to advance to Stage 3, which means that browsers can now implement and ship it.

Day 3 #

Decimal igalia logo #

Decimal champion Jesse Alama gave an update about the current status of the proposal. There were minor changes to the API; the main update was to make the spec text considerable sharper, thanks to close cooperation with new co-author Waldemar Horwat. Jesse asked for decimal to advance to stage 2 in the TC39 process, but it did not, because the argument that decimal, in its current form, is too close to just being a library. In the discussion, some concerns were raised that if we were to go ahead with the current Decimal128-based data model, in which trailing zeroes are preserved, we close the door to a potential future where decimal exists JavaScript as a primitive type. Our work in decimal continues -- stay tuned!

by compilers at September 11, 2024 12:00 AM

September 10, 2024

Enrique Ocaña

Don’t shoot yourself in the foot with the C++ move constructor

Move semantics can be very useful to transfer ownership of resources, but as many other C++ features, it’s one more double edge sword that can harm yourself in new and interesting ways if you don’t read the small print.

For instance, if object moving involves super and subclasses, you have to keep an extra eye on what’s actually happening. Consider the following classes A and B, where the latter inherits from the former:

#include <stdio.h>
#include <utility>

#define PF printf("%s %p\n", __PRETTY_FUNCTION__, this)

class A {
 public:
 A() { PF; }
 virtual ~A() { PF; }
 A(A&& other)
 {
  PF;
  std::swap(i, other.i);
 }

 int i = 0;
};

class B : public A {
 public:
 B() { PF; }
 virtual ~B() { PF; }
 B(B&& other)
 {
  PF;
  std::swap(i, other.i);
  std::swap(j, other.j);
 }

 int j = 0;
};

If your project is complex, it would be natural that your code involves abstractions, with part of the responsibility held by the superclass, and some other part by the subclass. Consider also that some of that code in the superclass involves move semantics, so a subclass object must be moved to become a superclass object, then perform some action, and then moved back to become the subclass again. That’s a really bad idea!

Consider this usage of the classes defined before:

int main(int, char* argv[]) {
 printf("Creating B b1\n");
 B b1;
 b1.i = 1;
 b1.j = 2;
 printf("b1.i = %d\n", b1.i);
 printf("b1.j = %d\n", b1.j);
 printf("Moving (B)b1 to (A)a. Which move constructor will be used?\n");
 A a(std::move(b1));
 printf("a.i = %d\n", a.i);
 // This may be reading memory beyond the object boundaries, which may not be
 // obvious if you think that (A)a is sort of a (B)b1 in disguise, but it's not!
 printf("(B)a.j = %d\n", reinterpret_cast<B&>(a).j);
 printf("Moving (A)a to (B)b2. Which move constructor will be used?\n");
 B b2(reinterpret_cast<B&&>(std::move(a)));
 printf("b2.i = %d\n", b2.i);
 printf("b2.j = %d\n", b2.j);
 printf("^^^ Oops!! Somebody forgot to copy the j field when creating (A)a. Oh, wait... (A)a never had a j field in the first place\n");
 printf("Destroying b2, a, b1\n");
 return 0;
}

If you’ve read the code, those printfs will have already given you some hints about the harsh truth: if you move a subclass object to become a superclass object, you’re losing all the subclass specific data, because no matter if the original instance was one from a subclass, only the superclass move constructor will be used. And that’s bad, very bad. This problem is called object slicing. It’s specific to C++ and can also happen with copy constructors. See it with your own eyes:

Creating B b1
A::A() 0x7ffd544ca690
B::B() 0x7ffd544ca690
b1.i = 1
b1.j = 2
Moving (B)b1 to (A)a. Which move constructor will be used?
A::A(A&&) 0x7ffd544ca6a0
a.i = 1
(B)a.j = 0
Moving (A)a to (B)b2. Which move constructor will be used?
A::A() 0x7ffd544ca6b0
B::B(B&&) 0x7ffd544ca6b0
b2.i = 1
b2.j = 0
^^^ Oops!! Somebody forgot to copy the j field when creating (A)a. Oh, wait... (A)a never had a j field in the first place
Destroying b2, a, b1
virtual B::~B() 0x7ffd544ca6b0
virtual A::~A() 0x7ffd544ca6b0
virtual A::~A() 0x7ffd544ca6a0
virtual B::~B() 0x7ffd544ca690
virtual A::~A() 0x7ffd544ca690

Why can something that seems so obvious become such a problem, you may ask? Well, it depends on the context. It’s not unusual for the codebase of a long lived project to have started using raw pointers for everything, then switching to using references as a way to get rid of null pointer issues when possible, and finally switch to whole objects and copy/move semantics to get rid or pointer issues (references are just pointers in disguise after all, and there are ways to produce null and dangling references by mistake). But this last step of moving from references to copy/move semantics on whole objects comes with the small object slicing nuance explained in this post, and when the size and all the different things to have into account about the project steals your focus, it’s easy to forget about this.

So, please remember: never use move semantics that convert your precious subclass instance to a superclass instance thinking that the subclass data will survive. You can regret about it and create difficult to debug problems inadvertedly.

Happy coding!

by eocanha at September 10, 2024 07:58 AM

September 04, 2024

Frédéric Wang

My recent contributions to Gecko (3/3)

Note: This blog post was written on June 2024. As of September 2024, final work to ship the feature is still in progress. Please follow bug 1797715 for the latest updates.

Introduction

This is the final blog post in a series about new web platform features implemented in Gecko, as part as an effort at Igalia to increase browser interoperability.

Let’s take a look at fetch priority attributes, which enable web developers to optimize resource loading by specifying the relative priority of resources to be fetched by the browser.

Fetch priority

The web.dev article on fetch priority explains in more detail how web developers can use fetch priority to optimize resource loading, but here’s a quick overview.

fetchpriority is a new attribute with the value auto (default behavior), high, or low. Setting the attribute on a script, link or img element indicates whether the corresponding resource should be loaded with normal, higher, or lower priority 1:

<head>
  <script src="high.js" fetchpriority="high"></script>
  <link rel="stylesheet" href="auto.css" fetchpriority="auto">
</head>
<body>
  <img src="low.png" alt="low" fetchpriority="low">
</body>

The priority can also be set in the RequestInit parameter of the fetch() method:

await fetch("high.txt", {priority: "high"});

The <link> element has some interesting features. One of them is combining rel=preload and as to fetch a resource with a particular destination 2:

<link rel="preload" as="font" href="high.woff2" fetchpriority="high">

You can even use Link in HTTP response headers and in particular early hints sent before the final response:

103 Early Hint
Link: <high.js>; rel=preload; as=script; fetchpriority=high

These are basically all the places where a fetch priority attribute can be used.

Note that other parameters are also taken into account when deciding the priority to use for resources, such as the position of the element in the page (e.g. blocking resources in <head>), other attributes on the element (<script async>, <script defer>, <link media>, <link rel>…) or the resource’s destination.

Finally, some browsers implement speculative HTML parsing, allowing them to continue fetching resources declared in the HTML markup while the parser is blocked. As far as I understand, Firefox has its own separate HTML parsing code for that purpose, which also has to take fetch priority attributes into account.

Implementation-defined prioritization

If you have not run away after reading the complexity described in the previous section, let’s talk a bit more about how fetch priority attributes are interpreted. The spec contains the following step when fetching a resource (emphasis mine):

If request’s internal priority is null, then use request’s priority, initiator, destination, and render-blocking in an implementation-defined manner to set request’s internal priority to an implementation-defined object.

So browsers would use the high/low/auto hints as well as the destination in order to calculate an internal priority value 3, but the details of this value are not provided in the specification, and it’s up to the browser to decide what to do. This is a bit unfortunate for our interoperability goal, but that’s probably the best we can do, given that each browser already has its own stategies to optimize resource loading. I think this also gives browsers some flexibility to experiment with optimizations… which can be hard to predict when you realize that web devs also try to adapt their content to the behavior of (the most popular) browsers!

In any case, the spec authors were kind enough to provide a note with more suggestions (emphasis mine):

The implementation-defined object could encompass stream weight and dependency for HTTP/2, priorities used in Extensible Prioritization Scheme for HTTP for transports where it applies (including HTTP/3), and equivalent information used to prioritize dispatch and processing of HTTP/1 fetches. [RFC9218]

OK, so what does that mean? I’m not a networking expert, but this is what I could gather after discussing with the Necko team and reading some HTTP specs:

  • HTTP/1 does not have a dedicated prioritization mechanism, but Firefox uses its internal priority to order requests.
  • HTTP/2 has a “stream priority” mechanism and Firefox uses its internal priority to implement that part of the spec. However, it was considered too complex and inefficient, and is likely poorly supported by existing web servers…
  • In upcoming releases, Firefox will use its internal priority to implement the Extensible Prioritization Scheme used by HTTP/2 and HTTP/3. See bug 1865040 and bug 1864392. Essentially, this means using its internal priority to adjust the urgency parameter.

Note that various parts of Firefox rely on NS_NewChannel to load resources, including the fetching algorithm above, which Firefox uses to implement the fetch() method. However, other cases mentioned in the first section have their own code paths with their own calls to NS_NewChannel, so these places must also be adjusted to take the fetch priority and destination into account.

Finishing the implementation work

Summarizing a bit, implementing fetch priority is a matter of:

  1. Adding fetchpriority to DOM objects for HTMLImageElement, HTMLLinkElement, HTMLScriptElement, and RequestInit.
  2. Parsing the fetch priority attribute into an auto/low/high enum.
  3. Passing the information to the callers of NS_NewChannel.
  4. Using that information to set the internal priority.
  5. Using that internal priority for HTTP requests.

Mirko Brodesser started this work in June 2023, and had already implemented almost all of the features discussed above. fetch(), <img>, and <link rel=preload as=image> were handled by Ziran Sun and I, while Valentin Gosu from Mozilla made HTTP requests use the internal priority.

The main blocker was due to that “implementation-defined” use of fetch priority. Mirko’s approach was to align Firefox with the behavior described in the web.dev article, which reflects Chromium’s implementation. But doing so would mean changing Firefox’s default behavior when fetchpriority is not specified (or explicitly set to auto), and it was not clear whether Chromium’s prioritization choices were the best fit for Firefox’s own implementation of resource loading.

After meeting with Mozilla, we agreed on a safer approach:

  1. Introduce runtime preferences to control how Firefox adjusts internal priorities when low, high, or auto is specified. By default, auto does not affect the internal priority so current behavior is preserved.
  2. Ask Mozilla’s performance team to run an experiment, so we can decide the best values for these preferences.
  3. Ship fetch priority with the chosen values, probably cleaning things up a bit. Any other ideas, including the ones described in the web.dev article, could be handled in future enhancements.

We recently entered phase 2 of this plan, so fingers crossed it works as expected!

Internal WPT tests

This project is part of the interoperability effort, but again, the “implementation-defined” part meant that we had very few WPT tests for that feature, really only those checking fetchpriority attributes for the DOM part.

Fortunately Mirko, who is a proponent of Test-driven development, had written quite a lot of internal WPT tests that use internal APIs to retrieve the internal priority. To test Link headers, he used the handy wptserve pipes. The only thing he missed was checking support in Early hints, but some WPT tests for early hints using WPT Python Handlers were available, so integrating them into Mirko’s tests was not too difficult.

It was also straightforward for Ziran and I to extend Mirko’s tests to cover fetch, img, and <link rel=preload as=image>, with one exception: when the fetch() method uses a non-default destination. In most of these code paths, we call NS_NewChannel to perform a fetch. But fetch() is tricky, because if the fetch event is intercepted, the event handler might call the fetch() method again using the same destination (e.g. image).

Handling this correctly involves multiple processes and IPC communication, which ended up not working well with the internal APIs used by Mirko’s tests. It took me a while to understand what was happening in bug 1881040, and in the end I came up with a new approach.

Upstreamable WPT tests

First, let’s pause for a moment: all the tests we have so far use an internal API to verify the internal priority, but they don’t actually check how that internal priority is used by Firefox when it sends HTTP requests. Valentin mentioned we should probably have some tests covering that, and not only would it solve the problem with fetch() calls in fetch event handlers, it would also remove the use of an internal API, making the tests potentially reusable by other browsers.

To make this kind of test possible, I added a WPT Python Handler that parses the urgency from a HTTP request and responds with an urgency-dependent resource, such as a stylesheet with different property values, an image of a different size, or an audio or video file of a different duration.

When a test uses resources with different fetch priorities, this influences the urgency values of their HTTP requests, which in turn influences the response in a way that the test can check for in JavaScript. This is a bit complicated, but it works!

Conclusion

Fetch priority has been enabled in Firefox Nightly for a while, and experiments started recently to determine the optimal priority adjustments. If everything goes well, we will be able to push this feature to the finish line after the (northern) summer.

Helping implement this feature also gave me the opportunity to work a bit on the Firefox networking code, which I had not touched since the collaboration with IPFS, and I learned a lot about resource loading and WPT features for HTTP requests.

To me, the “implementation-defined” part was still a bit awkward for the web platform. We had to write our own internal WPT tests and do extra effort to prepare the feature for shipping. But in the end, I believe things went relatively smoothly.

Acknowledgments

To conclude this series of blog posts, I’d also like to thank Alexander Surkov, Cathie Chen, Jihye Hong, Martin Robinson, Mirko Brodesser, Oriol Brufau, Ziran Sun, and others at Igalia who helped on implementing these features in Firefox. Thank you to Emilio Cobos, Olli Pettay, Valentin Gosu, Zach Hoffman, and others from the Mozilla community who helped with the implementation, reviews, tests and discussions. Finally, our spelling and grammar expert Delan Azabani deserves special thanks for reviewing this series of blog post and providing useful feedback.

  1. Other elements have been or are being considered (e.g. <iframe>, SVG <image> or SVG <script>), but these are the only ones listed in the HTML spec at the time of writing. 

  2. As mentioned below, the browser needs to know about the actual destination in order to properly calculate the priority. 

  3. As far as I know, Firefox does not take initiator into account, nor does it support render-blocking yet

by Frédéric Wang at September 04, 2024 10:00 PM

August 28, 2024

Jani Hautakangas

Bringing WebKit back to Android: Progress and Perspectives

In my previous blog post, I delved into the technical aspects of reintroducing WebKit to the Android platform. It was an exciting journey, filled with the challenges and triumphs that come with working on a project as ambitious as WPE-Android. However, I realize that the technical depth of that post may have left some readers seeking more context. Today, I want to take a step back and offer a broader view of what this project is all about—why we’re doing it, how it builds on the WPEWebKit engine, and the progress we’ve made so far.

The Vision: Reviving WebKit on Android #

WebKit has a storied history in the world of web browsers, serving as the backbone for Safari, Epiphany, and many embedded browsers. However, over time, Android’s landscape has shifted toward Blink/Chromium, the engine behind Chrome. While Blink and Chromium have undoubtedly shaped the modern web, there are compelling reasons to bring WebKit back to Android.

WPE-Android is an effort to reintroduce WebKit into the Android ecosystem as a modern, efficient, and secure browser engine. Our goal is to provide developers with more options—whether they’re building full-fledged browsers, integrating web views into native apps, or exploring innovative applications in IoT and embedded systems. By leveraging WebKit’s unique strengths, we’re opening new doors for creativity and innovation on the Android platform.

Why WPEWebKit? #

At the heart of WPE-Android is WPEWebKit, a streamlined version of the WebKit engine specifically optimized for embedded systems. Unlike its desktop counterpart, WPEWebKit is designed to be lightweight, efficient, and highly adaptable to various hardware environments. This makes it an ideal foundation for bringing WebKit back to Android.

The decision to base WPE-Android on WPEWebKit is strategic. WPEWebKit is not only performant but also backed by a strong community of developers and organizations dedicated to its continuous improvement. This community-driven approach ensures that WPE-Android benefits from a robust, well-maintained codebase, with contributions from experts around the world.

Building on a Strong Foundation #

Since the inception of WPE-Android, our focus has been on making WebKit a viable option for Android developers. This involves more than just getting the engine to run on Android—it’s about ensuring that it’s stable, integrates seamlessly with Android’s unique features, and offers a developer-friendly experience.

A significant part of our work has involved optimizing the interaction between WPEWebKit and Android’s graphics stack. As part of that, we decided to focus on Android API level 30 and higher to keep the prototyping phase faster and simpler. Our efforts have aimed at achieving smooth and consistent performance, ensuring that WPE-Android can meet the needs of modern Android applications.

We are building a foundation to run instrumentation tests in CI to ensure that we don’t regress and that we get consistent results that match Android’s system WebView APIs. We continue adding more APIs that are similar to Android System WebView offerings and provide similar results.

Additionally, we’ve focused on enhancing the integration of WPE-Android with Android-specific features. This includes improving support for touch input and dialogs, refining the way web views are handled within native Android applications, and ensuring compatibility with the Android development environment. These enhancements make WPE-Android a natural fit for developers who are already familiar with the Android platform.

What’s new #

Most of the changes are under the hood improvements. The task that required the most effort was upgrading and rebasing our patches on top of Cerbero. After we upgraded to WPE WebKit 2.44.1, we required a more recent GLib version provided by the newer Cerbero version. Along with the upgrade, we managed to refactor and squash many of the patches that we had on top of Cerbero. We went from 175 patches down to 66, which will simplify the next upgrade.

Here’s a list of the most notable changes since the last update:

  • Upgraded to WPE WebKit 2.44.1.
  • Upgraded Cerbero to version 1.24.2.
  • Upgraded Android NDK to version r26d.
  • Migrated from libsoup2 to libsoup3 for HTTP/2 support.
  • Support for proper device scale ratio according to Android’s DisplayMetrics. This takes into account the screen size and pixel density, automatically adapting rendered content to show with appropriate dimensions on all devices.
  • Support for JS dialogs (Alert, Confirm, Prompt). Integrates Android dialogs with JavaScript alert(), confirm(), and prompt() prompts. Also provides an option to build custom native dialogs for these prompts.
  • Instrumentation tests for recently added features and a CI pipeline for running them.
  • API to receive HTTP errors. WPEViewClient interface onReceivedHttpError to catch HTTP error codes >= 400.
  • API to evaluate JavaScript. Provides the WPEView method evaluateJavascript to inject and evaluate JavaScript code on a loaded page.

Demos #

Dialog prompts #

The demo shows the default WPEView alert() prompt integration on the left side. On the right side, an application using WPEView has overridden the onJsAlert method from the WPEChromeClient interface and provides a custom native alert dialog for the JavaScript alert() prompt. The custom dialog is constructed using Android’s AlertDialog.Builder factory. Similar customization can be applied to JavaScript confirm() and prompt() prompts by overriding the onJsConfirm and onJsPrompt methods from the WPEChromeClient interface.

Default WPEView alert dialog
Custom WPEView alert dialog

Evaluate javascript #

The demo shows how to inject JavaScript and call functions on a loaded page from Kotlin code.

HTML and JavaScript:

<script>
function showName(message) {
document.getElementById('name').innerHTML=message;
}
</script>
<center>
<br><br><br><br>
<h1>WPEView</h1>
<p>Evaluate javascript</p>"
<br><br><br><br>
<h2>What's your name?</h2>
<h1 id="name"></h1>
</center>

Kotlin/WPEView code:

binding.toolbarButton.setOnClickListener {
webview.evaluateJavascript("showName(\"" + binding.toolbarEditText.text + "\")",null)
binding.toolbarEditText.setText("")
}

Device scale factor #

Android devices come with a variety of screen sizes, resolutions, and screen densities (pixels per inch, also known as ppi). In order for the UI to look consistent and good across all different devices, the device scale factor needs to be applied to the UI. Screen density can be fetched via the Android DisplayMetrics API, and in WPE WebKit, this corresponds to the device scale factor that can be set using wpe_view_backend_dispatch_set_device_scale_factor. Previously, in WPE-Android, we had hardcoded that value to 2.0, but now we are using proper metrics specific to each device.

Below are some screenshots from before and after applying the proper device scale. I’m using a Google Pixel 7 device, which has a density value of 2.75.

Old hardcoded device scale factor 2.0
Device scale factor from DisplayMetrics density

Looking Forward #

Our goal is to make WPE-Android even more accessible and usable for the broader Android development community. This involves ongoing performance optimizations, expanding device compatibility, and potentially providing more resources like documentation, example projects, and developer tools to ease the adoption of WPE-Android.

We believe that by offering WebKit as a viable option on Android, we’re contributing to a more diverse and innovative web ecosystem. WPE-Android is not just about bringing back a familiar engine—it’s about giving developers the tools they need to create fast, secure, and beautiful web experiences on Android devices.

Conclusion #

The journey of bringing WebKit back to Android has been both challenging and rewarding so far. By building on the strong foundation of WPEWebKit, we’re crafting a tool that empowers developers to push the boundaries of what’s possible with web technologies on Android. The progress we’ve made so far is just the beginning, and I’m excited to see how the project will continue to evolve.

If you’re interested in learning more or getting involved, you can find all the details on the WPE-Android GitHub page.

by Jani Hautakangas at August 28, 2024 12:00 AM

August 26, 2024

Sergio Villar

A New Way to Browse: Eye Tracking Comes to Wolvic!

We’re thrilled to share some exciting news with you. Wolvic is about to transform how you interact with the web in a VR environment with the introduction of eye tracking support! Starting with the just released v1.7.0 release on the Gecko backend and the highly anticipated v1.0 release on the Chromium backend, you’ll be able to control the browser pointer just by looking at what you want to interact with. While this feature is still being refined, it’s a fantastic start, and we can’t wait for you to try it out.

August 26, 2024 06:07 AM

August 10, 2024

Abhijeet Kandalkar

Exploring sandboxing in CEF and CEFSharp for Windows platform

Sandboxing is a critical feature for ensuring the security and stability of applications that embed web content. In this blog, we will delve into how sandboxing is achieved in the Chromium Embedded Framework (CEF). All analysis in this blog is performed on Windows operating system.

CEFClient and CEFSharp

The base CEF framework includes support for the C and C++ programming languages. Also it provides a test applications to demonstrate its functionality. CEFClient, a one of the C++ test application, serves this purpose.

CEF provides CAPI bindings to enable application development across various programming languages. For integrating the CEF framework with .NET, you should explore CEFSharp. It is a C# application that leverages the CEF open source project, allowing developers to create applications in C#.

CEFSharp(C#)
            \  CAPI 
              -------> CEF(C++) ------> Chromium(C++)
            /
CEFClient(C++)

What is the sandbox?

As mentioned in documentation of sandbox in Chromium repository,

The sandbox is a C++ library that allows the creation of sandboxed processes — processes that execute within a very restrictive environment. The only resources sandboxed processes can freely use are CPU cycles and memory. For example, sandboxes processes cannot write to disk or display their own windows. What exactly they can do is controlled by an explicit policy. Chromium renderers are sandboxed processes.

Understanding Sandboxing in CEF

Since CEF uses Chromium internally, we decided to investigate how sandboxing is implemented by examining Chromium’s source code. Our findings revealed that the sandbox is designed to be versatile, with no hard dependencies on the Chromium browser itself, making it usable with other applications.

CEF prepares a static library using the existing sandbox implementation in the Chromium repository and links it to the main application. This process enables the sandbox functionality in CEF applications. For more detailed information, refer to the sandbox_win.cc file in the CEF repository.

if (is_win) {
  static_library("cef_sandbox") {
    sources = [ "libcef_dll/sandbox/sandbox_win.cc" ]
    include_dirs = [ "." ]
    deps = [ "libcef/features", "//sandbox" ]
  }
}

Implementing Sandboxing in CEFClient

CEFClient achieves a sandboxing by linking to cef_sandbox library

  executable("cefclient") {
      deps += [
        ":cef_sandbox",
      ]

In the CEFClient application, we need to create a sandbox object and pass it to CEF, which then forwards it internally to Chromium to configure the sandbox.

  CefScopedSandboxInfo scoped_sandbox;
  sandbox_info = scoped_sandbox.sandbox_info();
  ...
  // Initialize the CEF browser process.
  if (!context->Initialize(main_args, settings, app, sandbox_info)) {
    return 1;
  }

The following call stack illustrates how CEFClient code calls into CEF, ultimately transitioning into the Chromium environment to configure the sandbox.

The CEF test application (CEFClient) successfully launches a sandboxed CEF, which is verified by loading the chrome://sandbox URL.


Why C# Applications Cannot Use Sandboxed CEF ?

As we have seen above, for C++ applications it is possible to use sandboxing but C# application can not use sandboxed CEF. But why ?

To achieve effective sandboxing in your CEF application, ensure that you link the cef_sandbox.lib static library. This linking must be done specifically to the main process of the application. While C++ applications can directly link static libraries without issue, C# applications do not support static linking of libraries.

Static library means that the library is going to be merged with application. This concept doesn't exist in .NET as .NET supports only dynamic link libraries

If you attempt to create a DLL for the cef_sandbox code and link it dynamically to your application, the Chromium sandbox code will detect this and fail with the error SBOX_ERROR_INVALID_LINK_STATE. Therefore, static linking of the sandboxing code to the main application is mandatory for sandbox functionality.

  // Attempt to start a sandboxed process from sandbox code hosted not within
  // the main EXE. This is an unsupported operation by the sandbox.
  SBOX_ERROR_INVALID_LINK_STATE = 64,

Conclusion

While sandboxing is a powerful feature available for C++ applications using CEF, it is not feasible for C# applications due to the inherent differences in how these languages operate within the system. The managed environment of C# introduces limitations that make it impossible to integrate the same level of sandboxing as in C++. Consequently, C# applications cannot leverage sandboxed CEF using the existing setup.


Future study

To enable sandboxing in C# applications, we may need to modify the .NET runtime (which is written in C++) to support sandboxing. This would involve linking the cef_sandbox library directly to the runtime. The .NET Core runtime includes several executables/libraries that act as the main entry point for code execution, typically referred to as the “host”. These host executables can be customized as needed.

For example, when you run the apphost executable, it initializes the .NET Core runtime and starts your application. Creating a sandbox object in the apphost and passing it to the application might help address sandboxing for C# applications. However, this approach has not been tested, so its effectiveness cannot be confirmed without further exploration and modification of the C# language runtime.


References

August 10, 2024 06:30 PM

August 08, 2024

Gyuyoung Kim

Chrome iOS Browser on Blink

At the beginning of this year, Apple allowed third-party browser engines, such as Gecko and Blink, to be used on iOS and iPadOS. The Chromium community has been developing Chrome iOS based on Blink as an experimental project. This post provides an overview of the project and examines its progress during the first half of 2024.

Structure of Chrome iOS (w/blink)

First, let’s look at the overall structure of Chrome iOS. This simple diagram illustrates the structure of Chrome iOS.

The UI components are placed in the //ios/chrome directory, which functions similarly to the //chrome layer in other implementations. The //ios/web layer also provides APIs for its initialization, content navigation, presenting UI callbacks, saving or restoring navigation sessions and browser data, and more. The //ios/web/public defines the public APIs, while various other directories within //ios/web implement them. The implementation of these public APIs invokes WebKit APIs to provide the necessary functionalities.

For Chrome iOS based on Blink, the developers decided to reuse the existing iOS Chrome UI implementation last year. Thus, one of the main developments of Chrome iOS based on Blink happened in the //ios/web/content directory which is the rectangle filled by a green color, created to implement the public APIs using Blink. As you can see in the below diagram, they added the content directory to //ios/web directory and implemented the public APIs using Blink’s content APIs. Notably, they’ve introduced a ContentWebState as a prototype implementation of WebState to replace the class in //ios/web/content/web_state. However, as shown in the diagram, the directory currently only includes in five components: web_state, navigation, UI, init, and js_messaging. Therefore, more components need to be implemented for Chrome iOS on Blink.

Igalia Contributions

Igalia has been contributing to this project since the project was made public by the community. We’ve worked mainly on UI related components e.g. file and color chooser, context menu, select list popup and integration with Chrome IOS UI. The pictures below are the screen captures of the chooser implementations that we contributed.



We also helped with the features of multimedia such as the video screen capture, hardware encoding/decoding, and audio recording.

We’ve worked on a few testing frameworks (e.g. unit tests, browser and web tests). Also, we’ve filtered out unsupported tests and failed tests. Specifically, we’ve implemented the infrastructure to run the web tests on the simulator. For your information, the test coverage of the web test on iOS Blink was about 72.2% and the pass ratio among the working tests was 91.83% at the beginning of this year. Then we’ve been maintaining the bot for Chrome iOS on Blink to keep the build and testing because keeping build and running tests are crucial for Blink to bring up to Chrome iOS.

Besides we supported the remote debugging with DevTools on Blink for iOS. Now developers are able to remotely use DevTools in a host machine (e.g. Mac) and inspect Chrome or Content Shell for development.

Moreover, we’ve worked on graphics stuff related to compositing and rendering. For instance, we supported the metal on ANGLE as well as fixed bugs in the graphics layers.

You can find the detailed contributions patch list here.

The major changes for H1 2024

Now, let’s review the major changes during the first half of this year. Firstly, the minimum iOS SDK version was bumped up to 17.4 which supports the BrowserEngineKit library. And the [browser/unit] tests began to run on the ios-blink bot with the 17.4 SDK.

By reusing the existing iOS Chrome UI, ContentWebState was introduced as a prototype implementation of WebState. During H1 2024, new methods were added further or previously empty methods were implemented. For example, GetVirtualKeyboardHeight, OnVisibilityChanged, DidFinishLoad, DidFailLoad methods were implemented.

BrowserEngineKit APIs that were announced by Apple to support third-party browser engines on iOS and iPadOS have been applied to JIT, GPU, network, and content process creation.

As mentioned above, Igalia implemented a color chooser, a file chooser, and a context menu during the period.

Lastly, the package size was reduced by removing duplicated resources.

Remaining Tasks

We’ve briefly looked at the current status of the project so far, but many functionalities still need to be supported. For example, regarding UI features, functionalities such as printing preview, download, text selection, request desktop site, zoom text, translate, find in page, and touch events are not yet implemented or are not functioning correctly
Moreover, there are numerous failing or skipped tests in unit tests, browser tests, and web tests. Ensuring that these tests are enabled and passing the test should also be a key focus moving forward.

Conclusion

The Blink-based port of Chrome iOS is a large and laborious project. Igalia has made significant contributions to this project and while it is still in an early stage, more features, infrastructure and tools need to be ported to Blink. Anyhow, we believe that we are on the right track for eventually replacing WebKit by Blink on Chromium related products for iOS.

by gyuyoung at August 08, 2024 02:35 AM

August 02, 2024

Pawel Lampe

Nuts and bolts of Canvas2D - globalCompositeOperation and shadows.

In recent months I’ve been privileged to work on the transition from Cairo to Skia for 2D graphics rendering in WPE and GTK WebKit ports. Big reworks like this are a great opportunity to explore all kinds of graphics-related APIs. One of the broader APIs in this area is the CanvasRenderingContext2D API from HTML Canvas. It’s a fairly straightforward yet extensive API allowing one to perform all kinds of drawing operations on the canvas. The comprehensiveness, however, comes at the expense of some complex situations the web engine needs to handle under the hood. One such situation was the issue I was working on recently regarding broken test cases involving drawing shadows when using Skia in WebKit. What makes it complex is that some problems are still visible due to multiple web engine layers being involved, but despite that I was eventually able to address the broken test cases.

In the next few sections I’m going to introduce the parts of the API that are involved in the problems while in the sections closer to the end I will gradually showcase the problems and explore potential paths toward fixing the entire situation.

Drawing on Canvas2D with globalCompositeOperation #

The Canvas2D API offers multiple methods for drawing various primitives such as rectangles, arcs, text etc. On top of that, it allows one to control compositing and clipping using the globalCompositeOperation property. The idea is very simple - the user of an API can change the property using one of the predefined compositing operations and immediately after that, all new drawing operations will behave according to the rules the particular compositing operation specifies:

canvas2DContext.fillRect(...); // Draws rect on top of existing content (default).
canvas2DContext.globalCompositeOperation = 'destination-atop';
canvas2DContext.fillRect(...); // Draws rect according to 'destination-atop'.

There are many compositing operations, but I’ll be focusing mostly on the ones having source and destination in their names. The source and destination terms refer to the new content to be drawn and the existing (already-drawn) content respectively.

The images below present some examples of compositing operations in action:

Compositing operations in action.

Drawing on Canvas2D with shadows #

When drawing primitives using the Canvas2D API one can use shadow* properties to enable drawing of shadows along with any content that is being drawn. The usage is very simple - one has to alter at least one property such as e.g. shadowOffsetX to make the shadow visible:

canvas2DContext.shadowColor = "#0f0";
canvas2DContext.shadowOffsetX = 10;
// From now on, any draw call will have a green shadow attached.

the above combined with simple code to draw a circle produces a following effect:

Circle with shadow.

Shadows meet globalCompositeOperation #

Things are getting interesting once one starts thinking about how globalCompositeOperation may affect the way shadows are drawn. When I thought about it for the first time, I imagined at least 3 possibilities:

  • Shadow and shadow origin are both treated as one entity (shadow always below the origin) and thus are drawn together.
  • Shadow and shadow origin are combined and then drawn as a one entity.
  • Shadow and shadow origin are drawn separately - shadow first, then the content.

When I confronted the above with the drawing model and shadows specification, it turned out the last guess was the correct one. The specification basically says that the shadow should be computed first, then composited within the clipping region over the current canvas content, and finally, the shadow origin should be composited within the clipping region over the current canvas content (the original canvas content combined with shadow).

The above can be confirmed visually using few examples (generated using chromium browser v126.0.6478.126):

Shadows combined with compositing operation.

  • The source-over operation shows the drawing order - destination first, shadow second, and shadow origin third.
  • The destination-over operation shows the reversed drawing order - destination first, shadow second (below destination), and shadow origin third (below destination and shadow).
  • The source-atop operation is more tricky as it behaves like source-over but with clipping to the destination content - therefore, destination is drawn first, then clipping is set to destination, then the shadow is drawn, and finally the shadow origin is drawn.
  • The destination-atop operation is even more tricky as it behaves like destination-over yet with the clipping region always being different. That difference can be seen on the image below that presents intermediate states of canvas after each drawing step:
    Breakdown of destination-atop operation.
    • The initial state shows a canvas after drawing the destination on it.
    • The after drawing shadow state, shows a shadow drawn below the destination. In this case, the clipping is set to new content (shadow), and hence the part of destination that is not “atop” shadow is being clipped out.
    • The after drawing shadow origin state, shows the final state after drawing the shadow origin below the previous canvas content (new destination) that is at this point “a shadow combined with destination”. Similarly as in the previous step, the clipping is set to the new content (shadow origin), and hence any part of new destination that is not “atop” the shadow origin is being clipped out.

Discrepancies between browser engines #

Whenever one realizes the drawing of shadows with globalCompositeOperation in general may be tricky, then one must also consider that when it comes to particular browser engines, the things are even more tricky as virtually no graphics library provides an API that matches the Canvas2D API 1-to-1. This means that depending on the graphics library used, the browser engine must implement more or less integration parts here and there. For example, one can imagine that some graphics library may not have native support for shadows - that would mean the browser engine has to prepare shadows itself by e.g. drawing shadow origin (no matter how complex) on extra surface, changing color, blurring etc. so that it can be used as a whole once prepared.

Having said the above, one would expect that all the above aspects should be tested and implemented really well. After all, whenever the subject matter becomes complicated, extra care is required. It turns out, however, this is not necessarily the case when it comes to globalCompositeOperation and shadows. As for the testing part, there are very few tests (2d.shadow.composite*) in WPT (Web Platform Tests) covering the use cases described above. It’s also not much better for internal web engine test suites. As for implementations, there’s a substantial amount of discrepancy.

Simple examples #

To show exactly what’s the situation, the examples from section Shadows meet globalCompositeOperation can be used again. This time using browsers representing different web engines:

  • Chromium 126.0.6478.126 Shadows combined with compositing operation - Chromium.
  • Firefox 128.0 Shadows combined with compositing operation - Firefox.
  • Gnome Web (Epiphany) 45.0 (WebKit/Cairo) Shadows combined with compositing operation - Epiphany.
  • WPE MiniBrowser build from WebKit@098c58dd13bf40fc81971361162e21d05cb1f74a (WebKit/Skia) Shadows combined with compositing operation - WPE MiniBrowser.
  • Safari 17.1 (WebKit/Core Graphics) Shadows combined with compositing operation - Safari.
  • Servo release from 2024/07/04 Shadows combined with compositing operation - Servo.
  • Ladybird build from 2024/06/29 Shadows combined with compositing operation - Ladybird

First of all, it’s evident that experimental browsers such as servo and ladybird are falling behind the competition - servo doesn’t seem to support shadows at all, while ladybird doesn’t support anything other than drawing a rect filled with color.

Second, the non-experimental browsers are pretty stable in terms of covering most of the combinations presented above.

Finally, the most tricky combination above seems to be the one including destination-atop - in that case almost every mainstream browser renders different results:

  • Chromium is the only one rendering correctly.
  • Firefox and Epiphany are pretty close, but both are suffering from a similar glitch where the red part is covered by the part of destination that should be clipped out already.
  • WPE MiniBrowser and Safari are both rendering in correct order, but the clipping is wrong.

More sophisticated examples #

Until now, the discrepancies don’t seem to be very dramatic, and hence it’s time to present more sophisticated examples that are an extended version of the test case from the WebKit source tree:

  • Chromium 126.0.6478.126

Shadows combined with compositing operation - Chromium.

  • Firefox 128.0

Shadows combined with compositing operation - Firefox.

  • Gnome Web (Epiphany) 45.0 (WebKit/Cairo)

Shadows combined with compositing operation - Epiphany.

  • WPE MiniBrowser build from WebKit@098c58dd13bf40fc81971361162e21d05cb1f74a (WebKit/Skia)

Shadows combined with compositing operation - WPE MiniBrowser.

  • Safari 17.1 (WebKit/Core Graphics)

Shadows combined with compositing operation - Safari.

  • Servo release from 2024/07/04

Shadows combined with compositing operation - Servo.

  • Ladybird build from 2024/06/29

Shadows combined with compositing operation - Ladybird.

Other than destination-out, xor, and a few simple operations presented before, all the operations presented above pose serious problems to the majority of browsers. The only browser that is correct in all the cases (to the best of my understanding) is Chromium that is using rendering engine called blink which in turn uses the Skia library. One may wonder if perhaps it’s Skia that’s responsible for the Chromium success, but given the above results where e.g. WPE MiniBrowser uses Skia as well, it’s evident that the problems lay above the particular graphics library.

Looking at the operations and browsers that render incorrectly, it’s clearly visible that even small problems - with either ordering of draw calls or clipping - lead to spectacularly broken results. The pinnacle of misery is the source-out operation that is the most variable one across browsers. One has to admit, however, that WPE MiniBrowser is slightly closer to being correct than others.

Towards unification #

Fixing the above problems is a long journey. After all, every single web engine has to be fixed in its own, specific way. If the specification would be a problem - it would be the obvious way to start. However, as mentioned in the section Shadows meet globalCompositeOperation, the specification, is pretty clear on how drawing, shadows, and globalCompositeOperation come together. In such case, the next obvious place to start improving things is a WPT test suite.

What makes WPT outstanding is that it is a de facto standard cross-browser test suite for testing the web platform stack. Thus the test suite is developed as an open collaboration effort by developers from around the globe and hence is very broad in terms of specification coverage. What’s also important, the test results are actively evaluated against the popular browser engines and published under wpt.fyi, therefore putting some pressure on web engine developers to fix the problems so that they keep up with competition.

Granted the above, extending WPE test suite by adding test cases to cover globalCompositeOperation operations combined with shadows is the reasonable first step towards the unification of browser implementations. This can be done either by directly contributing tests to WPT, or by creating an issue. Personally, I’ve decided to file an issue first (WPT#46544) and to add tests once I have some time. I haven’t contributed to WPT yet, but I’m excited to work with it soon. Once I land my first pull request, I’ll start fixing WebKit and I won’t hesitate to post some updates on this blog.

by Paweł Lampe at August 02, 2024 12:00 AM

July 26, 2024

Loïc Le Page

FFmpeg 101

A high-level architecture overview to start with FFmpeg.

FFmpeg package content #

FFmpeg is composed of a suite of tools and libraries.

FFmpeg tools #

The tools can be used to encode/decode/transcode a multitude of different audio and video formats, and to stream the encoded media over networks.

  • ffmpeg: a command line tool to convert multimedia files between formats
  • ffplay: a simple mediaplayer based on SDL and the FFmpeg libraries
  • ffprobe: a simple multimedia stream analyzer

FFmpeg libraries #

The libraries can be used to integrate those same features into your own product.

  • libavformat: I/O and muxing/demuxing
  • libavcodec: encoding/decoding
  • libavfilter: graph-based filters for raw media
  • libavdevice: input/output devices
  • libavutil: common multimedia utilities
  • libswresample: audio resampling, samples format conversion and audio mixing
  • libswscale: color conversion and image scaling
  • libpostproc: video post-processing (deblocking/noise filters)

FFmpeg simple player #

A basic usage of FFmpeg is to demux a multimedia stream (obtained from a file or from the network) into its audio and video streams and then to decode those streams into raw audio and raw video data.

To manage the media streams, FFmpeg uses the following structures:

  • AVFormatContext: a high level structure providing sync, metadata and muxing for the streams
  • AVStream: a continuous stream (audio or video)
  • AVCodec: defines how data are encoded and decoded
  • AVPacket: encoded data in the stream
  • AVFrame: decoded data (raw video frame or raw audio samples)

The process used to demux and decode follows this logic:

basic processing

Here is the basic code needed to read an encoded multimedia stream from a file, analyze its content and demux the audio and video streams. Those features are provided by the libavformat library and it uses the AVFormatContext and AVStream structures to store the information.

// Allocate memory for the context structure
AVFormatContext* format_context = avformat_alloc_context();

// Open a multimedia file (like an mp4 file or any format recognized by FFmpeg)
avformat_open_input(&format_context, filename, NULL, NULL);
printf("File: %s, format: %s\n", filename, format_context->iformat->name);

// Analyze the file content and identify the streams within
avformat_find_stream_info(format_context, NULL);

// List the streams
for (unsigned int i = 0; i < format_context->nb_streams; ++i)
{
    AVStream* stream = format_context->streams[i];

    printf("---- Stream %02d\n", i);
    printf("  Time base: %d/%d\n", stream->time_base.num, stream->time_base.den);
    printf("  Framerate: %d/%d\n", stream->r_frame_rate.num, stream->r_frame_rate.den);
    printf("  Start time: %" PRId64 "\n", stream->start_time);
    printf("  Duration: %" PRId64 "\n", stream->duration);
    printf("  Type: %s\n", av_get_media_type_string(stream->codecpar->codec_type));

    uint32_t fourcc = stream->codecpar->codec_tag;
    printf("  FourCC: %c%c%c%c\n", fourcc & 0xff, (fourcc >> 8) & 0xff, (fourcc >> 16) & 0xff, (fourcc >> 24) & 0xff);
}

// Close the multimedia file and free the context structure
avformat_close_input(&format_context);

Once we’ve got the different streams from inside the multimedia file, we need to find specific codecs to decode the streams to raw audio and raw video data. All codecs are statically included in libavcodec. You can easily create your own codec by just creating an instance of the FFCodec structure and registering it as an extern const FFCodec in libavcodec/allcodecs.c, but this would be a different topic for another post.

To find the codec corresponding to the content of an AVStream, we can use the following code:

// Stream obtained from the AVFormatContext structure in the former streams listing loop
AVStream* stream = format_context->streams[i];

// Search for a compatible codec
const AVCodec* codec = avcodec_find_decoder(stream->codecpar->codec_id);
if (!codec)
{
    fprintf(stderr, "Unsupported codec\n");
    continue;
}
printf("  Codec: %s, bitrate: %" PRId64 "\n", codec->name, stream->codecpar->bit_rate);

if (codec->type == AVMEDIA_TYPE_VIDEO)
{
    printf("  Video resolution: %dx%d\n", stream->codecpar->width, stream->codecpar->height);
}
else if (codec->type == AVMEDIA_TYPE_AUDIO)
{
    printf("  Audio: %d channels, sample rate: %d Hz\n",
        stream->codecpar->ch_layout.nb_channels,
        stream->codecpar->sample_rate);
}

With the right codec and codec parameters extracted from the AVStream information, we can now allocate the AVCodecContext structure that will be used to decode the corresponding stream. It is important to remember the index of the stream we want to decode from the former streams list (format_context->streams) because this index will be used later to identify the demuxed packets extracted by the AVFormatContext.

In the following code we’re going to select the first video stream contained in the multimedia file.

// first_video_stream_index is determined during the streams listing in the former loop
int first_video_stream_index = ...;

AVStream* first_video_stream = format_context->streams[first_video_stream_index];
AVCodecParameters* first_video_stream_codec_params = first_video_stream->codecpar;
const AVCodec* first_video_stream_codec = avcodec_find_decoder(first_video_stream_codec_params->codec_id);

// Allocate memory for the decoding context structure
AVCodecContext* codec_context = avcodec_alloc_context3(first_video_stream_codec);

// Configure the decoder with the codec parameters
avcodec_parameters_to_context(codec_context, first_video_stream_codec_params);

// Open the decoder
avcodec_open2(codec_context, first_video_stream_codec, NULL);

Now that we have a running decoder, we can extract the demuxed packets using the AVFormatContext structure and decode them to raw video frames. For that we need 2 different structures:

  • AVPacket which contains the encoded packets extracted from the input multimedia file,
  • AVFrame which will contain the raw video frame after the AVCodecContext has decoded the former packets.
// Allocate memory for the encoded packet structure
AVPacket* packet = av_packet_alloc();

// Allocate memory for the decoded frame structure
AVFrame* frame = av_frame_alloc();

// Demux the next packet from the input multimedia file
while (av_read_frame(format_context, packet) >= 0)
{
    // The demuxed packet uses the stream index to identify the AVStream it is coming from
    printf("Packet received for stream %02d, pts: %" PRId64 "\n", packet->stream_index, packet->pts);

    // In our example we are only decoding the first video stream identified formerly by first_video_stream_index
    if (packet->stream_index == first_video_stream_index)
    {
        // Send the packet to the previsouly initialized decoder
        int res = avcodec_send_packet(codec_context, packet);
        if (res < 0)
        {
            fprintf(stderr, "Cannot send packet to the decoder: %s\n", av_err2str(res));
            break;
        }

        // The decoder (AVCodecContext) acts like a FIFO queue, we push the encoded packets on one end and we need to
        // poll the other end to fetch the decoded frames. The codec implementation may (or may not) use different
        // threads to perform the actual decoding.

        // Poll the running decoder to fetch all available decoded frames until now
        while (res >= 0)
        {
            // Fetch the next available decoded frame
            res = avcodec_receive_frame(codec_context, frame);
            if (res == AVERROR(EAGAIN) || res == AVERROR_EOF)
            {
                // No more decoded frame is available in the decoder output queue, go to next encoded packet
                break;
            }
            else if (res < 0)
            {
                fprintf(stderr, "Error while receiving a frame from the decoder: %s\n", av_err2str(res));
                goto end;
            }

            // Now the AVFrame structure contains a decoded raw video frame, we can process it further...
            printf("Frame %02" PRId64 ", type: %c, format: %d, pts: %03" PRId64 ", keyframe: %s\n",
                codec_context->frame_num, av_get_picture_type_char(frame->pict_type), frame->format, frame->pts,
                (frame->flags & AV_FRAME_FLAG_KEY) ? "true" : "false");

            // The AVFrame internal content is automatically unreffed and recycled during the next call to
            // avcodec_receive_frame(codec_context, frame)
        }
    }

    // Unref the packet internal content to recycle it for the next demuxed packet
    av_packet_unref(packet);
}

// Free the previously allocated memory for the different FFmpeg structures
end:
    av_packet_free(&packet);
    av_frame_free(&frame);
    avcodec_free_context(&codec_context);
    avformat_close_input(&format_context);

The way the former code is acting is resumed in the next diagram:

processing diagram

You can find the full code here.

To build the example you will need meson and ninja. If you have python and pip installed, you can install them very easily by calling pip3 install meson ninja. Then, once the example archive extracted to a ffmpeg-101 folder, go to this folder and call: meson setup build. It will automatically download the right version of FFmpeg if you don’t have it already installed on your system. Then call: ninja -C build to build the code and ./build/ffmpeg-101 sample.mp4 to run it.

You should obtain the following result:

File: sample.mp4, format: mov,mp4,m4a,3gp,3g2,mj2
---- Stream 00
  Time base: 1/3000
  Framerate: 30/1
  Start time: 0
  Duration: 30000
  Type: video
  FourCC: avc1
  Codec: h264, bitrate: 47094
  Video resolution: 206x80
---- Stream 01
  Time base: 1/44100
  Framerate: 0/0
  Start time: 0
  Duration: 440320
  Type: audio
  FourCC: mp4a
  Codec: aac, bitrate: 112000
  Audio: 2 channels, sample rate: 44100 Hz
Packet received for stream 00, pts: 0
Send video packet to decoder...
Frame 01, type: I, format: 0, pts: 000, keyframe: true
Packet received for stream 00, pts: 100
Send video packet to decoder...
Frame 02, type: P, format: 0, pts: 100, keyframe: false
Packet received for stream 00, pts: 200
Send video packet to decoder...
Frame 03, type: P, format: 0, pts: 200, keyframe: false
Packet received for stream 00, pts: 300
Send video packet to decoder...
Frame 04, type: P, format: 0, pts: 300, keyframe: false
Packet received for stream 00, pts: 400
Send video packet to decoder...
Frame 05, type: P, format: 0, pts: 400, keyframe: false
Packet received for stream 00, pts: 500
Send video packet to decoder...
Frame 06, type: P, format: 0, pts: 500, keyframe: false
Packet received for stream 00, pts: 600
Send video packet to decoder...
Frame 07, type: P, format: 0, pts: 600, keyframe: false
Packet received for stream 00, pts: 700
Send video packet to decoder...
Frame 08, type: P, format: 0, pts: 700, keyframe: false
Packet received for stream 01, pts: 0
Packet received for stream 01, pts: 1024
Packet received for stream 01, pts: 2048
Packet received for stream 01, pts: 3072
Packet received for stream 01, pts: 4096
Packet received for stream 01, pts: 5120
Packet received for stream 01, pts: 6144
Packet received for stream 01, pts: 7168
Packet received for stream 01, pts: 8192
Packet received for stream 01, pts: 9216
Packet received for stream 01, pts: 10240
Packet received for stream 01, pts: 11264
Packet received for stream 01, pts: 12288
Packet received for stream 01, pts: 13312
Packet received for stream 01, pts: 14336
Packet received for stream 01, pts: 15360
Packet received for stream 01, pts: 16384
Packet received for stream 01, pts: 17408
Packet received for stream 01, pts: 18432
Packet received for stream 01, pts: 19456
Packet received for stream 01, pts: 20480
Packet received for stream 01, pts: 21504
Packet received for stream 00, pts: 800
Send video packet to decoder...
Frame 09, type: P, format: 0, pts: 800, keyframe: false
Packet received for stream 00, pts: 900
Send video packet to decoder...
Frame 10, type: P, format: 0, pts: 900, keyframe: false

by Loïc Le Page at July 26, 2024 12:00 AM

July 22, 2024

Eric Meyer

Design for Real Life News!

If you’re reading this, odds are you’ve at least heard of A Book Apart (ABA), who published Design for Real Life, which I co-wrote with Sara Wachter-Boettcher back in 2016.  What you may not have heard is that ABA has closed up shop.  There won’t be any more new ABA titles, nor will ABA continue to sell the books in their catalog.

That’s the bad news.  The great news is that ABA has transferred the rights for all of its books to their respective authors! (Not every ex-publisher does this, and not every book contract demands it, so thanks to ABA.) We’re all figuring out what to do with our books, and everyone will make their own choices.  One of the things Sara and I have decided to do is to eventually put the entire text online for free, as a booksite.  That isn’t ready yet, but it should be coming somewhere down the road.

In the meantime, we’ve decided to cut the price of print and e-book copies available through Ingram.  DfRL was the eighteenth book ABA put out, so we’ve decided to make the price of both the print and e-book $18, regardless of whether those dollars are American, Canadian, or Australian.  Also €18 and £18.  Basically, in all five currencies we can define, the price is 18 of those.

…unless you buy it through Apple Books; then it’s 17.99 of every currency, because the system forces us to make it cheaper than the list price and also have the amount end in .99.  Obversely, if you’re buying a copy (or copies) for a library, the price has to be more than the list price and also end in .99, so the library cost is 18.99 currency units.  Yeah, I dunno either.

At any rate, compared to its old price, this is a significant price cut, and in some cases (yes, Australia, we’re looking at you) it’s a huge discount.  Or, at least, it will be at a discount once online shops catch up.  The US-based shops seem to be up to date, and Apple Books as well, but some of the “foreign” (non-U.S.) sources are still at their old prices.  In those cases, maybe wishlist or bookmark or something and keep an eye out for the drop.  We hope it will become global by the end of the week.  And hey, as I write this, a couple of places have the ebook version for like 22% less than our listed price.

So!  If you’ve always thought about buying a copy but never got around to it, now’s a good time to get a great deal.  Ditto if you’ve never heard of the book but it sounds interesting, or you want it in ABA branding, or really for any other reason you have to buy a copy now.

I suppose the real question is, should you buy a copy?  We’ll grant that some parts of it are a little dated, for sure.  But the concepts and approaches we introduced can be seen in a lot of work done even today.  It made significant inroads into government design practices in the UK and elsewhere, for example, and we still hear from people who say it really changed how they think about design and UX.  We’re still very proud of it, and we think anyone who takes the job of serving their users seriously should give it a read.  But then, I guess we would, or else we’d never have written it in the first place.

And that’s the story so far.  I’ll blog again when the freebook is online, and if anything else changes as we go through the process.  Got questions?  Leave a comment or drop me a line.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at July 22, 2024 03:22 PM

July 18, 2024

Igalia Compilers Team

Summary of the June 2024 TC39 plenary in Helsinki

In June, many colleagues from Igalia participated in a TC39 meeting organized in Helsinki by Aalto University and Mozilla to discuss proposed features for the JavaScript standard alongside delegates from various other organizations.

Let's delve together into some of the most exciting updates!

You can also read the full agenda and the meeting minutes on GitHub.

Day 1 #

import defer to Stage 2.7 igalia logo #

The import defer proposal allows pre-loading modules while deferring their evaluation until a later time. This proposal aims at giving developers better tools to optimize the startup performance of their application.

As soon as some code needs the variables exported by a deferred module, it will be synchronously evaluated to immediately give access to its value:

// This does not cause evaluation of my-mod or its dependencies:
import defer * as myMod from "./my-mod.js";

$button.addEventListener("click", () => {
// but this does!
console.log("val:", myMod.val);
});

This is similar to, when using CommonJS, moving a require(...) call from the top-level of a module to inside a function. Adding a similar capability to ES modules is one further step towards helping people migrate from CommonJS to ESM.

The proposal reached stage 2.7 after answering a few questions centered around the behavior of top-level await: modules with top-level await will not be deferred even if imported through import defer, because they cannot be synchronously evaluted. If you application can easily handle asynchronous deferred evaluation of modules, it can as well use dynamic import().

The proposal now needs to have test262 tests written, to be able to go to Stage 3.

Promise.try to Stage 3 #

The new Promise.try helper allows calling a functions that might or might not be asynchronous, unifying their error handling paths. This is useful for asynchronous APIs (that should always signal errors by returning a rejected promise) that interact with user-provided callbacks.

Consider this example:

type AsyncNegator = (val: number) => number | Promise<number>;
function subtract(a: number, b: number, negate: AsyncNegator): Promise<number> {
return Promise.resolve(negate(b)).then(negB => a + negB);
}

While the developer here took care of wrapping negate's result in Promise.resolve, in case negate returns a number directly, what happens in negate throws an error? In that case, subtract will throw synchronously rather than returning a rejected promise!

With Promise.try, you can easily handle both the success and error paths correctly:

type AsyncNegator = (val: number) => number | Promise<number>;
function subtract(a: number, b: number, negate: AsyncNegator): Promise<number> {
return Promise.try(() => negate(b)).then(negB => a + negB);
}

Day 2 #

Source maps update #

Source maps are an important tool in a developer's toolbox: they are what lets you debug transpiled/minified code in your editor or browser, while still stepping through your original hand-authored code.

While they are supported by most tools and browsers, there hasn't been a shared standard that defines how they should work. Tools and browsers all have to peek at what the others are doing to understand how to properly implement them, and this situation makes it very difficult to evolve the format to improve the debugging experience.

TC39 recently picked up the task of formalizing the standard, as well as adding new features such as the scopes proposal that would let devtools better understand renamed variables and inlined functions.

Iterator.zip to Stage 2.7 #

TC39 is working on many helpers for more easily working with iterators ("lazy lists", that only produce values as needed). While most of them are in the Iterator Helpers proposal, this one is advancing on its own.

Iterator.zip allows pairing values coming from multiple iterators:

function getNums(start = 0, step = 1) {
for (let val = start; ; start += step) yield step;
}

let naturals = getNums();
let evens = getNums(0, 2);
let negatives = getNums(-1, -1);

// an iterator of [a natural, an even, a negative]
let allTogether = Iterators.zip([naturals, evens, negative]);

console.log(allTogether.next().value); // [0, 0, -1]
console.log(allTogether.next().value); // [1, 2, -2]
console.log(allTogether.next().value); // [2, 4, -3]

This proposal, like import defer, just reached the new Stage 2.7: it will now need test262 tests to be eligible for Stage 3.

Temporal reduction igalia logo #

Temporal is one of the longest awaited features of JavaScript, advancing bit by bit on its path to stage 4 as obstacles are removed. For the last 6 months or so we have been working on removing one of the final obstacles: addressing feedback from JS engines on the size and complexity of the proposal, which culminated in this meeting.

As we get closer to having shipping implementations, it's become clear that the size of Temporal was an obstacle for platforms such as low-end Android devices: it added a large chunk to the size of the JS engine all at once. So, Philip Chimento and Justin Grant presented a slate of API removals to make the proposal smaller.

What was removed? Some methods previously existed for convenience, but were removed as somewhat redundant because there was a one-line way to accomplish the same thing. A more substantial removal was Temporal.Calendar and Temporal.TimeZone objects, along with the ability to extend them to implement custom calendars and custom time zones. We've received feedback that these have been the most complicated parts of the proposal for implementations, and they've also been where the most bugs have popped up. As well, due to the existence of formats like jsCalendar (RFC 8984), as well as learning more about the drawbacks of a callback-based design, we believe there are better designs possible for custom time zones and calendars than there were when the feature was designed.

Most of the slate of removals was adopted, and Temporal continues its journey to stage 4 smaller than it was before. You can follow the progress in this ticket on Temporal's issue tracker.

Day 3 #

Decimal update igalia logo #

If you're tired of the fact that 0.1 + 0.2 is not 0.3 in JavaScript, then the decimal proposal is for you! This proposal, which is currently at stage 1, was presented by Jesse Alama. The goal was to present an update about some changes to the API, and go through the status of the proposal's spec text. Although most of the committee was generally supportive of allowing this proposal to go to stage 2, it remains at stage 1 due to some concerns about missing details in the spec text and the overall motivation of the proposal.

Discard bindings update #

The intention of discard bindings is to formalize a pattern commonly seen out there in the wild of JavaScript:

let [ _, rhs ] = s.split(".");

Notice that underscore character (_)? Although our intention is to signal to the readers of our code that we don't care what is on the left-hand side of the . in our string, _ is actually a valid identifier. It might even contain a big value, which takes up memory. This is just the tip of the iceberg; things get even more complex when one imagines binding -- but not using! -- even more complex entities. We would like to use _ -- or perhaps something else, like void -- to signal to the JavaScript engine that it can throw away whatever value _ might have held. Ron Buckton presented an update about this proposal and successfully advanced it to Stage 2 in the TC39 process.

Signals update #

Signals is an ambitious proposal that takes some of the various approaches to reactivity found out there in various JS frameworks such as Vue, React, and so on. The idea is to bring some form of reactivity into the JS Core. Of course, different approaches to reactivity are possible, and should remain valid; the idea of the Signals proposal is to provide some common abstraction that different aproaches can build on. Dan Ehrenberg presented an update on this new proposal, which is currently at stage 1. A lot of work remains to be done; Dan explicitly said that a request to move Signals to stage 2 might take at least a year.

by compilers at July 18, 2024 12:00 AM

July 12, 2024

Georges Stavracas

Profiling a web engine

One topic that interests me endlessly is profiling. I’ve covered this topic many times in this blog, but not enough to risk sounding like a broken record yet. So here we are again!

Not everyone may know this but GNOME has its own browser, Web (a.k.a. Epiphany, or Ephy for the intimates). It’s a fairly old project, descendant of Galeon. It uses the GTK port of WebKit as its web engine.

The recent announcement that WebKit on Linux (both WebKitGTK and WPE WebKit) switched to Skia for rendering brought with it a renewed interest in measuring the performance of WebKit.

And that was only natural; prior to that, WebKit on Linux was using Cairo, which is entirely CPU-based, whereas Skia had both CPU and GPU-based rendering easily available. The CPU renderer mostly matches Cairo in terms of performance and resource usage. Thus one of the big promises of switching to Skia was better hardware utilization and better overall performance by switching to the GPU renderer.

A Note About Cairo

Even though nowadays we often talk about Cairo as a legacy piece of software, there’s no denying that Cairo is really good at what it does. Cairo can and often is extremely fast at 2D rendering on the CPU, specially for small images with simple rendering. Cairo has received optimizations and improvements for this specific use case for almost 20 years, and it is definitely not a low bar to beat.

I think it’s important to keep this in mind because, as tempting as it may sound, simply switching to use GPU rendering doesn’t necessarily imply better performance.

Guesswork is a No-No

Optimizations should always be a byproduct of excellent profiling. Categorically speaking, meaningful optimizations are a consequence of instrumenting the code so much that the bottlenecks become obvious.

I think the most important and practical lesson I’ve learned is: when I’m guessing what are the performance issues of my code, I will be wrong pretty much 100% of the time. The only reliable way to optimize anything is to have hard data about the behavior of the app.

I mean, so many people – myself included – were convinced that GNOME Software was slow due to Flatpak that nobody thought about looking at app icons loading.

Enter the Profiler

Thanks to the fantastic work of Søren Sandmann, Christian Hergert, et al, we have a fantastic modern system profiler: Sysprof.

Sysprof offers a variety of instruments to profile the system. The most basic one uses perf to gather stack traces of the processes that are running. Sysprof also supports time marks, which allow plotting specific events and timings in a timeline. Sysprof also offers extra instrumentation for more specific metrics, such as network usage, graphics, storage, and more.

  • Screenshot of Sysprof's callgraph viewCallgraph
  • Screenshot of Sysprof's flamegraphs viewFlamegraphs
  • Screenshot of Sysprof's mark chart viewMark chart
  • Screenshot of Sysprof's waterfall viewMark waterfall

All these metrics are super valuable when profiling any app, but they’re particularly useful for profiling WebKit.

One challenging aspect of WebKit is that, well, it’s not exactly a small project. A WebKit build can easily take 30~50min. You need a fairly beefy machine to even be able to build a debug build of WebKit. The debug symbols can take hundreds of megabytes. This makes WebKit particularly challenging to profile.

Another problem is that Sysprof marks require integration code. Apps have to purposefully link against, and use, libsysprof-capture to send these marks to Sysprof.

Integrating with Sysprof

As a first step, Adrian brought the libsysprof-capture code into the WebKit tree. As libsysprof-capture is a static library with minimal dependencies, this was relatively easy. We’re probably going to eventually remove the in-tree copy and switch to host system libsysprof-capture, but having it in-tree was enough to kickstart the whole process.

Originally I started sprinkling Sysprof code all around the WebKit codebase, and to some degree, it worked. But eventually I learned that WebKit has its own macro-based tracing mechanism that is only ever implemented for Apple builds.

Looking at it, it didn’t seem impossible to implement these macros using Sysprof, and that’s what I’ve been doing for the past few weeks. The review was lengthy but behold, WebKit now reports Sysprof marks!

Screenshot of Sysprof with WebKit marks highlighted

Right now these marks cover a variety of JavaScript events, layout and rendering events, and web page resources. This all came for free from integrating with the preexisting tracing mechanism!

This gives us a decent understanding of how the Web process behaves. It’s not yet complete enough, but it’s a good start. I think the most interesting data to me is correlating frame timings across the whole stack, from the kernel driver to the compositor to GTK to WebKit’s UI process to WebKit’s Web process, and back:

Screenshot of Sysprof with lots of compositor and GTK and WebKit marks

But as interesting as it may be, oftentimes the fun part of profiling is being surprised by the data you collect.

For example, in WebKit, one specific, seemingly innocuous, completely bland method is in the top 3 of the callgraph chart:

Screenshot of Sysprof showing the callgraph view with an interesting result highlighted

Why is WebCore::FloatRect::contains so high in the profiling? That’s what I’m investigating right now. Who guessed this specific method would be there? Nobody, as far as I know.

Once this is out in a stable release, anyone will be able to grab a copy of GNOME Web, and run it with Sysprof, and help find out any performance issues that only reproduce in particular combinations of hardware.

Next Plans

To me this already is a game changer for WebKit, but of course we can do more. Besides the rectangular surprise, and one particular slowdown that comes from GTK loading Vulkan on startup, no other big obvious data point popped up. Specially in the marks, I think their coverage is still fairly small compared to what it could have been.

We need more data.

Some ideas that are floating right now:

  • Track individual frames and correlate them with Sysprof marks
  • Measure top-to-bottom-to-top latency
  • Measure input delay
  • Integrate with multimedia frames

Perhaps this will allow us to make WebKit the prime web engine for Linux, with top-tier performance, excellent system integration, and more. Maybe we can even redesign the whole rendering architecture of WebKit on Linux to be more GPU friendly now. I can dream high, can’t I? 🙂

In any case, I think we have a promising and exciting time ahead for WebKit on Linux!

by Georges Stavracas at July 12, 2024 12:42 PM

July 10, 2024

Stephen Chenney

Canvas Text Editing

Editing text in HTML canvas has never been easy. It requires identifying which character is under the hit point in order to place a caret, and it requires computing bounds for a range of text that is selected. The existing implementations of Canvas TextMetrics made these things possible, but not without a lot of Javascript making multiple expensive calls to compute metrics for substrings. Three new additions to the TextMetrics API are intended to support editing use cases in Canvas text. They are in the standards pipeline, and implemented in Chromium-based browsers behind the ExtendedTextMetrics flag:

  • caretPositionFromPoint gives the location in a string corresponding to a pixel length along the string. Use it to identify where the caret is in the string, and what the bounds of a selection range are.
  • getSelectionRects returns the rectangles that a browser would use to highlight a range of text. Use it to draw the selection highlight.
  • getActualBoundingBox returns the bounding box for a sub-range of text within a string. Use it if you need to know whether a point lies within a substring, rather than the entire string.

To enable the flag, use --enable-blink-features=ExtendedTextMetrics when launching Chrome from a script or command line, or enable “Experimental Web Platform features” via chrome://flags/#enable-experimental-web-platform-features.

I wrote a basic web app (opens in a new tab) in order to demonstrate the use of these features. It will function in Chrome versions beyond 128.0.6587.0 (Canary at the time of writing) with the above flags set.

The app allows the editing of a single line of text drawn in an HTML canvas. Here I’ll work through usage of the new features.

In the demo, the first instance of “new Canvas Text Metrics” is considered a link back to this blog page. Canvas Text has no notion of links, and thousands of people have looked at Stack Exchange for a way to insert hyperlinks in canvas text. Part of the problem, assuming you know where the link is in the text, is determining when the link was clicked on. The TextMetrics getActualBoundingBox(start, end) method is intended to simplify the problem by returning the bounding box of a substring of the text, in this case the link.

  onStringChanged() {
text_metrics = context.measureText(string);
link_start_position = string.indexOf(link_text);
if (link_start_position != -1) {
link_end_position = link_start_position + link_text.length;
}
}
...
linkHit(x, y) {
let bound_rect = undefined;
try {
bound_rect = text_metrics.getActualBoundingBox(link_start_position, link_end_position);
} catch (error) {
return false;
}
let relative_x = x - string_x;
let relative_y = y - string_y;
return relative_x >= bound_rect.left && relative_y >= bound_rect.top
&& relative_x < bound_rect.right && relative_y < bound_rect.bottom;
}

The first function finds the link in the string and stores the start and end string offsets. When a click event happens, the second method is called to determine if the hit point was within the link area. The text metrics object is queried for the bounding box of the link’s substring. Note the call is contained within a try...catch block because an exception will be returned if the substring is invalid. The event offset is mapped into the coordinate system of the text (in this case by subtracting the text location) and the resulting point is tested against the rectangle.

In more general situations you may need to use a regular expression to find links, and keep track of a more complex transformation chain to convert event locations into the text string’s coordinate system.

Mapping a Point to a String Index #

A primary concept of any editing application is the caret location because it indicates where typed text will appear, or what will be deleted by backspace, or where an insertion will happen. Mapping a hit point in the canvas into the caret position in the text string is a fundamental editing operation. It is possible to do this with existing methods but it is expensive (you can do a binary search using the width of substrings).

The TextMetrics caretPositionFromPoint(offset) method uses existing code in browsers to efficiently map a point to a string position. The underlying functionality is very similar to the document.caretPositionFromPoint(x,y) method, but modified for the canvas situation. The demo code uses it to position the caret and to identify the selection range.

  text_offset = event.offsetX - string_x;
caret_position = text_metrics.caretPositionFromPoint(text_offset);

The caretPositionFromPoint function takes the horizontal offset, in pixels, measured from the origin of the text (based on the textAlign property of the canvas context). The function finds the character boundary closest to the given offset, then returns the character index to the right for left-to-right text, and to the left for right-to-left text. The offset can be negative to allow characters to the left of the origin to be mapped.

In the figure below, the top string has textDirection = "ltr" and textAlign = "center". The origin for measuring offsets is the center of the string. Green shows the offsets given, while blue shows the indexes returned. The bottom string demonstrates textDirection = "rtl" and textAlign = "start".

An offset past the beginning of the text always returns 0, and past the end returns the string length. Note that the offset is always measured left-to-right, even if the text direction is right-to-left. The “beginning” and “end” of the text string do respect the text direction, so for RTL text the beginning is on the right.

The caretPositionFromPoint function may produce very counter-intuitive results when the text string has mixed bidi content, such as a latin substring within an arabic string. As the offset moves along the string the positions will not steadily increase, or decrease, but may jump around at the boundaries of a directional run. Full handling of bidi content requires incorporating bidi level information, particularly for selecting text, and is beyond the scope of this article.

Selection Rectangles #

Selected text is normally indicated by drawing a highlight over the range, but to produce such an effect in canvas requires estimating the rectangle using existing text metrics, and again making multiple queries to text metrics to obtain the left and right extents. The new TextMetrics getSelectionRects(start, end) function returns a list of browser defined selection rectangles for the given subrange of the string. There may be multiple rectangles because the browser returns one for each bidi run; you would need to draw them all to highlight the complete range. The demo assumes a single rectangle because it assumes no mixed-direction strings.

selection_rect = text_metrics.getSelectionRects(selection_range[0], selection_range[1])[0];
...
context.fillStyle = 'yellow';
context.fillRect(selection_rect.x + string_x,
selection_rect.y + string_y,
selection_rect.width,
selection_rect.height)

Like all the new methods, the rectangle returned is in the coordinate system of the string, as defined by the transform, textAlign and textBaseline.

Conclusion #

The new Canvas Text Metrics described here are in the process of standardization. When the feedback process is opened we will update this blog post with the place to raise issues with these proposed methods.

Thanks #

The implementation of Canvas Text Features was aided by Igalia S.L. funded by Bloomberg L.P.

by Stephen Chenney at July 10, 2024 12:00 AM

July 09, 2024

Guilherme Piccoli

Presenting kdumpst, or how to collect kernel crash logs on Arch Linux

It’s been a long time since I last posted something here – yet there are interesting news to mention, a bit far from ARM64 (the topic of my previous posts). Let’s talk today about kernel crashes, or even better, how can we collect information if a kernel panic happens on Arch Linux and on SteamOS, … Continue reading "Presenting kdumpst, or how to collect kernel crash logs on Arch Linux"

by gpiccoli at July 09, 2024 07:21 PM

July 08, 2024

Frédéric Wang

My recent contributions to Gecko (2/3)

Introduction

This is the second in a series of blog posts describing new web platform features Igalia has implemented in Gecko, as part of an effort to improve browser interoperability. I’ll talk about the task of implementing ‘content-visibility’, to which several Igalians have contributed since early 2022, and I’ll focus on two main roadblocks I had to overcome.

The ‘content-visibility’ property

In the past, Igalia worked on CSS containment, a feature allowing authors to isolate a subtree from the rest of the document to improve rendering performance. This is done using the ‘contain’ property, which accepts four kinds of containment: size, layout, style and paint.

‘content-visibility’ is a new property allowing authors to “hide” some content from the page, and save the browser unnecessary work by applying containment. The most interesting one is probably content-visibility: auto, which hides content that is not relevant to the user. This is essentially native “virtual scrolling”, allowing you to build virtualized or “recycled” lists without breaking accessibility and find-in-page.

To explain this, consider the typical example of a page with a series of posts, as shown below. By default, each post would have the four types of containment applied, plus it won’t be painted, won’t respond to hit-testing, and would use the dimensions specified in the ‘contain-intrinsic-size’ property. It’s only once a post becomes relevant to the user (e.g. when scrolled close enough to the viewport, or when focus is moved into the post) that the actual effort to properly render the content, and calculate its actual size, is performed:

div.post {
  content-visibility: auto;
  contain-intrinsic-size: 500px 1000px;
}
<div class="post">
...
</div>
<div class="post">
...
</div>
<div class="post">
...
</div>
<div class="post">
...
</div>

If a post later loses its relevance (e.g. when scrolled away, or when focus is lost) then it would use the dimensions specified by ‘contain-intrinsic-size’ again, discarding the content size that was obtained after layout. One can also avoid that and use the last remembered size instead:

div.post {
  contain-intrinsic-size: auto 500px auto 1000px;
}

Finally, there is also a content-visibility: hidden value, which is the same as content-visibility: auto but never reveals the content, enhancing other methods to hide content such as display: none or visibility: hidden.

This is just a quick overview of the feature, but I invite you to read the web.dev article on content-visibility for further details and thoughts.

Viewport distance for content-visibility: auto

As is often the case, the feature looks straightforward to implement, but issues appear when you get into the details.

In bug 1807253, my colleague Oriol Brufau raised an interoperability bug with a very simple test case, reproduced below for convenience. Chromium would report 0 and 42, whereas Firefox would sometimes report 0 twice, meaning that the post did not become relevant after a rendering update:

<!DOCTYPE html>
<div id="post" style="content-visibility: auto">
  <div style="height: 42px"></div>
</div>
<script>
console.log(post.clientHeight);
requestAnimationFrame(() => requestAnimationFrame(() => {
  console.log(post.clientHeight);
}));
</script>

It turned out that an early version of the specification relied too heavily on an modified version of IntersectionObserver to synchronously detect when an element is close to the viewport, as this was how it was implemented in Chromium. However, the initial implementation in Firefox relied on a standard IntersectionObserver (with asynchronous notifications of observers) and so failed to produce the behavior described in the specification. This issue was showing up in several WPT failures.

To solve that problem, the moment when we determine an element’s proximity to the viewport was moved into the HTML5 specification, at the step when the rendering is updated, more precisely when the ResizeObserver notifications are broadcast. My colleague Alexander Surkov had started rewriting Firefox’s implementation to align with this new behavior in early 2023, and I took over his work in November.

Since this touches the “update the rendering” step which is executed on every page, it was quite likely to break things… and indeed many regressions were caused by my patch, for example:

  • One regression was about white flickering of pages on every reload/navigation.
  • One more regression was about content-visibility: auto nodes not being rendered at all.
  • Another regression was about new resize loop errors appearing in tests.
  • Some test cases were also found where the “update the rendering step” would repeat indefinitely, causing performance regressions.
  • Last but not least, crashes were reported.

Some of these issues were due to the fact that support for the last remembered size in Firefox relied on an internal ResizeObserver. However, the CSS Box Sizing spec only says that the last remembered size is updated when ResizeObserver events are delivered, not that such an internal ResizeObserver object is actually needed. I removed this internal observer and ensured the last remembered size is computed directly in the “update the rendering” phase, making the whole thing simpler and more robust.

Dynamic changes to CSS ‘contain’ and ‘content-visibility’

Before sending the intent-to-ship, we reviewed remaining issues and stumbled on bug 1765615, which had been opened during the initial 2022 work. Mozilla indicated this performance bug was important enough to consider an optimization, so I started tackling the issue.

Elaborating a bit about what was mentioned above, a non-visible ‘content-visibility’ implies layout, style and paint containment, and when the element is not relevant to the user, it also implies size containment 1. This has certain side effects, for example paint and layout containment establish an independent formatting context and affect how the contained box interacts with floats and how margin collapsing applies. Style containment can even have more drastic consequences, since they make counter-* and *-quote properties scoped to the subtree.

When we dynamically modify the ‘contain’ or ‘content-visibility’ properties, or when the relevance of a content-visibility: auto element changes, browsers must make sure that the rendering is properly updated. It turned out that there were almost no tests for that, and unsurprisingly, Chromium and WebKit had various invalidation bugs. Firefox was always forcing a rebuild of the tree used for rendering, which avoided such bugs but is not optimal.

I wrote a couple of web platform tests for ‘contain’ and ‘content-visibility’ 2, and made sure that Firefox does the minimal invalidation effort needed, being careful not to cause any regressions. As a result, except for style containment changes, we’re now able to avoid the cost a rebuild of the tree used for rendering!

Conclusion

Almost two years after the initial work on ‘content-visibility’, I was able to send the intent-to-ship, and the feature finally became available in Firefox 125. Finishing the implementation work on this feature was challenging, but quite interesting to me.

I believe ‘content-visibility’ is a good example of why implementing a feature in different browsers is important to ensure that both the specification and tests are good enough. The lack of details in the spec regarding when we determine viewport proximity, and the absence for WPT tests for invalidation, definitely made the Firefox work take longer than expected. But finishing that implementation work was also useful for improving the spec, tests, and other implementations 3.

I’ll conclude this series of blog posts with fetch priority, which also has its own interesting story…

  1. In both cases, “implies” means the used value of ‘contain’ is modified accordingly. 

  2. One of the thing I had to handle with care was the update of the accessibility tree, since content that is not relevant to the user must not be exposed. Unfortunately it’s not possible to write WPT tests for accessibility yet so for now I had to write internal Firefox-specific non-regression tests. 

  3. Another interesting report happened after the release and is related to content-visibility: auto on elements drawn in a canvas

by Frédéric Wang at July 08, 2024 10:00 PM

July 01, 2024

Alex Bradbury

pwr

Summary

pwr (paced web reader) is a script and terminal-centric workflow I use for keeping up to date with various sources online, shared on the off chance it's useful to you too.

Motivation

The internet is (mostly) a wonderful thing, but it's kind of a lot. It can be distracting and I thnk we all know the unhealthy loops of scrolling and refreshing the same sites. pwr provides a structured workflow for keeping up to date with a preferred set of sites in an incremental fashion (willpower required). It takes some inspiration from a widely reported workflow that involved sending a URL to a server and having it returned via email to be read in a batch later. pwr adopts the delayed gratification aspect of this but doesn't involve downloading for offline reading.

The pwr flow

One-time setup:

  • Configure the pwr script so it supports your desired feed sources (RSS or using hand-written extractors for those that don't have a good feed).

Regular workflow (just run pwr with no arguments to initiate this sequence in one invocation):

  • Run pwr read to open any URLs that were previously queued for reading.
  • Run pwr fetch to get any new URLs from the configured sources.
  • Run pwr filter to open an editor window where you can quickly mark which retrieved articles to queue for reading.

In my preferred usage, the above is run once a day as a replacement for unstructured web browsing. This flow means you're always reading items that were identified the previous day. Although comments on sites such as Hacker News or Reddit are much maligned, I do find they can be a source of additional insight, and this flow means that by the time you're reading a post ~24 hours after initially found, discussion has died down so there's little reason to keep refreshing.

pwr filter is the main part requiring active input, and involves the editor in a way that is somewhat inspired by git rebase -i. For instance, at the time of writing it produces the following output (and you would simply replace the d prefix with r for any you want to queue to read:

------------------------------------------------------------
Filter file generated at 2024-07-01 08:51:54 UTC
DO NOT DELETE OR MOVE ANY LINES
To mark an item for reading, replace the 'd' prefix with 'r'
Exit editor with non-zero return code (:cq in vim) to abort
------------------------------------------------------------

# Rust Internals
d [Discussion] Hybrid borrow (0 replies)

# Swift Evolution
d [Pitch #2] Safe Access to Contiguous Storage (27 replies)
d [Re-Proposal] Type only Unions (69 replies)

# HN
d Programmers Should Never Trust Anyone, Not Even Themselves
d Unification in Elixir
d Quaternions in Signal and Image Processing

# lobste.rs
d Code Reviews Do Find Bugs
d Integrated assembler improvements in LLVM 19
d Cubernetes
d Grafana security update: Grafana Loki and unintended data write attempts to Amazon S3 buckets
d regreSSHion: RCE in OpenSSH's server, on glibc-based Linux systems (CVE-2024-6387)
d Elaboration of the PostgreSQL sort cost model

# /r/programminglanguages
d Rate my syntax (Array Access)

Making it your own

Ultimately pwr is a tool that happens to scratch an itch for me. It's out there in case any aspect of it is useful to you. It's very explicitly written a script, where the expected usage is that you take a copy and make what modifications you need for yourself (changing sources, new fetchers, or other improvements).


Article changelog
  • 2024-07-01: Initial publication date.

July 01, 2024 12:00 PM

June 20, 2024

Frédéric Wang

My recent contributions to Gecko (1/3)

Introduction

Igalia has been contributing to the web platform implementations of different web engines for a long time. One of our goals is ensuring that these implementations are interoperable, by relying on various web standards and web platform tests. In July 2023, I happily joined a project that focuses on this goal, and I worked more specifically on the Gecko web engine. One year later, three new features I contributed to are being shipped in Firefox. In this series of blog posts, I’ll give an overview of those features (namely registered custom properties, content visibility, and fetch priority) and my journey to make them “ride the train” as Mozilla people say.

Let’s start with registered custom properties, an enhancement of traditional CSS variables.

Registered custom properties

You may already be familiar with CSS variables, these “dash dash” names that facilitate the maintenance of a large web site by allowing author-defined CSS properties. In the example below, the :root selector defines a variable --main-theme-color with value “blue”, which is used for the style applied to other elements via the var() CSS function. As you can see, this makes the usage of the main theme color in different places more readable and makes customizing that color much easier.

:root { --main-theme-color: blue; }
p { color: var(--main-theme-color); }
section {
  padding: 1em;
  border: 1px solid var(--main-theme-color);
}
.progress-bar {
  height: 10px;
  width: 100%;
  background: linear-gradient(white, var(--main-theme-color));
}
<section>
  <p>Loading...</p>
  <div class="progress-bar"></div>
</section>

In browsers supporting CSS variables, you should see a frame containing the text “Loading” and a progress bar, all of these components being blue:

Loading...

Having such CSS variables available is already nice, but they are lacking some features available to native CSS properties… For example, there is (almost) no syntax checking on specified values, they are always inherited, and their initial value is always the guaranteed invalid value. In order to improve on that situation, the CSS Properties and Values specification provides some APIs to register custom properties with further characteristics:

  • An accepted syntax for the property; for example, igalia | <url> | <integer>+ means either the custom identifier “igalia”, or a URL, or a space-separated list of integers.
  • Whether the property is inherited or non-inherited.
  • An initial value.

Custom properties can be registered via CSS or via a JS API, and these ways are equivalent. For example, to register --main-theme-color as a non-inherited color with initial value blue:

@property --main-theme-color {
  syntax: "<color>";
  inherits: false;
  initial-value: blue;
}
window.CSS.registerProperty({
  name: "--main-theme-color",
  syntax: "<color>",
  inherits: false,
  initialValue: blue,
});

Interpolation of registered custom properties

By having custom properties registered with a specific syntax, we open up the possibility of interpolating between two values of the properties when performing an animation. Consider the following example, where the width of the animated div depends on the custom property --my-length. Defining this property as a length allows browsers to interpolate it continuously between 10px and 200px when it is animated:

 @property --my-length {
   syntax: "<length>";
   inherits: false;
   initialValue: '0px';
 }
 @keyframes test {
   from {
     --my-length: 10px;
   }
   to {
     --my-length: 200px;
   }
 }
 div#animated {
   animation: test 2s linear both;
   width: var(--my-length, 10px);
   height: 200px;
   background: lightblue;
 }

With non-registered custom properties, we can instead only animate discretely; --my-length would suddenly jump from 10px to 200px halfway through the duration of the animation, which is generally not what is desired for lengths.

Custom properties in the cascade

If you check the Interop 2023 Dashboard for custom properties, you may notice that interoperability was really bad at the beginning of the year, and this was mainly due to Firefox’s low score. Consequently, when I joined the project, I was asked to help with improving that situation.

Graph showing the 2023 evolution of scores and interop for custom properties

While the two registration methods previously mentioned had already been implemented, the main issue was that the CSS cascade was always treating custom properties as inherited and initialized with the guaranteed invalid value. This is indeed correct for unregistered custom properties, but it’s generally incorrect for registered custom properties!

In bug 1840478, bug 1855887, and others, I made registered custom properties work properly in the cascade, including non-inherited properties and registered initial values. But in the past, with the previous assumptions around inheritance and initial values, it was possible to store the computed values of custom properties on an element as a “cheap” map, considering only the properties actually specified on the element or an ancestor and (in most cases) only taking shallow copies of the parent’s map. As a result, when generalizing the cascade for registered custom properties, I had to be careful to avoid introducing performance regressions for existing content.

Custom properties in animations

Another area where the situation was pretty bad was animations. Not only was Firefox unable to interpolate registered custom properties between two values — one of the main motivations for the new spec — but it was actually unable to animate custom properties at all!

The main problem was that the existing animation code referred to CSS properties using an enum nsCSSPropertyID, with all custom properties represented by the single value nsCSSPropertyID::eCSSPropertyExtra_variable. To make this work for custom properties, I had to essentially replace that value with a structure containing the nsCSSPropertyID and the name of the custom properties.

I uploaded patches to bug 1846516 to perform that change throughout the whole codebase, and with a few more tweaks, I was able to make registered custom properties animate discretely, but my patches still needed some polish before they could be reviewed. I had to move onto other tasks, but fortunately, some Mozilla folks were kind enough to take over this task, and more generally, complete the work on registered custom properties!

Conclusion

This was an interesting task to work on, and because a lot of the work happened in Stylo, the CSS engine shared by Servo and Gecko, I also had the opportunity to train more on the Rust programming language. Thanks to help from folks at Mozilla, we were able to get excellent progress on registered custom properties in Firefox in 2023, and this feature is expected to ship in Firefox 128!

As I said, I’ve since moved onto other tasks, which I’ll describe in subsequent blog posts in this series. Stay tuned for content-visibility, enabling interesting layout optimizations for web pages.

by Frédéric Wang at June 20, 2024 10:00 PM

June 06, 2024

Víctor Jáquez

GStreamer Vulkan Operation API

Two weeks ago the GStreamer Spring Hackfest took place in Thessaloniki, Greece. I had a great time. I hacked a bit on VA, Vulkan and my toy, planet-rs, but mostly I ate delicious Greek food ☻. A big thanks to our hosts: Vivia, Jordan and Sebastian!

GStreamer Spring Hackfest
2024First day of the GStreamer Spring Hackfest 2024 - https://floss.social/@gstreamer/112511912596084571

And now, writing this supposed small note, I recalled that I have in my to-do list an item to write a comment about GstVulkanOperation, an addition to GstVulkan API which helps with the synchronization of operations on frames, in order to enable Vulkan Video.

Originally, GstVulkan API didn’t provide almost any synchronization operation, beside fences, and that appeared to be enough for elements, since they do simple Vulkan operations. Nonetheless, as soon as we enabled VK_VALIDATION_FEATURE_ENABLE_SYNCHRONIZATION_VALIDATION_EXT feature, which reports resource access conflicts due to missing or incorrect synchronization operations between action [*], a sea of hazard operation warnings drowned us [*].

Hazard operations are a sequence of read/write commands in a memory area, such as an image, that might be re-ordered, or racy even.

Why are those hazard operations reported by the Vulkan Validation Layer, if the programmer pushes the commands to execute in queue in order? Why is explicit synchronization required? Because, as the great blog post from Hans-Kristian Arntzen, Yet another blog explaining Vulkan synchronization, (make sure you read it!) states:

[…] all commands in a queue execute out of order. Reordering may happen across command buffers and even vkQueueSubmits

In order to explain how synchronization is done in Vulkan, allow me to yank a couple definitions stated by the specification:

Commands are instructions that are recorded in a device’s queue. There are four types of commands: action, state, synchronization and indirection. Synchronization commands impose ordering constraints on action commands, by introducing explicit execution and memory dependencies.

Operation is an arbitrary amount of commands recorded in a device’s queue.

Since the driver can reorder commands (perhaps for better performance, dunno), we need to send explicit synchronization commands to the device’s queue to enforce a specific sequence of action commands.

Nevertheless, Vulkan doesn’t offer fine-grained dependencies between individual operations. Instead, dependencies are expressed as a relation of two elements, where each element is composed by the intersection of scope and operation. A scope is a concept in the specification that, in practical terms, can be either pipeline stage (for execution dependencies), or both pipeline stage and memory access type (for memory dependencies).

First let’s review execution dependencies through pipeline stages:

Every command submitted to a device’s queue goes through a sequence of steps known as pipeline stages. This sequence of steps is one of the very few implicit ordering guarantees that Vulkan has. Draw calls, copy commands, compute dispatches, all go through certain sequential stages, which amount of stages to cover depends on the specific command and the current command buffer state.

In order to visualize an abstract execution dependency let’s imagine two compute operations and the first must happen before the second.

Operation 1
Sync command
Operation 2
  1. The programmer has to specify the Sync command in terms of two scopes (Scope 1 and Scope 2), in this execution dependency case, two pipeline stages.
  2. The driver generates an intersection between commands in Operation 1 and Scope 1 defined as Scoped operation 1. The intersection contains all the commands in Operation 1 that go through up to the pipeline stage defined in Scope 1. The same is done with Operation 2 and Scope 2 generating Scoped operation 2.
  3. Finally, we got an execution dependency that guarantees that Scoped operation 1 happens before Scoped operation 2.

Now let’s talk about memory dependencies:

First we need to understand the concepts of memory availability and visibility. Their formal definition in Vulkan are a bit hard to grasp since they come from the Vulkan memory model, which is intended to abstract all the ways of how hardware access memory. Perhaps we could say that availability is the operation that assures the existence of the required memory; while visibility is the operation that assures it’s possible to read/write the data in that memory area.

Memory dependencies are limited the Operation 1 that be done before memory availability and Operation 2 that have to be done after its visibility.

But again, there’s no fine-grained way to declare that memory dependency. Instead, there are memory access types, which are functions used by descriptor types, or functions for pipeline stage to access memory, and they are used as access scopes.

All in all, if a synchronization command defining a memory dependency between two operations, it’s composed by the intersection of between each command and a pipeline stage, intersected with the memory access type associated with the memory processed by those commands.

Now that the concepts are more or less explained we could see those concepts expressed in code. The synchronization command for execution and memory dependencies is defined by VkDependencyInfoKHR. And it contains a set of barrier arrays, for memory, buffers and images. Barriers express the relation of dependency between two operations. For example, Image barriers use VkImageMemoryBarrier2 which contain the mask for source pipeline stage (to define Scoped operation 1), and the mask for the destination pipeline stage (to define Scoped operation 2); the mask for source memory access type and the mask for the destination memory access to define access scopes; and also layout transformation declaration.

A Vulkan synchronization example from Vulkan Documentation wiki:

vkCmdDraw(...);

... // First render pass teardown etc.

VkImageMemoryBarrier2KHR imageMemoryBarrier = {
...
.srcStageMask = VK_PIPELINE_STAGE_2_FRAGMENT_SHADER_BIT_KHR,
.dstStageMask = VK_PIPELINE_STAGE_2_COLOR_ATTACHMENT_OUTPUT_BIT_KHR,
.dstAccessMask = VK_ACCESS_2_COLOR_ATTACHMENT_WRITE_BIT_KHR,
.oldLayout = VK_IMAGE_LAYOUT_READ_ONLY_OPTIMAL,
.newLayout = VK_IMAGE_LAYOUT_ATTACHMENT_OPTIMAL
/* .image and .subresourceRange should identify image subresource accessed */};

VkDependencyInfoKHR dependencyInfo = {
...
1, // imageMemoryBarrierCount
&imageMemoryBarrier, // pImageMemoryBarriers
...
}

vkCmdPipelineBarrier2KHR(commandBuffer, &dependencyInfo);

... // Second render pass setup etc.

vkCmdDraw(...);

First draw samples a texture in the fragment shader. Second draw writes to that texture as a color attachment.

This is a Write-After-Read (WAR) hazard, which you would usually only need an execution dependency for - meaning you wouldn’t need to supply any memory barriers. In this case you still need a memory barrier to do a layout transition though, but you don’t need any access types in the src access mask. The layout transition itself is considered a write operation though, so you do need the destination access mask to be correct - or there would be a Write-After-Write (WAW) hazard between the layout transition and the color attachment write.

Other explicit synchronization mechanisms, along with barriers, are semaphores and fences. Semaphores are a synchronization primitive that can be used to insert a dependency between operations without notifying the host; while fences are a synchronization primitive that can be used to insert a dependency from a queue to the host. Semaphores and fences are expressed in the VkSubmitInfo2KHR structure.

As a preliminary conclusion, synchronization in Vulkan is hard and a helper API would be very helpful. Inspired by FFmpeg work done by Lynne, I added GstVulkanOperation object helper to GStreamer Vulkan API.

GstVulkanOperation object helper aims to represent an operation in the sense of the Vulkan specification mentioned before. It owns a command buffer as public member where external commands can be pushed to the associated device’s queue.

It has a set of methods:

Internally, GstVulkanOperation contains two arrays:

  1. The array of dependency frames, which are the set of frames, each representing an operation, which will hold dependency relationships with other dependency frames.

    gst_vulkan_operation_add_dependency_frame appends frames to this array.

    When calling gst_vulkan_operation_end the frame’s barrier state for each frame in the array is updated.

    Also, each dependency frame creates a timeline semaphore, which will be signaled when a command, associated with the frame, is executed in the device’s queue.

  2. The array of barriers, which contains a list of synchronization commands. gst_vulkan_operation_add_frame_barrier fills and appends a VkImageMemoryBarrier2KHR associated with a frame, which can be in the array of dependency frames.

Here’s a generic view of video decoding example:

gst_vulkan_operation_begin (priv->exec, ...);

cmd_buf = priv->exec->cmd_buf->cmd;

gst_vulkan_operation_add_dependency_frame (exec, out,
VK_PIPELINE_STAGE_2_VIDEO_DECODE_BIT_KHR,
VK_PIPELINE_STAGE_2_VIDEO_DECODE_BIT_KHR);

/* assume a engine where out frames can be used for DPB frames, */
/* so a barrier for layout transition is required */
gst_vulkan_operation_add_frame_barrier (exec, out,
VK_PIPELINE_STAGE_2_ALL_COMMANDS_BIT,
VK_ACCESS_2_VIDEO_DECODE_WRITE_BIT_KHR,
VK_IMAGE_LAYOUT_VIDEO_DECODE_DPB_KHR, NULL);

for (i = 0; i < dpb_size; i++) {
gst_vulkan_operation_add_dependency_frame (exec, dpb_frame,
VK_PIPELINE_STAGE_2_VIDEO_DECODE_BIT_KHR,
VK_PIPELINE_STAGE_2_VIDEO_DECODE_BIT_KHR);
}

barriers = gst_vulkan_operation_retrieve_image_barriers (exec);
vkCmdPipelineBarrier2 (cmd_buf, &(VkDependencyInfo) {
...
.pImageMemoryBarriers = barriers->data,
.imageMemoryBarrierCount = barriers->len,
});
g_array_unref (barriers);

vkCmdBeginVideoCodingKHR (cmd_buf, &decode_start);
vkCmdDecodeVideoKHR (cmd_buf, &decode_info);
vkCmdEndVideoCodingKHR (cmd_buf, &decode_end);

gst_vulkan_operation_end (exec, ...);

Here, just one memory barrier is required for memory layout transition, but semaphores are required to signal when an output frame and its DPB frames are processed, and later, the output frame can be used as a DPB frame. Otherwise, the output frame might not be fully reconstructed with it’s used as DPB for the next output frame, generating only noise.

And that’s all. Thank you.

by Víctor Jáquez at June 06, 2024 12:00 AM

June 05, 2024

Alberto Garcia

More ways to install software in SteamOS: Distrobox and Nix

Introduction

In my previous post I talked about how to use systemd-sysext to add software to the Steam Deck without modifying the root filesystem. In this post I will give a brief overview of two additional methods.

Distrobox

distrobox is a tool that uses containers to create a mutable environment on top of your OS.

Distrobox running in SteamOS

With distrobox you can open a terminal with your favorite Linux distro inside, with full access to the package manager and the ability to install additional software. Containers created by distrobox are integrated with the system so apps running inside have normal access to the user’s home directory and the Wayland/X11 session.

Since these containers are not stored in the root filesystem they can survive an OS update and continue to work fine. For this reason they are particularly suited to systems with an immutable root filesystem such as Silverblue, Endless OS or SteamOS.

Starting from SteamOS 3.5 the system comes with distrobox (and podman) preinstalled and it can be used right out of the box without having to do any previous setup.

For example, in order to create a Debian bookworm container simply open a terminal and run this:

$ distrobox create -i debian:bookworm debbox

Here debian:bookworm is the image that this container is created from (debian is the name and bookworm is the tag, see the list of supported tags here) and debbox is the name that is given to this new container.

Once the container is created you can enter it:

$ distrobox enter debbox

Or from the ‘Debian’ entry in the desktop menu -> Lost & Found.

Once inside the container you can run your Debian commands normally:

$ sudo apt update
$ sudo apt install vim-gtk3

Nix

Nix is a package manager for Linux and other Unix-like systems. It has the property that it can be installed alongside the official package manager of any distribution, allowing the user to add software without affecting the rest of the system.

Nix running in SteamOS

Nix installs everything under the /nix directory, and packages are made available to the user through a new entry in the PATH and a ~/.nix-profile symlink stored in the home directory.

Nix is more things, including the basis of the NixOS operating system. Explaning Nix in more detail is beyond the scope of this blog post, but for SteamOS users these are perhaps its most interesting properties:

  • Nix is self-contained: all packages and their dependencies are installed under /nix.
  • Unlike software installed with pacman, Nix survives OS updates.
  • Unlike podman / distrobox, Nix does not create any containers. All packages have normal access to the rest of the system, just like native SteamOS packages.
  • Nix has a very large collection of packages, here is a search engine: https://search.nixos.org/packages

The only thing that Nix needs from SteamOS is help to set up the /nix directory so its contents are not stored in the root filesystem. This is already happening starting from SteamOS 3.5 so you can install Nix right away in single-user mode:

$ sudo chown deck:deck /nix
$ wget https://nixos.org/nix/install
$ sh ./install --no-daemon

This installs Nix and adds a line to ~/.bash_profile to set up the necessary environment variables. After that you can log in again and start using it. Here’s a very simple example (refer to the official documentation for more details):

# Install and run Midnight Commander
$ nix-env -iA nixpkgs.mc
$ mc

# List installed packages
$ nix-env -q
mc-4.8.31
nix-2.21.1

# Uninstall Midnight Commander
$ nix-env -e mc-4.8.31

What we have seen so far is how to install Nix in single-user mode, which is the simplest one and probably good enough for a single-user machine like the Steam Deck. The Nix project however recommends a multi-user installation, see here for the reasons.

Unfortunately the official multi-user installer does not work out of the box on the Steam Deck yet, but if you want to go the multi-user way you can use the Determinate Systems installer: https://github.com/DeterminateSystems/nix-installer

Conclusion

Distrobox and Nix are useful tools and they give SteamOS users the ability to add additional software to the system without having to modify the base operating system.

While for graphical applications the recommended way to install third-party software is still Flatpak, Distrobox and Nix give the user additional flexibility and are particularly useful for installing command-line utilities and other system tools.

by berto at June 05, 2024 03:53 PM

May 24, 2024

Víctor Jáquez

GStreamer Hackfest 2024

Last weeks were a bit hectic. First, with a couple friends we biked the southwest of the Netherlands for almost a week. The next week, the last one, I attended the 2024 Display Next Hackfest

This week was Igalia’s Assembly meetings, and next week, along with other colleagues, I’ll be in Thessaloniki for the GStreamer Spring Hackfest

I’m happy to meet again friends from the GStreamer community and talk and move things forward related with Vulkan, VA-API, KMS, video codecs, etc.

by Víctor Jáquez at May 24, 2024 12:00 AM

May 23, 2024

Patrick Griffis

Introducing the WebKit Container SDK

Developing WebKitGTK and WPE has always had challenges such as the amount of dependencies or it’s fairly complex C++ codebase which not all compiler versions handle well. To help with this we’ve made a new SDK to make it easier.

Current Solutions

There have always been multiple ways to build WebKit and its dependencies on your host however this was never a great developer experience. Only very specific hosts could be “supported”, you often had to build a large number of dependencies, and the end result wasn’t very reproducable for others.

The current solution used by default is a Flatpak based one. This was a big improvement for ease of use and excellent for reproducablity but it introduced many challenges doing development work. As it has a strict sandbox and provides read-only runtimes it was difficult to use complex tooling/IDEs or develop third party libraries in it.

The new SDK tries to take a middle ground between those two alternatives, isolating itself from the host to be somewhat reproducable, yet being a mutable environment to be flexible enough for a wide range of tools and workflows.

The WebKit Container SDK

At the core it is an Ubuntu OCI image with all of the dependencies and tooling needed to work on WebKit. On top of this we added some scripts to run/manage these containers with podman and aid in developing inside of the container. It’s intention is to be as simple as possible and not change traditional development workflows.

You can find the SDK and follow the quickstart guide on our GitHub: https://github.com/Igalia/webkit-container-sdk

The main requirements is that this only works on Linux with podman 4.0+ installed. For example Ubuntu 23.10+.

In the most simple case, once you clone https://github.com/Igalia/webkit-container-sdk.git, using the SDK can be a few commands:

source /your/path/to/webkit-container-sdk/register-sdk-on-host.sh
wkdev-create --create-home
wkdev-enter

From there you can use WebKit’s build scripts (./Tools/Scripts/build-webkit --gtk) or CMake. As mentioned before it is an Ubuntu installation so you can easily install your favorite tools directly like VSCode. We even provide a wkdev-setup-vscode script to automate that.

Advanced Usage

Disposibility

A workflow that some developers may not be familiar with is making use of entirely disposable development environments. Since these are isolated containers you can easily make two. This allows you to do work in parallel that would interfere with eachother while not worrying about it as well as being able to get back to a known good state easily:

wkdev-create --name=playground1
wkdev-create --name=playground2

podman rm playground1 # You would stop first if running.
wkdev-enter --name=playground2

Working on Dependencies

An important part of WebKit development is working on the dependencies of WebKit rather than itself, either for debugging or for new features. This can be difficult or error-prone with previous solutions. In order to make this easier we use a project called JHBuild which isn’t new but works well with containers and is a simple solution to work on our core dependencies.

Here is an example workflow working on GLib:

wkdev-create --name=glib
wkdev-enter --name=glib

# This will clone glib main, build, and install it for us. 
jhbuild build glib

# At this point you could simply test if a bug was fixed in a different versin of glib.
# We can also modify and debug glib directly. All of the projects are cloned into ~/checkout.
cd ~/checkout/glib

# Modify the source however you wish then install your new version.
jhbuild make

Remember that containers are isoated from each other so you can even have two terminals open with different builds of glib. This can also be used to test projects like Epiphany against your build of WebKit if you install it into the JHBUILD_PREFIX.

To Be Continued

In the next blog post I’ll document how to use VSCode inside of the SDK for debugging and development.

May 23, 2024 04:00 AM

May 22, 2024

Frédéric Wang

Flygskam and o caminho da Corunha

Prolegomenon

Early next June, I’m traveling from Paris to A Coruña for the Web Engines Hackfest and other internal Igalia events. In recent years I’ve done it by train, as previously mentioned. Some colleagues at Igalia were curious about it so I decided to write this blog post, investigating possible ways to do it from various European places. I wish this can also motivate more people at Igalia and beyond, who are still hesistant to give up alternatives with heavier carbon footprint.

In addition to various trip planners, I’ve also used this nice map from Wikipedia, which gives a good overview of high-speed rail in Europe:

High Speed Railroad Map of Europe

I also sought advice from Nicolò Ribaudo who is quite familiar with train traveling and provided useful recommendations. In particular, he mentioned the ÖBB trip planner, which seems quite efficient combining trains from multiple operators.

I’ve focused on big european cities (with airports) that are close to Spain but this is definitely not exhaustive. There is probably a lot more to discuss beyong trip planning, but hopefully that can be a good starting point.

Paris as a departure or connection

Based on my experience traveling from Paris, I found these direct trains between big cities:

  • Renfe offers several AVE and Alvia trains traveling every days between A Coruña and Madrid, with a duration 3h30-4h 1. There are two important thing to note:
    1. These local trains are only avalailble for sell maybe 2 months in advance at best.
    2. These trains are connecting to Madrid Chamartín which is maybe 30 minutes away from Madrid Atocha by the Madrid metro.
  • Renfe also proposes even more options (say one each half hour during the day) between Barcelona and Madrid Atocha, with a duration of 2h30-3h 2.
  • The SNCF proposes two or three direct trains during the day between Paris Gare de Lyon and Barcelona, with a duration of 6h30-7h. If you are coming from Paris and want to to take a train to Madrid, you will likely have to cross the station and pass some x-ray baggage scanner, so be sure to keep enough time for the connection.
  • The Eurostar offers several options during the day to connect Paris with cities below. They connect to Gare du Nord or Gare de l’Est, which are very close to each other but ~30 minutes away from Gare de Lyon by public transport.

Personally, I’m doing this one-day-and-half trip (inbound trip is similar):

  1. Take the train from Paris (9:42 AM) to Barcelona (4:33 PM).
  2. Keep enough time for the Barcelona Sants connection.
  3. Take a train from Barcelona to Madrid in the evening.
  4. Stay one night in Madrid.
  5. Take a train from Madrid to A Coruña in the morning.

From London, Amsterdam, Brussels, Berlin one could instead do a two-days trip:

  1. Travel to Paris in the morning 3.
  2. Keep enough time for the Paris connection.
  3. Take the train from Paris (2:42PM) to Barcelona (9:27PM).
  4. Stay one night in Barcelona.
  5. Travel from Barcelona to Madrid Atocha.
  6. Keep enough time for the Madrid Metro connection.
  7. Travel from Madrid Chamartín to A Coruña.

I also looked at the trip with the minimum number of connections to go to Barcelona from big cities in Switzerland and a a similar traject is possible. See later for alternatives.

Finally, Nicolò mentioned that ÖBB recently started running a night train from Berlin to Paris, which you can probably use to do a similar trip as mine.

Estimate of CO2 emissions

In order to estimate CO2 emission for the trips suggested in the previous section, I compiled information from different sources:

There would be a lot to discuss about the methodology but the goal here is only to give a rough idea. Anyway, below is an estimate of kilograms of CO2 per passenger for some of the train trips previously mentioned:

  Eurostar SNCF Ecopassenger 4 Ecotree
Berlin ↔ Cologne - - 17-19 1-3
Cologne ↔ Paris 5.2 5.2 7.4 1-3
London ↔ Paris 2.4 2.4 1.7 1-3
Brussels ↔ Paris 1.6 1.8 1.8 1-2
Amsterdam ↔ Paris 2.6 2.9 9.3-9.5 1-3
Paris ↔ Barcelona - 3.8 - 1-6
Barcelona ↔ Madrid - - 17-20 1-6
Madrid ↔ A Coruña - - - 1-3

The best/worst cases are compiled into the following table and compared with a flight to A Coruña (with a connection by Madrid) as calculated by Ecotree. If we follow these data, a train from Berlin, London, Paris, Brussels and Amsterdam won’t at worst be around 50kg of CO2 per passenger and represent at least a reduction of around 90% compared to using a plane:

Departure Flight by Madrid (Ecotree) Train 5 CO2 reduction
Berlin 448 6-55.4 ≥87%
London 388 4-32 ≥91%
Brussels 396 4-30.8 ≥92%
Amsterdam 396 4-38.5 ≥90%
Paris 368 3-29 ≥92%
Madrid 244 1-3 ≥98%

More cities and trains

Disclaimer: This section is essentially based on information provided by Nicolò and what I found on internet.

In addition to the SNCF train previously mentioned, Renfe proposes two trains that are quite important for traveling from France to Spain:

  • One train between Lyon and Barcelona per day (duration ~5h)
  • One train between Marseille and Madrid per day (duration ~8h)

From Switzerland, you can pass by the south of France using more trains/connections to arrive faster in Barcelone than what was previously proposed. For example, taking a local train to go from Genève to Lyon followed by the Lyon to Barcelona train mentioned above can be done in around 7h. From Zurich, with connection at Genève and Lyon, it takes around 10h as opposed to 12h for a single connection in Paris.

From the Netherlands, Belgium or Germany it makes sense to consider the night trains from Paris to the border with Spain. Those trains do not have a fixed schedule, but vary depending on the weekday. Most of them arrive in Portbou and from there you can take a regional train to Barcelona. Some of them arrive to Latour-de-Carol, and from there it’s three hours on a regional train to Barcelona. In any case, you’ll be early enough at the border so that it’s possible to then arrive to A Coruña in the afternoon or evening. Rarely the night train arrives at the border on the west coast, and continuing from there to A Coruña with the regional trains that go on the northern coast might be a good experience.

From Belgium it’s also possible to take a TGV from Brussels to Lyon, and then from there take the train to Barcelona. This avoids a stressful connection in Paris, where you need to move between the two stations Gare du Nord and Gare de Lyon.

From Italy, the main trouble is to deal with connections. The Turin–Lyon high-speed railway contruction may help in the future but it’s not running yet. The alternatives are either to go on the coast through Genova-Ventimiglia-Nice, or with the Eurocity from Milan to Switzerland finally get to France.

From Portugal, which is so close geographically and culturally to Galicia, we could think there should be an easy way to travel to A Coruña. But apparently neither Comboios de Portugal nor Renfe provides any direct train:

  • From Porto, the ÖBB trip planner suggests connection at Vigo-Guixar for a total duration of 6h30. As a comparison, the website of the Web Engines Hackfest indicates a 3-hours trip by car and a 5-hours trip by bus.
  • Between Libon and Porto, Comboios de Portugal seems to propose trains taking 2h30-3h. You can combine that with the other trains to do the trip to A Coruña with one night at Porto or Vigo.

Last but not least, A Coruña railway station is a one-quarter walk away from the Igalia office and a three-quarter walks away from Palexco (Web Engines Hackfest’s venue). This is more convenient than the A Coruña airport which is around 10 km away from A Coruña.

Notes and references

  • Flygskam: anti-flying social movement started in Sweden, with the aim of reducing the environmental impact of aviation.
  • O Caminho de Santiago: The Way of St. James, a network of pilgrims’ ways leading to Santiago de Compostela (whose railway station you will likely stop at if you take a train to A Coruña).
  1. Incidentally, people travelling from far away are unlikely to find a direct flight to the A Coruña airport. In that case, using local trains from bigger airports like Madrid or Porto may be a better option. 

  2. Direct train between A Coruña and Barcelona seems to be rarer, slower and no longer available as night train. So a connection or night stay in Madrid seems the best option. 

  3. Deutsche Bahn offers a lot of Berlin to Cologne trains per day with a duration of 4h-4h30, including early/late trains or night trains, that you can combine with the Cologne-Paris Eurostar. Deutsche Bahn also offers (non-direct) ICE trains to go from Paris to Berlin. 

  4. Ecopassenger gives information per train, so I’m provided some lower/upper bounds based on different trains. 

  5. Based on the previous data from Eurostar, SNCF, Ecopassenger, Ecotree, trying to find the lowest/highest sum for each individual segment. 

by Frédéric Wang at May 22, 2024 10:00 PM

May 20, 2024

Time travel debugging of WebKit with rr

Introduction

rr is a debugging tool for Linux that was originally developed by Mozilla for Firefox. It has long been adopted by Igalia and other web platform developers for Chromium and WebKit too. Back in 2019, there were breakout sessions on this topic at the Web Engines Hackfest and BlinkOn.

For WebKitGTK, the Flatpak SDK provides a copy of rr, but recently I was unable to use the instructions on trac.webkit.org. Fortunately, my colleague Adrián Pérez suggested using a direct build without flatpak or the bubblewrap sandbox, and that indeed solved my problem. I thought it might be interesting to share this information with others, so I decided to write this blog post.

Disclaimer: The build instructions below may be imperfect, will likely become outdated, and are in any case not a replacement for the official ones for WebKitGTK development. Use them at your own risk!

CMake configuration

The approach that worked for me was thus to perform a direct build from my system. I came up with the following configuration step:

cmake -S. -BWebKitBuild/Release \
   -DCMAKE_BUILD_TYPE=Release \
   -DCMAKE_INSTALL_PREFIX=$HOME/WebKit/WebKitBuild/install/Release \
   -GNinja -DPORT=GTK -DENABLE_BUBBLEWRAP_SANDBOX=OFF \
   -DDEVELOPER_MODE=ON -DDEVELOPER_MODE_FATAL_WARNINGS=OFF \
   -DENABLE_TOOLS=ON -DENABLE_LAYOUT_TESTS=ON

where:

  • The -B option specifies the build directory, which is traditionnaly called WebKitBuild/ for the WebKit project.
  • CMAKE_BUILD_TYPE specifies the build type, e.g. optimized release builds (Release, corresponding to --release for the offical script) or debug builds with assertions (Debug, corresponding to --debug) 1.
  • CMAKE_INSTALL_PREFIX specifies the installation directory, which I place inside WebKitBuild/install/ 2.
  • The -G option specifies the build system generator. I used Ninja, which is the default for the offical script too.
  • -DPORT=GTK is for building WebKitGTK. I haven’t tested rr with other Linux ports.
  • -DENABLE_BUBBLEWRAP_SANDBOX=OFF was suggested by Adrián. The bubblewrap sandbox probably does not make sense without flatpak, so it should be safe to disable it anyway.
  • I extracted the other -D flags from the official script, trying to stay as close as possible to what it provides for WebKit development (being able to run layout tests, building the Tools/, ignoring fatal warnings, etc).

Needless to say, the advantage of using flatpak is that it automatically downloads and install all the required dependencies. But if you do your own build, you need to figure out what they are and perform the setup manually. Generally, this is straightforward using your distribution’s package manager, but there can be some tricky exceptions 3.

While we are still at the configuration step, I believe it’s worth sharing two more tricks for WebKit developers:

  • You can use -DENABLE_SANITIZERS=address to produce Asan builds or builds with other sanitizers.
  • You can use -DCMAKE_CXX_FLAGS="-DENABLE_TREE_DEBUGGING" in release builds if you want to get access to the tree debugging functions (ShowRenderTree and the like). This flag is turned on by default for debug builds.

Building and running WebKit

Once the configure step is successful, you can build and install WebKit using the following CMake command 2.

cmake --build WebKitBuild/Release --target install

When that operation completes, you should be able to run MiniBrowser with the following command:

LD_LIBRARY_PATH=WebKitBuild/install/Release/lib ./WebKitBuild/Release/bin/MiniBrowser

For WebKitTestRunner, some extra environment variables are necessary 2:

TEST_RUNNER_INJECTED_BUNDLE_FILENAME=$HOME/WebKit/WebKitBuild/Release/lib/libTestRunnerInjectedBundle.so LD_LIBRARY_PATH=WebKitBuild/install/Release/lib ./WebKitBuild/Release/bin/WebKitTestRunner filename.html

You can also use the official scripts, Tools/Script/run-minibrowser and Tools/Script/run-webkit-tests. They expect some particular paths, but a quick workaround is to use a symbolic link:

ln -s $HOME/WebKit/WebKitBuild $HOME/WebKit/WebKitBuild/GTK

Using rr for WebKit debugging

rr is generally easily installable from your distribution’s package manager. However, as stated on the project wiki page:

Support for the latest hardware and kernel features may require building rr from Github master.

Indeed, using the source has always worked best for me to avoid mysterious execution failures when starting the recording 4.

If you are not familiar with rr, I strongly invite you to take a look at the overview on the project home page or at some of the references I mentioned in the introduction. In any case, the first step is to record a trace by passing the program and arguments to rr. For example, to record a trace for MiniBrowser:

LD_LIBRARY_PATH=WebKitBuild/install/Debug/lib rr ./WebKitBuild/Debug/bin/MiniBrowser https://www.igalia.com/

After the program exits, you can replay the recorded trace as many times as you want. For hard-to-reproduce bugs (e.g. non-deterministic issues or involving a lot of manual steps), that means you only need to be able to record and reproduce the bug once and then can just focus on debugging. You can even turn off your machine after hours of exhausting debugging, then continue the effort later when you have more time and energy! The trace is played in a deterministic way, always using the same timing and pointer addresses. You can use most gdb commands (to run the program, interrupt it, and inspect data), but the real power comes from new commands to perform reverse execution!

Before coming to that, let’s explain how to handle programs with multiple processes, which is the case for WebKit and modern browsers in general. After you recorded a trace, you can display the pids of all recorded processes using the rr ps command. For example, we can see in the following output that the MiniBrowser process (pid 24103) actually forked three child processes, including the Network Process (pid 24113) and the Web Process (24116):

PID     PPID    EXIT    CMD
24103   --      0       ./WebKitBuild/Debug/bin/MiniBrowser https://www.igalia.com/
24113   24103   -9      ./WebKitBuild/Debug/bin/WebKitNetworkProcess 7 12
24115   24103   1       (forked without exec)
24116   24103   -9      ./WebKitBuild/Debug/bin/WebKitWebProcess 15 15

Here is a small debugging session similar to the single-process example from Chromium Chronicle #13 5. We use the option -p 24116 to attach the debugger to the Web Process and -e to start debugging from where it exited:

rr replay -p 24116 -e
(rr) break RenderFlexibleBox::layoutBlock
(rr) rc # Run back to the last layout call
Thread 2 hit Breakpoint 1, WebCore::RenderFlexibleBox::layoutBlock (this=0x7f66699cc400, relayoutChildren=false) at /home/fred/src-obj/WebKit/Source/WebCore/rendering/RenderFlexibleBox.cpp:420
(rr) # Inspect anything you want here. To find the previous Layout call on this object:
(rr) cond 1 this == 0x7f66699cc400
(rr) rc
Thread 2 hit Breakpoint 1, WebCore::RenderFlexibleBox::layoutBlock (this=0x7f66699cc400, relayoutChildren=false) at /home/fred/src-obj/WebKit/Source/WebCore/rendering/RenderFlexibleBox.cpp:420
420     {
(rr) delete 1
(rr) watch -l m_style.m_nonInheritedFlags.effectiveDisplay # Or find the last time the effective display was changed
Thread 4 hit Hardware watchpoint 2: -location m_style.m_nonInheritedFlags.effectiveDisplay

Old value = 16
New value = 0
0x00007f6685234f39 in WebCore::RenderStyle::RenderStyle (this=0x7f66699cc4a8) at /home/fred/src-obj/WebKit/Source/WebCore/rendering/style/RenderStyle.cpp:176
176     RenderStyle::RenderStyle(RenderStyle&&) = default;

rc is an abbreviation for reverse-continue and continues execution backward. Similarly, you can use reverse-next, reverse-step and reverse-finish commands, or their abbreviations. Notice that the watchpoint change is naturally reversed compared to normal execution: the old value (sixteen) is the one after intialization, while the new value (zero) is the one before initialization!

Restarting playback from a known point in time

rr also has a concept of “event” and associates a number to each event it records. They can be obtained by the when command, or printed to the standard output using the -M option. To elaborate a bit more, suppose you add the following printf in RenderFlexibleBox::layoutBlock:

@@ -423,6 +423,8 @@ void RenderFlexibleBox::layoutBlock(bool relayoutChildren, LayoutUnit)
     if (!relayoutChildren && simplifiedLayout())
         return;

+    printf("this=%p\n", this);
+

After building, recording and replaying again, the output should look like this:

$ rr -M replay -p 70285 -e # replay with the new PID of the web process.
...
[rr 70285 57408]this=0x7f742203fa00
[rr 70285 57423]this=0x7f742203fc80
[rr 70285 57425]this=0x7f7422040200
...

Each printed output is now annotated with two numbers in bracket: a PID and an event number. So in order to restart from when an interesting output happened (let’s say [rr 70285 57425]this=0x7f7422040200), you can now execute run 57425 from the debugging session, or equivalently:

rr replay -p 70285 -g 57425

Older traces and parallel debugging

Another interesting thing to know is that traces are stored in ~/.local/share/rr/ and you can always specify an older trace to the rr command e.g. rr ps ~/.local/share/rr/MiniBrowser-0. Be aware that the executable image must not change, but you can use rr pack to be able to run old traces after a rebuild, or even to copy traces to another machine.

To be honest, most the time I’m just using the latest trace. However, one thing I’ve sometimes found useful is what I would call the “parallel debugging” technique. Basically, I’m recording one trace for a testcase that exhibits the bug and another one for a very similar testcase (e.g. with one CSS property difference) that behaves correctly. Then I replay the two traces side by side, comparing them to understand where the issue comes from and what can be done to fix it.

The usage documentation also provides further tips, but this should be enough to get you started with time travel debugging in WebKit!

  1. RelWithDebInfo build type (which yields an optimized release build with debug symbols) might also be interesting to consider in some situations, e.g. debugging bugs that reproduce in release builds but not in debug builds. 

  2. Using an installation directory might not be necessary, but without that, I had trouble making the whole thing work properly (wrong libraries loaded or libraries not found).  2 3

  3. In my case, I chose the easiest path to disable some features, namely -DUSE_JPEGXL=OFF, -DUSE_LIBBACKTRACE=OFF, and -DUSE_GSTREAMER_TRANSCODER=OFF

  4. Incidentally, you are likely to get an error saying that perf_event_paranoid is required to be at most 1, which you can force using sudo sysctl kernel.perf_event_paranoid=1

  5. The equivalent example would probably have been to watch for the previous style change with watch -l m_style, but this was exceeding my hardware watchpoint limit, so I narrowed it down to a smaller observation scope. 

by Frédéric Wang at May 20, 2024 10:00 PM

May 15, 2024

Manuel A. Fernandez Montecelo

WiFi back-ends in SteamOS 3.6

As of NetworkManager v1.46.0 (and a very long while before that, since 1.12 released in 2018), there are two wifi.backend's supported: wpa_supplicant (default), and iwd (experimental).

iwd is still considered experimental after these years, see the list of issues with iwd backend, and specially this one.

In SteamOS, however, the default is iwd. It works relatively well in most scenarios; and the main advantage is that, when waking up the device (think of Steam Deck returning from sleep, "suspended to RAM" or S3 sleeping state), it regains network connectivity much faster -- typically 1-2 seconds vs. 5+ for wpa_supplicant.

However, if for some reason iwd doesn't work properly in your case, you can give wpa_supplicant a try.

With steamos-wifi-set-backend command

In 3.6, the back-end can be changed via command.

3.6 has been available in Main channel since late March and in Beta/Preview for several days, for example 20240509.100.

If you haven't done this before, to be able to run it, one must be able to either log in via SSH (not available out of the box), or to switch to desktop-mode, open a terminal and be able to type commands.

Assuming that one wants to change from the default iwd to wpa_supplicant, the command is:

steamos-wifi-set-backend wpa_supplicant

The script tries to ensure than things are in place in terms of some systemd units being stopped and others brought up as appropriate. After a few seconds, if it's able to connect correctly to the WiFi network, the device should get connectivity (e.g. ping works). It is not necessary to enter again the preferred networks and passwords, NetworkManager feeds the back-ends with the necessary config and credentials.

It is possible to switch back and forth between back-ends by using alternatively iwd and wpa_supplicant as many times as desired, and without restarting the system.

The current back-end as per configuration can be obtained with:

steamos-wifi-set-backend --check

And finally, this is the output of changing to wpa_supplicant and back to iwd (i.e., the output that you should expect):

(deck@steamdeck ~)$ steamos-wifi-set-backend --check
iwd

(deck@steamdeck ~)$ steamos-wifi-set-backend wpa_supplicant
INFO: switching back-end to 'wpa_supplicant'
INFO: stopping old back-end service and restarting NetworkManager,
      networking will be disrupted (hopefully only momentary blip, max 10 seconds)...
INFO: restarting done
INFO: checking status of services ...
      (sleeping for 2 seconds to catch-up with state)
INFO: status OK

(deck@steamdeck ~)$ steamos-wifi-set-backend --check
wpa_supplicant

(deck@steamdeck ~)$ ping -c 3 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=56 time=16.4 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=56 time=24.8 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=56 time=16.5 ms

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 16.370/19.195/24.755/3.931 ms

(deck@steamdeck ~)$ steamos-wifi-set-backend iwd
INFO: switching back-end to 'iwd'
INFO: stopping old back-end service and restarting NetworkManager,
      networking will be disrupted (hopefully only momentary blip, max 10 seconds)...
INFO: restarting done
INFO: checking status of services ...
      (sleeping for 2 seconds to catch-up with state)
INFO: status OK

(deck@steamdeck ~)$ steamos-wifi-set-backend --check
iwd

(deck@steamdeck ~)$ ping -c 3 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=56 time=16.0 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=56 time=16.5 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=56 time=21.3 ms

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 15.991/17.940/21.295/2.382 ms

By hand

⚠️ IMPORTANT ⚠️ : Please do this only if you know what you are doing, you don't fear minor breakage and you are confident that you can recover from the situation if something goes wrong. If unsure, better wait until the channel that you run supports the commands to handle this automatically.

If you are running a channel that still does not have this command, e.g. Stable at the time of writing this, you can achieve the same effect by executing the instructions run by this script by hand.

Note, however, that you might easily loose network connectivity if you make some mistake or something goes wrong. So be prepared to restore things by hand by switching to desktop-mode and having a terminal ready.

Now, the actual instructions. In /etc/NetworkManager/conf.d/wifi_backend.conf, change the string of wifi.backend, for example comment out the default one with iwd and add a line using wpa_supplicant as value:

[device]
#wifi.backend=iwd
wifi.backend=wpa_supplicant
wifi.iwd.autoconnect=yes

⚠️ IMPORTANT ⚠️ : Leave the other lines alone. If you don't have experience with the format you can easily cause the file to not be parseable, and then networking could stop working entirely.

After that, things get more complicated depending on the hardware on which the system runs.

If the device is the SteamDeck, there is a difference between the original LCD and the newer OLED model. With the OLED model, when you switch away from iwd, typically the network device is destroyed, so it's necessary to re-create it. This happens automatically when booting the system, so the easiest way if the network doesn't come up after 15 seconds after the following instructions, is to just restart the whole system.

If you're running SteamOS on models of the SteamDeck that don't exist at the time of writing this, or on other types of computers, the situation can be similar to the OLED or even more complex, depending on the WiFi hardware and drivers. So, again, tread carefully.

In the simplest scenario, LCD model, execute the following commands as root (or prepending sudo), with OLD_BACKEND being either iwd or wpa_supplicant, the one that you want to migrate away from:

systemctl stop NetworkManager
systemctl disable --now OLD_BACKEND
systemctl restart NetworkManager

If you have the OLED model and switching from wpa_supplicant to iwd, that should be enough as well.

But if you have the OLED model and you changed from iwd to wpa_supplicant, as explained above, your system will not be able to connect to the network, and the easiest is to restart the system.

However, if you want to try to solve the problem by hand without restarting, then execute these commands as root/sudo:

systemctl stop NetworkManager
systemctl disable --now OLD_BACKEND

if ! iw dev wlan0 info &>/dev/null; then iw phy phy0 interface add wlan0 type station; fi

systemctl restart NetworkManager

In any case, if after trying to change the back-end by hand and rebooting it's still not being able to connect, perhaps the best thing to do is to restore the previous configuration and restart the system again, and wait until your channel is upgraded to include this command.

Change back-ends via UI?

In the future, it is conceivable that the UI allows to change WiFi backends with the help of this command or by other means, but this is not possible at the moment.

by Manuel A. Fernandez Montecelo at May 15, 2024 09:21 AM

May 12, 2024

Alex Bradbury

Notes from the Carbon panel session at EuroLLVM 2024

Last month I had the pleasure of attending EuroLLVM which featured a panel session on the Carbon programming language. It was recorded and of course we all know automated transcription can be stunningly accurate these days, yet I still took fairly extensive notes throughout. I often take notes (or near transcriptions) as I find it helps me process. I'm not sure whether I'm adding any value by writing this up, or just contributing entropy but here it goes. You should of course assume factual mistakes or odd comments are errors in my transcription or interpretation, and if you keep an eye on the LLVM YouTube channel you should find the session recording uploaded there in the coming months.

First, a bit of background on the session, Carbon, and my interest in it. Carbon is a programming language started by Google aiming to be used in many cases where you would currently use C++ (billed as "an experimental successor to C++"). The Carbon README gives a great description of its goals and purpose, but one point I'll highlight is that ease of interoperability with C++ is a key design goal and constraint. The project recognises that this limits the options for the language to such a degree that it explicitly recommends that if you are able to make use of a modern language like Go, Swift, Kotlin, or Rust, then you should. Carbon is intended to be there for when you need that C++ interoperability. Of course, (and as mentioned during the panel) there are parallel efforts to improve Rust's ability to interoperate with C++.

Whatever other benefits the Carbon project is able to deliver, I think there's huge value in the artefacts being produced as part of the design and decision making process so far. This has definitely been true of languages like Swift and especially Rust, where going back through the development history there's masses of discussion on difficult decisions e.g. (in the case of Rust) green threads, the removal of typestate, or internal vs external iterators. The Swift Evolution and Rust Internals forums are still a great read, but obviously there's little ability to revisit fundamental decisions at this point. I try to follow Carbon's development due to a combination of its openness, the fact it's early stage enough to be making big decisions (e.g. lambdas in Carbon), and also because they're exploring different design trade-offs than those other languages (e.g. the Carbon approach to definition-checked generics). I follow because I'm a fan of programming language design discussion, not necessarily because of the language itself, which may or may not turn out to be something I'm excited to program in. As a programming language development lurker I like to imagine I'm learning something by osmosis if nothing else - but perhaps it's just a displacement activity...

If you want to keep up with Carbon, then check out their recently started newsletter, issues/PRs/discussions on GitHub and there's a lot going on on Discord too (I find group chat distracting so don't really follow there). The current Carbon roadmap is of course worth a read too. In case I wasn't clear enough already, I have no affiliation to the project. From where I'm standing, I'm impressed by the way it's being designed and developed out in the open, more than (rightly or wrongly) you might assume given the Google origin.

Notes from the panel session

The panel "Carbon: An experiment in different tradeoffs" took place on April 11th at EuroLLVM 2024. It was moderated by Hana Dusíková and the panel members were Chandler Carruth, Jon Ross-Perkins, and Richard Smith. Rather than go through each question in the order they were asked I've tried to group together questions I saw as thematically linked. Anything in quotes should be a close approximation of what was said, but I can't guarantee I didn't introduce errors.

Background and basics of Carbon

This wasn't actually the first question, but I think a good starting point is 'how would you sell Carbon to a C++ user?'. Per Chandler, "we can provide a better language, tooling, and ecosystem without needing to leave everything you built up in C++ (both existing code and the training you had). The cost of leveraging carbon should have as simple a ramp as possible. Sometimes you need performance, we can give more advanced tools to help you get the most performance from your code. In other cases, it's security and we'll be able to offer more tools here than C++. It's having the ability to unlock improvements in the language without having to walk away from your investments in C++, that of course isn't going anywhere."

When is it going to be done? Chandler: "When it's ready! We're trying to put our roadmap out there publicly where we can. It's a long term project. Our goal for this year is to get the toolchain caught up with the language design, get into building practical C++ interop this year. There are many unknowns in terms of how far we'll get this year, but next year I think you'll see a toolchain that works and you can do interesting stuff with and evaluate in a more concrete context."

As for the size of the team and how many companies are contributing, the conclusion was that Google is definitely the main backer right now but there are others starting to take a look. There are probably about 5-10 people active on Carbon in any different week, but it varies so this can be a different set of people from week to week.

Given we've established that Google are still the main backer, one line of questioning was about what Google see in it and how management were convinced to back it. Richard commented "I think we're always interested in exploring new possibilities for how to deal with our existing C++ codebase, which is a hugely valuable asset for us. That includes both looking at how we can keep the C++ language itself happy and our tools for it being good and maintainable, but also places we might take it in the future. For a long time we've been talking about if we can build something that works better for our use case than existing tools, and had an opportunity to explore that and went with it."

A fun question that followed later aimed to establish the favourite thing about Carbon from each of the panel members (of course, it's nice that much of this motivation overlaps with the reasons for my own interest!):

  • Jon: For me it's unlocking the ability to improve the developer experience. Improving the language syntax, making things more understandable. Also C++ is very slow to compile and we're hoping for 1/3rd of the compile time for a typical file vs Clang.
  • Richard: from a language design perspective, being able to go through and systematically evaluate every decision and look at them again using things that have been learned since then.
  • Chandler: It's not so much the systematic review for me. We're building this with an eye to C++, and to integrate with C++ need a fairly big complex language (generics, inheritance, overloading etc etc). But when we look at this stuff, we keep finding principled and small/focused designs that we can use to underpin the sort of language functionality that is in C++ but fits together in a coherent and principled way. And we write this down. I think a lot of people don't realise that Carbon is writing down all the design as we go rather than having some mostly complete starting point and then making changes to that using something like an RFC process.

Finally, (for this section) there was also some intriguing discussion about Carbon's design tradeoffs. This is covered mostly elsewhere, but I think Chandler's answer focusing on the philosophy of Carbon rather than technical implementation details fits well here: "One of the philosophical goals of our language design is we don't try to solve hard problems in the compiler. [In other languages] there's often a big problem and the compiler has to solve it, freeing the programmer from worrying about it. Instead we provide the syntax and ask the programmer to tell us. e.g. for type inference, you could imagine having Hindley-Milner type inference. That's a hard problem, even if it makes ergonomics better in some cases. So we use a simpler and more direct type system. You see this crop up all over the language design."

Carbon governance

The first question related to organisation of the project and its governance was about what has been done to make it easier for contributors to get involved. My interpretation of the responses is that although they've had some success in attracting new contributors, the feeling is that there's more that could be done here. e.g. from Jon "Things like maintaining the project documentation, writing a start guide on how to build. We're making some different infrastructure choices, e.g. bazel, but in general we're trying to use things people are familiar with: GitHub, GH actions etc. There's probably still room for improvement. Listening to the LLVM newcomers session a couple of days ago gave me some ideas. Right now there's probably a lot of challenge learning how the different model works in Carbon."

The governance model for Carbon in general was something Chandler had a lot to say about:

  • On getting it in place early: "I was keen we start off with it, seeing challenges with C++ or even LLVM where as a project gets big the governance model that worked well at the beginning might not work so well and changing it at that point can be really hard. So I wanted to set up something that could scale, and can be changed if needed.
  • Initial mistakes: "The initial model was terrible. It was very elaborate, a large core team of people that pulled in people that weren't even involved in the project. We'd have weekly/biweekly meetings, a complex consensus process. In some ways it worked, we were being intentional and it was clear there was a governance project. But its was so inefficient, so people tried to avoid it. So we revamped it and simplified it, distilling it to something a little closer to the benevolent dictator for life model with minimal patches. Three leads rather than one to add some redundancy, one top-level governance body, will eventually have a rotation there."
  • The model now: "It's not that we'll try to agree on everything - if a change is uncontroversial then any one of the leads can approve, which gives us 3x bandwidth. Except for the cases if there is controversy, then the three of us step back and discuss. I had my personal pet favourite feature of carbon overturned by this governance model. The model allows us to make decisions quickly and easily, and most decisions are reversible if they turn out to be wrong. So we don't try to optimise for getting to the best outcome, rather to minimise cost for getting to an outcome. We're still early - we have a plan for scaling but that's more in the future."

Further questioning led to additional points being made about the governance model, such as the fact there are now several community members unaffiliated to Google with commit privileges, that the project goals are documented which helps contributors understand if a proposal is likely to be aligned with Carbon or not, and that these goals are potentially changeable if people come in with good arguments as to why they should be updated.

Carbon vs other languages

Naturally, there were multiple questions related to Carbon vs Rust. In terms of high-level comparisons, the panel members were keen to point out that Rust is a great language and is mature, while Carbon remains at an early stage. There are commonalities in terms of the aim to provide modern type system features to systems programmers, but the focus on C++ interoperability as a goal driving the language design is a key differentiator.

Thoughts about Rust naturally lead to questions about Carbon's memory safety story, and whether Carbon plans to have something like Rust's borrow checker. Richard commented "I think it's possible to get something like that. We're not sure exactly what it will look like yet, we're planning to look at this more between the 0.1 and 0.2 milestone. Borrow checking is an interesting option to pursue but there are some others to explore."

Concerns were also raised about whether given that C++ interop is such a core goal of Carbon, it may have problems as C++ continues to evolve (perhaps with features that clash with features add in Carbon). Chandler's response was "As long as clang gets new C++ features, we'll get them. Similar to how swift's interop works, linking the toolchain and using that to understand the functionality. If C++ starts moving in ways addressing the safety concerns etc that would be fantastic. We could shut down carbon! I don't think that's very likely. In terms of functionality that would make interop not work well that's a bit of a concern, but of course if C++ adds something they need it to interop with existing C++ code so we face similar constraints. While Richard commented "C++ modules is a big one we need to keep an eye on. But at the moment as modules adoption hasn't taken off in a big way yet, we're still targeting header files." There was also a brief exchange about what if Carbon gets reflection before C++, with the hope that if it did happen it could help with the design process by giving another example for C++ to learn from (just as Carbon learns from Rust, Swift, Go, C++, and others).

Compiler implementation questions

EuroLLVM is of course a compilers conference, so there was a lot of curiousity about the implementation of the Carbon toolchain itself. In discussing trade-offs in the design, Jon leapt into a discussion of this "to go for SemIR vs a traditional AST. In Clang the language is represented by an AST, which is what you're typically taught about how compilers work. We have a semantic IR model where we produce lex tokens, then a parse tree (slightly different to an AST) an then it becomes SemIR. This is very efficient and fast to process, lowers to LLVM IR very easily, but it's a different model and a different way to think about writing a compiler. To the earlier point about newcomers, it's something people have to learn and a bit of a barrier because of that. We try to address it, but it is a trade-off." Jon since provided more insight on Carbon's implementation approach in a recent /r/ProgrammingLanguages thread.

Given what Carbon devs have learned about applying data oriented design to optimise the frontend (see talks like Modernizing Compiler Design for Carbon Toolchain), could the same ideas be applied to the backend? Chandler commented on the tradeoff he sees "By necessity with data-oriented design you have to specialise a lot. We think we can benefit from this a lot with the frontend at the cost of reusability. The trade-off might be different within LLVM."

When asked about whether the Carbon frontend may be reusable for other tools (reducing duplicated effort writing parsers for Carbon), Jon responded "I'm pretty confident at least through the parse tree it will be reusable. I'm more hesitant to make that claim through to SemIR. It may not be the best choice for everyone, and in some cases you may want more of an AST. But providing these tools as part of the language project like a formatter, migration tool [for language of API changes], something like clang-tidy, we're going to be taking on these costs ourselves which is going to incentivise us to find ways of amortising this and finding the right balance."


Article changelog
  • 2024-05-13: Various phrasing tweaks and fixed improper heading nesting.
  • 2024-05-12: Initial publication date.

May 12, 2024 12:00 PM

May 09, 2024

Igalia Compilers Team

MessageFormat 2.0: A domain-specific language?

I recommend reading the prequel to this post, "MessageFormat 2.0: a new standard for translatable messages"; otherwise what follows won't make much sense!

This blog post represents the Igalia compilers team; all opinions expressed here are our own and may not reflect the opinions of other members of the MessageFormat Working Group or any other group.

What we've achieved #

In the previous post, I've just described the work done by the members of the MF2 Working Group to develop a specification for MF2, a process that spanned several years. When I joined the project in 2023, my task was to implement the spec and produce a back-end for the JavaScript Intl.MessageFormat proposal.

|-------------------------------|
| JavaScript code using the API |
|-------------------------------|
| browser engine                |
|-------------------------------|
| MF2 in ICU                    |
|-------------------------------|

For code running in a browser, the stack looks like this: web developers write JavaScript code that uses the API; their code runs in a JavaScript engine that's built into a Web browser; and that browser makes API calls to the ICU library in order to execute the JS code.

The bottom layer of the stack is what I've implemented as a tech preview. Since the major JavaScript engines are implemented in C++, I worked on ICU4C, the C++ version of ICU.

The middle level is not yet implemented. A polyfill makes it possible to try out the API in JavaScript now.

Understanding MF2 #

To understand the remainder of this post, it's useful to have read over a few more examples from the MF2 documentation. See:

Is MF2 a programming language? (And does it matter?) #

Different members of the working group have expressed different opinions on whether MF2 is a programming language:

  • Addison Phillips: "I think we made a choice (we can reconsider it if necessary, although I don't think we need to) that MF2 is not a resource format. I'm not sure if 'templating language' is the right characterization, but let's go with it for now." (comment on pull request 474, September 2023)
  • Stanisław Małolepszy: "...we’re neither resource format nor templating language. I tend to think we’re a storage format for variants." (comment on the same PR)
  • Mihai Niță: "For me, this is a DSL designed for i18n / l10n." (comment on PR 507)

MF2 lacks block structure and complex control flow: the .match construct can't be nested, and .local variable declarations are sequential rather than nested. Some people might expect a general-programming language to include these features.

But in implementing it, I have found that several issues arise that are familiar from my experience studying, and implementing, compilers and interpreters for more conventional programming languages.

I wrote in part 1 that when using either MF1 or MF2, messages can be separated from code -- but it turns out that maybe, messages are code! But they constitute code in a different programming language, which can be cleanly separated from the host programming language, just as with uninterpreted strings.

Abstract syntax #

As with a programming language, the syntax of MF2 is specified using a formal grammar, in this case using Augmented Backus-Naur Form. In most compilers and interpreters, a parser (that's either generated from a specification of the formal grammar, or written by hand to closely follow that grammar) reads the text of the program and generates an abstract syntax tree (AST), which is an internal representation of the program.

MF2 has a data model, which is like an AST. Users can either write messages in the text-based syntax (probably easiest for most people) or construct a data model directly using part of the API (useful for things like building external tools).

Parsers and ASTs are familiar concepts for programming language implementors, and writing the "front-end" for MF2 (the parser, and classes that define the data model) was not that different from writing a front-end for a programming language. Because it works on ASTs, the formatter itself (the part that does all the work once everything is parsed) also looks a lot like an interpreter. The "back-end", producing a formatted result, is currently quite simple since the end result is a flat string -- however, in the future, "formatting to parts" (necessary to easily support features like markup) will be supported.

For more details, see Eemeli Aro's FOSDEM 2024 talk on the data model.

Naming #

MF2 has local variable declarations, like the let construct in JavaScript and in some functional programming languages. That means that in implementing MF2, we've had to think about evaluation strategies (lazy versus eager evaluation), scope, and mutability. MF2 settled on a spec in which all local variables are immutable and no name shadowing is permitted except in a very limited way, with .input declarations. Free variables are permitted, and don't need to be declared explicitly: these correspond to runtime arguments provided to a message. Eager and lazy implementations can both conform to the spec, since variables are immutable, and lazy implementations are free to use call-by-need semantics (memoization). These design choices weren't obvious, and in some cases, inspired quite a lot of discussion.

Naming can have subtle consequences. In at least one case, the presence of .local declarations necessitates a dataflow analysis (albeit a simple one). The analysis is to check for "missing selector annotation" errors; the details are outside the scope of this post, but the point is just that error checking requires "looking through" a variable reference, possibly multiple chains of references, to the variable's definition. Sounds like a programming language!

Functions #

Placeholders in MF2 may be expressions with annotations. These expressions resemble function calls in a more conventional language. Unlike in most languages, functions cannot be implemented in MF2 itself, but must be implemented in a host language. For example, in the ICU4C implementation, the functions would either be written in C++ or (less likely) another language that has a foreign function interface with C++.

Functions can be built-in (required in the spec to be provided), or custom. Programmers using an implementation of the MF2 API can write their own custom functions that format data ("formatters"), that customize the behavior of the .match construct ("selectors"), or do both. Formatters and selectors have different interfaces.

The term "annotation" in the MF2 spec is suggestive. For example, a placeholder like {$x :number} can be read: "x should be formatted as a number." But custom formatters and selectors may also do much more complicated things, as long as they conform to their interfaces. An implementor could define a custom :squareroot function, for example, such that {$x :squareroot} is replaced with a formatted number whose value is the square root of the runtime value of $x. It's unclear how common use cases like this will be, but the generality of the custom function mechanism makes them possible.

Wherever there are functions, functional programmers think about types, explicit or implicit. MF2 is designed to use a single runtime type: it's unityped. The spec uses the term "resolved value" to refer to the type of its runtime values. Moreover, the spec tries to avoid constraining the structure of "resolved values", to allow the implementation as much freedom as possible. However, the implementation has to call functions written in a host language that may have a different type system. The challenge comes in defining the "function registry", which is the part of the MF2 spec that defines both the names and behavior of built-in functions, and how to specify custom functions.

Different host languages have different type systems. Specifying the interface between MF2 and externally-defined functions is tricky, since the goal is to be able to implement MF2 in any programming language.

While a language that can call foreign functions but not define its own functions is unusual, defining foreign function interfaces that bridge the gaps between disparate programming languages is a common task when designing a language. This also sounds like a programming language.

Pattern matching #

The .match construct in MF2 looks a lot like a case expression in Haskell, or perhaps a switch statement in C/C++, depending on your perspective. Unlike switch, .match allows multiple expressions to be examined in order to choose an alternative. And unlike case, .match only compares data against specific string literals or a wildcard symbol, rather than destructuring the expressions being scrutinized.

The ability provided by MF2 to customize pattern-matching is unusual. An implementation of a selector function takes a list of keys, one per variant, and returns a sorted list of keys in order of preference. The list is then used by the pattern matching algorithm in the formatter to pick the best-matching variant given that there are multiple selectors. An abstract example is:

.match {$x :X} {$y :Y}
A1 A2 {{1}}
B1 B2 {{2}}
C1 * {{3}}
* C2 {{4}}
* * {{5}}

In this example, the implementation of X would be called with the list of keys [A1, B1, C1] (the * key is not passed to the selector implementation) and returns a list of the same keys, arranged in any order. Likewise, the implementation of Y would be called with [A2, B2, C2].

I don't know of any existing programming languages with a pattern-matching construct like this one; usually, the comparison between values and patterns is based on structural equality and can't be abstracted over. But the ability to factor out the workings of pattern matching and swap in a new kind of matching (by defining a custom selector function) is a kind of abstraction that would be found in a general-purpose programming language. Of course, the goal here is not to match patterns in general but to select a grammatically correct variant depending on data that flows in at runtime.

But is it Turing-complete? #

MF2 has no explicit looping constructs or recursion, but it does have function calls. The details of how custom functions are realized are implementation-specific; typically, using a general-purpose programming language. That means that MF2 can invoke code that does arbitrary computation, but not express it. I think it would be fair to say that the combination of MF2 and a suitable registry of custom functions is Turing-complete, but MF2 itself is not Turing-complete. For example, imagine a custom function named eval that accepts an arbitrary JavaScript program as a string, and returns its output as a string. This is not how MF2 is intended to be used; the spec notes that "execution time SHOULD be limited" for invocations of custom functions. I'm not aware of any other languages whose Turing-completeness depends on their execution environment. (Though there is at least one lengthy discussion of whether CSS is Turing-complete.)

If custom functions were removed altogether and the registry of functions was limited to a small built-in set, MF2 would look much less like a programming language; its underlying "instruction set" would be much more limited.

Code versus data #

There is an old Saturday Night Live routine: “It’s a floor wax! It’s a dessert topping! It’s both!” XML is similar. “It’s a database! It’s a document! It’s both!” -- Philip Wadler, "XQuery: a typed functional language for querying XML" (2002)

The line between code and data isn't always clear, and the MF2 data model can be seen as a representation of the input data, rather than as an AST representing a program. Likewise, the syntax can be seen as a serialized format for representing the data, rather than as the syntax of a program.

There is also a "floor wax / dessert topping" dichotomy when it comes to functions. Is MF2 an active agent that calls functions, or does it define data passed to an API, whose implementation is what calls functions?

"Language" has multiple meanings in software: it can refer to a programming language, but the "L" in "HTML" and "XML" stands for "language". We would usually say that HTML and XML are markup languages, not programming languages, but even that point is debatable. After all, HTML embeds JavaScript code; the relationship between HTML and JavaScript resembles the relationship between MF2 and function implementations. Languages differ in how much computation they can directly express, but there are common aspects that unite different languages, like having a formal grammar.

I view MF2 as a domain-specific language for formatting messages, but another perspective is that it's a representation of data passed to an API. An API itself can be viewed as a domain-specific language: it provides verbs (functions or methods that can be called) and nouns (data structures that the functions can accept.)

Summing up #

As someone with a background in programming language design and implementation, I'm naturally inclined to see everything as a language design problem. A general-purpose programming language is complex, or at least can be used to solve complex problems. In contrast, those who have worked on the MF2 spec over the years have put in a lot of effort to make it as simple and focused on a narrow domain as possible.

One of the ways in which MF2 has veered towards programming language territory is the custom function mechanism, which was added to provide extensibility. The mechanism is general, but if there was a less general solution that still supported the desired range of use cases, these years of effort would have unearthed one. The presence of name binding (with all the complexities that brings up) and an unusual form of pattern matching also suggest to me that it's appropriate to consider MF2 a programming language, and to apply known programming language techniques to its design and implementation. This shows that programming language theory has interesting applications in internationalization, which is a new finding as far as I know.

What's left to do? #

The MessageFormat spec working group welcomes feedback during the tech preview period. User feedback on the MF2 spec and implementations will influence its future design. The current tech preview spec is part of LDML 45; the beta version, to be included in LDML 46, may include improvements suggested by users and implementors.

Igalia plans to continue collaboration to advance the Intl.MessageFormat proposal in TC39. Implementation in browser engines will be part of that process.

Acknowledgments #

Thanks to my colleagues Ben Allen, Philip Chimento, Brian Kardell, Eric Meyer, Asumu Takikawa, and Andy Wingo; and MF2 working group members Eemeli Aro, Elango Cheran, Richard Gibson, Mihai Niță, and Addison Phillips for their comments and suggestions on this post and its predecessor. Special thanks to my colleague Ujjwal Sharma, whose work I borrowed from in parts of the first post.

by compilers at May 09, 2024 12:00 AM

May 07, 2024

Melissa Wen

Get Ready to 2024 Linux Display Next Hackfest in A Coruña!

We’re excited to announce the details of our upcoming 2024 Linux Display Next Hackfest in the beautiful city of A Coruña, Spain!

This year’s hackfest will be hosted by Igalia and will take place from May 14th to 16th. It will be a gathering of minds from a diverse range of companies and open source projects, all coming together to share, learn, and collaborate outside the traditional conference format.

Who’s Joining the Fun?

We’re excited to welcome participants from various backgrounds, including:

  • GPU hardware vendors;
  • Linux distributions;
  • Linux desktop environments and compositors;
  • Color experts, researchers and enthusiasts;

This diverse mix of backgrounds are represented by developers from several companies working on the Linux display stack: AMD, Arm, BlueSystems, Bootlin, Collabora, Google, GravityXR, Igalia, Intel, LittleCMS, Qualcomm, Raspberry Pi, RedHat, SUSE, and System76. It’ll ensure a dynamic exchange of perspectives and foster collaboration across the Linux Display community.

Please take a look at the list of participants for more info.

What’s on the Agenda?

The beauty of the hackfest is that the agenda is driven by participants! As this is a hybrid event, we decided to improve the experience for remote participants by creating a dedicated space for them to propose topics and some introductory talks in advance. From those inputs, we defined a schedule that reflects the collective interests of the group, but is still open for amendments and new proposals. Find the schedule details in the official event webpage.

Expect discussions on:

KMS Color/HDR
  • Proposal with new DRM object type:
    • Brief presentation of GPU-vendor features;
    • Status update of plane color management pipeline per vendor on Linux;
  • HDR/Color Use-cases:
    • HDR gainmap images and how should we think about HDR;
    • Google/ChromeOS GFX view about HDR/per-plane color management, VKMS and lessons learned;
  • Post-blending Color Pipeline.
  • Color/HDR testing/CI
    • VKMS status update;
    • Chamelium boards, video capture.
  • Wayland protocols
    • color-management protocol status update;
    • color-representation and video playback.
Display control
  • HDR signalling status update;
  • backlight status update;
  • EDID and DDC/CI.
Strategy for video and gaming use-cases
  • Multi-plane support in compositors
    • Underlay, overlay, or mixed strategy for video and gaming use-cases;
    • KMS Plane UAPI to simplify the plane arrangement problem;
    • Shared plane arrangement algorithm desired.
  • HDR video and hardware overlay
Frame timing and VRR
  • Frame timing:
    • Limitations of uAPI;
    • Current user space solutions;
    • Brainstorm better uAPI;
  • Cursor/overlay plane updates with VRR;
  • KMS commit and buffer-readiness deadlines;
Power Saving vs Color/Latency
  • ABM (adaptive backlight management);
  • PSR1 latencies;
  • Power optimization vs color accuracy/latency requirements.
Content-Adaptive Scaling & Sharpening
  • Content-Adaptive Scalers on display hardware;
  • New drm_colorop for content adaptive scaling;
  • Proprietary algorithms.
Display Mux
  • Laptop muxes for switching of the embedded panel between the integrated GPU and the discrete GPU;
  • Seamless/atomic hand-off between drivers on Linux desktops.
Real time scheduling & async KMS API
  • Potential benefits: lower latency input feedback, better VRR handling, buffer synchronization, etc.
  • Issues around “async” uAPI usage and async-call handling.

In-person, but also geographically-distributed event

This year Linux Display Next hackfest is a hybrid event, hosted onsite at the Igalia offices and available for remote attendance. In-person participants will find an environment for networking and brainstorming in our inspiring and collaborative office space. Additionally, A Coruña itself is a gem waiting to be explored, with stunning beaches, good food, and historical sites.

Semi-structured structure: how the 2024 Linux Display Next Hackfest will work

  • Agenda: Participants proposed the topics and talks for discussing in sessions.
  • Interactive Sessions: Discussions, workshops, introductory talks and brainstorming sessions lasting around 1h30. There is always a starting point for discussions and new ideas will emerge in real time.
  • Immersive experience: We will have coffee-breaks between sessions and lunch time at the office for all in-person participants. Lunches and coffee-breaks are sponsored by Igalia. This will keep us sharing knowledge and in continuous interaction.
  • Spaces for all group sizes: In-person participants will find different room sizes that match various group sizes at Igalia HQ. Besides that, there will be some devices for showcasing and real-time demonstrations.

Social Activities: building connections beyond the sessions

To make the most of your time in A Coruña, we’ll be organizing some social activities:

  • First-day Dinner: In-person participants will enjoy a Galician dinner on Tuesday, after a first day of intensive discussions in the hackfest.
  • Getting to know a little of A Coruña: Finding out a little about A Coruña and current local habits.

Participants of a guided tour in one of the sectors of the Museum of Estrella Galicia (MEGA). Source: mundoestrellagalicia.es

  • On Thursday afternoon, we will close the 2024 Linux Display Next hackfest with a guided tour of the Museum of Galicia’s favorite beer brand, Estrella Galicia. The guided tour covers the eight sectors of the museum and ends with beer pouring and tasting. After this experience, a transfer bus will take us to the Maria Pita square.
  • At Maria Pita square we will see the charm of some historical landmarks of A Coruña, explore the casual and vibrant style of the city center and taste local foods while chatting with friends.

Sponsorship

Igalia sponsors lunches and coffee-breaks on hackfest days, Tuesday’s dinner, and the social event on Thursday afternoon for in-person participants.

We can’t wait to welcome hackfest attendees to A Coruña! Stay tuned for further details and outcomes of this unconventional and unique experience.

May 07, 2024 02:33 PM

May 06, 2024

Igalia Compilers Team

MessageFormat 2.0: a new standard for translatable messages

This blog post represents the Igalia compilers team; all opinions expressed here are our own and may not reflect the opinions of other members of the MessageFormat Working Group or any other group.

MessageFormat 2.0 is a new standard for expressing translatable messages in user interfaces, both in Web applications and other software. Last week, my work on implementing MessageFormat 2.0 in C++ was released in ICU 75, the latest release of the International Components for Unicode library.

As a compiler engineer, when I learned about MessageFormat 2.0, I began to see it as a programming language, albeit an unconventional one. The message formatter is analogous to an interpreter for a programming language. I was surprised to discover a programming language hiding in this unexpected place. Understanding my surprise requires some context: first of all, what "messages" are.

A story about message formatting #

Over the past 40 to 50 years, user interfaces (UIs) have grown increasingly complex. As interfaces have become more interactive and dynamic, the process of making them accessible in a multitude of natural languages has increased in complexity as well. Internationalization (i18n) refers to this general practice, while the process of localization (l10n) refers to the act of modifying a specific system for a specific natural language.

i18n history #

Localization of user interfaces (both command-line and graphical) began by translating strings embedded in code that implements UIs. Those strings are called "messages". For example, consider a text-based adventure game: messages like "It is pitch black. You are likely to be eaten by a grue." might appear anywhere in the code. As a slight improvement, the messages could all be stored in a separate "resource file" that is selected based on the user's locale. The ad hoc approaches to translating these messages and integrating them into code didn't scale well. In the late 1980s, C introduced a gettext() function into glibc, which was never standardized but was widely adopted. This function primarily provided string replacement functionality. While it was limited, it inspired the work that followed. Microsoft and Apple operating systems had more powerful i18n support during this time, but that's beyond the scope of this post.

The rise of the Web required increased flexibility. Initially, static documents and apps with mostly static content dominated. Java, PHP, and Ruby on Rails all brought more dynamic content to the web, and developers using these languages needed to tinker more with message formatting. Their applications needed to handle not just static strings, but messages that could be customized to dynamic content in a way that produced understandable and grammatically correct messages for the target audience. MessageFormat arose as one solution to this problem, implemented by Taligent (along with other i18n functionality).

The creators of Java had an interest in making Java the preferred language for implementing Web applications, so Sun Microsystems incorporated Taligent's work into Java in JDK 1.1 (1997). This version of MessageFormat supported pattern strings, formatting basic types, and choice format.

The MessageFormat implementation was then incorporated into ICU, the widely used library that implements internationalization functions. Initially, Taligent and IBM (which was one of Taligent's parent companies) worked with Sun to maintain parallel i18n implementations in ICU and in the JDK, but eventually the two implementations diverged. Moving at a faster pace, ICU added plural selection to MessageFormat and expanded other capabilities, based on feedback from developers and translators who had worked with the previous generations of localization tools. ICU also added a C++ port of this Java-based implementation; hence, the main ICU project includes ICU4J (implemented in Java) and ICU4C (implemented in C++ with some C APIs).

Since ICU MessageFormat was introduced, many much more complex Web UIs, largely based on JavaScript, have appeared. Complex programs can generate much more interesting content, which implies more complex localization. This necessitates a different model of message formatting. An update to MessageFormat has been in the works at least since 2019, and that work culminates with the recent release of the MessageFormat 2.0 specification. The goals are to provide more modularity and extensibility, and easier adaptation for different locales.

For brevity, we will refer to ICU MessageFormat, otherwise known simply as "MessageFormat", as "MF1" throughout. Likewise, "MF2" refers to MessageFormat 2.0.

Why message formatting? (Not just for translation) #

The work described in this post is motivated by the desire to make it as easy as possible to write software that is accessible to people regardless of the language(s) they use to communicate. But message formatting is needed even in user interfaces that only support a single natural language.

Consider this C code, where number is presumed to be an unsigned integer variable that's already been defined:

printf("You have %u files", number);

We've probably all used applications that informed us that we "have 1 files". While it's probably clear what that means, why should human readers have to mentally correct the grammatical errors of a computer? A programmer trying to improve things might write:

printf("You have %u file%s", number, number == 1 ? "" : "s");

which handles the case where n is 1. However, English is full of irregular plurals: consider "bunny" and "bunnies", "life" and "lives", or "mouse" and "mice". The code would be easier to read and maintain if it was easier to express "print the plural of 'file'", instead of a conditional expression that must be scrutinized for its meaning. While "printf" is short for "print formatted", its formatting is limited to simpler tasks like printing numbers as strings, not complex tasks like pluralizing "mouse".

This is just one example; another example is messages including ordinals, like "1st", "2nd", or "3rd", where the suffix (like "st") is determined in a complicated way from the number (consider "11th" versus "21st").

We've seen how messages can vary based on user input in non-local ways: the bug in "You have %u files" is that not only does the number of files vary, so does the word "files". So you might begin to see the value of a specialized notation for expressing such messages, which would make it easier to notice bugs like the "1 files" bug and even prevent them in the first place. That notation might even resemble a domain-specific language.

Why message formatting? #

Turning now to translation, you might be wondering why you would need a separate message formatter, instead of writing code in your programming language of choice that would look something like this (if your language of choice is C/C++):

// Supposing a char* "what" and int "planet" are already defined
printf("There was %s on planet %d", what, planet);

Here, the message is the first argument to printf and the other arguments are substituted into the message at runtime; no special message formatting functionality is used, beyond what the C standard library offers.

The problem is that if you want to translate the words in the message, translation may require changing the code, not just the text of the message. Suppose that in the target language, we needed to write the equivalent of:

printf("On planet %d, there was a %s", what, planet);

Oops! We've reordered the text, but not the other arguments to printf. When using a modern C/C++ compiler, this would be a compile-time error (since what is a string and not an integer). But it's easy to imagine similar cases where both arguments are strings, and a nonsense message would result. Message formatters are necessary not only for resilience to errors, but also for division of labor: people building systems want to decouple the tasks of programming and translation.

ICU MessageFormat #

In search of a better tool for the job, we turn to one of many tools for formatting messages: ICU MessageFormat (MF1). According to the ICU4C API documentation, "MessageFormat prepares strings for display to users, with optional arguments (variables/placeholders)".

The MF2 working group describes MF1 more succinctly as "An i18n-aware printf, basically."

Here's an expanded version of the previous printf example, expressed in MF1 (and taken from the "Usage Information" of the API docs):

"At {1,time,::jmm} on {1,date,::dMMMM}, there was {2} on planet {0,number}."

Typically, the task of creating a message string like this is done by a software engineer, perhaps one who specializes in localization. The work of translating the words in the message from one natural language to another is done by a translator, who is not assumed to also be a software developer.

The designers of MF1 made it clear in the syntax what content needs to be translated, and what doesn't. Any text occurring inside curly braces is non-translatable: for example, the string "number" in the last set of curly braces doesn't mean the word "number", but rather, is a directive that the value of argument 0 should be formatted as a number. This makes it easy for both developers and translators to work with a message. In MF1, a piece of text delimited by curly braces is called a "placeholder".

The numbers inside placeholders represent arguments provided at runtime. (Arguments can be specified either by number or name.) Conceptually, time, date, and number are formatting functions, which take an argument and an optional "style" (like ::jmm and ::dMMMM in this example). MF1 has a fixed set of these functions. So {1,time,::jmm} should be read as "Format argument 1 as a time value, using the format specified by the string ::jmm". (The details of how the format strings work aren't important for this explanation.) Since {2} has no formatting function specified, the value of argument 2 is formatted based on its type. (From context, we can suppose it has a string type, but we would need to look at the arguments to the message to know for sure.) The set of types that can be formatted in this way is fixed (users can't add new types of their own).

For brevity, I won't show how to provide the arguments that are substituted for the numbers 0, 1 and 2 in the message; the API documentation shows the C++ code to create those arguments.

MF1 addresses the reordering issue we noticed in the printf example. As long as they don't change what's inside the placeholders, a translator doesn't have to worry that their translation will disturb the relationship between placeholders and arguments. Likewise, a developer doesn't have to worry about manually changing the order of arguments to reflect a translation of the message. The placeholder {2} means the same thing regardless of where it appears in the message. (Using named rather than positional arguments would make this point even clearer.)

(The ICU User Guide contains more documentation on MF1. A tutorial by Mohammad Ashour also contains some useful information on MF1.)

Enter MessageFormat 2.0 #

To summarize a few of the shortcomings of MF1:

  • Lack of modularity: the set of formatters (mostly for numbers, dates, and times) is fixed. There's no way to add your own.
    • Like formatting, selection (choosing a different pattern based on the runtime contents of one or more arguments) is not customizable. It can only be done either based on plural form, or literal string matching. This makes it hard to express the grammatical structures of various human languages.
  • No separate data model: the abstract syntax is concrete syntax.
  • There is no way to declare a local variable so that the same piece of a message can be re-used without repeating it; all variables are external arguments.
  • Baggage: extending the existing spec with different syntax could change the meaning of existing messages, so a new spec is needed. (MF1 was not designed for forward-compatibility, and the new spec is backward-incompatible.)

MF2 represents the best efforts of a number of experts in the field to address these shortcomings and others.

I hope the brief history I gave shows how much work, by many tech workers, has gone into the problem of message formatting. Synthesizing the lessons of the past few decades into one standard has taken years, with seemingly small details provoking nuanced discussions. What follows may seem complex, but the complexity is inherent in the problem space that it addresses. The plethora of different competing tools to address the same set of problems is evidence for that.

"Inherent is Latin for not your fault." -- Rich Hickey, "Simple Made Easy"

A simple example #

Here's the same example in MF2 syntax:

At time {$time :datetime hour=numeric minute=|2-digit|}, on {$time :datetime day=numeric month=long},
there was {$what} on planet {$planet :number}

As in MF1, placeholders are enclosed in curly braces. Variables are prefixed with a $; since there are no local bindings for time, what, or planet, they are treated as external arguments. Although variable names are already delimited by curly braces, the $ prefix helps make it even clearer that variable names should not be translated.

Function names are now prefixed by a :, and options (like hour) are named. There can be multiple options, not just one as in MF1. This is a loose translation of the MF1 version, since in MF2, the skeleton option is not part of the alpha version of the spec. (It is possible to write a custom function that takes this option, and it may be added in the future as part of the built-in datetime function.) Instead, the :datetime formatter can take a variety of field options (shown in the example) or style options (not shown) The full options are specified in the Default Registry portion of the spec.

Literal strings that occur within placeholders, like 2-digit, are quoted. The MF2 syntax for quoting a literal is to enclose it in vertical bars (|). The vertical bars are optional in most cases, but are necessary for the literal 2-digit because it includes a hyphen. This syntax was chosen instead of literal quotation marks (") because some use cases for MF2 messages require embedding them in a file format that assigns meaning to quotation marks. Vertical bars are less commonly used in this way.

A shorter version of this example is:

At time {$time :time}, on {$time :date}, there was {$what} on planet {$planet :number}

:time formats the time portion of its operand with default options, and likewise for :date. Implementations may differ on how options are interpreted and their default values. Thus, the formatted output of this version may be slightly different from that of the first version.

A more complex example: selection and custom functions #

So far, the examples are single messages that contain placeholders that vary based on runtime input. But sometimes, the translatable text in a message also depends on runtime input. A common example is pluralization: in English, consider "You have one item" versus "You have 5 items." We can call these different strings "variants" of the same message.

While you can imagine code that selects the right variant with a conditional, that violates the separation between code and data that we previously discussed.

Another motivator for variants is grammatical case. English has a fairly simple case system (consider "She told me" versus "I told her"; the "she" of the first sentence changes to "her" when that pronoun is the object of the transitive verb "tell".) Some languages have much more complex case systems. For example, Russian has six grammatical cases; Basque, Estonian, and Finnish all have more than ten. The complexity of translating messages into and between languages like these is further motivation for organizing all the variants of a message together. The overall goal is to make messages as self-contained as possible, so that changing them doesn't require changing the code that manipulates them. MF2 makes that possible in more situations. While MF1 supports selection based on plural categories, MF2 also supports a general, customizable form of selection.

Here's a very simple example (necessarily simple since it's in English) of using custom selectors in MF2 to express grammatical case:

.match {$userName :hasCase}
vocative {{Hello, {$userName :person case=vocative}!}}
accusative {{Please welcome {$userName :person case=accusative}!}}
* {{Hello!}}

The keyword .match designates the beginning of a matcher.

How to read a matcher #

Although MF2 is declarative, you can read this example imperatively as follows:

  1. Apply :hasCase to the runtime value of $userName to get a string c representing a grammatical case.
  2. Compare c to each of the keys in the three variants ("vocative", "accusative", and the wildcard "*", which matches any string.)
  3. Take the matching variant v and format its pattern.

This example assumes that the runtime value of the argument $userName is a structure whose grammatical case can be determined. This example also assumes that custom :hasCase and :person functions have been defined (the details of how those functions are defined are outside the scope of this post). In this example, :person is a formatting function, like :datetime in the simpler example. :hasCase is a different kind of function: a selector function, which may extract a field from its argument or do arbitrarily complicated computation to determine a value to be matched against.

In general: a matcher includes one or more selectors and one or more variants. A variant includes a list of keys, whose length must equal the number of selectors in the matcher; and a single pattern.

In this example, the selector of the matcher is the placeholder {$userName :hasCase}. Selectors appear between the .match keyword and the beginning of the first variant. There are three variants, each of which has a single key. The strings delimited by double sets of curly braces are patterns, which in turn contain other placeholders. The selectors are used to select a pattern based on whether the runtime value of the selector matches one of the keys.

Formatting the selected pattern #

Supposing that :hasCase maps the value of $userName onto "vocative", the formatted pattern consists of the string "Hello, " concatenated with the result of formatting this placeholder:

{$userName :person case=vocative}

concatenated with the string "!". (This also supposes that we are formatting to a single string result; future versions of MF2 may also support formatting to a list of "parts", in which case the result strings would be returned in a more complicated data structure, not concatenated.)

You can read this placeholder like a function call with arguments: "Call the :person function on the value of $userName with an additional named argument, the name "case" bound to the string "vocative". We also elide the details of :person, but you can suppose it uses additional fields in $userName to format it as a personal name. Such fields might include a title, first name, and last name (incidentally, see "Falsehoods Programmers Believe About Names" for why formatting personal names is more complicated than you might think.)

Note that MF2 provides no constructs that mutate variables. Once a variable is bound, its value doesn't change. Moreover, built-in functions don't have side effects, so as long as custom functions are written in a reasonable way (that is, without side effects that cross the boundary between MF2 and function execution), MF2 has no side effects.

That means that $userName represents the same value when it appears in the selector and when it appears in any of the patterns. Conceptually, :hasCase returns a result that is used for selection; it doesn't change what the name $userName is bound to.

Abstraction over function implementations #

A developer could plug in an implementation of :hasCase that requires its argument to be an object or record that contains grammatical metadata, and then simply returns one of the fields of this object. Or we could plug in an implementation that can accept a string and uses a dictionary for the target language to guess its case. The message is structured the same way regardless, though to make it work as expected, the structure of the arguments must match the expectations of whatever functions consume them. Effectively, the message is parameterized over the meanings of its arguments and over the meanings of any custom functions it uses. This parameterization is a key feature of MF2.

Summing up #

This example couldn't be expressed in MF1, since MF1 has no custom functions. The checking for different grammatical cases would be done in the underlying programming language, with ad hoc code for selecting between the different strings. This would be error-prone, and would force a code change whenever a message changes (just as in the simple example shown previously). MF1 does have an equivalent of .match that supports a few specific kinds of matching, like plural matching. In MF2, the ability to write custom selector functions allows for much richer matching.

Further reading (or watching) #

For more background both on message formatting in general and on MF2, I recommend my teammate Ujjwal Sharma's talk at FOSDEM 2024, on which portions of this post are based. A recent MessageFormat 2 open house talk by Addison Phillips and Elango Cheran also provides some great context and motivation for why a new standard is needed.

You can also read a detailed argument in favor of a new standard by visiting the spec repository on GitHub: "Why MessageFormat needs a successor".

Igalia's work on MF2, or: Who are we and what are we doing here? #

So far, I've provided a very brief overview of the syntax and semantics of MF2. More examples will be provided via links in the next post.

A question I haven't answered is why this post is on the Igalia compilers team blog. I'm a member of the Igalia compilers team; together with Ujjwal Sharma, I have been collaborating with the MF2 working group, a subgroup of the Unicode CLDR technical committee. The working group is chaired by Addison Phillips and has many other members and contributors.

Part of our work on the compilers team has been to implement the MF2 specification as a C++ API in ICU. (Mihai Niță at Google, also a member of the working group, implemented the specification as a Java API in ICU.)

So why would the compilers team work on internationalization?

Part of our work as a team is to introduce and refine proposals for new JavaScript (JS) language features and work with TC39, the JS standards committee, to advance these proposals with the goal of inclusion in the official JS specification.

One such proposal that the compilers team has been involved with is the Intl.MessageFormat proposal.

The implementation of MessageFormat 2 in ICU provides support for browser engines (the major ones are implemented in C++) to implement this proposal. Prototype implementations are part of the TC39 proposal process.

But there's another reason: as I hinted at the beginning, the more I learned about MF2, the more I began to see it as a programming language, at least a domain-specific language. In the next post on this blog, I'll talk about the implementation work that Igalia has done on MF2 and reflect on that work from a programming language design perspective. Stay tuned!

by compilers at May 06, 2024 12:00 AM

April 30, 2024

Enrique Ocaña

Dissecting GstSegments

During all these years using GStreamer, I’ve been having to deal with GstSegments in many situations. I’ve always have had an intuitive understanding of the meaning of each field, but never had the time to properly write a good reference explanation for myself, ready to be checked at those times when the task at hand stops being so intuitive and nuisances start being important. I used the notes I took during an interesting conversation with Alba and Alicia about those nuisances, during the GStreamer Hackfest in A Coruña, as the seed that evolved into this post.

But what are actually GstSegments? They are the structures that track the values needed to synchronize the playback of a region of interest in a media file.

GstSegments are used to coordinate the translation between Presentation Timestamps (PTS), supplied by the media, and Runtime.

PTS is the timestamp that specifies, in buffer time, when the frame must be displayed on screen. This buffer time concept (called buffer running-time in the docs) refers to the ideal time flow where rate isn’t being had into account.

Decode Timestamp (DTS) is the timestamp that specifies, in buffer time, when the frame must be supplied to the decoder. On decoders supporting P-frames (forward-predicted) and B-frames (bi-directionally predicted), the PTS of the frames reaching the decoder may not be monotonic, but the PTS of the frames reaching the sinks are (the decoder outputs monotonic PTSs).

Runtime (called clock running time in the docs) is the amount of physical time that the pipeline has been playing back. More specifically, the Runtime of a specific frame indicates the physical time that has passed or must pass until that frame is displayed on screen. It starts from zero.

Base time is the point when the Runtime starts with respect to the input timestamp in buffer time (PTS or DTS). It’s the Runtime of the PTS=0.

Start, stop, duration: Those fields are buffer timestamps that specify when the piece of media that is going to be played starts, stops and how long that portion of the media is (the absolute difference between start and stop, and I mean absolute because a segment being played backwards may have a higher start buffer timestamp than what its stop buffer timestamp is).

Position is like the Runtime, but in buffer time. This means that in a video being played back at 2x, Runtime would flow at 1x (it’s physical time after all, and reality goes at 1x pace) and Position would flow at 2x (the video moves twice as fast than physical time).

The Stream Time is the position in the stream. Not exactly the same concept as buffer time. When handling multiple streams, some of them can be offset with respect to each other, not starting to be played from the begining, or even can have loops (eg: repeating the same sound clip from PTS=100 until PTS=200 intefinitely). In this case of repeating, the Stream time would flow from PTS=100 to PTS=200 and then go back again to the start position of the sound clip (PTS=100). There’s a nice graphic in the docs illustrating this, so I won’t repeat it here.

Time is the base of Stream Time. It’s the Stream time of the PTS of the first frame being played. In our previous example of the repeating sound clip, it would be 100.

There are also concepts such as Rate and Applied Rate, but we didn’t get into them during the discussion that motivated this post.

So, for translating between Buffer Time (PTS, DTS) and Runtime, we would apply this formula:

Runtime = BufferTime * ( Rate * AppliedRate ) + BaseTime

And for translating between Buffer Time (PTS, DTS) and Stream Time, we would apply this other formula:

StreamTime = BufferTime * AppliedRate + Time

And that’s it. I hope these notes in the shape of a post serve me as reference in the future. Again, thanks to Alicia, and especially to Alba, for the valuable clarifications during the discussion we had that day in the Igalia office. This post wouldn’t have been possible without them.

by eocanha at April 30, 2024 06:00 AM

April 26, 2024

Gyuyoung Kim

Web Platform Test and Chromium

I’ve been working on tasks related to web platform tests since last year. In this blog post, I’ll introduce what web platform tests are and how the Chromium project incorporates these tests into its development process.

1. Introduction

The web-platform-tests project serves as a cross-browser test suite for the Web platform stack. By crafting tests that can run seamlessly across all browsers, browser projects gain assurance that their software aligns with other implementations. This confidence extends to future implementations, ensuring compatibility. Consequently, web authors and developers can trust the Web platform to fulfill its promise of seamless functionality across browsers and devices, eliminating the need for additional layers of abstraction to address gaps introduced by specification editors and implementors.

For your information, the Web Platform Test Community operates a dashboard that tracks the pass ratio of tests across major web browsers. The chart below displays the number of failing tests in major browsers over time, with Chrome showing the lowest failure rate.

[Caption] Test failures graph on major browsers
And, the below chart shows the interoperability of the web platform technology for 2023 among major browsers. The interoperability has been improving.

[Caption] Interoperability among the major browsers 2023

2. Test Suite Design

The majority of the test suite is made up of HTML pages that are designed to be loaded in a browser. These pages may either generate results programmatically or provide a set of steps for executing the test and obtaining the outcome. Overall, the tests are concise, cross-platform, and self-contained, making them easy to run in any browser.

2.1 Test Layout

Most primary directories within the repository are dedicated to tests associated with specific web standards. For W3C specifications, these directories typically use the short name of the spec, which is the name used for snapshot publications under the /TR/ path. For WHATWG specifications, the directories are usually named after the spec’s subdomain, omitting “.spec.whatwg.org” from the URL. Other specifications follow a logical naming convention.

The css/ directory contains test suites specifically designed for the CSS Working Group specifications.

Within each specification-specific directory, tests are organized in one of two common ways: a flat structure, sometimes used for shorter specifications, or a nested structure, where each subdirectory corresponds to the ID of a heading within the specification. The nested structure, which provides implicit metadata about the tested section of the specification based on its location in the filesystem, is preferred for larger specifications.

For example, tests related to “The History interface” in HTML can be found in html/browsers/history/the-history-interface/.

Many directories also include a file named META.yml, which may define properties such as:
  • spec: a link to the specification covered by the tests in the directory
  • suggested_reviewers: a list of GitHub usernames for individuals who are notified when pull requests modify files in the directory
Various resources that tests rely on are stored in common directories, including images, fonts, media, and general resources.

2.2 Test Types

Tests in this project employ various approaches to validate expected behavior, with classifications based on how expectations are expressed:

  1. Rendering Tests:
    • Reftests: Compare the graphical rendering of two (or more) web pages, asserting equality in their display (e.g., A.html and B.html must render identically). This can be done manually by users switching between tabs/windows or through automated scripts.
    • Visual Tests: Evaluate a page’s appearance, with results determined either by human observation or by comparing it with a saved screenshot specific to the user agent and platform.
  2. JavaScript Interface Tests (testharness.js tests):
    • Ensure that JavaScript interfaces behave as expected. Named after the JavaScript harness used to execute them.
  3. WebDriver Browser Automation Protocol Tests (wdspec tests):
    • Written in Python, these tests validate the WebDriver browser automation protocol.
  4. Manual Tests rely on a human to run them and determine their result.

3. Web Platform Test in Chromium

The Chromium project conducts web tests utilized by Blink to assess various components, encompassing, but not limited to, layout and rendering. Generally, these web tests entail loading pages in a test renderer (content_shell) and comparing the rendered output or JavaScript output with an expected output file. The Web Tests include “Web platform tests”(WPT) located at web_tests/external/wpt. Other directories are for Chrome-specific tests only. In the web_tests/external/wpt, there are around 50,000 web platform tests currently.

3.1 Web Tests in the Chromium development process

Every patch must pass all tests before it can be merged into the main source tree. In Gerrit, trybots execute a wide array of tests, including the Web Test, to ensure this. We can initiate these trybots with a specific patch, and each trybot will then run the scheduled tests using that patch.

[Caption] Trybots in Gerrit For example, the linux-rel trybot runs the web tests in the blink_web_tests step as below,

[Caption] blink_web_tests in the linux-rel trybot

3.2 Internal sequence for running the Web Test in Chromium

Let’s take a look at how Chromium internally runs the web tests simply. Basically, there are 2 major components to run the web tests in Chromium. One is run_web_tests.py acting as a server. The other one is content_shell acting as a client that loads a passed test and returns the result. The input and output of content_shell are assumed to follow the run_web_tests protocol through pipes that connect stdin and stdout of run_web_tests.py and content_shell as below,

[Caption] Sequence how to execute a test between run_web_test.py and content_shell

4. In conclusion

We just explored what Web Tests are and how Chromium runs these tests for web platform testing. Web platform tests play a crucial role in ensuring interoperability among various web browsers and platforms. In the following blog post, I will share how did I support and implement to run of web tests on iOS for the Blink port.

References

by gyuyoung at April 26, 2024 08:48 AM

April 24, 2024

Stephen Chenney

CSS Custom Properties in Highlight Pseudos

The CSS highlight inheritance model describes the process for inheriting the CSSproperties of the various highlight pseudo elements:

  • ::selection controlling the appearance of selected content
  • ::spelling-error controlling the appearance of misspelled word markers
  • ::grammar-error controlling how grammar errors are marked
  • ::target-text controlling the appearance of the string matching a target-text URL
  • ::highlight defining the appearance of a named highlight, accessed via the CSS highlight API

The inheritance model is intended to generate more intuitive behavior for examples like the one below, where we would expect the second line to be entirely blue because the <em> is a child of the blue paragraph that typically would inherit properties from the paragraph:

<style>
::selection /* = *::selection (universal) */ {
color: lightgreen;
}
.blue::selection {
color: blue;
}
</style>
<p>Some <em>lightgreen</em> text</p>
<p class="blue">Some <em>lightgreen</em> text that one would expect to be blue</p>
<script>
range = new Range();
range.setStart(document.body, 0);
range.setEnd(document.body, 3)
document.getSelection().addRange(range);
</script>

Before this model was standardized, web developers achieved the same effect using CSS custom properties. The selection highlight could make use of a custom property defined on the originating element (the element that is selected). Being defined on the originating element tree, those custom properties were inherited, so a universal highlight would effectively inherit the properties of its parent. Here’s what it looks like:

<style>
:root {
--selection-color: lightgreen;
}
::selection /* = *::selection (universal) */ {
color: var(--selection-color);
}
.blue {
--selection-color: blue;
}
</style>
<p>Some <em>lightgreen</em> text</p>
<p class="blue">Some <em>lightgreen</em> text that is also blue</p>
<script>
range = new Range();
range.setStart(document.body, 0);
range.setEnd(document.body, 3)
document.getSelection().addRange(range);
</script>

In the example, all selection highlights use the value of --selection-color as the selection text color. The <em> element inherits the property value of blue from its parent <p> and hence has a blue highlight.

This approach to selection inheritance was promoted in Stack Overflow posts and other places, even in posts that did not discuss inheritance. The problem is that the new highlight inheritance model, as previously specified, broke this behavior in a way that required significant changes for web sites making use of the former approach. At the prompting of developer advocates, the CSS Working Group decided to change the specified behavior of custom properties with highlight pseudos to support the existing recommended approach.

Custom Properties for Highlights #

The updated behavior for custom properties in highlight pseudos is that highlight properties that have var(...) references take the property values from the originating element. In addition, custom properties defined in the highlight pseudo itself will be ignored. The example above works with this new approach, so content making use of it will not break when highlight inheritance is enabled in browsers.

Note that this explicitly conflicts with the previous guidance, which was to define the custom properties on the highlight pseudos, with only the root highlight inheriting custom properties from the document root element.

The change is implemented behind a flag in Chrome 126 and higher, as is expected to be enabled by default as early as Chrome 130. To see the effects right now, enable “Experimental Web Platform features” via chrome://flags in M126 Chrome (Canary at the time of writing).

A Unified Approach for Derived Properties in Highlights #

There are other situations in which properties are disallowed on highlight pseudos yet those properties are necessary to resolve variables. For example, lengths that use font based units, such as 0.5em cannot use font-size from a highlight pseudo because font-size is not allowed in highlights. Similarly container-based units or viewport units. The general rule is that the highlight will get whatever value it needs from the originating element, and that now includes custom property values.

by Stephen Chenney at April 24, 2024 12:00 AM

April 16, 2024

Philippe Normand

From WebKit/GStreamer to rust-av, a journey on our stack’s layers

In this post I’ll try to document the journey starting from a WebKit issue and ending up improving third-party projects that WebKitGTK and WPEWebKit depend on.

I’ve been working on WebKit’s GStreamer backends for a while. Usually some new feature needed on WebKit side would trigger work on GStreamer. That’s quite common and healthy actually, by improving GStreamer (bug fixes or implementing new features) we make the whole stack stronger (hopefully). It’s not hard to imagine other web-engines, such as Servo for instance, leveraging fixes made in GStreamer in the context of WebKit use-cases.

Sometimes though we have to go deeper and this is what this post is about!

Since version 2.44, WebKitGTK and WPEWebKit ship with a WebCodecs backend. That backend leverages the wide range of GStreamer audio and video decoders/encoders to give low-level access to encoded (or decoded) audio/video frames to Web developers. I delivered a lightning talk at gst-conf 2023 about this topic.

There are still some issues to fix regarding performance and some W3C web platform tests are still failing. The AV1 decoding tests were flagged early on while I was working on WebCodecs, I didn’t have time back then to investigate the failures further, but a couple weeks ago I went back to those specific issues.

The WebKit layout tests harness is executed by various post-commit bots, on various platforms. The WebKitGTK and WPEWebKit bots run on Linux. The WebCodec tests for AV1 currently make use of the GStreamer av1enc and dav1ddec elements. We currently don’t run the tests using the modern and hardware-accelerated vaav1enc and vaav1dec elements because the bots don’t have compatible GPUs.

The decoding tests were failing, this one for instance (the ?av1 variant). In that test both encoding and decoding are tested, but decoding was failing, for a couple reasons. Rabbit hole starts here. After debugging this for a while, it was clear that the colorspace information was lost between the encoded chunks and the decoded frames. The decoded video frames didn’t have the expected colorimetry values.

The VideoDecoderGStreamer class basically takes encoded chunks and notifies decoded VideoFrameGStreamer objects to the upper layers (JS) in WebCore. A video frame is basically a GstSample (Buffer and Caps) and we have code in place to interpret the colorimetry parameters exposed in the sample caps and translate those to the various WebCore equivalents. So far so good, but the caps set on the dav1ddec elements didn’t have those informations! I thought the dav1ddec element could be fixed, “shouldn’t be that hard” and I knew that code because I wrote it in 2018 :)

So let’s fix the GStreamer dav1ddec element. It’s a video decoder written in Rust, relying on the dav1d-rs bindings of the popular C libdav1d library. The dav1ddec element basically feeds encoded chunks of data to dav1d using the dav1d-rs bindings. In return, the bindings provide the decoded frames using a Dav1dPicture Rust structure and the dav1ddec GStreamer element basically makes buffers and caps out of this decoded picture. The dav1d-rs bindings are quite minimal, we implemented API on a per-need basis so far, so it wasn’t very surprising that… colorimetry information for decoded pictures was not exposed! Rabbit hole goes one level deeper.

So let’s add colorimetry API in dav1d-rs. When working on (Rust) bindings of a C library, if you need to expose additional API the answer is quite often in the C headers of the library. Every Dav1dPicture has a Dav1dSequenceHeader, in which we can see a few interesting fields:

typedef struct Dav1dSequenceHeader {
...
    enum Dav1dColorPrimaries pri; ///< color primaries (av1)
    enum Dav1dTransferCharacteristics trc; ///< transfer characteristics (av1)
    enum Dav1dMatrixCoefficients mtrx; ///< matrix coefficients (av1)
    enum Dav1dChromaSamplePosition chr; ///< chroma sample position (av1)
    ...
    uint8_t color_range;
    ...
...
} Dav1dSequenceHeader;

After sharing a naive branch with rust-av co-maintainers Luca Barbato and Sebastian Dröge, I came up with a couple pull-requests that eventually were shipped in version 0.10.3 of dav1d-rs. I won’t deny matching primaries, transfer, matrix and chroma-site enum values to rust-avs Pixel enum was a bit challenging :P Anyway, with dav1d-rs fixed up, rabbit hole level goes up one level :)

Now with the needed dav1d-rs API, the GStreamer dav1ddec element could be fixed. Again, matching the various enum values to their GStreamer equivalent was an interesting exercise. The merge request was merged, but to this date it’s not shipped in a stable gst-plugins-rs release yet. There’s one more complication here, ABI broke between dav1d 1.2 and 1.4 versions. The dav1d-rs 0.10.3 release expects the latter. I’m not sure how we will cope with that in terms of gst-plugins-rs release versioning…

Anyway, WebKit’s runtime environment can be adapted to ship dav1d 1.4 and development version of the dav1ddec element, which is what was done in this pull request. The rabbit is getting out of his hole.

The WebCodec AV1 tests were finally fixed in WebKit, by this pull request. Beyond colorimetry handling a few more fixes were needed, but luckily those didn’t require any fixes outside of WebKit.

Wrapping up, if you’re still reading this post, I thank you for your patience. Working on inter-connected projects can look a bit daunting at times, but eventually the whole ecosystem benefits from cross-project collaborations like this one. Thanks to Luca and Sebastian for the help and reviews in dav1d-rs and the dav1ddec element. Thanks to my fellow Igalia colleagues for the WebKit reviews.

by Philippe Normand at April 16, 2024 08:15 PM

April 02, 2024

Maíra Canal

Linux 6.8: AMD HDR and Raspberry Pi 5

The Linux kernel 6.8 came out on March 10th, 2024, bringing brand-new features and plenty of performance improvements on different subsystems. As part of Igalia, I’m happy to be an active part of many features that are released in this version, and today I’m going to review some of them.

Linux 6.8 is packed with a lot of great features, performance optimizations, and new hardware support. In this release, we can check the Intel Xe DRM driver experimentally, further support for AMD Zen 5 and other upcoming AMD hardware, initial support for the Qualcomm Snapdragon 8 Gen 3 SoC, the Imagination PowerVR DRM kernel driver, support for the Nintendo NSO controllers, and much more.

Igalia is widely known for its contributions to Web Platforms, Chromium, and Mesa. But, we also make significant contributions to the Linux kernel. This release shows some of the great work that Igalia is putting into the kernel and strengthens our desire to keep working with this great community.

Let’s take a deep dive into Igalia’s major contributions to the 6.8 release:

AMD HDR & Color Management

You may have seen the release of a new Steam Deck last year, the Steam Deck OLED. What you may not know is that Igalia helped bring this product to life by putting some effort into the AMD driver-specific color management properties implementation. Melissa Wen, together with Joshua Ashton (Valve), and Harry Wentland (AMD), implemented several driver-specific properties to allow Gamescope to manage color features provided by the AMD hardware to fit HDR content and improve gamers’ experience.

She has explained all features implemented in the AMD display kernel driver in two blog posts and a 2023 XDC talk:

Async Flip

André Almeida worked together with Simon Ser (SourceHut) to provide support for asynchronous page-flips in the atomic API. This feature targets users who want to present a new frame immediately, even if after missing a V-blank. This feature is particularly useful for applications with high frame rates, such as gaming.

Raspberry Pi 5

Raspberry Pi 5 was officially released on October 2023 and Igalia was ready to bring top-notch graphics support for it. Although we still can’t use the RPi 5 with the mainline kernel, it is superb to see some pieces coming upstream. Iago Toral worked on implementing all the kernel support needed for the V3D 7.1.x driver.

With the kernel patches, by the time the RPi 5 was released, it already included a fully 3.1 OpenGL ES and Vulkan 1.2 compliant driver implemented by Igalia.

GPU stats and CPU jobs for the Raspberry Pi 4/5

Apart from the release of the Raspberry Pi 5, Igalia is still working on improving the whole Raspberry Pi environment. I worked, together with José Maria “Chema” Casanova, implementing the support for GPU stats on the V3D driver. This means that RPi 4/5 users now can access the usage percentage of the GPU and they can access the statistics by process or globally.

I also worked, together with Melissa, implementing CPU jobs for the V3D driver. As the Broadcom GPU isn’t capable of performing some operations, the Vulkan driver uses the CPU to compensate for it. In order to avoid stalls in the job submission, now CPU jobs are part of the kernel and can be easily synchronized though with synchronization objects.

If you are curious about the CPU job implementation, you can check this blog post.

Other Contributions & Fixes

Sometimes we don’t contribute to a major feature in the release, however we can help improving documentation and sending fixes. André also contributed to this release by documenting the different AMD GPU reset methods, making it easier to understand by future users.

During Igalia’s efforts to improve the general users’ experience on the Steam Deck, Guilherme G. Piccoli noticed a message in the kernel log and readily provided a fix for this PCI issue.

Outside of the Steam Deck world, we can check some of Igalia’s work on the Qualcomm Adreno GPUs. Although most of our Adreno-related work is located at the user-space, Danylo Piliaiev sent a couple of kernel fixes to the msm driver, fixing some hangs and some CTS tests.

We also had contributions from our 2023 Igalia CE student, Nia Espera. Nia’s project was related to mobile Linux and she managed to write a couple of patches to the kernel in order to add support for the OnePlus 9 and OnePlus 9 Pro devices.

If you are a student interested in open-source and would like to have a first exposure to the professional world, check if we have openings for the Igalia Coding Experience. I was a CE student myself and being mentored by a Igalian was a incredible experience.

Check the complete list of Igalia’s contributions for the 6.8 release

Authored (57):

André Almeida (2)

Danylo Piliaiev (2)

Guilherme G. Piccoli (1)

Iago Toral Quiroga (4)

Maíra Canal (17)

Melissa Wen (27)

Nia Espera (4)

Signed-off-by (88):

André Almeida (4)

Danylo Piliaiev (2)

Guilherme G. Piccoli (1)

Iago Toral Quiroga (4)

Jose Maria Casanova Crespo (2)

Maíra Canal (28)

Melissa Wen (43)

Nia Espera (4)

Acked-by (4):

Jose Maria Casanova Crespo (2)

Maíra Canal (1)

Melissa Wen (1)

Reviewed-by (30):

André Almeida (1)

Christian Gmeiner (1)

Iago Toral Quiroga (20)

Maíra Canal (4)

Melissa Wen (4)

Tested-by (1):

Guilherme G. Piccoli (1)

April 02, 2024 11:00 AM

March 20, 2024

Jani Hautakangas

Bringing WebKit back to Android.

It’s been quite a while since the last blog post about WPE-Android, but that doesn’t mean WPE-Android hasn’t been in development. The focus has been on stabilizing the runtime and implementing the most crucial core features to make it easier to integrate new APIs into WPEView.

Main building blocks #

WPE-Android has three main building blocks:

  • Cerbero
    • Cross-platform build aggregator that is used to build WPE WebKit and all of its dependencies to Android.
  • WPEBackend-Android
    • Implements Android specific graphics buffer support and buffer sharing between WebKit UIProcess and WebProcess.
  • WPEView
    • Allows displaying web content in activity layout using WPEWebKit.

WPE-Android high-level design

What’s new #

The list of all work completed so far would be quite long, as there have been no official releases or public announcements, with the exception of the last blog post. Since that update, most of the efforts have been focused on ‘under the hood’ improvements, including enhancing stability, adding support for some core WebKit features, and making general improvements to the development infrastructure.

Here is a list of new features worth mentioning:

  • Based on WPE WebKit 2.42.1, the project was branched for the 2.42.x series. Future work will continue in the main branch for the next release.
  • Dropped support for 32-bit platforms (x86 and armv7). Only arm64-v8a and x86_64 are supported
  • Integration to Android main loop so that WPE WebKit GLib main loop is driven by Android main loop
  • Process-Swap On Navigation aka PSON
  • Added ASharedMemory support to WebKit SharedMemory
  • Hardware-accelerated multimedia playback
  • Fullscreen support
  • Cookies management
  • ANGLE based WebGL
  • Cross process fence insertion for composition synchronization with Surface Flinger
  • WebDriver support
  • GitHub Actions build bots
  • GitHub Actions WebDriver test bots

Demos #

WPEWebkit powered web view (WPEView) #

Demo uses WPE-Android MiniBrowser sample application to show basic web page loading and touch-based scrolling, usage of a simple cookie manager to clear the page’s date usage, and finally loads the popular “Aquarium sample” to show a smooth (60FPS) WebGL animation running thanks to HW acceleration support.

WebDriver #

Demo shows how to run WebDriver test with with emulator. Detailed instructions how to run WebDriver test can be found in README.md

Test requests a website through the Selenium remote webdriver. It then replaces the default behavior of window.alert on the requested page by injecting and executing a JavaScript snippet. After loading the page, it performs a click() action on the element that calls the alert. This results in the text ‘cheese’ being displayed right below the ‘click me’ link.

Test contains three building blocks:

  • WPE WebDriver running on emulator
  • HTTP server serving test.html web page
  • Selenium python test script executing the test

What’s next #

We have ambitious plans for the coming months regarding the development of WPE Android, focusing mainly on implementing additional features, stabilization, and performance improvements.

As for implementing new features, now that the integration with the WPE WebKit runtime has reached a more robust state, it’s time to start adding more of the APIs that are still missing in WPEView compared to other webviews on Android and to enable other web-facing features supported by WebKit. This effort, along with adding support for features like HTTP/2 and the remote Web Inspector, will be a major focus.

As for stabilization and performance, having WebDriver support will be very helpful as it will enable us to identify and fix issues promptly and thus help make WPE Android more stable and feature-complete. We would also like to focus on conformance testing compared to other web views on Android, which should help us prioritize our efforts.

The broader goal for this year is to develop WPE-Android into an easy-to-use platform for third-party projects, offering a compelling alternative to other webviews on the Android platform. We extend many thanks to the NLNet Foundation, whose support for WPE Android through a grant from the NGI Zero Core fund will be instrumental in helping us achieve this goal!

Try it yourself #

WPE-Android is still considered a prototype, and it’s a bit difficult to build. However, if you want to try it out, you can follow the instructions in the README.md file. Additionally, you can use the project’s issue tracker to report problems or suggest new features.

by Jani Hautakangas at March 20, 2024 12:00 AM

March 05, 2024

José Dapena

Maintaining downstreams of Chromium: why downstream?

Chromium, the web browser open source project Google Chrome is based on, can be considered nowadays the reference implementation of the web platform. As such, it is the first choice when implementing the web platform in a software platform or product.

Why is it like this? In this blog post I am going to introduce the topic, and then review the different reasons why a downstream Chromium is used in different projects.

A series of blog posts

This is the first of a series of blog posts, where I am going through several aspects and challenges of maintaining a downstream project using Chromium as its upstream project.

They will be mostly based on the discussions in two events. First, on The Web Engines Hackfest 2023 break out session with same title that I chaired in A Coruña. Then, on my BlinkOn 18 talk in November 2023, at Sunnyvale.

Some definitions

Before starting the discussion of the different aspects, let’s clarify how I will use several terms.

Repository vs. project

I am going to refer to a repository as a version controlled storage of code strictly. Then, a project (specifically, a software project) is the community of people that share goals and some kind of organization, to maintain one or several software products.

So, a project may use several repositories for their goals.

In this discussion I will talk about Chromium, an open source project that targets the implementation of the web platform user agent, a web browser, for different platforms. As such, it uses a number of repositories (src, v8 and more).

Downstream and upstream

I will use the downstream and upstream terms, referred to the relationship of different software projects version control repositories.

If there is a software project repository (typically open source), and a new repository is created that contains all or part of the original repository, then:

  • Upstream project will be the original repository.
  • Downstream project is the new repository.

It is important to highlight that different things can happen to the downstream repository:

  • This copy could be a one time event, so the downstream repository becomes an independent fork, and there may be no interest in tracking the upstream evolution. This happens often for abandoned repositories, where a different set of people start an independent project. But there could be other reasons.
  • There is the intent to track the upstream repository changes. So the downstream repository evolves as the upstream repository does too, but with some specific differences maintained on top of the original repository.

Why using Chromium?

Nowadays, web platform is a solid alternative for providing contents to the users. It allows modern user interfaces, based on well known standards, and integrate well with local and remote services. The gap between native applications and web contents has been reduced, so it is a good alternative quite often.

But, when integrating web contents, product integrators need an implementation of the web platform. It is no surprise that Chromium is the most used, for a number of reasons:

  • It is open source, with a license that allows to adapt it for new product needs.
  • It is well maintained and up to date. Even pushing through standardization for improving it continuously.
  • It is secure, both from architecture and maintenance model point of view.
  • It provides integration points to tailor the implementation to ones needs.
  • It supports the most popular software platforms (Windows, Android, Linux, …) for integrating new products.
  • On top of the web platform itself, it provides an implementation for many of the components required to build a modern web browser.

Still, there are other good alternate choices for integrating the web, as WebKit (specially WPE for the embedded use cases), or using the system-provided web components (Android or iOS web view, …).

Though, in this blog post I will focus on the Chromium case.

Why downstreaming Chromium?

But, why do different projects need to use downstream Chromium repositories?

The main simple reason the project needs a downstream repository is adding changes that are not upstream. This can be for a variety of reasons:

  • Downstream changes that are not allowed by upstream. I.e. because they will make upstream project harder to maintain, or it will not be tested often.
  • Downstream changes that downstream project does not want to add to upstream.

Let’s see some examples of changes of both types.

Hardware and OS adaptation

This is when downstream adds support for a hardware target or OS that is not originally supported in upstream Chromium project.

Chromium upstream provides an abstraction layer for that purpose named Ozone, that allows to adapt it to the OS, desktop environment, and system graphics compositor. But there are other abstraction layers for media acceleration, accessibility or input methods.

The Wayland protocol adaptation started as a downstream effort, as upstream Chromium did not intend to support Wayland at that time. Eventually it evolved into an upstream official Ozone backend.

An example? LGE webOS Chromium port.

Differentiation

The previous case mostly forces to have a downstream project or repository. But there are also some cases where this is intended. There is the will to have some features in the downstream repository and not in upstream, an intended differentiation.

Why would anybody want that? Some typical examples:

  • A set of features that the downstream project owners consider to make the project better in some way, and want them to be kept downstream. This can happen when a new browser is shipped, and it contains features that make the product offering different, and, in some ways, better, than upstream Chrome. That can be a different user experience, some security features, better privacy…
  • Adaptation to a different product brand. Each browser or browser-based product will want to have its specific brand instead of upstream Chromium brand.

Examples of this:

  • Brave browser, with completely different privacy and security choices.
  • ARC browser, with an innovative user experience.
  • Microsoft Edge, with tight Windows OS integration and corporate features.

Hybrid application runtimes

And one last interesting case: integrating the web platform for developing hybrid applications: those that mix parts of the user interface implemented in a native toolkit, and parts implemented using the web platform.

Though Chromium includes official support for hybrid applications for Android, with the Android Web View, other toolkits provide also web applications support, and the integration of those belong, in Chromium case, to downstream projects.

Examples?

What’s next?

In this blog post I presented different reasons why projects end up maintaining a downstream fork of Chromium.

In the next blog post I will present one of the main challenges when maintaining a downstream of Chromium: the different rebase and upgrade strategies.

by José Dapena Paz at March 05, 2024 03:54 PM