Planet Igalia

September 07, 2024

Andy Wingo

conservative gc can be faster than precise gc

Should your garbage collector be precise or conservative? The prevailing wisdom is that precise is always better. Conservative GC can retain more objects than strictly necessary, making GC slow: GC has to more frequently, and it has to trace a larger heap on each collection. However the calculus is not as straightforward as most people think, and indeed there are some reasons to expect that conservative root-finding can result in faster systems.

(I have made / relayed some of these arguments before but I feel like a dedicated article can make a contribution here.)

problem precision

Let us assume that by conservative GC we mean conservative root-finding, in which the collector assumes that any integer on the stack that happens to be a heap address indicates a reference on the object containing that address. The address doesn’t have to be at the start of the object. Assume that objects on the heap are traced precisely; contrast to BDW-GC which generally traces both the stack and the heap conservatively. Assume a collector that will pin referents of conservative roots, but in which objects not referred to by a conservative root can be moved, as in Conservative Immix or Whippet’s stack-conservative-mmc collector.

With that out of the way, let’s look at some reasons why conservative GC might be faster than precise GC.

smaller lifetimes

A compiler that does precise root-finding will typically output a side-table indicating which slots in a stack frame hold references to heap objects. These lifetimes aren’t always precise, in the sense that although they precisely enumerate heap references, those heap references might actually not be used in the continuation of the stack frame. When GC occurs, it might mark more objects as live than are actually live, which is the imputed disadvantage of conservative collectors.

This is most obviously the case when you need to explicitly register roots with some kind of handle API: the handle will typically be kept live until the scope ends, but that might be an overapproximation of lifetime. A compiler that can assume conservative stack scanning may well exhibit more precision than it would if it needed to emit stack maps.

no run-time overhead

For generated code, stack maps are great. But if a compiler needs to call out to C++ or something, it needs to precisely track roots in a run-time data structure. This is overhead, and conservative collectors avoid it.

smaller stack frames

A compiler may partition spill space on a stack into a part that contains pointers to the heap and a part containing numbers or other unboxed data. This may lead to larger stack sizes than if you could just re-use a slot for two purposes, if the lifetimes don’t overlap. A similar concern applies for compilers that partition registers.

no stack maps

The need to emit stack maps is annoying for a compiler and makes binaries bigger. Of course it’s necessary for precise roots. But then there is additional overhead when tracing the stack: for each frame on the stack, you need to look up the stack map for the return continuation, which takes time. It may be faster to just test if words on the stack might be pointers to the heap.

unconstrained compiler

Having to make stack maps is a constraint imposed on the compiler. Maybe if you don’t need them, the compiler could do a better job, or you could use a different compiler entirely. A conservative compiler can sometimes have better codegen, for example by the use of interior pointers.

anecdotal evidence

The Conservative Immix paper shows that conservative stack scanning can beat precise scanning in some cases. I have reproduced these results with parallel-stack-conservative-mmc compared to parallel-mmc. It’s small—maybe a percent—but it was a surprising result to me and I thought others might want to know.

Also, Apple’s JavaScriptCore uses conservative stack scanning, and V8 is looking at switching to it. Funny, right?

conclusion

When it comes to designing a system with GC, don’t count out conservative stack scanning; the tradeoffs don’t obviously go one way or the other, and conservative scanning might be the right engineering choice for your system.

by Andy Wingo at September 07, 2024 10:00 AM

September 06, 2024

Andy Wingo

on taking advantage of ragged stops

Many years ago I read one of those Cliff Click “here’s what I learned” articles in which he was giving advice about garbage collector design, and one of the recommendations was that at a GC pause, running mutator threads should cooperate with the collector by identifying roots from their own stacks. You can read a similar assertion in their VEE2005 paper, The Pauseless GC Algorithm, though this wasn’t the source of the information.

One motivation for the idea was locality: a thread’s stack is already local to a thread. Then specifically in the context of a pauseless collector, you need to avoid races between the collector and the mutator for a thread’s stack, and having the thread visit its own stack neatly handles this problem.

However, I am not so interested any more in (so-called) pauseless collectors; though I have not measured myself, I am convinced enough by the arguments in the Distilling the real costs of production garbage collectors paper, which finds that state of the art pause-minimizing collectors actually increase both average and p99 latency, relative to a well-engineered collector with a pause phase. So, the racing argument is not so relevant to me, because a pause is happening anyway.

There was one more argument that I thought was interesting, which was that having threads visit their own stacks is a kind of cheap parallelism: the mutator threads are already there, they might as well do some work; it could be that it saves time, if other threads haven’t seen the safepoint yet. Mutators exhibit a ragged stop, in the sense that there is no clean cutoff time at which all mutators stop simultaneously, only a time after which no more mutators are running.

Visiting roots during a ragged stop introduces concurrency between the mutator and the collector, which is not exactly free; notably, it prevents objects marked while mutators are running from being evacuated. Still, it could be worth it in some cases.

Or so I thought! Let’s try to look at the problem analytically. Consider that you have a system with N processors, a stop-the-world GC with N tracing threads, and M mutator threads. Let’s assume that we want to minimize GC latency, as defined by the time between GC is triggered and the time that mutators resume. There will be one triggering thread that causes GC to begin, and then M–1 remote threads that need to reach a safepoint before the GC pause can begin.

The total amount of work that needs to be done during GC can be broken down into rootsi, the time needed to visit roots for mutator i, and then graph, the time to trace the transitive closure of live objects. We want to know whether it’s better to perform rootsi during the ragged stop or in the GC pause itself.

Let’s first look to the case where M is 1 (just one mutator thread). If we visit roots before the pause, we have

latencyragged,M=1 = roots0 + graphN

Which is to say, thread 0 triggers GC, visits its own roots, then enters the pause in which the whole graph is traced by all workers with maximum parallelism. It may be that graph tracing doesn’t fully parallelize, for example if the graph has a long singly-linked list, but the parallelism with be maximal within the pause as there are N workers tracing the graph.

If instead we visit roots within the pause, we have:

latencypause,M=1= roots0+graphN

This is strictly better than the ragged-visit latency.

If we have two threads, then we will need to add in some delay, corresponding to the time it takes for remote threads to reach a safepoint. Let’s assume that there is a maximum period (in terms of instructions) at which a mutator will check for safepoints. In that case the worst-case delay will be a constant, and we add it on to the latency. Let us assume also there are more than two threads available. The marking-roots-during-the-pause case it’s easiest to analyze:

latencypause,M=2= delay + roots0+roots1+graphN

In this case, a ragged visit could win: while the triggering thread is waiting for the remote thread to stop, it could perform roots0, moving the work out of the pause, reducing pause time, and reducing latency, for free.

latencyragged,M=2= delay + roots1 + graphN

However, we only have this win if the root-visiting time is smaller than the safepoint delay; otherwise we are just prolonging the pause. Thing is, you don’t know in general. If indeed the root-visiting time is short, relative to the delay, we can assume the roots elements of our equation are 0, and so the choice to mark during ragged stop doesn’t matter at all! If we assume instead that root-visiting time is long, then it is suboptimally parallelised: under-parallelised if we have more than M cores, oversubscribed if M is greater than N, and needlessly serializing before the pause while it waits for the last mutator to finish marking its roots. What’s worse, root-visiting actually slows down delay, because the oversubscribed threads compete with visitation for CPU time.

So in summary, I plan to switch away from doing GC work during the ragged stop. It is complexity that doesn’t pay. Onwards and upwards!

by Andy Wingo at September 06, 2024 12:15 PM

Brian Kardell

1.3 Million Subtests

1.3 Million Subtests

Today the Servo project crossed a milestone.

If you're not aware, all of the mainstream browsers that exist today use one of three browser projects/engines (Chromium/Blink, Firefox/Gecko or WebKit). These in turn both have their roots in projects begun in the late 1990's (Gecko and KHTML). Further, they have only 1 funding model, and largely 1 funding source. Over the years, the projects have grown to tens of millions of lines of code and have now had tens of thousands of person years worth of investment.

All of this is just to underscore that getting a new one is incredibly challenging. Microsoft, a web giant, tried and eventually gave up and embraced Chromium.

But... If you haven't been paying attention, something very fun and exciting has been developing over the last couple of years. New life has been breathed into a movement for what we call "novel engines" - that is, engines that don't come from those initial two (khtml/Gecko).

What's more is that most of the life at least comes from 2 engines which are being developed and funded differently - and they're making rapid progress now.

As of today, one of those novel engines - Servo (stewarded by Igalia) passes 1.3 million subtests in Web Platform Tests! Specfically: 1,303,530 as of this writing! Congratulations Servo project!

For perspective, that's over 73% of the subtests that it's currently running. Of course, that's a totally arbitrary threshold, but it's nice to track progress and celebrate milestones and that's a big sounding round number to stop, raise a glass and say "well done!" and "keep up the good work".

Speaking of keeping up the good work: You can join a growing number of others in helping to support the work on Servo directly by donating through GitHub sponsors or our open collective. The Servo Technical Steering Committee collectively discusses how to prioritize the spending of available funds in the public monthly calls. You can also contribute code, reviews and other effort via the Servo GitHub repository. Let's see how quickly we can reach 1.4 million :)

September 06, 2024 04:00 AM

September 04, 2024

Frédéric Wang

My recent contributions to Gecko (3/3)

Note: This blog post was written on June 2024. As of September 2024, final work to ship the feature is still in progress. Please follow bug 1797715 for the latest updates.

Introduction

This is the final blog post in a series about new web platform features implemented in Gecko, as part as an effort at Igalia to increase browser interoperability.

Let’s take a look at fetch priority attributes, which enable web developers to optimize resource loading by specifying the relative priority of resources to be fetched by the browser.

Fetch priority

The web.dev article on fetch priority explains in more detail how web developers can use fetch priority to optimize resource loading, but here’s a quick overview.

fetchpriority is a new attribute with the value auto (default behavior), high, or low. Setting the attribute on a script, link or img element indicates whether the corresponding resource should be loaded with normal, higher, or lower priority 1:

<head>
  <script src="high.js" fetchpriority="high"></script>
  <link rel="stylesheet" href="auto.css" fetchpriority="auto">
</head>
<body>
  <img src="low.png" alt="low" fetchpriority="low">
</body>

The priority can also be set in the RequestInit parameter of the fetch() method:

await fetch("high.txt", {priority: "high"});

The <link> element has some interesting features. One of them is combining rel=preload and as to fetch a resource with a particular destination 2:

<link rel="preload" as="font" href="high.woff2" fetchpriority="high">

You can even use Link in HTTP response headers and in particular early hints sent before the final response:

103 Early Hint
Link: <high.js>; rel=preload; as=script; fetchpriority=high

These are basically all the places where a fetch priority attribute can be used.

Note that other parameters are also taken into account when deciding the priority to use for resources, such as the position of the element in the page (e.g. blocking resources in <head>), other attributes on the element (<script async>, <script defer>, <link media>, <link rel>…) or the resource’s destination.

Finally, some browsers implement speculative HTML parsing, allowing them to continue fetching resources declared in the HTML markup while the parser is blocked. As far as I understand, Firefox has its own separate HTML parsing code for that purpose, which also has to take fetch priority attributes into account.

Implementation-defined prioritization

If you have not run away after reading the complexity described in the previous section, let’s talk a bit more about how fetch priority attributes are interpreted. The spec contains the following step when fetching a resource (emphasis mine):

If request’s internal priority is null, then use request’s priority, initiator, destination, and render-blocking in an implementation-defined manner to set request’s internal priority to an implementation-defined object.

So browsers would use the high/low/auto hints as well as the destination in order to calculate an internal priority value 3, but the details of this value are not provided in the specification, and it’s up to the browser to decide what to do. This is a bit unfortunate for our interoperability goal, but that’s probably the best we can do, given that each browser already has its own stategies to optimize resource loading. I think this also gives browsers some flexibility to experiment with optimizations… which can be hard to predict when you realize that web devs also try to adapt their content to the behavior of (the most popular) browsers!

In any case, the spec authors were kind enough to provide a note with more suggestions (emphasis mine):

The implementation-defined object could encompass stream weight and dependency for HTTP/2, priorities used in Extensible Prioritization Scheme for HTTP for transports where it applies (including HTTP/3), and equivalent information used to prioritize dispatch and processing of HTTP/1 fetches. [RFC9218]

OK, so what does that mean? I’m not a networking expert, but this is what I could gather after discussing with the Necko team and reading some HTTP specs:

  • HTTP/1 does not have a dedicated prioritization mechanism, but Firefox uses its internal priority to order requests.
  • HTTP/2 has a “stream priority” mechanism and Firefox uses its internal priority to implement that part of the spec. However, it was considered too complex and inefficient, and is likely poorly supported by existing web servers…
  • In upcoming releases, Firefox will use its internal priority to implement the Extensible Prioritization Scheme used by HTTP/2 and HTTP/3. See bug 1865040 and bug 1864392. Essentially, this means using its internal priority to adjust the urgency parameter.

Note that various parts of Firefox rely on NS_NewChannel to load resources, including the fetching algorithm above, which Firefox uses to implement the fetch() method. However, other cases mentioned in the first section have their own code paths with their own calls to NS_NewChannel, so these places must also be adjusted to take the fetch priority and destination into account.

Finishing the implementation work

Summarizing a bit, implementing fetch priority is a matter of:

  1. Adding fetchpriority to DOM objects for HTMLImageElement, HTMLLinkElement, HTMLScriptElement, and RequestInit.
  2. Parsing the fetch priority attribute into an auto/low/high enum.
  3. Passing the information to the callers of NS_NewChannel.
  4. Using that information to set the internal priority.
  5. Using that internal priority for HTTP requests.

Mirko Brodesser started this work in June 2023, and had already implemented almost all of the features discussed above. fetch(), <img>, and <link rel=preload as=image> were handled by Ziran Sun and I, while Valentin Gosu from Mozilla made HTTP requests use the internal priority.

The main blocker was due to that “implementation-defined” use of fetch priority. Mirko’s approach was to align Firefox with the behavior described in the web.dev article, which reflects Chromium’s implementation. But doing so would mean changing Firefox’s default behavior when fetchpriority is not specified (or explicitly set to auto), and it was not clear whether Chromium’s prioritization choices were the best fit for Firefox’s own implementation of resource loading.

After meeting with Mozilla, we agreed on a safer approach:

  1. Introduce runtime preferences to control how Firefox adjusts internal priorities when low, high, or auto is specified. By default, auto does not affect the internal priority so current behavior is preserved.
  2. Ask Mozilla’s performance team to run an experiment, so we can decide the best values for these preferences.
  3. Ship fetch priority with the chosen values, probably cleaning things up a bit. Any other ideas, including the ones described in the web.dev article, could be handled in future enhancements.

We recently entered phase 2 of this plan, so fingers crossed it works as expected!

Internal WPT tests

This project is part of the interoperability effort, but again, the “implementation-defined” part meant that we had very few WPT tests for that feature, really only those checking fetchpriority attributes for the DOM part.

Fortunately Mirko, who is a proponent of Test-driven development, had written quite a lot of internal WPT tests that use internal APIs to retrieve the internal priority. To test Link headers, he used the handy wptserve pipes. The only thing he missed was checking support in Early hints, but some WPT tests for early hints using WPT Python Handlers were available, so integrating them into Mirko’s tests was not too difficult.

It was also straightforward for Ziran and I to extend Mirko’s tests to cover fetch, img, and <link rel=preload as=image>, with one exception: when the fetch() method uses a non-default destination. In most of these code paths, we call NS_NewChannel to perform a fetch. But fetch() is tricky, because if the fetch event is intercepted, the event handler might call the fetch() method again using the same destination (e.g. image).

Handling this correctly involves multiple processes and IPC communication, which ended up not working well with the internal APIs used by Mirko’s tests. It took me a while to understand what was happening in bug 1881040, and in the end I came up with a new approach.

Upstreamable WPT tests

First, let’s pause for a moment: all the tests we have so far use an internal API to verify the internal priority, but they don’t actually check how that internal priority is used by Firefox when it sends HTTP requests. Valentin mentioned we should probably have some tests covering that, and not only would it solve the problem with fetch() calls in fetch event handlers, it would also remove the use of an internal API, making the tests potentially reusable by other browsers.

To make this kind of test possible, I added a WPT Python Handler that parses the urgency from a HTTP request and responds with an urgency-dependent resource, such as a stylesheet with different property values, an image of a different size, or an audio or video file of a different duration.

When a test uses resources with different fetch priorities, this influences the urgency values of their HTTP requests, which in turn influences the response in a way that the test can check for in JavaScript. This is a bit complicated, but it works!

Conclusion

Fetch priority has been enabled in Firefox Nightly for a while, and experiments started recently to determine the optimal priority adjustments. If everything goes well, we will be able to push this feature to the finish line after the (northern) summer.

Helping implement this feature also gave me the opportunity to work a bit on the Firefox networking code, which I had not touched since the collaboration with IPFS, and I learned a lot about resource loading and WPT features for HTTP requests.

To me, the “implementation-defined” part was still a bit awkward for the web platform. We had to write our own internal WPT tests and do extra effort to prepare the feature for shipping. But in the end, I believe things went relatively smoothly.

Acknowledgments

To conclude this series of blog posts, I’d also like to thank Alexander Surkov, Cathie Chen, Jihye Hong, Martin Robinson, Mirko Brodesser, Oriol Brufau, Ziran Sun, and others at Igalia who helped on implementing these features in Firefox. Thank you to Emilio Cobos, Olli Pettay, Valentin Gosu, Zach Hoffman, and others from the Mozilla community who helped with the implementation, reviews, tests and discussions. Finally, our spelling and grammar expert Delan Azabani deserves special thanks for reviewing this series of blog post and providing useful feedback.

  1. Other elements have been or are being considered (e.g. <iframe>, SVG <image> or SVG <script>), but these are the only ones listed in the HTML spec at the time of writing. 

  2. As mentioned below, the browser needs to know about the actual destination in order to properly calculate the priority. 

  3. As far as I know, Firefox does not take initiator into account, nor does it support render-blocking yet

September 04, 2024 10:00 PM

Tvrtko Ursulin

DRM scheduling cgroup controller

Introduction #

The topic of a Direct Rendering Manager (DRM) cgroup controller is something which has been proposed a few times in the past, but so far is still missing from the Linux graphics stack. Some of those attempts were focusing on controlling the GPU memory usage aspect, while some were concerned with scheduling. As I am continuing to explore this area as part of my work at Igalia, in this post we will discuss one possible way of implementing the latter.

General problem statement which we are trying to address is the fact many GPUs (and their respective kernel drivers) can simultaneously schedule workloads from different clients and that there are use-cases where having external control over scheduling decisions would be beneficial.

But first to clarify what we mean by “external control”. By that term we refer to the scheduling decisions being influenced from the outside of the actual process doing the rendering. If we were to draw a parallel to CPU scheduling, that would be the difference between a process (or a thread) issuing a system call such as setpriority(2) or nice(2) itself (“internal control”), versus its scheduling priority being modified by an external entity such as the user issuing the renice(1) shell command, launching the executable via the nice(1) shell command, or even using the CPU scheduling cgroup controller (“external control”).

This has two benefits. Firstly, it is the user who typically knows which tasks are higher priority and which should run in the background and therefore be as much as it is possible isolated from starving the foreground tasks from resources. Secondly, external control can be applied on any process in an unified manner, without the need for applications to individually expose the means to control their scheduling priority.

If we now return back to the world of GPU scheduling we find ourselves in a landscape where internal scheduling control is possible with many GPU drivers, but the external control is not. To improve on that there are some technical and conceptual challenges, because GPUs are not as nice and uniform in their scheduling needs and capabilities as CPUs are, but if we would be able to come up with something reasonable even if not perfect, it could bring improvements to the user experience in a variety of scenarios.

Past attempts - Priority based controllers #

The earliest attempt I can remember was from 2018, by Matt Roper[1], who proposed to implement a driver-specific priority based controller. The RFC limited itself to i915 (kernel driver for Intel GPUs) and, although the priority-based setup is well established in the world of CPU scheduling, and it is easy to understand its effects, the proposal did not gain much traction.

Because of the aforementioned advantages, when I proposed my version of the controller in 2022[2], it also included a slightly different version of a priority-based controller. In contrast to the earlier one, this proposal was in principle driver-agnostic and the priority levels were also abstracted.

The proposal was also accompanied by benchmark results showing that the approach was effective in allowing users on Linux to launch GPU tasks in the background, while leaving more GPU bandwidth to the foreground task than when not using the controller. Similarly on ChromeOS, when wired into the focused versus un-focused window cgroup management, it was able to demonstrate relatively more GPU time given to the foreground window.

Current proposal - Weight based controller #

Anticipating the potential lack of sufficient support for this approach the same RFC also included a second controller which takes a different route. It abstracts things one step further and implements a weight based controller based on GPU utilisation[3].

The basic idea is that the GPU time budget is split based on relative group weights across the cgroup hierarchy, and that the controller notifies the individual DRM drivers when their clients are over budget. From there it is left for the individual drivers to know how to best manage this situation, depending on the specific scheduling capabilities of the driver and the GPU hardware.

The user interface completely mimics the exiting CPU and IO cgroup controllers with the single drm.weight control file. The weights carry no absolute meaning and are only relative within a single group of siblings. Their only purpose is to split out the time budget between them.

Visually one potential cgroup configuration could look like this:

A visual representation of cgroup hierarchy

The DRM cgroup controller then executes a periodic scanning task which queries each DRM client for its GPU usage and notifies drivers when clients are over their allocated budget.

If we expand the concept with runtime adjustment of group weights based on window focus status, with two graphically active clients such as a game and a web browser, we can end up with the following two scenarios:

Example cgroup hierararchy re-configured based on window focus

Here we show the actual GPU utilisation of each group together with their drm.weight. On the left hand side the web browser is the focused window, with the weights 100-to-10 in its favour.

The compositor is not using its full 200 / (200 + 100) so a portion is passed on to the desktop group to the extent of the full 80% required. Inside the desktop group the game is currently using 70%, while its actual allocation is 80% * (10 / (100 + 10)) = 7.27%. Therefore it is currently consuming is more than the budget and the corresponding DRM driver will be notified by the controller and will be able to do something about it.

After the user has given focus to the game window, relative weights will be adjusted and so will the budgets. Now the web browser will be over budget and therefore it can be throttled down, limiting the effect of its background activity on the foreground game window.

First driver implementation - i915 #

Back when I started developing this idea Intel GPU’s were my main focus, which is why i915 was the first driver I wired up with the controller.

There I implemented a rather simple approach of dynamically adjusting the scheduling priority of the throttled contexts, to the amount proportional to how much client is over budget in relative terms.

Implementation would also cross-check against the physical engine utilisation, since in i915 we have easy access to that metric, and only throttle if the latter is close to being fully utilised. (Why this makes sense could be an interesting digression relating to the fact that a single cgroup can in theory contain multiple GPUs and multiple clients using a mix of those GPUs. But lets leave that for later.)

One of the scenarios I used to test how well this works is to run two demanding GPU clients, each in its own cgroup, tweak their relative weights, and see what happens. The results were encouraging and are shown in the following table.

Benchmark results under Linux

We can see that, when a clients group weight was decreased, the GPU bandwidth it was receiving also went down, as a consequence of the lowered context priority after receiving the over-budget notification.

This is a suitable moment to mention how the DRM cgroup controller does not promise perfect control, that is, achieving the actual GPU sharing ratios as expressed by group-relative weights. As we have mentioned before, GPU scheduling is not nearly at the same level of quality and granularity as in the CPU world, so the goal it sets is simply to improve things - do something which has a positive impact on user experience. At the same time, the mechanism and control interface proposed does not preclude individual drivers doing as good job as they can. Or even a future possibility of replacing the inner workings with a controller with something smarter, with no need to change the user space control interface.

Going back to the initial i915 implementation, the second test I have done was attempting to wire up with the background/foreground window focus handling in ChromeOS. There I experimented with a game (Android VM) running in parallel with a WebGL demo in a browser. At a certain point after both clients were running I lowered the weight of the background game and on the below screenshot we can see how the FPS metric in a browser jumped up.

Screenshot of foreground versus background change under ChromeOS

This illustrates how having the controller can indeed improve the user experience. The user’s focus will be at the foreground window and therefore it does make sense to prioritise GPU access to that client for better interactiveness and smoother rendering there. In fact, in this example the actual FPS jumped from around 48-49 to 60fps. Meaning that throttling the background client has allowed the foreground one to match its rendering to display’s refresh rate.

Second implementation - amdgpu #

AMD’s kernel module was the next interesting driver which I wired up with the controller.

The fact that its scheduling is built on top of the DRM scheduler with only three distinct priority levels mandated a different approach to throttling. We keep a sorted list of “most offending” clients (most out of budget, or most borrowed unused budget from the sibling group), with the idea that the top client on that list gets throttled by lowering its scheduling priority. That was relatively straightforward to implement and sounded like it could potentially satisfy the most basic use case of background task isolation.

To test the runtime behaviour we set up two sibling cgroups and vary their relative scheduling weights. In one cgroup we run glxgears with vsync turned off and log its frame rate over time, while in the second group we run glmark2.

Let us first have a look on how glxgears frame rate varies during this test, depending on three different scheduling weight ratios between the cgroups. Scheduling weight ratio is expressed as glxgears:glmark2 ie. 10:1 means glxgears scheduling weight was ten times as much as configured for glmark2.

We can observe that, as the glmark2 is progressing through its various sub-benchmarks, glxgears frame rate is changing too. But it was overall higher in the runs where the scheduling weight ratio was in its favour. That is a positive result showing that even a simple implementation seems to be having the desired effect, at least to some extent.

Scheduling weight effect on glxgears running in parallel with glmark2

For the second test we can look from the perspective of glmark2, checking how the benchmark score change depending on the ratio of scheduling weights.

Scheduling weight effect on glmark2 running in parallel with glxgears

Again we see that the scores are generally improving when the scheduling weight ratio is increased in favour of the benchmark.

However, in neither case the change of the result is proportional to actual ratios. This is because the primitive implementation is not able to precisely limit the “background” client, but is only able to achieve some throttling. Also, there is an inherent delay in how fast the controller can react given the control loop is based on periodic scanning. This period is configurable and was set to two seconds for the above tests.

Conclusion #

Hopefully this write-up has managed to demonstrate two main points:

  • First, that a generic and driver agnostic approach to DRM scheduling cgroup controller can improve user experience and enable new use cases. While at the same time following the established control interface as it exists for CPU and IO control, which makes it future-proof and extendable;

  • Secondly, that even relatively basic driver implementations can be somewhat effective in providing positive control effects.

It also probably needs to be re-iterated that neither the driver implementations or the cgroup controller implementation itself are limited by the user interface proposed. Both could be independently improved under the hood in the future.

What is next? There is more work to be done such as conducting more detailed testing, polishing the implementation and potentially attempting to wire up more drivers to the controller. Further advocacy work in the DRM community too.

References #


  1. https://lore.kernel.org/dri-devel/20180120015141.10118-1-matthew.d.roper@intel.com/ ↩︎

  2. https://lore.kernel.org/lkml/20221019173254.3361334-1-tvrtko.ursulin@linux.intel.com/ ↩︎

  3. https://lore.kernel.org/lkml/ZVE3shwiRbUQyAqs@mtj.duckdns.org/T/ ↩︎

September 04, 2024 12:00 AM

August 29, 2024

Brian Kardell

What makes it exciting?

What makes it exciting?

Some thoughts on the things that we get excited about, or don't - and why.

Hey, check this out...

That's Igalia's website running in Google Chrome in kiosk mode. In kiosk mode there is no surrounding "chrome" and so no rendered controls. It has none of that stuff that we normally associate with "browsers" (tabs, back/forward/refresh buttons, a URL bar, etc)

In fact, while I say that that's Google Chrome running in kiosk mode, it could just as well be embedded browser, like WPEWebKit which has no built in controls. Is it?

I'm not telling 😏.

But, it doesn't really matter: I'm just using it to set up the question of whether you feel like that is a browser? And perhaps whether WPEWebKit excites you as a browser? There's no "right" answer.

Ok, hold that thought, and look at this...

This is the webkit.org site running in WebKit mini-browser. It isn't Safari, it is just... er... WebKit. Is it a browser?

On the one hand, it's right there in the name, right? It's a Mini...browser. And, it does have the most important controls.

Still, you can't download it as a finished product, you have to build it. And no one thinks you should use this as your daily browser. Does it capture your attention as a browser?

Probably not?

Ok, now what about this:

That's ladybird. Yes, that's exciting, right?

I mean, I totally agree...

But the reason it's exciting has almost nothing to do with it being a browser. The Ladybird browser is hardly more than the mini-browser, in fact. It's purpose it the same and how you get it is the same. What makes it exciting is that it's based on a novel engine.

That said, excitement isn't currently (or probably for the foreseeable future) of the form "because we can actually use that browser day to day". Unlike in the mini-browsers of the others, there is a lot left to do here and the long tail and tricky bits of the web that are needed are... well, a lot, and tricky. They also have this pesky issue where the more people use your browser, the more bugs and unwritten rules you find you have to manage, all without breaking something else. So, Ladybird's got a long way to get to a place where you can practically use it -- but it is exciting.

Anyway, I asked some people a while back if they were as excited about Servo and they said "well, but it's not a browser".

Batman, thinking...
Hmmmmm...

And, indeed, recently someone from the Servo community created a GitHub repo which will be a browser based on Servo (named Verso - great name!) and the crowd seemingly did go wild on that (double the points of almost any other Servo related post on the orange website), despite the fact that at the time, it was effectively still a mostly-not-function repo.

So, anyway... I just wanted to step back and say...

Are you not entertained?

That's the Servo mini-browser running 3 tabs. And you know how I got it? I went to servo.org, downloaded it, installed it and launched it. I can currently use keyboard commands to launch and close tabs. It's also open source, so anyone could in theory have a better downstream one (maybe Verso?) or help grow a browser in Servo itself (more like Firefox, I guess).

So... Is Servo a browser?

I mean... no. But also... yes and it's super exciting.

You should be excited about it, I think.

You can help support it with some funding through GitHub or Open Collective (wealthy donors and business support is welcome too :)). How to choose and prioritize the work funded by donations is decided through collective discussions on the public monthly calls of the Servo Technical Steering Committee.

August 29, 2024 04:00 AM

August 28, 2024

Jani Hautakangas

Bringing WebKit back to Android: Progress and Perspectives

In my previous blog post, I delved into the technical aspects of reintroducing WebKit to the Android platform. It was an exciting journey, filled with the challenges and triumphs that come with working on a project as ambitious as WPE-Android. However, I realize that the technical depth of that post may have left some readers seeking more context. Today, I want to take a step back and offer a broader view of what this project is all about—why we’re doing it, how it builds on the WPEWebKit engine, and the progress we’ve made so far.

The Vision: Reviving WebKit on Android #

WebKit has a storied history in the world of web browsers, serving as the backbone for Safari, Epiphany, and many embedded browsers. However, over time, Android’s landscape has shifted toward Blink/Chromium, the engine behind Chrome. While Blink and Chromium have undoubtedly shaped the modern web, there are compelling reasons to bring WebKit back to Android.

WPE-Android is an effort to reintroduce WebKit into the Android ecosystem as a modern, efficient, and secure browser engine. Our goal is to provide developers with more options—whether they’re building full-fledged browsers, integrating web views into native apps, or exploring innovative applications in IoT and embedded systems. By leveraging WebKit’s unique strengths, we’re opening new doors for creativity and innovation on the Android platform.

Why WPEWebKit? #

At the heart of WPE-Android is WPEWebKit, a streamlined version of the WebKit engine specifically optimized for embedded systems. Unlike its desktop counterpart, WPEWebKit is designed to be lightweight, efficient, and highly adaptable to various hardware environments. This makes it an ideal foundation for bringing WebKit back to Android.

The decision to base WPE-Android on WPEWebKit is strategic. WPEWebKit is not only performant but also backed by a strong community of developers and organizations dedicated to its continuous improvement. This community-driven approach ensures that WPE-Android benefits from a robust, well-maintained codebase, with contributions from experts around the world.

Building on a Strong Foundation #

Since the inception of WPE-Android, our focus has been on making WebKit a viable option for Android developers. This involves more than just getting the engine to run on Android—it’s about ensuring that it’s stable, integrates seamlessly with Android’s unique features, and offers a developer-friendly experience.

A significant part of our work has involved optimizing the interaction between WPEWebKit and Android’s graphics stack. As part of that, we decided to focus on Android API level 30 and higher to keep the prototyping phase faster and simpler. Our efforts have aimed at achieving smooth and consistent performance, ensuring that WPE-Android can meet the needs of modern Android applications.

We are building a foundation to run instrumentation tests in CI to ensure that we don’t regress and that we get consistent results that match Android’s system WebView APIs. We continue adding more APIs that are similar to Android System WebView offerings and provide similar results.

Additionally, we’ve focused on enhancing the integration of WPE-Android with Android-specific features. This includes improving support for touch input and dialogs, refining the way web views are handled within native Android applications, and ensuring compatibility with the Android development environment. These enhancements make WPE-Android a natural fit for developers who are already familiar with the Android platform.

What’s new #

Most of the changes are under the hood improvements. The task that required the most effort was upgrading and rebasing our patches on top of Cerbero. After we upgraded to WPE WebKit 2.44.1, we required a more recent GLib version provided by the newer Cerbero version. Along with the upgrade, we managed to refactor and squash many of the patches that we had on top of Cerbero. We went from 175 patches down to 66, which will simplify the next upgrade.

Here’s a list of the most notable changes since the last update:

  • Upgraded to WPE WebKit 2.44.1.
  • Upgraded Cerbero to version 1.24.2.
  • Upgraded Android NDK to version r26d.
  • Migrated from libsoup2 to libsoup3 for HTTP/2 support.
  • Support for proper device scale ratio according to Android’s DisplayMetrics. This takes into account the screen size and pixel density, automatically adapting rendered content to show with appropriate dimensions on all devices.
  • Support for JS dialogs (Alert, Confirm, Prompt). Integrates Android dialogs with JavaScript alert(), confirm(), and prompt() prompts. Also provides an option to build custom native dialogs for these prompts.
  • Instrumentation tests for recently added features and a CI pipeline for running them.
  • API to receive HTTP errors. WPEViewClient interface onReceivedHttpError to catch HTTP error codes >= 400.
  • API to evaluate JavaScript. Provides the WPEView method evaluateJavascript to inject and evaluate JavaScript code on a loaded page.

Demos #

Dialog prompts #

The demo shows the default WPEView alert() prompt integration on the left side. On the right side, an application using WPEView has overridden the onJsAlert method from the WPEChromeClient interface and provides a custom native alert dialog for the JavaScript alert() prompt. The custom dialog is constructed using Android’s AlertDialog.Builder factory. Similar customization can be applied to JavaScript confirm() and prompt() prompts by overriding the onJsConfirm and onJsPrompt methods from the WPEChromeClient interface.

Default WPEView alert dialog
Custom WPEView alert dialog

Evaluate javascript #

The demo shows how to inject JavaScript and call functions on a loaded page from Kotlin code.

HTML and JavaScript:

<script>
function showName(message) {
document.getElementById('name').innerHTML=message;
}
</script>
<center>
<br><br><br><br>
<h1>WPEView</h1>
<p>Evaluate javascript</p>"
<br><br><br><br>
<h2>What's your name?</h2>
<h1 id="name"></h1>
</center>

Kotlin/WPEView code:

binding.toolbarButton.setOnClickListener {
webview.evaluateJavascript("showName(\"" + binding.toolbarEditText.text + "\")",null)
binding.toolbarEditText.setText("")
}

Device scale factor #

Android devices come with a variety of screen sizes, resolutions, and screen densities (pixels per inch, also known as ppi). In order for the UI to look consistent and good across all different devices, the device scale factor needs to be applied to the UI. Screen density can be fetched via the Android DisplayMetrics API, and in WPE WebKit, this corresponds to the device scale factor that can be set using wpe_view_backend_dispatch_set_device_scale_factor. Previously, in WPE-Android, we had hardcoded that value to 2.0, but now we are using proper metrics specific to each device.

Below are some screenshots from before and after applying the proper device scale. I’m using a Google Pixel 7 device, which has a density value of 2.75.

Old hardcoded device scale factor 2.0
Device scale factor from DisplayMetrics density

Looking Forward #

Our goal is to make WPE-Android even more accessible and usable for the broader Android development community. This involves ongoing performance optimizations, expanding device compatibility, and potentially providing more resources like documentation, example projects, and developer tools to ease the adoption of WPE-Android.

We believe that by offering WebKit as a viable option on Android, we’re contributing to a more diverse and innovative web ecosystem. WPE-Android is not just about bringing back a familiar engine—it’s about giving developers the tools they need to create fast, secure, and beautiful web experiences on Android devices.

Conclusion #

The journey of bringing WebKit back to Android has been both challenging and rewarding so far. By building on the strong foundation of WPEWebKit, we’re crafting a tool that empowers developers to push the boundaries of what’s possible with web technologies on Android. The progress we’ve made so far is just the beginning, and I’m excited to see how the project will continue to evolve.

If you’re interested in learning more or getting involved, you can find all the details on the WPE-Android GitHub page.

August 28, 2024 12:00 AM

August 26, 2024

Sergio Villar

A New Way to Browse: Eye Tracking Comes to Wolvic!

We’re thrilled to share some exciting news with you. Wolvic is about to transform how you interact with the web in a VR environment with the introduction of eye tracking support! Starting with the just released v1.7.0 release on the Gecko backend and the highly anticipated v1.0 release on the Chromium backend, you’ll be able to control the browser pointer just by looking at what you want to interact with. While this feature is still being refined, it’s a fantastic start, and we can’t wait for you to try it out.

August 26, 2024 06:07 AM

August 25, 2024

Andy Wingo

whippet update: faster evacuation, eager sweeping of empty blocks

Good evening. Tonight, notes on things I have learned recently while hacking on the Whippet GC library.

service update

For some time now, the name Whippet has referred to three things. Firstly, it is the project as a whole, consisting of an include-only garbage collection library containing a compile-time configurable choice of specific collector implementations. Also, it is the name of a specific Immix-derived collector. Finally, it is the name of a specific space within that collector, in which objects are mostly marked in place but can be evacuated if appropriate.

Well, naming being one of the two hard problems of computer science, I can only ask for forgiveness and understanding. I have started fixing this situation with the third component, renaming the whippet space to the nofl space. Nofl stands for no-free-list, indicating that it’s a (mostly) mark space but which does bump-pointer allocation instead of using freelists. Also, it stands for novel, in the sense that as far as I can tell, it is a design that hasn’t been tried yet.

unreliable evacuation

Following Immix, the nofl space has always had optimistic evacuation. It prefers to mark objects in place, but if fragmentation gets too high, it will try to defragment by evacuating sparse blocks into a small set of empty blocks reserved for this purpose. If the evacuation reserve fills up, nofl will dynamically switch to marking in place.

My previous implementation was a bit daft: some subset of blocks would get marked as being evacuation targets, and evacuation would allocate into those blocks in ascending address order. Parallel GC threads would share a single global atomically-updated evacuation allocation pointer. As you can imagine, this was a serialization bottleneck; I initially thought it wouldn’t be so important but for some workloads it is.

I had chosen this strategy to maximize utilization of the evacuation reserve; if you had 8 GC workers, each allocating into their own block, their blocks won’t all be full at the end of GC; that would waste space.

But reliability turns out to be unimportant. It’s more important to let parallel GC threads do their thing without synchronization, to the extent possible.

Also, this serialized allocation discipline imposed some complexity on the nofl space implementation; the evacuation allocator was not like the “normal” allocator. With worker-local allocation buffers, these two allocators are now essentially the same. (They differ in that the normal allocator interleaves lazy sweeping with allocation, and can allocate into blocks with survivors from the last GC instead of requiring empty blocks.)

eager sweeping

Another initial bad idea I had was to lean too much on lazy sweeping as a design principle. The idea was that deferring sweeping work until just before an allocator would write to a block would minimize cache overhead (you page in a block right when you will use it) and provide for workload-appropriate levels of parallelism (multiple mutator threads naturally parallelize sweeping).

Lazy sweeping was very annoying when it came to discovery of empty blocks. Empty blocks are precious: they can be returned to the OS if needed, they are useful for evacuation, and they have nice allocation properties, in that you can just do bump-pointer from beginning to end.

Nofl was discovering empty blocks just in time, from the allocator. If the allocator acquired a new block and found that it was empty, it would return it to a special list of empty blocks. Only if all sweepable pages were exhausted would an allocator use an empty block. But to prevent an allocator from pausing forever, there was a limit to the number of empty swept blocks that would be returned to the collector (10, as it happens); an 11th empty swept block would be given to a mutator for allocation. And so on and so on. Complicated, and you only know the number of empty blocks yielded by the last collection when the whole next allocation cycle has finished.

The obvious solution is some kind of separate mark on blocks, in addition to a mark on objects. I didn’t do it initially out of fear of overhead; marking is a fast path. The implementation I ended up making was a packed bitvector, with one bit per 64 kB block, at the beginning of each 4 MB slab of blocks. The beginning blocks are for metadata like this. For reasons, I don’t have the space for full bytes. When marking an object, if I see that a block’s mark is unset, I do an atomic_fetch_or_explicit on the byte with memory_order_relaxed ordering. In this way I only do the atomic op very rarely. It seems that on ARMv8.1 there is actually an instruction to do atomic bit setting; everywhere else it’s a garbage compare-and-swap thing, but on my x64 machine it’s fine.

Then after collection, during the pause, if I see a block is unmarked, I move it directly to the empty set instead of sweeping it. (I could probably do this concurrently outside the pause, but that would be for another day.)

And the overhead? Negative! Negative, in the sense that because I don’t have to sweep empty blocks, and that (for some workloads) collection can produce a significant-enough fraction of empty blocks, I actually see speedups with this strategy, relative to lazy sweeping. It also simplifies the allocator (no need for that return-the-11th-block logic).

The only wrinkle is as regards generational collection: nofl currently uses the sticky mark bit algorithm, which has to be applied also to block marks. Subtle, but not complicated.

fin

Next up is growing and shrinking the nofl-using Whippet collector (which might need another name), using the membalancer algorithm, and then I think I will be ready to move on to getting Whippet into Guile. Until then, happy hacking!

by Andy Wingo at August 25, 2024 08:29 PM

August 19, 2024

Andy Wingo

javascript weakmaps should be iterable

Good evening. Tonight, a brief position statement: it is a mistake for JavaScript’s WeakMap to not be iterable, and we should fix it.

story time

A WeakMap associates a key with a value, as long as the key is otherwise reachable in a program. (It is an ephemeron table.)

When WeakMap was added to JavaScript, back in the ES6 times, some implementors thought that it could be reasonable to implement weak maps not as a data structure in its own right, but rather as a kind of property on each object. Under the hood, adding an key→value association to a map M would set key[M] = value. GC would be free to notice dead maps and remove their associations in live objects.

If you implement weak maps like this, or are open to the idea of such an implementation, then you can’t rely on the associations being enumerable from the map itself, as they are instead spread out among all the key objects. So, ES6 specified WeakMap as not being iterable; given a map, you can’t know what’s in it.

As with many things GC-related, non-iterability of weak maps then gained a kind of legendary status: the lore states that non-iterability preserves some key flexibility for JS implementations, and therefore it is good, and you just have to accept it and move on.

dynamics

Time passed, and two things happened.

One was that this distributed WeakMap implementation strategy did not pan out; everyone ended up implementing weak maps as their own kind of object, and people use an algorithm like the one Steve Fink described a couple years ago to compute the map×key⇒value conjunction. The main original motivation for non-iterability was no longer valid.

The second development was WeakRef and FinalizationRegistry, which expose some details of reachability as viewed by the garbage collector to user JS code. With WeakRef (and WeakMap), you can build an iterable WeakMap.

(Full disclosure: I did work on ES6 and had a hand in FinalizationRegistry but don’t do JS language work currently.)

Thing is, your iterable WeakMap is strictly inferior to what the browser can provide: its implementation is extraordinarily gnarly, shipped over the wire instead of already in the browser, uses more memory, is single-threaded and high-latency (because FinalizationRegistry), and non-standard. What if instead as language engineers we just did our jobs and standardized iterability, as we do with literally every other collection in the JS standard?

Just this morning I wrote yet another iterable WeakSet (which has all the same concerns as WeakMap), and while it’s sufficient for my needs, it’s not good (lacking prompt cleanup of dead entries), and by construction can’t be great (because it has to be redundantly implemented on top of WeakSet instead of being there already).

I am sympathetic to deferring language specification decisions to allow the implementation space to be explored, but when the exploration is done and the dust has settled, we shouldn’t hesitate to pick a winner: JS weak maps and sets should be iterable. Godspeed, brave TC39 souls; should you take up this mantle, you are doing the Lord’s work!

Thanks to Philip Chimento for notes on the timeline and Joyee Cheung for notes on the iterable WeakMap implementation in the WeakRef spec. All errors mine, of course!

by Andy Wingo at August 19, 2024 08:13 PM

Emmanuele Bassi

On Vala

It seems I raised a bit of a stink on Twitter last week:

Of course, and with reason, I’ve been called out on this by various people. Luckily, it was on Twitter, so we haven’t seen articles on Slashdot and Phoronix and LWN with headlines like “GNOME developer says Vala is dead and will be removed from all servers for all eternity and you all suck”. At least, I’ve only seen a bunch of comments on Reddit about this, but nobody cares about that particular cesspool of humanity.

Sadly, 140 characters do not leave any room for nuance, so maybe I should probably clarify what I wrote on a venue with no character limit.

First of all, I’d like to apologise to people that felt I was attacking them or their technical choices: it was not my intention, but see above, re: character count. I may have only about 1000 followers on Twitter, but it seems that the network effect is still a bit greater than that, so I should be careful when wording opinions. I’d like to point out that it’s my private Twitter account, and you can only get to what it says if you follow me, or if you follow people who follow me and decide to retweet what I write.

My PSA was intended as a reflection on the state of Vala, and its impact on the GNOME ecosystem in terms of newcomers, from the perspective of a person that used Vala for his own personal projects; recommended Vala to newcomers; and has to deal with the various build issues that arise in GNOME because something broke in Vala or in projects using Vala. If you’re using Vala outside of GNOME, you have two options: either ignore all I’m saying, as it does not really apply to your case; or do a bit of soul searching, and see if what I wrote does indeed apply to you.

First of all, I’d like to qualify my assertion that Vala is a “dead language”. Of course people see activity in the Git repository, see the recent commits and think “the project is still alive”. Recent commits do not tell a complete story.

Let’s look at the project history for the past 10 cycles (roughly 2.5 years). These are the commits for every cycle, broken up in two values: one for the full repository, the other one for the whole repository except the vapi directory, which contains the VAPI files for language bindings:

Commits

Aside from the latest cycle, Vala has seen very little activity; the project itself, if we exclude binding updates, has seen less than 100 commits for every cycle — some times even far less. The latest cycle is a bit of an outlier, but we can notice a pattern of very little work for two/three cycles, followed by a spike. If we look at the currently in progress cycle, we can already see that the number of commits has decreased back to 55/42, as of this morning.

Commits

Number of commits is just a metric, though; more important is the number of contributors. After all, small, incremental changes may be a good thing in a language — though, spoiler alert: they are usually an indication of a series of larger issues, and we’ll come to that point later.

These are the number of developers over the same range of cycles, again split between committers to the full repository and to the full repository minus the vapi directory:

Developers

As you can see, the number of authors of changes is mostly stable, but still low. If we have few people that actively commit to the repository it means we have few people that can review a patch. It means patches linger longer and longer, while reviewers go through their queues; it means that contributors get discouraged; and, since nobody is paid to work full time on Vala, it means that any interruption caused by paid jobs will be a bottleneck on the project itself.

These concerns are not unique of a programming language: they exist for every volunteer-driven free and open source project. Programming languages, though, like core libraries, are problematic because any bottleneck causes ripple effects. You can take any stalled project you depend on, and vendor it into your own, but if that happens to the programming language you’re using, then you’re pretty much screwed.

For these reasons, we should also look at how well-distributed is the workload in Vala, i.e. which percentage of the work is done by the authors of those commits; the results are not encouraging. Over that range of cycles, Only two developers routinely crossed the 5% of commits:

  • Rico Tzschichholz
  • Jürg Billeter

And Rico has been the only one to consistently author >50% of the commits. This means there’s only one person dealing with the project on a day to day basis.

As the maintainer of a project who basically had to do all the work, I cannot even begin to tell you how soul-crushing that can become. You get burned out, and you feel responsible for everyone using your code, and then you get burned out some more. I honestly don’t want Rico to burn out, and you shouldn’t, either.

So, let’s go into unfair territory. These are the commits for Rust — the compiler and standard library:

Rust

These are the commits for Go — the compiler and base library:

Go

These are the commits for Vala — both compiler and bindings:

Vala

These are the number of commits over the past year. Both languages are younger than Vala, have more tools than Vala, and are more used than Vala. Of course, it’s completely unfair to compare them, but those numbers should give you a sense of scale, of what is the current high bar for a successful programming language these days. Vala is a niche language, after all; it’s heavily piggy-backing on the GNOME community because it transpiles to C and needs a standard library and an ecosystem like the one GNOME provides. I never expected Vala to rise to the level of mindshare that Go and Rust currently occupy.

Nevertheless, we need to draw some conclusions about the current state of Vala — starting from this thread, perhaps, as it best encapsulates the issues the project is facing.

Vala, as a project, is limping along. There aren’t enough developers to actively effect change on the project; there aren’t enough developers to work on ancillary tooling — like build system integration, debugging and profiling tools, documentation. Saying that “Vala compiles to C so you can use tools meant for C” is comically missing the point, and it’s effectively like saying that “C compiles to binary code, so you can disassemble a program if you want to debug it”. Being able to inspect the language using tools native to the language is a powerful thing; if you have to do the name mangling in your head in order to set a breakpoint in GDB you are elevating the barrier of contributions way above the head of many newcomers.

Being able to effect change means also being able to introduce change effectively and without fear. This means things like continuous integration and a full test suite heavily geared towards regression testing. The test suite in Vala is made of 210 units, for a total of 5000 lines of code; the code base of Vala (vala AST, codegen, C code emitter, and the compiler) is nearly 75 thousand lines of code. There is no continuous integration, outside of the one that GNOME Continuous performs when building Vala, or the one GNOME developers perform when using jhbuild. Regressions are found after days or weeks, because developers of projects using Vala update their compiler and suddenly their projects cease to build.

I don’t want to minimise the enormous amount of work that every Vala contributor brought to the project; they are heroes, all of them, and they deserve as much credit and praise as we can give. The idea of a project-oriented, community-oriented programming language has been vindicated many times over, in the past 5 years.

If I scared you, or incensed you, then you can still blame me, and my lack of tact. You can still call me an asshole, and you can think that I’m completely uncool. What I do hope, though, is that this blog post pushes you into action. Either to contribute to Vala, or to re-new your commitment to it, so that we can look at my words in 5 years and say “boy, was Emmanuele wrong”; or to look at alternatives, and explore new venues in order to make GNOME (and the larger free software ecosystem) better.

by ebassi at August 19, 2024 10:53 AM

August 10, 2024

Abhijeet Kandalkar

Exploring sandboxing in CEF and CEFSharp for Windows platform

Sandboxing is a critical feature for ensuring the security and stability of applications that embed web content. In this blog, we will delve into how sandboxing is achieved in the Chromium Embedded Framework (CEF). All analysis in this blog is performed on Windows operating system.

CEFClient and CEFSharp

The base CEF framework includes support for the C and C++ programming languages. Also it provides a test applications to demonstrate its functionality. CEFClient, a one of the C++ test application, serves this purpose.

CEF provides CAPI bindings to enable application development across various programming languages. For integrating the CEF framework with .NET, you should explore CEFSharp. It is a C# application that leverages the CEF open source project, allowing developers to create applications in C#.

CEFSharp(C#)
            \  CAPI 
              -------> CEF(C++) ------> Chromium(C++)
            /
CEFClient(C++)

What is the sandbox?

As mentioned in documentation of sandbox in Chromium repository,

The sandbox is a C++ library that allows the creation of sandboxed processes — processes that execute within a very restrictive environment. The only resources sandboxed processes can freely use are CPU cycles and memory. For example, sandboxes processes cannot write to disk or display their own windows. What exactly they can do is controlled by an explicit policy. Chromium renderers are sandboxed processes.

Understanding Sandboxing in CEF

Since CEF uses Chromium internally, we decided to investigate how sandboxing is implemented by examining Chromium’s source code. Our findings revealed that the sandbox is designed to be versatile, with no hard dependencies on the Chromium browser itself, making it usable with other applications.

CEF prepares a static library using the existing sandbox implementation in the Chromium repository and links it to the main application. This process enables the sandbox functionality in CEF applications. For more detailed information, refer to the sandbox_win.cc file in the CEF repository.

if (is_win) {
  static_library("cef_sandbox") {
    sources = [ "libcef_dll/sandbox/sandbox_win.cc" ]
    include_dirs = [ "." ]
    deps = [ "libcef/features", "//sandbox" ]
  }
}

Implementing Sandboxing in CEFClient

CEFClient achieves a sandboxing by linking to cef_sandbox library

  executable("cefclient") {
      deps += [
        ":cef_sandbox",
      ]

In the CEFClient application, we need to create a sandbox object and pass it to CEF, which then forwards it internally to Chromium to configure the sandbox.

  CefScopedSandboxInfo scoped_sandbox;
  sandbox_info = scoped_sandbox.sandbox_info();
  ...
  // Initialize the CEF browser process.
  if (!context->Initialize(main_args, settings, app, sandbox_info)) {
    return 1;
  }

The following call stack illustrates how CEFClient code calls into CEF, ultimately transitioning into the Chromium environment to configure the sandbox.

The CEF test application (CEFClient) successfully launches a sandboxed CEF, which is verified by loading the chrome://sandbox URL.


Why C# Applications Cannot Use Sandboxed CEF ?

As we have seen above, for C++ applications it is possible to use sandboxing but C# application can not use sandboxed CEF. But why ?

To achieve effective sandboxing in your CEF application, ensure that you link the cef_sandbox.lib static library. This linking must be done specifically to the main process of the application. While C++ applications can directly link static libraries without issue, C# applications do not support static linking of libraries.

Static library means that the library is going to be merged with application. This concept doesn't exist in .NET as .NET supports only dynamic link libraries

If you attempt to create a DLL for the cef_sandbox code and link it dynamically to your application, the Chromium sandbox code will detect this and fail with the error SBOX_ERROR_INVALID_LINK_STATE. Therefore, static linking of the sandboxing code to the main application is mandatory for sandbox functionality.

  // Attempt to start a sandboxed process from sandbox code hosted not within
  // the main EXE. This is an unsupported operation by the sandbox.
  SBOX_ERROR_INVALID_LINK_STATE = 64,

Conclusion

While sandboxing is a powerful feature available for C++ applications using CEF, it is not feasible for C# applications due to the inherent differences in how these languages operate within the system. The managed environment of C# introduces limitations that make it impossible to integrate the same level of sandboxing as in C++. Consequently, C# applications cannot leverage sandboxed CEF using the existing setup.


Future study

To enable sandboxing in C# applications, we may need to modify the .NET runtime (which is written in C++) to support sandboxing. This would involve linking the cef_sandbox library directly to the runtime. The .NET Core runtime includes several executables/libraries that act as the main entry point for code execution, typically referred to as the “host”. These host executables can be customized as needed.

For example, when you run the apphost executable, it initializes the .NET Core runtime and starts your application. Creating a sandbox object in the apphost and passing it to the application might help address sandboxing for C# applications. However, this approach has not been tested, so its effectiveness cannot be confirmed without further exploration and modification of the C# language runtime.


References

August 10, 2024 06:30 PM

August 08, 2024

Gyuyoung Kim

Chrome iOS Browser on Blink

At the beginning of this year, Apple allowed third-party browser engines, such as Gecko and Blink, to be used on iOS and iPadOS. The Chromium community has been developing Chrome iOS based on Blink as an experimental project. This post provides an overview of the project and examines its progress during the first half of 2024.

Structure of Chrome iOS (w/blink)

First, let’s look at the overall structure of Chrome iOS. This simple diagram illustrates the structure of Chrome iOS.

The UI components are placed in the //ios/chrome directory, which functions similarly to the //chrome layer in other implementations. The //ios/web layer also provides APIs for its initialization, content navigation, presenting UI callbacks, saving or restoring navigation sessions and browser data, and more. The //ios/web/public defines the public APIs, while various other directories within //ios/web implement them. The implementation of these public APIs invokes WebKit APIs to provide the necessary functionalities.

For Chrome iOS based on Blink, the developers decided to reuse the existing iOS Chrome UI implementation last year. Thus, one of the main developments of Chrome iOS based on Blink happened in the //ios/web/content directory which is the rectangle filled by a green color, created to implement the public APIs using Blink. As you can see in the below diagram, they added the content directory to //ios/web directory and implemented the public APIs using Blink’s content APIs. Notably, they’ve introduced a ContentWebState as a prototype implementation of WebState to replace the class in //ios/web/content/web_state. However, as shown in the diagram, the directory currently only includes in five components: web_state, navigation, UI, init, and js_messaging. Therefore, more components need to be implemented for Chrome iOS on Blink.

Igalia Contributions

Igalia has been contributing to this project since the project was made public by the community. We’ve worked mainly on UI related components e.g. file and color chooser, context menu, select list popup and integration with Chrome IOS UI. The pictures below are the screen captures of the chooser implementations that we contributed.



We also helped with the features of multimedia such as the video screen capture, hardware encoding/decoding, and audio recording.

We’ve worked on a few testing frameworks (e.g. unit tests, browser and web tests). Also, we’ve filtered out unsupported tests and failed tests. Specifically, we’ve implemented the infrastructure to run the web tests on the simulator. For your information, the test coverage of the web test on iOS Blink was about 72.2% and the pass ratio among the working tests was 91.83% at the beginning of this year. Then we’ve been maintaining the bot for Chrome iOS on Blink to keep the build and testing because keeping build and running tests are crucial for Blink to bring up to Chrome iOS.

Besides we supported the remote debugging with DevTools on Blink for iOS. Now developers are able to remotely use DevTools in a host machine (e.g. Mac) and inspect Chrome or Content Shell for development.

Moreover, we’ve worked on graphics stuff related to compositing and rendering. For instance, we supported the metal on ANGLE as well as fixed bugs in the graphics layers.

You can find the detailed contributions patch list here.

The major changes for H1 2024

Now, let’s review the major changes during the first half of this year. Firstly, the minimum iOS SDK version was bumped up to 17.4 which supports the BrowserEngineKit library. And the [browser/unit] tests began to run on the ios-blink bot with the 17.4 SDK.

By reusing the existing iOS Chrome UI, ContentWebState was introduced as a prototype implementation of WebState. During H1 2024, new methods were added further or previously empty methods were implemented. For example, GetVirtualKeyboardHeight, OnVisibilityChanged, DidFinishLoad, DidFailLoad methods were implemented.

BrowserEngineKit APIs that were announced by Apple to support third-party browser engines on iOS and iPadOS have been applied to JIT, GPU, network, and content process creation.

As mentioned above, Igalia implemented a color chooser, a file chooser, and a context menu during the period.

Lastly, the package size was reduced by removing duplicated resources.

Remaining Tasks

We’ve briefly looked at the current status of the project so far, but many functionalities still need to be supported. For example, regarding UI features, functionalities such as printing preview, download, text selection, request desktop site, zoom text, translate, find in page, and touch events are not yet implemented or are not functioning correctly
Moreover, there are numerous failing or skipped tests in unit tests, browser tests, and web tests. Ensuring that these tests are enabled and passing the test should also be a key focus moving forward.

Conclusion

The Blink-based port of Chrome iOS is a large and laborious project. Igalia has made significant contributions to this project and while it is still in an early stage, more features, infrastructure and tools need to be ported to Blink. Anyhow, we believe that we are on the right track for eventually replacing WebKit by Blink on Chromium related products for iOS.

by gyuyoung at August 08, 2024 02:35 AM

August 02, 2024

Pawel Lampe

Nuts and bolts of Canvas2D - globalCompositeOperation and shadows.

In recent months I’ve been privileged to work on the transition from Cairo to Skia for 2D graphics rendering in WPE and GTK WebKit ports. Big reworks like this are a great opportunity to explore all kinds of graphics-related APIs. One of the broader APIs in this area is the CanvasRenderingContext2D API from HTML Canvas. It’s a fairly straightforward yet extensive API allowing one to perform all kinds of drawing operations on the canvas. The comprehensiveness, however, comes at the expense of some complex situations the web engine needs to handle under the hood. One such situation was the issue I was working on recently regarding broken test cases involving drawing shadows when using Skia in WebKit. What makes it complex is that some problems are still visible due to multiple web engine layers being involved, but despite that I was eventually able to address the broken test cases.

In the next few sections I’m going to introduce the parts of the API that are involved in the problems while in the sections closer to the end I will gradually showcase the problems and explore potential paths toward fixing the entire situation.

Drawing on Canvas2D with globalCompositeOperation #

The Canvas2D API offers multiple methods for drawing various primitives such as rectangles, arcs, text etc. On top of that, it allows one to control compositing and clipping using the globalCompositeOperation property. The idea is very simple - the user of an API can change the property using one of the predefined compositing operations and immediately after that, all new drawing operations will behave according to the rules the particular compositing operation specifies:

canvas2DContext.fillRect(...); // Draws rect on top of existing content (default).
canvas2DContext.globalCompositeOperation = 'destination-atop';
canvas2DContext.fillRect(...); // Draws rect according to 'destination-atop'.

There are many compositing operations, but I’ll be focusing mostly on the ones having source and destination in their names. The source and destination terms refer to the new content to be drawn and the existing (already-drawn) content respectively.

The images below present some examples of compositing operations in action:

Compositing operations in action.

Drawing on Canvas2D with shadows #

When drawing primitives using the Canvas2D API one can use shadow* properties to enable drawing of shadows along with any content that is being drawn. The usage is very simple - one has to alter at least one property such as e.g. shadowOffsetX to make the shadow visible:

canvas2DContext.shadowColor = "#0f0";
canvas2DContext.shadowOffsetX = 10;
// From now on, any draw call will have a green shadow attached.

the above combined with simple code to draw a circle produces a following effect:

Circle with shadow.

Shadows meet globalCompositeOperation #

Things are getting interesting once one starts thinking about how globalCompositeOperation may affect the way shadows are drawn. When I thought about it for the first time, I imagined at least 3 possibilities:

  • Shadow and shadow origin are both treated as one entity (shadow always below the origin) and thus are drawn together.
  • Shadow and shadow origin are combined and then drawn as a one entity.
  • Shadow and shadow origin are drawn separately - shadow first, then the content.

When I confronted the above with the drawing model and shadows specification, it turned out the last guess was the correct one. The specification basically says that the shadow should be computed first, then composited within the clipping region over the current canvas content, and finally, the shadow origin should be composited within the clipping region over the current canvas content (the original canvas content combined with shadow).

The above can be confirmed visually using few examples (generated using chromium browser v126.0.6478.126):

Shadows combined with compositing operation.

  • The source-over operation shows the drawing order - destination first, shadow second, and shadow origin third.
  • The destination-over operation shows the reversed drawing order - destination first, shadow second (below destination), and shadow origin third (below destination and shadow).
  • The source-atop operation is more tricky as it behaves like source-over but with clipping to the destination content - therefore, destination is drawn first, then clipping is set to destination, then the shadow is drawn, and finally the shadow origin is drawn.
  • The destination-atop operation is even more tricky as it behaves like destination-over yet with the clipping region always being different. That difference can be seen on the image below that presents intermediate states of canvas after each drawing step:
    Breakdown of destination-atop operation.
    • The initial state shows a canvas after drawing the destination on it.
    • The after drawing shadow state, shows a shadow drawn below the destination. In this case, the clipping is set to new content (shadow), and hence the part of destination that is not “atop” shadow is being clipped out.
    • The after drawing shadow origin state, shows the final state after drawing the shadow origin below the previous canvas content (new destination) that is at this point “a shadow combined with destination”. Similarly as in the previous step, the clipping is set to the new content (shadow origin), and hence any part of new destination that is not “atop” the shadow origin is being clipped out.

Discrepancies between browser engines #

Whenever one realizes the drawing of shadows with globalCompositeOperation in general may be tricky, then one must also consider that when it comes to particular browser engines, the things are even more tricky as virtually no graphics library provides an API that matches the Canvas2D API 1-to-1. This means that depending on the graphics library used, the browser engine must implement more or less integration parts here and there. For example, one can imagine that some graphics library may not have native support for shadows - that would mean the browser engine has to prepare shadows itself by e.g. drawing shadow origin (no matter how complex) on extra surface, changing color, blurring etc. so that it can be used as a whole once prepared.

Having said the above, one would expect that all the above aspects should be tested and implemented really well. After all, whenever the subject matter becomes complicated, extra care is required. It turns out, however, this is not necessarily the case when it comes to globalCompositeOperation and shadows. As for the testing part, there are very few tests (2d.shadow.composite*) in WPT (Web Platform Tests) covering the use cases described above. It’s also not much better for internal web engine test suites. As for implementations, there’s a substantial amount of discrepancy.

Simple examples #

To show exactly what’s the situation, the examples from section Shadows meet globalCompositeOperation can be used again. This time using browsers representing different web engines:

  • Chromium 126.0.6478.126 Shadows combined with compositing operation - Chromium.
  • Firefox 128.0 Shadows combined with compositing operation - Firefox.
  • Gnome Web (Epiphany) 45.0 (WebKit/Cairo) Shadows combined with compositing operation - Epiphany.
  • WPE MiniBrowser build from WebKit@098c58dd13bf40fc81971361162e21d05cb1f74a (WebKit/Skia) Shadows combined with compositing operation - WPE MiniBrowser.
  • Safari 17.1 (WebKit/Core Graphics) Shadows combined with compositing operation - Safari.
  • Servo release from 2024/07/04 Shadows combined with compositing operation - Servo.
  • Ladybird build from 2024/06/29 Shadows combined with compositing operation - Ladybird

First of all, it’s evident that experimental browsers such as servo and ladybird are falling behind the competition - servo doesn’t seem to support shadows at all, while ladybird doesn’t support anything other than drawing a rect filled with color.

Second, the non-experimental browsers are pretty stable in terms of covering most of the combinations presented above.

Finally, the most tricky combination above seems to be the one including destination-atop - in that case almost every mainstream browser renders different results:

  • Chromium is the only one rendering correctly.
  • Firefox and Epiphany are pretty close, but both are suffering from a similar glitch where the red part is covered by the part of destination that should be clipped out already.
  • WPE MiniBrowser and Safari are both rendering in correct order, but the clipping is wrong.

More sophisticated examples #

Until now, the discrepancies don’t seem to be very dramatic, and hence it’s time to present more sophisticated examples that are an extended version of the test case from the WebKit source tree:

  • Chromium 126.0.6478.126

Shadows combined with compositing operation - Chromium.

  • Firefox 128.0

Shadows combined with compositing operation - Firefox.

  • Gnome Web (Epiphany) 45.0 (WebKit/Cairo)

Shadows combined with compositing operation - Epiphany.

  • WPE MiniBrowser build from WebKit@098c58dd13bf40fc81971361162e21d05cb1f74a (WebKit/Skia)

Shadows combined with compositing operation - WPE MiniBrowser.

  • Safari 17.1 (WebKit/Core Graphics)

Shadows combined with compositing operation - Safari.

  • Servo release from 2024/07/04

Shadows combined with compositing operation - Servo.

  • Ladybird build from 2024/06/29

Shadows combined with compositing operation - Ladybird.

Other than destination-out, xor, and a few simple operations presented before, all the operations presented above pose serious problems to the majority of browsers. The only browser that is correct in all the cases (to the best of my understanding) is Chromium that is using rendering engine called blink which in turn uses the Skia library. One may wonder if perhaps it’s Skia that’s responsible for the Chromium success, but given the above results where e.g. WPE MiniBrowser uses Skia as well, it’s evident that the problems lay above the particular graphics library.

Looking at the operations and browsers that render incorrectly, it’s clearly visible that even small problems - with either ordering of draw calls or clipping - lead to spectacularly broken results. The pinnacle of misery is the source-out operation that is the most variable one across browsers. One has to admit, however, that WPE MiniBrowser is slightly closer to being correct than others.

Towards unification #

Fixing the above problems is a long journey. After all, every single web engine has to be fixed in its own, specific way. If the specification would be a problem - it would be the obvious way to start. However, as mentioned in the section Shadows meet globalCompositeOperation, the specification, is pretty clear on how drawing, shadows, and globalCompositeOperation come together. In such case, the next obvious place to start improving things is a WPT test suite.

What makes WPT outstanding is that it is a de facto standard cross-browser test suite for testing the web platform stack. Thus the test suite is developed as an open collaboration effort by developers from around the globe and hence is very broad in terms of specification coverage. What’s also important, the test results are actively evaluated against the popular browser engines and published under wpt.fyi, therefore putting some pressure on web engine developers to fix the problems so that they keep up with competition.

Granted the above, extending WPE test suite by adding test cases to cover globalCompositeOperation operations combined with shadows is the reasonable first step towards the unification of browser implementations. This can be done either by directly contributing tests to WPT, or by creating an issue. Personally, I’ve decided to file an issue first (WPT#46544) and to add tests once I have some time. I haven’t contributed to WPT yet, but I’m excited to work with it soon. Once I land my first pull request, I’ll start fixing WebKit and I won’t hesitate to post some updates on this blog.

August 02, 2024 12:00 AM

July 26, 2024

Loïc Le Page

FFmpeg 101

A high-level architecture overview to start with FFmpeg.

FFmpeg package content #

FFmpeg is composed of a suite of tools and libraries.

FFmpeg tools #

The tools can be used to encode/decode/transcode a multitude of different audio and video formats, and to stream the encoded media over networks.

  • ffmpeg: a command line tool to convert multimedia files between formats
  • ffplay: a simple mediaplayer based on SDL and the FFmpeg libraries
  • ffprobe: a simple multimedia stream analyzer

FFmpeg libraries #

The libraries can be used to integrate those same features into your own product.

  • libavformat: I/O and muxing/demuxing
  • libavcodec: encoding/decoding
  • libavfilter: graph-based filters for raw media
  • libavdevice: input/output devices
  • libavutil: common multimedia utilities
  • libswresample: audio resampling, samples format conversion and audio mixing
  • libswscale: color conversion and image scaling
  • libpostproc: video post-processing (deblocking/noise filters)

FFmpeg simple player #

A basic usage of FFmpeg is to demux a multimedia stream (obtained from a file or from the network) into its audio and video streams and then to decode those streams into raw audio and raw video data.

To manage the media streams, FFmpeg uses the following structures:

  • AVFormatContext: a high level structure providing sync, metadata and muxing for the streams
  • AVStream: a continuous stream (audio or video)
  • AVCodec: defines how data are encoded and decoded
  • AVPacket: encoded data in the stream
  • AVFrame: decoded data (raw video frame or raw audio samples)

The process used to demux and decode follows this logic:

basic processing

Here is the basic code needed to read an encoded multimedia stream from a file, analyze its content and demux the audio and video streams. Those features are provided by the libavformat library and it uses the AVFormatContext and AVStream structures to store the information.

// Allocate memory for the context structure
AVFormatContext* format_context = avformat_alloc_context();

// Open a multimedia file (like an mp4 file or any format recognized by FFmpeg)
avformat_open_input(&format_context, filename, NULL, NULL);
printf("File: %s, format: %s\n", filename, format_context->iformat->name);

// Analyze the file content and identify the streams within
avformat_find_stream_info(format_context, NULL);

// List the streams
for (unsigned int i = 0; i < format_context->nb_streams; ++i)
{
    AVStream* stream = format_context->streams[i];

    printf("---- Stream %02d\n", i);
    printf("  Time base: %d/%d\n", stream->time_base.num, stream->time_base.den);
    printf("  Framerate: %d/%d\n", stream->r_frame_rate.num, stream->r_frame_rate.den);
    printf("  Start time: %" PRId64 "\n", stream->start_time);
    printf("  Duration: %" PRId64 "\n", stream->duration);
    printf("  Type: %s\n", av_get_media_type_string(stream->codecpar->codec_type));

    uint32_t fourcc = stream->codecpar->codec_tag;
    printf("  FourCC: %c%c%c%c\n", fourcc & 0xff, (fourcc >> 8) & 0xff, (fourcc >> 16) & 0xff, (fourcc >> 24) & 0xff);
}

// Close the multimedia file and free the context structure
avformat_close_input(&format_context);

Once we’ve got the different streams from inside the multimedia file, we need to find specific codecs to decode the streams to raw audio and raw video data. All codecs are statically included in libavcodec. You can easily create your own codec by just creating an instance of the FFCodec structure and registering it as an extern const FFCodec in libavcodec/allcodecs.c, but this would be a different topic for another post.

To find the codec corresponding to the content of an AVStream, we can use the following code:

// Stream obtained from the AVFormatContext structure in the former streams listing loop
AVStream* stream = format_context->streams[i];

// Search for a compatible codec
const AVCodec* codec = avcodec_find_decoder(stream->codecpar->codec_id);
if (!codec)
{
    fprintf(stderr, "Unsupported codec\n");
    continue;
}
printf("  Codec: %s, bitrate: %" PRId64 "\n", codec->name, stream->codecpar->bit_rate);

if (codec->type == AVMEDIA_TYPE_VIDEO)
{
    printf("  Video resolution: %dx%d\n", stream->codecpar->width, stream->codecpar->height);
}
else if (codec->type == AVMEDIA_TYPE_AUDIO)
{
    printf("  Audio: %d channels, sample rate: %d Hz\n",
        stream->codecpar->ch_layout.nb_channels,
        stream->codecpar->sample_rate);
}

With the right codec and codec parameters extracted from the AVStream information, we can now allocate the AVCodecContext structure that will be used to decode the corresponding stream. It is important to remember the index of the stream we want to decode from the former streams list (format_context->streams) because this index will be used later to identify the demuxed packets extracted by the AVFormatContext.

In the following code we’re going to select the first video stream contained in the multimedia file.

// first_video_stream_index is determined during the streams listing in the former loop
int first_video_stream_index = ...;

AVStream* first_video_stream = format_context->streams[first_video_stream_index];
AVCodecParameters* first_video_stream_codec_params = first_video_stream->codecpar;
const AVCodec* first_video_stream_codec = avcodec_find_decoder(first_video_stream_codec_params->codec_id);

// Allocate memory for the decoding context structure
AVCodecContext* codec_context = avcodec_alloc_context3(first_video_stream_codec);

// Configure the decoder with the codec parameters
avcodec_parameters_to_context(codec_context, first_video_stream_codec_params);

// Open the decoder
avcodec_open2(codec_context, first_video_stream_codec, NULL);

Now that we have a running decoder, we can extract the demuxed packets using the AVFormatContext structure and decode them to raw video frames. For that we need 2 different structures:

  • AVPacket which contains the encoded packets extracted from the input multimedia file,
  • AVFrame which will contain the raw video frame after the AVCodecContext has decoded the former packets.
// Allocate memory for the encoded packet structure
AVPacket* packet = av_packet_alloc();

// Allocate memory for the decoded frame structure
AVFrame* frame = av_frame_alloc();

// Demux the next packet from the input multimedia file
while (av_read_frame(format_context, packet) >= 0)
{
    // The demuxed packet uses the stream index to identify the AVStream it is coming from
    printf("Packet received for stream %02d, pts: %" PRId64 "\n", packet->stream_index, packet->pts);

    // In our example we are only decoding the first video stream identified formerly by first_video_stream_index
    if (packet->stream_index == first_video_stream_index)
    {
        // Send the packet to the previsouly initialized decoder
        int res = avcodec_send_packet(codec_context, packet);
        if (res < 0)
        {
            fprintf(stderr, "Cannot send packet to the decoder: %s\n", av_err2str(res));
            break;
        }

        // The decoder (AVCodecContext) acts like a FIFO queue, we push the encoded packets on one end and we need to
        // poll the other end to fetch the decoded frames. The codec implementation may (or may not) use different
        // threads to perform the actual decoding.

        // Poll the running decoder to fetch all available decoded frames until now
        while (res >= 0)
        {
            // Fetch the next available decoded frame
            res = avcodec_receive_frame(codec_context, frame);
            if (res == AVERROR(EAGAIN) || res == AVERROR_EOF)
            {
                // No more decoded frame is available in the decoder output queue, go to next encoded packet
                break;
            }
            else if (res < 0)
            {
                fprintf(stderr, "Error while receiving a frame from the decoder: %s\n", av_err2str(res));
                goto end;
            }

            // Now the AVFrame structure contains a decoded raw video frame, we can process it further...
            printf("Frame %02" PRId64 ", type: %c, format: %d, pts: %03" PRId64 ", keyframe: %s\n",
                codec_context->frame_num, av_get_picture_type_char(frame->pict_type), frame->format, frame->pts,
                (frame->flags & AV_FRAME_FLAG_KEY) ? "true" : "false");

            // The AVFrame internal content is automatically unreffed and recycled during the next call to
            // avcodec_receive_frame(codec_context, frame)
        }
    }

    // Unref the packet internal content to recycle it for the next demuxed packet
    av_packet_unref(packet);
}

// Free the previously allocated memory for the different FFmpeg structures
end:
    av_packet_free(&packet);
    av_frame_free(&frame);
    avcodec_free_context(&codec_context);
    avformat_close_input(&format_context);

The way the former code is acting is resumed in the next diagram:

processing diagram

You can find the full code here.

To build the example you will need meson and ninja. If you have python and pip installed, you can install them very easily by calling pip3 install meson ninja. Then, once the example archive extracted to a ffmpeg-101 folder, go to this folder and call: meson setup build. It will automatically download the right version of FFmpeg if you don’t have it already installed on your system. Then call: ninja -C build to build the code and ./build/ffmpeg-101 sample.mp4 to run it.

You should obtain the following result:

File: sample.mp4, format: mov,mp4,m4a,3gp,3g2,mj2
---- Stream 00
  Time base: 1/3000
  Framerate: 30/1
  Start time: 0
  Duration: 30000
  Type: video
  FourCC: avc1
  Codec: h264, bitrate: 47094
  Video resolution: 206x80
---- Stream 01
  Time base: 1/44100
  Framerate: 0/0
  Start time: 0
  Duration: 440320
  Type: audio
  FourCC: mp4a
  Codec: aac, bitrate: 112000
  Audio: 2 channels, sample rate: 44100 Hz
Packet received for stream 00, pts: 0
Send video packet to decoder...
Frame 01, type: I, format: 0, pts: 000, keyframe: true
Packet received for stream 00, pts: 100
Send video packet to decoder...
Frame 02, type: P, format: 0, pts: 100, keyframe: false
Packet received for stream 00, pts: 200
Send video packet to decoder...
Frame 03, type: P, format: 0, pts: 200, keyframe: false
Packet received for stream 00, pts: 300
Send video packet to decoder...
Frame 04, type: P, format: 0, pts: 300, keyframe: false
Packet received for stream 00, pts: 400
Send video packet to decoder...
Frame 05, type: P, format: 0, pts: 400, keyframe: false
Packet received for stream 00, pts: 500
Send video packet to decoder...
Frame 06, type: P, format: 0, pts: 500, keyframe: false
Packet received for stream 00, pts: 600
Send video packet to decoder...
Frame 07, type: P, format: 0, pts: 600, keyframe: false
Packet received for stream 00, pts: 700
Send video packet to decoder...
Frame 08, type: P, format: 0, pts: 700, keyframe: false
Packet received for stream 01, pts: 0
Packet received for stream 01, pts: 1024
Packet received for stream 01, pts: 2048
Packet received for stream 01, pts: 3072
Packet received for stream 01, pts: 4096
Packet received for stream 01, pts: 5120
Packet received for stream 01, pts: 6144
Packet received for stream 01, pts: 7168
Packet received for stream 01, pts: 8192
Packet received for stream 01, pts: 9216
Packet received for stream 01, pts: 10240
Packet received for stream 01, pts: 11264
Packet received for stream 01, pts: 12288
Packet received for stream 01, pts: 13312
Packet received for stream 01, pts: 14336
Packet received for stream 01, pts: 15360
Packet received for stream 01, pts: 16384
Packet received for stream 01, pts: 17408
Packet received for stream 01, pts: 18432
Packet received for stream 01, pts: 19456
Packet received for stream 01, pts: 20480
Packet received for stream 01, pts: 21504
Packet received for stream 00, pts: 800
Send video packet to decoder...
Frame 09, type: P, format: 0, pts: 800, keyframe: false
Packet received for stream 00, pts: 900
Send video packet to decoder...
Frame 10, type: P, format: 0, pts: 900, keyframe: false

July 26, 2024 12:00 AM

July 24, 2024

Andy Wingo

whippet progress update: funding, features, future

Greets greets! Today, an update on recent progress in Whippet, including sponsorship, a new collector, and a new feature.

the lob, the pitch

But first, a reminder of what the haps: Whippet is a garbage collector library. The target audience is language run-time authors, particularly “small” run-times: wasm2c, Guile, OCaml, and so on; to a first approximation, the kinds of projects that currently use the Boehm-Demers-Weiser collector.

The pitch is that if you use Whippet, you get a low-fuss small dependency to vendor into your source tree that offers you access to a choice of advanced garbage collectors: not just the conservative mark-sweep collector from BDW-GC, but also copying collectors, an Immix-derived collector, generational collectors, and so on. You can choose the GC that fits your problem domain, like Java people have done for many years. The Whippet API is designed to be a no-overhead abstraction that decouples your language run-time from the specific choice of GC.

I co-maintain Guile and will be working on integrating Whippet in the next months, and have high hopes for success.

bridgeroos!

I’m delighted to share that Whippet was granted financial support from the European Union via the NGI zero core fund, administered by the Dutch non-profit, NLnet foundation. See the NLnet project page for the overview.

This funding allows me to devote time to Whippet to bring it from proof-of-concept to production. I’ll finish the missing features, spend some time adding tracing support, measuring performance, and sanding off any rough edges, then work on integrating Whippet into Guile.

This bloggery is a first update of the progress of the funded NLnet project.

a new collector!

I landed a new collector a couple weeks ago, a parallel copying collector (PCC). It’s like a semi-space collector, in that it always evacuates objects (except large objects, which are managed in their own space). However instead of having a single global bump-pointer allocation region, it breaks the heap into 64-kB blocks. In this way it supports multiple mutator threads: mutators do local bump-pointer allocation into their own block, and when their block is full, they fetch another from the global store.

The block size is 64 kB, but really it’s 128 kB, because each block has two halves: the active region and the copy reserve. It’s a copying collector, after all. Dividing each block in two allows me to easily grow and shrink the heap while ensuring there is always enough reserve space.

Blocks are allocated in 64-MB aligned slabs, so you get 512 blocks in a slab. The first block in a slab is used by the collector itself, to keep metadata for the rest of the blocks, for example a chain pointer allowing blocks to be collected in lists, a saved allocation pointer for partially-filled blocks, whether the block is paged in or out, and so on.

The PCC not only supports parallel mutators, it can also trace in parallel. This mechanism works somewhat like allocation, in which multiple trace workers compete to evacuate objects into their local allocation buffers; when an allocation buffer is full, the trace worker grabs another, just like mutators do.

However, unlike the simple semi-space collector which uses a Cheney grey worklist, the PCC uses the fine-grained work-stealing parallel tracer originally developed for Whippet’s Immix-like collector. Each trace worker maintains a local queue of objects that need tracing, which currently has 1024 entries. If the local queue becomes full, the worker will publish 3/4 of those entries to the worker’s shared worklist. When a worker runs out of local work, it will first try to remove work from its own shared worklist, then will try to steal from other workers.

Of course, because threads compete to evacuate objects, we have to use atomic compare-and-swap instead of simple forwarding pointer updates; if you only have one mutator thread and are OK with just one tracing thread, you can avoid the ~30% performance penalty that atomic operations impose. The PCC generally starts to win over a semi-space collector when it can trace with 2 threads, and gets better with each thread you add.

I sometimes wonder whether the worklist should contain grey edges or grey objects. MMTk seems to do the former, and bundles edges into work packets, which are the unit of work sharing. I don’t know yet what is best and look forward to experimenting once I have better benchmarks.

Anyway, maintaining an external worklist is cheating in a way: unlike the Cheney worklist, this memory is not accounted for as part of the heap size. If you are targetting a microcontroller or something, probably you need to choose a different kind of collector. Fortunately, Whippet enables this kind of choice, as it contains a number of collector implementations.

What about benchmarks? Well, I’ll be doing those soon in a more rigorous way. For now I will just say that it seems to behave as expected and I am satisfied; it is useful to have a performance oracle against which to compare other collectors.

finalizers!

This week I landed support for finalizers!

Finalizers work in all the collectors: semi, pcc, whippet, and the BDW collector that is a shim to use BDW-GC behind the Whippet API. They have a well-defined relationship with ephemerons and are multi-priority, allowing embedders to build guardians or phantom references on top.

In the future I should do some more work to make finalizers support generations, if the collector is generational, allowing a minor collection to avoid visiting finalizers for old objects. But this is a straightforward extension that will happen at some point.

future!

And that’s the end of this update. Next up, I am finally going to tackle heap resizing, using the membalancer approach. Then basic Windows and Mac support, then I move on to the tracing and performance measurement phase. Onwards and upwards!

by Andy Wingo at July 24, 2024 09:19 AM

July 22, 2024

Eric Meyer

Design for Real Life News!

If you’re reading this, odds are you’ve at least heard of A Book Apart (ABA), who published Design for Real Life, which I co-wrote with Sara Wachter-Boettcher back in 2016.  What you may not have heard is that ABA has closed up shop.  There won’t be any more new ABA titles, nor will ABA continue to sell the books in their catalog.

That’s the bad news.  The great news is that ABA has transferred the rights for all of its books to their respective authors! (Not every ex-publisher does this, and not every book contract demands it, so thanks to ABA.) We’re all figuring out what to do with our books, and everyone will make their own choices.  One of the things Sara and I have decided to do is to eventually put the entire text online for free, as a booksite.  That isn’t ready yet, but it should be coming somewhere down the road.

In the meantime, we’ve decided to cut the price of print and e-book copies available through Ingram.  DfRL was the eighteenth book ABA put out, so we’ve decided to make the price of both the print and e-book $18, regardless of whether those dollars are American, Canadian, or Australian.  Also €18 and £18.  Basically, in all five currencies we can define, the price is 18 of those.

…unless you buy it through Apple Books; then it’s 17.99 of every currency, because the system forces us to make it cheaper than the list price and also have the amount end in .99.  Obversely, if you’re buying a copy (or copies) for a library, the price has to be more than the list price and also end in .99, so the library cost is 18.99 currency units.  Yeah, I dunno either.

At any rate, compared to its old price, this is a significant price cut, and in some cases (yes, Australia, we’re looking at you) it’s a huge discount.  Or, at least, it will be at a discount once online shops catch up.  The US-based shops seem to be up to date, and Apple Books as well, but some of the “foreign” (non-U.S.) sources are still at their old prices.  In those cases, maybe wishlist or bookmark or something and keep an eye out for the drop.  We hope it will become global by the end of the week.  And hey, as I write this, a couple of places have the ebook version for like 22% less than our listed price.

So!  If you’ve always thought about buying a copy but never got around to it, now’s a good time to get a great deal.  Ditto if you’ve never heard of the book but it sounds interesting, or you want it in ABA branding, or really for any other reason you have to buy a copy now.

I suppose the real question is, should you buy a copy?  We’ll grant that some parts of it are a little dated, for sure.  But the concepts and approaches we introduced can be seen in a lot of work done even today.  It made significant inroads into government design practices in the UK and elsewhere, for example, and we still hear from people who say it really changed how they think about design and UX.  We’re still very proud of it, and we think anyone who takes the job of serving their users seriously should give it a read.  But then, I guess we would, or else we’d never have written it in the first place.

And that’s the story so far.  I’ll blog again when the freebook is online, and if anything else changes as we go through the process.  Got questions?  Leave a comment or drop me a line.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at July 22, 2024 03:22 PM

Andy Wingo

finalizers, guardians, phantom references, et cetera

Consider guardians. Compared to finalizers, in which the cleanup procedures are run concurrently with the mutator, by the garbage collector, guardians allow the mutator to control concurrency. See Guile’s documentation for more notes. Java’s PhantomReference / ReferenceQueue seems to be similar in spirit, though the details differ.

questions

If we want guardians, how should we implement them in Whippet? How do they relate to ephemerons and finalizers?

It would be a shame if guardians were primitive, as they are a relatively niche language feature. Guile has them, yes, but really what Guile has is bugs: because Guile implements guardians on top of BDW-GC’s finalizers (without topological ordering), all the other things that finalizers might do in Guile (e.g. closing file ports) might run at the same time as the objects protected by guardians. For the specific object being guarded, this isn’t so much of a problem, because when you put an object in the guardian, it arranges to prepend the guardian finalizer before any existing finalizer. But when a whole clique of objects becomes unreachable, objects referred to by the guarded object may be finalized. So the object you get back from the guardian might refer to, say, already-closed file ports.

The guardians-are-primitive solution is to add a special guardian pass to the collector that will identify unreachable guarded objects. In this way, the transitive closure of all guarded objects will be already visited by the time finalizables are computed, protecting them from finalization. This would be sufficient, but is it necessary?

answers?

Thinking more abstractly, perhaps we can solve this issue and others with a set of finalizer priorities: a collector could have, say, 10 finalizer priorities, and run the finalizer fixpoint once per priority, in order. If no finalizer is registered at a given priority, there is no overhead. A given embedder (Guile, say) could use priority 0 for guardians, priority 1 for “normal” finalizers, and ignore the other priorities. I think such a facility could also support other constructs, including Java’s phantom references, weak references, and reference queues, especially when combined with ephemerons.

Anyway, all this is a note for posterity. Are there other interesting mutator-exposed GC constructs that can’t be implemented with a combination of ephemerons and multi-priority finalizers? Do let me know!

by Andy Wingo at July 22, 2024 09:27 AM

July 18, 2024

Brian Kardell

927: Thoughts on a Global Design System

927: Thoughts on a Global Design System

My thoughts on "A Global Design System" as is being discussed in OpenUI.

As you may or may not be aware, there's been recent discussion in OpenUI, brought forward by an effort by my fellow Pittsburgher Brad Frost, about the group taking on the effort of creating a global design system.

First, let me say that the problem that Brad describes is real, and also not new. He and I have discussed this in the past as well. I've spent a lot (the majority maybe) of my career (which began in the 90s) working on projects that were either using, evaluating or making their own common controls.

So much wasted energy

While explaining this, Brad frequently notes that inventing and reinventing the same things over and over wastes an enormous amount of human potential. We could be spending that time better.

I mean... Yes. I agree.

But, even more than that, the time spent re-inventing is only part of the story. The status quo is good for approximately no one. It also has multiplicative effects far beyond just the actual reinvention..

There might be 100 toolkits/component libraries which combined have 100k worth of invested hours, and yeah, that's a huge amount of time... Those hours are also wildly skewed. 1 might have 10x or even 100x the thought, care, review and testing than another.

But while there might be thousands of people spending time re-inventing, there are millions of authors who need components - and so many are spending at least a few hours, or maybe in some cases days searching for suitable components. I've been involved in corporate evaluations that were weeks of time. And it's hard to evaluate them and make good choices that consider accessibility, responsiveness, design, and internationalization. It is not only time-consuming, we often don't have the skills to do it. That is, after all, one of the reasons we want them: So that we don't each have to know all that stuff.

But then, how do we expect authors make a good choice?

Sometimes the ones with the least effort put into them can have a great looking web site, nice documentation, charismatic promotion, or be somehow associated with a big tech company. Too often we wind up choosing components by proxy and just assuming that something else must mean it's good, and will last a long time. However, history has not borne that out — see the various component toolkits and design systems from even big orgs like Microsoft and Google, for example, that fell by the wayside.

But yeah - multiply that time out... What all of this currently creates is bad all around. All of the millions of developers looking and ultimately unable to make well-informed choices is probably tens of millions of hours, by comparison.

In the end, many give up and re-implement again, making the problem even worse.

Each one might introduce tiny variations and accidentally invent something subtly new and create new challenges for users that we'll spend years sorting out too.

Ugh. It's bad. We should want a better future, and we should act on that.

Imagining a Better Future

Here's where I believe we get into trouble though: We have to be clear on what are we imagining, and whether it is practical/pragmatic to deliver adequate value in a reasonable timeframe.

Native HTML?

We could, for example, choose to imagine that HTML can be given a great and complete set of elements representing a complete UI toolkit. In addition to correcting all of the issues with the elements we've added so far, this means adding powerful grids connected to data, tabsets, notifications, carousels, charts, and so on.

Can it? Eventually, maybe, but I hope it is not controversial to suggest that it is extremely unlikely that we could accomplish this with the necessary qualities and in a reasonable timeframe. There's just no information or insight I have that gives me hope that focusing only on that scenario is a good idea.

This is a good end-goal for many components, but it's not where to start. It's hard and time consuming and gated on very specific and limited participation of a small number of people. HTML itself moves slow, on purpose.

I think HTML is at the end of 99 other steps...

The real question, I believe, is about improving how we get there, and deliver iterative partial value along the way.

New Web Components Reference Implementations?

It's been suggested that we could work on a single standard with a reference implementation for each component.

I do believe that ultimately this is a good goal, but I'd like to suggest that it's not where to start either.

The challenges to this are less than trying to add it to HTML in some ways, it doesn't require browser vendorts to act in concert, sure. We can iterate on it, sure. But the challenges are still huge and trading knowns for unknowns.

Instead of needing to convince 3 browser vendors to act in concert, we have to convince several UI kit vendors and developers to participate. We also have to convince everyone to use it and try to avoid XKCD 927 territory...

XKCD 927
Situation: There are 14 Competing Standards
Person 1: 14? Ridiculous! We need to develop one universal standard that covers everyone's use cases!
Person 2: Yeah!
Situation: There are 15 Competing Standards

This is exacerbated by the fact that it won't come all at once. It'll still be a non-trivial amount of time before we have a whole library of components which could reasonably be promoted for use. It still requires people with expertise (probably many of the same people as if it were native) to participate for reviewing accessibility, usability, internationalization, etc. In practice, there are just very finite resources available to put toward large scale, long term cooperation. Practically speaking, it seems likely we could only focus on a couple of components at a time.

Let's say we finish and ship the first component: Tabs. Can we really call it a global design system if it has just one component? Won't that really limit who can/will adopt it?

Adopt, modify and bless an library

It's been suggested that we could take up a library as a kind of a 'donation' project to provide a starting point. Specifically, maintainers from Shoelace/Web Awesome (also formerly MS components) have volunteered components for this purpose. Not as a "this is the thing" but a "this is a start". That would give us a nice leap forward.

Yeah, it would.

Except... Doesn't it raise a lot of questions we have to answer anyways?

First, but maybe not as importantly: Why that one? That goes to legitimacy. We should be able to explain why this is not just the first attractive looking opportunity that presented itself.

More importantly, it seems to me that the rest of the situation decribed above remains largely unchanged. We can't seriously promote that until it is deemed "good", and practically speaking it seems that we will approve them individually, not as a library. So, can't we define how we think it should work before we worry about picking a library?

The most obvious thing we could have ever done that with was jQuery, and we didn't.

I think that a library of reference implementations that we can agree to and recommend is still very far along the timeline...

The real question, I believe, is about improving how we get there, and deliver iterative partial value along the way.

We still don't have a great way to evolve the web - but I keep saying that I think we should.

How I think we could get there...

This is what I want more than anything: A plan to get there. Reasonable steps of value along the way, comparatively quickly.

It is effectively what I thought in 2013-2014 too. I suggested to the W3C Advisory Committee that we needed to rethink how we approach standards to include this sort of idea, which could work more like languages/dictionaries. I tried to suggest the W3C should create such a process/listing/review process.

What follows is a vague outline of what I imagine:

I'd like to create a central place where we lay out some new rules and a process where components, in a basic form that we agree to (it is as a module, should it use shadow dom or not, etc) can be submitted.

What are the criteria? That's the first few steps...

We'd define some criteria gating submission, first with IP/license agreements we agree to, possibly some kind of bar for contributors or stars or something, but mainly: A commitment of participation in this new process, especially from experts. Honestly, participation is a bigger part of the limiting factor than anyone really imagines.

Once submitted it would undergo wide review and get some kind of 'verification stamps' for each kind (a11y, i18n, etc).

For this reason, I would really love to try to include the authors of government tools here. They are legally mandated and funded to solve the problem already and seem highly incentivized to participate. A collective of government efforts also lends immediate credibility and sense of independence to it.

To me, ideally, we would begin with a call for components/participation.

A call for particpation/submissions...

You might have noticed...

You might have noticed that I didn't answer the question of "how do we pick one?" That's because I think that's like 99 steps down the road and will come naturally.

We can get a set of people who can contribute tabs, and a set of people who can review, and we can all discuss several of them at the same time. We can begin to lay out conformance criteria, and give each one little 'conformance stamps' along the way. Inevitably we can more easily get implementations to align and develop universal definitions and tests -- new stamps to be had.

Component get conformance stamps...

For authors, along the way, there's a nice central catalog somewhere, like webcomponents.org, but better. You'll know those have been submitted, and which ones have which conformance stamps. Maybe there isn't a 'the one', yet. But, it's ok? You have a smaller set, and the information you really need to choose one. Maybe all 3 of them are ... fine?

That's not the worst thing, we can sit back and evaluate it for a while while already saving ourselves collectively millions of hours and our users a lot of pain.

In fact, collecting data and a little variation is good. Probably, they continue to align, or one begins to be the clearer winner.

We have very well defined, portable criteria for testing and more or less 1 definition...

And, that's the point: As we go we would slowly, but without stopping major progress at any point. Even if nothing more happens, each of those steps has had real value. No one has just wasted time.

Then, maybe we can get somewhere where we have a single reference implementation of all of those things - or even a standard almost identical to them.

We have a true global reference implementation... Should we bake it into HTML?

In any case, that's how I would prefer to approach it. I wouldn't call it a "global design system" to start, because I wouldn't even start out assuming there would be only one of anything initially... But eventually.

July 18, 2024 04:00 AM

Igalia Compilers Team

Summary of the June 2024 TC39 plenary in Helsinki

In June, many colleagues from Igalia participated in a TC39 meeting organized in Helsinki by Aalto University and Mozilla to discuss proposed features for the JavaScript standard alongside delegates from various other organizations.

Let's delve together into some of the most exciting updates!

You can also read the full agenda and the meeting minutes on GitHub.

Day 1 #

import defer to Stage 2.7 igalia logo #

The import defer proposal allows pre-loading modules while deferring their evaluation until a later time. This proposal aims at giving developers better tools to optimize the startup performance of their application.

As soon as some code needs the variables exported by a deferred module, it will be synchronously evaluated to immediately give access to its value:

// This does not cause evaluation of my-mod or its dependencies:
import defer * as myMod from "./my-mod.js";

$button.addEventListener("click", () => {
// but this does!
console.log("val:", myMod.val);
});

This is similar to, when using CommonJS, moving a require(...) call from the top-level of a module to inside a function. Adding a similar capability to ES modules is one further step towards helping people migrate from CommonJS to ESM.

The proposal reached stage 2.7 after answering a few questions centered around the behavior of top-level await: modules with top-level await will not be deferred even if imported through import defer, because they cannot be synchronously evaluted. If you application can easily handle asynchronous deferred evaluation of modules, it can as well use dynamic import().

The proposal now needs to have test262 tests written, to be able to go to Stage 3.

Promise.try to Stage 3 #

The new Promise.try helper allows calling a functions that might or might not be asynchronous, unifying their error handling paths. This is useful for asynchronous APIs (that should always signal errors by returning a rejected promise) that interact with user-provided callbacks.

Consider this example:

type AsyncNegator = (val: number) => number | Promise<number>;
function subtract(a: number, b: number, negate: AsyncNegator): Promise<number> {
return Promise.resolve(negate(b)).then(negB => a + negB);
}

While the developer here took care of wrapping negate's result in Promise.resolve, in case negate returns a number directly, what happens in negate throws an error? In that case, subtract will throw synchronously rather than returning a rejected promise!

With Promise.try, you can easily handle both the success and error paths correctly:

type AsyncNegator = (val: number) => number | Promise<number>;
function subtract(a: number, b: number, negate: AsyncNegator): Promise<number> {
return Promise.try(() => negate(b)).then(negB => a + negB);
}

Day 2 #

Source maps update #

Source maps are an important tool in a developer's toolbox: they are what lets you debug transpiled/minified code in your editor or browser, while still stepping through your original hand-authored code.

While they are supported by most tools and browsers, there hasn't been a shared standard that defines how they should work. Tools and browsers all have to peek at what the others are doing to understand how to properly implement them, and this situation makes it very difficult to evolve the format to improve the debugging experience.

TC39 recently picked up the task of formalizing the standard, as well as adding new features such as the scopes proposal that would let devtools better understand renamed variables and inlined functions.

Iterator.zip to Stage 2.7 #

TC39 is working on many helpers for more easily working with iterators ("lazy lists", that only produce values as needed). While most of them are in the Iterator Helpers proposal, this one is advancing on its own.

Iterator.zip allows pairing values coming from multiple iterators:

function getNums(start = 0, step = 1) {
for (let val = start; ; start += step) yield step;
}

let naturals = getNums();
let evens = getNums(0, 2);
let negatives = getNums(-1, -1);

// an iterator of [a natural, an even, a negative]
let allTogether = Iterators.zip([naturals, evens, negative]);

console.log(allTogether.next().value); // [0, 0, -1]
console.log(allTogether.next().value); // [1, 2, -2]
console.log(allTogether.next().value); // [2, 4, -3]

This proposal, like import defer, just reached the new Stage 2.7: it will now need test262 tests to be eligible for Stage 3.

Temporal reduction igalia logo #

Temporal is one of the longest awaited features of JavaScript, advancing bit by bit on its path to stage 4 as obstacles are removed. For the last 6 months or so we have been working on removing one of the final obstacles: addressing feedback from JS engines on the size and complexity of the proposal, which culminated in this meeting.

As we get closer to having shipping implementations, it's become clear that the size of Temporal was an obstacle for platforms such as low-end Android devices: it added a large chunk to the size of the JS engine all at once. So, Philip Chimento and Justin Grant presented a slate of API removals to make the proposal smaller.

What was removed? Some methods previously existed for convenience, but were removed as somewhat redundant because there was a one-line way to accomplish the same thing. A more substantial removal was Temporal.Calendar and Temporal.TimeZone objects, along with the ability to extend them to implement custom calendars and custom time zones. We've received feedback that these have been the most complicated parts of the proposal for implementations, and they've also been where the most bugs have popped up. As well, due to the existence of formats like jsCalendar (RFC 8984), as well as learning more about the drawbacks of a callback-based design, we believe there are better designs possible for custom time zones and calendars than there were when the feature was designed.

Most of the slate of removals was adopted, and Temporal continues its journey to stage 4 smaller than it was before. You can follow the progress in this ticket on Temporal's issue tracker.

Day 3 #

Decimal update igalia logo #

If you're tired of the fact that 0.1 + 0.2 is not 0.3 in JavaScript, then the decimal proposal is for you! This proposal, which is currently at stage 1, was presented by Jesse Alama. The goal was to present an update about some changes to the API, and go through the status of the proposal's spec text. Although most of the committee was generally supportive of allowing this proposal to go to stage 2, it remains at stage 1 due to some concerns about missing details in the spec text and the overall motivation of the proposal.

Discard bindings update #

The intention of discard bindings is to formalize a pattern commonly seen out there in the wild of JavaScript:

let [ _, rhs ] = s.split(".");

Notice that underscore character (_)? Although our intention is to signal to the readers of our code that we don't care what is on the left-hand side of the . in our string, _ is actually a valid identifier. It might even contain a big value, which takes up memory. This is just the tip of the iceberg; things get even more complex when one imagines binding -- but not using! -- even more complex entities. We would like to use _ -- or perhaps something else, like void -- to signal to the JavaScript engine that it can throw away whatever value _ might have held. Ron Buckton presented an update about this proposal and successfully advanced it to Stage 2 in the TC39 process.

Signals update #

Signals is an ambitious proposal that takes some of the various approaches to reactivity found out there in various JS frameworks such as Vue, React, and so on. The idea is to bring some form of reactivity into the JS Core. Of course, different approaches to reactivity are possible, and should remain valid; the idea of the Signals proposal is to provide some common abstraction that different aproaches can build on. Dan Ehrenberg presented an update on this new proposal, which is currently at stage 1. A lot of work remains to be done; Dan explicitly said that a request to move Signals to stage 2 might take at least a year.

July 18, 2024 12:00 AM

July 12, 2024

Georges Stavracas

Profiling a web engine

One topic that interests me endlessly is profiling. I’ve covered this topic many times in this blog, but not enough to risk sounding like a broken record yet. So here we are again!

Not everyone may know this but GNOME has its own browser, Web (a.k.a. Epiphany, or Ephy for the intimates). It’s a fairly old project, descendant of Galeon. It uses the GTK port of WebKit as its web engine.

The recent announcement that WebKit on Linux (both WebKitGTK and WPE WebKit) switched to Skia for rendering brought with it a renewed interest in measuring the performance of WebKit.

And that was only natural; prior to that, WebKit on Linux was using Cairo, which is entirely CPU-based, whereas Skia had both CPU and GPU-based rendering easily available. The CPU renderer mostly matches Cairo in terms of performance and resource usage. Thus one of the big promises of switching to Skia was better hardware utilization and better overall performance by switching to the GPU renderer.

A Note About Cairo

Even though nowadays we often talk about Cairo as a legacy piece of software, there’s no denying that Cairo is really good at what it does. Cairo can and often is extremely fast at 2D rendering on the CPU, specially for small images with simple rendering. Cairo has received optimizations and improvements for this specific use case for almost 20 years, and it is definitely not a low bar to beat.

I think it’s important to keep this in mind because, as tempting as it may sound, simply switching to use GPU rendering doesn’t necessarily imply better performance.

Guesswork is a No-No

Optimizations should always be a byproduct of excellent profiling. Categorically speaking, meaningful optimizations are a consequence of instrumenting the code so much that the bottlenecks become obvious.

I think the most important and practical lesson I’ve learned is: when I’m guessing what are the performance issues of my code, I will be wrong pretty much 100% of the time. The only reliable way to optimize anything is to have hard data about the behavior of the app.

I mean, so many people – myself included – were convinced that GNOME Software was slow due to Flatpak that nobody thought about looking at app icons loading.

Enter the Profiler

Thanks to the fantastic work of Søren Sandmann, Christian Hergert, et al, we have a fantastic modern system profiler: Sysprof.

Sysprof offers a variety of instruments to profile the system. The most basic one uses perf to gather stack traces of the processes that are running. Sysprof also supports time marks, which allow plotting specific events and timings in a timeline. Sysprof also offers extra instrumentation for more specific metrics, such as network usage, graphics, storage, and more.

  • Screenshot of Sysprof's callgraph view
  • Screenshot of Sysprof's flamegraphs view
  • Screenshot of Sysprof's mark chart view
  • Screenshot of Sysprof's waterfall view

All these metrics are super valuable when profiling any app, but they’re particularly useful for profiling WebKit.

One challenging aspect of WebKit is that, well, it’s not exactly a small project. A WebKit build can easily take 30~50min. You need a fairly beefy machine to even be able to build a debug build of WebKit. The debug symbols can take hundreds of megabytes. This makes WebKit particularly challenging to profile.

Another problem is that Sysprof marks require integration code. Apps have to purposefully link against, and use, libsysprof-capture to send these marks to Sysprof.

Integrating with Sysprof

As a first step, Adrian brought the libsysprof-capture code into the WebKit tree. As libsysprof-capture is a static library with minimal dependencies, this was relatively easy. We’re probably going to eventually remove the in-tree copy and switch to host system libsysprof-capture, but having it in-tree was enough to kickstart the whole process.

Originally I started sprinkling Sysprof code all around the WebKit codebase, and to some degree, it worked. But eventually I learned that WebKit has its own macro-based tracing mechanism that is only ever implemented for Apple builds.

Looking at it, it didn’t seem impossible to implement these macros using Sysprof, and that’s what I’ve been doing for the past few weeks. The review was lengthy but behold, WebKit now reports Sysprof marks!

Screenshot of Sysprof with WebKit marks highlighted

Right now these marks cover a variety of JavaScript events, layout and rendering events, and web page resources. This all came for free from integrating with the preexisting tracing mechanism!

This gives us a decent understanding of how the Web process behaves. It’s not yet complete enough, but it’s a good start. I think the most interesting data to me is correlating frame timings across the whole stack, from the kernel driver to the compositor to GTK to WebKit’s UI process to WebKit’s Web process, and back:

Screenshot of Sysprof with lots of compositor and GTK and WebKit marks

But as interesting as it may be, oftentimes the fun part of profiling is being surprised by the data you collect.

For example, in WebKit, one specific, seemingly innocuous, completely bland method is in the top 3 of the callgraph chart:

Screenshot of Sysprof showing the callgraph view with an interesting result highlighted

Why is WebCore::FloatRect::contains so high in the profiling? That’s what I’m investigating right now. Who guessed this specific method would be there? Nobody, as far as I know.

Once this is out in a stable release, anyone will be able to grab a copy of GNOME Web, and run it with Sysprof, and help find out any performance issues that only reproduce in particular combinations of hardware.

Next Plans

To me this already is a game changer for WebKit, but of course we can do more. Besides the rectangular surprise, and one particular slowdown that comes from GTK loading Vulkan on startup, no other big obvious data point popped up. Specially in the marks, I think their coverage is still fairly small compared to what it could have been.

We need more data.

Some ideas that are floating right now:

  • Track individual frames and correlate them with Sysprof marks
  • Measure top-to-bottom-to-top latency
  • Measure input delay
  • Integrate with multimedia frames

Perhaps this will allow us to make WebKit the prime web engine for Linux, with top-tier performance, excellent system integration, and more. Maybe we can even redesign the whole rendering architecture of WebKit on Linux to be more GPU friendly now. I can dream high, can’t I? 🙂

In any case, I think we have a promising and exciting time ahead for WebKit on Linux!

by Georges Stavracas at July 12, 2024 12:42 PM

Madeeha Javed

Igalia's Latest Contributions to Graphics

The Igalia Graphics team has been expanding and making significant contributions in the space of open source graphics. An earlier blog post by our team member Lucas provides an excellent insight in to the team’s evolution over the past years. The following series of posts will attempt to summarize the team’s recent engagements:

  • This post covers our updates on GPU color management, Turnip, V3DV, DRM/KMS, Etnaviv and community events we have been participating in.
  • The next post will cover news from our CTS, Vulkan Video, Mesa CI, GPU reset work and talks about some new initiatives that recently we got involved in.

Before dwelling in to details, it is worth mentioning the recent highlights; Igalia hosted 2024 Linux Display Next Hackfest in May this year and X.org Developers Conference 2023 in October last year, both in the beautiful city of A Coruña. These events were a huge success in creating a hub for graphics experts to foster open innovation. Continue reading for more details on these events.

A Vibrant Linux #

Last year brought great news for AMD GPU color management: the AMD driver-specific color management properties reached the upstream linux-next! My Igalia colleague Melissa Wen has been spearheading this effort for some time now and has journalled every detail in a series of blog posts.

AMD has been improving its display color management pipeline with each new hardware generation. The new color capabilities, before and after plane composition, can be used by compositors and userspace applications to provide a vibrant experience to the end-user. Exposing AMD driver-specific color properties is a step towards advanced color management on Linux, allowing gamut mapping, HDR rendering, HDR on SDR, and SDR on HDR.

On a very high level, there are 2 parts of this support:

  • Upgrading the DRM/KMS Linux interface to expose the new features to the user-space. One major challenge was the limited DRM/KMS interface, which only exposed a small set of post-blending color properties. Latest AMD Display Core Next hardware has many more post-blending and pre-blending capabilities. Melissa’s work involved mapping these capabilities to the AMD driver’s display core interface and then to the DRM interface. Her blog post provides a brief overview of this extensive mapping effort.

  • Updating the AMD’s Linux display driver to expose the new hardware features. AMD DCN 3.0 comes with cutting edge color capabilities described by Melissa here and this blog post also talks about the AMD’s Linux display subsystem components and about the new properties.

I quote here some of Melissa’s write-ups that helped me get some understanding about this vast subject:

Turnip Upgrades #

Turnip, the open-source Vulkan driver for Qualcomm Adreno GPUs, has been receiving major upgrades this year for Qualcomm’s Adreno 7XX GPUs.

From my colleague Danylo Piliaiev’s Turnip update at FOSDEM 2024, Turnip seems to be in a great state; major Vulkan extensions and better debug support, AAA desktop games can now run via FEX + Turnip on Linux, with some from the Termux community even running desktop games on Android with Box64/FEX + Turnip.

The highlight of Danylo’s talk is the A7XX support. The team started the year with A7XX bring up and now ramping on adding support for the new features introduced in A7XX:

  • Mark Collins, who also represents Igalia at the Khronos Vulkan WG, implemented GMEM rendering for A7XX, which can be considerably faster and more power efficient than sysmem rendering depending on what’s being rendered. Followed up by support for unidirectional LRZ, bringing A7XX to parity with A6XX’s GMEM rendering feature set and further boosting performance, with more performance improvements for A7XX on the horizon.

  • Our colleague Amber Harmonia added support for allowing a shader to contain 64-bit atomic operations on signed and unsigned integers and support for allowing rasterizing wide lines while Fixed Stride Draw Table support is work-in-progress.

In addition to new feature support, we are committed to providing a robust and performant driver.

Recently, Job Noorman has joined our Turnip team to improve the IR3 compiler. He improved handling of predicate registers and added support for predication. Adreno GPUs have special registers that store the result of a condition called predicate registers, utilizing these registers can eliminate branches in the generated code thereby improving performance. Similarly, more than 10% code size reduction was observed in shader-db with his patch for using rptN instructions.

Turnip has come far and has been giving competition to the Adreno’s proprietary driver recently. Here is Assassin’s Creed running on Adreno + Turnip. Check the FPS on that screen!

Turnip Development Resources #

Danylo usually talks about analyzing some of the major Turnip issues in his series of blog posts “Turnips in the wild” with part 3 being the latest addition. This is exactly what you need to jump start Turnip development.

As always, the team also discovered many new techniques of debugging GPU issues. GPU driver developers want to modify the GPU command stream on run-time to see the outcome of editing it in different ways. Danylo implemented this highly sought out feature as a tool for Adreno and describes how this tool can be used.

DRM/KMS Improvements #

The management of the display, graphics and composition in Linux lies in the kernel DRM/KMS framework. Igalian Maíra Canal provides full disclosure on our notable contributions authoring, reviewing and testing kernel DRM patches while I privide a few highlights here:

  • My Igalia colleague André Almeida and Simon Ser have been working on Asynchronous Page Flips, an optimization that allows applications to flip a plane for immediate presentation. The support for this feature is now available in the atomic API. Plus, with André’s patch, it is enabled for all planes including the primary plane if the hardware supports it.

  • Maíra has been working on feature crucial to graphics development on RPi. She supplied per client GPU usage statistics as well as global GPU utilization.

  • In order to ensure continuous job submission to the GPU, CPU jobs submitted from userspace must be prevented. With a series of patches from Maíra moved CPU jobs mechanisms from the V3DV driver to the V3D kernel driver.

We want more Pi! #

After achieving Vulkan 1.2 conformance on V3DV, the Igalia team working on V3DV have been focusing on instrumental enhancements of the driver. V3DV is Broadcom Video Core GPU’s Vulkan driver on the image

RPi 5 was launched in October last year with a new BCM GPU. Alejandro provided an overview of the team’s journey through V3DV development since RPi 4 and then talks about challenges of RPi 5 support in V3DV:

More improvements and new Vulkan extensions were supported last year.

This year Iago landed support for Vulkan dynamic rendering extension. VK_KHR_dynamic_rendering is a popular Vulkan extension that has added flexibility to the Vulkan API by allowing users to skip render pass and frame buffer objects and start immediate rendering. And now its available on the Pi.

As mentioned in the DRM/KMS improvements above, Maíra together with José María Casanova (Chema) and Melissa supported GPU utilization stats and CPU jobs optimization. Here is a snapshot of collection of GPU stats on Pi5:

image

RPi 5 continues to use OpenGL/Wayland based Wayfire compositor on these devices. Christopher was therefore tasked with enabling Wayfire to run on RPi 3 and 4 as well. He achieved this by software rendering implementing by a Pixman back-end. Check out the demo:

Iago also made some interesting observations while experimenting with SuperTuxKart on the Pi. You will be pleasantly surprised to know how Vulkan out-performed OpenGL.

The team has been working towards Vulkan 1.3 and we will hopefully be able to share more news on that front very soon.

Etnaviv #

Christian Gmeiner, one of the maintainers of Etnaviv (open-source graphics driver for Vivante GPUs), joined our team last year. We are very excited to have him on-board because it is a testament to Igalia’s dedication towards open source graphics software development.

Christian is also enjoying being at Igalia as he discusses in blog post and also reveals his plans for Etnaviv:

  • Improving Etnaviv’s Gallium driver.
  • Exposing GLES3.
  • Moving towards a new back-end compiler.

One of his latest updates is the user-space hardware database. He explains that a user-space driver HW database has been introduced to obtain GPU specific information like GPU features and limits, corresponding to the introduction of an in-kernel hardware database. I am sure this will be super helpful for the reverse engineers out there!

News & Community Events #

Igalians are always eager to share their knowledge and expertise with the open source community by participating in key organizations and events.

Good bye ‘Xorg’ and Hello ‘Linux Foundation’ #

image

There is quite a trend in Igalians serving on the X.Org Foundation’s Board of Directors. Samuel Iglesias took on this responsibility for a number of terms but this year he is stepping down. He reminisced about his role in this blog post.

Ricardo was, however, elected as one of the board of directors in 2022 and stayed on the board till Q1 2024, leaving Christopher Michael as the only Igalian currently on the board. In his blog post, Ricardo introduces the X.Org Foundation but also tackles some questions about its future.

image

Samuel was invited to join the Linux Foundation (Europe) advisory board and he has accepted the invitation. This is a huge milestone for the whole graphics team. Congratulations Sam!

2024 Linux Display Hackfest #

This is a rather new event that has materialized in the Linux community to enhance the Linux display stack.

Melissa’s work on HDR and AMD color management together with interesting discussions during XDC 2023 Color Management workshop paved the way for the event this year and therefore, Igalia graciously offered to host it.

The event attracted key participants from Linux community, AMD, Nvidia, Google, Fedora, and Gnome, focusing on topics like HDR/color Management, variable refresh rate, tearing, multiplane/hardware overlay for video and gaming, real-time scheduling, async KMS API, power saving vs. color/latency, content-adaptive scaling and sharpening, and display control. The success of this event has highlighted the need for future editions.

Embedded Open Source Summit 2024 #

At EOSS this year, we presented the following talks:

FOSDEM 2024 #

At FOSDEM this year, we presented the following talks:

Vukanised 2024 #

At Vukanised this year, we presented the following talks:

Stéphane Cerveau & Hyunjun Ko, “Implementing a Vulkan Video Encoder From Mesa to Streamer” Iago Toral, Faith Ekstrand, “8 Years of Open Drivers, including the State of Vulkan in Mesa”

Igalians who attended the event found it quite informative on the subject.

XDC 2023 #

Igalia hosted XDC 2023 in the city of their headquarters, A Coruña. We also presented many talks and demos.

The lightning talks and demos had an equally active participation from Igalia:

Workshops were organized for discussion on larger subjects like advance color management (discussion summary) and continuous integration (discussion summary).

The Future #

Igalia graphics team has profound expertise in Mesa, Vulkan, OpenGL and Linux kernel. We have also embraced new and really interesting graphics technologies that I talk about in my next post.

July 12, 2024 12:00 AM

July 10, 2024

Andy Wingo

copying collectors with block-structured heaps are unreliable

Good day, garbage pals! This morning, a quick note on “reliability” and garbage collectors, how a common GC construction is unreliable, and why we choose it anyway.

on reliability

For context, I’m easing back in to Whippet development. One of Whippet’s collectors is a semi-space collector. Semi-space collectors are useful as correctness oracles: they always move objects, so they require their embedder to be able to precisely enumerate all edges of the object graph, to update those edges to point to the relocated objects. A semi-space collector is so simple that if there is a bug, it is probably in the mutator rather than the collector. They also have well-understood performance, as as such are useful when comparing performance of other collectors.

But one other virtue of the simple semi-space collector is that it is reliable, in the sense that given a particular set of live objects, allocated in any order, there is a single heap size at which the allocation (and collection) will succeed, and below which the program fails (not enough memory). This is because all allocations go in the same linear region, collection itself doesn’t allocate memory, the amount of free space after an object (the fragmentation) does not depend on where it is allocated, and those object extents just add up in a commutative way.

Reliability is a virtue. Sometimes it is a requirement: for example, the Toit language and run-time targets embeded microcontrollers, and you there you have finite resources and either they workload fits or it doesn’t. You can’t really tolerate a result of “it works sometimes”. So, Toit uses a generational semi-space + mark-sweep collector that never makes things worse.

on block-structured heaps

But, threads make reliability tricky. With Whippet I am targetting embedders with multiple mutator threads, and a classic semi-space collector doesn’t scale – you could access the allocation pointer atomically, but that would be a bottleneck, serializing mutators, not to mention the cache contention.

The usual solution for this problem is to arrange the heap in such a way that different threads can allocate in different areas, so they don’t need to share an allocation pointer and so they don’t write to the same cache lines. And, the most common way to do this is to use a block-structured heap; for example you might have a 256 MB heap, but divided into 4096 blocks, each of which is 64 kB. That’s enough granularity to dynamically partition out space between many threads: you keep a list of available blocks and allocator threads compete to grab fresh blocks as needed. There’s central contention on the block list, so you want blocks big enough that you aren’t fetching blocks too often.

To break a heap into blocks requires a large-object space, to allow for allocations that are larger than a block. And actually, as I mentioned in the article about block-structured heaps, usually you choose a threshold for large object allocations that is smaller than the block size, to limit the maximum and expected amount of fragmentation at the end of each block, when an allocation doesn’t fit.

on unreliability

Which brings me to my point: a copying collector with a block-structured heap is unreliable, in the sense that there is no single heap size below which the program fails and above which it succeeds.

Consider a mutator with a single active thread, allocating a range of object sizes, all smaller than the large object threshold. There is a global list of empty blocks available for allocation, and the thread grabs blocks as needed and bump-pointer allocates into that block. The last allocation in each block will fail: that’s what forces the thread to grab a new fresh block. The space left at the end of the block is fragmentation.

Assuming that the sequence of allocations performed by the mutator is deterministic, by the time the mutator has forced the first collection, the total amount of fragmentation will also be deterministic, as will the set of live roots at the time of collection. Assume also that there is a single collector thread which evacuates the live objects; this will also produce deterministic fragmentation.

However, there is no guarantee that the post-collection fragmentation is less than the pre-collection fragmentation. Unless objects are copied in such a way that preserves allocation order—generally not the case for a semi-space collector, and it would negate any advantage of a block-structured heap—then different object order could produce different end-of-block fragmentation.

causes of unreliability

The unreliability discussed above is due to non-commutative evacuation. If your collector marks objects in place, you are not affected. If you don’t commute live objects—if you preserve their allocation order, as Toit’s collector does—then you are not affected. If your evacuation commutes, as in the case of the simple semi-space collector, you are not affected. But if you have a block-structured heap and you evacuate, your collector is probably unreliable.

There are other sources of unreliability in a collector, but to my mind they are not as fundamental as this one.

  • Multiple mutator threads generally lead to a kind of unreliability, because the size of the live object graph is not deterministic at the time of collection: even if all threads have the same allocation trace, they don’t necessarily proceed in lock-step nor stop in the same place.

  • Adding collector threads to evacuate in parallel adds its own form of unreliability: if you have 8 evacuator threads, then there are 8 blocks local to the evacuator threads which also contribute to post-collection wasted space, possibly an entire block per thread.

  • Some collectors will allocate memory during collection, for example to represent a worklist of objects that need tracing. This allocation can fail. Also, annoyingly, collection-time allocation complicates comparison: you can no longer compare two collectors at the same heap size, because one of them cheats.

  • Virtual memory and paging can make you have a bad time. For example, you go to allocate a large object, so you remove some empty blocks from the main space and return them to the OS, providing you enough budget to allocate the new large object. Then the new large object is collected, so you reclaim the pages you returned to the OS, adding them to the available list. But probably you don’t page them in already, because who wants a syscall? They get paged in lazily when the mutator needs them, but that could fail because of other processes on the system.

embracing unreliability

I think it only makes sense to insist on a reliable collector if your mutator does not have threads; otherwise, the fragmentation-related unreliability pales in comparison.

What’s more, fragmentation-related unreliability can be entirely mitigated by giving the heap more memory: the maximum amount of fragmentation is an object just shy of the large object threshold, per block, so in our case 8 kB per 64 kB. So, just increase your heap by 12.5%. You will certainly not regret increasing your heap by 12.5%.

And happily, increasing heap size also works to mitigate unreliability related to multiple mutator threads. Consider 10 threads each of which has a local object graph that is usually 10 MB but briefly 100MB when calculating: usually when GC happens, total live object size is 10×10MB=100MB, but sometimes as much as 1 GB; there is a minimum heap size for which the program sometimes works, but also a minimum heap size at which it always works. The trouble is, of course, that you generally only know the minimum always-works size by experimentation, and you are unlikely to be able to actually measure the maximum heap size.

Which brings me to my final point, which is that virtual memory and growable heaps are good. Unless you have a dedicated devops team or you are evaluating a garbage collector, you should not be using a fixed heap size. The ability to just allocate some pages to keep the heap from being too tight buys us a form of soft reliability.

And with that, end of observations. Happy fragmenting, and until next time!

by Andy Wingo at July 10, 2024 08:48 AM

Stephen Chenney

Canvas Text Editing

Editing text in HTML canvas has never been easy. It requires identifying which character is under the hit point in order to place a caret, and it requires computing bounds for a range of text that is selected. The existing implementations of Canvas TextMetrics made these things possible, but not without a lot of Javascript making multiple expensive calls to compute metrics for substrings. Three new additions to the TextMetrics API are intended to support editing use cases in Canvas text. They are in the standards pipeline, and implemented in Chromium-based browsers behind the ExtendedTextMetrics flag:

  • caretPositionFromPoint gives the location in a string corresponding to a pixel length along the string. Use it to identify where the caret is in the string, and what the bounds of a selection range are.
  • getSelectionRects returns the rectangles that a browser would use to highlight a range of text. Use it to draw the selection highlight.
  • getActualBoundingBox returns the bounding box for a sub-range of text within a string. Use it if you need to know whether a point lies within a substring, rather than the entire string.

To enable the flag, use --enable-blink-features=ExtendedTextMetrics when launching Chrome from a script or command line, or enable “Experimental Web Platform features” via chrome://flags/#enable-experimental-web-platform-features.

I wrote a basic web app (opens in a new tab) in order to demonstrate the use of these features. It will function in Chrome versions beyond 128.0.6587.0 (Canary at the time of writing) with the above flags set.

The app allows the editing of a single line of text drawn in an HTML canvas. Here I’ll work through usage of the new features.

In the demo, the first instance of “new Canvas Text Metrics” is considered a link back to this blog page. Canvas Text has no notion of links, and thousands of people have looked at Stack Exchange for a way to insert hyperlinks in canvas text. Part of the problem, assuming you know where the link is in the text, is determining when the link was clicked on. The TextMetrics getActualBoundingBox(start, end) method is intended to simplify the problem by returning the bounding box of a substring of the text, in this case the link.

  onStringChanged() {
text_metrics = context.measureText(string);
link_start_position = string.indexOf(link_text);
if (link_start_position != -1) {
link_end_position = link_start_position + link_text.length;
}
}
...
linkHit(x, y) {
let bound_rect = undefined;
try {
bound_rect = text_metrics.getActualBoundingBox(link_start_position, link_end_position);
} catch (error) {
return false;
}
let relative_x = x - string_x;
let relative_y = y - string_y;
return relative_x >= bound_rect.left && relative_y >= bound_rect.top
&& relative_x < bound_rect.right && relative_y < bound_rect.bottom;
}

The first function finds the link in the string and stores the start and end string offsets. When a click event happens, the second method is called to determine if the hit point was within the link area. The text metrics object is queried for the bounding box of the link’s substring. Note the call is contained within a try...catch block because an exception will be returned if the substring is invalid. The event offset is mapped into the coordinate system of the text (in this case by subtracting the text location) and the resulting point is tested against the rectangle.

In more general situations you may need to use a regular expression to find links, and keep track of a more complex transformation chain to convert event locations into the text string’s coordinate system.

Mapping a Point to a String Index #

A primary concept of any editing application is the caret location because it indicates where typed text will appear, or what will be deleted by backspace, or where an insertion will happen. Mapping a hit point in the canvas into the caret position in the text string is a fundamental editing operation. It is possible to do this with existing methods but it is expensive (you can do a binary search using the width of substrings).

The TextMetrics caretPositionFromPoint(offset) method uses existing code in browsers to efficiently map a point to a string position. The underlying functionality is very similar to the document.caretPositionFromPoint(x,y) method, but modified for the canvas situation. The demo code uses it to position the caret and to identify the selection range.

  text_offset = event.offsetX - string_x;
caret_position = text_metrics.caretPositionFromPoint(text_offset);

The caretPositionFromPoint function takes the horizontal offset, in pixels, measured from the origin of the text (based on the textAlign property of the canvas context). The function finds the character boundary closest to the given offset, then returns the character index to the right for left-to-right text, and to the left for right-to-left text. The offset can be negative to allow characters to the left of the origin to be mapped.

In the figure below, the top string has textDirection = "ltr" and textAlign = "center". The origin for measuring offsets is the center of the string. Green shows the offsets given, while blue shows the indexes returned. The bottom string demonstrates textDirection = "rtl" and textAlign = "start".

An offset past the beginning of the text always returns 0, and past the end returns the string length. Note that the offset is always measured left-to-right, even if the text direction is right-to-left. The “beginning” and “end” of the text string do respect the text direction, so for RTL text the beginning is on the right.

The caretPositionFromPoint function may produce very counter-intuitive results when the text string has mixed bidi content, such as a latin substring within an arabic string. As the offset moves along the string the positions will not steadily increase, or decrease, but may jump around at the boundaries of a directional run. Full handling of bidi content requires incorporating bidi level information, particularly for selecting text, and is beyond the scope of this article.

Selection Rectangles #

Selected text is normally indicated by drawing a highlight over the range, but to produce such an effect in canvas requires estimating the rectangle using existing text metrics, and again making multiple queries to text metrics to obtain the left and right extents. The new TextMetrics getSelectionRects(start, end) function returns a list of browser defined selection rectangles for the given subrange of the string. There may be multiple rectangles because the browser returns one for each bidi run; you would need to draw them all to highlight the complete range. The demo assumes a single rectangle because it assumes no mixed-direction strings.

selection_rect = text_metrics.getSelectionRects(selection_range[0], selection_range[1])[0];
...
context.fillStyle = 'yellow';
context.fillRect(selection_rect.x + string_x,
selection_rect.y + string_y,
selection_rect.width,
selection_rect.height)

Like all the new methods, the rectangle returned is in the coordinate system of the string, as defined by the transform, textAlign and textBaseline.

Conclusion #

The new Canvas Text Metrics described here are in the process of standardization. When the feedback process is opened we will update this blog post with the place to raise issues with these proposed methods.

Thanks #

The implementation of Canvas Text Features was aided by Igalia S.L. funded by Bloomberg L.P.

July 10, 2024 12:00 AM

July 09, 2024

Guilherme Piccoli

Presenting kdumpst, or how to collect kernel crash logs on Arch Linux

It’s been a long time since I last posted something here – yet there are interesting news to mention, a bit far from ARM64 (the topic of my previous posts). Let’s talk today about kernel crashes, or even better, how can we collect information if a kernel panic happens on Arch Linux and on SteamOS, … Continue reading "Presenting kdumpst, or how to collect kernel crash logs on Arch Linux"

by gpiccoli at July 09, 2024 07:21 PM

July 08, 2024

Frédéric Wang

My recent contributions to Gecko (2/3)

Introduction

This is the second in a series of blog posts describing new web platform features Igalia has implemented in Gecko, as part of an effort to improve browser interoperability. I’ll talk about the task of implementing ‘content-visibility’, to which several Igalians have contributed since early 2022, and I’ll focus on two main roadblocks I had to overcome.

The ‘content-visibility’ property

In the past, Igalia worked on CSS containment, a feature allowing authors to isolate a subtree from the rest of the document to improve rendering performance. This is done using the ‘contain’ property, which accepts four kinds of containment: size, layout, style and paint.

‘content-visibility’ is a new property allowing authors to “hide” some content from the page, and save the browser unnecessary work by applying containment. The most interesting one is probably content-visibility: auto, which hides content that is not relevant to the user. This is essentially native “virtual scrolling”, allowing you to build virtualized or “recycled” lists without breaking accessibility and find-in-page.

To explain this, consider the typical example of a page with a series of posts, as shown below. By default, each post would have the four types of containment applied, plus it won’t be painted, won’t respond to hit-testing, and would use the dimensions specified in the ‘contain-intrinsic-size’ property. It’s only once a post becomes relevant to the user (e.g. when scrolled close enough to the viewport, or when focus is moved into the post) that the actual effort to properly render the content, and calculate its actual size, is performed:

div.post {
  content-visibility: auto;
  contain-intrinsic-size: 500px 1000px;
}
<div class="post">
...
</div>
<div class="post">
...
</div>
<div class="post">
...
</div>
<div class="post">
...
</div>

If a post later loses its relevance (e.g. when scrolled away, or when focus is lost) then it would use the dimensions specified by ‘contain-intrinsic-size’ again, discarding the content size that was obtained after layout. One can also avoid that and use the last remembered size instead:

div.post {
  contain-intrinsic-size: auto 500px auto 1000px;
}

Finally, there is also a content-visibility: hidden value, which is the same as content-visibility: auto but never reveals the content, enhancing other methods to hide content such as display: none or visibility: hidden.

This is just a quick overview of the feature, but I invite you to read the web.dev article on content-visibility for further details and thoughts.

Viewport distance for content-visibility: auto

As is often the case, the feature looks straightforward to implement, but issues appear when you get into the details.

In bug 1807253, my colleague Oriol Brufau raised an interoperability bug with a very simple test case, reproduced below for convenience. Chromium would report 0 and 42, whereas Firefox would sometimes report 0 twice, meaning that the post did not become relevant after a rendering update:

<!DOCTYPE html>
<div id="post" style="content-visibility: auto">
  <div style="height: 42px"></div>
</div>
<script>
console.log(post.clientHeight);
requestAnimationFrame(() => requestAnimationFrame(() => {
  console.log(post.clientHeight);
}));
</script>

It turned out that an early version of the specification relied too heavily on an modified version of IntersectionObserver to synchronously detect when an element is close to the viewport, as this was how it was implemented in Chromium. However, the initial implementation in Firefox relied on a standard IntersectionObserver (with asynchronous notifications of observers) and so failed to produce the behavior described in the specification. This issue was showing up in several WPT failures.

To solve that problem, the moment when we determine an element’s proximity to the viewport was moved into the HTML5 specification, at the step when the rendering is updated, more precisely when the ResizeObserver notifications are broadcast. My colleague Alexander Surkov had started rewriting Firefox’s implementation to align with this new behavior in early 2023, and I took over his work in November.

Since this touches the “update the rendering” step which is executed on every page, it was quite likely to break things… and indeed many regressions were caused by my patch, for example:

  • One regression was about white flickering of pages on every reload/navigation.
  • One more regression was about content-visibility: auto nodes not being rendered at all.
  • Another regression was about new resize loop errors appearing in tests.
  • Some test cases were also found where the “update the rendering step” would repeat indefinitely, causing performance regressions.
  • Last but not least, crashes were reported.

Some of these issues were due to the fact that support for the last remembered size in Firefox relied on an internal ResizeObserver. However, the CSS Box Sizing spec only says that the last remembered size is updated when ResizeObserver events are delivered, not that such an internal ResizeObserver object is actually needed. I removed this internal observer and ensured the last remembered size is computed directly in the “update the rendering” phase, making the whole thing simpler and more robust.

Dynamic changes to CSS ‘contain’ and ‘content-visibility’

Before sending the intent-to-ship, we reviewed remaining issues and stumbled on bug 1765615, which had been opened during the initial 2022 work. Mozilla indicated this performance bug was important enough to consider an optimization, so I started tackling the issue.

Elaborating a bit about what was mentioned above, a non-visible ‘content-visibility’ implies layout, style and paint containment, and when the element is not relevant to the user, it also implies size containment 1. This has certain side effects, for example paint and layout containment establish an independent formatting context and affect how the contained box interacts with floats and how margin collapsing applies. Style containment can even have more drastic consequences, since they make counter-* and *-quote properties scoped to the subtree.

When we dynamically modify the ‘contain’ or ‘content-visibility’ properties, or when the relevance of a content-visibility: auto element changes, browsers must make sure that the rendering is properly updated. It turned out that there were almost no tests for that, and unsurprisingly, Chromium and WebKit had various invalidation bugs. Firefox was always forcing a rebuild of the tree used for rendering, which avoided such bugs but is not optimal.

I wrote a couple of web platform tests for ‘contain’ and ‘content-visibility’ 2, and made sure that Firefox does the minimal invalidation effort needed, being careful not to cause any regressions. As a result, except for style containment changes, we’re now able to avoid the cost a rebuild of the tree used for rendering!

Conclusion

Almost two years after the initial work on ‘content-visibility’, I was able to send the intent-to-ship, and the feature finally became available in Firefox 125. Finishing the implementation work on this feature was challenging, but quite interesting to me.

I believe ‘content-visibility’ is a good example of why implementing a feature in different browsers is important to ensure that both the specification and tests are good enough. The lack of details in the spec regarding when we determine viewport proximity, and the absence for WPT tests for invalidation, definitely made the Firefox work take longer than expected. But finishing that implementation work was also useful for improving the spec, tests, and other implementations 3.

I’ll conclude this series of blog posts with fetch priority, which also has its own interesting story…

  1. In both cases, “implies” means the used value of ‘contain’ is modified accordingly. 

  2. One of the thing I had to handle with care was the update of the accessibility tree, since content that is not relevant to the user must not be exposed. Unfortunately it’s not possible to write WPT tests for accessibility yet so for now I had to write internal Firefox-specific non-regression tests. 

  3. Another interesting report happened after the release and is related to content-visibility: auto on elements drawn in a canvas

July 08, 2024 10:00 PM

July 01, 2024

Alex Bradbury

pwr

Summary

pwr (paced web reader) is a script and terminal-centric workflow I use for keeping up to date with various sources online, shared on the off chance it's useful to you too.

Motivation

The internet is (mostly) a wonderful thing, but it's kind of a lot. It can be distracting and I thnk we all know the unhealthy loops of scrolling and refreshing the same sites. pwr provides a structured workflow for keeping up to date with a preferred set of sites in an incremental fashion (willpower required). It takes some inspiration from a widely reported workflow that involved sending a URL to a server and having it returned via email to be read in a batch later. pwr adopts the delayed gratification aspect of this but doesn't involve downloading for offline reading.

The pwr flow

One-time setup:

  • Configure the pwr script so it supports your desired feed sources (RSS or using hand-written extractors for those that don't have a good feed).

Regular workflow (just run pwr with no arguments to initiate this sequence in one invocation):

  • Run pwr read to open any URLs that were previously queued for reading.
  • Run pwr fetch to get any new URLs from the configured sources.
  • Run pwr filter to open an editor window where you can quickly mark which retrieved articles to queue for reading.

In my preferred usage, the above is run once a day as a replacement for unstructured web browsing. This flow means you're always reading items that were identified the previous day. Although comments on sites such as Hacker News or Reddit are much maligned, I do find they can be a source of additional insight, and this flow means that by the time you're reading a post ~24 hours after initially found, discussion has died down so there's little reason to keep refreshing.

pwr filter is the main part requiring active input, and involves the editor in a way that is somewhat inspired by git rebase -i. For instance, at the time of writing it produces the following output (and you would simply replace the d prefix with r for any you want to queue to read:

------------------------------------------------------------
Filter file generated at 2024-07-01 08:51:54 UTC
DO NOT DELETE OR MOVE ANY LINES
To mark an item for reading, replace the 'd' prefix with 'r'
Exit editor with non-zero return code (:cq in vim) to abort
------------------------------------------------------------

# Rust Internals
d [Discussion] Hybrid borrow (0 replies)

# Swift Evolution
d [Pitch #2] Safe Access to Contiguous Storage (27 replies)
d [Re-Proposal] Type only Unions (69 replies)

# HN
d Programmers Should Never Trust Anyone, Not Even Themselves
d Unification in Elixir
d Quaternions in Signal and Image Processing

# lobste.rs
d Code Reviews Do Find Bugs
d Integrated assembler improvements in LLVM 19
d Cubernetes
d Grafana security update: Grafana Loki and unintended data write attempts to Amazon S3 buckets
d regreSSHion: RCE in OpenSSH's server, on glibc-based Linux systems (CVE-2024-6387)
d Elaboration of the PostgreSQL sort cost model

# /r/programminglanguages
d Rate my syntax (Array Access)

Making it your own

Ultimately pwr is a tool that happens to scratch an itch for me. It's out there in case any aspect of it is useful to you. It's very explicitly written a script, where the expected usage is that you take a copy and make what modifications you need for yourself (changing sources, new fetchers, or other improvements).


Article changelog
  • 2024-07-01: Initial publication date.

July 01, 2024 10:00 AM

June 26, 2024

Lucas Fryzek

Software Rendering and Android

My current project at Igalia has had me working on Mesa’s software renderers, llvmpipe and lavapipe. I’ve been working to get them running on Android, and I wanted to document the progress I’ve made, the challenges I’ve faced, and talk a little bit about the development process for a project like this. My work is not totally merged into upstream mesa yet, but you can see the MRs I made here:

Setting up an Android development environment

Getting system level software to build and run on Android is unfortunately not straightforward. Since we are doing software rendering we don’t need a physical device and instead we can make use of the Android emulator, and if you didn’t know Android has two emulators, the common one most people use is “goldfish” and the other lesser known is “cuttlefish”. For this project I did my work on the cuttlefish emulator as its meant for testing the Android OS itself instead of just Android apps and is more reflective of real hardware. The cuttlefish emulator takes a little bit more work to setup, and I’ve found that it only works properly in Debian based linux distros. I run Fedora, so I had to run the emulator in a debian VM.

Thankfully Google has good instructions for building and running cuttlefish, which you can find here. The instructions show you how to setup the emulator using nightly build images from Google. We’ll also need to setup our own Android OS images so after we’ve confirmed we can run the emulator, we need to start looking at building AOSP.

For building our own AOSP image, we can also follow the instructions from Google here. For the target we’ll want aosp_cf_x86_64_phone-trunk_staging-eng. At this point it’s a good idea to verify that you can build the image, which you can do by following the rest of the instructions on the page. Building AOSP from source does take a while though, so prepare to wait potentially an entire day for the image to build. Also if you get errors complaining that you’re out of memory, you can try to reduce the number of parallel builds. Google officially recommends to have 64GB of RAM, and I only had 32GB so some packages had to be built with the parallel builds set to 1 so I wouldn’t run out of RAM.

For running this custom-built image on Cuttlefish, you can just copy all the *.img files from out/target/product/vsoc_x86_64/ to the root cuttlefish directory, and then launch cuttlefish. If everything worked successfully you should be able to see your custom built AOSP image running in the cuttlefish webui.

Building Mesa targeting Android

Working from the changes in MR !29344 building llvmpipe or lavapipe targeting Android should just work™️. To get to that stage required a few changes. First llvmpipe actually already had some support on Android, as long as it was running on a device that supports a DRM display driver. In that case it could use the dri window system integration which already works on Android. I wanted to get llvmpipe (and lavapipe) running without dri, so I had to add support for Android in the drisw window system integration.

To support Android in drisw, this mainly meant adding support for importing dmabuf as framebuffers. The Android windowing system will provide us with a “gralloc” buffer which inside has a dmabuf fd that represents the framebuffer. Adding support for importing dmabufs in drisw means we can import and begin drawing to these frame buffers. Most the changes to support that can be found in drisw_allocate_textures and the underlying changes to llvmpipe to support importing dmabufs in MR !27805. The EGL Android platform code also needed some changes to use the drisw window system code. Previously this code would only work with true dri drivers, but with some small tweaks it was possible to get to have it initialize the drisw window system and then using it for rendering if no hardware devices are available.

For lavapipe the changes were a lot simpler. The Android Vulkan loader requires your driver to have HAL_MODULE_INFO_SYM symbol in the binary, so that got created and populated correctly, following other Vulkan drivers in Mesa like turnip. Then the image creation code had to be modified to support the VK_ANDROID_native_buffer extension which allows the Android Vulkan loader to create images using Android native buffer handles. Under the hood this means getting the dmabuf fd from the native buffer handle. Thankfully mesa already has some common code to handle this, so I could just use that. Some other small changes were also necessary to address crashes and other failures that came up during testing.

With the changes out of of the way we can now start building Mesa on Android. For this project I had to update the Android documentation for Mesa to include steps for building LLVM for Android since the version Google ships with the NDK is missing libraries that llvmpipe/lavapipe need to function. You can see the updated documentation here and here. After sorting out LLVM, building llvmpipe/lavapipe is the same as building any other Mesa driver for Android: we setup a cross file to tell meson how to cross compile and then we run meson. At this point you could manual modify the Android image and copy these files to the vm, but I also wanted to support building a new AOSP image directly including the driver. In order to do that you also have to rename the driver binaries to match Android’s naming convention, and make sure SO_NAME matches as well. If you check out this section of the documentation I wrote, it covers how to do that.

If you followed all of that you should have built an version of llvmpipe and lavapipe that you can run on Android’s cuttlefish emulator.

Android running lavapipe
Android running lavapipe

References

June 26, 2024 11:00 PM

June 24, 2024

Changwoo Min

sched_ext: scheduler architecture and interfaces (Part 2)

This is the second blog post about the sched_ext, a BPF-based extensible scheduler class. Just in case you didn’t read the first one, you can find it here. In this blog post, I briefly update what has been happening in sched_ext, then introduce the scheduler architecture and the sched_ext API. After reading this, you should have a good understanding of the sched_ext architecture and be ready to read the source code of any sched_ext schedulers. Let’s get started!

Now, it is happening! #

After a long debate about the sched_ext patch, Linus finally approved its inclusion in the 6.11 kernel. As of the time of this writing, the V7 patch set has been posted to the mailing list. In addition, the developer and user communities around the sched_ext are expanding. More developers have been attending the weekly office hours. If you are interested in scheduler development, please join our Slack channel. Also, many enthusiasts (especially from the CachyOS community) have been actively testing, benchmarking, and bug-reporting the sched_ext schedulers.

Where to start developing a sched_ext scheduler? #

As discussed in the first blog post, sched_ext is an extensible scheduler class, so more than one sched_ext-based scheduler exists (e.g., scx_lavd, scx_rusty, and scx_rustland). To play with sched_ext schedulers, you should use a sched_ext-enabled kernel. There are three ways to get a sched_ext-enabled kernel:

  • You can install a Linux distribution that natively supports the sched_ext kernel by default. As of this writing, CachyOS—an Arch-based Linux distribution—supports the sched_ext enabled kernel and the BPF schedulers. This might be the most straightforward if you plan to set up a new machine. See this tutorial, which explains how to use sched_ext schedulers in CachyOS.

  • If you plan to explore the sched_ext kernel as well, you can start with virtual machine. virtme_ng is an excellent environment, which makes your VM-based kernel development seamless. See Andrea’s blog post on virtme_ng for more details.

  • If you finally decide to run your own sched_ext kernel on your machine, you should build the kernel with proper config options. See the details in my first blog post.

Now, you are ready to hack. Congrats!

Understanding the big picture of Linux schedulers #

Before getting your hands dirty, let’s step back and see the big picture of the Linux scheduler first.

The following diagram shows the architecture of Linux scheduler. As the figure illustrates, the Linux scheduler conceptually consists of two layers. The core kernel scheduler (core.c) defines the common behavior of all scheduling policies. For example, the core kernel scheduler defines what to do upon a timer interrupt (or scheduler tick). At the high level, it is analogous to the abstract base class for all schedulers in a C++ term.

Concrete scheduling policies are defined on top of the core kernel scheduler. For example, Linux provides real-time scheduling policies (rt.c) such as FIFO (First-In, First-Out, SCHED_FIFO) and RR (Round Robin, SCHED_RR). The fair scheduling policy (SCHED_NORMAL, EEVDF: Earliest Eligible Virtual Deadline First) governs non-realtime regular tasks.

This two-levels architecture allows one to write a new scheduling policy while reusing the logic of the core kernel scheduler. In Linux, such an architecture is called a scheduler class.

                                                  +========================+
                                                  || User-space part of   ||
                                                  || your scheduler       ||
                                                  || (e.g., main.rs)      ||
                                                  +========================+
~~~~~~~~~~~~~~~~ <kernel space vs. user space boundary > ~~~~~~~~~~~~~~~~~~~
                                                  +========================+
                                                  || Your BPF scheduler   ||
                                                  || (e.g., main.bpf.c)   ||
                                                  +========================+
+-----------------------+ +---------------------+ +========================+
| Default scheduler     | | Real-time scheduler | || Sched_ext framework  ||
| (EEVDF)               | | (FIFO, RR)          | ||                      ||
| (kernel/sched/fair.c) | | (kernel/sched/rt.c) | || (kernel/sched/ext.c) ||
+-----------------------+ +---------------------+ +========================+
+--------------------------------------------------------------------------+
|   Core kernel scheduler                                                  |
|   (kernel/sched/core.c)                                                  |
+--------------------------------------------------------------------------+

The sched_ext framework inside the kernel (ext.c) is defined on top of the core kernel scheduler like any other scheduler classes (e.g., fair.c and rt.c). That implies all the scheduler classes should communicate to the core kernel scheduler via the same interface; we will discuss that interface shortly.

The sched_ext framework is a vessel that ships a BPF scheduler with a user-defined policy. The sched_ext framework adds two more layers on top of it: a BPF scheduler layer and a user-space process, which interacts with the BPF scheduler. Writing a new sched_ext scheduler means writing a new BPF scheduler and its user-space counterpart.

Sharing time is different from sharing space #

You may notice that the idea of a scheduler class is somewhat similar to the idea of a VFS (Virtual File System) layer for file systems. Both define the common behavior of concrete policies (e.g., a new scheduling policy or file system layout) and the common interfaces for them.

Yet, there is a key distinction. VFS, with its focus on spatially partitioned disk space (or drive), allows for the simultaneous existence of multiple file systems, as long as each manages disjoint disk (or drive) space.

 SCHED_FIFO => SCHED_RR => SCHED_NORMAL (either EEVDF or sched_ext)

In contrast, scheduling is about how to slice and use CPU time, so at a certain point, only one scheduling policy can make scheduling decisions. In that sense, the scheduler classes are hierarchical in using CPU time. Real-time schedulers will always take the CPU time first (SCHED_FIFO => SCHED_RR). If no real-time task wants more CPU time, the schedulers for “normal” tasks (SCHED_NORMAL) will take turns with the real-time schedulers. Since EEVDF and sched_ext are both defined as SCHED_NORMAL, they cannot coexist at a certain point. When the sched_ext-based scheduler becomes active, it will take over all normal class tasks, replacing EEVDF.*

(*Note that sched_ext provides a special mode for scheduling only a subset of normal class tasks, mainly for testing purposes. However, I won’t discuss it in this post for brevity of discussion.)

Interface matters #

A sched_ext scheduler consists of four layers, as discussed from bottom to top: 1) core kernel scheduler, 2) sched_ext framework, 3) BPF scheduler, and 4) BPF’s user-space counterpart. These layers interact with each other through (relatively) well-defined interfaces. We now discuss those interfaces to understand how a scheduler works in action.

+========================+
|| User-space part of   ||
|| your scheduler       ||
|| (e.g., main.rs)      ||
+========================+
   \\//   ^^^^
   \\//   ^^^^ <== Interface 4: BPF scheduler <=> user-space counter part
   \\//   ^^^^
+========================+
|| Your BPF scheduler   ||
|| (e.g., main.bpf.c)   ||
+========================+
   ^^^^   \\// <== Interface 3: BPF scheduler => sched_ext framework
   ^^^^   \\//
   ^^^^  <======== Interface 2: sched_ext framework => BPF scheduler
+========================+
|| Sched_ext framework  ||
||                      ||
|| (kernel/sched/ext.c) ||
+========================+
          ^^^^
          ^^^^ <== Interface 1: core kernel scheduler => scheduler class
          ^^^^
+------------------------+
| Core kernel scheduler  |
|  (kernel/sched/core.c) |
+------------------------+

Interface 1: core kernel scheduler ⇒ scheduler class #

The core kernel scheduler defines the common underlying behavior for scheduling and defines an interface for concrete scheduler classes (e.g., SCHED_FIFO, SCHED_NORMAL). It defines an ops structure (struct sched_class) – a struct of function pointers – that needs to be filled by a concrete scheduler class to implement a concrete scheduling policy.

In the Linux kernel, a task (thread or process) is represented by struct task_struct, and a runnable (i.e., non-sleeping, ready-to-run) task should be in one of the run queues (struct rq) associated with each CPU. Scheduling is the problem of treating runnable tasks through run queues.

In a particular situation, when each scheduling policy needs its specific action, the core kernel scheduler calls an operation defined in struct sched_class. For example, when the core kernel scheduler needs to select a task to be scheduled, it calls the sched_class.pick_next_task(rq) callback of a concrete scheduling policy. When a task becomes runnable, the core kernel scheduler calls sched_calss.enqueue(rq, p, flags) so the concrete scheduling policy enqueues task p to run queue rq. When a task’s runtime state needs to be updated, the core kernel scheduler calls sched_calss.update_curr(rq).

/* kernel/sched/sched.h */
struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
bool (*yield_to_task)(struct rq *rq, struct task_struct *p);

void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);

struct task_struct *(*pick_next_task)(struct rq *rq);

void (*put_prev_task)(struct rq *rq, struct task_struct *p);
void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);

int (*balance)(struct rq *rq, struct task_struct *prev,
struct rq_flags *rf);
int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);

struct task_struct * (*pick_task)(struct rq *rq);

void (*migrate_task_rq)(struct task_struct *p, int new_cpu);

void (*task_woken)(struct rq *this_rq, struct task_struct *task);

void (*set_cpus_allowed)(struct task_struct *p,
struct affinity_context *ctx);

void (*rq_online)(struct rq *rq);
void (*rq_offline)(struct rq *rq);

struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);

void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
void (*task_fork)(struct task_struct *p);
void (*task_dead)(struct task_struct *p);

void (*switching_to) (struct rq *this_rq, struct task_struct *task);
void (*switched_from)(struct rq *this_rq, struct task_struct *task);
void (*switched_to) (struct rq *this_rq, struct task_struct *task);
void (*reweight_task)(struct rq *this_rq, struct task_struct *task,
int newprio);
void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
int oldprio);

unsigned int (*get_rr_interval)(struct rq *rq,
struct task_struct *task);

void (*update_curr)(struct rq *rq);
/* ... */
};

The sched_ext kernel framework implements functions required in struct sched_class. For example, when the core kernel scheduler enqueues a task or selects a next task to be scheduled or updates task’s runtime state, it eventually calls enqueue_task_scx(), pick_next_task_scx(), and update_curr_scx(), respectively, through struct sched_class.

/* kernel/sched/ext.c */
DEFINE_SCHED_CLASS(ext) = {
.enqueue_task = enqueue_task_scx,
.dequeue_task = dequeue_task_scx,
.yield_task = yield_task_scx,
.yield_to_task = yield_to_task_scx,

.wakeup_preempt = wakeup_preempt_scx,

.pick_next_task = pick_next_task_scx,

.put_prev_task = put_prev_task_scx,
.set_next_task = set_next_task_scx,

#ifdef CONFIG_SMP
.balance = balance_scx,
.select_task_rq = select_task_rq_scx,
.set_cpus_allowed = set_cpus_allowed_scx,

.rq_online = rq_online_scx,
.rq_offline = rq_offline_scx,
#endif

#ifdef CONFIG_SCHED_CORE
.pick_task = pick_task_scx,
#endif

.task_tick = task_tick_scx,

.switching_to = switching_to_scx,
.switched_from = switched_from_scx,
.switched_to = switched_to_scx,
.reweight_task = reweight_task_scx,
.prio_changed = prio_changed_scx,

.update_curr = update_curr_scx,

#ifdef CONFIG_UCLAMP_TASK
.uclamp_enabled = 1,
#endif
};

The sched_ext framework provides the common implementation for BPF schedulers. For example, when a task’s state needs to be updated, update_curr_scx() in the sched_ext decrements the time slice of the currently running task.

static void update_curr_scx(struct rq *rq)
{
struct task_struct *curr = rq->curr;
u64 now = rq_clock_task(rq);
u64 delta_exec;

delta_exec = now - curr->se.exec_start;
/* ... */
curr->scx.slice -= min(curr->scx.slice, delta_exec);
/* ... */
}

Interface 2: sched_ext framework ⇒ BPF scheduler #

Let’s think about another case of enqueuing a runnable task to a run queue. The core kernel scheduler will call sched_class.enqueue_task(), actually enqueue_task_scx() in the schd_ext framework. How should the enqueue_task_scx() function be designed? The proper data structure and how tasks are organized in the queue depend on the BPF scheduler’s scheduling policy. For example, a FIFO queue would be the best data structure in the FIFO policy. In the deadline-based scheduling policy, it would be nice to maintain runnable tasks ordered by their deadlines. Any ordered map (e.g., red-black tree) would be appropriate.

Now let’s briefly examine how sched_ext implements enqueue_task_scx(). As the following code snippet shows, enqueue_task_scx() eventually calls scx_ops.enqueue(args), which is an enqueue implementation of a BPF scheduler.

/* kernel/sched/ext.c */

static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags)
{
int sticky_cpu = p->scx.sticky_cpu;
enq_flags |= rq->scx.extra_enq_flags;
/* ... */
do_enqueue_task(rq, p, enq_flags, sticky_cpu);
}

static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
int sticky_cpu)
{
/* ... */
SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, p, enq_flags);
/* ... */
}

#define SCX_CALL_OP_TASK(mask, op, task, args...) \
do { \
/* ... */ \
SCX_CALL_OP(mask, op, task, ##args); \
/* ... */ \
} while (0)


#define SCX_CALL_OP(mask, op, args...) \
do { \
/* ... */ \
scx_ops.op(args); \
/* ... */ \
} while (0)

The sched_ext framework defines an operation structure – struct sched_ext_ops, where scx_ops.enqueue() is defined. The following code snippet is a simplified version of struct sched_ext_ops. The full version with API description can be found in the source code.

/* kernel/sched/ext.c */
struct sched_ext_ops {
s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags);

void (*enqueue)(struct task_struct *p, u64 enq_flags);
void (*dispatch)(s32 cpu, struct task_struct *prev);

void (*tick)(struct task_struct *p);

void (*runnable)(struct task_struct *p, u64 enq_flags);
void (*running)(struct task_struct *p);
void (*stopping)(struct task_struct *p, bool runnable);
void (*quiescent)(struct task_struct *p, u64 deq_flags);

/* ... */

void (*update_idle)(s32 cpu, bool idle);

/* ... */

s32 (*init_task)(struct task_struct *p, struct scx_init_task_args *args);
void (*exit_task)(struct task_struct *p, struct scx_exit_task_args *args);

/* ... */

s32 (*init)(void);
void (*exit)(struct scx_exit_info *info);

/* ... */
};

static struct sched_ext_ops scx_ops;

There seem to be a lot of callbacks in a BPF scheduler, but most of them are pretty straightforward.

When a BPF scheduler is loaded and unloaded, you have control over it. The sched_ext_ops.init() and sched_ext_ops.exit() functions are your tools to initialize and de-initialize BPF scheduler-wide data structures. This is where you can create and destroy per-CPU structures and global run queues managed by the BPF scheduler.

When a new task is created, sched_ext_ops.init_task() is called. As you can expect, sched_ext_ops.exit_task() is called when a task is terminated. You can manage per-task data structures there.

During the lifetime, a task will become first runnable (sched_ext_ops.runnable()), then it will be picked by the scheduler and will be running (sched_ext_ops.running()). Then, when its time slice is exhausted, the task will be stopped and scheduled out (sched_ext_ops.stopping()). The task becomes non-runnable (sched_ext_ops.quiescent()), for example, when waiting for an IO event.

When a task wakes up, for example, when an IO event is delivered, the BPF scheduler should first decide which CPU the task should run on (sched_ext_ops.select_cpu()). Then, the task will be enqueued to a BPF-managed run queue (sched_ext_ops.enqueue()).

Each CPU will continue to execute the assigned tasks to it. When a CPU has nothing to run, it will get a task from the BPF-managed run queue (sched_ext_ops.dispatch()). Even if there is no task to run on the BPF-managed run queue, the CPU will be idle to save power until there is something to run. The state transition from/to idle state can be tracked by sched_ext_ops.update_idle().

Upon every scheduler tick (i.e., every 1/HZ seconds), sched_ext_ops.tick() is called, so the BPF-scheduler can periodically do chores (e.g., power management, preemption, etc).

That’s (almost) all! Writing a new BPF scheduler is nothing but implementing those struct sched_ext_ops functions. Note that the sched_ext framework provides a default implementation of each callback, so when a callback implementation is not provided by the BPF scheduler, the default implementation will be executed. That’s straightforward, huh?

Interface 3: BPF scheduler ⇒ sched_ext framework #

A BPF scheduler needs to talk to the sched_ext kernel framework. For instance, after choosing a task to be run, it should be somehow notified to the sched_ext framework. There are two ways from a BPF scheduler to the sched_ext framework: 1) BPF helper function and 2) dispatch queue (DSQ). DSQ is also created and manipulated by the BPF helper functions. Here is a list of some essential BPF helper functions. A full list of BPF helper functions are here, and their implementations with full API descriptions are here.

/* tools/shed_ext/include/scx/common.bpf.h */
s32 scx_bpf_create_dsq(u64 dsq_id, s32 node);
s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;

void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id,
u64 slice, u64 enq_flags);
void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id, u64 slice,
u64 vtime, u64 enq_flags);

bool scx_bpf_consume(u64 dsq_id);
bool scx_bpf_consume_task(unsigned long it, struct task_struct *p);

s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
u64 wake_flags, bool *is_idle);

void scx_bpf_kick_cpu(s32 cpu, u64 flags);

Dispatch queue (DSQ) is a core construct between a BPF scheduler and the sched_ext kernel framework, so many BPF helper functions are around DSQ. DSQ is a queue holding runnable tasks. Tasks in a DSQ can be ordered by an arrival order (FIFO) or a vtime (virtual time) order, where vtime is an integer associated with a task. The sched_ext framework also creates and maintains its internal DSQs: 1) one system-wide (named global DSQ) and 2) one for each CPU (named local DSQ). Both internal DSQs are FIFO-ordered. Each CPU in the sched_ext framework runs a task on its local DSQ in a FIFO order.

A BPF scheduler can also create DSQs (either FIFO or vtime DSQ) to manage runnable tasks in its own way (scx_bpf_create_dsq()). sched_ext_ops.init() is a typical place to create custom DSQs.

A task can be enqueued to the DSQ in a FIFO order (scx_bpf_dispatch()) or vtime order (scx_bpf_dispatch_vtime()), typically at sched_ext_ops.enqueue().

A task in a DSQ can be consumed from tip (scx_bpf_consume()) or a specified task can be consumed (scx_bpf_consume_task()), typically at sched_ext_ops.dispatch(). Here consuming a task means that a task is removed from a custom DSQ and moved to the internal local DSQ, so a CPU can run a task in its locak DSQ.

In addition, the number of tasks in a DSQ can be queried (scx_bpf_dsq_nr_queued()). At the end, a DSQ should be destroyed (scx_bpf_destroy_dsq()).

Besides DSQ-related BPF helpers, two other helper functions are notable: scx_bpf_select_cpu_dfl() and scx_bpf_kick_cpu(). scx_bpf_select_cpu_dfl() finds an appropriate CPU to run a task using default core selection policy. It can be used in sched_ext_ops.select_cpu(). scx_bpf_kick_cpu() sends an IPI (Inter-Processor Interrupt) signal to another CPU for waking up an idle CPU (SCX_KICK_IDLE flag) and asking preempt out the currently running task (SCX_KICK_PREEMPT flag).

Interface 4: BPF scheduler ⟺ user-space counterpart #

Since the BPF scheduler is a regular BPF program, you can use any BPF tricks you like. A user-space program (typically written in C or Rust) can use any libbpf API. Most importantly, a BPF map is accessed and manipulated by a user-space program (e.g., bpf_map_lookup_elem()).

What’s next? #

The best way to digest the sched_ext concept is to read the existing BPF scheduler code. I strongly encourage you to read the source code of one of scx_lavd, scx_rusty, or scx_rustland. Then, you will have a solid foundation to write a new BPF scheduler or improve an existing one. In the next blog post, I will discuss how to find out what to improve as an author of the scx_lavd scheduler. I hope you enjoyed this blog post. Happy scheduling until the next blog post!

June 24, 2024 12:00 AM

June 20, 2024

Frédéric Wang

My recent contributions to Gecko (1/3)

Introduction

Igalia has been contributing to the web platform implementations of different web engines for a long time. One of our goals is ensuring that these implementations are interoperable, by relying on various web standards and web platform tests. In July 2023, I happily joined a project that focuses on this goal, and I worked more specifically on the Gecko web engine. One year later, three new features I contributed to are being shipped in Firefox. In this series of blog posts, I’ll give an overview of those features (namely registered custom properties, content visibility, and fetch priority) and my journey to make them “ride the train” as Mozilla people say.

Let’s start with registered custom properties, an enhancement of traditional CSS variables.

Registered custom properties

You may already be familiar with CSS variables, these “dash dash” names that facilitate the maintenance of a large web site by allowing author-defined CSS properties. In the example below, the :root selector defines a variable --main-theme-color with value “blue”, which is used for the style applied to other elements via the var() CSS function. As you can see, this makes the usage of the main theme color in different places more readable and makes customizing that color much easier.

:root { --main-theme-color: blue; }
p { color: var(--main-theme-color); }
section {
  padding: 1em;
  border: 1px solid var(--main-theme-color);
}
.progress-bar {
  height: 10px;
  width: 100%;
  background: linear-gradient(white, var(--main-theme-color));
}
<section>
  <p>Loading...</p>
  <div class="progress-bar"></div>
</section>

In browsers supporting CSS variables, you should see a frame containing the text “Loading” and a progress bar, all of these components being blue:

.example1 { --main-theme-color: blue; margin: 2em; } .example1 p { color: var(--main-theme-color); } .example1 section { padding: 1em; border: 1px solid var(--main-theme-color); } .example1 .progress-bar { height: 10px; width: 100%; background: linear-gradient(white, var(--main-theme-color)); }

Loading...

Having such CSS variables available is already nice, but they are lacking some features available to native CSS properties… For example, there is (almost) no syntax checking on specified values, they are always inherited, and their initial value is always the guaranteed invalid value. In order to improve on that situation, the CSS Properties and Values specification provides some APIs to register custom properties with further characteristics:

  • An accepted syntax for the property; for example, igalia | <url> | <integer>+ means either the custom identifier “igalia”, or a URL, or a space-separated list of integers.
  • Whether the property is inherited or non-inherited.
  • An initial value.

Custom properties can be registered via CSS or via a JS API, and these ways are equivalent. For example, to register --main-theme-color as a non-inherited color with initial value blue:

@property --main-theme-color {
  syntax: "<color>";
  inherits: false;
  initial-value: blue;
}
window.CSS.registerProperty({
  name: "--main-theme-color",
  syntax: "<color>",
  inherits: false,
  initialValue: blue,
});

Interpolation of registered custom properties

By having custom properties registered with a specific syntax, we open up the possibility of interpolating between two values of the properties when performing an animation. Consider the following example, where the width of the animated div depends on the custom property --my-length. Defining this property as a length allows browsers to interpolate it continuously between 10px and 200px when it is animated:

 @property --my-length {
   syntax: "<length>";
   inherits: false;
   initialValue: '0px';
 }
 @keyframes test {
   from {
     --my-length: 10px;
   }
   to {
     --my-length: 200px;
   }
 }
 div#animated {
   animation: test 2s linear both;
   width: var(--my-length, 10px);
   height: 200px;
   background: lightblue;
 }

With non-registered custom properties, we can instead only animate discretely; --my-length would suddenly jump from 10px to 200px halfway through the duration of the animation, which is generally not what is desired for lengths.

Custom properties in the cascade

If you check the Interop 2023 Dashboard for custom properties, you may notice that interoperability was really bad at the beginning of the year, and this was mainly due to Firefox’s low score. Consequently, when I joined the project, I was asked to help with improving that situation.

Graph showing the 2023 evolution of scores and interop for custom properties

While the two registration methods previously mentioned had already been implemented, the main issue was that the CSS cascade was always treating custom properties as inherited and initialized with the guaranteed invalid value. This is indeed correct for unregistered custom properties, but it’s generally incorrect for registered custom properties!

In bug 1840478, bug 1855887, and others, I made registered custom properties work properly in the cascade, including non-inherited properties and registered initial values. But in the past, with the previous assumptions around inheritance and initial values, it was possible to store the computed values of custom properties on an element as a “cheap” map, considering only the properties actually specified on the element or an ancestor and (in most cases) only taking shallow copies of the parent’s map. As a result, when generalizing the cascade for registered custom properties, I had to be careful to avoid introducing performance regressions for existing content.

Custom properties in animations

Another area where the situation was pretty bad was animations. Not only was Firefox unable to interpolate registered custom properties between two values — one of the main motivations for the new spec — but it was actually unable to animate custom properties at all!

The main problem was that the existing animation code referred to CSS properties using an enum nsCSSPropertyID, with all custom properties represented by the single value nsCSSPropertyID::eCSSPropertyExtra_variable. To make this work for custom properties, I had to essentially replace that value with a structure containing the nsCSSPropertyID and the name of the custom properties.

I uploaded patches to bug 1846516 to perform that change throughout the whole codebase, and with a few more tweaks, I was able to make registered custom properties animate discretely, but my patches still needed some polish before they could be reviewed. I had to move onto other tasks, but fortunately, some Mozilla folks were kind enough to take over this task, and more generally, complete the work on registered custom properties!

Conclusion

This was an interesting task to work on, and because a lot of the work happened in Stylo, the CSS engine shared by Servo and Gecko, I also had the opportunity to train more on the Rust programming language. Thanks to help from folks at Mozilla, we were able to get excellent progress on registered custom properties in Firefox in 2023, and this feature is expected to ship in Firefox 128!

As I said, I’ve since moved onto other tasks, which I’ll describe in subsequent blog posts in this series. Stay tuned for content-visibility, enabling interesting layout optimizations for web pages.

June 20, 2024 10:00 PM

June 06, 2024

Víctor Jáquez

GStreamer Vulkan Operation API

Two weeks ago the GStreamer Spring Hackfest took place in Thessaloniki, Greece. I had a great time. I hacked a bit on VA, Vulkan and my toy, planet-rs, but mostly I ate delicious Greek food ☻. A big thanks to our hosts: Vivia, Jordan and Sebastian!

GStreamer Spring Hackfest
2024
First day of the GStreamer Spring Hackfest 2024 - https://floss.social/@gstreamer/112511912596084571

And now, writing this supposed small note, I recalled that I have in my to-do list an item to write a comment about GstVulkanOperation, an addition to GstVulkan API which helps with the synchronization of operations on frames, in order to enable Vulkan Video.

Originally, GstVulkan API didn’t provide almost any synchronization operation, beside fences, and that appeared to be enough for elements, since they do simple Vulkan operations. Nonetheless, as soon as we enabled VK_VALIDATION_FEATURE_ENABLE_SYNCHRONIZATION_VALIDATION_EXT feature, which reports resource access conflicts due to missing or incorrect synchronization operations between action [*], a sea of hazard operation warnings drowned us [*].

Hazard operations are a sequence of read/write commands in a memory area, such as an image, that might be re-ordered, or racy even.

Why are those hazard operations reported by the Vulkan Validation Layer, if the programmer pushes the commands to execute in queue in order? Why is explicit synchronization required? Because, as the great blog post from Hans-Kristian Arntzen, Yet another blog explaining Vulkan synchronization, (make sure you read it!) states:

[…] all commands in a queue execute out of order. Reordering may happen across command buffers and even vkQueueSubmits

In order to explain how synchronization is done in Vulkan, allow me to yank a couple definitions stated by the specification:

Commands are instructions that are recorded in a device’s queue. There are four types of commands: action, state, synchronization and indirection. Synchronization commands impose ordering constraints on action commands, by introducing explicit execution and memory dependencies.

Operation is an arbitrary amount of commands recorded in a device’s queue.

Since the driver can reorder commands (perhaps for better performance, dunno), we need to send explicit synchronization commands to the device’s queue to enforce a specific sequence of action commands.

Nevertheless, Vulkan doesn’t offer fine-grained dependencies between individual operations. Instead, dependencies are expressed as a relation of two elements, where each element is composed by the intersection of scope and operation. A scope is a concept in the specification that, in practical terms, can be either pipeline stage (for execution dependencies), or both pipeline stage and memory access type (for memory dependencies).

First let’s review execution dependencies through pipeline stages:

Every command submitted to a device’s queue goes through a sequence of steps known as pipeline stages. This sequence of steps is one of the very few implicit ordering guarantees that Vulkan has. Draw calls, copy commands, compute dispatches, all go through certain sequential stages, which amount of stages to cover depends on the specific command and the current command buffer state.

In order to visualize an abstract execution dependency let’s imagine two compute operations and the first must happen before the second.

Operation 1
Sync command
Operation 2
  1. The programmer has to specify the Sync command in terms of two scopes (Scope 1 and Scope 2), in this execution dependency case, two pipeline stages.
  2. The driver generates an intersection between commands in Operation 1 and Scope 1 defined as Scoped operation 1. The intersection contains all the commands in Operation 1 that go through up to the pipeline stage defined in Scope 1. The same is done with Operation 2 and Scope 2 generating Scoped operation 2.
  3. Finally, we got an execution dependency that guarantees that Scoped operation 1 happens before Scoped operation 2.

Now let’s talk about memory dependencies:

First we need to understand the concepts of memory availability and visibility. Their formal definition in Vulkan are a bit hard to grasp since they come from the Vulkan memory model, which is intended to abstract all the ways of how hardware access memory. Perhaps we could say that availability is the operation that assures the existence of the required memory; while visibility is the operation that assures it’s possible to read/write the data in that memory area.

Memory dependencies are limited the Operation 1 that be done before memory availability and Operation 2 that have to be done after its visibility.

But again, there’s no fine-grained way to declare that memory dependency. Instead, there are memory access types, which are functions used by descriptor types, or functions for pipeline stage to access memory, and they are used as access scopes.

All in all, if a synchronization command defining a memory dependency between two operations, it’s composed by the intersection of between each command and a pipeline stage, intersected with the memory access type associated with the memory processed by those commands.

Now that the concepts are more or less explained we could see those concepts expressed in code. The synchronization command for execution and memory dependencies is defined by VkDependencyInfoKHR. And it contains a set of barrier arrays, for memory, buffers and images. Barriers express the relation of dependency between two operations. For example, Image barriers use VkImageMemoryBarrier2 which contain the mask for source pipeline stage (to define Scoped operation 1), and the mask for the destination pipeline stage (to define Scoped operation 2); the mask for source memory access type and the mask for the destination memory access to define access scopes; and also layout transformation declaration.

A Vulkan synchronization example from Vulkan Documentation wiki:

vkCmdDraw(...);

... // First render pass teardown etc.

VkImageMemoryBarrier2KHR imageMemoryBarrier = {
...
.srcStageMask = VK_PIPELINE_STAGE_2_FRAGMENT_SHADER_BIT_KHR,
.dstStageMask = VK_PIPELINE_STAGE_2_COLOR_ATTACHMENT_OUTPUT_BIT_KHR,
.dstAccessMask = VK_ACCESS_2_COLOR_ATTACHMENT_WRITE_BIT_KHR,
.oldLayout = VK_IMAGE_LAYOUT_READ_ONLY_OPTIMAL,
.newLayout = VK_IMAGE_LAYOUT_ATTACHMENT_OPTIMAL
/* .image and .subresourceRange should identify image subresource accessed */};

VkDependencyInfoKHR dependencyInfo = {
...
1, // imageMemoryBarrierCount
&imageMemoryBarrier, // pImageMemoryBarriers
...
}

vkCmdPipelineBarrier2KHR(commandBuffer, &dependencyInfo);

... // Second render pass setup etc.

vkCmdDraw(...);

First draw samples a texture in the fragment shader. Second draw writes to that texture as a color attachment.

This is a Write-After-Read (WAR) hazard, which you would usually only need an execution dependency for - meaning you wouldn’t need to supply any memory barriers. In this case you still need a memory barrier to do a layout transition though, but you don’t need any access types in the src access mask. The layout transition itself is considered a write operation though, so you do need the destination access mask to be correct - or there would be a Write-After-Write (WAW) hazard between the layout transition and the color attachment write.

Other explicit synchronization mechanisms, along with barriers, are semaphores and fences. Semaphores are a synchronization primitive that can be used to insert a dependency between operations without notifying the host; while fences are a synchronization primitive that can be used to insert a dependency from a queue to the host. Semaphores and fences are expressed in the VkSubmitInfo2KHR structure.

As a preliminary conclusion, synchronization in Vulkan is hard and a helper API would be very helpful. Inspired by FFmpeg work done by Lynne, I added GstVulkanOperation object helper to GStreamer Vulkan API.

GstVulkanOperation object helper aims to represent an operation in the sense of the Vulkan specification mentioned before. It owns a command buffer as public member where external commands can be pushed to the associated device’s queue.

It has a set of methods:

Internally, GstVulkanOperation contains two arrays:

  1. The array of dependency frames, which are the set of frames, each representing an operation, which will hold dependency relationships with other dependency frames.

    gst_vulkan_operation_add_dependency_frame appends frames to this array.

    When calling gst_vulkan_operation_end the frame’s barrier state for each frame in the array is updated.

    Also, each dependency frame creates a timeline semaphore, which will be signaled when a command, associated with the frame, is executed in the device’s queue.

  2. The array of barriers, which contains a list of synchronization commands. gst_vulkan_operation_add_frame_barrier fills and appends a VkImageMemoryBarrier2KHR associated with a frame, which can be in the array of dependency frames.

Here’s a generic view of video decoding example:

gst_vulkan_operation_begin (priv->exec, ...);

cmd_buf = priv->exec->cmd_buf->cmd;

gst_vulkan_operation_add_dependency_frame (exec, out,
VK_PIPELINE_STAGE_2_VIDEO_DECODE_BIT_KHR,
VK_PIPELINE_STAGE_2_VIDEO_DECODE_BIT_KHR);

/* assume a engine where out frames can be used for DPB frames, */
/* so a barrier for layout transition is required */
gst_vulkan_operation_add_frame_barrier (exec, out,
VK_PIPELINE_STAGE_2_ALL_COMMANDS_BIT,
VK_ACCESS_2_VIDEO_DECODE_WRITE_BIT_KHR,
VK_IMAGE_LAYOUT_VIDEO_DECODE_DPB_KHR, NULL);

for (i = 0; i < dpb_size; i++) {
gst_vulkan_operation_add_dependency_frame (exec, dpb_frame,
VK_PIPELINE_STAGE_2_VIDEO_DECODE_BIT_KHR,
VK_PIPELINE_STAGE_2_VIDEO_DECODE_BIT_KHR);
}

barriers = gst_vulkan_operation_retrieve_image_barriers (exec);
vkCmdPipelineBarrier2 (cmd_buf, &(VkDependencyInfo) {
...
.pImageMemoryBarriers = barriers->data,
.imageMemoryBarrierCount = barriers->len,
});
g_array_unref (barriers);

vkCmdBeginVideoCodingKHR (cmd_buf, &decode_start);
vkCmdDecodeVideoKHR (cmd_buf, &decode_info);
vkCmdEndVideoCodingKHR (cmd_buf, &decode_end);

gst_vulkan_operation_end (exec, ...);

Here, just one memory barrier is required for memory layout transition, but semaphores are required to signal when an output frame and its DPB frames are processed, and later, the output frame can be used as a DPB frame. Otherwise, the output frame might not be fully reconstructed with it’s used as DPB for the next output frame, generating only noise.

And that’s all. Thank you.

June 06, 2024 12:00 AM

June 05, 2024

Alberto Garcia

More ways to install software in SteamOS: Distrobox and Nix

Introduction

In my previous post I talked about how to use systemd-sysext to add software to the Steam Deck without modifying the root filesystem. In this post I will give a brief overview of two additional methods.

Distrobox

distrobox is a tool that uses containers to create a mutable environment on top of your OS.

Distrobox running in SteamOS

With distrobox you can open a terminal with your favorite Linux distro inside, with full access to the package manager and the ability to install additional software. Containers created by distrobox are integrated with the system so apps running inside have normal access to the user’s home directory and the Wayland/X11 session.

Since these containers are not stored in the root filesystem they can survive an OS update and continue to work fine. For this reason they are particularly suited to systems with an immutable root filesystem such as Silverblue, Endless OS or SteamOS.

Starting from SteamOS 3.5 the system comes with distrobox (and podman) preinstalled and it can be used right out of the box without having to do any previous setup.

For example, in order to create a Debian bookworm container simply open a terminal and run this:

$ distrobox create -i debian:bookworm debbox

Here debian:bookworm is the image that this container is created from (debian is the name and bookworm is the tag, see the list of supported tags here) and debbox is the name that is given to this new container.

Once the container is created you can enter it:

$ distrobox enter debbox

Or from the ‘Debian’ entry in the desktop menu -> Lost & Found.

Once inside the container you can run your Debian commands normally:

$ sudo apt update
$ sudo apt install vim-gtk3

Nix

Nix is a package manager for Linux and other Unix-like systems. It has the property that it can be installed alongside the official package manager of any distribution, allowing the user to add software without affecting the rest of the system.

Nix running in SteamOS

Nix installs everything under the /nix directory, and packages are made available to the user through a new entry in the PATH and a ~/.nix-profile symlink stored in the home directory.

Nix is more things, including the basis of the NixOS operating system. Explaning Nix in more detail is beyond the scope of this blog post, but for SteamOS users these are perhaps its most interesting properties:

  • Nix is self-contained: all packages and their dependencies are installed under /nix.
  • Unlike software installed with pacman, Nix survives OS updates.
  • Unlike podman / distrobox, Nix does not create any containers. All packages have normal access to the rest of the system, just like native SteamOS packages.
  • Nix has a very large collection of packages, here is a search engine: https://search.nixos.org/packages

The only thing that Nix needs from SteamOS is help to set up the /nix directory so its contents are not stored in the root filesystem. This is already happening starting from SteamOS 3.5 so you can install Nix right away in single-user mode:

$ sudo chown deck:deck /nix
$ wget https://nixos.org/nix/install
$ sh ./install --no-daemon

This installs Nix and adds a line to ~/.bash_profile to set up the necessary environment variables. After that you can log in again and start using it. Here’s a very simple example (refer to the official documentation for more details):

# Install and run Midnight Commander
$ nix-env -iA nixpkgs.mc
$ mc

# List installed packages
$ nix-env -q
mc-4.8.31
nix-2.21.1

# Uninstall Midnight Commander
$ nix-env -e mc-4.8.31

What we have seen so far is how to install Nix in single-user mode, which is the simplest one and probably good enough for a single-user machine like the Steam Deck. The Nix project however recommends a multi-user installation, see here for the reasons.

Unfortunately the official multi-user installer does not work out of the box on the Steam Deck yet, but if you want to go the multi-user way you can use the Determinate Systems installer: https://github.com/DeterminateSystems/nix-installer

Conclusion

Distrobox and Nix are useful tools and they give SteamOS users the ability to add additional software to the system without having to modify the base operating system.

While for graphical applications the recommended way to install third-party software is still Flatpak, Distrobox and Nix give the user additional flexibility and are particularly useful for installing command-line utilities and other system tools.

by berto at June 05, 2024 03:53 PM

May 27, 2024

Andy Wingo

cps in hoot

Good morning good morning! Today I have another article on the Hoot Scheme-to-Wasm compiler, this time on Hoot’s use of the continuation-passing-style (CPS) transformation.

calls calls calls

So, just a bit of context to start out: Hoot is a Guile, Guile is a Scheme, Scheme is a Lisp, one with “proper tail calls”: function calls are either in tail position, syntactically, in which case they are tail calls, or they are not in tail position, in which they are non-tail calls. A non-tail call suspends the calling function, putting the rest of it (the continuation) on some sort of stack, and will resume when the callee returns. Because non-tail calls push their continuation on a stack, we can call them push calls.

(define (f)
  ;; A push call to g, binding its first return value.
  (define x (g))
  ;; A tail call to h.
  (h x))

Usually the problem in implementing Scheme on other language run-times comes in tail calls, but WebAssembly supports them natively (except on JSC / Safari; should be coming at some point though). Hoot’s problem is the reverse: how to implement push calls?

The issue might seem trivial but it is not. Let me illustrate briefly by describing what Guile does natively (not compiled to WebAssembly). Firstly, note that I am discussing residual push calls, by which I mean to say that the optimizer might remove a push call in the source program via inlining: we are looking at those push calls that survive the optimizer. Secondly, note that native Guile manages its own stack instead of using the stack given to it by the OS; this allows for push-call recursion without arbitrary limits. It also lets Guile capture stack slices and rewind them, which is the fundamental building block we use to implement exception handling, Fibers and other forms of lightweight concurrency.

The straightforward function call will have an artificially limited total recursion depth in most WebAssembly implementations, meaning that many idiomatic uses of Guile will throw exceptions. Unpleasant, but perhaps we could stomach this tradeoff. The greater challenge is how to slice the stack. That I am aware of, there are three possible implementation strategies.

generic slicing

One possibility is that the platform provides a generic, powerful stack-capture primitive, which is what Guile does. The good news is that one day, the WebAssembly stack-switching proposal should provide this too. And in the meantime, the so-called JS Promise Integration (JSPI) proposal gets close: if you enter Wasm from JS via a function marked as async, and you call out to JavaScript to a function marked as async (i.e. returning a promise), then on that nested Wasm-to-JS call, the engine will suspend the continuation and resume it only when the returned promise settles (i.e. completes with a value or an exception). Each entry from JS to Wasm via an async function allocates a fresh stack, so I understand you can have multiple pending promises, and thus multiple wasm coroutines in progress. It gets a little gnarly if you want to control when you wait, for example if you might want to wait on multiple promises; in that case you might not actually mark promise-returning functions as async, and instead import an async-marked async function waitFor(p) { return await p} or so, allowing you to use Promise.race and friends. The main problem though is that JSPI is only for JavaScript. Also, its stack sizes are even smaller than the the default stack size.

instrumented slicing

So much for generic solutions. There is another option, to still use push calls from the target machine (WebAssembly), but to transform each function to allow it to suspend and resume. This is what I think of as Joe Marshall’s stack trick (also see §4.2 of the associated paper). The idea is that although there is no primitive to read the whole stack, each frame can access its own state. If you insert a try/catch around each push call, the catch handler can access local state for activations of that function. You can slice a stack by throwing a SaveContinuation exception, in which each frame’s catch handler saves its state and re-throws. And if we want to avoid exceptions, we can use checked returns as Asyncify does.

I never understood, though, how you resume a frame. The Generalized Stack Inspection paper would seem to indicate that you need the transformation to introduce a function to run “the rest of the frame” at each push call, which becomes the Invoke virtual method on the reified frame object. To avoid code duplication you would have to make normal execution flow run these Invoke snippets as well, and that might undo much of the advantages. I understand the implementation that Joe Marshall was working on was an interpreter, though, which bounds the number of sites needing such a transformation.

cps transformation

The third option is a continuation-passing-style transformation. A CPS transform results in a program whose procedures “return” by tail-calling their “continuations”, which themselves are procedures. Taking our previous example, a naïve CPS transformation would reify the following program:

(define (f' k)
  (g' (lambda (x) (h' k x))))

Here f' (“f-prime”) receives its continuation as an argument. We call g', for whose continuation argument we pass a closure. That closure is the return continuation of g, binding a name to its result, and then tail-calls h with respect to f. We know their continuations are the same because it is the same binding, k.

Unfortunately we can’t really slice abitrary ranges of a stack with the naïve CPS transformation: we can only capture the entire continuation, and can’t really inspect its structure. There is also no way to compose a captured continuation with the current continuation. And, in a naïve transformation, we would be constantly creating lots of heap allocation for these continuation closures; a push call effectively pushes a frame onto the heap as a closure, as we did above for g'.

There is also the question of when to perform the CPS transform; most optimizing compilers would like a large first-order graph to work on, which is out of step with the way CPS transformation breaks functions into many parts. Still, there is a nugget of wisdom here. What if we preserve the conventional compiler IR for most of the pipeline, and only perform the CPS transformation at the end? In that way we can have nice SSA-style optimizations. And, for return continuations of push calls, what if instead of allocating a closure, we save the continuation data on an explicit stack. As Andrew Kennedy notes, closures introduced by the CPS transform follow a stack discipline, so this seems promising; we would have:

(define (f'' k)
  (push! k)
  (push! h'')
  (g'' (lambda (x)
         (define h'' (pop!))
         (define k (pop!))
         (h'' k x))))

The explicit stack allows for generic slicing, which makes it a win for implementing delimited continuations.

hoot and cps

Hoot takes the CPS transformation approach with stack-allocated return closures. In fact, Hoot goes a little farther, too far probably:

(define (f''')
  (push! k)
  (push! h''')
  (push! (lambda (x)
           (define h'' (pop!))
           (define k (pop!))
           (h'' k x)))
  (g'''))

Here instead of passing the continuation as an argument, we pass it on the stack of saved values. Returning pops off from that stack; for example, (lambda () 42) would transform as (lambda () ((pop!) 42)). But some day I should go back and fix it to pass the continuation as an argument, to avoid excess stack traffic for leaf function calls.

There are some gnarly details though, which I know you are here for!

splits

For our function f, we had to break it into two pieces: the part before the push-call to g and the part after. If we had two successive push-calls, we would instead split into three parts. In general, each push-call introduces a split; let us use the term tails for the components produced by a split. (You could also call them continuations.) How many tails will a function have? Well, one for the entry, one for each push call, and one any time control-flow merges between two tails. This is a fixpoint problem, given that the input IR is a graph. (There is also some special logic for call-with-prompt but that is too much detail for even this post.)

where to save the variables

Guile is a dynamically-typed language, having a uniform SCM representation for every value. However in the compiler and run-time we can often unbox some values, generally as u64/s64/f64 values, but also raw pointers of some specific types, some GC-managed and some not. In native Guile, we can just splat all of these data members into 64-bit stack slots and rely on the compiler to emit stack maps to determine whether a given slot is a double or a tagged heap object reference or what. In WebAssembly though there is no sum type, and no place we can put either a u64 or a (ref eq) value. So we have not one stack but three (!) stacks: one for numeric values, implemented using a Wasm memory; one for (ref eq) values, using a table; and one for return continuations, because the func type hierarchy is disjoin from eq. It’s.... it’s gross? It’s gross.

what variables to save

Before a push-call, you save any local variables that will be live after the call. This is also a flow analysis problem. You can leave off constants, and instead reify them anew in the tail continuation.

I realized, though, that we have some pessimality related to stacked continuations. Consider:

(define (q x)
  (define y (f))
  (define z (f))
  (+ x y z))

Hoot’s CPS transform produces something like:

(define (q0 x)
  (save! x)
  (save! q1)
  (f))

(define (q1 y)
  (restore! x)
  (save! x)
  (save! y)
  (save! q2)
  (f))

(define (q2 z)
  (restore! x)
  (restore! y)
  ((pop!) (+ x y z)))

So q0 saved x, fine, indeed we need it later. But q1 didn’t need to restore x uselessly, only to save it again on q2‘s behalf. Really we should be applying a stack discipline for saved data within a function. Given that the source IR is a graph, this means another flow analysis problem, one that I haven’t thought about how to solve yet. I am not even sure if there is a solution in the literature, given that the SSA-like flow graphs plus tail calls / CPS is a somewhat niche combination.

calling conventions

The continuations introduced by CPS transformation have associated calling conventions: return continuations may have the generic varargs type, or the compiler may have concluded they have a fixed arity that doesn’t need checking. In any case, for a return, you call the return continuation with the returned values, and the return point then restores any live-in variables that were previously saved. But for a merge between tails, you can arrange to take the live-in variables directly as parameters; it is a direct call to a known continuation, rather than an indirect call to an unknown call site.

cps soup?

Guile’s intermediate representation is called CPS soup, and you might wonder what relationship that CPS has to this CPS. The answer is not much. The continuations in CPS soup are first-order; a term in one function cannot continue to a continuation in another function. (Inlining and contification can merge graphs from different functions, but the principle is the same.)

It might help to explain that it is the same relationship as it would be if Guile represented programs using SSA: the Hoot CPS transform runs at the back-end of Guile’s compilation pipeline, where closures representations have already been made explicit. The IR is still direct-style, just that syntactically speaking, every call in a transformed program is a tail call. We had to introduce save and restore primitives to implement the saved variable stack, and some other tweaks, but generally speaking, the Hoot CPS transform ensures the run-time all-tail-calls property rather than altering the compile-time language; a transformed program is still CPS soup.

fin

Did we actually make the right call in going for a CPS transformation?

I don’t have good performance numbers at the moment, but from what I can see, the overhead introduced by CPS transformation can impose some penalties, even 10x penalties in some cases. But some results are quite good, improving over native Guile, so I can’t be categorical.

But really the question is, is the performance acceptable for the functionality, and there I think the answer is more clear: we have a port of Fibers that I am sure Spritely colleagues will be writing more about soon, we have good integration with JavaScript promises while not relying on JSPI or Asyncify or anything else, and we haven’t had to compromise in significant ways regarding the source language. So, for now, I am satisfied, and looking forward to experimenting with the stack slicing proposal as it becomes available.

Until next time, happy hooting!

by Andy Wingo at May 27, 2024 12:36 PM

May 24, 2024

Andy Wingo

hoot's wasm toolkit

Good morning! Today we continue our dive into the Hoot Scheme-to-WebAssembly compiler. Instead of talking about Scheme, let’s focus on WebAssembly, specifically the set of tools that we have built in Hoot to wrangle Wasm. I am peddling a thesis: if you compile to Wasm, probably you should write a low-level Wasm toolchain as well.

(Incidentally, some of this material was taken from a presentation I gave to the Wasm standardization organization back in October, which I think I haven’t shared yet in this space, so if you want some more context, have at it.)

naming things

Compilers are all about names: definitions of globals, types, local variables, and so on. An intermediate representation in a compiler is a graph of definitions and uses in which the edges are names, and the set of possible names is generally unbounded; compilers make more names when they see fit, for example when copying a subgraph via inlining, and remove names if they determine that a control or data-flow edge is not necessary. Having an unlimited set of names facilitates the graph transformation work that is the essence of a compiler.

Machines, though, generally deal with addresses, not names; one of the jobs of the compiler back-end is to tabulate the various names in a compilation unit, assigning them to addresses, for example when laying out an ELF binary. Some uses may refer to names from outside the current compilation unit, as when you use a function from the C library. The linker intervenes at the back-end to splice in definitions for dangling uses and applies the final assignment of names to addresses.

When targetting Wasm, consider what kinds of graph transformations you would like to make. You would probably like for the compiler to emit calls to functions from a low-level run-time library written in wasm. Those functions are probably going to pull in some additional definitions, such as globals, types, exception tags, and so on. Then once you have your full graph, you might want to lower it, somehow: for example, you choose to use the stringref string representation, but browsers don’t currently support it; you run a post-pass to lower to UTF-8 arrays, but then all your strings are not constant, meaning they can’t be used as global initializers; so you run another post-pass to initialize globals in order from the start function. You might want to make other global optimizations as well, for example to turn references to named locals into unnamed stack operands (not yet working :).

Anyway what I am getting at is that you need a representation for Wasm in your compiler, and that representation needs to be fairly complete. At the very minimum, you need a facility to transform that in-memory representation to the standard WebAssembly text format, which allows you to use a third-party assembler and linker such as Binaryen’s wasm-opt. But since you have to have the in-memory representation for your own back-end purposes, probably you also implement the names-to-addresses mapping that will allow you to output binary WebAssembly also. Also it could be that Binaryen doesn’t support something you want to do; for example Hoot uses block parameters, which are supported fine in browsers but not in Binaryen.

(I exaggerate a little; Binaryen is a more reasonable choice now than it was before the GC proposal was stabilised. But it has been useful to be able to control Hoot’s output, for example as the exception-handling proposal has evolved.)

one thing leads to another

Once you have a textual and binary writer, and an in-memory representation, perhaps you want to be able to read binaries as well; and perhaps you want to be able to read text. Reading the text format is a little annoying, but I had implemented it already in JavaScript a few years ago; and porting it to Scheme was a no-brainer, allowing me to easily author the run-time Wasm library as text.

And so now you have the beginnings of a full toolchain, built just out of necessity: reading, writing, in-memory construction and transformation. But how are you going to test the output? Are you going to require a browser? That’s gross. Node? Sure, we have to check against production Wasm engines, and that’s probably the easiest path to take; still, would be nice if this were optional. Wasmtime? But that doesn’t do GC.

No, of course not, you are a dirty little compilers developer, you are just going to implement a little wasm interpreter, aren’t you. Of course you are. That way you can build nice debugging tools to help you understand when things go wrong. Hoot’s interpreter doesn’t pretend to be high-performance—it is not—but it is simple and it just works. Massive kudos to Spritely hacker David Thompson for implementing this. I think implementing a Wasm VM also had the pleasant side effect that David is now a Wasm expert; implementation is the best way to learn.

Finally, one more benefit of having a Wasm toolchain as part of the compiler: %inline-wasm. In my example from last time, I had this snippet that makes a new bytevector:

(%inline-wasm
 '(func (param $len i32) (param $init i32)
    (result (ref eq))
    (struct.new
     $mutable-bytevector
     (i32.const 0)
     (array.new $raw-bytevector
                (local.get $init)
                (local.get $len))))
 len init)

%inline-wasm takes a literal as its first argument, which should parse as a Wasm function. Parsing guarantees that the wasm is syntactically valid, and allows the arity of the wasm to become apparent: we just read off the function’s type. Knowing the number of parameters and results is one thing, but we can do better, in that we also know their type, which we use for intentional types, requiring in this case that the parameters be exact integers which get wrapped to the signed i32 range. The resulting term is spliced into the CPS graph, can be analyzed for its side effects, and ultimately when written to the binary we replace each local reference in the Wasm with a reference of the appropriate local variable. All this is possible because we have the tools to work on Wasm itself.

fin

Hoot’s Wasm toolchain is about 10K lines of code, and is fairly complete. I think it pays off for Hoot. If you are building a compiler targetting Wasm, consider budgetting for a 10K SLOC Wasm toolchain; you won’t regret it.

Next time, an article on Hoot’s use of CPS. Until then, happy hacking!

by Andy Wingo at May 24, 2024 10:37 AM

Víctor Jáquez

GStreamer Hackfest 2024

Last weeks were a bit hectic. First, with a couple friends we biked the southwest of the Netherlands for almost a week. The next week, the last one, I attended the 2024 Display Next Hackfest

This week was Igalia’s Assembly meetings, and next week, along with other colleagues, I’ll be in Thessaloniki for the GStreamer Spring Hackfest

I’m happy to meet again friends from the GStreamer community and talk and move things forward related with Vulkan, VA-API, KMS, video codecs, etc.

May 24, 2024 12:00 AM

May 23, 2024

Patrick Griffis

Introducing the WebKit Container SDK

Developing WebKitGTK and WPE has always had challenges such as the amount of dependencies or it’s fairly complex C++ codebase which not all compiler versions handle well. To help with this we’ve made a new SDK to make it easier.

Current Solutions

There have always been multiple ways to build WebKit and its dependencies on your host however this was never a great developer experience. Only very specific hosts could be “supported”, you often had to build a large number of dependencies, and the end result wasn’t very reproducable for others.

The current solution used by default is a Flatpak based one. This was a big improvement for ease of use and excellent for reproducablity but it introduced many challenges doing development work. As it has a strict sandbox and provides read-only runtimes it was difficult to use complex tooling/IDEs or develop third party libraries in it.

The new SDK tries to take a middle ground between those two alternatives, isolating itself from the host to be somewhat reproducable, yet being a mutable environment to be flexible enough for a wide range of tools and workflows.

The WebKit Container SDK

At the core it is an Ubuntu OCI image with all of the dependencies and tooling needed to work on WebKit. On top of this we added some scripts to run/manage these containers with podman and aid in developing inside of the container. It’s intention is to be as simple as possible and not change traditional development workflows.

You can find the SDK and follow the quickstart guide on our GitHub: https://github.com/Igalia/webkit-container-sdk

The main requirements is that this only works on Linux with podman 4.0+ installed. For example Ubuntu 23.10+.

In the most simple case, once you clone https://github.com/Igalia/webkit-container-sdk.git, using the SDK can be a few commands:

source /your/path/to/webkit-container-sdk/register-sdk-on-host.sh
wkdev-create --create-home
wkdev-enter

From there you can use WebKit’s build scripts (./Tools/Scripts/build-webkit --gtk) or CMake. As mentioned before it is an Ubuntu installation so you can easily install your favorite tools directly like VSCode. We even provide a wkdev-setup-vscode script to automate that.

Advanced Usage

Disposibility

A workflow that some developers may not be familiar with is making use of entirely disposable development environments. Since these are isolated containers you can easily make two. This allows you to do work in parallel that would interfere with eachother while not worrying about it as well as being able to get back to a known good state easily:

wkdev-create --name=playground1
wkdev-create --name=playground2

podman rm playground1 # You would stop first if running.
wkdev-enter --name=playground2

Working on Dependencies

An important part of WebKit development is working on the dependencies of WebKit rather than itself, either for debugging or for new features. This can be difficult or error-prone with previous solutions. In order to make this easier we use a project called JHBuild which isn’t new but works well with containers and is a simple solution to work on our core dependencies.

Here is an example workflow working on GLib:

wkdev-create --name=glib
wkdev-enter --name=glib

# This will clone glib main, build, and install it for us. 
jhbuild build glib

# At this point you could simply test if a bug was fixed in a different versin of glib.
# We can also modify and debug glib directly. All of the projects are cloned into ~/checkout.
cd ~/checkout/glib

# Modify the source however you wish then install your new version.
jhbuild make

Remember that containers are isoated from each other so you can even have two terminals open with different builds of glib. This can also be used to test projects like Epiphany against your build of WebKit if you install it into the JHBUILD_PREFIX.

To Be Continued

In the next blog post I’ll document how to use VSCode inside of the SDK for debugging and development.

May 23, 2024 04:00 AM

May 22, 2024

Frédéric Wang

Flygskam and o caminho da Corunha

Prolegomenon

Early next June, I’m traveling from Paris to A Coruña for the Web Engines Hackfest and other internal Igalia events. In recent years I’ve done it by train, as previously mentioned. Some colleagues at Igalia were curious about it so I decided to write this blog post, investigating possible ways to do it from various European places. I wish this can also motivate more people at Igalia and beyond, who are still hesistant to give up alternatives with heavier carbon footprint.

In addition to various trip planners, I’ve also used this nice map from Wikipedia, which gives a good overview of high-speed rail in Europe:

I also sought advice from Nicolò Ribaudo who is quite familiar with train traveling and provided useful recommendations. In particular, he mentioned the ÖBB trip planner, which seems quite efficient combining trains from multiple operators.

I’ve focused on big european cities (with airports) that are close to Spain but this is definitely not exhaustive. There is probably a lot more to discuss beyong trip planning, but hopefully that can be a good starting point.

Paris as a departure or connection

Based on my experience traveling from Paris, I found these direct trains between big cities:

  • Renfe offers several AVE and Alvia trains traveling every days between A Coruña and Madrid, with a duration 3h30-4h 1. There are two important thing to note:
    1. These local trains are only avalailble for sell maybe 2 months in advance at best.
    2. These trains are connecting to Madrid Chamartín which is maybe 30 minutes away from Madrid Atocha by the Madrid metro.
  • Renfe also proposes even more options (say one each half hour during the day) between Barcelona and Madrid Atocha, with a duration of 2h30-3h 2.
  • The SNCF proposes two or three direct trains during the day between Paris Gare de Lyon and Barcelona, with a duration of 6h30-7h. If you are coming from Paris and want to to take a train to Madrid, you will likely have to cross the station and pass some x-ray baggage scanner, so be sure to keep enough time for the connection.
  • The Eurostar offers several options during the day to connect Paris with cities below. They connect to Gare du Nord or Gare de l’Est, which are very close to each other but ~30 minutes away from Gare de Lyon by public transport.

Personally, I’m doing this one-day-and-half trip (inbound trip is similar):

  1. Take the train from Paris (9:42 AM) to Barcelona (4:33 PM).
  2. Keep enough time for the Barcelona Sants connection.
  3. Take a train from Barcelona to Madrid in the evening.
  4. Stay one night in Madrid.
  5. Take a train from Madrid to A Coruña in the morning.

From London, Amsterdam, Brussels, Berlin one could instead do a two-days trip:

  1. Travel to Paris in the morning 3.
  2. Keep enough time for the Paris connection.
  3. Take the train from Paris (2:42PM) to Barcelona (9:27PM).
  4. Stay one night in Barcelona.
  5. Travel from Barcelona to Madrid Atocha.
  6. Keep enough time for the Madrid Metro connection.
  7. Travel from Madrid Chamartín to A Coruña.

I also looked at the trip with the minimum number of connections to go to Barcelona from big cities in Switzerland and a a similar traject is possible. See later for alternatives.

Finally, Nicolò mentioned that ÖBB recently started running a night train from Berlin to Paris, which you can probably use to do a similar trip as mine.

Estimate of CO2 emissions

In order to estimate CO2 emission for the trips suggested in the previous section, I compiled information from different sources:

There would be a lot to discuss about the methodology but the goal here is only to give a rough idea. Anyway, below is an estimate of kilograms of CO2 per passenger for some of the train trips previously mentioned:

  Eurostar SNCF Ecopassenger 4 Ecotree
Berlin ↔ Cologne - - 17-19 1-3
Cologne ↔ Paris 5.2 5.2 7.4 1-3
London ↔ Paris 2.4 2.4 1.7 1-3
Brussels ↔ Paris 1.6 1.8 1.8 1-2
Amsterdam ↔ Paris 2.6 2.9 9.3-9.5 1-3
Paris ↔ Barcelona - 3.8 - 1-6
Barcelona ↔ Madrid - - 17-20 1-6
Madrid ↔ A Coruña - - - 1-3

The best/worst cases are compiled into the following table and compared with a flight to A Coruña (with a connection by Madrid) as calculated by Ecotree. If we follow these data, a train from Berlin, London, Paris, Brussels and Amsterdam won’t at worst be around 50kg of CO2 per passenger and represent at least a reduction of around 90% compared to using a plane:

Departure Flight by Madrid (Ecotree) Train 5 CO2 reduction
Berlin 448 6-55.4 ≥87%
London 388 4-32 ≥91%
Brussels 396 4-30.8 ≥92%
Amsterdam 396 4-38.5 ≥90%
Paris 368 3-29 ≥92%
Madrid 244 1-3 ≥98%

More cities and trains

Disclaimer: This section is essentially based on information provided by Nicolò and what I found on internet.

In addition to the SNCF train previously mentioned, Renfe proposes two trains that are quite important for traveling from France to Spain:

  • One train between Lyon and Barcelona per day (duration ~5h)
  • One train between Marseille and Madrid per day (duration ~8h)

From Switzerland, you can pass by the south of France using more trains/connections to arrive faster in Barcelone than what was previously proposed. For example, taking a local train to go from Genève to Lyon followed by the Lyon to Barcelona train mentioned above can be done in around 7h. From Zurich, with connection at Genève and Lyon, it takes around 10h as opposed to 12h for a single connection in Paris.

From the Netherlands, Belgium or Germany it makes sense to consider the night trains from Paris to the border with Spain. Those trains do not have a fixed schedule, but vary depending on the weekday. Most of them arrive in Portbou and from there you can take a regional train to Barcelona. Some of them arrive to Latour-de-Carol, and from there it’s three hours on a regional train to Barcelona. In any case, you’ll be early enough at the border so that it’s possible to then arrive to A Coruña in the afternoon or evening. Rarely the night train arrives at the border on the west coast, and continuing from there to A Coruña with the regional trains that go on the northern coast might be a good experience.

From Belgium it’s also possible to take a TGV from Brussels to Lyon, and then from there take the train to Barcelona. This avoids a stressful connection in Paris, where you need to move between the two stations Gare du Nord and Gare de Lyon.

From Italy, the main trouble is to deal with connections. The Turin–Lyon high-speed railway contruction may help in the future but it’s not running yet. The alternatives are either to go on the coast through Genova-Ventimiglia-Nice, or with the Eurocity from Milan to Switzerland finally get to France.

From Portugal, which is so close geographically and culturally to Galicia, we could think there should be an easy way to travel to A Coruña. But apparently neither Comboios de Portugal nor Renfe provides any direct train:

  • From Porto, the ÖBB trip planner suggests connection at Vigo-Guixar for a total duration of 6h30. As a comparison, the website of the Web Engines Hackfest indicates a 3-hours trip by car and a 5-hours trip by bus.
  • Between Libon and Porto, Comboios de Portugal seems to propose trains taking 2h30-3h. You can combine that with the other trains to do the trip to A Coruña with one night at Porto or Vigo.

Last but not least, A Coruña railway station is a one-quarter walk away from the Igalia office and a three-quarter walks away from Palexco (Web Engines Hackfest’s venue). This is more convenient than the A Coruña airport which is around 10 km away from A Coruña.

Notes and references

  • Flygskam: anti-flying social movement started in Sweden, with the aim of reducing the environmental impact of aviation.
  • O Caminho de Santiago: The Way of St. James, a network of pilgrims’ ways leading to Santiago de Compostela (whose railway station you will likely stop at if you take a train to A Coruña).
  1. Incidentally, people travelling from far away are unlikely to find a direct flight to the A Coruña airport. In that case, using local trains from bigger airports like Madrid or Porto may be a better option. 

  2. Direct train between A Coruña and Barcelona seems to be rarer, slower and no longer available as night train. So a connection or night stay in Madrid seems the best option. 

  3. Deutsche Bahn offers a lot of Berlin to Cologne trains per day with a duration of 4h-4h30, including early/late trains or night trains, that you can combine with the Cologne-Paris Eurostar. Deutsche Bahn also offers (non-direct) ICE trains to go from Paris to Berlin. 

  4. Ecopassenger gives information per train, so I’m provided some lower/upper bounds based on different trains. 

  5. Based on the previous data from Eurostar, SNCF, Ecopassenger, Ecotree, trying to find the lowest/highest sum for each individual segment. 

May 22, 2024 10:00 PM

Andy Wingo

growing a bootie

Following on last week’s egregious discussion of the Hoot Scheme-to-WebAssembly compiler bootie, today I would like to examine another axis of boot, which is a kind of rebased branch of history: not the hack as it happened, but the logic inside the hack, the structure of the built thing, the history as it might have been. Instead of describing the layers of shims and props that we used while discovering what were building, let’s look at how we would build Hoot again, if we had to.

I think many readers of this blog will have seen Growing a Language, a talk / performance art piece in which Guy L. Steele—I once mentioned to him that Guy L. was one of the back-justifications for the name Guile; he did not take it well—in which Steele takes the set of monosyllabic words as primitives and builds up a tower of terms on top, bootstrapping a language as he goes. I just watched it again and I think it holds up, probably well enough to forgive the superfluous presence of the gender binary in the intro; ideas were different in the 1900s.

It is in the sense of that talk that I would like to look at growing a Hoot: how Hoot defines nouns and verbs in terms of smaller, more primitive terms: terms in terms of terms.

(hoot features) features (hoot primitives) primitives (ice-9 match) match (ice-9 match):s->(hoot primitives):n (hoot eq) eq (ice-9 match):s->(hoot eq):n (hoot pairs) pairs (ice-9 match):s->(hoot pairs):n (hoot vectors) vectors (ice-9 match):s->(hoot vectors):n (hoot equal) equal (ice-9 match):s->(hoot equal):n (hoot lists) lists (ice-9 match):s->(hoot lists):n (hoot errors) errors (ice-9 match):s->(hoot errors):n (hoot numbers) numbers (ice-9 match):s->(hoot numbers):n (fibers scheduler) scheduler (hoot ffi) ffi (fibers scheduler):s->(hoot ffi):n (guile) (guile) (fibers scheduler):s->(guile):n (fibers channels) channels (fibers channels):s->(ice-9 match):n (fibers waiter-queue) waiter-queue (fibers channels):s->(fibers waiter-queue):n (fibers operations) operations (fibers channels):s->(fibers operations):n (fibers channels):s->(guile):n (srfi srfi-9) srfi-9 (fibers channels):s->(srfi srfi-9):n (fibers waiter-queue):s->(ice-9 match):n (fibers waiter-queue):s->(fibers operations):n (fibers waiter-queue):s->(guile):n (fibers waiter-queue):s->(srfi srfi-9):n (fibers promises) promises (fibers promises):s->(fibers operations):n (hoot exceptions) exceptions (fibers promises):s->(hoot exceptions):n (fibers promises):s->(hoot ffi):n (fibers promises):s->(guile):n (fibers conditions) conditions (fibers conditions):s->(ice-9 match):n (fibers conditions):s->(fibers waiter-queue):n (fibers conditions):s->(fibers operations):n (fibers conditions):s->(guile):n (fibers conditions):s->(srfi srfi-9):n (fibers timers) timers (fibers timers):s->(fibers scheduler):n (fibers timers):s->(fibers operations):n (scheme time) time (fibers timers):s->(scheme time):n (fibers timers):s->(guile):n (fibers operations):s->(ice-9 match):n (fibers operations):s->(fibers scheduler):n (hoot boxes) boxes (fibers operations):s->(hoot boxes):n (fibers operations):s->(guile):n (fibers operations):s->(srfi srfi-9):n (hoot eq):s->(hoot primitives):n (hoot syntax) syntax (hoot eq):s->(hoot syntax):n (hoot strings) strings (hoot strings):s->(hoot primitives):n (hoot strings):s->(hoot eq):n (hoot strings):s->(hoot pairs):n (hoot bytevectors) bytevectors (hoot strings):s->(hoot bytevectors):n (hoot strings):s->(hoot lists):n (hoot bitwise) bitwise (hoot strings):s->(hoot bitwise):n (hoot char) char (hoot strings):s->(hoot char):n (hoot strings):s->(hoot errors):n (hoot strings):s->(hoot numbers):n (hoot match) match (hoot strings):s->(hoot match):n (hoot pairs):s->(hoot primitives):n (hoot bitvectors) bitvectors (hoot bitvectors):s->(hoot primitives):n (hoot bitvectors):s->(hoot bitwise):n (hoot bitvectors):s->(hoot errors):n (hoot bitvectors):s->(hoot match):n (hoot vectors):s->(hoot primitives):n (hoot vectors):s->(hoot pairs):n (hoot vectors):s->(hoot lists):n (hoot vectors):s->(hoot errors):n (hoot vectors):s->(hoot numbers):n (hoot vectors):s->(hoot match):n (hoot equal):s->(hoot primitives):n (hoot equal):s->(hoot eq):n (hoot equal):s->(hoot strings):n (hoot equal):s->(hoot pairs):n (hoot equal):s->(hoot bitvectors):n (hoot equal):s->(hoot vectors):n (hoot records) records (hoot equal):s->(hoot records):n (hoot equal):s->(hoot bytevectors):n (hoot not) not (hoot equal):s->(hoot not):n (hoot values) values (hoot equal):s->(hoot values):n (hoot hashtables) hashtables (hoot equal):s->(hoot hashtables):n (hoot equal):s->(hoot numbers):n (hoot equal):s->(hoot boxes):n (hoot equal):s->(hoot match):n (hoot exceptions):s->(hoot features):n (hoot exceptions):s->(hoot primitives):n (hoot exceptions):s->(hoot pairs):n (hoot exceptions):s->(hoot records):n (hoot exceptions):s->(hoot lists):n (hoot exceptions):s->(hoot syntax):n (hoot exceptions):s->(hoot errors):n (hoot exceptions):s->(hoot match):n (hoot cond-expand) cond-expand (hoot exceptions):s->(hoot cond-expand):n (hoot parameters) parameters (hoot parameters):s->(hoot primitives):n (hoot fluids) fluids (hoot parameters):s->(hoot fluids):n (hoot parameters):s->(hoot errors):n (hoot parameters):s->(hoot cond-expand):n (hoot records):s->(hoot primitives):n (hoot records):s->(hoot eq):n (hoot records):s->(hoot pairs):n (hoot records):s->(hoot vectors):n (hoot symbols) symbols (hoot records):s->(hoot symbols):n (hoot records):s->(hoot lists):n (hoot records):s->(hoot values):n (hoot records):s->(hoot bitwise):n (hoot records):s->(hoot errors):n (hoot ports) ports (hoot records):s->(hoot ports):n (hoot records):s->(hoot numbers):n (hoot records):s->(hoot match):n (hoot keywords) keywords (hoot records):s->(hoot keywords):n (hoot records):s->(hoot cond-expand):n (hoot dynamic-wind) dynamic-wind (hoot dynamic-wind):s->(hoot primitives):n (hoot dynamic-wind):s->(hoot syntax):n (hoot bytevectors):s->(hoot primitives):n (hoot bytevectors):s->(hoot bitwise):n (hoot bytevectors):s->(hoot errors):n (hoot bytevectors):s->(hoot match):n (hoot error-handling) error-handling (hoot error-handling):s->(hoot primitives):n (hoot error-handling):s->(hoot pairs):n (hoot error-handling):s->(hoot exceptions):n (hoot write) write (hoot error-handling):s->(hoot write):n (hoot control) control (hoot error-handling):s->(hoot control):n (hoot error-handling):s->(hoot fluids):n (hoot error-handling):s->(hoot errors):n (hoot error-handling):s->(hoot ports):n (hoot error-handling):s->(hoot numbers):n (hoot error-handling):s->(hoot match):n (hoot error-handling):s->(hoot cond-expand):n (hoot ffi):s->(hoot primitives):n (hoot ffi):s->(hoot strings):n (hoot ffi):s->(hoot pairs):n (hoot procedures) procedures (hoot ffi):s->(hoot procedures):n (hoot ffi):s->(hoot lists):n (hoot ffi):s->(hoot not):n (hoot ffi):s->(hoot errors):n (hoot ffi):s->(hoot numbers):n (hoot ffi):s->(hoot cond-expand):n (hoot debug) debug (hoot debug):s->(hoot primitives):n (hoot debug):s->(hoot match):n (hoot symbols):s->(hoot primitives):n (hoot symbols):s->(hoot errors):n (hoot assoc) assoc (hoot assoc):s->(hoot primitives):n (hoot assoc):s->(hoot eq):n (hoot assoc):s->(hoot pairs):n (hoot assoc):s->(hoot equal):n (hoot assoc):s->(hoot lists):n (hoot assoc):s->(hoot not):n (hoot procedures):s->(hoot primitives):n (hoot procedures):s->(hoot syntax):n (hoot write):s->(hoot primitives):n (hoot write):s->(hoot eq):n (hoot write):s->(hoot strings):n (hoot write):s->(hoot pairs):n (hoot write):s->(hoot bitvectors):n (hoot write):s->(hoot vectors):n (hoot write):s->(hoot records):n (hoot write):s->(hoot bytevectors):n (hoot write):s->(hoot symbols):n (hoot write):s->(hoot procedures):n (hoot write):s->(hoot bitwise):n (hoot write):s->(hoot char):n (hoot write):s->(hoot errors):n (hoot write):s->(hoot ports):n (hoot write):s->(hoot numbers):n (hoot write):s->(hoot keywords):n (hoot lists):s->(hoot primitives):n (hoot lists):s->(hoot pairs):n (hoot lists):s->(hoot values):n (hoot lists):s->(hoot numbers):n (hoot lists):s->(hoot match):n (hoot lists):s->(hoot cond-expand):n (hoot not):s->(hoot syntax):n (hoot syntax):s->(hoot primitives):n (hoot values):s->(hoot primitives):n (hoot values):s->(hoot syntax):n (hoot control):s->(hoot primitives):n (hoot control):s->(hoot parameters):n (hoot control):s->(hoot values):n (hoot control):s->(hoot cond-expand):n (hoot bitwise):s->(hoot primitives):n (hoot char):s->(hoot primitives):n (hoot char):s->(hoot bitvectors):n (hoot char):s->(hoot bitwise):n (hoot char):s->(hoot errors):n (hoot char):s->(hoot match):n (hoot dynamic-states) dynamic-states (hoot dynamic-states):s->(hoot primitives):n (hoot dynamic-states):s->(hoot vectors):n (hoot dynamic-states):s->(hoot debug):n (hoot dynamic-states):s->(hoot lists):n (hoot dynamic-states):s->(hoot values):n (hoot dynamic-states):s->(hoot errors):n (hoot dynamic-states):s->(hoot numbers):n (hoot dynamic-states):s->(hoot match):n (hoot read) read (hoot read):s->(hoot primitives):n (hoot read):s->(hoot eq):n (hoot read):s->(hoot strings):n (hoot read):s->(hoot pairs):n (hoot read):s->(hoot bitvectors):n (hoot read):s->(hoot vectors):n (hoot read):s->(hoot exceptions):n (hoot read):s->(hoot symbols):n (hoot read):s->(hoot lists):n (hoot read):s->(hoot not):n (hoot read):s->(hoot values):n (hoot read):s->(hoot char):n (hoot read):s->(hoot errors):n (hoot read):s->(hoot ports):n (hoot read):s->(hoot numbers):n (hoot read):s->(hoot match):n (hoot read):s->(hoot keywords):n (hoot hashtables):s->(hoot primitives):n (hoot hashtables):s->(hoot eq):n (hoot hashtables):s->(hoot pairs):n (hoot hashtables):s->(hoot vectors):n (hoot hashtables):s->(hoot procedures):n (hoot hashtables):s->(hoot lists):n (hoot hashtables):s->(hoot values):n (hoot hashtables):s->(hoot bitwise):n (hoot hashtables):s->(hoot errors):n (hoot hashtables):s->(hoot numbers):n (hoot fluids):s->(hoot primitives):n (hoot fluids):s->(hoot cond-expand):n (hoot errors):s->(hoot primitives):n (hoot atomics) atomics (hoot atomics):s->(hoot primitives):n (hoot ports):s->(hoot primitives):n (hoot ports):s->(hoot eq):n (hoot ports):s->(hoot strings):n (hoot ports):s->(hoot pairs):n (hoot ports):s->(hoot vectors):n (hoot ports):s->(hoot parameters):n (hoot ports):s->(hoot bytevectors):n (hoot ports):s->(hoot procedures):n (hoot ports):s->(hoot lists):n (hoot ports):s->(hoot not):n (hoot ports):s->(hoot values):n (hoot ports):s->(hoot bitwise):n (hoot ports):s->(hoot char):n (hoot ports):s->(hoot errors):n (hoot ports):s->(hoot numbers):n (hoot ports):s->(hoot boxes):n (hoot ports):s->(hoot match):n (hoot ports):s->(hoot cond-expand):n (hoot numbers):s->(hoot primitives):n (hoot numbers):s->(hoot eq):n (hoot numbers):s->(hoot not):n (hoot numbers):s->(hoot values):n (hoot numbers):s->(hoot bitwise):n (hoot numbers):s->(hoot errors):n (hoot numbers):s->(hoot match):n (hoot boxes):s->(hoot primitives):n (hoot match):s->(hoot primitives):n (hoot match):s->(hoot errors):n (hoot keywords):s->(hoot primitives):n (hoot cond-expand):s->(hoot features):n (hoot cond-expand):s->(hoot primitives):n (scheme lazy) lazy (scheme lazy):s->(hoot primitives):n (scheme lazy):s->(hoot records):n (scheme lazy):s->(hoot match):n (scheme base) base (scheme lazy):s->(scheme base):n (scheme load) load (scheme load):s->(hoot primitives):n (scheme load):s->(hoot errors):n (scheme load):s->(scheme base):n (scheme complex) complex (scheme complex):s->(hoot numbers):n (scheme time):s->(hoot primitives):n (scheme time):s->(scheme base):n (scheme file) file (scheme file):s->(hoot primitives):n (scheme file):s->(hoot errors):n (scheme file):s->(hoot ports):n (scheme file):s->(hoot match):n (scheme file):s->(scheme base):n (scheme write) write (scheme write):s->(hoot write):n (scheme eval) eval (scheme eval):s->(hoot errors):n (scheme eval):s->(scheme base):n (scheme inexact) inexact (scheme inexact):s->(hoot primitives):n (scheme inexact):s->(hoot numbers):n (scheme char) char (scheme char):s->(hoot primitives):n (scheme char):s->(hoot bitwise):n (scheme char):s->(hoot char):n (scheme char):s->(hoot numbers):n (scheme char):s->(scheme base):n (scheme process-context) process-context (scheme process-context):s->(hoot primitives):n (scheme process-context):s->(hoot errors):n (scheme process-context):s->(scheme base):n (scheme cxr) cxr (scheme cxr):s->(hoot pairs):n (scheme read) read (scheme read):s->(hoot read):n (scheme base):s->(hoot features):n (scheme base):s->(hoot primitives):n (scheme base):s->(hoot eq):n (scheme base):s->(hoot strings):n (scheme base):s->(hoot pairs):n (scheme base):s->(hoot vectors):n (scheme base):s->(hoot equal):n (scheme base):s->(hoot exceptions):n (scheme base):s->(hoot parameters):n (scheme base):s->(hoot dynamic-wind):n (scheme base):s->(hoot bytevectors):n (scheme base):s->(hoot error-handling):n (scheme base):s->(hoot symbols):n (scheme base):s->(hoot assoc):n (scheme base):s->(hoot procedures):n (scheme base):s->(hoot write):n (scheme base):s->(hoot lists):n (scheme base):s->(hoot not):n (scheme base):s->(hoot syntax):n (scheme base):s->(hoot values):n (scheme base):s->(hoot control):n (scheme base):s->(hoot char):n (scheme base):s->(hoot read):n (scheme base):s->(hoot errors):n (scheme base):s->(hoot ports):n (scheme base):s->(hoot numbers):n (scheme base):s->(hoot match):n (scheme base):s->(hoot cond-expand):n (scheme base):s->(srfi srfi-9):n (scheme repl) repl (scheme repl):s->(hoot errors):n (scheme repl):s->(scheme base):n (scheme r5rs) r5rs (scheme r5rs):s->(scheme lazy):n (scheme r5rs):s->(scheme load):n (scheme r5rs):s->(scheme complex):n (scheme r5rs):s->(scheme file):n (scheme r5rs):s->(scheme write):n (scheme r5rs):s->(scheme eval):n (scheme r5rs):s->(scheme inexact):n (scheme r5rs):s->(scheme char):n (scheme r5rs):s->(scheme process-context):n (scheme r5rs):s->(scheme cxr):n (scheme r5rs):s->(scheme read):n (scheme r5rs):s->(scheme base):n (scheme r5rs):s->(scheme repl):n (scheme case-lambda) case-lambda (scheme case-lambda):s->(hoot primitives):n (fibers) (fibers) (fibers):s->(fibers scheduler):n (fibers):s->(guile):n (guile):s->(hoot features):n (guile):s->(hoot primitives):n (guile):s->(ice-9 match):n (guile):s->(hoot eq):n (guile):s->(hoot strings):n (guile):s->(hoot pairs):n (guile):s->(hoot bitvectors):n (guile):s->(hoot vectors):n (guile):s->(hoot equal):n (guile):s->(hoot exceptions):n (guile):s->(hoot parameters):n (guile):s->(hoot dynamic-wind):n (guile):s->(hoot bytevectors):n (guile):s->(hoot error-handling):n (guile):s->(hoot symbols):n (guile):s->(hoot assoc):n (guile):s->(hoot procedures):n (guile):s->(hoot write):n (guile):s->(hoot lists):n (guile):s->(hoot not):n (guile):s->(hoot syntax):n (guile):s->(hoot values):n (guile):s->(hoot control):n (guile):s->(hoot bitwise):n (guile):s->(hoot char):n (guile):s->(hoot dynamic-states):n (guile):s->(hoot read):n (guile):s->(hoot fluids):n (guile):s->(hoot errors):n (guile):s->(hoot ports):n (guile):s->(hoot numbers):n (guile):s->(hoot boxes):n (guile):s->(hoot keywords):n (guile):s->(hoot cond-expand):n (guile):s->(scheme lazy):n (guile):s->(scheme time):n (guile):s->(scheme file):n (guile):s->(scheme char):n (guile):s->(scheme process-context):n (guile):s->(scheme base):n (guile):s->(srfi srfi-9):n (srfi srfi-9):s->(hoot primitives):n (srfi srfi-9):s->(hoot records):n

If you are reading this on the web, you should see above a graph of dependencies among the 50 or so libraries that are shipped as part of Hoot. (Somehow I doubt that a feed reader will plumb through the inline SVG, but who knows.) It’s a bit of a mess, but still I think it’s a useful illustration of a number of properties of how the Hoot language is grown from small to large. Click on any box to visit the source code for that module.

the root of the boot

Firstly, let us note that the graph is not a forest: it is a single tree. There is no module that does not depend (possibly indirectly) on (hoot primitives). This is because there are no capabilities that Hoot libraries can access without importing them, and the only way into the Hootosphere from outside is via the definitions in the primitives module.

So what are these definitions, you might ask? Well, these are the “well-known” bindings, for example + for which the compiler might have some special understanding, the sort of binding that gets translated to a primitive operation at the compiler IR level. They are used in careful ways by the modules that use (hoot primitives) to ensure that their uses are all open-coded by the compiler. (“Open coding” is inlining. But inlining to me implies that the whole implementation is inlined, with no slow-path callouts, whereas open coding implies to me that it’s the compiler that knows what the op does and may or may not inline the actual asm.)

But, (hoot primitives) also exposes some other definitions, for example define and let and lambda and all that. Scheme doesn’t have keywords in the sense that Python has def and with and such: there is no privileged way to associate a name with its meaning. It is in this sense that it is impossible to avoid (hoot primitives): the most simple (define x 42) depends on the lexical meaning of define, which is provided by the primitives module.

Syntax definitions are an expander construct; they are not present at run-time. Using a syntax definition causes the expander to invoke code, and the expander runs on the host system, which is Guile and not WebAssembly. So, syntax definitions belong to the host. This goes also for some first-order definitions such as syntax->datum and so on, which are only used in syntax expanders; these definitions are plumbed through (hoot primitives), but can only ever be used by macro definitions, which run on the meta-level.

(Is this too heavy? Allow me to lighten the mood: when I was 22 or so and working in Namibia, I somehow got an advance copy of Notes from the Metalevel. I was working on algorithmic music synthesis, and my chief strategy was knocking hubris together with itself, as one does. I sent the author a bunch of uninvited corrections to his book. I think it was completely unwelcome! Anyway, moral of the story, at 22 you get a free pass to do whatever you want, and come to think of it, now that I am 44 I think I should get some kind of hubris loyalty award or something.)

powerful primitives

So, there are expand-time primitives and run-time primitives. The expander knows about expand-time primitives and the compiler knows about run-time primitives. One particularly powerful primitive is %inline-wasm, which takes an inline snippet of WebAssembly as an s-expression and applies it to a number of arguments passed at run-time. Consider make-bytevector:

(define* (make-bytevector len #:optional (init 0))
  (%inline-wasm
   '(func (param $len i32) (param $init i32)
      (result (ref eq))
      (struct.new
       $mutable-bytevector
       (i32.const 0)
       (array.new $raw-bytevector
                  (local.get $init)
                  (local.get $len))))
   len init))

We have an inline snippet of wasm that makes a $mutable-bytevector. It passes 0 as the hash field, meaning that the hashq of this value will be lazily initialized, and the contents are a new array of a given size and initial value. Inputs will be unboxed to the appropriate type (two i32s in this case), and likewise with outputs; here we produce the universal (ref eq) representation.

The nice thing about %inline-wasm is that the compiler didn’t have to be taught about make-bytevector: this definition suffices, because %inline-wasm can access a number of lower-level capabilities.

dual denotations

But as we learned in my notes on whole-program compilation, any run-time definition is available at compile-time, if it is reachable from a syntax transformer. So this definition above isn’t quite sufficient; we can’t call make-bytevector as part of a procedural macro, which we might want to do. What we need instead is to provide one definition when residualizing wasm at run-time, and another when loading a module at expand-time.

In Hoot we do this with cond-expand, where we expand to %inline-wasm when targetting Hoot, and... what, precisely, at expand-time? Really we need to make a Guile bytevector, so in this sort of case, we end up having to include a run-time make-bytevector definition in the (hoot primitives) module. This happens whereever we end up using %inline-wasm.

building to guile

Returning to our graph, we see that there is a red-colored block for Hoot modules, a teal-colored layer on top for those modules that are defined by R7RS, a few oddballs, and then (guile) and Fibers built on top. The (guile) module provides a shim that implements Guile’s own default set of bindings, allowing Guile modules to be loaded on a Hoot system. (guile) is layered on top of the low-level Hoot libraries, and out of convenience, on top of the various R7RS libraries as well, because it was easiest to remember what was where in R7RS than our ad-hoc nest of Hoot internal libraries.

Having (guile) lets Guile hackers build on Hoot. It’s still incomplete but I think eventually it will be capital-G Good. Even for a library that needed more porting like Fibers (Hoot has no threads so much of the parallel concurrent ML implementation can be simplified, and we use an event loop from the Wasm run-time instead of an epoll-based scheduler), it was still pleasant to be able to use define-module and keyword arguments and all of that.

next layers

I mentioned that this tower of terms is incomplete, and so that is one of the next work items for Hoot: complete support for Guile’s run-time library. At that point we’d probably want to merge it into Guile, but that is another topic.

But let’s leave that for another day; until then, happy hacking!

by Andy Wingo at May 22, 2024 08:16 AM

May 21, 2024

Brian Kardell

Improving the WPT Dashboard

Improving the WPT Dashboard

Thoughts on things I'd like to see as part of the WPT dashboard.

In my last post I dug into the data behind wpt.fyi's Browser Specific Failures chart (below) and the site's general reporting capabilities.

I suggested that linking to data on specifically what failed (at least what is queryable) would probably be really helpful (perhaps some kind of "understanding this chart" link as well).

While these aren't part of the design today, I think this is mainly because the primary audience of that chart was originally mainly the vendors themselves. It was intended to allow for certain simple kinds of tracking, planning and prioritization. For example "Let's set a goal to not let the failure exceed such and such threshold" or "Let's aim to lower the failures by X this quarter". It wasn't critical to link to the tests because the audience knew how to interrogate the data - the purpose was just to get a quantifiable number you can easily report on.

But, now we see this chart shared a lot and it's pretty clear that people are curious so we should probably adjust it for the wider audience.

Additionally though, that's also only a single view of the data, and I'd like to argue that we could some other improvements too.

Prioritization

BSF made the observation that if we can identify a test that fails in only 1 browser, then that browser's team can easily prioritize something that has significant impact. That browser is the boat anchor holding things back. Except, it's not quite that cut and dry in reality.

Real management of software projects is hard. I think that anyone who's worked on software projects can relate to this, at least a bit, if we take some time to consider all of the things that go into choosing how to apply our limited resources. Obviously, not all failures are equal - especially when we're talking about projects which are a quarter of a century old. The reality is that all of that decisioning and prioritization is happening independently across different organizations, with different views on the web, different budgets, different legacy challenges, etc.

That's where I think there are some things to learn from Interop.

What I learned from Interop Is...

If you think about it, Interop is about trying to achieve thematically, basically the same thing as BSF: Make more things "green across the board". But it is a very different thing than BSF.

I've really learned a lot from the process of helping organize Interop every year about why this takes so long to happen naturally. There are so many limits and signals and opinions. One of the things we do as part of the process is to take all of the submissions and independently order them in terms of what we think are their priorities. There are 6 organizations doing that: Apple, Bocoup, Google, Igalia, Microsoft and Mozilla. How many do you think chose the same #1? The answer is 0.

It really highlights how waiting for all of the stars to align by chance winds up often being a painfully slow process and full of problems.

However, a huge part of interop is dedicated to dealing with the stuff BSF doesn't really consider - aligning agreement on: 1. what features are most important 2. which tests regarding those are valid/important 3. are all the spec questions really answered? 4. is this actually (hopefully) achievable in the next year?

In that, I believe it has been extremely successful in creating way more "green across the board" than anything else. I think this is true beyond even what is officially part of Interop, because we're all able to kind of discuss and see where others are probably going to invest in work because things that were important for them didn't make the cut.

In a way, each year is sort of like doing what we used to do with "CSS2" and "HTML4"... Creating a more focused discussion that is the goal floor, not the ceiling.

It's not enough... Sure, I believe this gives us much better results by helping alignment. I think this is obvious given how rapid and smoothly we've found so much high-quality alignment in recent years. However, there's something I want stress in all of this: Choosing what to prioritize is also inherently choosing what to collectively deprioritize. It is inevitable because at the end of the day there is just too much.

The only real solution to this problem is wider investment in the platform and, ultimately, almost certainly, changing how we fund it.

Alignment vs Passing

Interop also showed us that a simple, individual pass/fail can be incomplete and misleading. If 3 browsers reach a point of passing 50% of measured tests, the number of tests that pass in all browsers might actually be 0, as illustrated in the table below...

Chrome Firefox WebKit
Lots of tests pass, but not even one passes universally!

In fact, here's a real world example of exactly this kind of misleading view in a set of SVG tests. If we look at the numbers across the bottom:

  • chrome: 166 / 191
  • edge: 166 / 191
  • firefox: 175 / 191
  • safari: 132 / 191

It's not terrible if you're only looking at those numbers. But, if you scroll down through that table you'll see that there are ragged failures all over that. In fact, only 52 of 189 are "green across the board"!

We can only realistically solve by having a more holistic view and working together. BSF is just the slice that is theoretically actionable individually, not everything that matters.

What about a focus on Universally Passing?

In the Interop project we track the difference above as its own data point: The Interop number, and we put it as a separate column in the test tables:

a table containing a column for each individual browser scores on different features, and a column for the number that pass in all
The interop column reports how many tests pass on all of the tracked browsers

Similarly, we track it over time:

A graph showing scores of each browser over time as well as an "interop line"

Could we learn something from this? Wouldn't something like that be great to have in general?

For example, in the wpt.fyi tables? Now, it couldn't look just like that because those numbers are all in percentages, and this only really works because the interop process carefully sets a governance process for defining/agreeing to what the tests are. It would be enough to add a column to the table in the same form, something like this:

That might help us uncover situations like the SVG one above and present opportunites like interop for us to collectively decide to try to address that.

Similarly, we could track it over time. Sort of the opposite of BSF. We want to see the simple number of subtests passing in browsers and it should always be going up (even as new tests are added, no existing ones should stop passing - those are just more opportunities to go up). Further, ideally the Universally Passing number shouldn't ever be drawing significantly further away from that over time or we're making less of the platform universal. That is, you could see, over time when we are cooperating better, and when we are not.

We do better when we are. In my mind, that's an explicit goal, and this would be a view into it.

May 21, 2024 04:00 AM

May 20, 2024

Frédéric Wang

Time travel debugging of WebKit with rr

Introduction

rr is a debugging tool for Linux that was originally developed by Mozilla for Firefox. It has long been adopted by Igalia and other web platform developers for Chromium and WebKit too. Back in 2019, there were breakout sessions on this topic at the Web Engines Hackfest and BlinkOn.

For WebKitGTK, the Flatpak SDK provides a copy of rr, but recently I was unable to use the instructions on trac.webkit.org. Fortunately, my colleague Adrián Pérez suggested using a direct build without flatpak or the bubblewrap sandbox, and that indeed solved my problem. I thought it might be interesting to share this information with others, so I decided to write this blog post.

Disclaimer: The build instructions below may be imperfect, will likely become outdated, and are in any case not a replacement for the official ones for WebKitGTK development. Use them at your own risk!

CMake configuration

The approach that worked for me was thus to perform a direct build from my system. I came up with the following configuration step:

cmake -S. -BWebKitBuild/Release \
   -DCMAKE_BUILD_TYPE=Release \
   -DCMAKE_INSTALL_PREFIX=$HOME/WebKit/WebKitBuild/install/Release \
   -GNinja -DPORT=GTK -DENABLE_BUBBLEWRAP_SANDBOX=OFF \
   -DDEVELOPER_MODE=ON -DDEVELOPER_MODE_FATAL_WARNINGS=OFF \
   -DENABLE_TOOLS=ON -DENABLE_LAYOUT_TESTS=ON

where:

  • The -B option specifies the build directory, which is traditionnaly called WebKitBuild/ for the WebKit project.
  • CMAKE_BUILD_TYPE specifies the build type, e.g. optimized release builds (Release, corresponding to --release for the offical script) or debug builds with assertions (Debug, corresponding to --debug) 1.
  • CMAKE_INSTALL_PREFIX specifies the installation directory, which I place inside WebKitBuild/install/ 2.
  • The -G option specifies the build system generator. I used Ninja, which is the default for the offical script too.
  • -DPORT=GTK is for building WebKitGTK. I haven’t tested rr with other Linux ports.
  • -DENABLE_BUBBLEWRAP_SANDBOX=OFF was suggested by Adrián. The bubblewrap sandbox probably does not make sense without flatpak, so it should be safe to disable it anyway.
  • I extracted the other -D flags from the official script, trying to stay as close as possible to what it provides for WebKit development (being able to run layout tests, building the Tools/, ignoring fatal warnings, etc).

Needless to say, the advantage of using flatpak is that it automatically downloads and install all the required dependencies. But if you do your own build, you need to figure out what they are and perform the setup manually. Generally, this is straightforward using your distribution’s package manager, but there can be some tricky exceptions 3.

While we are still at the configuration step, I believe it’s worth sharing two more tricks for WebKit developers:

  • You can use -DENABLE_SANITIZERS=address to produce Asan builds or builds with other sanitizers.
  • You can use -DCMAKE_CXX_FLAGS="-DENABLE_TREE_DEBUGGING" in release builds if you want to get access to the tree debugging functions (ShowRenderTree and the like). This flag is turned on by default for debug builds.

Building and running WebKit

Once the configure step is successful, you can build and install WebKit using the following CMake command 2.

cmake --build WebKitBuild/Release --target install

When that operation completes, you should be able to run MiniBrowser with the following command:

LD_LIBRARY_PATH=WebKitBuild/install/Release/lib ./WebKitBuild/Release/bin/MiniBrowser

For WebKitTestRunner, some extra environment variables are necessary 2:

TEST_RUNNER_INJECTED_BUNDLE_FILENAME=$HOME/WebKit/WebKitBuild/Release/lib/libTestRunnerInjectedBundle.so LD_LIBRARY_PATH=WebKitBuild/install/Release/lib ./WebKitBuild/Release/bin/WebKitTestRunner filename.html

You can also use the official scripts, Tools/Script/run-minibrowser and Tools/Script/run-webkit-tests. They expect some particular paths, but a quick workaround is to use a symbolic link:

ln -s $HOME/WebKit/WebKitBuild $HOME/WebKit/WebKitBuild/GTK

Using rr for WebKit debugging

rr is generally easily installable from your distribution’s package manager. However, as stated on the project wiki page:

Support for the latest hardware and kernel features may require building rr from Github master.

Indeed, using the source has always worked best for me to avoid mysterious execution failures when starting the recording 4.

If you are not familiar with rr, I strongly invite you to take a look at the overview on the project home page or at some of the references I mentioned in the introduction. In any case, the first step is to record a trace by passing the program and arguments to rr. For example, to record a trace for MiniBrowser:

LD_LIBRARY_PATH=WebKitBuild/install/Debug/lib rr ./WebKitBuild/Debug/bin/MiniBrowser https://www.igalia.com/

After the program exits, you can replay the recorded trace as many times as you want. For hard-to-reproduce bugs (e.g. non-deterministic issues or involving a lot of manual steps), that means you only need to be able to record and reproduce the bug once and then can just focus on debugging. You can even turn off your machine after hours of exhausting debugging, then continue the effort later when you have more time and energy! The trace is played in a deterministic way, always using the same timing and pointer addresses. You can use most gdb commands (to run the program, interrupt it, and inspect data), but the real power comes from new commands to perform reverse execution!

Before coming to that, let’s explain how to handle programs with multiple processes, which is the case for WebKit and modern browsers in general. After you recorded a trace, you can display the pids of all recorded processes using the rr ps command. For example, we can see in the following output that the MiniBrowser process (pid 24103) actually forked three child processes, including the Network Process (pid 24113) and the Web Process (24116):

PID     PPID    EXIT    CMD
24103   --      0       ./WebKitBuild/Debug/bin/MiniBrowser https://www.igalia.com/
24113   24103   -9      ./WebKitBuild/Debug/bin/WebKitNetworkProcess 7 12
24115   24103   1       (forked without exec)
24116   24103   -9      ./WebKitBuild/Debug/bin/WebKitWebProcess 15 15

Here is a small debugging session similar to the single-process example from Chromium Chronicle #13 5. We use the option -p 24116 to attach the debugger to the Web Process and -e to start debugging from where it exited:

rr replay -p 24116 -e
(rr) break RenderFlexibleBox::layoutBlock
(rr) rc # Run back to the last layout call
Thread 2 hit Breakpoint 1, WebCore::RenderFlexibleBox::layoutBlock (this=0x7f66699cc400, relayoutChildren=false) at /home/fred/src-obj/WebKit/Source/WebCore/rendering/RenderFlexibleBox.cpp:420
(rr) # Inspect anything you want here. To find the previous Layout call on this object:
(rr) cond 1 this == 0x7f66699cc400
(rr) rc
Thread 2 hit Breakpoint 1, WebCore::RenderFlexibleBox::layoutBlock (this=0x7f66699cc400, relayoutChildren=false) at /home/fred/src-obj/WebKit/Source/WebCore/rendering/RenderFlexibleBox.cpp:420
420     {
(rr) delete 1
(rr) watch -l m_style.m_nonInheritedFlags.effectiveDisplay # Or find the last time the effective display was changed
Thread 4 hit Hardware watchpoint 2: -location m_style.m_nonInheritedFlags.effectiveDisplay

Old value = 16
New value = 0
0x00007f6685234f39 in WebCore::RenderStyle::RenderStyle (this=0x7f66699cc4a8) at /home/fred/src-obj/WebKit/Source/WebCore/rendering/style/RenderStyle.cpp:176
176     RenderStyle::RenderStyle(RenderStyle&&) = default;

rc is an abbreviation for reverse-continue and continues execution backward. Similarly, you can use reverse-next, reverse-step and reverse-finish commands, or their abbreviations. Notice that the watchpoint change is naturally reversed compared to normal execution: the old value (sixteen) is the one after intialization, while the new value (zero) is the one before initialization!

Restarting playback from a known point in time

rr also has a concept of “event” and associates a number to each event it records. They can be obtained by the when command, or printed to the standard output using the -M option. To elaborate a bit more, suppose you add the following printf in RenderFlexibleBox::layoutBlock:

@@ -423,6 +423,8 @@ void RenderFlexibleBox::layoutBlock(bool relayoutChildren, LayoutUnit)
     if (!relayoutChildren && simplifiedLayout())
         return;

+    printf("this=%p\n", this);
+

After building, recording and replaying again, the output should look like this:

$ rr -M replay -p 70285 -e # replay with the new PID of the web process.
...
[rr 70285 57408]this=0x7f742203fa00
[rr 70285 57423]this=0x7f742203fc80
[rr 70285 57425]this=0x7f7422040200
...

Each printed output is now annotated with two numbers in bracket: a PID and an event number. So in order to restart from when an interesting output happened (let’s say [rr 70285 57425]this=0x7f7422040200), you can now execute run 57425 from the debugging session, or equivalently:

rr replay -p 70285 -g 57425

Older traces and parallel debugging

Another interesting thing to know is that traces are stored in ~/.local/share/rr/ and you can always specify an older trace to the rr command e.g. rr ps ~/.local/share/rr/MiniBrowser-0. Be aware that the executable image must not change, but you can use rr pack to be able to run old traces after a rebuild, or even to copy traces to another machine.

To be honest, most the time I’m just using the latest trace. However, one thing I’ve sometimes found useful is what I would call the “parallel debugging” technique. Basically, I’m recording one trace for a testcase that exhibits the bug and another one for a very similar testcase (e.g. with one CSS property difference) that behaves correctly. Then I replay the two traces side by side, comparing them to understand where the issue comes from and what can be done to fix it.

The usage documentation also provides further tips, but this should be enough to get you started with time travel debugging in WebKit!

  1. RelWithDebInfo build type (which yields an optimized release build with debug symbols) might also be interesting to consider in some situations, e.g. debugging bugs that reproduce in release builds but not in debug builds. 

  2. Using an installation directory might not be necessary, but without that, I had trouble making the whole thing work properly (wrong libraries loaded or libraries not found).  2 3

  3. In my case, I chose the easiest path to disable some features, namely -DUSE_JPEGXL=OFF, -DUSE_LIBBACKTRACE=OFF, and -DUSE_GSTREAMER_TRANSCODER=OFF

  4. Incidentally, you are likely to get an error saying that perf_event_paranoid is required to be at most 1, which you can force using sudo sysctl kernel.perf_event_paranoid=1

  5. The equivalent example would probably have been to watch for the previous style change with watch -l m_style, but this was exceeding my hardware watchpoint limit, so I narrowed it down to a smaller observation scope. 

May 20, 2024 10:00 PM

May 16, 2024

Andy Wingo

on hoot, on boot

I realized recently that I haven’t been writing much about the Hoot Scheme-to-WebAssembly compiler. Upon reflection, I have been too conscious of its limitations to give it verbal tribute, preferring to spend each marginal hour fixing bugs and filling in features rather than publicising progress.

In the last month or so, though, Hoot has gotten to a point that pleases me. Not to the point where I would say “accept no substitutes” by any means, but good already for some things, and worth writing about.

So let’s start today by talking about bootie. Boot, I mean! The boot, the boot, the boot of Hoot.

hoot boot: temporal tunnel

The first axis of boot is time. In the beginning, there was nary a toot, and now, through boot, there is Hoot.

The first boot of Hoot was on paper. Christine Lemmer-Webber had asked me, ages ago, what I thought Guile should do about the web. After thinking a bit, I concluded that it would be best to avoid compromises when building an in-browser Guile: if you have to pollute Guile to match what JavaScript offers, you might as well program in JavaScript. JS is cute of course, but Guile is a bit different in some interesting ways, the most important of which is control: delimited continuations, multiple values, tail calls, dynamic binding, threads, and all that. If Guile’s web bootie doesn’t pack all the funk in its trunk, probably it’s just junk.

So I wrote up a plan something to which I attributed the name tailification. In retrospect, this is simply a specific flavor of a continuation-passing-style (CPS) transmutation, late in the compiler pipeline. I’ll elocute more in a future dispatch. I did end up writing the tailification pass back then; I could have continued to target JS, but it was sufficiently annoying and I didn’t prosecute. It sat around unused for a few years, until Christine’s irresistable charisma managed to conjure some resources for Hoot.

In the meantime, the GC extension for WebAssembly shipped (woot woot!), and to boot Hoot, I filled in the missing piece: a backend for Guile’s compiler that tailified and then translated primitive operations to snippets of WebAssembly.

It was, well, hirsute, but cute and it did compute, so we continued to boot. From this root we grew a small run-time library, written in raw WebAssembly, used for slow-paths for the various primitive operations that are part of Guile’s compiler back-end. We filled out Guile primcalls, in minute commits, growing the WebAssembly runtime library and toolchain as we went.

Eventually we started constituting facilities defined in terms of those primitives, via a Scheme prelude that was prepended to all programs, within a nested lexical environment. It was never our intention though to drown the user’s programs in a sea of predefined bindings, as if the ultimate program were but a vestigial inhabitant of the lexical lake—don’t dilute the newt!, we would often say [ed: we did not]— so eventually when the prelude became unmanageable, we finally figured out how to do whole-program compilation of a set of modules.

Then followed a long month in which I would uproot the loot from the boot: take each binding from the prelude and reattribute it into an appropriate module. User code could import all the modules that suit, as long as they were known to Hoot, but no others; it was only until we added the ability for users to programmatically consitute an environment from their modules that Hoot became a language implementation of any repute.

Which brings us to the work of the last month, about which I cannot be mute. When you have existing Guile code that you want to distribute via the web, Hoot required you transmute its module definitions into the more precise R6RS syntax. Precise, meaning that R6RS modules are static, in a way that Guile modules, at least in absolute terms, are not: Guile programs can use first-class accessors on the module systems to pull out bindings. This is yet another example of what I impute as the original sin of 1990s language development, that modules are just mutable hash maps. You see it in Python, for example: because you don’t know for sure to what values global names are bound, it is easy for any discussion of what a particular piece of code means to end in dispute.

The question is, though, are the semantics of name binding in a language fixed and absolute? Once your language is booted, are its aspects definitively attributed? I think some perfection, in the sense of becoming more perfect or more like the thing you should be, is something to salute. Anyway, in Guile it would be coherent with Scheme’s lexical binding heritage to restitute some certainty as to the meanings of names, at least in a default compilation node. Lexical binding is, after all, the foundation of the Macro Writer’s Statute of Rights. Of course if you are making a build for development purposes, not to distribute, then you might prefer a build that marks all bindings as dynamic. Otherwise I think it’s reasonable to require the user to explicitly indicate which definitions are denotations, and which constitute locations.

Hoot therefore now includes an implementation of the static semantics of Guile’s define-module: it can load Guile modules directly, and as a tribute, it also has an implementation of the ambient (guile) module that constitutes the lexical soup of modules that aren’t #:pure. (I agree, it would be better if all modules were explicit about the language they are written in—their imported bindings and so on—but there is an existing corpus to accomodate; the point is moot.)

The astute reader (whom I salute!) will note that we have a full boot: Hoot is a Guile. Not an implementation to substitute the original, but more of an alternate route to the same destination. So, probably we should scoot the two implementations together, to knock their boots, so to speak, merging the offshoot Hoot into Guile itself.

But do I circumlocute: I can only plead a case of acute Hoot. Tomorrow, we elocute on a second axis of boot. Until then, happy compute!

by Andy Wingo at May 16, 2024 08:01 PM

May 15, 2024

Manuel A. Fernandez Montecelo

WiFi back-ends in SteamOS 3.6

As of NetworkManager v1.46.0 (and a very long while before that, since 1.12 released in 2018), there are two wifi.backend's supported: wpa_supplicant (default), and iwd (experimental).

iwd is still considered experimental after these years, see the list of issues with iwd backend, and specially this one.

In SteamOS, however, the default is iwd. It works relatively well in most scenarios; and the main advantage is that, when waking up the device (think of Steam Deck returning from sleep, "suspended to RAM" or S3 sleeping state), it regains network connectivity much faster -- typically 1-2 seconds vs. 5+ for wpa_supplicant.

However, if for some reason iwd doesn't work properly in your case, you can give wpa_supplicant a try.

With steamos-wifi-set-backend command

In 3.6, the back-end can be changed via command.

3.6 has been available in Main channel since late March and in Beta/Preview for several days, for example 20240509.100.

If you haven't done this before, to be able to run it, one must be able to either log in via SSH (not available out of the box), or to switch to desktop-mode, open a terminal and be able to type commands.

Assuming that one wants to change from the default iwd to wpa_supplicant, the command is:

steamos-wifi-set-backend wpa_supplicant

The script tries to ensure than things are in place in terms of some systemd units being stopped and others brought up as appropriate. After a few seconds, if it's able to connect correctly to the WiFi network, the device should get connectivity (e.g. ping works). It is not necessary to enter again the preferred networks and passwords, NetworkManager feeds the back-ends with the necessary config and credentials.

It is possible to switch back and forth between back-ends by using alternatively iwd and wpa_supplicant as many times as desired, and without restarting the system.

The current back-end as per configuration can be obtained with:

steamos-wifi-set-backend --check

And finally, this is the output of changing to wpa_supplicant and back to iwd (i.e., the output that you should expect):

(deck@steamdeck ~)$ steamos-wifi-set-backend --check
iwd

(deck@steamdeck ~)$ steamos-wifi-set-backend wpa_supplicant
INFO: switching back-end to 'wpa_supplicant'
INFO: stopping old back-end service and restarting NetworkManager,
      networking will be disrupted (hopefully only momentary blip, max 10 seconds)...
INFO: restarting done
INFO: checking status of services ...
      (sleeping for 2 seconds to catch-up with state)
INFO: status OK

(deck@steamdeck ~)$ steamos-wifi-set-backend --check
wpa_supplicant

(deck@steamdeck ~)$ ping -c 3 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=56 time=16.4 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=56 time=24.8 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=56 time=16.5 ms

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 16.370/19.195/24.755/3.931 ms

(deck@steamdeck ~)$ steamos-wifi-set-backend iwd
INFO: switching back-end to 'iwd'
INFO: stopping old back-end service and restarting NetworkManager,
      networking will be disrupted (hopefully only momentary blip, max 10 seconds)...
INFO: restarting done
INFO: checking status of services ...
      (sleeping for 2 seconds to catch-up with state)
INFO: status OK

(deck@steamdeck ~)$ steamos-wifi-set-backend --check
iwd

(deck@steamdeck ~)$ ping -c 3 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=56 time=16.0 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=56 time=16.5 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=56 time=21.3 ms

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 15.991/17.940/21.295/2.382 ms

By hand

⚠️ IMPORTANT ⚠️ : Please do this only if you know what you are doing, you don't fear minor breakage and you are confident that you can recover from the situation if something goes wrong. If unsure, better wait until the channel that you run supports the commands to handle this automatically.

If you are running a channel that still does not have this command, e.g. Stable at the time of writing this, you can achieve the same effect by executing the instructions run by this script by hand.

Note, however, that you might easily loose network connectivity if you make some mistake or something goes wrong. So be prepared to restore things by hand by switching to desktop-mode and having a terminal ready.

Now, the actual instructions. In /etc/NetworkManager/conf.d/wifi_backend.conf, change the string of wifi.backend, for example comment out the default one with iwd and add a line using wpa_supplicant as value:

[device]
#wifi.backend=iwd
wifi.backend=wpa_supplicant
wifi.iwd.autoconnect=yes

⚠️ IMPORTANT ⚠️ : Leave the other lines alone. If you don't have experience with the format you can easily cause the file to not be parseable, and then networking could stop working entirely.

After that, things get more complicated depending on the hardware on which the system runs.

If the device is the SteamDeck, there is a difference between the original LCD and the newer OLED model. With the OLED model, when you switch away from iwd, typically the network device is destroyed, so it's necessary to re-create it. This happens automatically when booting the system, so the easiest way if the network doesn't come up after 15 seconds after the following instructions, is to just restart the whole system.

If you're running SteamOS on models of the SteamDeck that don't exist at the time of writing this, or on other types of computers, the situation can be similar to the OLED or even more complex, depending on the WiFi hardware and drivers. So, again, tread carefully.

In the simplest scenario, LCD model, execute the following commands as root (or prepending sudo), with OLD_BACKEND being either iwd or wpa_supplicant, the one that you want to migrate away from:

systemctl stop NetworkManager
systemctl disable --now OLD_BACKEND
systemctl restart NetworkManager

If you have the OLED model and switching from wpa_supplicant to iwd, that should be enough as well.

But if you have the OLED model and you changed from iwd to wpa_supplicant, as explained above, your system will not be able to connect to the network, and the easiest is to restart the system.

However, if you want to try to solve the problem by hand without restarting, then execute these commands as root/sudo:

systemctl stop NetworkManager
systemctl disable --now OLD_BACKEND

if ! iw dev wlan0 info &>/dev/null; then iw phy phy0 interface add wlan0 type station; fi

systemctl restart NetworkManager

In any case, if after trying to change the back-end by hand and rebooting it's still not being able to connect, perhaps the best thing to do is to restore the previous configuration and restart the system again, and wait until your channel is upgraded to include this command.

Change back-ends via UI?

In the future, it is conceivable that the UI allows to change WiFi backends with the help of this command or by other means, but this is not possible at the moment.

by Manuel A. Fernandez Montecelo at May 15, 2024 09:21 AM

May 13, 2024

Andy Wingo

partitioning pitfalls for generational collectors

You might have heard of long-pole problems: when you are packing a tent, its bag needs to be at least as big as the longest pole. (This is my preferred etymology; there are others.) In garbage collection, the long pole is the longest chain of object references; there is no way you can algorithmically speed up pointer-chasing of (say) a long linked-list.

As a GC author, some of your standard tools can mitigate the long-pole problem, and some don’t apply.

Parallelism doesn’t help: a baby takes 9 months, no matter how many people you assign to the problem. You need to visit each node in the chain, one after the other, and having multiple workers available to process those nodes does not get us there any faster. Parallelism does help in general, of course, but doesn’t help for long poles.

You can apply concurrency: instead of stopping the user’s program (the mutator) to enumerate live objects, you trace while the mutator is running. This can happen on mutator threads, interleaved with allocation, via what is known as incremental tracing. Or it can happen in dedicated tracing threads, which is what people usually mean when they refer to concurrent tracing. Though it does impose some coordination overhead on the mutator, the pause-time benefits are such that most production garbage collectors do trace concurrently with the mutator, if they have the development resources to manage the complexity.

Then there is partitioning: instead of tracing the whole graph all the time, try to just trace part of it and just reclaim memory for that part. This bounds the size of a long pole—it can’t be bigger than the partition you trace—and so tracing a graph subset should reduce total pause time.

The usual way to partition is generational, in which the main program allocates into a nursery, and objects that survive a few collections then get promoted into old space. But there may be other partitions, for example to put “large objects” (for example, bigger than a few virtual memory pages) in their own section, to be managed with their own algorithm.

partitions, take one

And that brings me to today’s article: generational partitioning has a failure mode which manifests itself as spurious out-of-memory. For example, in V8, running this small JavaScript program that repeatedly allocates a 1-megabyte buffer grows to consume all memory in the system and eventually panics:

while (1) new Uint8Array(1e6);

This is a funny result, to say the least. Let’s dig in a bit to see why.

First off, note that allocating a 1-megabyte Uint8Array makes a large object, because it is bigger than half a V8 page, which is 256 kB on most systems There is a backing store for the array that will get allocated into the large object space, and then the JS object wrapper that gets allocated in the young generation of the regular object space.

Let’s imagine the heap consists of the nursery, the old space, and a large object space (lospace, for short). (See the next section for a refinement of this model.) In the loop, we cause allocation in the nursery and the lospace. When the nursery starts to get full, we run a minor collection, to trace the newly allocated part of the object graph, possibly promoting some objects to the old generation.

Promoting an object adds to the byte count in the old generation. You could say that promoting causes allocation in the old-gen, though it might happen just on an accounting level if an object is promoted in place. In V8, old-gen allocation is subject to a limit, and that limit is set by a gnarly series of heuristics. Exceeding the limit will cause a major GC, in which the whole heap is traced.

So, minor collections cause major collections via promotion. But what if a minor collection never promotes? This can happen in request-oriented workflows, in which a request comes in, you handle it in JS, write out the result, and then handle the next. If the nursery is at least as large as the memory allocated in one request, no object will ever survive more than one collection. But in our new Uint8Array(1e6) example above, we are allocating to newspace and the lospace. If we never cause promotion, we will always collect newspace but never trigger the full GC that would also collect the large object space, triggering this out-of-memory phenomenon.

partitions, take two

In general, the phenomenon we are looking for is nursery allocations that cause non-nursery allocations, where those non-nursery allocations will not themselves bring about a major GC.

In our example above, it was a typed array object with an associated lospace backing store, assuming the lospace allocation wouldn’t eventually trigger GC. But V8’s heap is not exactly like our model, for one because it actually has separate nursery and old-generation lospaces, and for two because allocation to the old-generation lospace does count towards the old-gen allocation limit. And yet, that simple does still blow through all memory. So what is the deal?

The answer is that V8’s heap now has around two dozen spaces, and allocations to some of those spaces escape the limits and allocation counters. In this case, V8’s sandbox makes the link between the JS typed array object and its backing store pass through a table of indirections. At each allocation, we make an object in the nursery, allocate a backing store in the nursery lospace, and then allocate a new entry in the external pointer table, storing the index of that entry in the JS object. When the object needs to get its backing store, it dereferences the corresponding entry in the external pointer table.

Our tight loop above would therefore cause an allocation (in the external pointer table) that itself would not hasten a major collection.

Seen this way, one solution immediately presents itself: maybe we should find a way to make external pointer table entry allocations trigger a GC based on some heuristic, perhaps the old-gen allocated bytes counter. But then you have to actually trigger GC, and this is annoying and not always possible, as EPT entries can be allocated in response to a pointer write. V8 hacker Michael Lippautz summed up the issue in a comment that can be summarized as “it’s gnarly”.

In the end, it would be better if new-space allocations did not cause old-space allocations. We should separate the external pointer table (EPT) into old and new spaces. Because there is at most one “host” object that is associated with any EPT entry—if it’s zero objects, the entry is dead—then the heuristics that apply to host objects will do the right thing with regards to EPT entries.

A couple weeks ago, I landed a patch to do this. It was much easier said than done; the patch went through upwards of 40 revisions, as it took me a while to understand the delicate interactions between the concurrent and parallel parts of the collector, which were themselves evolving as I was working on the patch.

The challenge mostly came from the fact that V8 has two nursery implementations that operate in different ways.

The semi-space nursery (the scavenger) is a parallel stop-the-world collector, which is relatively straightforward... except that there can be a concurrent/incremental major trace in progress while the scavenger runs (a so-called interleaved minor GC). External pointer table entries have mark bits, which the scavenger collector will want to use to determine which entries are in use and which can be reclaimed, but the concurrent major marker will also want to use those mark bits. In the end we decided that if a major mark is in progress, an interleaved collection will neither mark nor sweep the external pointer table nursery; a major collection will finish soon, and reclaiming dead EPT entries will be its responsibility.

The minor mark-sweep nursery does not run concurrently with a major concurrent/incremental trace, which is something of a relief, but it does run concurrently with the mutator. When we promote a page, we also need to promote all EPT entries for objects on that page. To keep track of which objects have external pointers, we had to add a new remembered set, built up during a minor trace, and cleared when the page is swept (also lazily / concurrently). Promoting a page iterates through that set, evacuating EPT entries to the old space. This is additional work during the stop-the-world pause, but hopefully it is not too bad.

To be honest I don’t have the performance numbers for this one. It rides the train this week to Chromium 126 though, and hopefully it fixes this problem in a robust way.

partitions, take three

The V8 sandboxing effort has sprouted a number of indirect tables: external pointers, external buffers, trusted objects, and code objects. Also recently to better integrate with compressed pointers, there is also a table for V8-to-managed-C++ (Oilpan) references. I expect the Oilpan reference table will soon grow a nursery along the same lines as the regular external pointer table.

In general, if you have a generational partition of your main object space, it would seem that you almost always need a generational partition of every other space. Otherwise either you cause new allocations to occur in what is effectively an old space, perhaps needlessly hastening a major GC, or you forget to track allocations in that space, leading to a memory leak. If the generational hypothesis applies for a wrapper object, it probably also applies for any auxiliary allocations as well.

fin

I have a few cleanups remaining in this area but I am relieved to have landed this patch, and pleased to have spent time under the hood of V8’s collector. Many many thanks to Samuel Groß, Michael Lippautz, and Dominik Inführ for their review and patience. Until the next cycle, happy allocating!

by Andy Wingo at May 13, 2024 01:03 PM