Planet Igalia

June 01, 2023

Manuel Rego

Web Engines Hackfest 2023 is coming

Next week Igalia is hosting a new edition of the Web Engines Hackfset in A Coruña.

As last year we’ll be back at Palexco an amazing venue and we have around 100 people registered to participate onsite. You can check the full schedule of the event at the wiki page.

We hope it’s going to be a great week for everyone, and we’re looking forward to the event!


On Monday 5th there will be five talks:

  • JavaScript Modules: Past, Present, and Future by Nicolò Ribaudo
  • Inside Kotlin/Wasm (or how your language could benefit from new proposals, like GC, EH, TFR) by Zalim Bashorov
  • Status of the WPE & GTK WebKit ports by Žan Doberšek
  • Servo 2023 by Delan Azabani
  • Ladybird: Building a new browser from scratch by Andreas Kling

I’m really happy about the set of talks, and particularly excited about the possibility to see the presentations about some less known web rendering engines like WPE, Servo and LibWeb. BTW, the talks will be live streamed in the Web Engines Hackfest YouTube channel.

Breakout Sessions

Apart from that we’ll have breakout sessions as usual. But this year, remote participation will be allowed on these sessions; you don’t need to register or anything like that, just join the room on the GitHub issues of each breakout session at the planned time.

Breakout Session Facilitator Issue
Cross-Shadow Root IDREF associations Alice Boxhall #10
Getting into web engine contributing CanadaHonk #20
Maintenace of Chromium downstream José Dapena Paz #9
Servo Martin Robinson #16
Standards and Web Performance Daniel Ehrenberg #8
Test262, Testing JavaScript Conformance Philip Chimento #19
Updates on accelerated compositing in WebKitGTK Carlos García Campos #18
Wasm GC in JavaScriptCore Zalim Bashorov #12
Wayland Antonio Gomes #13
WebKit and Linux graphics Žan Doberšek #15
WebViews and Apps Jonas Kruckenberg #11
WinterCG Andreu Botella #14
Wolvic: An open source XR browser Javier Fernández García-Boente #17

More sessions might be scheduled during the event so keep an eye to the hackfest wiki page and issues.


Last, but not least. Thanks to the Web Engines Hackfest sponsors Arm, Google and Igalia; without your support this event won’t be possible.

Web Engines Hackfest 2023 Sponsors

June 01, 2023 10:00 PM

May 31, 2023

Emmanuele Bassi

Constraints editing

Last year I talked about the newly added support for Apple’s Visual Format Language in Emeus, which allows to quickly describe layouts using a cross between ASCII art and predicates. For instance, I can use:


and obtain a layout like this one:

Boxes approximate widgets

Thanks to the contribution of my colleague Martin Abente Lahaye, now Emeus supports extensions to the VFL, namely:

  • arithmetic operators for constant and multiplication factors inside predicates, like [button1(button2 * 2 + 16)]
  • explicit attribute references, like [button1(button1.height / 2)]

This allows more expressive layout descriptions, like keeping aspect ratios between UI elements, without requiring hitting the code base.

Of course, editing VFL descriptions blindly is not what I consider a fun activity, so I took some time to write a simple, primitive editing tool that lets you visualize a layout expressed through VFL constraints:

I warned you that it was primitive and simple

Here’s a couple of videos showing it in action:

At some point, this could lead to a new UI tool to lay out widgets inside Builder and/or Glade.

As of now, I consider Emeus in a stable enough state for other people to experiment with it — I’ll probably make a release soon-ish. The Emeus website is up to date, as it is the API reference, and I’m happy to review pull requests and feature requests.

by ebassi at May 31, 2023 02:36 PM

May 30, 2023

Byungwoo Lee

Emmanuele Bassi

Configuring portals

One of the things I’ve been recently working on at Igalia is the desktop portals implementation, the middleware layer of API for application and toolkit developers that allows sandboxed applications to interact with the host system. Sandboxing technologies like Flatpak and Snap expose the portal D-Bus interfaces inside the sandbox they manage, to handle user-mediated interactions like opening a file that exists outside of the locations available to the sandboxed process, or talking to privileged components like the compositor to obtain a screenshot.

Outside of allowing dynamic permissions for sandboxed applications, portals act as a vendor-neutral API for applications to target when dealing with Linux as an OS; this is mostly helpful for commercial applications that are not tied to a specific desktop environment, but don’t want to re-implement the layer of system integration from the first principles of POSIX primitives.

The architecture of desktop portals has been described pretty well in a blog post by Peter Hutterer, but to recap:

  • desktop portals are a series of D-Bus interfaces
  • toolkits and applications call methods on those D-Bus interfaces
  • there is a user session daemon called xdg-desktop-portal that provides a service for the D-Bus interfaces
  • xdg-desktop-portal implements some of those interface directly
  • for the interfaces that involve user interaction, or interaction with desktop-specific services, we have separate services that are proxied by xdg-desktop-portal; GNOME has xdg-desktop-portal-gnome, KDE has xdg-desktop-portal-kde; Sway and wlroot-based compositors have xdg-desktop-portal-wlr; and so on, and so forth

There’s also xdg-desktop-portal-gtk, which acts a bit as a reference portal implementation, and a shared desktop portal implementation for a lot of GTK-based environments. Ideally, every desktop environment should have their own desktop portal implementation, so that applications using the portal API can be fully integrated with each desktop’s interface guidelines and specialised services.

One thing that is currently messy is the mechanism by which xdg-desktop-portal finds the portal implementations available on the system, and decides which implementation should be used for a specific interface.

Up until the current stable version of xdg-desktop-portal, the configuration worked this way:

  1. each portal implementation (xdg-desktop-portal-gtk, -gnome, -kde, …) ships a ${NAME}.portal file; the file is a simple INI-like desktop entry file with the following keys:
    • DBusName, which contains the service name of the portal, for instance, org.freedesktop.impl.portal.desktop.gnome for the GNOME portals; this name is used by xdg-desktop-portal to launch the portal implementation
    • Interfaces, which contains a list of D-Bus interfaces under the org.freedesktop.impl.portal.* namespace that are implemented by the desktop-specific portal; xdg-desktop-portal will match the portal implementation with the public facing D-Bus interface internally
    • UseIn, which contains the name of the desktop to be matched with the contents of the $XDG_CURRENT_DESKTOP environment variable
  2. once xdg-desktop-portal starts, it finds all the .portal files in a well-known location and builds a list of portal implementations currently installed in the system, containing all the interfaces they implement as well as their preferred desktop environment
  3. whenever something calls a method on an interface in the org.freedesktop.portal.* namespace, xdg-desktop-portal will check the current desktop using the XDG_CURRENT_DESKTOP environment variable, and check if the portal that has a UseIn key that matches the current desktop
  4. once there’s a match, xdg-desktop-portal will activate the portal implementation and proxy the calls made on the org.freedesktop.portal interfaces over to the org.freedesktop.impl.portal ones

This works perfectly fine for the average case of a Linux installation with a single session, using a single desktop environment, and a single desktop portal. Where things get messy is the case where you have multiple sessions on the same system, each with its own desktop and portals, or even no portals whatsoever. In a bad scenario, you may get the wrong desktop portal just because the name sorts before the one you’re interested in, so you get the GTK “reference” portals instead of the KDE-specific ones; in the worst case scenario, you may get a stall when launching an application just because the wrong desktop portal is trying to contact a session service that simply does not exist, and you have to wait 30 seconds for a D-Bus timeout.

The problem is that some desktop portal implementations are shared across desktops, or cover only a limited amount of interfaces; a mandatory list of desktop environments is far too coarse a tool to deal with this. Additionally, xdg-desktop-portal has to have enough fallbacks to ensure that, if it cannot find any implementation for the current desktop, it will proxy to the first implementation it can find in order to give a meaningful answer. Finally, since the supported desktops are shipped by the portal themselves, there’s no way to override this information by packagers, admins, or users.

After iterating over the issue, I ended up writing the support for a new configuration file. Instead of having portals say what kind of desktop environment they require, we have desktop environments saying which portal implementations they prefer. Now, each desktop should ship a ${NAME}-portals.conf INI-like desktop entry file listing each interface, and what kind of desktop portal should be used for it; for instance, the GNOME desktop should ship a gnome-portals.conf configuration file that specifies a default for every interface:


On the other hand, you could have a Foo desktop that relies on the GTK portal for everything, except for specific interfaces that are implemented by the “foo” portal:


You could also disable all portals except for a specific interface (and its dependencies):


Or, finally, you could disable all portal implementations:


A nice side effect of this work is that you can configure your own system, by dropping a portals.conf configuration file inside the XDG_CONFIG_HOME/xdg-desktop-portal directory; this should cover all the cases in which people assemble their desktop out of disparate components.

By having desktop environments (or, in a pinch, the user themselves) owning the kind of portals they require, we can avoid messy configurations in the portal implementations, and clarify the intended behaviour to downstream packagers; at the same time, generic portal implementations can be adopted by multiple environments without necessarily having to know which ones upfront.

In a way, the desktop portals project is trying to fulfill the original mission of’s Cross-desktop Group: a set of API that are not bound to a single environment, and can be used to define “the Linux desktop” as a platform.

Of course, there’s a lot of work involved in creating a vendor-neutral platform API, especially when it comes to designing both the user and the developer experiences; ideally, more people should be involved in this effort, so if you want to contribute to the Linux ecosystem, this is an area where you can make the difference.

by ebassi at May 30, 2023 01:59 PM

May 25, 2023

Samuel Iglesias

Closing a cycle

For the last four years I’ve served as a member of the X.Org Foundation Board of Directors, but some days ago I stepped down after my term ended and not having run for re-election.

I started contributing to Mesa in 2014 and joined the amazing freedesktop community. Soon after, I joined the X.Org Foundation as an regular member in order to participate in the elections and get access to some interesting perks (VESA, Khronos Group). You can learn more about what X.Org Foundation does in Ricardo’s blogpost.

But everything changed in 2018. That year, Chema and I organized XDC 2018 in A Coruña, Spain.

XDC 2018 photo

The following year, I ran for the yearly election of X.Org Foundation’s board of directors (as it is a two years term, we renew half of the board every year)… and I was elected! It was awesome! Almost immediately, I started coordinating XDC, and looking for organization proposals for the following XDC. I documented my experience organizing XDC 2018 in an attempt to make the job easier for future organizers, reducing the burden that organizing such a conference entails.

In 2021, I was re-elected and everything continued without changes (well, except the pandemic and having our first 2 virtual XDCs: 2020 and 2021).

Unfortunately, my term finished this year… and I did not re-run for election. The reasons were a mix of personal life commitments (having 2 kids change your life completely) and new professional responsibilities. After those changes, I could not contribute as much as I wanted, and that was enough to me to pass the torch and let others contribute to the X.Org Foundation instead. Congratulations to Christopher Michale and Arek Hiler, I’m pretty sure you are going to do great!

Surprisingly enough, I am closing the cycle as it started: organizing X.Org Developers Conference 2023 in A Coruña, Spain from 17th to 19th October 2023.

A Coruña

I leave the board of directors but I won friends and great memories. In case you are interested on participating to the community via the board of directors, prepare your candidancy for next year!

See you in A Coruña!

May 25, 2023 10:09 AM

May 24, 2023

Ricardo García

What is the X.Org Foundation, anyway?

A few weeks ago the annual X.Org Foundation Board of Directors election took place. The Board of Directors has 8 members at any given moment, and members are elected for 2-year terms. Instead of renewing the whole board every 2 years, half the board is renewed every year. Foundation members, which must apply for or renew membership every year, are the electorate in the process. Their main duty is voting in board elections and occasionally voting in other changes proposed by the board.

As you may know, thanks to the work I do at Igalia, and the trust of other Foundation members, I’m part of the board and currently serving the second year of my term, which will end in Q1 2024. Despite my merits coming from my professional life, I do not represent Igalia as a board member. However, to avoid companies from taking over the board, I must disclose my professional affiliation and we must abide by the rule that prohibits more than two people with the same affiliation from being on the board at the same time.

X.Org Logo
Figure 1. X.Org Logo by Wikipedia user Sven, released under the terms of the GNU Free Documentation License

Because of the name of the Foundation and for historical reasons, some people are confused about its purpose and sometimes they tend to think it acts as a governance body for some projects, particularly the X server, but this is not the case. The X.Org Foundation wiki page at has some bits of information but I wanted to clarify a few points, like mentioning the Foundation has no paid employees, and explain what we do at the Foundation and the tasks of the Board of Directors in practical terms.

Cue the music.

(“The Who - Who Are You?” starts playing)

The main points would be:

  1. The Foundation acts as an umbrella for multiple projects, including the X server, Wayland and others.

  2. The board of directors has no power to decide who has to work on what.

  3. The largest task is probably organizing XDC.

  4. Being a director is not a paid position.

  5. The Foundation pays for project infrastructure.

  6. The Foundation, or its financial liaison, acts as an intermediary with other orgs.

Umbrella for multiple projects

Some directors have argued in the past that we need to change the Foundation name to something different, like the Foundation. With some healthy sense of humor, others have advocated for names like Freedesktop Software Foundation, or FSF for short, which should be totally not confusing. Humor or not, the truth is the X.Org Foundation is essentially the Freedesktop Foundation, so the name change would be nice in my own personal opinion.

If you take a look at the Freedesktop Gitlab instance, you can navigate to a list of projects and sort them by stars. Notable mentions you’ll find in the list: Mesa, PipeWire, GStreamer, Wayland, the X server, Weston, PulseAudio, NetworkManager, libinput, etc. Most of them closely related to a free and open source graphics stack, or free and open source desktop systems in general.

X.Org server unmaintained? I feel you

As I mentioned above, the Foundation has no paid employees and the board has no power to direct engineering resources to a particular project under its umbrella. It’s not a legal question, but a practical one. Is the X.Org server dying and nobody wants to touch it anymore? Certainly. Many people who worked on the X server are now working on Wayland and creating and improving something that works better in a modern computer, with a GPU that’s capable of doing things which were not available 25 years ago. It’s their decision and the board can do nothing.

On a tangent, I’m feeling a bit old now, so let me say when I started using Linux more than 20 years ago people were already mentioning most toolkits were drawing stuff to pixmaps and putting those pixmaps on the screen, ignoring most of the drawing capabilities of the X server. I’ve seen tearing when playing movies on Linux many times, and choppy animations everywhere. Attempting to use the X11 protocol over a slow network resulted in broken elements and generally unusable screens, problems which would not be present when falling back to a good VNC server and client (they do only one specialized thing and do it better).

For the last 3 or 4 years I’ve been using Wayland (first on my work laptop, nowadays also on my personal desktop) and I’ve seen it improve all the time. When using Wayland, animations are never choppy in my own experience, tearing is unheard of and things work more smoothly, as far as my experience goes. Thanks to using the hardware better, Wayland may also give you improved battery life. I’ve posted in the past that you can even use NVIDIA with Gnome on Wayland these days, and things are even simpler if you use an Intel or AMD GPU.

Naturally, there may be a few things which may not be ready for you yet. For example, maybe you use a DE which only works on X11. Or perhaps you use an app or DE which works on Wayland, but its support is not great and has problems there. If it’s an app, likely power users or people working on distributions can tune it to make it use XWayland by default, instead of Wayland, while bugs are ironed out.

X.Org Developers Conference

Ouch, there we have the “X.Org” moniker again…​

Back on track, if the Foundation can do nothing about the lack of people maintaining the X server and does not set any technical direction for projects, what does it do? (I hear you shouting “nothing!” while waving your fist at me.) One of the most time-consuming tasks is organizing XDC every year, which is arguably one of the most important conferences, if not the most important one, for open source graphics right now.

Specifically, the board of directors will set up a commission composed of several board members and other Foundation members to review talk proposals, select which ones will have a place at the conference, talk to speakers about shortening or lengthening their talks, and put them on a schedule to be used at the conference, which typically lasts 3 days. I chaired the paper committee for XDC 2022 and spent quite a lot of time on this.

The conference is free to attend for anyone and usually alternates location between Europe and the Americas. Some people may want to travel to the conference to present talks there but they may lack the budget to do so. Maybe they’re a student or they don’t have enough money, or their company will not sponsor travel to the conference. For that, we have travel grants. The board of directors also reviews requests for travel grants and approves them when they make sense.

But that is only the final part. The board of directors selects the conference contents and prepares the schedule, but the job of running the conference itself (finding an appropriate venue, paying for it, maybe providing some free lunches or breakfasts for attendees, handling audio and video, streaming, etc) falls in the hands of the organizer. Kid you not, it’s not easy to find someone willing to spend the needed amount of time and money organizing such a conference, so the work of the board starts a bit earlier. We have to contact people and request for proposals to organize the conference. If we get more than one proposal, we have to evaluate and select one.

As the conference nears, we have to fire some more emails and convince companies to sponsor XDC. This is also really important and takes time as well. Money gathered from sponsors is not only used for the conference itself and travel grants, but also to pay for infrastructure and project hosting throughout the whole year. Which takes us to…​

Spending millions in director salaries

No, that’s not happening.

Being a director of the Foundation is not a paid position. Every year we suffer a bit to be able to get enough candidates for the 4 positions that will be elected. Many times we have to extend the nomination period.

If you read news about the Foundation having trouble finding candidates for the board, that barely qualifies as news because it’s almost the same every year. Which doesn’t mean we’re not happy when people spread the news and we receive some more nominations, thank you!

Just like being an open source maintainer is not a grateful task sometimes, not everybody wants to volunteer and do time-consuming tasks for free. Running the board elections themselves, approving membership renewals and requests every year, and sending voting reminders also takes time. Believe me, I just did that a few weeks ago with help from Mark Filion from Collabora and technical assistance from Martin Roukala.

Project infrastructure

The Foundation spends a lot of money on project hosting costs, including Gitlab and CI systems, for projects under the umbrella. These systems are used every day and are fundamental for some projects and software you may be using if you run Linux. Running our own Gitlab instance and associated services helps keep the web decentralized and healthy, and provides more technical flexibility. Many people seem to appreciate those details, judging by the number of projects we host.

Speaking on behalf of the community

The Foundation also approaches other organizations on behalf of the community to achieve some stuff that would be difficult otherwise.

To pick one example, we’ve worked with VESA to provide members with access to various specifications that are needed to properly implement some features. Our financial liaison, formerly SPI and soon SFC, signs agreements with the Khronos Group that let them waive fees for certifying open source implementations of their standards.

For example, you know RADV is certified to comply with the Vulkan 1.3 spec and the submission was made on behalf of Software in the Public Interest, Inc. Same thing for lavapipe. Similar for Turnip, which is Vulkan 1.1 conformant.


The song is probably over by now and you have a better idea of what the Foundation does, and what the board members do to keep the lights on. If you have any questions, please let me know.

May 24, 2023 07:52 AM

May 23, 2023

Samuel Iglesias

Joining the Linux Foundation Europe Advisory Board

Last year, the Linux Foundation announced the creation of the Linux Foundation Europe.

Linux Foundation Europe

The goal of the Linux Foundation Europe is, in a nutshell, to promote Open Source in Europe not only to individuals (via events and courses), but to companies (guidance and hosting projects) and European organizations. However, this effort needs the help of European experts in Open Source.

Thus, the Linux Foundation Europe (LFE) has formed an advisory board called the Linux Foundation Europe Advisory Board (LFEAB), which includes representatives from a cross-section of 20 leading European organizations within the EU, the UK, and beyond. The Advisory Board will play an important role in stewarding Linux Foundation Europe’s growing community, which now spans 100 member organizations from across the European region.

Early this year, I was invited to join the LFEAB as an inaugural member. I would not be in this position without the huge amount of work done by the rest of my colleagues at Igalia since the company was founded in 2001, which has paved the way for us to be one of the landmark consultancies specialized in Open Source, both globally and in Europe.

My presence in the LFEAB will help to share our experience, and help the Linux Foundation Europe to grow and spread Open Source everywhere in Europe.

Samuel Iglesias presented as Linux Foundation Europe Advisory Board member

I’m excited to participate in the Linux Foundation Europe Advisory Board! I and the rest of the LFEAB will be at the Open Source Summit Europe, send me an email if you want to meet me to know more about LFEAB, about Igalia or about how you can contribute more to Open Source.

Happy hacking!

May 23, 2023 11:30 AM

Brian Kardell

Says Who?

Says Who?

Thoughts on standards and the new baseline effort.

If you've been around me, or my writing, for any time at all, you've probably heard me ask "but what really makes it a standard"?

It is, for example, possible to have words approved in a standards body for which there are, for all intents and purposes, not much in the way of actual implementation or users. Conversely, it is possible to have things that had nothing to do with standards bodies and yet have dozens or hundreds of interoperable implementations and open licenses and are, in reality, much more universal.

At the end of the day, any real judgement kind of involves looking at the reality on the ground. It is a standard... when it is a standard.

I come from Pittsburgh, and in the Steel City, outside the locker room of the Steelers (our amazing football team) it says...

Text painted on the wall that says 'The standard... is the standard'.

See? It's right there on the wall.

Still, if this makes you uncomfortable, think about an english dictionary. The words in it are simply recognized as standards... because they are.

Where is the invisible line?

At some point, it seems, a thing crosses an invisible line and then it's standard. But only after a gradual process to reach that point. The very end of that process is really rather boring because it's really just stating the obvious.

But where is that magical line when it comes to "the reality on the ground" for web standards?

There's a new effort by the WebDX Community Group called "Baseline" which attempts to idenitfy it and I'm excited because feels like it could be really valuable in several ways.

One that I am most keen on is using it to create a really high signal-to-noise channel for developers to subscribe to. If we define a line right, then something reaching that line is very newsworthy and pretty rare, so we can all afford to pay some attention to it. Imagine an RSS Feed and social media accounts that posted very rarely and only told you this Very Serious Amazing News. Yes, please, give that access to my notifcations! I feel like that would make everything feel a lot less overwhelming and also probably markedly speed real adoption at scale.

The really tricky thing here seems to be, mainly, that it's just really hard to define where that line is in a way everyone agrees with that is actually still useful as more than a kind of historical artifact. That's not to say that such an artifact isn't useful to future learners, but again, by that point this will just be common knowledge.

Stage {x}?

The new Baseline idea has a definition (as of the time of this writing, that is "supported in the last 2 major releases of certain popular browsers (Firefox, Samsung Internet, Safari and Chrome). There was a lot of debate about it before arriving at that. It also currently has a whole slew of issues about why that definition isn't great.

But maybe that's because there are actually several different lines and all of them are interesting in different ways. Think about a progress meter: there can be lots of lines along the way to "done".

The thing I like about the ECMA "Stages" model is that it's easy to visualize like that, and has no clever names: Just 0, 1, 2, 3, 4. Each of those is a 'line' you pass on the way to done. Maybe that kind of model works to discuss here too - we just need more numbers, because those are about ECMA 'done-ness' and not something like what baseline is trying to convey.

Something reaching stage 4 is a huge day, but it doesn't mean the on-the-ground-reality of "all users have support on their devices". In theory, at least, that could still take years to reach.

Conceptually speaking, we could imagine more interesting "lines" (plural) a thing would cross on the way to that day.

For example, the day we learn there is a final engine implementation in experimental builds passing tests is an interesting line. Maybe that's "Stage 5" in my analogy.

The day when it "ships without caveat in the last of the 'steward' browsers" is, in my opinion, a super interesting line (that is, the steward browser that primarily maintains the engine itself). That seems like an especially newsworthy day we should pay attention to because many people will be working on projects that won't ship for a long time, and maybe it's worth considering using it. Maybe that's like "Stage 6" in my analogy.

But that also doesn't mean all of the downstream browsers have released support - that can take time. If you're just starting on a months or year long project, that's probably pretty safe. There's always also a risk that downstream browsers can choose not to adopt a new feature for some reason. Many downstream browsers have considerable support differences with certain APIs (web speech, for example). Not much you can do about that, but what does it mean? Is Web Speech a standard? Is it “baseline”? There are at least tens of millions of browser instances out there that lack support, but a few billion that don't.

Even in the steward's own browser (ie, Chrome, Firefox or Safari), it's not as if releasing a new version is a lightswitch that updates all of the devices in the world. There are lots of things that prevent or delay updating: Corporate controls, OS limits/settings. In some cases, a user interaction is simply required and for whatever reason, people just... don't, for long stretches of time.

So, what should baseline use? Any of them? There probably several useful 'lines' (or stages, if that is easier to imagine) worth discussing. I guess one of those can be called "baseline" - I'm just not really sure where that is in this spectrum. I'm curious for more your thoughts!

Feel free to hit me up on any of the social medias. Tweet, toot or skeet at me if you like. Or, even better: If you're interested in contributing to the thinking around this, it's part of the Web Platform DX Community Group which you can participate in. This work is being tracked and discussed mainly in its Feature-Set Repository on GitHub. Participation is welcome.

May 23, 2023 04:00 AM

May 22, 2023

Alex Bradbury

2023Q2 week log

I tend to keep quite a lot of notes on the development related (sometimes at work, sometimes not) I do on a week-by-week basis, and thought it might be fun to write up the parts that were public. This may or may not be of wider interest, but it aims to be a useful aide-mémoire for my purposes at least. Weeks with few entries might be due to focusing on downstream work (or perhaps just a less productive week - I am only human!).

Week of 15th May 2023

Week of 17th April 2023

  • Still pinging for an updated riscv-bfloat1y spec version that incorporates the fcvt.bf16.s encoding fix.
  • Bumped the version of the experimental Zfa RISC-V extension supported by LLVM to 0.2 (D146834). This was very straightforward as after inspecting the spec history, it was clear there were no changes that would impact the compiler.
  • Filed a couple of pull requests against the riscv-zacas repo (RISC-V Atomic Compare and Swap extension).
    • #8 made the dependency on the A extension explicit.
    • #7 attempted to explicitly reference the extension for misaligned atomics, though it seems won't be merged. I do feel uncomfortable with RISC-V extensions that can have their semantics changed by other standard extensions without this possibility being called out very explicitly. As I note in the PR, failure to appreciate this might mean that conformance tests written for zacas might fail on a system with zacas_zam. I see a slight parallel to a recent discussion about RISC-V profiles.
  • Fixed the canonical ordering used for ISA naming strings in RISCVISAInfo (this will mainly affect the string stored in build attributes). This was fixed in D148615 which built on the pre-committed test case.
  • A whole bunch of upstream LLVM reviews. As noted in D148315 I'm thinking we should probably relaxing the ordering rules for ISA strings in -march in order to avoid issues due to spec changes and incompatibilities between GCC and Clang.
  • LLVM Weekly #485.

Week of 10th April 2023

Week of 3rd April 2023

Article changelog
  • 2023-05-22: Added notes for the week of 15th May 2023.
  • 2023-04-24: Added notes for the week of 17th April 2023.
  • 2023-04-17: Added notes for the week of 10th April 2023.
  • 2023-04-10: Initial publication date.

May 22, 2023 10:00 AM

Emmanuele Bassi

Dream About Flying

as usual — long time, no blog.

my only excuse is that I was busy with other things: new job, new office, holidays… you know, whatever happens between coding. :-)

it’s that time of year again, and we’re nearing another Clutter release — this time it’s a special one, though, as it is 1.0.0. which also means that the API will be frozen for the entire duration of the 1.x branch: only additions and deprecations will be allowed. no worries about stagnation, though — we are already planning for 2.0, even though it’ll take at least a couple of years to get there)).

since we’re in the process of finalizing the 1.0 API I thought about writing something about what changed, what was added and what has been removed for good.

let’s start with the Effects API. the Effects were meant to provide a high level API for simple, fire-and-forget animations ((even though people always tried to find new ways to abuse the term “fire-and-forget”)). they were sub-obtimal in the memory management — you had to keep around the EffectTemplate, the effects copied the timelines — and they weren’t extensible — writing your own effect would have been impossible without reimplementing the whole machinery. after the experiments done by Øyvind and myself, and after looking at what the high-level languages provided, I implemented a new implicit animation API — all based around a single object, with the most automagic memory management possible:

/* resize the actor in 250 milliseconds using a cubic easing
 * and attach a callback at the end of the animation
 ClutterAnimation *animation =
   clutter_actor_animate (actor, 250, CLUTTER_EASE_IN_CUBIC,
                          "width", 200,
                          "height", 200,
                          "color", &new_color,
  g_signal_connect (animation, "completed",
                    G_CALLBACK (on_animation_complete),

this should make a lot of people happy. the easing modes in particular are the same shared among various animation framworks, like tweener and jQuery

what might make some people slightly less happy is the big API churn that removed both ClutterLabel and ClutterEntry and added ClutterText. the trade-off, though, is clearly in favour of ClutterText, as this is a base class for both editable and non-editable text displays; it supports pointer and keyboard selection, and multi-line as well as single-line editing.

another big changed happened on the low level COGL API, with the introduction of vertex buffers — which allow you to efficiently store arrays of vertex attributes; and, more importantly, with the introduction of the Materials which decouple the drawing operations with the fill operations. it also adds support for multi-texturing, colors and other GL features — on both GL and GLES.


after unifying Label and Entry, we also decided to unify BehaviourPath and BehaviourBspline; after that we added support for creating paths using SVG-like descriptions and for “replaying” a Path on a cairo_t. well, the Cairo integration is also another feature — clutter-cairo has been deprecated and its functionality moved inside ClutterCairoTexture.

one of the last minute additions has been ClutterClone an efficient way to clone generic actors without using FBOs — which also supercedes the CloneTexture actor.

the Pango integration has been extended, and the internal Pango API exposed and officially supported — now you can display text using the Pango renderer and glyphs cache inside your own custom actors without using internal/unstable API.

thanks to Johan Dahlin and Owen Taylor, Clutter now generates GObject-Introspection data at compile time, so that runtime language bindings will be ready as soon as 1.0.0 hits the internets.

finally, there’s a ton of bug fixes in how we use GL, how we render text, how we relayout actors, etc.

hope you’ll have fun with Clutter!

by ebassi at May 22, 2023 09:27 AM

quiet strain

as promised during GUADEC, I’m going to blog a bit more about the development of GSK — and now that I have some code, it’s actually easier to do.

so, let’s start from the top, and speak about GDK.

in April 2008 I was in Berlin, enjoying the city, the company, and good food, and incidentally attending the first GTK+ hackfest. those were the days of Project Ridley, and when the plan for GTK+ 3.0 was to release without deprecated symbols and with all the instance structures sealed.

in the long discussions about the issue of a “blessed” canvas library to be used by GTK app developers and by the GNOME project, we ended up discussing the support of the OpenGL API in GDK and GTK+. the [original bug][bug-opegl] had been opened by Owen about 5 years prior, and while we had ancillary libraries like GtkGLExt and GtkGLArea, the integration was a pretty sore point. the consensus at the end of the hackfest was to provide wrappers around the platform-specific bits of OpenGL inside GDK, enough to create a GL context and bind it to a specific GdkWindow, to let people draw with OpenGL commands at the right time in the drawing cycle of GTK+ widgets. the consensus was also that I would look at the bug, as a person that at the time was dealing with OpenGL inside tool kits for his day job.

well, that didn’t really work out, because cue to 6 years after that hackfest, the bug is still open.

to be fair, the landscape of GTK and GDK has changed a lot since those days. we actually released GTK+ 3.0, and with a lot more features than just deprecations removal; the whole frame cycle is much better, and the paint sequence is reliable and completely different than before. yet, we still have to rely on poorly integrated external libraries to deal with OpenGL.

right after GUADEC, I started hacking on getting the minimal amount of API necessary to create a GL context, and being able to use it to draw on a GTK widget. it turns out that it wasn’t that big of a job to get something on the screen in a semi-reliable way — after all, we already had libraries like GtkGLExt and GtkGLArea living outside of the GTK git repository that did that, even if they had to use deprecated or broken API. the complex part of this work involved being able to draw GL inside the same infrastructure that we currently use for Cairo. we need to be able to synchronise the frame drawing, and we need to be able to blend the contents of the GL area with both content that was drawn before and after, likely with Cairo — otherwise we would not be able to do things like drawing an overlay notification on top of the usual spinning gears, while keeping the background color of the window:

welcome to the world of tomorrow (for values of tomorrow close to 2005)

luckily, thanks to Alex, the amount of changes in the internals of GDK was kept to a minimum, and we can enjoy GL rendering running natively on X11 and Wayland, using GLX or EGL respectively.

on top of the low level API, we have a GtkGLArea widget that renders all the GL commands you submit to it, and it behaves like any other GTK+ widgets.

today, Matthias merged the topic branch into master, which means that, barring disastrous regressions, GTK+ 3.16 will finally have native OpenGL support — and we’ll be one step closer to GSK as well.

right now, there’s still some work to do — namely: examples, performance, documentation, porting to MacOS and Windows — but the API is already fairly solid, so we’d all like to get feedback from the users of libraries like GtkGLExt and GtkGLArea, to see what they need or what we missed. feedback is, as usual, best directed at the gtk-devel mailing list, or on the #gtk+ IRC channel.

by ebassi at May 22, 2023 09:27 AM

Who wrote GTK+ (Reprise)

As I’ve been asked by different people about data from older releases of GTK+, after the previous article on Who Wrote GTK+ 3.18, I ran the git-dm script on every release and generated some more data:

Release Lines added Lines removed Delta Changesets Contributors
2.01 666495 345348 321147 2503 106
2.2 301943 227762 74181 1026 89
2.4 601707 116402 485305 2118 109
2.6 181478 88050 93428 1421 101
2.8 93734 47609 46125 1155 86
2.10 215734 54757 160977 1614 110
2.12 232831 43172 189659 1966 148
2.14 215151 102888 112263 1952 140
2.16 71335 23272 48063 929 118
2.18 52228 23490 28738 1079 90
2.20 80397 104504 -24107 761 82
2.22 51115 71439 -20324 438 70
2.24 4984 2168 2816 184 37
3.01 354665 580207 -225542 4792 115
3.2 227778 168616 59162 2435 98
3.4 126934 83313 43621 2201 84
3.6 206620 34965 171655 1011 89
3.8 84693 34826 49867 1105 90
3.10 143711 204684 -60973 1722 111
3.12 86342 54037 32305 1453 92
3.14 130387 144926 -14539 2553 84
3.16 80321 37037 43284 1725 94
3.18* 78997 54614 24383 1638 83

Here you can see the history of the GTK releases, since 2.0.

These numbers are to be taken with a truckload of salt, especially the ones from the 2.x era. During the early 2.x cycle, releases did not follow the GNOME timed release schedule; instead, they were done whenever needed:

Release Date
2.0 March 2002
2.2 December 2002
2.4 March 2004
2.6 December 2004
2.8 August 2005
2.10 July 2006
2.12 September 2007
2.14 September 2008
2.16 March 2009
2.18 September 2009
2.20 March 2010
2.22 September 2010
2.24 January 2011

Starting with 2.14, we settled to the same cycle as GNOME, as it made releasing GNOME and packaging GTK+ on your favourite distribution a lot easier.

This disparity in the length of the development cycles explains why the 2.12 and 2.14 cycles, which lasted a year, represent an anomaly in terms of contributors (148 and 140, respectively) and in terms of absolute lines changed.

The reduced activity between 2.20 and 2.24.0 is easily attributable to the fact that people were working hard on the 2.90 branch that would become 3.0.

In general, once you adjust by release time, it’s easy to see that the number of contributors is pretty much stable at around 90:

The average is 94.5, which means we have an hobbit somewhere in the commit log

Another interesting data point would be to look at the ecosystem of companies spawned around GTK+ and GNOME, and how it has changed over the years — but that’s part of a larger discussion that would probably take more than a couple of blog posts to unpack.

I guess the larger point is that GTK+ is definitely not dying; it’s pretty much being worked on by the same amount of people — which includes long timers as well as newcomers — as it was during the 2.x cycle.

  1. Both 2.0 and 3.0 are not wholly accurate; I used, as a starting point for the changeset period, the previous released branch point; for GTK+ 2.0, I started from the GTK_1_3_1 tag, whereas for GTK+ 3.0 I used the 2.90.0 tag. There are commits preceding both tags, but not enough to skew the results. 

by ebassi at May 22, 2023 09:26 AM

GSK Demystified (II) — Rendering

See the previous article for an introduction to GSK.

In order to render with GSK we need to get acquainted with two classes:

  • GskRenderNode, a single element in the rendering tree
  • GskRenderer, the object that effectively turns the rendering tree into rendering commands


The usual way to put things on the screen involves asking the windowing system to give us a memory region, filling it with something, and then asking the windowing system to present it to the graphics hardware, in the hope that everything ends up on the display. This is pretty much how every windowing system works. The only difference lies in that “filling it with something”.

With Cairo you get a surface that represents that memory region, and a (stateful) drawing context; every time you need to draw you set up your state and emit a series of commands. This happens on every frame, starting from the top level window down into every leaf object. At the end of the frame, the content of the window is swapped with the content of the buffer. Every frame is drawn while we’re traversing the widget tree, and we have no control on the rendering outside of the state of the drawing context.

A tree of GTK widgets

With GSK we change this process with a small layer of indirection; every widget, from the top level to the leaves, creates a series of render nodes, small objects that each hold the drawing state for their contents. Each node is, at its simplest, a collection of:

  • a rectangle, representing the region used to draw the contents
  • a transformation matrix, representing the parent-relative set of transformations applied to the contents when drawing
  • the contents of the node

Every frame, thus, is composed of a tree of render nodes.

A tree of GTK widgets and GSK render nodes

The important thing is that the render tree does not draw anything; it describes what to draw (which can be a rasterization generated using Cairo) and how and where to draw it. The actual drawing is deferred to the GskRenderer instance, and will happen only once the tree has been built.

After the rendering is complete we can discard the render tree. Since the rendering is decoupled from the widget state, the widgets will hold all the state across frames — as they already do. Each GskRenderNode instance is, thus, a very simple instance type instead of a full GObject, whose lifetime is determined by the renderer.


The renderer is the object that turns a render tree into the actual draw commands. At its most basic, it’s a simple compositor, taking the content of each node and its state and blending it on a rendering surface, which then gets pushed to the windowing system. In practice, it’s a tad more complicated than that.

Each top-level has its own renderer instance, as it requires access to windowing system resources, like a GL context. When the frame is started, the renderer will take a render tree and a drawing context, and will proceed to traverse the render tree in order to translate it into actual render commands.

As we want to offload the rendering and blending to the GPU, the GskRenderer instance you’ll most likely get is one that uses OpenGL to perform the rendering. The GL renderer will take the render tree and convert it into a (mostly flat) list of data structures that represent the state to be pushed on the state machine — the blending mode, the shading program, the textures to sample, and the vertex buffer objects and attributes that describe the rendering. This “translation” stage allows the renderer to decide which render nodes should be used and which should be discarded; it also allows us to create, or recycle, all the needed resources when the frame starts, and minimize the state transitions when doing the actual rendering.

Going from here to there

Widgets provided by GTK will automatically start using render nodes instead of rendering directly to a Cairo context.

There are various fallback code paths in place in the existing code, which means that, luckily, we don’t have to break any existing out of tree widget: they will simply draw themselves (and their children) on an implicit render node. If you want to port your custom widgets or containers, on the other hand, you’ll have to remove the GtkWidget::draw virtual function implementation or signal handler you use, and override the GtkWidget::get_render_node() virtual function instead.

Containers simply need to create a render node for their own background, border, or custom drawing; then they will have to retrieve the render node for each of their children. We’ll provide convenience API for that, so the chances of getting something wrong will be, hopefully, reduced to zero.

Leaf widgets can remain unported a bit longer, unless they are composed of multiple rendering elements, in which case they simply need to create a new render node for each element.

I’ll provide more example of porting widgets in a later article, as soon as the API will have stabilized.

by ebassi at May 22, 2023 09:26 AM


Speaking at GUADEC 2016

I’m going to talk about the evolution of GTK+ rendering, from its humble origins of X11 graphics contexts, to Cairo, to GSK. If you are interested in this kind of stuff, you can either attend my presentation on Saturday at 11 in the Grace Room, or you can just find me and have a chat.

I’m also going to stick around during the BoF days — especially for the usual GTK+ team meeting, which will be on the 15th.

See you all in Karlsruhe.

by ebassi at May 22, 2023 09:26 AM

May 20, 2023

Andy Wingo

approaching cps soup

Good evening, hackers. Today's missive is more of a massive, in the sense that it's another presentation transcript-alike; these things always translate to many vertical pixels.

In my defense, I hardly ever give a presentation twice, so not only do I miss out on the usual per-presentation cost amortization and on the incremental improvements of repetition, the more dire error is that whatever message I might have can only ever reach a subset of those that it might interest; here at least I can be more or less sure that if the presentation would interest someone, that they will find it.

So for the time being I will try to share presentations here, in the spirit of, well, why the hell not.

CPS Soup

A functional intermediate language

10 May 2023 – Spritely

Andy Wingo

Igalia, S.L.

Last week I gave a training talk to Spritely Institute collaborators on the intermediate representation used by Guile's compiler.

CPS Soup

Compiler: Front-end to Middle-end to Back-end

Middle-end spans gap between high-level source code (AST) and low-level machine code

Programs in middle-end expressed in intermediate language

CPS Soup is the language of Guile’s middle-end

An intermediate representation (IR) (or intermediate language, IL) is just another way to express a computer program. Specifically it's the kind of language that is appropriate for the middle-end of a compiler, and by "appropriate" I meant that an IR serves a purpose: there has to be a straightforward transformation to the IR from high-level abstract syntax trees (ASTs) from the front-end, and there has to be a straightforward translation from IR to machine code.

There are also usually a set of necessary source-to-source transformations on IR to "lower" it, meaning to make it closer to the back-end than to the front-end. There are usually a set of optional transformations to the IR to make the program run faster or allocate less memory or be more simple: these are the optimizations.

"CPS soup" is Guile's IR. This talk presents the essentials of CPS soup in the context of more traditional IRs.

How to lower?


(+ 1 (if x 42 69))


  cmpi $x, #f
  je L1
  movi $t, 42
  j L2  
  movi $t, 69
  addi $t, 1

How to get from here to there?

Before we dive in, consider what we might call the dynamic range of an intermediate representation: we start with what is usually an algebraic formulation of a program and we need to get down to a specific sequence of instructions operating on registers (unlimited in number, at this stage; allocating to a fixed set of registers is a back-end concern), with explicit control flow between them. What kind of a language might be good for this? Let's attempt to answer the question by looking into what the standard solutions are for this problem domain.


Control-flow graph (CFG)

graph := array<block>
block := tuple<preds, succs, insts>
inst  := goto B
       | if x then BT else BF
       | z = const C
       | z = add x, y

BB0: if x then BB1 else BB2
BB1: t = const 42; goto BB3
BB2: t = const 69; goto BB3
BB3: t2 = addi t, 1; ret t2

Assignment, not definition

Of course in the early days, there was no intermediate language; compilers translated ASTs directly to machine code. It's been a while since I dove into all this but the milestone I have in my head is that it's the 70s when compiler middle-ends come into their own right, with Fran Allen's work on flow analysis and optimization.

In those days the intermediate representation for a compiler was a graph of basic blocks, but unlike today the paradigm was assignment to locations rather than definition of values. By that I mean that in our example program, we get t assigned to in two places (BB1 and BB2); the actual definition of t is implicit, as a storage location, and our graph consists of assignments to the set of storage locations in the program.


Static single assignment (SSA) CFG

graph := array<block>
block := tuple<preds, succs, phis, insts>
phi   := z := φ(x, y, ...)
inst  := z := const C
       | z := add x, y
BB0: if x then BB1 else BB2
BB1: v0 := const 42; goto BB3
BB2: v1 := const 69; goto BB3
BB3: v2 := φ(v0,v1); v3:=addi t,1; ret v3

Phi is phony function: v2 is v0 if coming from first predecessor, or v1 from second predecessor

These days we still live in Fran Allen's world, but with a twist: we no longer model programs as graphs of assignments, but rather graphs of definitions. The introduction in the mid-80s of so-called "static single-assignment" (SSA) form graphs mean that instead of having two assignments to t, we would define two different values v0 and v1. Then later instead of reading the value of the storage location associated with t, we define v2 to be either v0 or v1: the former if we reach the use of t in BB3 from BB1, the latter if we are coming from BB2.

If you think on the machine level, in terms of what the resulting machine code will be, this either function isn't a real operation; probably register allocation will put v0, v1, and v2 in the same place, say $rax. The function linking the definition of v2 to the inputs v0 and v1 is purely notational; in a way, you could say that it is phony, or not real. But when the creators of SSA went to submit this notation for publication they knew that they would need something that sounded more rigorous than "phony function", so they instead called it a "phi" (φ) function. Really.

2003: MLton

Refinement: phi variables are basic block args

graph := array<block>
block := tuple<preds, succs, args, insts>

Inputs of phis implicitly computed from preds

BB0(a0): if a0 then BB1() else BB2()
BB1(): v0 := const 42; BB3(v0)
BB2(): v1 := const 69; BB3(v1)
BB3(v2): v3 := addi v2, 1; ret v3

SSA is still where it's at, as a conventional solution to the IR problem. There have been some refinements, though. I learned of one of them from MLton; I don't know if they were first but they had the idea of interpreting phi variables as arguments to basic blocks. In this formulation, you don't have explicit phi instructions; rather the "v2 is either v1 or v0" property is expressed by v2 being a parameter of a block which is "called" with either v0 or v1 as an argument. It's the same semantics, but an interesting notational change.

Refinement: Control tail

Often nice to know how a block ends (e.g. to compute phi input vars)

graph := array<block>
block := tuple<preds, succs, args, insts,
control := if v then L1 else L2
         | L(v, ...)
         | switch(v, L1, L2, ...)
         | ret v

One other refinement to SSA is to note that basic blocks consist of some number of instructions that can define values or have side effects but which otherwise exhibit fall-through control flow, followed by a single instruction that transfers control to another block. We might as well store that control instruction separately; this would let us easily know how a block ends, and in the case of phi block arguments, easily say what values are the inputs of a phi variable. So let's do that.

Refinement: DRY

Block successors directly computable from control

Predecessors graph is inverse of successors graph

graph := array<block>
block := tuple<args, insts, control>

Can we simplify further?

At this point we notice that we are repeating ourselves; the successors of a block can be computed directly from the block's terminal control instruction. Let's drop those as a distinct part of a block, because when you transform a program it's unpleasant to have to needlessly update something in two places.

While we're doing that, we note that the predecessors array is also redundant, as it can be computed from the graph of block successors. Here we start to wonder: am I simpliying or am I removing something that is fundamental to the algorithmic complexity of the various graph transformations that I need to do? We press on, though, hoping we will get somewhere interesting.

Basic blocks are annoying

Ceremony about managing insts; array or doubly-linked list?

Nonuniformity: “local” vs ‘`global’' transformations

Optimizations transform graph A to graph B; mutability complicates this task

  • Desire to keep A in mind while making B
  • Bugs because of spooky action at a distance

Recall that the context for this meander is Guile's compiler, which is written in Scheme. Scheme doesn't have expandable arrays built-in. You can build them, of course, but it is annoying. Also, in Scheme-land, functions with side-effects are conventionally suffixed with an exclamation mark; after too many of them, both the writer and the reader get fatigued. I know it's a silly argument but it's one of the things that made me grumpy about basic blocks.

If you permit me to continue with this introspection, I find there is an uneasy relationship between instructions and locations in an IR that is structured around basic blocks. Do instructions live in a function-level array and a basic block is an array of instruction indices? How do you get from instruction to basic block? How would you hoist an instruction to another basic block, might you need to reallocate the block itself?

And when you go to transform a graph of blocks... well how do you do that? Is it in-place? That would be efficient; but what if you need to refer to the original program during the transformation? Might you risk reading a stale graph?

It seems to me that there are too many concepts, that in the same way that SSA itself moved away from assignment to a more declarative language, that perhaps there is something else here that might be more appropriate to the task of a middle-end.

Basic blocks, phi vars redundant

Blocks: label with args sufficient; “containing” multiple instructions is superfluous

Unify the two ways of naming values: every var is a phi

graph := array<block>
block := tuple<args, inst>
inst  := L(expr)
       | if v then L1() else L2()
expr  := const C
       | add x, y

I took a number of tacks here, but the one I ended up on was to declare that basic blocks themselves are redundant. Instead of containing an array of instructions with fallthrough control-flow, why not just make every instruction a control instruction? (Yes, there are arguments against this, but do come along for the ride, we get to a funny place.)

While you are doing that, you might as well unify the two ways in which values are named in a MLton-style compiler: instead of distinguishing between basic block arguments and values defined within a basic block, we might as well make all names into basic block arguments.

Arrays annoying

Array of blocks implicitly associates a label with each block

Optimizations add and remove blocks; annoying to have dead array entries

Keep labels as small integers, but use a map instead of an array

graph := map<label, block>

In the traditional SSA CFG IR, a graph transformation would often not touch the structure of the graph of blocks. But now having given each instruction its own basic block, we find that transformations of the program necessarily change the graph. Consider an instruction that we elide; before, we would just remove it from its basic block, or replace it with a no-op. Now, we have to find its predecessor(s), and forward them to the instruction's successor. It would be useful to have a more capable data structure to represent this graph. We might as well keep labels as being small integers, but allow for sparse maps and growth by using an integer-specialized map instead of an array.

This is CPS soup

graph := map<label, cont>
cont  := tuple<args, term>
term  := continue to L
           with values from expr
       | if v then L1() else L2()
expr  := const C
       | add x, y


This is exactly what CPS soup is! We came at it "from below", so to speak; instead of the heady fumes of the lambda calculus, we get here from down-to-earth basic blocks. (If you prefer the other way around, you might enjoy this article from a long time ago.) The remainder of this presentation goes deeper into what it is like to work with CPS soup in practice.

Scope and dominators

BB0(a0): if a0 then BB1() else BB2()
BB1(): v0 := const 42; BB3(v0)
BB2(): v1 := const 69; BB3(v1)
BB3(v2): v3 := addi v2, 1; ret v3

What vars are “in scope” at BB3? a0 and v2.

Not v0; not all paths from BB0 to BB3 define v0.

a0 always defined: its definition dominates all uses.

BB0 dominates BB3: All paths to BB3 go through BB0.

Before moving on, though, we should discuss what it means in an SSA-style IR that variables are defined rather than assigned. If you consider variables as locations to which values can be assigned and which initially hold garbage, you can read them at any point in your program. You might get garbage, though, if the variable wasn't assigned something sensible on the path that led to reading the location's value. It sounds bonkers but it is still the C and C++ semantic model.

If we switch instead to a definition-oriented IR, then a variable never has garbage; the single definition always precedes any uses of the variable. That is to say that all paths from the function entry to the use of a variable must pass through the variable's definition, or, in the jargon, that definitions dominate uses. This is an invariant of an SSA-style IR, that all variable uses be dominated by their associated definition.

You can flip the question around to ask what variables are available for use at a given program point, which might be read equivalently as which variables are in scope; the answer is, all definitions from all program points that dominate the use site. The "CPS" in "CPS soup" stands for continuation-passing style, a dialect of the lambda calculus, which has also has a history of use as a compiler intermediate representation. But it turns out that if we use the lambda calculus in its conventional form, we end up needing to maintain a lexical scope nesting at the same time that we maintain the control-flow graph, and the lexical scope tree can fail to reflect the dominator tree. I go into this topic in more detail in an old article, and if it interests you, please do go deep.

CPS soup in Guile

Compilation unit is intmap of label to cont

cont := $kargs names vars term
      | ...
term := $continue k src expr
      | ...
expr := $const C
      | $primcall ’add #f (a b)
      | ...

Conventionally, entry point is lowest-numbered label

Anyway! In Guile, the concrete form that CPS soup takes is that a program is an intmap of label to cont. A cont is the smallest labellable unit of code. You can call them blocks if that makes you feel better. One kind of cont, $kargs, binds incoming values to variables. It has a list of variables, vars, and also has an associated list of human-readable names, names, for debugging purposes.

A $kargs contains a term, which is like a control instruction. One kind of term is $continue, which passes control to a continuation k. Using our earlier language, this is just goto *k*, with values, as in MLton. (The src is a source location for the term.) The values come from the term's expr, of which there are a dozen kinds or so, for example $const which passes a literal constant, or $primcall, which invokes some kind of primitive operation, which above is add. The primcall may have an immediate operand, in this case #f, and some variables that it uses, in this case a and b. The number and type of the produced values is a property of the primcall; some are just for effect, some produce one value, some more.

CPS soup

term := $continue k src expr
      | $branch kf kt src op param args
      | $switch kf kt* src arg
      | $prompt k kh src escape? tag
      | $throw src op param args

Expressions can have effects, produce values

expr := $const val
      | $primcall name param args
      | $values args
      | $call proc args
      | ...

There are other kinds of terms besides $continue: there is $branch, which proceeds either to the false continuation kf or the true continuation kt depending on the result of performing op on the variables args, with immediate operand param. In our running example, we might have made the initial term via:

  ($branch BB1 BB2 'false? #f (a0)))

The definition of build-term (and build-cont and build-exp) is in the (language cps) module.

There is also $switch, which takes an unboxed unsigned integer arg and performs an array dispatch to the continuations in the list kt, or kf otherwise.

There is $prompt which continues to its k, having pushed on a new continuation delimiter associated with the var tag; if code aborts to tag before the prompt exits via an unwind primcall, the stack will be unwound and control passed to the handler continuation kh. If escape? is true, the continuation is escape-only and aborting to the prompt doesn't need to capture the suspended continuation.

Finally there is $throw, which doesn't continue at all, because it causes a non-resumable exception to be thrown. And that's it; it's just a handful of kinds of term, determined by the different shapes of control-flow (how many continuations the term has).

When it comes to values, we have about a dozen expression kinds. We saw $const and $primcall, but I want to explicitly mention $values, which simply passes on some number of values. Often a $values expression corresponds to passing an input to a phi variable, though $kargs vars can get their definitions from any expression that produces the right number of values.

Kinds of continuations

Guile functions untyped, can multiple return values

Error if too few values, possibly truncate too many values, possibly cons as rest arg...

Calling convention: contract between val producer & consumer

  • both on call and return side

Continuation of $call unlike that of $const

When a $continue term continues to a $kargs with a $const 42 expression, there are a number of invariants that the compiler can ensure: that the $kargs continuation is always passed the expected number of values, that the vars that it binds can be allocated to specific locations (e.g. registers), and that because all predecessors of the $kargs are known, that those predecessors can place their values directly into the variable's storage locations. Effectively, the compiler determines a custom calling convention between each $kargs and its predecessors.

Consider the $call expression, though; in general you don't know what the callee will do to produce its values. You don't even generally know that it will produce the right number of values. Therefore $call can't (in general) continue to $kargs; instead it continues to $kreceive, which expects the return values in well-known places. $kreceive will check that it is getting the right number of values and then continue to a $kargs, shuffling those values into place. A standard calling convention defines how functions return values to callers.

The conts

cont := $kfun src meta self ktail kentry
      | $kclause arity kbody kalternate
      | $kargs names syms term
      | $kreceive arity kbody
      | $ktail

$kclause, $kreceive very similar

Continue to $ktail: return

$call and return (and $throw, $prompt) exit first-order flow graph

Of course, a $call expression could be a tail-call, in which case it would continue instead to $ktail, indicating an exit from the first-order function-local control-flow graph.

The calling convention also specifies how to pass arguments to callees, and likewise those continuations have a fixed calling convention; in Guile we start functions with $kfun, which has some metadata attached, and then proceed to $kclause which bridges the boundary between the standard calling convention and the specialized graph of $kargs continuations. (Many details of this could be tweaked, for example that the case-lambda dispatch built-in to $kclause could instead dispatch to distinct functions instead of to different places in the same function; historical accidents abound.)

As a detail, if a function is well-known, in that all its callers are known, then we can lighten the calling convention, moving the argument-count check to callees. In that case $kfun continues directly to $kargs. Similarly for return values, optimizations can make $call continue to $kargs, though there is still some value-shuffling to do.

High and low

CPS bridges AST (Tree-IL) and target code

High-level: vars in outer functions in scope

Closure conversion between high and low

Low-level: Explicit closure representations; access free vars through closure

CPS soup is the bridge between parsed Scheme and machine code. It starts out quite high-level, notably allowing for nested scope, in which expressions can directly refer to free variables. Variables are small integers, and for high-level CPS, variable indices have to be unique across all functions in a program. CPS gets lowered via closure conversion, which chooses specific representations for each closure that remains after optimization. After closure conversion, all variable access is local to the function; free variables are accessed via explicit loads from a function's closure.

Optimizations at all levels

Optimizations before and after lowering

Some exprs only present in one level

Some high-level optimizations can merge functions (higher-order to first-order)

Because of the broad remit of CPS, the language itself has two dialects, high and low. The high level dialect has cross-function variable references, first-class abstract functions (whose representation hasn't been chosen), and recursive function binding. The low-level dialect has only specific ways to refer to functions: labels and specific closure representations. It also includes calls to function labels instead of just function values. But these are minor variations; some optimization and transformation passes can work on either dialect.


Intmap, intset: Clojure-style persistent functional data structures

Program: intmap<label,cont>

Optimization: program→program

Identify functions: (program,label)→intset<label>

Edges: intmap<label,intset<label>>

Compute succs: (program,label)→edges

Compute preds: edges→edges

I mentioned that programs were intmaps, and specifically in Guile they are Clojure/Bagwell-style persistent functional data structures. By functional I mean that intmaps (and intsets) are values that can't be mutated in place (though we do have the transient optimization).

I find that immutability has the effect of deploying a sense of calm to the compiler hacker -- I don't need to worry about data structures changing out from under me; instead I just structure all the transformations that you need to do as functions. An optimization is just a function that takes an intmap and produces another intmap. An analysis associating some data with each program label is just a function that computes an intmap, given a program; that analysis will never be invalidated by subsequent transformations, because the program to which it applies will never be mutated.

This pervasive feeling of calm allows me to tackle problems that I wouldn't have otherwise been able to fit into my head. One example is the novel online CSE pass; one day I'll either wrap that up as a paper or just capitulate and blog it instead.

Flow analysis

A[k] = meet(A[p] for p in preds[k])
         - kill[k] + gen[k]

Compute available values at labels:

  • A: intmap<label,intset<val>>
  • meet: intmap-intersect<intset-intersect>
  • -, +: intset-subtract, intset-union
  • kill[k]: values invalidated by cont because of side effects
  • gen[k]: values defined at k

But to keep it concrete, let's take the example of flow analysis. For example, you might want to compute "available values" at a given label: these are the values that are candidates for common subexpression elimination. For example if a term is dominated by a car x primcall whose value is bound to v, and there is no path from the definition of V to a subsequent car x primcall, we can replace that second duplicate operation with $values (v) instead.

There is a standard solution for this problem, which is to solve the flow equation above. I wrote about this at length ages ago, but looking back on it, the thing that pleases me is how easy it is to decompose the task of flow analysis into manageable parts, and how the types tell you exactly what you need to do. It's easy to compute an initial analysis A, easy to define your meet function when your maps and sets have built-in intersect and union operators, easy to define what addition and subtraction mean over sets, and so on.

Persistent data structures FTW

  • meet: intmap-intersect<intset-intersect>
  • -, +: intset-subtract, intset-union

Naïve: O(nconts * nvals)

Structure-sharing: O(nconts * log(nvals))

Computing an analysis isn't free, but it is manageable in cost: the structure-sharing means that meet is usually trivial (for fallthrough control flow) and the cost of + and - is proportional to the log of the problem size.

CPS soup: strengths

Relatively uniform, orthogonal

Facilitates functional transformations and analyses, lowering mental load: “I just have to write a function from foo to bar; I can do that”

Encourages global optimizations

Some kinds of bugs prevented by construction (unintended shared mutable state)

We get the SSA optimization literature

Well, we're getting to the end here, and I want to take a step back. Guile has used CPS soup as its middle-end IR for about 8 years now, enough time to appreciate its fine points while also understanding its weaknesses.

On the plus side, it has what to me is a kind of low cognitive overhead, and I say that not just because I came up with it: Guile's development team is small and not particularly well-resourced, and we can't afford complicated things. The simplicity of CPS soup works well for our development process (flawed though that process may be!).

I also like how by having every variable be potentially a phi, that any optimization that we implement will be global (i.e. not local to a basic block) by default.

Perhaps best of all, we get these benefits while also being able to use the existing SSA transformation literature. Because CPS is SSA, the lessons learned in SSA (e.g. loop peeling) apply directly.

CPS soup: weaknesses

Pointer-chasing, indirection through intmaps

Heavier than basic blocks: more control-flow edges

Names bound at continuation only; phi predecessors share a name

Over-linearizes control, relative to sea-of-nodes

Overhead of re-computation of analyses

CPS soup is not without its drawbacks, though. It's not suitable for JIT compilers, because it imposes some significant constant-factor (and sometimes algorithmic) overheads. You are always indirecting through intmaps and intsets, and these data structures involve significant pointer-chasing.

Also, there are some forms of lightweight flow analysis that can be performed naturally on a graph of basic blocks without looking too much at the contents of the blocks; for example in our available variables analysis you could run it over blocks instead of individual instructions. In these cases, basic blocks themselves are an optimization, as they can reduce the size of the problem space, with corresponding reductions in time and memory use for analyses and transformations. Of course you could overlay a basic block graph on top of CPS soup, but it's not a well-worn path.

There is a little detail that not all phi predecessor values have names, since names are bound at successors (continuations). But this is a detail; if these names are important, little $values trampolines can be inserted.

Probably the main drawback as an IR is that the graph of conts in CPS soup over-linearizes the program. There are other intermediate representations that don't encode ordering constraints where there are none; perhaps it would be useful to marry CPS soup with sea-of-nodes, at least during some transformations.

Finally, CPS soup does not encourage a style of programming where an analysis is incrementally kept up to date as a program is transformed in small ways. The result is that we end up performing much redundant computation within each individual optimization pass.


CPS soup is SSA, distilled

Labels and vars are small integers

Programs map labels to conts

Conts are the smallest labellable unit of code

Conts can have terms that continue to other conts

Compilation simplifies and lowers programs

Wasm vs VM backend: a question for another day :)

But all in all, CPS soup has been good for Guile. It's just SSA by another name, in a simpler form, with a functional flavor. Or, it's just CPS, but first-order only, without lambda.

In the near future, I am interested in seeing what a new GC will do for CPS soup; will bump-pointer allocation palliate some of the costs of pointer-chasing? We'll see. A tricky thing about CPS soup is that I don't think that anyone else has tried it in other languages, so it's hard to objectively understand its characteristics independent of Guile itself.

Finally, it would be nice to engage in the academic conversation by publishing a paper somewhere; I would like to see interesting criticism, and blog posts don't really participate in the citation graph. But in the limited time available to me, faced with the choice between hacking on something and writing a paper, it's always been hacking, so far :)

Speaking of limited time, I probably need to hit publish on this one and move on. Happy hacking to all, and until next time.

by Andy Wingo at May 20, 2023 07:10 AM

May 18, 2023

José Dapena

Javascript memory profiling with heap snapshot

In both web and NodeJS worlds, the main runtime for executing program logic is the Javascript runtime. Because of that, a huge number of applications and user interfaces are using it. As any software component, Javascript code uses resources of the system, that are not unlimited. We should be careful when using CPU time, application storage, or memory.

In this blog post we are going to focus on the latter.

Where’s my memory!

Usually the objects allocated by a web page are not a lot, so they do not eat a huge amount of memory for a modern and beefy computer. But we find problems like:

  • Oh, but I don’t have a single web page loaded. I like those 40-80 tabs all open for some reason… Well, no, there’s no reason for that! But that’s another topic.
  • Many users are not using beefy phones or computers. So using memory has an impact on what they can do.

The user may not be happy with the web application developer implementation choices. And this developer may want to be… more efficient. Do something.

Where’s my memory! The cloud strikes back

Now… Think about the cloud providers. And developers implementing software using NodeJS in the cloud. The contract with the provider may limit the available memory… Or get money depending on the actual usage.

So… An innocent script that takes 10MB, but is run thousands or millions or times for a few seconds. That is expensive!

These developers will need to make their apps… again, more efficient.

A new hope

In performance problems, we usually want to have reliable data of what is happening, and when. Memory problems are no different. We need some observability of the memory usage.

Chromium and NodeJS share their Javascript runtime, V8, and it provides some tools to help with memory investigation.

In this post we are going to focus on the family of tools around a V8 feature named heap snapshot, that allows capturing the memory usage at any time in a Javascript execution context.

About the heap

❗ This is a fast recap on how Javascript heap works, you can skip it if you want

In V8 Javascript runtime, variables, no matter their scope, are allocated on a heap. No matter if it is a number, a string, an object or a function, all of them are stored there. Not only that, in V8 even the code is stored in the heap.

But, in Javascript, memory is freed lazily, with a garbage collection. This means that, when an object is not used anymore, its memory is not immediately disposed. Garbage collector will explore which objects are disposable later, and free them when it is convenient.

How do we know if an object is still used? The idea is simple: objects are used if they can be accessed. To find out which ones, the runtime will take the root objects, and explore recursively all the object references. Any object that has not been found in that exploration can be discarded.

OK, and what is a root object? In a script it can be the objects in the global context. But also Javascript objects referred from native objects.

More details of how the V8 garbage collector works are out of the scope of this post. If you want to learn more, this post should provide a good overview of current implementation: Trash talk: the Orinoco garbage collector.

Heap snapshot: how does it work?

OK, so we know all the Javascript memory allocation goes through the heap. And, as I said, heap snapshot is a tool to investigate memory problems.

The name is quite explicit about how it works. Heap snapshot will stop the Javascript execution, traverse all the heap, analyze it, and dump it in a meaningful format that can be investigated.

What kind of information does it have?

  • Which objects are in the heap, and their types.
  • How much memory each object takes.
  • The references between them, so we can understand which object is keeping another one from being disposed.
  • In some of the tools, it can also store the stack trace of the code that allocated that memory.

The format of those snapshots is using JSON, and it can be opened from Chromium developer tools for analysis.

Heap snapshots from Chromium

In the Chromium browser, heap snapshots can be obtained from the Chrome developer tools, accessed through the Inspect right button menu option.

This is common to any browser based in Chromium exposing those developer tools locally or remotely.

Once the developer tools are visible, there is the Memory tab:

We can select three profiling types:

  • Heap snapshot: it just captures the heap at the specific moment it is captured.
  • Allocation instrumentation on timeline: this records all the allocations over time, in a session, allowing to check the allocation that happened in a specific time range. This is quite expensive, and suitable only for short profiling sessions.
  • Allocation sampling: instead of capturing all allocations, this one records them with sampling. Not as accurate as allocation instrumentation, but very lightweight, allowing to give a good approximation for a long profiling session.

In all cases, we will get a profiling report that we can analyze later.

Heap snapshots from NodeJS

Using Chromium dev tools UI

In NodeJS, we can attach the Chrome dev tools passing --inspect through the command line or the NODE_OPTIONS environment variable. This will attach the inspector to NodeJS, but it does not stop execution. The variant --inspect-brk will break on debugger at start of the user script.

How does it work? It will open a port in localhost:9229, and then this can be accessed from Chromium browser URL chrome://inspect. The UI allows users to select which hosts to listen to for Node sessions. The end point can be modified using --inspect=[HOST:]PORT, --inspect-brk=[HOST:]PORT or with the specific command line argument --inspect-port=[HOST:]PORT.

Once you attach dev tools inspector, you can access the Memory tab as in the case of Chromium

There is a problem, though, when we are using NODE_OPTIONS. All instances of NodeJS will take the same parameter, so they will try to attach to the same host and port. And only the first instance will get the port. So it is less useful than we would expect for a session running multiple NodeJS processes (as it can be just running NPM or YARN to run stuff).

Oh, but there are some tricks!:

  • If you pass port 0 it will allocate a port (and report it through the console!). So you can inspect any arbitrary session (more details).
  • In POSIX systems such as Linux, the inspector will be enabled if the process receives SIGUSR1. This will run in default localhost:9229 unless a different setting is specified with --inspect-port=[HOST:]PORT (more details).

Using command line

Also, there are other ways to obtain heap snapshots directly, without using developer tools UI. NodeJS allows to pass different command line parameters for programming heap snapshot capture/profiling:

  • --heapsnapshot-near-heap-limit=N will dump a heap snapshot when the V8 heap is close to its maximum size limit. The N parameter is the number of times it will dump a new snapshot. This is important because, when V8 is reaching the heap limit, it will take measures to free memory through garbage collection, so in a pattern of growing usage we will hit the limit several times.
  • --heapsnapshot-signal=SIGNAL will dump heap snapshots every time the NodeJS process gets the UNIX signal SIGNAL.

We can also record a heap profiling session from the start of the process to the end (same kind of profiling we obtain from Dev Tools using Allocation sampling option) using command line option --heap-prof. This will sample continuously the memory allocations, and can be tuned using different command line parameters as documented here.

Analysis of heap snapshots

The scope of this post is about how to capture heap snapshots in different scenarios. But… once you have them… You will want to use that information to actually understand memory usage. Here are some good reads about how to use heap snapshots.

First, from Chrome DevTools documentation:

  • Memory terminology: it gives a great tour on how memory is allocated, and what heap snapshots try to represent.
  • Fix memory problems: this one provides some examples of how to use different tools in Chromium to understand memory usage, including some heap snapshot and profiling examples.
  • View snapshots: a high level view of the different heap snapshot and profiling tools.
  • How to Use the Allocation Profiler Tool: this one specific to the allocation profiler.

And then, from NodeJS, you have also a couple of interesting things:

  • Memory Diagnostics: some of this has been covered in this post, but still has an example of how to find a memory leak using Comparison.
  • Heap snapshot exercise: this is an exercise including a memory leak, that you can hunt with heap snapshot.


  • Memory is a valuable resource that Javascript (both web and NodeJS) application developers may want to profile.
  • As usual, when there are resource allocation problems, we need reliable and accurate information about what is happening and when.
  • V8 heap snapshots provide such information, integrated with Chromium and NodeJS.


In a follow up post, I will talk about several optimizations we worked on, that make V8 heap snapshot implementation faster. Stay tuned!


This work has been thanks to the sponsorship from Igalia and Bloomberg.

by José Dapena Paz at May 18, 2023 03:26 PM

May 11, 2023

Lucas Fryzek

Igalia’s Mesa 23.1 Contributions - Behind the Scenes

It’s an exciting time for Mesa as its next major release is unveiled this week. Igalia has played an important role in this milestone, with Eric Engestrom managing the release and 11 other Igalians contributing over 110 merge requests. A sample of these contributions are detailed below.

radv: Implement vk.check_status

As part of an effort to enhance the reliability of GPU resets on amdgpu, Tony implemented a GPU reset notification feature in the RADV Vulkan driver. This new function improves the robustness of the RADV driver. The driver can now check if the GPU has been reset by a userspace application, allowing the driver to recover their contexts, exit, or engage in some other appropriate action.

You can read more about Tony’s changes in the link below

turnip: KGSL backend rewrite

With a goal of improving feature parity of the KGSL kernel mode driver with its drm counterpart, Mark has been rewriting the backend for KGSL. These changes leverage the new, common backend Vulkan infrastructure inside Mesa and fix multiple bugs. In addition, they introduce support for importing/exporting sync FDs, pre-signalled fences, and timeline semaphore support.

If you’re interested in taking a deeper dive into Mark’s changes, you can read the following MR:

turnip: a7xx preparation, transition to C++

Danylo has adopted a significant role for two major changes inside turnip: 1)contributing to the effort to migrate turnip to C++ and 2)supporting the next generation a7xx Adreno GPUs from Qualcomm. A more detailed overview of Danylo’s changes can be found in the linked MRs below:

v3d/v3dv various fixes & CTS conformance

Igalia maintains the v3d OpenGL driver and v3dv Vulkan drive for broadcom videocore GPUs which can be found on devices such as the Raspberry Pi. Iago, Alex and Juan have combined their expertise to implement multiple fixes for both the v3d gallium driver and the v3dv vulkan driver on the Raspberry Pi. These changes include CPU performance optimizations, support for 16-bit floating point vertex attributes, and raising support in the driver to OpenGL 3.1 level functionality. This Igalian trio has also been addressing fixes for conformance issues raised in the Vulkan 1.3.5 conformance test suite (CTS).

You can dive into some of their Raspberry Pi driver changes here:

ci, build system, and cleanup

In addition to managing the 23.1 release, Eric has also implemented many fixes in Mesa’s infrastructure. He has assisted with addressing a number of CI issues within Mesa on various drivers from v3d to panfrost. Eric also dedicated part of his time to general clean-up of the Mesa code (e.g. removing duplicate functions, fixing and improving the meson-based build system, and removing dead code).

If you’re interested in seeing some of his work, check out some of the MRs below:

May 11, 2023 04:00 AM

May 10, 2023

Igalia Compilers Team

Compiling Bigloo Scheme to WebAssembly

In the JavaScript world, browser implementations have focused on JIT compilation as a high-performance implementation technique. Recently, new applications of JS, such as on cloud compute and edge compute platforms, have driven interest in non-JIT implementations of the language. For these kinds of use cases, fast startup and predictable performance can make traditional implementation approaches appealing. An example implementation is QuickJS, which compiles JS to a bytecode format and interprets the bytecodes. Another approach is Manuel Serrano's work on Hopc, which is a performant AOT JS compiler that uses Scheme as an intermediate language.

Another direction that is gaining interest is compiling JavaScript to WebAssembly (Wasm). The motivations for this approach are explained very clearly in Lin Clark's article on making JS run fast on Wasm, and some of my Igalia colleagues are spearheading this effort with the SpiderMonkey JS engine in collaboration with Bytecode Alliance partners.

There is still an open question of if we can apply these techniques for AOT compilation of JS to compile JS to Wasm in a practical way (though the componentize-js effort appears to be building up to this using partial evaluation). One way to test this out would be to apply the previously mentioned Hopc compiler. Hopc compiles to Scheme which, via the Bigloo Scheme implementation, compiles to C. Using the standard C toolchain for Wasm (i.e., Emscripten), we can compile that C code to Wasm.

To even attempt this, we would have to first make sure Bigloo Scheme's C output can be compiled to Wasm, which is the main topic of this blog post.

Bigloo on Wasm #

In theory, it's simple to get Bigloo working with Wasm because it can emit C code, which you can compile with the C compiler of your choice. For example, you could use Emscripten's emcc to generate the final executable. In practice, it's more complicated than that.

For one, if you only compile the user Bigloo code to Wasm, it will fail to execute. The binary relies on several libraries that make up the runtime system, which themselves have to be compiled to Wasm in order to link a final executable.

The diagram below illustrates the compilation pipeline. The purple boxes at the lower right are the runtime libraries that need to be linked in.

Diagram illustrating the steps in the compilation pipeline from Hopc to Wasm

As a result, we will need to build Bigloo twice: once natively and once to Wasm. The latter build to Wasm will create the needed runtime libraries. This approach is also suggested in the Emscripten documentation for building projects that use self-execution.

I've scripted this approach in a Dockerfile that contains a reproducible setup for reliably compiling Bigloo to Wasm. You can see that starting at line 21 an ordinary native Bigloo is built, with a number of features disabled that won't work well in Wasm. Starting at line 42 a very similar build is done using the emconfigure wrapper that handles the configure script process for Emscripten. The options passed mirror the native build, but with some extra options needed for Wasm.

Like many projects that use Emscripten, some modifications are needed to get Bigloo to compile properly. For example, making C types more precise, backporting Emscripten compatibility patches for included libraries, and adjusting autoconf tests to return a desired result with Emscripten.

1 + 1 = 4? #

There are some tricky details that you need to get right to have working Wasm programs in the end. For example, when I first got a working docker environment to run Bigloo-on-Wasm programs, I got the following result:

$ cat num.scm # this is a Bigloo scheme module
(module num)
(display (+ 1 1)) (newline)
$ /opt/bigloo/bin/bigloo -O3 num.scm -o num.js -cc /emsdk/upstream/emscripten/emcc # compile to wasm, more arguments are needed in practice, this is a simplified example
$ emsdk/node/14.18.2_64bit/bin/node num.js # execute the compiled wasm in nodejs

(Side note: if you haven't used a Wasm toolchain before, you may be confused why the output is num.js. Wasm toolchains often produce JS glue code that you use to load the actual Wasm code in a browser/JS engine.)

The Scheme program num.scm is supposed to print the result of "1 + 1". The wasm binary helpfully prints... 4. Other programs that I tried, like printing "hello world", resulted in the IO system trying to print random parts of Wasm's linear memory.

The proximal reason for this failure was that the value tagging code in the Bigloo runtime was being configured incorrectly. If you look at the Bigloo tagging code, you see these cpp definitions:

# define TAG_MASK ((1 << PTR_ALIGNMENT) - 1)

# define TAG(_v, shift, tag) \
((long)(((unsigned long)(_v) << shift) | tag))

# define UNTAG(_v, shift, tag) \
((long)((long)(_v) >> shift))

/* ... */

# define TAG_INT 0 /* integer tagging ....00 */

The TAG operation is used throughout compiled Bigloo code to tag values into the unityped Scheme value representation. The default tagging scheme (see TAG_SHIFT) is a typical one that depends on the pointer alignment, which depends on the word size (4 bytes on 32-bit, 8 bytes on 64-bit). The PTR_ALIGNMENT definition is defined to be the log base 2 of the word size. This means 2 bits of the value are used for a tag on 32-bit platforms and 3 bits are used on 64-bit platforms.

In the case of numbers, the tag is 0 (TAG_INT above) so a discrepancy in tagging will produce a mis-shifted number value. That's exactly why the num.js program printed 4 above. It's the right answer 2 shifted by one bit.

The reason for that shift is that I was compiling native Bigloo in a 64-bit configuration since that's the architecture of the host machine. Wasm, however, is specified to have a 32-bit address space (unless the memory64 proposal is used). This discrepancy caused values to get shifted with 2 bits in some places, and 3 bits in others during tagging/untagging. After figuring this out, it was relatively easy to compile Bigloo with an i686 toolchain.

Function pointers #

After fixing 32-bit/64-bit discrepancies, simple Bigloo programs would run in a Wasm engine. On more complex examples, however, I was running into function pointer cast errors like the following:

$ emsdk/node/15.14.0_64bit/bin/node bigloo-compiled-program.js
RuntimeError: function signature mismatch
at <anonymous>:wasm-function[184]:0x3b430
at <anonymous>:wasm-function[1397]:0x3400bb
at <anonymous>:wasm-function[505]:0x118046
at <anonymous>:wasm-function[325]:0xd693e
at <anonymous>:wasm-function[4224]:0x71dd82
at Ya (<anonymous>:wasm-function[18267]:0x143d987)
at ret.<computed> (/test-output.js:1:112711)
at Object.doRewind (/test-output.js:1:114339)
at /test-output.js:1:114922
at /test-output.js:1:99074

This is a documented issue that comes up when porting systems to Emscripten. It's not Emscripten's fault, because oftentimes the programs are relying on undefined behavior (UB) in C.

In particular, CPPReference says the following about function pointer casts:

Any pointer to function can be cast to a pointer to any other function type. If the resulting pointer is converted back to the original type, it compares equal to the original value. If the converted pointer is used to make a function call, the behavior is undefined (unless the function types are compatible)

which means that generally function pointer casts are undefined unless the source and target types are compatible. Bigloo has many cases where function pointers need to be cast. For example, the representation of Scheme procedures contains a field for a C function pointer:

   /* procedure (closures) */
struct procedure {
header_t header;
union scmobj *(*entry)(); // <-- function pointer for the procedure entrypoint
union scmobj *(*va_entry)();
union scmobj *attr;
int arity;
union scmobj *obj0;
} procedure;

The function pointer's C type cannot precisely capture the actual behavior even with a uniform value representation, as the arity of the Scheme procedure needs to be represented. C does not prevent you from calling the function with whatever arity you like though, as you can see in the Gstreamer API code:

obj_t proc = cb->proc; // A Scheme procedure object
switch( cb->arity ) {
case 0:
PROCEDURE_ENTRY( proc ) ( proc, BEOA ); // Extract the function entry pointer and call it

case 1:
PROCEDURE_ENTRY( proc ) ( proc, convert( cb->args[ 0 ], BTRUE ), BEOA );

/* ... */

In practice, this kind of UB shouldn't cause problems for a typical C compiler because its output (assembly/machine code) is untyped. What matters is whether the calling convention is followed (it should be fine in Bigloo since these functions uniformly take and return scmobj* pointers).

Since Wasm has a sound static type system, it doesn't allow such loose typing of functions and will crash with a runtime type check if the types do not match. It's possible to work around this by using the EMULATE_FUNCTION_POINTER_CASTS Emscripten option to generate stubs that emulate the cast, but it adds significant overheads as the Emscripten docs note (emphasis mine):

Use EMULATE_FUNCTION_POINTER_CASTS. When you build with -sEMULATE_FUNCTION_POINTER_CASTS, Emscripten emits code to emulate function pointer casts at runtime, adding extra arguments/dropping them/changing their type/adding or dropping a return type/etc. This can add significant runtime overhead, so it is not recommended, but is be worth trying.

The overhead is clear in the generated code, because the option adds dummy function parameters to virtualize calls. Here's an example showing the decompiled Wasm code with emulated casts:

Screenshot of decompiled Wasm showing a large number of function parameters

You can see that the function being called with call_indirect has a huge number of arguments (the highlighted (param i64 ...) shows the type of the function being called, and the (i64.const 0) ... above the call are the concrete arguments). There are more than 70 arguments here, and most of them are unused and are present only for the virtualization of the call. This can add up to a huge binary size cost, since there can also be a large number of functions in the Wasm module's indirect function table:

Screenshot of the V8 debugger showing a Wasm module with more than 20,000 entries in its function table

The screenshot from the V8 debugger above is showing the contents of the running module. In this case the module's table (highlighted in red) has over 20,000 function entries. Calls to many of these will incur the emulation overhead. It's not clear to me that there is any good way to avoid this cost without significantly changing the representation of values in Bigloo.

What about Hopc? #

After getting Bigloo to compile to Wasm, I did go back to the initial motivation of this blog post and tried to get Hopc (the JS to Scheme compiler) working in order to have a whole pipeline to compile JS to Wasm. While I was able to get a working build, I had some trouble producing a final Wasm program that could serve as a demo without crashing. At some point, some of the runtime initialization code hits a call_indirect on a null function pointer and crashes.

I suspect that even if I could resolve the crashes, there would be more work needed to make this practical for the use cases I described at the beginning. The best code size I've been able to get for a minimal JS program compiled to Wasm using this pipeline was 29MB, which is rather large. For comparison, Guy Bedford quoted in the JS to Wasm article linked earlier suggested 5-6MB was a reasonable number for a Spidermonkey embedding.

There may be opportunities to reduce this overhead. For example, disabling asyncify and function pointer cast emulation reduces the binary to 9.8MB, albeit a non-working one. Asyncify appears to be required to use the default BDW garbage collector due to the use of emscripten_scan_registers(). There is some discussion of possibly making the asyncify use optional (and possibly using Binaryen's "spill pointers" pass), but it looks like this will take more time to materialize. To avoid the asyncify overhead at the Bigloo level, it could be interesting to look into alternative GC configurations that don't use BDW at all. For the function pointer issue, maybe future changes that leverage the Wasm GC proposal (which has a ref.cast instruction that can cast a function reference to a more precise type) could provide a workaround.

Wrap-up #

It was fun to explore this possibility for AOT compiling JS to Wasm, and more generally it was a good exercise in porting a programming language to Wasm. While there were some tricky problems, the Emscripten tools were good at handling many parts of the process automatically.

I also had to debug a bunch of crashing Wasm code too, and found that the debug support was better than I expected. Passing the debug mode flag -g to emcc helped in getting useful stack traces and in utilizing the Chrome debugger. Though I did wish I had access to a rr-style time travel debugger to continue backwards from a crash site.

With regard to Hopc, I think it could be worth exploring further if the runtime crashes in Wasm can be resolved and if the binary size could be brought down using some of the approaches I suggested above. For the time being though, if you wanted to compile Scheme to Wasm you have an option available now with Bigloo. The Bigloo setup can compile some non-trivial Scheme programs too, such as this demo page that uses Olin Shivers' maze program compiled to Wasm:

For another path to use Scheme on Wasm, also check out my colleagues' work to compile Guile to Wasm.

Header image credit:

May 10, 2023 12:00 AM

May 09, 2023

Jesse Alama

Announcing decimal128: JavaScript implementation of Decimal128

An­nounc­ing decimal128.js, an NPM pack­age for IEEE 754 Dec­i­mal128 float­ing-point dec­i­mal num­bers

I’m hap­py to an­nounce decimal128.js, an NPM pack­age I made for sim­u­lat­ing IEEE 754 Dec­i­mal128 num­bers in JavaScript.

(This is my first NPM pack­age. I made it in Type­Script; it’s my first go at the lan­guage.)


Dec­i­mal128 is an IEEE stan­dard for float­ing-point dec­i­mal num­bers. These num­bers aren’t the bi­na­ry float­ing-point num­bers that you know and love (?), but dec­i­mal num­bers. You know, the kind we learn about be­fore we’re even ten years old. In the bi­na­ry world, things like 0.1 + 0.2 aren’t ex­act­ly* equal to 0.3, and cal­cu­la­tions like 0.7 * 1.05 work out to ex­act­ly 0.735. These kinds of num­bers are what we use when do­ing all sorts of every­day cal­cu­la­tions, es­pe­cial­ly those hav­ing to do with mon­ey.

Dec­i­mal128 en­codes dec­i­mal num­bers into 128 bits. It is a fixed-width en­cod­ing, un­like ar­bi­trary-pre­ci­sion num­bers, which, of course, re­quire an ar­bi­trary amount of space. The en­cod­ing can rep­re­sent of num­bers with up to 34 sig­nif­i­cant dig­its and an ex­po­nent of –6143 to 6144. That is a tru­ly vast amount of space if one keeps the in­tend­ed use cas­es in­volv­ing hu­man-read­able and -writable num­bers (read: mon­ey) in mind.


I’m work­ing on ex­tend­ing the JavaScript lan­guage with dec­i­mal num­bers (pro­pos­al-dec­i­mal). One of the de­sign de­ci­sions that has to be made there is whether to im­ple­ment ar­bi­trary-pre­ci­sion dec­i­mal num­bers or to im­ple­ment some kind of ap­prox­i­ma­tion there­of, with Dec­i­mal128 be­ing the main con­tender. As far as I could tell, there was no im­ple­men­ta­tion of Dec­i­mal128 in JavaScript, so I made one.

The in­ten­tion isn’t to sup­port the full Dec­i­mal128 stan­dard, nor should one ex­pect to achieve the per­for­mance that, say, a C/C++ li­brary would give you in user­land JavaScript. (To say noth­ing of hav­ing ma­chine-na­tive dec­i­mal in­struc­tions, which is tru­ly ex­ot­ic.) The in­ten­tion is to give JavaScript de­vel­op­ers some­thing that gen­uine­ly strives to ap­prox­i­mate Dec­i­mal128 for JS pro­grams.

In par­tic­u­lar, the hope is that this li­brary of­fers the JS com­mu­ni­ty a chance to get a feel for what Dec­i­mal128 might be like.

How to use

Just do

$ npm install decimal128

and start us­ing the pro­vid­ed Decimal128 class.


If you find any bugs or would like to re­quest a fea­ture, just open an is­sue and I’ll get on it.

May 09, 2023 11:26 AM

May 03, 2023

Brian Kardell

Wolvic Behind-the-Scenes

Wolvic Behind-the-Scenes

A little post about power dynamics, frustrations, mistakes and things I’m still learning along the way about the challenges of making The World’s Greatest Cross Device Open Source Browser for XR (or any browser, really).

The next minor (not patch) release of Wolvic will very likely include a fairly major API addition, but if it does, you won’t even find it in the release notes. Still I find myself wanting to write about it.

We are … {drum roll} … very probably [1] re-adding support for the WebVR API.


If you're not familiar: The WebVR API was first conceived in 2014 and shipped in Firefox in 2015. It ran as an origin trial in Chrome itself from 2016 until 2018, but was enabled in some downstream browsers. It has been superseded since 2018 by WebXR, which supports a superset of its use cases and addresses some UX issues caused by WebVR. It was removed from Firefox and Wolvic in early 2022.

And now, in spite of all of that, we're very probably [1] re-enabling it in Wolvic. For now, at least.

Record scratch
Freeze Frame
Yup, that’s me.
You’re probably wondering how I got here…

Reality is Messy

To reiterate, this technology

  • Was supposed to be experimental
  • Was superseded five years ago by WebXR, which is under active development and/or partly shipping in multiple browser engines;
  • Has an easy replacement (many cases just need a library update).

However, none of that really matters if you’re a “young” browser and your users complain in the app stores, because that really hampers your ability to grow.

And at least currently, too many do.

“user-agent” cuts browsers both ways

I’ve written before (a few times actually) about our trials with the user agent string, and sites just blocking our browser from getting immersive content and why we have to tell A Few Good Lies on behalf of the user. We are the user agent, that’s our job.

But also, if we don’t tell those lies, people will give us bad ratings in the stores.

It makes sense, right? I don’t really blame them. At the end of the day users just don’t care who did The Bad Thing™.

They don’t care that websites shouldn’t try use the UA string like that, or any of that other unseen stuff. What they care about is that from their perspective it just

doesn’t work in your shitty upstart browser. ☆ 1 star.

For a young browser, app store reviews matter a lot[2]. And reviews will typically skew negative, because happy users don’t really have a good motivation to give you a positive review; they just want to do whatever they do in the browser, not jump over to the app store to say

It got right out of my way. ☆☆☆☆☆ five stars


Unhappy users, though: they’ll leave a review, because they’re frustrated and this is the only real chance they have to do anything about it.

On the one hand, that kind of mechanism can be generally helpful: It gives users at least some way to apply pressure on apps to make users happy. Stores also give us ways to reply and ask questions, and to let them know their issue is fixed so that hopefully they can update their review. That happens, and in an ideal world, maybe that’s fine.

At the same time: You might not want to tell everyone exactly what you were doing...

However, users don’t always want to give details on what precise thing they were doing that led to them to getting frustrated enough to leave a bad review.

So, for example, a surprising number of sites which offer streaming video don’t use the <video> element. Often, instead they’ll use some library which rolls its own video streaming support, usually with device feature targeting, based on some lower-level APIs. So, they'll offer you an immersive experience if they think that's possible. If support for those APIs is dropped, it looks like video streaming broke, with no warning. Libraries might have updates available, but sites don't always get updates like that.

So... For example, if your upstart browser appears to have problems with some sexually explicit streaming video sites, those people might be highly motivated to tell you that your browser is shitty or broken.

But also, that’s about all they will tell you.

Again, I get it! I mean, they don’t care why it’s broken, and they don’t want to tell you what they were looking at. I can respect both of those things!

But… that also means it’s very hard to actually do anything about it.

Look, the web is really terribly, mindbogglingly huge place and certainly by far the vast majority of it appears to be working just fine. If we can identify a thing that isn’t working, we’d like to fix it. But we’re unlikely to just stumble across it ourselves.

The worst part about this is that I imagine it will be very hard to get those people to reconsider because it’s impossible to know when to tell them their issue is fixed, and I assume at some point they’ll just stop looking.

Wolvic is adding a new way to report a broken site, optionally anonymously. Maybe that will help this specific problem, but I’m not sure.

Help a browser out...

I hope it’s interesting to come on this adventure of building The World’s Greatest Cross Device Open Source Browser for XR with me.

I hope if you know of any WebVR stuff floating around out there you’ll encourage people to update their libraries already so we can hopefully all get rid of it someday.

If you have Wolvic, maybe you’ll think about popping over and give it an honest rating if you haven’t already - or tell someone to try it out.


  1. Assuming that we ultimately judge that this won't actually cause even worse problems after some more research. Part of the reason we were eager to remove it early in the life of Wolvic was that there are so many of exactly the kinds of problems described here with websites or libraries trying to make a good accommodation but making a poor assumption. We suspect that many things were checking for WebVR before WebXR, which would prevent the preferred APIs from being used and further entrench the focus on the deprecated ones!
  2. I think about these rating things in stores every time I’m shopping. It feels like they generally not great. They are a dull instrument and full of problems and bad incentives and ways to game. We should be able to do better than that. For example, a small number of reviews from people I have some reason to trust is worth thousands from random internet accounts I have plenty of reason to mistrust. I’m not sure how we build that yet, but maybe one of you can figure it out. Please :) If you don’t mind, and share it with the rest of the class. Thanks.

May 03, 2023 04:00 AM

May 02, 2023

Andy Wingo

structure and interpretation of ark

Hello, dear readers! Today's article describes Ark, a new JavaScript-based mobile development platform. If you haven't read them yet, you might want to start by having a look at my past articles on Capacitor, React Native, NativeScript, and Flutter; having a common understanding of the design space will help us understand where Ark is similar and where it differs.

Ark, what it is

If I had to bet, I would guess that you have not heard of Ark. (I certainly hadn't either, when commissioned to do this research series.) To a first approximation, Ark—or rather, what I am calling Ark; I don't actually know the name for the whole architecture—is a loosely Flutter-like UI library implemented on top of a dialect of JavaScript, with build-time compilation to bytecode (like Hermes) but also with support for just-in-time and ahead-of-time compilation of bytecode to native code. It is made by Huawei.

At this point if you are already interested in this research series, I am sure this description raises more questions than it answers. Flutter-like? A dialect? Native compilation? Targetting what platforms? From Huawei? We'll get to all of these, but I think we need to start with the last question.

How did we get here?

In my last article on Flutter, I told a kind of just-so history of how Dart and Flutter came to their point in the design space. Thanks to corrections from a kind reader, it happened to also be more or less correct. In this article, though, I haven't talked with Ark developers at all; I don't have the benefit of a true claim on history. And yet, the only way I can understand Ark is by inventing a narrative, so here we go. It might even be true!

Recall that in 2018, Huawei was a dominant presence in the smartphone market. They were shipping excellent hardware at good prices both to the Chinese and to the global markets. Like most non-Apple, non-Google manufacturers, they shipped Android, and like most Android OEMs, they shipped Google's proprietary apps (mail, maps, etc.).

But then, over the next couple years, the US decided that allowing Huawei to continue on as before was, like, against national security interests or something. Huawei was barred from American markets, a number of suppliers were forbidden from selling hardware components to Huawei, and even Google was prohibited from shipping its mobile apps on Huawei devices. The effect on Huawei's market share for mobile devices was enormous: its revenue was cut in half over a period of a couple years.

In this position, as Huawei, what do you do? I can't even imagine, but specifically looking at smartphones, I think I would probably do about what they did. I'd fork Android, for starters, because that's what you already know and ship, and Android is mostly open source. I'd probably plan on continuing to use its lower-level operating system pieces indefinitely (kernel and so on) because that's not a value differentiator. I'd probably ship the same apps on top at first, because otherwise you slip all the release schedules and lose revenue entirely.

But, gosh, there is the risk that your product will be perceived as just a worse version of Android: that's not a good position to be in. You need to be different, and ideally better. That will take time. In the meantime, you claim that you're different, without actually being different yet. It's a somewhat ridiculous position to be in, but I can understand how you get here; Ars Technica published a scathing review poking fun at the situation. But, you are big enough to ride it out, knowing that somehow eventually you will be different.

Up to now, this part of the story is relatively well-known; the part that follows is more speculative on my part. Firstly, I would note that Huawei had been working for a while on a compiler and language run-time called Ark Compiler, with the goal of getting better performance out of Android applications. If I understand correctly, this compiler took the Java / Dalvik / Android Run Time bytecodes as its input, and outputted native binaries along with a new run-time implementation.

As I can attest from personal experience, having a compiler leads to hubris: you start to consider source languages like a hungry person looks at a restaurant menu. "Wouldn't it be nice to ingest that?" That's what we say at restaurants, right, fellow humans? So in 2019 and 2020 when the Android rug was pulled out from underneath Huawei, I think having in-house compiler expertise allowed them to consider whether they wanted to stick with Java at all, or whether it might be better to choose a more fashionable language.

Like black, JavaScript is always in fashion. What would it mean, then, to retool Huawei's operating system -- by then known by the name "HarmonyOS" -- to expose a JavaScript-based API as its primary app development framework? You could use your Ark compiler somehow to implement JavaScript (hubris!) and then you need a UI framework. Having ditched Java, it is now thinkable to ditch all the other Android standard libraries, including the UI toolkit: you start anew, in a way. So are you going to build a Capacitor, a React Native, a NativeScript, a Flutter? Surely not precisely any of these, but what will it be like, and how will it differ?

Incidentally, I don't know the origin story for the name Ark, but to me it brings to mind tragedy and rebuilding: in the midst of being cut off from your rich Android ecosystem, you launch a boat into the sea, holding a promise of a new future built differently. Hope and hubris in one vessel.

Two programming interfaces

In the end, Huawei builds two things: something web-like and something like Flutter. (I don't mean to suggest copying or degeneracy here; it's rather that I can only understand things in relation to other things, and these are my closest points of comparison for what they built.)

The web-like programming interface specifies UIs using an XML dialect, HML, and styles the resulting node tree with CSS. You augment these nodes with JavaScript behavior; the main app is a set of DOM-like event handlers. There is an API to dynamically create DOM nodes, but unlike the other systems we have examined, the HarmonyOS documentation doesn't really sell you on using a high-level framework like Angular.

If this were it, I think Ark would not be so compelling: the programming model is more like what was available back in the DHTML days. I wouldn't expect people to be able to make rich applications that delight users, given these primitives, though CSS animation and the HML loop and conditional rendering from the template system might be just expressive enough for simple applications.

The more interesting side is the so-called "declarative" UI programming model which exposes a Flutter/React-like interface. The programmer describes the "what" of the UI by providing a tree of UI nodes in its build function, and the framework takes care of calling build when necessary and of rendering that tree to the screen.

Here I need to show some example code, because it is... weird. Well, I find it weird, but it's not too far from SwiftUI in flavor. A small example from the fine manual:

struct MyComponent {
  build() {
    Stack() {

The @Entry decorator (*) marks this struct (**) as being the main entry point for the app. @Component marks it as being a component, like a React functional component. Components conform to an interface (***) which defines them as having a build method which takes no arguments and returns no values: it creates the tree in a somewhat imperative way.

But as you see the flavor is somewhat declarative, so how does that work? Also, build() { ... } looks syntactically a lot like Stack() { ... }; what's the deal, are they the same?

Before going on to answer this, note my asterisks above: these are concepts that aren't in JavaScript. Indeed, programs written for HarmonyOS's declarative framework aren't JavaScript; they are in a dialect of TypeScript that Huawei calls ArkTS. In this case, an interface is a TypeScript concept. Decorators would appear to correspond to an experimental TypeScript feature, looking at the source code.

But struct is an ArkTS-specific extension, and Huawei has actually extended the TypeScript compiler to specifically recognize the @Component decorator, such that when you "call" a struct, for example as above in Stack() { ... }, TypeScript will parse that as a new expression type EtsComponentExpression, which may optionally be followed by a block. When Stack() is invoked, its children (instances of Image and Text, in this case) will be populated via running the block.

Now, though TypeScript isn't everyone's bag, it's quite normalized in the JavaScript community and not a hard sell. Language extensions like the handling of @Component pose a more challenging problem. Still, Facebook managed to sell people on JSX, so perhaps Huawei can do the same for their dialect. More on that later.

Under the hood, it would seem that we have a similar architecture to Flutter: invoking the components creates a corresponding tree of elements (as with React Native's shadow tree), which then are lowered to render nodes, which draw themselves onto layers using Skia, in a multi-threaded rendering pipeline. Underneath, the UI code actually re-uses some parts of Flutter, though from what I can tell HarmonyOS developers are replacing those over time.

Restrictions and extensions

So we see that the source language for the declarative UI framework is TypeScript, but with some extensions. It also has its restrictions, and to explain these, we have to talk about implementation.

Of the JavaScript mobile application development frameworks we discussed, Capacitor and NativeScript used "normal" JS engines from web browsers, while React Native built their own Hermes implementation. Hermes is also restricted, in a way, but mostly inasmuch as it lags the browser JS implementations; it relies on source-to-source transpilers to get access to new language features. ArkTS—that's the name of HarmonyOS's "extended TypeScript" implementation—has more fundamental restrictions.

Recall that the Ark compiler was originally built for Android apps. There you don't really have the ability to load new Java or Kotlin source code at run-time; in Java you have class loaders, but those load bytecode. On an Android device, you don't have to deal with the Java source language. If we use a similar architecture for JavaScript, though, what do we do about eval?

ArkTS's answer is: don't. As in, eval is not supported on HarmonyOS. In this way the implementation of ArkTS can be divided into two parts, a frontend that produces bytecode and a runtime that runs the bytecode, and you never have to deal with the source language on the device where the runtime is running. Like Hermes, the developer produces bytecode when building the application and ships it to the device for the runtime to handle.

Incidentally, before we move on to discuss the runtime, there are actually two front-ends that generate ArkTS bytecode: one written in C++ that seems to only handle standard TypeScript and JavaScript, and one written in TypeScript that also handles "extended TypeScript". The former has a test262 runner with about 10k skipped tests, and the latter doesn't appear to have a test262 runner. Note, I haven't actually built either one of these (or any of the other frameworks, for that matter).

The ArkTS runtime is itself built on a non-language-specific common Ark runtime, and the set of supported instructions is the union of the core ISA and the JavaScript-specific instructions. Bytecode can be interpreted, JIT-compiled, or AOT-compiled.

On the side of design documentation, it's somewhat sparse. There are some core design docs; readers may be interested in the rationale to use a bytecode interface for Ark as a whole, or the optimization overview.

Indeed ArkTS as a whole has a surfeit of optimizations, to an extent that makes me wonder which ones are actually needed. There are source-to-source optimizations on bytecode, which I expect are useful if you are generating ArkTS bytecode from JavaScript, where you probably don't have a full compiler implementation. There is a completely separate optimizer in the eTS part of the run-time, based on what would appear to be a novel "circuit-based" IR that bears some similarity to sea-of-nodes. Finally the whole thing appears to bottom out in LLVM, which of course has its own optimizer. I can only assume that this situation is somewhat transitory. Also, ArkTS does appear to generate its own native code sometimes, notably for inline cache stubs.

Of course, when it comes to JavaScript, there are some fundamental language semantics and there is also a large and growing standard library. In the case of ArkTS, this standard library is part of the run-time, like the interpreter, compilers, and the garbage collector (generational concurrent mark-sweep with optional compaction).

All in all, when I step back from it, it's a huge undertaking. Implementing JavaScript is no joke. It appears that ArkTS has done the first 90% though; the proverbial second 90% should only take a few more years :)


If you told a younger me that a major smartphone vendor switched from Java to JavaScript for their UI, you would probably hear me react in terms of the relative virtues of the programming languages in question. At this point in my career, though, the only thing that comes to mind is what an expensive proposition it is to change everything about an application development framework. 200 people over 5 years would be my estimate, though of course teams are variable. So what is it that we can imagine that Huawei bought with a thousand person-years of investment? Towards what other local maximum are we heading?

Startup latency

I didn't mention it before, but it would seem that one of the goals of HarmonyOS is in the name: Huawei wants to harmonize development across the different range of deployment targets. To the extent possible, it would be nice to be able to write the same kinds of programs for IoT devices as you do for feature-rich smartphones and tablets and the like. In that regard one can see through all the source code how there is a culture of doing work ahead-of-time and preventing work at run-time; for example see the design doc for the interpreter, or for the file format, or indeed the lack of JavaScript eval.

Of course, this wide range of targets also means that the HarmonyOS platform bears the burden of a high degree of abstraction; not only can you change the kernel, but also the JavaScript engine (using JerryScript on "lite" targets).

I mention this background because sometimes in news articles and indeed official communication from recent years there would seem to be some confusion that HarmonyOS is just for IoT, or aimed to be super-small, or something. In this evaluation I am mostly focussed on the feature-rich side of things, and there my understanding is that the developer will generate bytecode ahead-of-time. When an app is installed on-device, the AOT compiler will turn it into a single ELF image. This should generally lead to fast start-up.

However it would seem that the rendering library that paints UI nodes into layers and then composits those layers uses Skia in the way that Flutter did pre-Impeller, which to be fair is a quite recent change to Flutter. I expect therefore that Ark (ArkTS + ArkUI) applications also experience shader compilation jank at startup, and that they may be well-served by tesellating their shapes into primitives like Impeller does so that they can precompile a fixed, smaller set of shaders.


Maybe it's just that apparently I think Flutter is great, but ArkUI's fundamental architectural similarity to Flutter makes me think that jank will not be a big issue. There is a render thread that is separate from the ArkTS thread, so like with Flutter, async communication with main-thread interfaces is the main jank worry. And on the ArkTS side, ArkTS even has a number of extensions to be able to share objects between threads without copying, should that be needed. I am not sure how well-developed and well-supported these extensions are, though.

I am hedging my words, of course, because I am missing a bit of social proof; HarmonyOS is still in infant days, and it doesn't have much in the way of a user base outside China, from what I can tell, and my ability to read Chinese is limited to what Google Translate can do for me :) Unlike other frameworks, therefore, I haven't been as able to catch a feel of the pulse of the ArkUI user community: what people are happy about, what the pain points are.

It's also interesting that unlike iOS or Android, HarmonyOS is only exposing these "web-like" and "declarative" UI frameworks for app development. This makes it so that the same organization is responsible for the software from top to bottom, which can lead to interesting cross-cutting optimizations: functional reactive programming isn't just a developer-experience narrative, but it can directly affect the shape of the rendering pipeline. If there is jank, someone in the building is responsible for it and should be able to fix it, whether it is in the GPU driver, the kernel, the ArkTS compiler, or the application itself.

Peak performance

I don't know how to evaluate ArkTS for peak performance. Although there is a JIT compiler, I don't have the feeling that it is as tuned for adaptive optimization as V8 is.

At the same time, I find it interesting that HarmonyOS has chosen to modify JavaScript. While it is doing that, could they switch to a sound type system, to allow the kinds of AOT optimizations that Dart can do? It would be an interesting experiment.

As it is, though, if I had to guess, I would say that ArkTS is well-positioned for predictably good performance with AOT compilation, although I would be interested in seeing the results of actually running it.

Aside: On the importance of storytelling

In this series I have tried to be charitable towards the frameworks that I review, to give credit to what they are trying to do, even while noting where they aren't currently there yet. That's part of why I need a plausible narrative for how the frameworks got where they are, because that lets me have an idea of where they are going.

In that sense I think that Ark is at an interesting inflection point. When I started reading documentation about ArkUI and HarmonyOS and all that, I bounced out—there were too many architectural box diagrams, too many generic descriptions of components, too many promises with buzzwords. It felt to me like the project was trying to justify itself to a kind of clueless management chain. Was there actually anything here?

But now when I see the adoption of a modern rendering architecture and a bold new implementation of JavaScript, along with the willingness to experiment with the language, I think that there is an interesting story to be told, but this time not to management but to app developers.

Of course you wouldn't want to market to app developers when your system is still a mess because you haven't finished rebuilding an MVP yet. Retaking my charitable approach, then, I can only think that all the architectural box diagrams were a clever blind to avoid piquing outside interest while the app development kit wasn't ready yet :) As and when the system starts working well, presumably over the next year or so, I would expect HarmonyOS to invest much more heavily in marketing and developer advocacy; the story is interesting, but you have to actually tell it.

Aside: O platform, my platform

All of the previous app development frameworks that we looked at were cross-platform; Ark is not. It could be, of course: it does appear to be thoroughly open source. But HarmonyOS devices are the main target. What implications does this have?

A similar question arises in perhaps a more concrete way if we start with the mature Flutter framework: what would it mean to make a Flutter phone?

The first thought that comes to mind is that having a Flutter OS would allow for the potential for more cross-cutting optimizations that cross abstraction layers. But then I think, what does Flutter really need? It has the GPU drivers, and we aren't going to re-implement those. It has the bridge to the platform-native SDK, which is not such a large and important part of the app. You get input from the platform, but that's also not so specific. So maybe optimization is not the answer.

On the other hand, a Flutter OS would not have to solve the make-it-look-native problem; because there would be no other "native" toolkit, your apps won't look out of place. That's nice. It's not something that would make the platform compelling, though.

HarmonyOS does have this embryonic concept of app mobility, where like you could put an app from your phone on your fridge, or something. Clearly I am not doing it justice here, but let's assume it's a compelling use case. In that situation it would be nice for all devices to present similar abstractions, so you could somehow install the same app on two different kinds of devices, and they could communicate to transfer data. As you can see here though, I am straying far from my domain of expertise.

One reasonable way to "move" an app is to have it stay running on your phone and the phone just communicates pixels with your fridge (or whatever); this is the low-level solution. I think HarmonyOS appears to be going for the higher-level solution where the app actually runs logic on the device. In that case it would make sense to ship UI assets and JavaScript / extended TypeScript bytecode to the device, which would run the app with an interpreter (for low-powered devices) or use JIT/AOT compilation. The Ark runtime itself would live on all devices, specialized to their capabilities.

In a way this is the Apple WatchOS solution (as I understand it); developers publish their apps as LLVM bitcode, and Apple compiles it for the specific devices. A FlutterOS with a Flutter run-time on all devices could do something similar. As with WatchOS you wouldn't have to ship the framework itself in the app bundle; it would be on the device already.

Finally, publishing apps as some kind of intermediate representation also has security benefits: as the OS developer, you can ensure some invariants via the toolchain that you control. Of course, you would have to ensure that the Flutter API is sufficiently expressive for high-performance applications, while also not having arbitrary machine code execution vulnerabilities; there is a question of language and framework design as well as toolchain and runtime quality of implementation. HarmonyOS could be headed in this direction.


Ark is a fascinating effort that holds much promise. It's also still in motion; where will it be when it anneals to its own local optimum? It would appear that the system is approaching usability, but I expect a degree of churn in the near-term as Ark designers decide which language abstractions work for them and how to, well, harmonize them with the rest of JavaScript.

For me, the biggest open question is whether developers will love Ark in the way they love, say, React. In a market where Huawei is still a dominant vendor, I think the material conditions are there for a good developer experience: people tend to like Flutter and React, and Ark is similar. Huawei "just" needs to explain their framework well (and where it's hard to explain, to go back and change it so that it is explainable).

But in a more heterogeneous market, to succeed Ark would need to make a cross-platform runtime like the one Flutter has and engage in some serious marketing efforts, so that developers don't have to limit themselves to targetting the currently-marginal HarmonyOS. Selling extensions to JavaScript will be much more difficult in a context where the competition is already established, but perhaps Ark will be able to productively engage with TypeScript maintainers to move the language so it captures some of the benefits of Dart that facilitate ahead-of-time compilation.

Well, that's it for my review round-up; hope you have enjoyed the series. I have one more pending article, speculating about some future technologies. Until then, happy hacking, and see you next time.

by Andy Wingo at May 02, 2023 09:13 AM

April 28, 2023

Loïc Le Page

WebRTC, GStreamer and HTML5 - Part 2

An easy 360º solution for realtime multimedia communication.

First part is available in Part 1 - The story so far...

Part 2 - The GstWebRTC API #

The new gstwebrtc-api Javascript library developed at Igalia offers a full integration of the GStreamer webrtcsrc/webrtcsink protocols within a web browser or a mobile WebView.

You can easily and transparently interconnect a realtime streaming web application with GStreamer native components. Some interesting use-cases may be:

  • ingestion of audio/video from a mobile phone to a backend infrastructure with low latency,
  • broadcasting of low-latency media to a web page,
  • video conferencing with native desktop clients and web clients,
  • mobile video heavy post-processing (like AI) on server-side with direct feedback on client-side,
  • mixing videos from different remote sources on the server-side and broadcasting back with low end-to-end latency,
  • etc...

A full-featured demo showing how to use the gstwebrtc-api is available on the source code repository in the index.html file.

Following the instructions in the repository’s README, you can launch the demo and interact with GStreamer pipelines.

This demo opens a simple web page that, on one hand, offers to stream out the device webcam as a producer and, on the other hand, automatically detects any new remote producer available on the signalling network and offers to connect to them to consume their WebRTC streams.

The gstwebrtc-api is very simple and articulates around 3 groups of functions:

1 - Management of the connection to the signalling server. #

The client is automatically connected to the signalling server when the DOMContentLoaded event is triggered and, in case of unwanted disconnection, it is automatically reconnected. The signalling server address and reconnection timeout can be configured in the gstWebRTCConfig configuration.

const connectionListener = {
connected: function(clientId)
// New connection established with the unique identifier: clientId

disconnected: function()
// Disconnected from the signalling server


2 - Management of the producer role. #

Only one producer session can be created at once. While a producer session is active, it is offering the provided MediaStream to the signalling network. The producer session can be stopped at any moment by calling the close() method on it.

const gstWebRTCAPI.SessionState = Object.freeze({
idle: 0, // Session has just been created, you need to call start() or connect()
connecting: 1, // Session is connecting to the other peer
streaming: 2, // Session is actively streaming (out for a producer, in for a consumer)
closed: 3 // Session is definitively closed and object can be recycled

class gstWebRTCAPI.ProducerSession
get stream() -> MediaStream; // The stream broadcasted out by this producer
get state() -> SessionState; // Current producer session state
start() -> boolean; // Must be called to start the session after registering eventual events listeners
close(); // Closes definitively the session

// The provided MediaStream is the stream that will be broadcasted out by the created producer session
gstWebRTCAPI.createProducerSession(stream) -> ProducerSession;

The ProducerSession class inherits from EventTarget and emits the following events:

  • error when a network or streaming error occurs,
  • stateChanged each time the session state changes,
  • closed when the session has been closed,
  • clientConsumerAdded each time a new remote consumer connects to the session,
  • clientConsumerRemoved each time a remote consumer disconnects from the session.

3 - Management of the consumer role #

You can connect to any remote producer using its unique identifier. You can get the list of remote producers by calling gstWebRTCAPI.getAvailableProducers(). You can also register a listener to stay informed when a producer appears or disappears.

gstWebRTCAPI.Producer = Object.freeze({
id: {non-empty string},
meta: {non-null object}

gstWebRTCAPI.getAvailableProducers() -> Producer[];

const producersListener = {
producerAdded: function(producer)
// A new producer is available (producer is a gstWebRTCAPI.Producer object)

producerRemoved: function(producer)
// A producer has been closed (producer is a gstWebRTCAPI.Producer object)


class gstWebRTCAPI.ConsumerSession
get peerId() -> string; // Unique identifier of the producer peer to which this session is (or must be)
// connected (always non-empty)
get sessionId() -> string; // Unique identifier of this session provided by the signalling server during
// connection (empty until connection succeeds)
get state() -> SessionState; // Current consumer session state
get streams() -> MediaStream[]; // List of remote streams received by this consumer session
connect() -> boolean; // Must be called to connect the session after registering eventual events
// listeners
close(); // Closes definitively the session

gstWebRTCAPI.createConsumerSession(producerId) -> ConsumerSession;

The ConsumerSession class inherits from EventTarget and emits the following events:

  • error when a network or streaming error occurs,
  • stateChanged each time the session state changes,
  • closed when the session has been closed,
  • streamsChanged each time the underlying media streams change when media tracks are added or removed.

You can find a reference usage of the gstwebrtc-api in the index.html file.

Conclusion #

By using the new gstwebrtc-api in your web application in conjunction with the new webrtcsrc and webrtcsink GStreamer elements, you can now easily build a 360º realtime audio/video communication product.

You can transparently integrate web components and native components for desktop and/or backend applications.

The web integration is as easy as including a single script with a few straightforward functions (see index.html), whereas you can write your desktop or backend application using any wrapping language (C/C++, Rust, Python, .Net, Java, Javascript, Shell Script, etc.) and/or UI framework (GTK, QT, WxWidgets, Tcl/Tk, Avalonia, etc.) supported by GStreamer.

If you want to integrate multimedia and realtime communication into your project, don't hesitate to contact us at or on our website: We will be pleased to offer our expertise to unblock your developments, reduce your innovation risks, or to help you add astonishing features to your products.

April 28, 2023 12:00 AM

April 26, 2023

Andy Wingo

structure and interpretation of flutter

Good day, gentle hackfolk. Like an old-time fiddler I would appear to be deep in the groove, playing endless variations on a theme, in this case mobile application frameworks. But one can only recognize novelty in relation to the familiar, and today's note is a departure: we are going to look at Flutter, a UI toolkit based not on JavaScript but on the Dart language.

The present, from the past

Where to start, even? The problem is big enough that I'll approach it from three different angles: from the past, from the top, and from the bottom.

With the other frameworks we looked at, we didn't have to say much about their use of JavaScript. JavaScript is an obvious choice, in 2023 at least: it is ubiquitous, has high quality implementations, and as a language it is quite OK and progressively getting better. Up to now, "always bet on JS" has had an uninterrupted winning streak.

But winning is not the same as unanimity, and Flutter and Dart represent an interesting pole of contestation. To understand how we got here, we have to go back in time. Ten years ago, JavaScript just wasn't a great language: there were no modules, no async functions, no destructuring, no classes, no extensible iteration, no optional arguments to functions. In addition it was hobbled with a significant degree of what can only be called accidental sloppiness: with which can dynamically alter a lexical scope, direct eval that can define new local variables, Function.caller, and so on. Finally, larger teams were starting to feel the need for more rigorous language tooling that could use types to prohibit some classes of invalid programs.

All of these problems in JavaScript have been addressed over the last decade, mostly successfully. But in 2010 or so if you were a virtual machine engineer, you might look at JavaScript and think that in some other world, things could be a lot better. That's effectively what happened: the team that originally built V8 broke off and started to work on what became Dart.

Initially, Dart was targetted for inclusion in the Chrome web browser as an alternate "native" browser language. This didn't work, for various reasons, but since then Dart grew the Flutter UI toolkit, which has breathed new life into the language. And this is a review of Flutter, not a review of Dart, not really anyway; to my eyes, Dart is spiritually another JavaScript, different but in the same family. Dart's implementation has many interesting aspects as well that we'll get into later on, but all of these differences are incidental: they could just as well be implemented on top of JavaScript, TypeScript, or another source language in that family. Even if Flutter isn't strictly part of the JavaScript-based mobile application development frameworks that we are comparing, it is valuable to the extent that it shows what is possible, and in that regard there is much to say.

Flutter, from the top

At its most basic, Flutter is a UI toolkit for Dart. In many ways it is like React. Like React, its interface follows the functional-reactive paradigm: programmers describe the "what", and Flutter takes care of the "how". Also, like the phenomenon in which new developers can learn React without really knowing JavaScript, Flutter is the killer app for Dart: Flutter developers mostly learn Dart at the same time that they pick up Flutter.

In some other ways, Flutter is the logical progression of React, going in the same direction but farther along. Whereas React-on-the-web takes the user's declarative specifications of what the UI should look like and lowers them into DOM trees, and React Native lowers them to platform-native UI widgets, Flutter has its own built-in layout, rasterization, and compositing engine: Flutter draws all the pixels.

This has the predictable challenge that Flutter has to make significant investments so that its applications don't feel out-of-place on their platform, but on the other hand it opens up a huge space for experimentation and potential optimization: Flutter has the potential to beat native at its own game. Recall that with React Native, the result of the render-commit-mount process is a tree of native widgets. The native platform will surely then perform a kind of layout on those widgets, divide them into layers that correspond to GPU textures, paint those layers, then composite them to the screen -- basically, what a web engine will do.

What if we could instead skip the native tree and go directly to the lower GPU layer? That is the promise of Flutter. Flutter has the potential to create much more smooth and complex animations than the other application development frameworks we have mentioned, with lower overhead and energy consumption.

In practice... that's always the question, isn't it? Again, please accept my usual caveat that I am a compilers guy moonlighting in the user interface domain, but my understanding is that Flutter mostly lives up to its promise, but with one significant qualification which we'll get to in a minute. But before that, let's traverse Flutter from the other direction, coming up from Dart.

Dart, from the bottom

To explain some aspects of Dart I'm going to tell a just-so story that may or may not be true. I know and like many of the Dart developers, and we have similar instincts, so it's probably not too far from the truth.

Let's say you are the team that originally developed V8, and you decide to create a new language. You write a new virtual machine that looks like V8, taking Dart source code as input and applying advanced adaptive compilation techniques to get good performance. You can even be faster than JS because your language is just a bit more rigid than JavaScript is: you have traded off expressivity for performance. (Recall from our discussion of NativeScript that expressivity isn't a value judgment: there can be reasons to pay for more "mathematically appealing operational equivalences", in Felleisen's language, in exchange for applying more constraints on a language.)

But, you fail to ship the VM in a browser; what do you do? The project could die; that would be annoying, but you work for Google, so it happens all the time. However, a few interesting things happen around the same time that will cause you to pivot. One is a concurrent experiment by Chrome developers to pare the web platform down to its foundations and rebuild it. This effort will eventually become Flutter; while it was originally based on JS, eventually they will choose to switch to Dart.

The second thing that happens is that recently-invented smart phones become ubiquitous. Most people have one, and the two platforms are iOS and Android. Flutter wants to target them. You are looking for your niche, and you see that mobile application development might be it. As the Flutter people continue to experiment, you start to think about what it would mean to target mobile devices with Dart.

The initial Dart VM was made to JIT, but as we know, Apple doesn't let people do this on iOS. So instead you look to write a quick-and-dirty ahead-of-time compiler, based on your JIT compiler that takes your program as input, parses and baseline-compiles it, and generates an image that can be loaded at runtime. It ships on iOS. Funnily enough, it ships on Android too, because AOT compilation allows you to avoid some startup costs; forced to choose between peak performance via JIT and fast startup via AOT, you choose fast startup.

It's a success, you hit your product objectives, and you start to look further to a proper ahead-of-time compiler native code that can stand alone without the full Dart run-time. After all, if you have to compile at build-time, you might as well take the time to do some proper optimizations. You actually change the language to have a sound typing system so that the compiler can make program transformations that are valid as long as it can rely on the program's types.

Fun fact: I am told that the shift to a sound type system actually started before Flutter and thus before AOT, because of a Dart-to-JavaScript compiler that you inherited from the time in which you thought the web would be the main target. The Dart-to-JS compiler used to be a whole-program compiler; this enabled it to do flow-sensitive type inference, resulting in faster and smaller emitted JavaScript. But whole-program compilation doesn't scale well in terms of compilation time, so Dart-to-JS switched to separate per-module compilation. But then you lose lots of types! The way to recover the fast-and-small-emitted-JS property was through a stronger, sound type system for Dart.

At this point, you still have your virtual machine, plus your ahead-of-time compiler, plus your Dart-to-JS compiler. Such riches, such bounty! It is not a bad situation to be in, in 2023: you can offer a good development experience via the just-in-time compiled virtual machine. Apparently you can even use the JIT on iOS in developer mode, because attaching ptrace to a binary allows for native code generation. Then when you go to deploy, you make a native binary that includes everything.

For the web, you also have your nice story, even nicer than with JavaScript in some ways: because the type checker and ahead-of-time compiler are integrated in Dart, you don't have to worry about WebPack or Vite or minifiers or uglifiers or TypeScript or JSX or Babel or any of the other things that JavaScript people are used to. Granted, the tradeoff is that innovation is mostly centralized with the Dart maintainers, but currently Google seems to be investing enough so that's OK.

Stepping back, this story is not unique to Dart; many of its scenes also played out in the world of JavaScript over the last 5 or 10 years as well. Hermes (and QuickJS, for that matter) does ahead-of-time compilation, albeit only to bytecode, and V8's snapshot facility is a form of native AOT compilation. But the tooling in the JavaScript world is more diffuse than with Dart. With the perspective of developing a new JavaScript-based mobile operating system in mind, the advantages that Dart (and thus Flutter) has won over the years are also on the table for JavaScript to win. Perhaps even TypeScript could eventually migrate to have a sound type system, over time; it would take a significant investment but the JS ecosystem does evolve, if slowly.

(Full disclosure: while the other articles in this series were written without input from the authors of the frameworks under review, through what I can only think was good URL guesswork, a draft copy of this article leaked to Flutter developers. Dart hacker Slava Egorov kindly sent me a mail correcting a number of misconceptions I had about Dart's history. Fair play on whoever guessed the URL, and many thanks to Slava for the corrections; any remaining errors are wholly mine, of course!)


So how do we expect Flutter applications to perform? If we were writing a new mobile OS based on JavaScript, what would it mean in terms of performance to adopt a Flutter-like architecture?

Startup latency

Flutter applications are well-positioned to start fast, with ahead-of-time compilation. However they have had problems realizing this potential, with many users seeing a big stutter when they launch a Flutter app.

To explain this situation, consider the structure of a typical low-end Android mobile device: you have a small number of not-terribly-powerful CPU cores, but attached to the same memory you also have a decent GPU with many cores. For example, the SoC in the low-end Moto E7 Plus has 8 CPU cores and 128 GPU cores (texture shader units). You could paint widget pixels into memory from either the CPU or the GPU, but you'd rather do it in the GPU because it has so many more cores: in the time it takes to compute the color of a single pixel on the CPU, on the GPU you could do, like, 128 times as many, given that the comparison is often between multi-threaded rasterization on the GPU versus single-threaded rasterization on the CPU.

Flutter has always tried to paint on the GPU. Historically it has done so via a GPU back-end to the Skia graphics library, notably used by Chrome among other projects. But, Skia's API is a drawing API, not a GPU API; Skia is the one responsible for configuring the GPU to draw what we want. And here's the problem: configuring the GPU takes time. Skia generates shader code at run-time for rasterizing the specific widgets used by the Flutter programmer. That shader code then needs to be compiled to the language the GPU driver wants, which looks more like Vulkan or Metal. The process of compilation and linking takes time, potentially seconds, even.

The solution to "too much startup shader compilation" is much like the solution to "too much startup JavaScript compilation": move this phase to build time. The new Impeller rendering library does just that. However to do that, it had to change the way that Flutter renders: instead of having Skia generate specialized shaders at run-time, Impeller instead lowers the shapes that it draws to a fixed set of primitives, and then renders those primitives using a smaller, fixed set of shaders. These primitive shaders are pre-compiled at build time and included in the binary. By switching to this new renderer, Flutter should be able to avoid startup jank.


Of all the application development frameworks we have considered, to my mind Flutter is the best positioned to avoid jank. It has the React-like asynchronous functional layout model, but "closer to the metal"; by skipping the tree of native UI widgets, it can potentially spend less time for each frame render.

When you start up a Flutter app on iOS, the shell of the application is actually written in Objective C++. On Android it's the same, except that it's Java. That shell then creates a FlutterView widget and spawns a new thread to actually run Flutter (and the user's Dart code). Mostly, Flutter runs on its own, rendering frames to the GPU resources backing the FlutterView directly.

If a Flutter app needs to communicate with the platform, it passes messages across an asynchronous channel back to the main thread. Although these messages are asynchronous, this is probably the largest potential source of jank in a Flutter app, outside the initial frame paint: any graphical update which depends on the answer to an asynchronous call may lag.

Peak performance

Dart's type system and ahead-of-time compiler optimize for predictable good performance rather than the more variable but potentially higher peak performance that could be provided by just-in-time compilation.

This story should probably serve as a lesson to any future platform. The people that developed the original Dart virtual machine had a built-in bias towards just-in-time compilation, because it allows the VM to generate code that is specialized not just to the program but also to the problem at hand. A given system with ahead-of-time compilation can always be made to perform better via the addition of a just-in-time compiler, so the initial focus was on JIT compilation. On iOS of course this was not possible, but on Android and other platforms where this was available it was the default deployment model.

However, even Android switched to ahead-of-time compilation instead of the JIT model in order to reduce startup latency: doing any machine code generation at all at program startup was more work than was needed to get to the first frame. One could add JIT back again on top of AOT but it does not appear to be a high priority.

I would expect that Capacitor could beat Dart in some raw throughput benchmarks, given that Capacitor's JavaScript implementation can take advantage of the platform's native JIT capability. Does it matter, though, as long as you are hitting your frame budget? I do not know.

Aside: An escape hatch to the platform

What happens if you want to embed a web view into a Flutter app?

If you think on the problem for a moment I suspect you will arrive at the unsatisfactory answer, which is that for better or for worse, at this point it is too expensive even for Google to make a new web engine. Therefore Flutter will have to embed the native WebView. However Flutter runs on its own threads; the native WebView has its own process and threads but its interface to the app is tied to the main UI thread.

Therefore either you need to make the native WebView (or indeed any other native widget) render itself to (a region of) Flutter's GPU backing buffer, or you need to copy the native widget's pixels into their own texture and then composite them in Flutter-land. It's not so nice! The Android and iOS platform view documentation discuss some of the tradeoffs and mitigations.

Aside: For want of a canvas

There is a very funny situation in the React Native world in which, if the application programmer wants to draw to a canvas, they have to embed a whole WebView into the React Native app and then proxy the canvas calls into the WebView. Flutter is happily able to avoid this problem, because it includes its own drawing library with a canvas-like API. Of course, Flutter also has the luxury of defining its own set of standard libraries instead of necessarily inheriting them from the web, so when and if they want to provide equivalent but differently-shaped interfaces, they can do so.

Flutter manages to be more expressive than React Native in this case, without losing much in the way of understandability. Few people will have to reach to the canvas layer, but it is nice to know it is there.


Dart and Flutter are terribly attractive from an engineering perspective. They offer a delightful API and a high-performance, flexible runtime with a built-in toolchain. Could this experience be brought to a new mobile operating system as its primary programming interface, based on JavaScript? React Native is giving it a try, but I think there may be room to take things further to own the application from the program all the way down to the pixels.

Well, that's all from me on Flutter and Dart for the time being. Next up, a mystery guest; see you then!

by Andy Wingo at April 26, 2023 01:50 PM

April 24, 2023

Andy Wingo

structure and interpretation of nativescript

Greetings, hackers tall and hackers small!

We're only a few articles in to this series on mobile application development frameworks, but I feel like we are already well into our journey. We started our trip through the design space with a look at Ionic / Capacitor, which defines its user interface in terms of the web platform, and only calls out to iOS or Android native features as needed. We proceeded on to React Native, which moves closer to native by rendering to platform-provided UI widgets, layering a cross-platform development interface on top.

Today's article takes an in-depth look at NativeScript, whose point in the design space is further on the road towards the platform, unabashedly embracing the specificities of the API available on iOS and Android, exposing these interfaces directly to the application programmer.

In practice what this looks like is that a NativeScript app is a native app which simply happens to call JavaScript on the main UI thread. That JavaScript has access to all native APIs, directly, without the mediation of serialization or message-passing over a bridge or message queue.

The first time I heard this I thought that it couldn't actually be all native APIs. After all, new versions of iOS and Android come out quite frequently, and surely it would take some effort on the part of NativeScript developers to expose the new APIs to JavaScript. But no, it really includes all of the various native APIs: the NativeScript developers wrote a build-time inspector that uses the platform's native reflection capabilities to grovel through all available APIs and to automatically generate JavaScript bindings, with associated TypeScript type definitions so that the developer knows what is available.

Some of these generated files are checked into source, so you can get an idea of the range of interfaces that are accessible to programmers; for example, see the iOS type definitions for x86-64. There are bindings for, like, everything.

Given access to all the native APIs, how do you go about making an app? You could write the same kind of programs that you would in Swift or Kotlin, but in JavaScript. But this would require more than just the ability to access native capabilities when needed: it needs a thorough knowledge of the platform interfaces, plus NativeScript itself on top. Most people don't have this knowledge, and those that do are probably programming directly in Swift or Kotlin already.

On one level, NativeScript's approach is to take refuge in that most ecumenical of adjectives, "unopinionated". Whereas Ionic / Capacitor encourages use of web platform interfaces, and React Native only really supports React as a programming paradigm, NativeScript provides a low-level platform onto which you can layer a number of different high-level frameworks.

Now, most high-level JavaScript application development frameworks are oriented to targetting the web: they take descriptions of user interfaces and translate them to the DOM. When targetting NativeScript, you could make it so that they target native UI widgets instead. However given the baked-in assumptions of how widgets should be laid out (notably via CSS), there is some impedance-matching to do between DOM-like APIs and native toolkits.

NativeScript's answer to this problem is a middle layer: a cross-platform UI library that provides DOM-like abstractions and CSS layout in a way that bridges the gap between web-like and native. You can even define parts of the UI using a NativeScript-specific XML vocabulary, which NativeScript compiles to native UI widget calls at run-time. Of course, there is no CSS engine in UIKit or Android's UI toolkit, so NativeScript includes its own, implemented in JavaScript of course.

You could program directly to this middle layer, but I suspect that its real purpose is in enabling Angular, Vue, Svelte, or the like. The pitch would be that NativeScript lets app developers use pleasant high-level abstractions, but while remaining close to the native APIs; you can always drop down for more power and expressiveness if needed.

Diving back down to the low level, as we mentioned all of the interactions between JavaScript and the native platform APIs happen on the main application UI thread. NativeScript does also allow programmers to create background threads, using an implementation of the Web Worker API. One could even in theory run a React-based UI in a worker thread and proxy native UI updates to the main thread; as an unopinionated platform, NativeScript can support many different frameworks and paradigms.

Finally, there is the question of how NativeScript runs the JavaScript in an application. Recall that Ionic / Capacitor uses the native JS engine, by virtue of using the native WebView, and that React Native used to use JavaScriptCore on both platforms but now uses its own Hermes implementation. NativeScript is another point in the design space, using V8 on both platforms. (They used to use JavaScriptCore on iOS but switched to V8 once V8 was able to run on iOS in "jitless" mode.) Besides the reduced maintenance burden of using a single implementation on all platforms, this also has the advantage of being able to use V8 snapshots to move JavaScript parse-and-compile work to build-time, even on iOS.


NativeScript is fundamentally simple: it's V8 running in an application's main UI thread, with access to all platform native APIs. So how do we expect it to perform?

Startup latency

In theory, applications with a NativeScript-like architecture should have no problem with startup time, because they can pre-compile all of their JavaScript into V8 snapshots. Snapshots are cheap to load up because they are already in a format that V8 is ready to consume.

In practice, it would seem that V8 snapshots do not perform as expected for NativeScript. There are a number of aspects about this situation that I don't understand, which I suspect relate to the state of the tooling around V8 rather than to the fundamental approach of ahead-of-time compilation. V8 is really made for Chrome, and it could be that not enough maintenance resources have been devoted to this snapshot facility.

In the meantime, NativeScript instead uses V8's code cache feature, which caches the result of parsing and compiling JavaScript files on the device. In this way the first time an app is installed or updated, it might start up slowly, but subsequent runs are faster. If you were designing a new operating system, you'd probably want to move this work to app install-time.

As we mentioned above, NativeScript apps have access to all native APIs. That is a lot of APIs, and only some of those interfaces will actually be used by any given app. In an ideal world, we would expect the build process to only include JavaScript code for those APIs that are needed by the application. However in the presence of eval and dynamic property lookup, pruning the native API surface to the precise minimum is a hard problem for a bundler to perform on its own. The solution for the time being is to manually allow and deny subsets of the platform native API. It's not an automatic process though, so it can be error-prone.

Besides the work that the JavaScript engine has to do to load an application's code, the other startup overhead involves whatever work that JavaScript might need to perform before the first frame is shown. In the case of NativeScript, more work is done before the initial layout than one would think: the main UI XML file is parsed by an XML parser written in JavaScript, any needed CSS files are parsed and loaded (again by JavaScript), and the tree of XML elements is translated to a tree of UI elements. The layout of the items in the view tree is then computed (in JavaScript, but calling into native code to measure text and so on), and then the app is ready.

At this point, I am again going to wave my "I am just a compiler engineer" flag: I am not a UI specialist, much less a NativeScript specialist. As in compilers, performance measurement and monitoring are key to UI development, but I suspect that also as in compilers there is a role for gut instinct. Incremental improvements are best driven by metrics, but qualitative leaps are often the result of somewhat ineffable hunches or even guesswork. In that spirit I can only surmise that React Native has an advantage over NativeScript in time-to-first-frame, because its layout is performed in C++ and because its element tree is computed directly from JavaScript instead of having JavaScript interpret XML and CSS files. In any case, I look forward to the forthcoming part 2 of the NativeScript and React Native performance investigations that were started in November 2022.

If I were NativeScript and using NativeScript's UI framework, and if startup latency proves to actually be a problem, I would lean into something in the shape of Angular's ahead-of-time compilation mode, but for the middle NativeScript UI layer.


On the face of it, NativeScript is the most jank-prone of the three frameworks we have examined, because it runs JavaScript on the main application UI thread, interleaved with UI event handling and painting and all of that. If an app's JavaScript takes too long to run, the app might miss frames or fail to promptly handle an event.

On the other hand, relative to React Native, the user's code is much closer to the application's behavior. There's no asynchrony between the application's logic and its main loop: in NativeScript it is easy to identify the code causing jank and eventually fix it.

The other classic JavaScript-on-the-main-thread worry relates to garbage collection pauses. V8's garbage collector does try to minimize the stop-the-world phase by tracing the heap concurrently and leveraging parallelism during pauses. Also, the user interface of a mobile app runs in an event loop, and typically spends most of its time idle; V8 exposes some API that can take advantage of this idle time to perform housekeeping tasks instead of needing to do them when handling high-priority events.

That said, having looked into the code of both the iOS and Android run-times, NativeScript does not currently take advantage of this facility. I dug deeper and it would seem that V8 itself is in flux, as the IdleNotificationDeadline API is on its way out; is the thought that concurrent tracing is largely sufficient? I would expect that if conservative stack scanning lands, we will see a re-introduction of this kind of API, as it does make sense to synchronize with the event loop when scanning the main thread stack.

Peak performance

As we have seen in our previous evaluations, this question boils down to "is the JavaScript engine state-of-the-art, and can it perform just-in-time compilation". In the case of NativeScript, the answers are yes and maybe, respectively: V8 is state-of-the-art, and it can JIT on Android, but not on iOS.

Perhaps the mitigation here is that the hardware that iOS runs on tends to be significantly more powerful than median Android devices; if you had to pick a subset of users to penalize with an interpreter-only run-time, people with iPhones are the obvious choice, because they can afford it.

Aside: Are markets wise?

Recall that our perspective in this series is that of the designer of a new JavaScript-based mobile development platform. We are trying to answer the question of what would it look like if a new platform offered a NativeScript-like experience. In this regard, only the structure of NativeScript is of interest, and notably its "market success" is not relevant, except perhaps in some Hayekian conception of the world in which markets are necessarily smarter than, well, me, or you, or any one of us.

It must be said, though, that React Native is the 800-pound gorilla of JavaScript mobile application development. The 2022 State of JS survey shows that among survey respondents, more people are aware of React Native than any other mobile framework, and people are generally more positive about React Native than other frameworks. Does NativeScript's mitigated market share indicate something about its architecture, or does it speak speak more to the size of Facebook's budget, both on the developer experience side and on marketing?

Aside: On the expressive power of application framworks

Oddly, I think the answer to the market wisdom question might be found in a 35-year-old computer science paper, "On the expressive power of programming languages" (PDF).

In this paper, Matthias Felleisen considers the notion of what it means for one programming language to be more expressive than another. For example, is a language with just for less expressive than a language with both for and while? Intuitively we would say no, these are similar things; you can make a simple local transformation of while (x) {...} to for (;x;) {...} and you have exactly the same program semantics. On the other hand a language with just for is less expressive than one which also has goto; there is no simple local rewrite that can turn goto into for.

In the same way, we can consider the question of what it would mean for one library to be more expressive than another. After all, the API of a library exposes a language in which its user can write programs; we should be able to reason about these languages. So between React Native and NativeScript, which one is more expressive?

By Felleisen's definitions, NativeScript is clearly the more expressive language: there is no simple local transformation that can turn imperative operations on native UI widgets into equivalent functional-reactive programs. Yes, with enough glue code React Native can reach directly to native APIs in a similar way as NativeScript, but everything that touches the native UI tree is expressly under React Native's control: there is no sanctioned escape hatch.

You might think that "more expressive" is always better, but Felleisen's take is more nuanced than that. Yes, he says, more expressive languages do allow programmers to make more concise programs, because they allow programmers to define abstractions that encapsulate patterns, and this is a good thing. However he also concludes that "an increase in expressive power is related to a decrease of the set of 'natural' (mathematically appealing) operational equivalences." Less expressive programming languages are easier to reason about, in general, and indeed that is one of the recognized strengths of React's programming model: it is easy to compose components and have confidence that the result will work.


A NativeScript-like architecture offers the possibility of performance: the developer has all the capabilities needed for writing pleasant-to-use applications that blend in with the platform-native experience. It is up to the developers to choose how to use the power at their disposal. In the wild, I expect that the low-level layer of NativeScript's API is used mainly by expert developers, who know how to assemble well-functioning machines from the parts on offer.

As a primary programming interface for a new JavaScript-based mobile platform, though, just providing a low-level API would seem to be not enough. NativeScript rightly promotes the use of more well-known high-level frameworks on top: Angular, Vue, Svelte, and such. Less experienced developers should use an opinionated high-level UI framework; these developers don't have good opinions yet and the API should lead them in the right direction.

That's it for today. Thanks for reading these articles, by the way; I have enjoyed diving into this space. Next up, we'll take a look beyond JavaScript, to Flutter and Dart. Until then, happy hacking!

by Andy Wingo at April 24, 2023 09:10 AM

April 21, 2023

Andy Wingo

structure and interpretation of react native

Hey hey! Today's missive continues exploring the space of JavaScript and mobile application development.

Yesterday we looked into Ionic / Capacitor, giving a brief structural overview of what Capacitor apps look like under the hood and how this translates to three aspects of performance: startup latency, jank, and peak performance. Today we'll apply that same approach to another popular development framework, React Native.

Background: React

I don't know about you, but I find that there is so much marketing smoke and lights around the whole phenomenon that is React and React Native that sometimes it's hard to see what's actually there. This is compounded by the fact that the programming paradigm espoused by React (and its "native" cousin that we are looking at here) is so effective at enabling JavaScript UI programmers to focus on the "what" and not the "how" that the machinery supporting React recedes into the background.

At its most basic, React is what they call a functional reactive programming model. It is functional in the sense that the user interface elements render as a function of the global application state. The reactive comes into how user input is handled, but I'm not going to focus on that here.

React's rendering process starts with a root element tree, describing the root node of the user interface. An element is a JavaScript object with a type property. To render an element tree, if the value of the type property is a string, then the element is terminal and doesn't need further lowering, though React will visit any node in the children property of the element to render them as needed.

Otherwise if the type property of an element is a function, then the element node is functional. In that case React invokes the node's render function (the type property), passing the JavaScript element object as the argument. React will then recursively re-render the element tree produced as a result of rendering the component until all nodes are terminal. (Functional element nodes can instead have a class as their type property, but the concerns are pretty much the same.)

(In the language of React Native, a terminal node is a React Host Component, and a functional node is a React Composite Component, and both are React Elements. There are many imprecisely-used terms in React and I will continue this tradition by using the terms I mention above.)

The rendering phase of a React application is thus a function from an element tree to a terminal element tree. Nodes of element trees can be either functional or terminal. Terminal element trees are composed only of terminal elements. Rendering lowers all functional nodes to terminal nodes. This description applies both to React (targetting the web) and React Native (which we are reviewing here).

It's probably useful to go deeper into what React does with a terminal element tree, before building to the more complex pipeline used in React Native, so here we go. The basic idea is that React-on-the-web does impedance matching between the functional description of what the UI should have, as described by a terminal element tree, and the stateful tree of DOM nodes that a web browser uses to actually paint and display the UI. When rendering yields a new terminal element tree, React will compute the difference between the new and old trees. From that difference React then computes the set of imperative actions needed to mutate the DOM tree to correspond to what the new terminal element tree describes, and finally applies those changes.

In this way, small changes to the leaves of a React element tree should correspond to small changes in the DOM. Additionally, since rendering is a pure function of the global application state, we can avoid rendering at all when the application state hasn't changed. We'll dive into performance more deeply later on in the article.

React Native doesn't use a WebView

React Native is similar to React-on-the-web in intent but different in structure. Instead of using a WebView on native platforms, as Ionic / Capacitor does, React Native renders the terminal element tree to platform-native UI widgets.

When a React Native functional element renders to a terminal element, it will create not just a JS object for the terminal node as React-on-the-web does, but also a corresponding C++ shadow object. The fully lowered tree of terminal elements will thus have a corresponding tree of C++ shadow objects. React Native will then calculate the layout for each node in the shadow tree, and then commit the shadow tree: as on the web, React Native computes the set of imperative actions needed to change the current UI so that it corresponds to what the shadow tree describes. These changes are then applied on the main thread of the application.

The twisty path that leads one to implement JavaScript

The description above of React Native's rendering pipeline applies to the so-called "new architecture", which has been in the works for some years and is only now (April 2023) starting to be deployed. The key development that has allowed React Native to move over to this architecture is tighter integration and control over its JavaScript implementation. Instead of using the platform's JavaScript engine (JavaScriptCore on iOS or V8 on Android), Facebook went and made their own whole new JavaScript implementation, Hermes. Let's step back a bit to see if we can imagine why anyone in their right mind would make a new JS implementation.

In the last article, I mentioned that the only way to get peak JS performance on iOS is to use the platform's WkWebView, which enables JIT compilation of JavaScript code. React Native doesn't want a WebView, though. I guess you could create an invisible WebView and just run your JavaScript in it, but the real issue is that the interface to the JavaScript engine is so narrow as to be insufficiently expressive. You can't cheaply synchronously create a shadow tree of layout objects, for example, because every interaction with JavaScript has to cross a process boundary.

So, it may be that JIT is just not worth paying for, if it means having to keep JavaScript at arm's distance from other parts of the application. How do you do JavaScript without a browser on mobile, though? Either you use the platform's JavaScript engine, or you ship your own. It would be nice to use the same engine on iOS and Android, though. When React Native was first made, V8 wasn't able to operate in a mode that didn't JIT, so React Native went with JavaScriptCore on both platforms.

Bundling your own JavaScript engine has the nice effect that you can easily augment it with native extensions, for example to talk to the Swift or Java app that actually runs the main UI. That's what I describe above with the creation of the shadow tree, but that's not quite what the original React Native did; I can only speculate but I suspect that there was a fear that JavaScript rendering work (or garbage collection!) could be heavy enough to cause the main UI to drop frames. Phones were less powerful in 2016, and JavaScript engines were less good. So the original React Native instead ran JavaScript in a separate thread. When a render would complete, the resulting terminal element tree would be serialized as JSON and shipped over to the "native" side of the application, which would actually apply the changes.

This arrangement did work, but it ran into problems whenever the system needed synchronous communication between native and JavaScript subsystems. As I understand it, this was notably the case when React layout would need the dimensions of a native UI widget; to avoid a stall, React would assume something about the dimensions of the native UI, and then asynchronously re-layout once the actual dimensions were known. This was particularly gnarly with regards to text measurements, which depend on low-level platform-specific rendering details.

To recap: React Native had to interpret its JS on iOS and was using a "foreign" JS engine on Android, so they weren't gaining anything by using a platform JS interpreter. They would sometimes have some annoying layout jank when measuring native components. And what's more, React Native apps would still experience the same problem as Ionic / Capacitor apps, in that application startup time was dominated by parsing and compiling the JavaScript source files.

The solution to this problem was partly to switch to the so-called "new architecture", which doesn't serialize and parse so much data in the course of rendering. But the other side of it was to find a way to move parsing and compiling JavaScript to the build phase, instead of having to parse and compile JS every time the app was run. On V8, you would do this by generating a snapshot. On JavaScriptCore, which React Native used, there was no such facility. Faced with this problem and armed with Facebook's bank account, the React Native developers decided that the best solution would be to make a new JavaScript implementation optimized for ahead-of-time compilation.

The result is Hermes. If you are familiar with JavaScript engines, it is what you might expect: a JavaScript parser, originally built to match the behavior of Esprima; an SSA-based intermediate representation; a set of basic optimizations; a custom bytecode format; an interpreter to run that bytecode; a GC to manage JS objects; and so on. Of course, given the presence of eval, Hermes needs to include the parser and compiler as part of the virtual machine, but the hope is that most user code will be parsed and compiled ahead-of-time.

If this were it, I would say that Hermes seems to me to be a dead end. V8 is complete; Hermes is not. For example, Hermes doesn't have with, async function implementation has been lagging, and so on. Why Hermes when you can V8 (with snapshots), now that V8 doesn't require JIT code generation?

I thought about this for a while and in the end, given that V8's main target isn't as an embedded library in a mobile app, perhaps the binary size question is the one differentiating factor (in theory) for Hermes. By focussing on lowering distribution size, perhaps Hermes will be a compelling JS engine in its own right. In any case, Facebook can afford to keep Hermes running for a while, regardless of whether it has a competitive advantage or not.

It sounds like I'm criticising Hermes here but that's not really the point. If you can afford it, it's good to have code you control. For example one benefit that I see React Native getting from Hermes is that they control the threading model; they can mostly execute JS in its own thread, but interrupt that thread and switch to synchronous main-thread execution in response to high-priority events coming from the user. You might be able to do that with V8 at some point but the mobile-apps-with-JS domain is still in flux, so it's nice to have a sandbox that React Native developers can use to explore the system design space.


With that long overview out of the way, let's take a look to what kinds of performance we can expect out of a React Native system.

Startup latency

Because React Native apps have their JavaScript code pre-compiled to Hermes bytecode, we can expect that the latency imposed by JavaScript during application startup is lower than is the case with Ionic / Capacitor, which needs to parse and compile the JavaScript at run-time.

However, it must be said that as a framework, React tends to result in large application sizes and incurs significant work at startup time. One of React's strengths is that it allows development teams inside an organization to compose well: because rendering is a pure function, it's easy to break down the task of making an app into subtasks to be handled by separate groups of people. Could this strength lead to a kind of weakness, in that there is less of a need for overall coordination on the project management level, such that in the end nobody feels responsible for overall application performance? I don't know. I think the concrete differences between React Native and React (the C++ shadow object tree, the multithreading design, precompilation) could mean that React Native is closer to an optimum in the design space than React. It does seem to me though that whether a platform's primary development toolkit shold be React-like remains an open question.


In theory React Native is well-positioned to avoid jank. JavaScript execution is mostly off the main UI thread. The threading model changes to allow JavaScript rendering to be pre-empted onto the main thread do make me wonder, though: what if that work takes too much time, or what if there is a GC pause during that pre-emption? I would not be surprised to see an article in the next year or two from the Hermes team about efforts to avoid GC during high-priority event processing.

Another question I would have about jank relates to interactivity. Say the user is dragging around a UI element on the screen, and the UI needs to re-layout itself. If rendering is slow, then we might expect to see a lag between UI updates and the dragging motion; the app technically isn't dropping frames, but the render can't complete in the 16 milliseconds needed for a 60 frames-per-second update frequency.

Peak perf

But why might rendering be slow? On the one side, there is the fact that Hermes is not a high-performance JavaScript implementation. It uses a simple bytecode interpreter, and will never be able to meet the performance of V8 with JIT compilation.

However the other side of this is the design of the application framework. In the limit, React suffers from the O(n) problem: any change to the application state requires the whole element tree to be recomputed. Rendering and layout work is proportional to the size of the application, which may have thousands of nodes.

Of course, React tries to minimize this work, by detecting subtrees whose layout does not change, by avoiding re-renders when state doesn't change, by minimizing the set of mutations to the native widget tree. But the native widgets aren't the problem: the programming model is, or it can be anyway.

Aside: As good as native?

Again in theory, React Native can used to write apps that are as good as if they were written directly against platform-native APIs in Kotlin or Swift, because it uses the same platform UI toolkits as native applications. React Native can also do this at the same time as being cross-platform, targetting iOS and Android with the same code. In practice, besides the challenge of designing suitable cross-platform abstractions, React Native has to grapple with potential performance and memory use overheads of JavaScript, but the result has the potential to be quite satisfactory.

Aside: Haven't I seen that rendering model somewhere?

As I mentioned in the last article, I am a compiler engineer, not a UI specialist. In the course of my work I do interact with a number of colleagues working on graphics and user interfaces, notably in the context of browser engines. I was struck when reading about React Native's rendering pipeline about how much it resembled what a browser itself will do as part of the layout, paint, and render pipeline: translate a tree of objects to a tree of immutable layout objects, clip those to the viewport, paint the ones that are dirty, and composite the resulting textures to the screen.

It's funny to think about how many levels we have here: the element tree, the recursively expanded terminal element tree, the shadow object tree, the platform-native widget tree, surely a corresponding platform-native layout tree, and then the GPU backing buffers that are eventually composited together for the user to see. Could we do better? I could certainly imagine any of these mobile application development frameworks switching to their own Metal/Vulkan-based rendering architecture at some point, to flatten out these layers.


By all accounts, React Native is a real delight to program for; it makes developers happy. The challenge is to make it perform well for users. With its new rendering architecture based on Hermes, React Native may well be on the path to addressing many of these problems. Bytecode pre-compilation should go a long way towards solving startup latency, provided that React's expands-to-fit-all-available-space tendency is kept in check.

If you were designing a new mobile operating system from the ground up, though, I am not sure that you would necessarily end up with React Native as it is. At the very least, you would include Hermes and the base run-time as part of your standard library, so that every app doesn't have to incur the space costs of shipping the run-time. Also, in the same way that Android can ahead-of-time and just-in-time compile its bytecode, I would expect that a mobile operating system based on React Native would extend its compiler with on-device post-install compilation and possibly JIT compilation as well. And at that point, why not switch back to V8?

Well, that's food for thought. Next up, NativeScript. Until then, happy hacking!

by Andy Wingo at April 21, 2023 08:20 AM

April 20, 2023

Andy Wingo

structure and interpretation of capacitor programs

Good day, hackers! Today's note is a bit of a departure from compilers internals. A client at work recently asked me to look into cross-platform mobile application development and is happy for the results to be shared publically. This, then, is the first in a series of articles.

Mobile apps and JavaScript: how does it work?

I'll be starting by taking a look at Ionic/Capacitor, React Native, NativeScript, Flutter/Dart, and then a mystery guest. This article will set the stage and then look into Ionic/Capacitor.

The angle I am taking is, if you were designing a new mobile operating system that uses JavaScript as its native application development language, what would it mean to adopt one of these as your primary app development toolkit? It's a broad question but I hope we can come up with some useful conclusions.

I'm going to approach the problem from the perspective of a compiler engineer. Compilers translate high-level languages that people like to write into low-level languages that machines like to run. Compilation is a bridge between developers and users: developers experience the source language interface to the compiler, while users experience the result of compiling those source languages. Note, though, that my expertise is mostly on the language and compiler side of things; though I have worked on a few user-interface libraries in my time, I am certainly not a graphics specialist, nor a user-experience specialist.

I'm explicitly not including the native application development toolkits from Android and iOS, because these are most useful to me as a kind of "control" to the experiment. A cross-platform toolkit that compiles down to use native widgets can only be as good as the official SDK, but it might be worse; and a toolkit that routes around the official widgets by using alternate drawing primitives has the possibility to be better, but at the risk of being inconsistent with other platform apps, along with the more general risk of not being as good as the native UI libraries.

And, of course, there is the irritation that SwiftUI / UIKit / AppKit aren't not open source, and that neither iOS nor Android's native UI library is cross platform. If we are designing a new JavaScript-based mobile operating system—that's the conceit of the original problem statement—then we might as well focus on the existing JS-based toolkits. (Flutter of course is the odd one out, but it's interesting enough to include in the investigation.)

Ionic / Capacitor

Let's begin with the Ionic Framework UI toolkit, based on the Capacitor web run-time.


Capacitor is like an alternate web browser development kit for phones. It uses the platform's native WebView: WKWebView on iOS or WebView on Android. The JavaScript APIs available within that WebView are extended with Capacitor plugins which can access native APIs; these plugins let JavaScript do the sort of things you wouldn't be able to do directly in a web browser.

(Over time, most of the features provided by Capacitor end up in the web platform in some way. For example, there are web standards to allow JavaScript to detect or lock screen rotation. The Fugu project is particularly avant-garde in pushing these capabilities to the web. But, the wheel of web standards grinds very finely, and for an app to have access to, say, the user's contact list now, it is pragmatic to use an extended-webview solution like Capacitor.)

I call Capacitor a "web browser development kit" because creating a Capacitor project effectively copies the shell of a specialized browser into your source tree, usually with an iOS and Android version. Building the app makes a web browser with your extensions that runs your JS, CSS, image, and other assets. Running the app creates a WebView and invokes your JavaScript; your JS should then create the user interface using the DOM APIs that are standard in a WebView.

As you can see, Capacitor is quite full-featured when it comes to native platform capabilities but is a bit bare-bones when it comes to UI. To provide a richer development environment, the Ionic Framework adds a set of widgets and some application life-cycle abstractions on top of the Capacitor WebView. Ionic is not like a "normal" app framework like Vue or Angular because it takes advantage of Capacitor APIs, and because it tries to mimic the presentation and behavior of the native plaform (iOS or Android, mainly), so that apps built with it don't look out-of-place.

The Ionic Framework itself is built using Web Components, which end up creating styled DOM nodes with JavaScript behavior. These components can compose with other app frameworks such as Vue or Angular, allowing developers to share some of their app code between web and mobile apps---you write the main app in Vue, and just implement the components one way on the web and using something like Ionic Framework on mobile.

To recap, an Ionic app uses Ionic Framework libraries on top of a Capacitor run-time. You could use other libraries on top of Capacitor instead of Ionic. In any case, I'm most interested in Capacitor, so let's continue our focus there.


Performance has a direct effect on the experience that users have when interacting with an application, and here I would like to break down the performance in three ways: one, startup latency; two, jank; and three, peak throughput. As Capacitor is structurally much like a web browser, many of these concerns, pitfalls, and mitigation techniques are similar to those on the web.

Startup latency

Startup latency is how long the app makes you wait after you run it before you can interact with it. The app may hide some of this work behind a splash screen, but if there is too much work to do at startup-time, the risk is the user might get bored or frustrated and switch away. The goal, therefore, it to minimize startup latency. Most people consider time-to-interactive of less than than a second to be sufficient; there's room to improve but for better or for worse, user expectations here are lower than they could be.

In the case of Capacitor, when a user launches an application, the app loads a minimal skeleton program that loads a WebView. On iOS and most Android systems, the rendering and JS parts of the WebView run in a separate process that has access to the app's bundled assets: JavaScript files, images, CSS, and so on. The WebView then executes the JavaScript main function to create the UI.

Capacitor application startup time will be dominated by parsing and compiling the JavaScript source files, as well as any initial work needed to boot the app framework (which could require running a significant amount of JS), and will depend on how fast the user's device is. The most effective performance mitigation here is to reduce the amount of JavaScript loaded, but this can be difficult for complex apps. The main techniques to reduce app size are toolchain-based. Tree-shaking, bundling, and minification reduce the number of JS bytes that the engine needs to parse without changing application logic. Code splitting can defer some work until it is needed, but this might just result in jank later on.

Common Ionic application sizes would seem to be between 500 kB and 5 MB of JavaScript. In 2021, Alex Russell suggested that the soon-to-be standard Android performance baseline should be the Moto E7 Plus which, if its performance relative to the mid-range Moto G4 phone that he measured in 2019 translates to JavaScript engine speed, should be able to parse and compile uncompressed JavaScript at a rate of about 1 MB/s. That's not very much, if you want to get startup latency under a second, and a JS payload size of 5 MB would lead to 5-second startup delays for many users. To my mind, this is the biggest challenge for a Capacitor-based app.

(Calculation detail: The Moto G4 JavaScript parse-and-compile throughput is measured at 170 kB/s, for compressed JS. Assuming a compression ratio of 4 and that the Moto E7 Plus is 50% faster than the Moto G4, that gets us to 1 MB/s for uncompressed JS, for what was projected to be a performance baseline in 2023.)


Jank is when an application's animation and interaction aren't smooth, because the application somehow missed rendering one or more frames. Technically startup time is a form of jank, but it is useful to treat startup separately.

Generally speaking, the question of whether a Capacitor application will show jank depends on who is doing the rendering: if application JavaScript is driving an animation, then this could cause jank, in the same way as it would on the web. The mitigation is to lean on the web platform so that the WebView is the one in charge of ensuring smooth interaction, for example by using CSS animations or the native scrolling capability.

There may always be an impetus to do some animation in JavaScript, though, for example if the design guidelines require a specific behavior that can't be created using raw CSS.

Peak perf

Peak throughput is how much work an app can do per unit time. For an application written in JavaScript, this is a question of how good can the JavaScript engine make a user's code run fast.

Here we need to make an important aside on the strategies that a JavaScript engine uses to run a user's code. A standard technique that JS engines use to make JavaScript run fast is just-in-time (JIT) compilation, in which the engine emits specialized machine code at run-time and then runs that code. However, on iOS the platform prohibits JIT code generation for most applications. The usual justification for this restriction is that the low-level capability granted to a program that allows it to JIT-compile increases the likelihood of security exploits. The only current exception to the no-JIT policy on iOS is for WebView (including the one in the system web browser), which iOS maintainers consider to be sufficiently sandboxed, so that any security exploit that uses JIT code generation won't be able to access other capabilities possessed by the application process.

So much for the motivation, as I understand it. The upshot is, there is no JIT on iOS outside web browsers or web views. If a cross-platform application development toolkit wants peak JavaScript performance on iOS, it has to run that JS in a WebView. Capacitor does this, so it has fast JS on iOS. Note that these restrictions are not in place on Android; you can have fast JS on Android without using a WebView, by using the system's JavaScript libraries.

Of course, actually getting peak throughput out of JavaScript is an art but the well-known techniques for JavaScript-on-the-web apply here; we won't cover them in this series.

The bridge

All of these performance observations are common to all web browsers, but with Capacitor there is the additional wrinkle that a Capacitor application can access native capabilities, for example access to a user's contacts. When application JavaScript goes to access the contacts API, Capacitor will send a message to the native side of the application over a bridge. The message will be serialized to JSON. Capacitor will include an implementation for the native side of the bridge into your app's source code, written in Swift for iOS or Java for Android. The native end of the bridge parses the JSON message, performs the requested action, and sends a message back with the result. Some messages may need to be proxied to the main application thread, because some native APIs can only be processed there.

As you can imagine, the bridge has an overhead. Apps that have high-bandwidth access to native capabilities will also have JSON encoding overhead, as well general asynchronous coordination overhead. It may even be possible that encoding or decoding a large JSON message causes the WebView to miss a frame, for example when accessing a large file in local storage.

The bridge is a necessary component to the design, though; an iOS WebView can't have direct in-process access to native capabilities. For Android, the WebView APIs do not appear to allow this either, though it is theoretically possible to ship your own WebView that could access native APIs directly. In any case, the Capacitor multi-process solution does allow for some parallelism, and the enforced asynchronous nature of the APIs should lead to less modal application programming.

Aside: Can WebView act like native widgets?

Besides performance, one way that users experience application behavior is by way of expectations: users expect (but don't require) a degree of uniformity between different apps on the same platform, for example the same scrolling behavior or the same kinds of widgets. This aspect isn't the most important one to my investigation, because I'm more concerned with a new operating system that might come with its own design language with its own standard JavaScript-based component library, but it's one to note; an Ionic app is starting at a disadvantage relative to a platform-native app. Sometimes ensuring a platform-native UI is just a question of CSS styling and using the right library (like Ionic Framework), but sometimes it might take significant engineering or possibly even new additions to the web platform.

Aside: Is a WebView as good as native widgets?

As a final observation on user experience, it's worth asking the question of whether you can actually make a user interface using the web platform that is "as good" as native widgets. After all, the API and widget set (DOM) available to a WebView is very different from that available to a native application; is a WebView able to use CPU and GPU resources efficiently to create a smooth interface? Are web standards, CSS, and the DOM API somehow a constraint that prevents efficient user interface construction? Here, somewhat embarrassingly, I can't actually say. I am instead going to just play the "compiler engineer" card, that I'm not a UI specialist; to me it is an open question.


Capacitor, with or without the Ionic Framework on top, is the always-bet-on-the-web solution to the cross-platform mobile application development problem. It uses stock system WebViews and JavaScript, augmented with native capabilities as needed. Consistency with the UX of the specific platform (iOS or Android) may require significant investment, though.

Performance-wise, my evaluation is that Capacitor apps are well-positioned for peak JS performance due to JIT code generation and low jank due to offloading animations to the web platform, but may have problems with startup latency, as Capacitor needs to parse and compile a possibly significant amount of JavaScript each time it is run.

Next time, a different beast: React Native. Until then, happy hacking!

by Andy Wingo at April 20, 2023 10:20 AM

April 19, 2023

Stéphane Cerveau

ESExtractor: how to integrate a dependency-free library to the Khronos CTS

ESExtractor, how to integrate a dependency-free library to the Khronos CTS #

Since the Vulkan CTS is now able to test and check Vulkan Video support including video decoding, it was necessary to define the kind of media container to be used inside the test cases and the library to extract the necessary encoded data.

In a first attempt, the FFMpeg media toolkit had been chosen to extract the video packets from the A/V ISO base media format chosen as a container reference. This library was provided as a binary package and loaded dynamically at each test run.

As Vulkan video aims to test only video contents, it was not necessary to choose a complex media container, so first all the videos were converted to the elementary stream format for H264 and H265 contents. This is a very elementary format based on MPEG start codes and NAL unit identification.

To avoid an extra multimedia solution integrable only with binaries, a first attempt to replace FFmpeg was, to use GStreamer and an in-house helper library called demuxeres. It was smaller but needed to be a binary still to avoid the glib/gstreamer system dependencies (self contained library). it was a no-go still because the binary package would be awkward to support on the various platforms targetted by the the Khronos CTS.

So at Igalia, we decided to implement a minimal, dependency-free, custom library, written in C++ to be compliant with the Khronos CTS and simple to integrate into any build system.

This library is called ESExtractor

What is ESExtractor ? #

ESExtractor aims to be a simple elementary stream extractor. For the first revision it was able to extract video data from a file in the NAL standard. The first official release was 0.2.4. In this release, only the NAL was supported with both the H264 and H265 streams supported.

As Vulkan Video aims to support more than H264 and H265 including format such as AV1 or VP9, the ESExtractor had to support multiple format. A redesign has been started to support multiple format and is now available in the version 0.3.2.

How ESExtractor works #

A simple C interface is provided to maximise portability and use by other languages.

es_extractor_new #

ESExtractor extractor = es_extractor_new(filePath, "options"));

This interface returns the main object which will give you access to the packets according to the file path and the options given in the arguments. This interface returns a ready to use object where the stream has been initially parsed to determine the kind of video during the object creation.

Then you can check the video format with:

ESEVideoFormat eVideoFormat = es_extractor_video_format(extractor);

or the video codec with:

ESEVideoCodec eVideoCodec = es_extractor_video_codec(extractor);

It supports H264, H265, AV1 and VP9 for now.

es_extractor_read_packet #

This API is the main function to retrieve the available packets from the file. Each time this API is called, the library will return the next available packet according to the format and the specific alignment (ie NAL) and a status to let the application decide what to do next. The packet should be freed using es_extractor_clear_packet.

Has CI powered by github #

To test the library usage, we have implementing a testing framework in addition to a CI infrastructure As github offers a very powerful worklow, we decided to use this platform to test the library on various architectures and platforms. The CI is now configured to release packages for 64 and 32 bits on Linux and Windows.

As usual, if you would like to learn more about Vulkan Video, ESExtractor or any other open multimedia framework, please contact us!

April 19, 2023 12:00 AM

April 18, 2023

Andy Wingo

sticking point

Good evening, gentle readers. A brief note tonight, on a sticky place.

See, I have too many projects right now.

In and of itself this is not so much of a problem as a condition. I know my limits; I keep myself from burning out by shedding load, and there is a kind of priority list of which projects keep adequate service levels.

First come the tiny humans that are in my care who need their butts wiped and bodies translated to and from school or daycare and who -- well you know the old Hegelian trope, that the dialectic crank of history doesn't turn itself, that it takes actions from people to synthesize the thesis and the antithesis, and that even History itself isn't always monotonic; in the same way, bedtime is a reality, there are the material conditions of sleepiness and you're-gonna-be-so-tired-tomorrow but without the World-Historical Actor which whose name is Dada ain't nobody getting into pyjamas, not to mention getting in bed and going to sleep.

Lower in the priority queue there is work, and work is precious: a whole day, a day of just thinking about things and solving problems; yes, I wish I could choose all of the problems that I work on but the simple fact of being able to think and solve is a gift, or rather a reward, time stolen from the arbitrary chaotic preliterate interruptors. I love my kids -- and here I have to choose my conjunction wisely -- and also they love me. Which is nice! They express this through wanting to sit on my lap and demand that I read to them when I am thinking about things. We all have our love languages, I suppose.

And the house? There are 5 of us now, and that's, you know, more than 5 times the work it was before because the 3 wee ones are more chaotic than the adults. There is so much laundry. We are now a cook-the-whole-500g-bag-of-pasta family, and we're far from the teenager stage.

Finally there are the weird side projects, of which there are a few. I have some sporty things I do that I can't avoid because I am socially coerced into doing them; all in all a fine configuration. I have some volunteer work, which has parts I like that take time, which I am fine with, and parts I don't that I neglect. There's the garden, also neglected but nice to muck about in sometimes. Sometimes I see friends? Rarely, so rarely, but sometimes.

But, netfriends, if I find a free couple hours, I hack. Not less than a couple hours, because otherwise I can't get into things. More? Yes please! But usually no, because priorities.

I write this note because I realized that in the last month or so, hain't been no hacktime; there's barely 30 minutes between the end of the evening kitchen-clean and the onset of zzzz. That's fine, really; ultimately I need to carve out work time for this sort of thing, somehow or other.

But I was also realizing that I painted myself into a corner with my GC work. See, I need some new benchmarks, and I can't afford to use Guile itself as my benchmark yet, because I can't afford to fix deep bugs right now. I need something small. Specifically I need a little language implementation with efficient GC integration. I was looking at porting the little Scheme implementation from my Wasm JIT work to C -- because the rest of whippet is in C -- but is that a good enough reason? I got mired in the swamps of precise GC roots in C and let me tell you, it is disgusting. You don't have RAII like you do in C++, without GCC extensions. You don't have stack maps, like you would like. I might be able to push through but it's the sort of project you need a day for, and those days have not been forthcoming recently.

Anyway, just a little hack log tonight, not detailing a success, but rather a condition. Perhaps observing the position of this logjam will change its velocity. Here's hoping, and until next time, happy hacking!

by Andy Wingo at April 18, 2023 08:23 PM

April 11, 2023

Clayton Craft

Wait, I can manage DNS config without losing hair??

In my unending quest to move the configuration for everything I manage into version control, I was always annoyed with configuring DNS providers. I run or use some services for domains that often require adding various records beyond the "standard" ones. Some examples are using DNS-01 for ACME stuff, configuring mail handling / verification, adding sub-domains and redirects, and so on. And backing this config up was always a chore. Put your hand up if you've thought to back this up before.

I should say at this point that I tend to pay for DNS providers to implement stuff for me. I assume that I don't have time to run an authoritative server, but I could be wrong...

So... the amount of DNS configuration I end up managing is more than I'd like. Not to mention that entering and saving config from DNS providers is absolutely crap. For the few I've used in the past, they all had different web forms for adding/managing records, and the forms all had limitations around acceptable input and how they were presented. I remember one that required splitting the record into two because of some arbitrary limit on number of characters for an input field. Fun, right?

This led me on a massive hunt a few years ago to find something better. That's when I discovered LuaDNS. This is a DNS provider that load configuration, written in Lua, from your git repository. Do I really need to say more??

Ok fine, here are some reasons why I like it:

Example: Configuring a domain for email service

This is an example of configuring a domain to use migadu for email service:


-- migadu email setup
txt(_a, 'hosted-email-verify=aaaaaaaa')
txt(_a, 'v=spf1 -all')
txt('_dmarc.' .. _a, 'v=DMARC1; p=quarantine;')

mx(_a, '', 0)
mx(_a, '', 1)

cname('key1._domainkey.' .. _a, 'key1.' .. _a .. '')
cname('key2._domainkey.' .. _a, 'key2.' .. _a .. '')
cname('key3._domainkey.' .. _a, 'key3.' .. _a .. '')

srv('_submissions._tcp.' .. _a, '', 465)
srv('_imaps._tcp.' .. _a, '', 993)

The set_defaults(_a) at the top is a template I created. It does some configuration that is applicable for all domains, like setting 'www' CNAME <domain>. _a is a placeholder for the domain.

Yes, this config snippet could itself be a template, which I could then use like this in a domain config file: set_migadu(_a).

Version control

LuaDNS can be triggered by a webhook to load configuration from a git repo. This means that all of the Lua config written can be in git, with branches, or whatever you want. And pushed anywhere, triggering automation at LuaDNS to pull and deploy it.


This provider has an API, that many ACME clients support for DNS-01, but I use it too for updating records with changed IP addresses. Yeah, I know many providers have this too, but none(?) of them have the other benefits above.

Oh, there is one more benefit: I can't afford to lose any more hair :D

For anyone (rightfully) wondering: I wasn't paid or anything by LuaDNS for this, I'm just a happy paying customer.

by Unknown at April 11, 2023 12:00 AM

April 10, 2023

Alex Bradbury

Updating Wren's benchmarks

Wren is a "small, fast, class-based, concurrent scripting language", originally designed by Bob Nystrom (who you might recognise as the author of Game Programming Patterns and Crafting Interpreters. It's a really fun language to study - the implementation is compact and easily readable, and although class-based languages aren't considered very hip these days there's a real elegance to its design. I saw Wren's performance page hadn't been updated for a very long time, and especially given the recent upstream interpreter performance work on Python, was interested in seeing how performance on these microbencharks has changed. Hence this quick post to share some new numbers.

New results

To cut to the chase, here are the results I get running the same set of benchmarks across a collection of Python, Ruby, and Lua versions (those available in current Arch Linux).

Method Call:

luajit2.1 -joff

Delta Blue:


Binary Trees:

luajit2.1 -joff

Recursive Fibonacci:

luajit2.1 -joff

I've used essentially the same presentation and methodology as in the original benchmark, partly to save time pondering the optimal approach, partly so I can redirect any critiques to the original author (sorry Bob!). Benchmarks do not measure interpreter startup time, and each benchmark is run ten times with the median used (thermal throttling could potentially mean this isn't the best methodology, but changing the number of test repetitions to e.g. 1000 seems to have little effect).

The tests were run on a machine with an AMD Ryzen 9 5950X processor. wren 0.4 as of commit c2a75f1 was used as well as the following Arch Linux packages:

  • lua52-5.2.4-5
  • lua53-5.3.6-1
  • lua-5.4.4-3,
  • luajit-2.1.0.beta3.r471.g505e2c03-1
  • mruby-3.1.0-1
  • python-3.10.10-1
  • python-3.11.3-1 (taken from Arch Linux's staging repo)
  • ruby2.7-2.7.7-1
  • ruby-3.0.5-1

The Python 3.10 and 3.11 packages were compiled with the same GCC version (12.2.1 according to python -VV), though this won't necessarily be true for all other packages (e.g. the lua52 and lua53 packages are several years old so will have been built an older GCC).

I've submitted a pull request to update the Wren performance page.

Old results

The following results are copied from the Wren performance page ( link ease of comparison. They were run on a MacBook Pro 2.3GHz Intel Core i7 with Lua 5.2.3, LuaJIT 2.0.2, Python 2.7.5, Python 3.3.4, ruby 2.0.0p247.

Method Call:

luajit2.0 -joff



Binary Trees:

luajit2.0 -joff

Recursive Fibonacci:

luajit2.0 -joff


A few takeaways:

  • LuaJIT's bytecode interpreter remains incredibly fast (though see this blog post for a methodology to produce an even faster interpreter).
  • The performance improvements in Python 3.11 were well documented and are very visible on this set of microbenchparks.
  • I was more surprised by the performance jump with Lua 5.4, especially as the release notes give few hints of performance improvements that would be reflected in these microbenchmarks. The LWN article about the Lua 5.4 release however did note improved performance on a range of benchmarks.
  • Wren remains speedy (for these workloads at least), but engineering work on other interpreters has narrowed that gap for some of these benchmarks.
  • I haven't taken the time to compare the January 2015 version of Wren used for the benchmarks vs present-day Wren 0.4. It would be interesting to explore that though.
  • A tiny number of microbenchmarks have been used in this performance test. It wouldn't be wise to draw general conclusions - this is just a bit of fun.

Appendix: Benchmark script

Health warning: this is incredibly quick and dirty (especially the repeated switching between the python packages to allow testing both 3.10 and 3.11):

#!/usr/bin/env python3

# Copyright Muxup contributors.
# Distributed under the terms of the MIT license, see LICENSE for details.
# SPDX-License-Identifier: MIT

import statistics
import subprocess

out = open("", "w", encoding="utf-8")

def run_single_bench(bench_name, bench_file, runner_name):
    bench_file = "./test/benchmark/" + bench_file
    if runner_name == "lua5.2":
        bench_file += ".lua"
        cmdline = ["lua5.2", bench_file]
    elif runner_name == "lua5.3":
        bench_file += ".lua"
        cmdline = ["lua5.3", bench_file]
    elif runner_name == "lua5.4":
        bench_file += ".lua"
        cmdline = ["lua5.4", bench_file]
    elif runner_name == "luajit2.1 -joff":
        bench_file += ".lua"
        cmdline = ["luajit", "-joff", bench_file]
    elif runner_name == "mruby":
        bench_file += ".rb"
        cmdline = ["mruby", bench_file]
    elif runner_name == "python3.10":
        bench_file += ".py"
        cmdline = ["python", bench_file]
    elif runner_name == "python3.11":
        bench_file += ".py"
        cmdline = ["python", bench_file]
    elif runner_name == "ruby2.7":
        bench_file += ".rb"
        cmdline = ["ruby-2.7", bench_file]
    elif runner_name == "ruby3.0":
        bench_file += ".rb"
        cmdline = ["ruby", bench_file]
    elif runner_name == "wren0.4":
        bench_file += ".wren"
        cmdline = ["./bin/wren_test", bench_file]
        raise SystemExit("Unrecognised runner")

    times = []
    for _ in range(10):
        bench_out =
            cmdline, capture_output=True, check=True, encoding="utf-8"
        times.append(float(bench_out.split(": ")[-1].strip()))
    return statistics.median(times)

def do_bench(name, file_base, runners):
    results = {}
    for runner in runners:
        results[runner] = run_single_bench(name, file_base, runner)
    results = dict(sorted(results.items(), key=lambda kv: kv[1]))
    longest_result = max(results.values())
    out.write('<table class="chart">\n')
    for runner, result in results.items():
        percent = round((result / longest_result) * 100)
    <th>{runner}</th><td><div class="chart-bar" style="width: {percent}%;">{result:.3f}s&nbsp;</div></td>

all_runners = [
    "luajit2.1 -joff",
do_bench("Method Call", "method_call", all_runners)
do_bench("Delta Blue", "delta_blue", ["python3.10", "python3.11", "wren0.4"])
do_bench("Binary Trees", "binary_trees", all_runners)
do_bench("Recursive Fibonacci", "fib", all_runners)
print("Output written to")

April 10, 2023 12:00 PM

Loïc Le Page

WebRTC, GStreamer and HTML5 - Part 1

An easy 360º solution for realtime multimedia communication.

Part 1 - The story so far... #

It's been a few years that we've been able to communicate in realtime from one web browser to another using the WebRTC protocol. The same protocol also allows broadcasting or ingestion of multimedia streams with a very low latency, in general less than half a second. All web browsers have started integrating the protocol from 2013/2014 and today, in 2023, we have a pretty stable and efficient support everywhere.

The GStreamer multimedia framework has also started integrating WebRTC from 2017 through the webrtcbin plugin. Using this plugin you can perfectly connect to a web browser and stream audio and video in realtime. Unfortunately, webrtcbin is a low-level component. It implements the peer-to-peer connection handshake (using ICE and external STUN servers), packets rerouting if direct connection is not possible (using external TURN servers) and then maintains the underlying RTP session which transports the actual audio and video data.

To build a 100% working product you need to write a lot of code above webrtcbin. Not only do you have to design your own signalling protocol but also implement a signalling server, take care of packet loss and retransmission, manage network congestion and adapt the encoding bitrates to maintain an acceptable user experience over networks of different qualities.

With all that in mind, in 2021 Mathieu Duponchelle started implementing a new GStreamer element called webrtcsink. This element, based on webrtcbin and written in Rust, allows to produce a WebRTC stream and to maintain the underlying connections to multiple remote peers with retransmission of lost packets, control of the network congestion and adaptative encoding bitrates. The webrtcsink project comes with a signalling server that helps normalizing communication between peers.

N.B. as far as signalling normalization is concerned, it is interesting to point out the WHIP protocol that is exclusively used for media ingestion.

Since then, at Igalia, we continuously worked to improve webrtcsink. Last year, in particular, Thibault Saunier has implemented the complete Google Congestion Control Algorithm from its RFC. And recently he has ported the whole project to gst-plugins-rs in order to make the plugin available with the official GStreamer distribution and foster a broader collaboration from the GStreamer community.

So far, webrtcsink is the best looking starting point for building a 360º solution including bi-directional and realtime communications between multiple peers using transparently web browsers and/or native code.

Improvements #

The webrtcsink element is dedicated to broadcast a WebRTC stream to multiple remote clients (this is a sink as its name suggests), and the original signalling server is designed with this objective in mind.

For one of our clients, we needed to be able to also consume a remote WebRTC stream and interact with web browsers and mobile WebViews on Android and iOS.

For that we decided at Igalia to:

  • Improve the signalling protocol allowing:
    • to use a single WebSocket connection (and single ID) per connected peer,
    • to endorse multiple roles at the same time (listener, producer and consumer),
    • to have producer and consumer sessions isolated and independent.
  • Develop a webrtcsrc element dedicated to consume WebRTC streams produced by the webrtcsink element (or from a web browser).
  • Develop gstwebrtc-api, a Javascript API allowing developers to communicate transparently with webrtcsink and webrtcsrc elements from a web browser.

Signalling #

The signalling protocol relies on the same original concepts but with the improvements listed above.

Each peer connects to the signalling server using a WebSocket (can be secured over SSL/TLS) and receives a unique identifier used for further commands. This unique identifier remains until the WebSocket connection is closed.

signalling diagram

By default, all peers inherit the role of consumer and can connect to a remote WebRTC stream to consume. To do so, a peer needs to create a session with a remote producer peer. This session has its own unique identifier and is in charge of exchanging SDP and ICE messages between peers until the WebRTC link can be established.

A peer can explicitly require to become a producer, in which case it is announced as this to all other connected peers and gets ready to receive WebRTC connection requests from remote consumer peers.

To finish with, a peer can also explicitly require to become a listener, in which case it will receive a message each time a new producer appears on the signalling network or disappears. The list of currently available producers can also be required independently of the peer roles.

The webrtcsink element always registers as a producer whereas the webrtcsrc element is a consumer. The gstwebrtc-api allows to activate and desactivate the producer mode from the API.

The novelty here is that, from now on, it is possible to activate and deactivate the listener and producer roles without having to open several WebSocket connections or to disconnect from the signalling server. You can also create more than one consumer sessions but you are still limited to one producer session per peer.

To sum-up: if you are creating a web application you can connect to as many remote streams as you want but you can only produce one unique stream (which can have several video and audio tracks). If you are creating a native application with GStreamer, you can use as many webrtcsrc elements as you want to connect to several remote streams and one webrtcsink element to produce one WebRTC stream.

With these combinations you can easily create low-latency streaming applications, media ingestion tools or any kind of video conferencing software, among other examples.

Webrtcsrc #

The new webrtcsrc element developed in Rust by Thibault Saunier offers a full-featured WebRTC consumer. The element connects to the signalling server and manages automatically the WebRTC session with a remote peer identified by its unique identifier.

It also registers the gstwebrtc(s):// schemes to be used directly with uridecodebin, playbin or playbin3 elements. Connecting to a remote WebRTC stream has never been so easy:

gst-launch-1.0 playbin uri=gstwebrtc://

or directly:

gst-play-1.0 gstwebrtc://

The structure of a webrtcsrc URI is as follows:

  • gstwebrtc:// for a non-secured WebSocket connection to the signalling server
  • or gstwebrtcs:// for a secured connection over SSL/TLS.
  • The signalling server address (hostname or IP address and port).
  • The remote producer peer to connect to specified with ?peer-id=[uuid].

The webrtcsrc element also provides access to its internal signaller object, so you can communicate with the signalling server and, for example, listen to all producers created on the signalling network without needing to implement the signalling protocol by yourself. The following example shows how to create an instance of the signaller to receive information about new producers.

class Application;
extern Application* myApp;

GstElement* elem = gst_element_factory_make("webrtcsrc", nullptr);
if (elem)
GObject* signaller = nullptr;
g_object_get(elem, "signaller", &signaller, nullptr);

if (signaller)
GObject* listener = G_OBJECT(g_object_new(G_OBJECT_TYPE(signaller),
"uri", "ws://",
"role", "listener", nullptr));

if (listener)
g_signal_connect_swapped(listener, "error", G_CALLBACK(+[](Application* app, const char* error) {
// Manage errors...
}), myApp);

g_signal_connect_swapped(listener, "producer-added",
G_CALLBACK(+[](Application* app, const char* producerId, const GstStructure* producerMeta) {
// Manage a new producer...
// You can, for example, create a new pipeline/branch using a webrtcsrc
// element and the producerId value to consume the new remote stream
}), myApp);

g_signal_connect_swapped(listener, "producer-removed",
G_CALLBACK(+[](Application* app, const char* producerId, const GstStructure* producerMeta) {
// Cleanup a removed producer...
// Any local consumer pipeline/branch previously created
// with a webrtcsrc element will receive an EOS event
}), myApp);

April 10, 2023 12:00 AM

April 04, 2023

Eric Meyer

Ventura Vexations

I’ve been a bit over a month now on my new 14” MacBook Pro, and I have complaints.  Not about the hardware, which is solid yet lightweight, super-quiet yet incredibly fast and powerful, long-lived on battery, and decent enough under the fingertips.  Plus, all the keyboard keys Just Work™, unlike the MBP it replaced!  So that’s nice.

No, my complaints are entirely about the user environment.  At first I thought this was because I skipped directly from OS X 10.14 to macOS 13, and simply wasn’t used to How The Kids Do Things These Days®, but apparently I would’ve felt the same even if I’d kept current with OS updates.  So I’m going to gripe here in hopes someone who knows more than me will have recommendations to ameliorate my annoyance.

DragThing Dismay

This isn’t on Apple, but still, it’s a huge loss for me.  I know I already complained about the lack of DragThing, but I really, really do miss what it did for me.  You never know what you’ve got ’til it’s gone, right?  But let me be clear about exactly what it did for me, which so far as I can tell no macOS application does, nor does macOS itself.

The way I used DragThing was to have a long shelf down the right side of my monitor containing small-but-recognizable icons representing my most-used folders (home directory, Downloads, Documents, Applications, a few other folders) and a number of applications.  It stayed there all the time, and the icons were always there whether or not the application was running.

When I launched, say, Firefox, then there would be a little indicator next to its application icon in DragThing to indicate it was running.  When I quit Firefox, the indicator went away but the Firefox icon stayed.  And also, if I launched an application that wasn’t in the DragThing shelf, it did not add an icon for that application to the shelf. (I used the Dock at the bottom of the screen to show me that.)

There are super-powered application switchers available for macOS, but as far as I’ve seen, they only list the applications actually running.  Launch an application, its icon is added.  Quit an application, its icon disappears.  None of these switchers let me keep persistent static one-click shortcuts to launch a variety of applications and open commonly-used folders.

Dock Folder Disgruntlement

Now I’m on to macOS itself.  Given the previous problem, the Dock is the only thing available to me, and I have gripes about it.  One of the bigger ones is rooted in folders kept on the Dock, to the right of the bar that divides them from the application icons.  When I click on them, I get a popup (wince) or a Stack (shudder) instead of them just opening the target folder in the Finder.

In the Before Times, I could create an alias to the folder and drop that in the Dock, the icon in the Dock would look like the target folder, and clicking on the alias opened the folder’s window.  If I do that now, the click-to-open part works, but the aliases all look like blank text documents with tiny arrows.  What the hell?

If I instead add actual folders (not aliases) to the Dock, holding down ⌥⌘ (option-command) when I click them does exactly what I want.  Only, I don’t want to have to hold down modifier keys, especially when using the trackpad.  I’ve mostly adapted to the key combo, but even on desktop I still sometimes click a folder and blink in irritation at the popup thingy for a second before remembering that things are stupider now.

Translucency Tribulation

The other problem with the Dock is that mine is too opaque.  That’s because the nearly-transparent Finder menu bar was really not doing it for me, so acting on a helpful tip, I went and checked the “Reduce Transparency” option in the Accessibility settings.  That fixed the menu bar nicely, but it also made the Dock opaque, which I didn’t actually want.  I can pretty easily live with it, but I do wish I could make just the menu bar opaque (without having to resort to desktop wallpaper hacks, which I suspect do not do well with changes of display resolution).

Shortcut Stupidity

Seriously, Apple, what the hell.

And while I’m on the subject of the menu bar: no matter the application or even the Finder itself, dropdown menus from the menu bar render the actions you can do in black and the actions you can’t do in washed-out gray.  Cool.  But also, all the keyboard shortcuts are now a washed-out gray, which I keep instinctively thinking means they’ve been disabled or something.  They’re also a lot more difficult for my older eyes to pick out, and I have to flick my eyes back and forth to make sure a given keyboard shortcut corresponds to a thing I actually can do.  Seriously, Apple, what the hell?

Trash Can Troubles

I used to have the Trash can on the desktop, down in the lower right corner, and now I guess I can’t.  I vaguely recall this is something DragThing made possible, so maybe that’s another reason to gripe about the lack of it, but it’s still bananas to me that the Trash can is not there by default.  I understand that I may be very old.

Preview Problems

On my old machine, Preview was probably the most rock-solid application on there.  On the new machine, Preview occasionally hangs on closing heavily-commented PDFs when I choose not to save changes.  I can force-quit it and so far haven’t experienced any data corruption, but it’s still annoying.

Those are the things that have stood out the most to me about Ventura.  How about you?  What bothers you about your operating system (whichever one that is) and how would you like to see it fixed?

Oh, and I’ll follow this up soon with a post about what I like in Ventura, because it’s not all frowns and grumbles.

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at April 04, 2023 03:57 PM

April 03, 2023

Alex Bradbury

2023Q1 week log

I tend to keep quite a lot of notes on the development related (sometimes at work, sometimes not) I do on a week-by-week basis, and thought it might be fun to write up the parts that were public. This may or may not be of wider interest, but it aims to be a useful aide-mémoire for my purposes at least. Weeks with few entries might be due to focusing on downstream work (or perhaps just a less productive week - I am only human!).

Week of 27th March 2023

Week of 20th March 2023

Week of 13th March 2023

  • Most importantly, added some more footer images for this site from the Quick Draw dataset. Thanks to my son (Archie, 5) for the assistance.
  • Reviewed submissions for EuroLLVM (I'm on the program committee).
  • Added note to the commercially available RISC-V silicon post about a hardware bug in the Renesas RZ/Five.
  • Finished writing and published what's new for RISC-V in LLVM 16 article and took part in some of the discussions in the HN and Reddit threads (it's on too, but that didn't generate any comments).
  • Investigated an issue where inline asm with the m constraint was generating worse code on LLVM vs GCC, finding that LLVM conservatively lowers this to a single register, while GCC treats m as reg+imm, relying on users indicating A when using a memory operand with an instruction that can't take an immediate offset. Worked with a colleague who posted D146245 to fix this.
  • Set agenda for and ran the biweekly RISC-V LLVM contributor sync call as usual.
  • Bisected reported LLVM bug #61412, which as it happens was fixed that evening by D145474 being committed. We hope to backport this to 16.0.1.
  • Did some digging on a regression (compiler crash) for -Oz, bisecting it to the commit that enabled machine copy propagation by default. I found the issue was due to machine copy propagation running after the machine outliner, and incorrectly determining that some register writes in outlined functions were not live-out. I posted and landed D146037 to fix this by running machine copy propagation earlier in the pipeline, though a more principled fix would be desirable.
  • Filed a PR against the riscv-isa-manual to disambiguate the use of the term "reserved" for HINT instructions. I've also been looking at the proposed bfloat16 extension recently and filed an issue to clarify if Zfbfinxmin will be defined (as all the other floating point extensions so far have an *inx twin.
  • Almost finished work to resolve issues related to overzealous error checking on RISC-V ISA naming strings (with llvm-objdump and related tools being the final piece).
    • Landed D145879 and D145882 to expand RISCVISAInfo test coverage and fix an issue that surfaced through that.
    • Posted a pair of patches that makes llvm-objdump and related tools tolerant of unrecognised versions of ISA extensions. D146070 resolves this for the base ISA in a minimally invasive way, while D146114 solves this for other extensions, moving the parsing logic to using the parseNormalizedArchString function I introduced to fix a similar issue in LLD. This built on some directly committed work to expand testing.
  • The usual assortment of upstream LLVM reviews.
  • LLVM Weekly #480.

Week of 6th March 2023

Week of 27th February 2023

  • Completed (to the point I was happy to publish at least) my attempt to enumerate the commercially available RISC-V SoCs. I'm very grateful to have received a whole range of suggested additions and clarifications over the weekend, which have all been incorporated.
  • Ran the usual biweekly RISC-V LLVM sync-up call. Topics included outstanding issues for LLVM 16.x (no major issues now my backport request to fix and LLD regression was merged), an overview off _Float16 ABI lowering fixes, GP relaxation in LLD, my recent RISC-V buildbot, and some vectorisation related issues.
  • Investigated and largely resolved a issues related to ABI lowering of _Float16 for RISC-V. Primarily, we weren't handling the cases where a GPR+FPR or a pair of FPRs are used to pass small structs including _Float16.
    • Part of this work involved rebasing my previous patches to refactor our RISC-V ABI lowering tests in Clang. Now that a version of my improvements to --function-signature (required for the refactor) landed as part of D144963, this can hopefully be completed.
    • Committed a number of simple test improvements related to half floats. e.g. 570995e, 81979c3, 34b412d.
    • Posted D145070 to add proper coverage for _Float16 ABI lowering, and D145074 to fix it. Also D145071 to set the HasLegalHalfType property, but the semantics of that are less clear.
    • Posted a strawman psABI patch for __bf16, needed for the RISC-V bfloat16 extension.
  • Attended the Cambridge RISC-V Meetup.
  • After seeing the Helix editor discussed on, retried my previously shared large Markdown file test case. Unfortunately it's still unusably slow to edit, seemingly due to a tree-sitter related issue.
  • Cleaned up the static site generator used for this site a bit. e.g. now my fixes (#157, #158, #159) for the traverse() helper in mistletoe where merged upstream, I removed my downstream version.
  • The usual mix of upstream LLVM reviews.
  • Had a day off for my birthday.
  • Publicly shared this week log for the first time.
  • LLVM Weekly #478.

Week of 20th February 2023

Week of 13th February 2023

Week of 6th February 2023

Article changelog
  • 2023-04-03: Added notes for the week of 27th March 2023.
  • 2023-03-27: Added notes for the week of 20th March 2023.
  • 2023-03-20: Added notes for the week of 13th March 2023.
  • 2023-03-13: Added notes for the week of 6th March 2023.
  • 2023-03-06: Added notes for the week of 27th February 2023.
  • 2023-02-27: Added in a forgotten note about trivial buildbot doc improvements.
  • 2023-02-27: Initial publication date.

April 03, 2023 12:00 PM

Carlos García Campos

WebKitGTK accelerated compositing rendering

Initial accelerated compositing support

When accelerated compositing support was added to WebKitGTK, there was only X11. Our first approach was quite simple, we sent the web view widget Xwindow ID to the web process to be used as rendering target using GLX. This was very efficient, but soon we realized it broke the GTK rendering model so it was not possible to use a web view inside a GtkOverlay, for example, to show status messages on top. The solution was to use a redirected Xcomposite window in the web process, and use its ID as the render target using GLX. The pixmap ID of the redirected Xcomposite window was sent to the UI process to be painted in the web view widget using a Cairo Xlib surface. Since the rendering happens in the web process, this approach required to use Xdamage to monitor when the redirected Xcomposite window was updated to schedule a web view redraw.

Wayland support

To support accelerated compositing under Wayland we initially added a nested Wayland compositor running in the UI process. The web process connected to the nested Wayland compositor and created a surface to be used as the rendering target using EGL. The good thing about this approach compared to the X11 one, is that we can create an EGLImage from Wayland buffers and use a GDK GL context to paint the contents in the web view. This is more efficient than X11 because we can use OpenGL both in web and UI processes.
WPE, when using the fdo backend, uses the same approach of running a nested Wayland compositor, but in a more efficient way, using DMABUF instead of Wayland buffers when available. So, we decided to use libwpe in the GTK port only for rendering under Wayland, and eventually remove our Wayland compositor implementation.
Before the removal of the custom Wayland compositor we had all these possible combinations:

  • UI Process
    • X11: Cairo Xlib surface
    • Wayland: EGL
  • Web Process
    • X11: GLX using redirected Xwindow
    • Wayland (nested Wayland compositor): EGL using Wayland surface
    • Wayland (libwpe): EGL using libwpe to get the Wayland surface

To reduce a bit the differences, and to make it easier to support WebGL with ANGLE we decided to change X11 to prefer EGL if possible, falling back to GLX only if EGL failed.


GTK4 was released and we added support for it. The fact that GTK4 uses GL by default should make the rendering more efficient in accelerated compositing mode. This is definitely true under Wayland, because we are using a GL context already, so we just keep passing a texture to GTK to paint the contents in the web view. However, in the case of X11 we still have a Cairo Xlib surface that GTK paints into a Cairo image surface to be uploaded to the GPU. With GTK4 now we have two more combinations in the UI process side X11 + GTK3, X11 + GTK4, Wayland + GTK3 and Wayland + GTK4.

Reducing all the combinations to (almost) one: DMABUF

All these combinations to support the different platforms made it quite difficult to maintain, every time we get a bug report about something not working in accelerated compositing mode we have to figure out the combination actually used by the reporter, GTK3 or GTK4? X11 or Wayland? using EGL or GLX? custom Wayland compositor or libwpe? driver? version? etc.

We are already using DMABUF in WebKit for different things like WebGL and media rendering, so we thought that we could also use it for sharing the rendered buffer between the web and UI processes. That would be a more efficient solution but it would also drastically reduce the amount of combinations to maintain. The web process always uses the surfaceless platform, so it doesn’t matter if it’s under Wayland or X11. Then we create a surfaceless context as the render target and use EGL and GBM APIs to export the contents as a DMABUF buffer. The UI process imports the DMABUF buffer using EGL and GBM too, to be passed to GTK as a texture that is painted in the web view.

This theoretically recudes all the previous combinations to just one (note that we removed GLX support entirely, making EGL a requirement for accelerated compositing), but there’s a problem under X11: GTK3 doesn’t support EGL on X11 and GTK4 defaults to EGL but falls back to GLX if it doesn’t find an EGL config that perfectly matches the screen visual. In my system it never finds that EGL config because mesa doesn’t expose any 32 bit depth config. So, in the case of GTK3 we have to manually download the buffer to CPU and paint normally using Cairo, but in the case of GTK4 + GLX, GTK uploads the buffer again to be painted using GLX. I don’t think it’s possible to force GTK to use EGL from the API, but at least you can use GDK_DEBUG=gl-egl.

WebKitGTK 2.41.1

WebKitGTK 2.41.1 is the first unstable release of this cycle and already includes the DMABUF support that is used by default. We encourage everybody to try it out and provide feedback or report any issue. Please, export the contents of webkit://gpu and attach it to the bug report when reporting any problem related to graphics. To check if the issue is a regression of the DMABUF implementation you can use WEBKIT_DISABLE_DMABUF_RENDERER=1 to use the WPE renderer or X11 instead. This environment variable and the WPE render/X11 code will be eventually removed if DMABUF works fine.


If this approach works fine we plan to use something similar for the WPE port and get rid of the nested Wayland compositor there too.

by carlos garcia campos at April 03, 2023 08:57 AM

March 31, 2023

Eric Meyer

Echoed Whisper

The two videos I was using Whisper on have been published, so you can see for yourself how the captioning worked out.  Designed as trade-show booth reel pieces, they’re below three minutes each, so watching both should take less than ten minutes, even with pauses to scrutinize specific bits of captioning.

As I noted in my previous post about this, I only had to make one text correction to the second video, plus a quick find-and-replace to turn “WPE WebKit” into “WPEWebKit”.  For the first video, I did make a couple of edits beyond fixing transcription errors; specifically, I added the dashes and line breaking in this part of the final SubRip Subtitle (SRT) file uploaded to YouTube:

00:00:25,000 --> 00:00:32,000
- Hey tell me, is Michael coming out?
- Affirmative, Mike's coming out.

This small snippet actually embodies the two things where Whisper falls down a bit: multiple voices, and caption line lengths.

Right now, Whisper doesn’t even try to distinguish between different voices, the technical term for which is “speaker diarisation”.  This means Whisper ideal for transcribing, say, a conference talk or a single-narrator video.  It’s lot less useful for things like podcasts, because while it will probably get (nearly) all the words right, it won’t even throw in a marker that the voice changed, let alone try to tell which bits belong to a given voice.  You have to go into the output and add those yourself, which for an hourlong podcast could be… quite the task.

There are requests for adding this to Whisper scattered in their GitHub discussions, but I didn’t see any open pull requests or mention of it in the README, so I don’t know if that’s coming or not.  If you do, please leave a comment!

As for the length of captions, I agree with J David Eisenberg: Whisper too frequently errs on the side of “too long”.  For example, here’s one of the bits Whisper output:

00:01:45,000 --> 00:01:56,000
Here is the dash.js player using MSE, running in a page, and using Widevine DRM to decrypt and play rights-managed video with EME, all fluidly.

That’s eleven seconds of static subtitling, with 143 characters of line length.  The BBC recommends line lengths at or below 37 characters, and Netflix suggests a limit of 42 characters, with actual hard limits for a few languages.  You can throw in line breaks to reduce line length, but should never have more than three lines, which wouldn’t be possible with 143 characters.  But let’s be real, that 11-second caption really should be split in twain, at the absolute minimum.

Whisper does not, as of yet, have a way to request limiting caption lengths, either in time or in text.  There is a fairly detailed discussion of this over on Whisper’s repository, with some code graciously shared by people working to address this, but it would be a lot better if Whisper accepted an argument to limit the length of any given bit of output.  And also if it threw in line breaks on its own, say around 40 characters in English, even when not requested.

The last thing I’d like to see improved is speed.  It’s not terribly slow as is, to be clear.  Using the default model size (small), which is what I used for the videos I wrote about, Whisper worked at about 2:1 speed: a two-minute video took about a minute to process.  I tried the next size up, the medium model, and it worked at roughly 1:1.5 speed, taking about an hour fifteen to process a 46-minute video.

The thing is, all that is running solely on the CPU, which in my case is a 12-core M2.  According to this pull request, problems in one of Whisper’s dependencies, PyTorch, means GPU utilization is essentially unavailable on the hardware I have. (Thanks to Chris Adams for the pointer.) I expect that will be cleared up sooner or later, so the limitation feels minor.

Overall, it’s a powerful tool, with accuracy I still find astounding, only coming up short in quality-of-life features that aren’t critical in some applications (transcribing a talk) or relatively easily worked around in others (hand-correcting caption length in short videos; using a small script to insert line breaks in longer videos).  The lack of speaker diarisation is the real letdown for me, and definitely the hardest to work around, so I hope it gets addressed soon.

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at March 31, 2023 03:12 PM

March 23, 2023

Eric Meyer

Peerless Whisper

What happened was, I was hanging out in an online chatter channel when a little birdy named Bruce chirped about OpenAI’s Whisper and how he was using it to transcribe audio.  And I thought, Hey, I have audio that needs to be transcribed.  Brucie Bird also mentioned it would output text, SRT, and WebVTT formats, and I thought, Hey, I have videos I’ll need to upload with transcription to YouTube!  And then he said you could run it from the command line, and I thought, Hey, I have a command line!

So off I went to install it and try it out, and immediately ran smack into some hurdles I thought I’d document here in case someone else has similar problems.  All of this took place on my M2 MacBook Pro, though I believe most of the below should be relevant to anyone trying to do this at the command line.

The first thing I did was what the GitHub repository’s README recommended, which is:

$ pip install -U openai-whisper

That failed because I didn’t have pip installed.  Okay, fair enough.  I figured out how to install that, setting up an alias of python for python3 along the way, and then tried again.  This time, the install started and then bombed out:

Collecting openai-whisper
  Using cached openai-whisper-20230314.tar.gz (792 kB)
  Installing build dependencies ...  done
  Getting requirements to build wheel ...  done
  Preparing metadata (pyproject.toml) ...  done
Collecting numba
  Using cached numba-0.56.4.tar.gz (2.4 MB)
  Preparing metadata ( ...  error
  error: subprocess-exited-with-error

…followed by some stack trace stuff, none of which was really useful until ten or so lines down, where I found:

RuntimeError: Cannot install on Python version 3.11.2; only versions >=3.7,<3.11 are supported.

In other words, the version of Python I have installed is too modern to run AI.  What a world.

I DuckDucked around a bit and hit upon pyenv, which is I guess a way of installing and running older versions of Python without having to overwrite whatever version(s) you already have.  I’ll skip over the error part of my trial-and-error process and give you the commands that made it all work:

$ brew install pyenv

$ pyenv install 3.10

$ PATH="~/.pyenv/shims:${PATH}"

$ pyenv local 3.10

$ pip install -U openai-whisper

That got Whisper to install.  It didn’t take very long.

At that point, I wondered what I’d have to configure to transcribe something, and the answer turned out to be precisely zilch.  Once the install was done, I dropped into the directory containing my MP4 video, and typed this:

$ whisper wpe-mse-eme-v2.mp4

Here’s what I got back.  I’ve marked the very few errors.

[00:00.000 --> 00:07.000]  In this video, we'll show you several demos showcasing multi-media capabilities in WPE WebKit,
[00:07.000 --> 00:11.000]  the official port of the WebKit engine for embedded devices.
[00:11.000 --> 00:18.000]  Each of these demos are running on the low-powered Raspberry Pi 3 seen in the lower right-hand side of the screen here.
[00:18.000 --> 00:25.000]  Infotainment systems and media players often need to consume digital rights-managed videos.
[00:25.000 --> 00:32.000]  They tell me, is Michael coming out?  Affirmative, Mike's coming out.
[00:32.000 --> 00:45.000]  Here you can see just that, smooth streaming playback using encrypted media extensions, or EME, with PlayReady 4.
[00:45.000 --> 00:52.000]  Media source extensions, or MSE, are used by many players for greater control over playback.
[00:52.000 --> 01:00.000]  YouTube TV has a whole conformance test suite for this, which WPE has been passing since 2021.
[01:00.000 --> 01:09.000]  The loan exceptions here are those tests requiring hardware support not available on the Raspberry Pi 4, but available for other platforms.
[01:09.000 --> 01:16.000]  YouTube TV has a conformance test for EME, which WPE WebKit passes with flying colors.
[01:22.000 --> 01:40.000]  Music
[01:40.000 --> 01:45.000]  Finally, perhaps most impressively, we can put all these things together.
[01:45.000 --> 01:56.000]  Here is the dash.js player using MSE, running in a page, and using Widevine DRM to decrypt and play rights-managed video with EME all fluidly.
[01:56.000 --> 02:04.000]  Music
[02:04.000 --> 02:09.000]  Remember, all of this is being played back on the same low-powered Raspberry Pi 3.
[02:27.000 --> 02:34.000]  For more about WPE WebKit, please visit WPE
[02:34.000 --> 02:42.000]  For more information about EGALIA, or to find out how we can help with your embedded device needs, please visit us at  

I am, frankly, astonished.  This has no business being as accurate as it is, for all kinds of reasons.  There’s a lot of jargon and very specific terminology in there, and Whisper nailed pretty much every last bit of it, first time in, no special configuration, nothing.  I didn’t even bump up the model size from the default of small.  I felt a little like that Froyo guy in the animated Hunchback of Notre Dame meme yelling about sorcery or whatever.

True, the output isn’t absolutely perfect.  Let’s review the glitches in reverse order.  The last two errors, turning “Igalia” into “EGALIA”, seems fair enough given I didn’t specify that there would be languages other than English involved.  I routinely have to spell it for my fellow Americans, so no reason to think a codebase could do any better.

The space inserted into “WPEWebKit” (which happens throughout) is similarly understandable.  I’m impressed it understood “WebKit” at all, never mind that it was properly capitalized and not-spaced.

The place where it says Music and I marked it as an error: This is essentially an echoing countdown and then a white-noise roar from rocket engines.  There’s a “music today is just noise” joke in here somewhere, but I’m too hip to find it.

Whisper turning “lone” into “loan” doesn’t particularly faze me, given the difficulty of handling soundalike words.  Hell, just yesterday, I was scribing a conference call and mistakenly recorded “gamut” as “gamma”, and those aren’t even technically homophones.  They just sound like they are.

Rounding out the glitch tour, “Hey” got turned into “They”, which (given the audio quality of that particular part of the video) is still pretty good.

There is one other error I couldn’t mark because there’s nothing to mark, but if you scrutinize the timeline, you’ll see a gap from 02:09.000 and 02:27.000.  In there, a short clip from a movie plays, and there’s a brief dialogue between two characters in not-very-Dutch-accented English there.  It’s definitely louder and more clear than the 00:25.000 –> 00:32.000 bit, so I’m not sure why Whisper just skipped over it.  Manually transcribing that part isn’t a big deal, but it’s odd to see it perform so flawlessly on every other piece of speech and then drop this completely on the floor.

Before posting, I decided to give Whisper another go, this time on a different video:

$ whisper wpe-gamepad-support-v3.mp4

This was the result, with the one actual error marked:

[00:00.000 --> 00:13.760]  In this video, we demonstrate WPE WebKit's support for the W3C's GamePad API.
[00:13.760 --> 00:20.080]  Here we're running WPE WebKit on a Raspberry Pi 4, but any device that will run WPE WebKit
[00:20.080 --> 00:22.960]  can benefit from this support.
[00:22.960 --> 00:28.560]  The GamePad API provides a JavaScript interface that makes it possible for developers to access
[00:28.560 --> 00:35.600]  and respond to signals from GamePads and other game controllers in a simple, consistent way.
[00:35.600 --> 00:40.320]  Having connected a standard Xbox controller, we boot up the Raspberry Pi with a customized
[00:40.320 --> 00:43.040]  build route image.
[00:43.040 --> 00:48.560]  Once the device is booted, we run cog, which is a small, single window launcher made specifically
[00:48.560 --> 00:51.080]  for WPE WebKit.
[00:51.080 --> 00:57.360]  The window cog creates can be full screen, which is what we're doing here.
[00:57.360 --> 01:01.800]  The game is loaded from a website that hosts a version of the classic video arcade game
[01:01.800 --> 01:05.480]  Asteroids.
[01:05.480 --> 01:11.240]  Once the game has loaded, the Xbox controller is used to start the game and control the spaceship.
[01:11.240 --> 01:17.040]  All the GamePad inputs are handled by the JavaScript GamePad API.
[01:17.040 --> 01:22.560]  This GamePad support is now possible thanks to work done by Igalia in 2022 and is available
[01:22.560 --> 01:27.160]  to anyone who uses WPE WebKit on their embedded device.
[01:27.160 --> 01:32.000]  For more about WPE WebKit, please visit
[01:32.000 --> 01:35.840]  For more information about Igalia, or to find out how we can help with your embedded device
[01:35.840 --> 01:39.000]  needs, please visit us at  

That should have been “buildroot”.  Again, an entirely reasonable error.  I’ve made at least an order of magnitude more typos writing this post than Whisper has in transcribing these videos.  And this time, it got the spelling of Igalia correct.  I didn’t make any changes between the two runs.  It just… figured it out.

I don’t have a lot to say about this other than, wow.  Just WOW.  This is some real Clarke’s Third Law stuff right here, and the technovertigo is Marianas deep.

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at March 23, 2023 12:52 PM

March 20, 2023

Danylo Piliaiev

Command stream editing as an effective method to debug driver issues

  1. How the tool is used

In previous posts, “Graphics Flight Recorder - unknown but handy tool to debug GPU hangs” and “Debugging Unrecoverable GPU Hangs”, I demonstrated a few tricks of how to identify the location of GPU fault.

But what’s the next step once you’ve roughly pinpointed the issue? What if the problem is only sporadically reproducible and the only way to ensure consistent results is by replaying a trace of raw GPU commands? How can you precisely determine the cause and find a proper fix?

Sometimes, you may have an inkling of what’s causing the problem, then and you can simply modify the driver’s code to see if it resolves the issue. However, there are instances where the root cause remains elusive or you only want to change a specific value without affecting the same register before and after it.

The optimal approach in these situations is to directly modify the commands sent to the GPU. The ability to arbitrarily edit the command stream was always an obvious idea and has crossed my mind numerous times (and not only mine – proprietary driver developers seem to employ similar techniques). Finally, the stars aligned: my frustration with a recent bug, the kernel’s new support for user-space-defined GPU addresses for buffer objects, the tool I wrote to replay command stream traces not so long ago, and the realization that implementing a command stream editor was not as complicated as initially thought.

The end result is a tool for Adreno GPUs (with msm kernel driver) to decompile, edit, and compile back command streams: “freedreno,turnip: Add tooling to edit command streams and use them in ‘replay’”.

The primary advantage of this command stream editing tool lies the ability to rapidly iterate over hypotheses. Another highly valuable feature (which I have plans for) would be the automatic bisection of the command stream, which would be particularly beneficial in instances where only the bug reporter has the necessary hardware to reproduce the issue at hand.

How the tool is used

# Decompile one command stream from the trace
./rddecompiler -s 0 gpu_trace.rd > generate_rd.c

# Compile the executable which would output the command stream
meson setup . build
ninja -C build

# Override the command stream with the commands from the generator
./replay gpu_trace.rd --override=0 --generator=./build/generate_rd
Reading dEQP-VK.renderpass.suballocation.formats.r5g6b5_unorm_pack16.clear.clear.rd...
gpuid: 660
Uploading iova 0x100000000 size = 0x82000
Uploading iova 0x100089000 size = 0x4000
cmdstream 0: 207 dwords
generating cmdstream './generate_rd --vastart=21441282048 --vasize=33554432 gpu_trace.rd'
Uploading iova 0x4fff00000 size = 0x1d4
override cmdstream: 117 dwords
skipped cmdstream 1: 248 dwords
skipped cmdstream 2: 223 dwords

The decompiled code isn’t pretty:

/* pkt4: GRAS_SC_SCREEN_SCISSOR[0].TL = { X = 0 | Y = 0 } */
pkt4(cs, REG_A6XX_GRAS_SC_SCREEN_SCISSOR_TL(0), (2), 0);
/* pkt4: GRAS_SC_SCREEN_SCISSOR[0].BR = { X = 32767 | Y = 32767 } */
pkt(cs, 2147450879);
/* pkt4: VFD_INDEX_OFFSET = 0 */
pkt4(cs, REG_A6XX_VFD_INDEX_OFFSET, (2), 0);
pkt(cs, 0);
/* pkt4: SP_FS_OUTPUT[0].REG = { REGID = r0.x } */
pkt4(cs, REG_A6XX_SP_FS_OUTPUT_REG(0), (1), 0);
pkt4(cs, REG_A6XX_SP_TP_RAS_MSAA_CNTL, (2), 2);
pkt(cs, 2);
pkt4(cs, REG_A6XX_GRAS_RAS_MSAA_CNTL, (2), 2);

Shader assembly is editable:

const char *source = R"(
  shps #l37
  getone #l37
  cov.u32f32 r1.w, c504.z
  cov.u32f32 r2.x, c504.w
  cov.u32f32 r1.y, c504.x
upload_shader(&ctx, 0x100200d80, source);
emit_shader_iova(&ctx, cs, 0x100200d80);

However, not everything is currently editable, such as descriptors. Despite this limitations, the existing functionality is sufficient for the majority of cases.

by Danylo Piliaiev at March 20, 2023 11:00 PM

Andy Wingo

a world to win: webassembly for the rest of us

Good day, comrades!

Today I'd like to share the good news that WebAssembly is finally coming for the rest of us weirdos.

A world to win

WebAssembly for the rest of us

17 Mar 2023 – BOB 2023

Andy Wingo

Igalia, S.L.

This is a transcript-alike of a talk that I gave last week at BOB 2023, a gathering in Berlin of people that are using "technologies beyond the mainstream" to get things done: Haskell, Clojure, Elixir, and so on. PDF slides here, and I'll link the video too when it becomes available.

WebAssembly, the story

WebAssembly is an exciting new universal compute platform

WebAssembly: what even is it? Not a programming language that you would write software in, but rather a compilation target: a sort of assembly language, if you will.

WebAssembly, the pitch

Predictable portable performance

  • Low-level
  • Within 10% of native

Reliable composition via isolation

  • Modules share nothing by default
  • No nasal demons
  • Memory sandboxing

Compile your code to WebAssembly for easier distribution and composition

If you look at what the characteristics of WebAssembly are as an abstract machine, to me there are two main areas in which it is an advance over the alternatives.

Firstly it's "close to the metal" -- if you compile for example an image-processing library to WebAssembly and run it, you'll get similar performance when compared to compiling it to x86-64 or ARMv8 or what have you. (For image processing in particular, native still generally wins because the SIMD primitives in WebAssembly are more narrow and because getting the image into and out of WebAssembly may imply a copy, but the general point remains.) WebAssembly's instruction set covers a broad range of low-level operations that allows compilers to produce efficient code.

The novelty here is that WebAssembly is both portable while also being successful. We language weirdos know that it's not enough to do something technically better: you have to also succeed in getting traction for your alternative.

The second interesting characteristic is that WebAssembly is (generally speaking) a principle-of-least-authority architecture: a WebAssembly module starts with access to nothing but itself. Any capabilities that an instance of a module has must be explicitly shared with it by the host at instantiation-time. This is unlike DLLs which have access to all of main memory, or JavaScript libraries which can mutate global objects. This characteristic allows WebAssembly modules to be reliably composed into larger systems.

WebAssembly, the hype

It’s in all browsers! Serve your code to anyone in the world!

It’s on the edge! Run code from your web site close to your users!

Compose a library (eg: Expat) into your program (eg: Firefox), without risk!

It’s the new lightweight virtualization: Wasm is what containers were to VMs! Give me that Kubernetes cash!!!

Again, the remarkable thing about WebAssembly is that it is succeeding! It's on all of your phones, all your desktop web browsers, all of the content distribution networks, and in some cases it seems set to replace containers in the cloud. Launch the rocket emojis!

WebAssembly, the reality

WebAssembly is a weird backend for a C compiler

Only some source languages are having success on WebAssembly

What about Haskell, Ocaml, Scheme, F#, and so on – what about us?

Are we just lazy? (Well...)

So why aren't we there? Where is Clojure-on-WebAssembly? Where are the F#, the Elixir, the Haskell compilers? Some early efforts exist, but they aren't really succeeding. Why is that? Are we just not putting in the effort? Why is it that Rust gets to ride on the rocket ship but Scheme does not?

WebAssembly, the reality (2)

WebAssembly (1.0, 2.0) is not well-suited to garbage-collected languages

Let’s look into why

As it turns out, there is a reason that there is no good Scheme implementation on WebAssembly: the initial version of WebAssembly is a terrible target if your language relies on the presence of a garbage collector. There have been some advances but this observation still applies to the current standardized and deployed versions of WebAssembly. To better understand this issue, let's dig into the guts of the system to see what the limitations are.

GC and WebAssembly 1.0

Where do garbage-collected values live?

For WebAssembly 1.0, only possible answer: linear memory

  (global $hp (mut i32) (i32.const 0))
  (memory $mem 10)) ;; 640 kB

The primitive that WebAssembly 1.0 gives you to represent your data is what is called linear memory: just a buffer of bytes to which you can read and write. It's pretty much like what you get when compiling natively, except that the memory layout is more simple. You can obtain this memory in units of 64-kilobyte pages. In the example above we're going to request 10 pages, for 640 kB. Should be enough, right? We'll just use it all for the garbage collector, with a bump-pointer allocator. The heap pointer / allocation pointer is kept in the mutable global variable $hp.

(func $alloc (param $size i32) (result i32)
  (local $ret i32)
  (loop $retry
    (local.set $ret (global.get $hp))
    (global.set $hp
      (i32.add (local.get $size) (local.get $ret)))

    (br_if 1
      (i32.lt_u (i32.shr_u (global.get $hp) 16)
      (local.get $ret))

    (call $gc)
    (br $retry)))

Here's what an allocation function might look like. The allocation function $alloc is like malloc: it takes a number of bytes and returns a pointer. In WebAssembly, a pointer to memory is just an offset, which is a 32-bit integer (i32). (Having the option of a 64-bit address space is planned but not yet standard.)

If this is your first time seeing the text representation of a WebAssembly function, you're in for a treat, but that's not the point of the presentation :) What I'd like to focus on is the (call $gc) -- what happens when the allocation pointer reaches the end of the region?

GC and WebAssembly 1.0 (2)

What hides behind (call $gc) ?

Ship a GC over linear memory

Stop-the-world, not parallel, not concurrent

But... roots.

The first thing to note is that you have to provide the $gc yourself. Of course, this is doable -- this is what we do when compiling to a native target.

Unfortunately though the multithreading support in WebAssembly is somewhat underpowered; it lets you share memory and use atomic operations but you have to create the threads outside WebAssembly. In practice probably the GC that you ship will not take advantage of threads and so it will be rather primitive, deferring all collection work to a stop-the-world phase.

GC and WebAssembly 1.0 (3)

Live objects are

  • the roots
  • any object referenced by a live object

Roots are globals and locals in active stack frames

No way to visit active stack frames

What's worse though is that you have no access to roots on the stack. A GC has to keep live objects, as defined circularly as any object referenced by a root, or any object referenced by a live object. It starts with the roots: global variables and any GC-managed object referenced by an active stack frame.

But there we run into problems, because in WebAssembly (any version, not just 1.0) you can't iterate over the stack, so you can't find active stack frames, so you can't find the stack roots. (Sometimes people want to support this as a low-level capability but generally speaking the consensus would appear to be that overall performance will be better if the engine is the one that is responsible for implementing the GC; but that is foreshadowing!)

GC and WebAssembly 1.0 (3)


  • handle stack for precise roots
  • spill all possibly-pointer values to linear memory and collect conservatively

Handle book-keeping a drag for compiled code

Given the noniterability of the stack, there are basically two work-arounds. One is to have the compiler and run-time maintain an explicit stack of object roots, which the garbage collector can know for sure are pointers. This is nice because it lets you move objects. But, maintaining the stack is overhead; the state of the art solution is rather to create a side table (a "stack map") associating each potential point at which GC can be called with instructions on how to find the roots.

The other workaround is to spill the whole stack to memory. Or, possibly just pointer-like values; anyway, you conservatively scan all words for things that might be roots. But instead of having access to the memory to which the WebAssembly implementation would spill your stack, you have to do it yourself. This can be OK but it's sub-optimal; see my recent post on the Whippet garbage collector for a deeper discussion of the implications of conservative root-finding.

GC and WebAssembly 1.0 (4)

Cycles with external objects (e.g. JavaScript) uncollectable

A pointer to a GC-managed object is an offset to linear memory, need capability over linear memory to read/write object from outside world

No way to give back memory to the OS

Gut check: gut says no

If that were all, it would already be not so great, but it gets worse! Another problem with linear-memory GC is that it limits the potential for composing a number of modules and the host together, because the garbage collector that manages JavaScript objects in a web browser knows nothing about your garbage collector over your linear memory. You can easily create memory leaks in a system like that.

Also, it's pretty gross that a reference to an object in linear memory requires arbitrary read-write access over all of linear memory in order to read or write object fields. How do you build a reliable system without invariants?

Finally, once you collect garbage, and maybe you manage to compact memory, you can't give anything back to the OS. There are proposals in the works but they are not there yet.

If the BOB audience had to choose between Worse is Better and The Right Thing, I think the BOB audience is much closer to the Right Thing. People like that feel instinctual revulsion to ugly systems and I think GC over linear memory describes an ugly system.

GC and WebAssembly 1.0 (5)

There is already a high-performance concurrent parallel compacting GC in the browser

Halftime: C++ N – Altlangs 0

The kicker is that WebAssembly 1.0 requires you to write and deliver a terrible GC when there is already probably a great GC just sitting there in the host, one that has hundreds of person-years of effort invested in it, one that will surely do a better job than you could ever do. WebAssembly as hosted in a web browser should have access to the browser's garbage collector!

I have the feeling that while those of us with a soft spot for languages with garbage collection have been standing on the sidelines, Rust and C++ people have been busy on the playing field scoring goals. Tripping over the ball, yes, but eventually they do manage to make within striking distance.

Change is coming!

Support for built-in GC set to ship in Q4 2023

With GC, the material conditions are now in place

Let’s compile our languages to WebAssembly

But to continue the sportsball metaphor, I think in the second half our players will finally be able to get out on the pitch and give it the proverbial 110%. Support for garbage collection is coming to WebAssembly users, and I think even by the end of the year it will be shipping in major browsers. This is going to be big! We have a chance and we need to sieze it.

Scheme to Wasm

Spritely + Igalia working on Scheme to WebAssembly

Avoid truncating language to platform; bring whole self

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Even with GC, though, WebAssembly is still a weird machine. It would help to see the concrete approaches that some languages of interest manage to take when compiling to WebAssembly.

In that spirit, the rest of this article/presentation is a walkthough of the approach that I am taking as I work on a WebAssembly compiler for Scheme. (Thanks to Spritely for supporting this work!)

Before diving in, a meta-note: when you go to compile a language to, say, JavaScript, you are mightily tempted to cut corners. For example you might implement numbers as JavaScript numbers, or you might omit implementing continuations. In this work I am trying to not cut corners, and instead to implement the language faithfully. Sometimes this means I have to work around weirdness in WebAssembly, and that's OK.

When thinking about Scheme, I'd like to highlight a few specific areas that have interesting translations. We'll start with value representation, which stays in the GC theme from the introduction.

Scheme to Wasm: Values

;;       any  extern  func
;;        |
;;        eq
;;     /  |   \
;; i31 struct  array

The unitype: (ref eq)

Immediate values in (ref i31)

  • fixnums with 30-bit range
  • chars, bools, etc

Explicit nullability: (ref null eq) vs (ref eq)

The GC extensions for WebAssembly are phrased in terms of a type system. Oddly, there are three top types; as far as I understand it, this is the result of a compromise about how WebAssembly engines might want to represent these different kinds of values. For example, an opaque JavaScript value flowing into a WebAssembly program would have type (ref extern). On a system with NaN boxing, you would need 64 bits to represent a JS value. On the other hand a native WebAssembly object would be a subtype of (ref any), and might be representable in 32 bits, either because it's a 32-bit system or because of pointer compression.

Anyway, three top types. The user can define subtypes of struct and array, instantiate values of those types, and access their fields. The life cycle of reference-typed objects is automatically managed by the run-time, which is just another way of saying they are garbage-collected.

For Scheme, we need a common supertype for all values: the unitype, in Bob Harper's memorable formulation. We can use (ref any), but actually we'll use (ref eq) -- this is the supertype of values that can be compared by (pointer) identity. So now we can code up eq?:

(func $eq? (param (ref eq) (ref eq))
           (result i32)
  (ref.eq (local.get a) (local.get b)))

Generally speaking in a Scheme implementation there are immediates and heap objects. Immediates can be encoded in the bits of a value, whereas for heap object the bits of a value encode a reference (pointer) to an object on the garbage-collected heap. We usually represent small integers as immediates, as well as booleans and other oddball values.

Happily, WebAssembly gives us an immediate value type, i31. We'll encode our immediates there, and otherwise represent heap objects as instances of struct subtypes.

Scheme to Wasm: Values (2)

Heap objects subtypes of struct; concretely:

(struct $heap-object
  (struct (field $tag-and-hash i32)))
(struct $pair
  (sub $heap-object
    (struct i32 (ref eq) (ref eq))))

GC proposal allows subtyping on structs, functions, arrays

Structural type equivalance: explicit tag useful

We actually need to have a common struct supertype as well, for two reasons. One is that we need to be able to hash Scheme values by identity, but for this we need an embedded lazily-initialized hash code. It's a bit annoying to take the per-object memory hit but it's a reality, and the JVM does it this way, so it must not be so terrible.

The other reason is more subtle: WebAssembly's type system is built in such a way that types that are "structurally" equivalent are indistinguishable. So a pair has two fields, besides the hash, but there might be a number of other fundamental object types that have the same shape; you can't fully rely on WebAssembly's dynamic type checks (ref.test et al) to be able to query the type of a value. Instead we re-use the low bits of the hash word to include a type tag, which might be 1 for pairs, 2 for vectors, 3 for closures, and so on.

Scheme to Wasm: Values (3)

(func $cons (param (ref eq)
                   (ref eq))
            (result (ref $pair))
  (struct.new_canon $pair
    ;; Assume heap tag for pairs is 1.
    (i32.const 1)
    ;; Car and cdr.
    (local.get 0)
    (local.get 1)))

(func $%car (param (ref $pair))
            (result (ref eq))
  (struct.get $pair 1 (local.get 0)))

With this knowledge we can define cons, as a simple call to struct.new_canon pair.

I didn't have time for this in the talk, but there is a ghost haunting this code: the ghost of nominal typing. See, in a web browser at least, every heap object will have its first word point to its "hidden class" / "structure" / "map" word. If the engine ever needs to check that a value is of a specific shape, it can do a quick check on the map word's value; if it needs to do deeper introspection, it can dereference that word to get more details.

Under the hood, testing whether a (ref eq) is a pair or not should be a simple check that it's a (ref struct) (and not a fixnum), and then a comparison of its map word to the run-time type corresponding to $pair. If subtyping of $pair is allowed, we start to want inline caches to handle polymorphism, but the checking the map word is still the basic mechanism.

However, as I mentioned, we only have structural equality of types; two (struct (ref eq)) type definitions will define the same type and have the same map word (run-time type / RTT). Hence the _canon in the name of struct.new_canon $pair: we create an instance of $pair, with the canonical run-time-type for objects having $pair-shape.

In earlier drafts of the WebAssembly GC extensions, users could define their own RTTs, which effectively amounts to nominal typing: not only does this object have the right structure, but was it created with respect to this particular RTT. But, this facility was cut from the first release, and it left ghosts in the form of these _canon suffixes on type constructor instructions.

For the Scheme-to-WebAssembly effort, we effectively add back in a degree of nominal typing via type tags. For better or for worse this results in a so-called "open-world" system: you can instantiate a separately-compiled WebAssembly module that happens to define the same types and use the same type tags and it will be able to happily access the contents of Scheme values from another module. If you were to use nominal types, you would't be able to do so, unless there were some common base module that defined and exported the types of interests, and which any extension module would need to import.

(func $car (param (ref eq)) (result (ref eq))
  (local (ref $pair))
  (block $not-pair
    (br_if $not-pair
      (i32.eqz (ref.test $pair (local.get 0))))
    (local.set 1 (ref.cast $pair) (local.get 0))
    (br_if $not-pair
        (i32.const 1)
          (i32.const 0xff)
          (struct.get $heap-object 0 (local.get 1)))))
    (return_call $%car (local.get 1)))

  (call $type-error)

In the previous example we had $%car, with a funny % in the name, taking a (ref $pair) as an argument. But in the general case (barring compiler heroics) car will take an instance of the unitype (ref eq). To know that it's actually a pair we have to make two checks: one, that it is a struct and has the $pair shape, and two, that it has the right tag. Oh well!

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

But with all of that I think we have a solid story on how to represent values. I went through all of the basic value types in Guile and checked that they could all be represented using GC types, and it seems that all is good. Now on to the next point: varargs.

Scheme to Wasm: Varargs

(list 'hey)      ;; => (hey)
(list 'hey 'bob) ;; => (hey bob)

Problem: Wasm functions strongly typed

(func $list (param ???) (result (ref eq))

Solution: Virtualize calling convention

In WebAssembly, you define functions with a type, and it is impossible to call them in an unsound way. You must call $car exactly 2 arguments or it will not compile, and those arguments have to be of specific types, and so on. But Scheme doesn't enforce these restrictions on the language level, bless its little miscreant heart. You can call car with 5 arguments, and you'll get a run-time error. There are some functions that can take a variable number of arguments, doing different things depending on incoming argument count.

How do we square these two approaches to function types?

;; "Registers" for args 0 to 3
(global $arg0 (mut (ref eq)) ( (i32.const 0)))
(global $arg1 (mut (ref eq)) ( (i32.const 0)))
(global $arg2 (mut (ref eq)) ( (i32.const 0)))
(global $arg3 (mut (ref eq)) ( (i32.const 0)))

;; "Memory" for the rest
(type $argv (array (ref eq)))
(global $argN (ref $argv)
          $argv (i31.const 42) ( (i32.const 0))))

Uniform function type: argument count as sole parameter

Callee moves args to locals, possibly clearing roots

The approach we are taking is to virtualize the calling convention. In the same way that when calling an x86-64 function, you pass the first argument in $rdi, then $rsi, and eventually if you run out of registers you put arguments in memory, in the same way we'll pass the first argument in the $arg0 global, then $arg1, and eventually in memory if needed. The function will receive the number of incoming arguments as its sole parameter; in fact, all functions will be of type (func (param i32)).

The expectation is that after checking argument count, the callee will load its arguments from globals / memory to locals, which the compiler can do a better job on than globals. We might not even emit code to null out the argument globals; might leak a little memory but probably would be a win.

You can imagine a world in which $arg0 actually gets globally allocated to $rdi, because it is only live during the call sequence; but I don't think that world is this one :)

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Great, two points out of the way! Next up, tail calls.

Scheme to Wasm: Tail calls

;; Call known function
(return_call $f arg ...)

;; Call function by value
(return_call_ref $type callee arg ...)

Friends -- I almost cried making this slide. We Schemers are used to working around the lack of tail calls, and I could have done so here, but it's just such a relief that these functions are just going to be there and I don't have to think much more about them. Technically speaking the proposal isn't merged yet; checking the phases document it's at the last station before headed to the great depot in the sky. But, soon soon it will be present and enabled in all WebAssembly implementations, and we should build systems now that rely on it.

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Next up, my favorite favorite topic: delimited continuations.

Scheme to Wasm: Prompts (1)

Problem: Lightweight threads/fibers, exceptions

Possible solutions

  • Eventually, built-in coroutines
  • binaryen’s asyncify (not yet ready for GC); see Julia
  • Delimited continuations

“Bring your whole self”

Before diving in though, one might wonder why bother. Delimited continuations are a building-block that one can use to build other, more useful things, notably exceptions and light-weight threading / fibers. Could there be another way of achieving these end goals without having to implement this relatively uncommon primitive?

For fibers, it is possible to implement them in terms of a built-in coroutine facility. The standards body seems willing to include a coroutine primitive, but it seems far off to me; not within the next 3-4 years I would say. So let's put that to one side.

There is a more near-term solution, to use asyncify to implement coroutines somehow; but my understanding is that asyncify is not ready for GC yet.

For the Guile flavor of Scheme at least, delimited continuations are table stakes of their own right, so given that we will have them on WebAssembly, we might as well use them to implement fibers and exceptions in the same way as we do on native targets. Why compromise if you don't have to?

Scheme to Wasm: Prompts (2)

Prompts delimit continuations

(define k
  (call-with-prompt ’foo
    ; body
    (lambda ()
      (+ 34 (abort-to-prompt 'foo)))
    ; handler
    (lambda (continuation)

(k 10)       ;; ⇒ 44
(- (k 10) 2) ;; ⇒ 42

k is the _ in (lambda () (+ 34 _))

There are a few ways to implement delimited continuations, but my usual way of thinking about them is that a delimited continuation is a slice of the stack. One end of the slice is the prompt established by call-with-prompt, and the other by the continuation of the call to abort-to-prompt. Capturing a slice pops it off the stack, copying it out to the heap as a callable function. Calling that function splats the captured slice back on the stack and resumes it where it left off.

Scheme to Wasm: Prompts (3)

Delimited continuations are stack slices

Make stack explicit via minimal continuation-passing-style conversion

  • Turn all calls into tail calls
  • Allocate return continuations on explicit stack
  • Breaks functions into pieces at non-tail calls

This low-level intuition of what a delimited continuation is leads naturally to an implementation; the only problem is that we can't slice the WebAssembly call stack. The workaround here is similar to the varargs case: we virtualize the stack.

The mechanism to do so is a continuation-passing-style (CPS) transformation of each function. Functions that make no calls, such as leaf functions, don't need to change at all. The same goes for functions that make only tail calls. For functions that make non-tail calls, we split them into pieces that preserve the only-tail-calls property.

Scheme to Wasm: Prompts (4)

Before a non-tail-call:

  • Push live-out vars on stacks (one stack per top type)
  • Push continuation as funcref
  • Tail-call callee

Return from call via pop and tail call:

(return_call_ref (call $pop-return)
                 (i32.const 0))

After return, continuation pops state from stacks

Consider a simple function:

(define (f x y)
  (+ x (g y))

Before making a non-tail call, a "tailified" function will instead push all live data onto an explicitly-managed stack and tail-call the callee. It also pushes on the return continuation. Returning from the callee pops the return continuation and tail-calls it. The return continuation pops the previously-saved live data and continues.

In this concrete case, tailification would split f into two pieces:

(define (f x y)
  (push! x)
  (push-return! f-return-continuation-0)
  (g y))

(define (f-return-continuation-0 g-of-y)
  (define k (pop-return!))
  (define x (pop! x))
  (k (+ x g-of-y)))

Now there are no non-tail calls, besides calls to run-time routines like push! and + and so on. This transformation is implemented by tailify.scm.

Scheme to Wasm: Prompts (5)


  • Pop stack slice to reified continuation object
  • Tail-call new top of stack: prompt handler

Calling a reified continuation:

  • Push stack slice
  • Tail-call new top of stack

No need to wait for effect handlers proposal; you can have it all now!

The salient point is that the stack on which push! operates (in reality, probably four or five stacks: one in linear memory or an array for types like i32 or f64, three for each of the managed top types any, extern, and func, and one for the stack of return continuations) are managed by us, so we can slice them.

Someone asked in the talk about whether the explicit memory traffic and avoiding the return-address-buffer branch prediction is a source of inefficiency in the transformation and I have to say, yes, but I don't know by how much. I guess we'll find out soon.

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Okeydokes, last point!

Scheme to Wasm: Numbers

Numbers can be immediate: fixnums

Or on the heap: bignums, fractions, flonums, complex

Supertype is still ref eq

Consider imports to implement bignums

  • On web: BigInt
  • On edge: Wasm support module (mini-gmp?)

Dynamic dispatch for polymorphic ops, as usual

First, I would note that sometimes the compiler can unbox numeric operations. For example if it infers that a result will be an inexact real, it can use unboxed f64 instead of library routines working on heap flonums ((struct i32 f64); the initial i32 is for the hash and tag). But we still need a story for the general case that involves dynamic type checks.

The basic idea is that we get to have fixnums and heap numbers. Fixnums will handle most of the integer arithmetic that we need, and will avoid allocation. We'll inline most fixnum operations as a fast path and call out to library routines otherwise. Of course fixnum inputs may produce a bignum output as well, so the fast path sometimes includes another slow-path callout.

We want to minimize binary module size. In an ideal compile-to-WebAssembly situation, a small program will have a small module size, down to a minimum of a kilobyte or so; larger programs can be megabytes, if the user experience allows for the download delay. Binary module size will be dominated by code, so that means we need to plan for aggressive dead-code elimination, minimize the size of fast paths, and also minimize the size of the standard library.

For numbers, we try to keep module size down by leaning on the platform. In the case of bignums, we can punt some of this work to the host; on a JavaScript host, we would use BigInt, and on a WASI host we'd compile an external bignum library. So that's the general story: inlined fixnum fast paths with dynamic checks, and otherwise library routine callouts, combined with aggressive whole-program dead-code elimination.

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Hey I think we did it! Always before when I thought about compiling Scheme or Guile to the web, I got stuck on some point or another, was tempted down the corner-cutting alleys, and eventually gave up before starting. But finally it would seem that the stars are aligned: we get to have our Scheme and run it too.


Debugging: The wild west of DWARF; prompts

Strings: stringref host strings spark joy

JS interop: Export accessors; Wasm objects opaque to JS. externref.

JIT: A whole ’nother talk!

AOT: wasm2c

Of course, like I said, WebAssembly is still a weird machine: as a compilation target but also at run-time. Debugging is a right proper mess; perhaps some other article on that some time.

How to represent strings is a surprisingly gnarly question; there is tension within the WebAssembly standards community between those that think that it's possible for JavaScript and WebAssembly to share an underlying string representation, and those that think that it's a fool's errand and that copying is the only way to go. I don't know which side will prevail; perhaps more on that as well later on.

Similarly the whole interoperation with JavaScript question is very much in its early stages, with the current situation choosing to err on the side of nothing rather than the wrong thing. You can pass a WebAssembly (ref eq) to JavaScript, but JavaScript can't do anything with it: it has no prototype. The state of the art is to also ship a JS run-time that wraps each wasm object, proxying exported functions from the wasm module as object methods.

Finally, some language implementations really need JIT support, like PyPy. There, that's a whole 'nother talk!

WebAssembly for the rest of us

With GC, WebAssembly is now ready for us

Getting our languages on WebAssembly now a S.M.O.P.

Let’s score some goals in the second half!


WebAssembly has proven to have some great wins for C, C++, Rust, and so on -- but now it's our turn to get in the game. GC is coming and we as a community need to be getting our compilers and language run-times ready. Let's put on the coffee and bang some bytes together; it's still early days and there's a world to win out there for the language community with the best WebAssembly experience. The game is afoot: happy consing!

by Andy Wingo at March 20, 2023 09:06 AM

March 18, 2023

Alex Bradbury

What's new for RISC-V in LLVM 16

LLVM 16.0.0 was just released today, and as I did for LLVM 15, I wanted to highlight some of the RISC-V specific changes and improvements. This is very much a tour of a chosen subset of additions rather than an attempt to be exhaustive. If you're interested in RISC-V, you may also want to check out my recent attempt to enumerate the commercially available RISC-V SoCs and if you want to find out what's going on in LLVM as a whole on a week-by-week basis, then I've got the perfect newsletter for you.

In case you're not familiar with LLVM's release schedule, it's worth noting that there are two major LLVM releases a year (i.e. one roughly every 6 months) and these are timed releases as opposed to being cut when a pre-agreed set of feature targets have been met. We're very fortunate to benefit from an active and growing set of contributors working on RISC-V support in LLVM projects, who are responsible for the work I describe below - thank you! I coordinate biweekly sync-up calls for RISC-V LLVM contributors, so if you're working in this area please consider dropping in.


LLVM 16 is the first release featuring a user guide for the RISC-V target (16.0.0 version, current HEAD. This fills a long-standing gap in our documentation, whereby it was difficult to tell at a glance the expected level of support for the various RISC-V instruction set extensions (standard, vendor-specific, and experimental extensions of either type) in a given LLVM release. We've tried to keep it concise but informative, and add a brief note to describe any known limitations that end users should know about. Thanks again to Philip Reames for kicking this off, and the reviewers and contributors for ensuring it's kept up to date.


LLVM 16 was a big release for vectorisation. As well as a long-running strand of work making incremental improvements (e.g. better cost modelling) and fixes, scalable vectorization was enabled by default. This allows LLVM's loop vectorizer to use scalable vectors when profitable. Follow-on work enabled support for loop vectorization using fixed length vectors and disabled vectorization of epilogue loops. See the talk optimizing code for scalable vector architectures (slides) by Sander de Smalen for more information about scalable vectorization in LLVM and introduction to the RISC-V vector extension by Roger Ferrer Ibáñez for an overview of the vector extension and some of its codegen challenges.

The RISC-V vector intrinsics supported by Clang have changed (to match e.g. this and this) during the 16.x development process in a backwards incompatible way, as the RISC-V Vector Extension Intrinsics specification evolves towards a v1.0. In retrospect, it would have been better to keep the intrinsics behind an experimental flag when the vector codegen and MC layer (assembler/disassembler) support became stable, and this is something we'll be more careful of for future extensions. The good news is that thanks to Yueh-Ting Chen, headers are available that provide the old-style intrinsics mapped to the new version.

Support for new instruction set extensions

I refer to 'experimental' support many times below. See the documentation on experimental extensions within RISC-V LLVM for guidance on what that means. One point to highlight is that the extensions remain experimental until they are ratified, which is why some extensions on the list below are 'experimental' despite the fact the LLVM support needed is trivial. On to the list of newly added instruction set extensions:

  • Experimental support for the Zca, Zcf, and Zcd instruction set extensions. These are all 16-bit instructions and are being defined as part of the output of the RISC-V code size reduction working group.
    • Zca is just a subset of the standard 'C' compressed instruction set extension but without floating point loads/stores.
    • Zcf is also a subset of the standard 'C' compressed instruction set extension, including just the single precision floating point loads and stores (c.flw, c.flwsp, c.fsw, c.fswsp).
    • Zcd, as you might have guessed, just includes the double precision floating point loads and stores from the standard 'C' compressed instruction set extension (c.fld, c.fldsp, c.fsd, c.fsdsp).
  • Experimental assembler/disassembler support for the Zihintntl instruction set extension. This provides a small set of instructions that can be used to hint that the memory accesses of the following instruction exhibits poor temporal locality.
  • Experimental assembler/disassembler support for the Zawrs instruction set extension, providing a pair of instructions meant for use in a polling loop allowing a core to enter a low-power state and wait on a store to a memory location.
  • Experimental support for the Ztso extension, which for now just means setting the appropriate ELF header flag. If a core implements Ztso, it implements the Total Store Ordering memory consistency model. Future releases will provide alternate lowerings of atomics operations that take advantage of this.
  • Code generation support for the Zfhmin extension (load/store, conversion, and GPR/FPR move support for 16-bit floating point values).
  • Codegen and assembler/disassembler support for the XVentanaCondOps vendor extension, which provides conditional arithmetic and move/select operations.
  • Codegen and assembler/disassembler support for the XTHeadVdot vendor extension, which implements vector integer four 8-bit multiple and 32-bit add.


LLDB has started to become usable for RISC-V in this period due to work by contributor 'Emmer'. As they summarise here, LLDB should be usable for debugging RV64 programs locally but support is lacking for remote debug (e.g. via the gdb server protocol). During the LLVM 16 development window, LLDB gained support for software single stepping on RISC-V, support in EmulateInstructionRISCV for RV{32,64}I, as well as extensions A and M, C, RV32F and RV64F, and D.

Short forward branch optimisation

Another improvement that's fun to look more closely at is support for "short forward branch optimisation" for the SiFive 7 series cores. What does this mean? Well, let's start by looking at the problem it's trying to solve. The base RISC-V ISA doesn't include conditional moves or predicated instructions, which can be a downside if your code features unpredictable short forward branches (with the ensuing cost in terms of branch mispredictions and bloating branch predictor state). The ISA spec includes commentary on this decision (page 23), noting some disadvantages of adding such instructions to the specification and noting microarchitectural techniques exist to convert short forward branches into predicated code internally. In the case of the SiFive 7 series, this is achieved using macro-op fusion where a branch over a single ALU instruction is fused and executed as a single conditional instruction.

In the LLVM 16 cycle, compiler optimisations targeting this microarchitectural feature were enabled for conditional move style sequences (i.e. branch over a register move) as well as for other ALU operations. The job of the compiler here is of course to emit a sequence compatible with the micro-architectural optimisation when possible and profitable. I'm not aware of other RISC-V designs implementing a similar optimisation - although there are developments in terms of instructions to support such operations directly in the ISA which would avoid the need for such microarchitectural tricks. See XVentanaCondOps, XTheadCondMov, the previously proposed but now abandoned Zbt extension (part of the earlier bitmanip spec) and more recently the proposed Zicond (integer conditional operations) standard extension.


It's perhaps not surprising that code generation for atomics can be tricky to understand, and the LLVM documentation on atomics codegen and libcalls is actually one of the best references on the topic I've found. A particularly important note in that document is that if a backend supports any inline lock-free atomic operations at a given size, all operations of that size must be supported in a lock-free manner. If targeting a RISC-V CPU without the atomics extension, all atomics operations would usually be lowered to __atomic_* libcalls. But if we know a bit more about the target, it's possible to do better - for instance, a single-core microcontroller could implement an atomic operation in a lock-free manner by disabling interrupts (and conventionally, lock-free implementations of atomics are provided through __sync_* libcalls). This kind of setup is exactly what the +forced-atomics feature enables, where atomic load/store can be lowered to a load/store with appropriate fences (as is supported in the base ISA) while other atomic operations generate a __sync_* libcall.

There's also been a very minor improvement for targets with native atomics support (the 'A' instruction set extension) that I may as well mention while on the topic. As you might know, atomic operations such as compare and swap that are lowered to an instruction sequence involving lr.{w,d} (load reserved) and sc.{w,d} (store conditional). There are very specific rules about these instruction sequences that must be met to align with the architectural forward progress guarantee (section 8.3, page 51), which is why we expand to a fixed instruction sequence at a very late stage in compilation (see original RFC). This means the sequence of instructions implementing the atomic operation are opaque to LLVM's optimisation passes and are treated as a single unit. The obvious disadvantage of avoiding LLVM's optimisations is that sometimes there are optimisations that would be helpful and wouldn't break that forward-progress guarantee. One that came up in real-world code was the lack of branch folding, which would have simplified a branch in the expanded cmpxchg sequence that just targets another branch with the same condition (by just folding in the eventual target). With some relatively simple logic, this suboptimal codegen is resolved.

; Before                 => After
.loop:                   => .loop
  lr.w.aqrl a3, (a0)     => lr.w.aqrl a3, (a0)
  bne a3, a1, .afterloop => bne a3, a1, .loop
  sc.w.aqrl a4, a2, (a0) => sc.w.aqrl a4, a2, (a0)
  bnez a4, .loop         => bnez a4, .loop
.aferloop:               =>
  bne a3, a1, .loop      =>
  ret                    => ret

Assorted optimisations

As you can imagine, there's been a lot of incremental minor improvements over the past ~6 months. I unfortunately only have space (and patience) to highight a few of them.

A new pre-regalloc pseudo instruction expansion pass was added in order to allow optimising the global address access instruction sequences such as those found in the medany code model (and was later broadened further). This results in improvements such as the following (note: this transformation was already supported for the medlow code model):

; Before                            => After
.Lpcrel_hi1:                        => .Lpcrel_hi1
auipc a0, %pcrel_hi1(ga)            => auipc a0, %pcrel_hi1(ga+4)
addi a0, a0, %pcrel_lo(.Lpcrel_hi1) =>
lw a0, 4(a0)                        => lw a0, %pcrel_lo(.Lpcrel_hi1)(a0)

A missing target hook (isUsedByReturnOnly) had been preventing tail calling libcalls in some cases. This was fixed, and later support was added for generating an inlined sequence of instructions for some of the floating point libcalls.

The RISC-V compressed instruction set extension defines a number of 16-bit encodings that map to a 32-bit longer form (with restrictions on addressable registers in the compressed form of course). The conversion 32-bit instructions 16-bit forms when possible happens at a very late stage, after instruction selection. But of course over time, we've introduced more tuning to influence codegen decisions in cases where a choice can be made to produce an instruction that can be compressed, rather than one that can't. A recent addition to this was the RISCVStripWSuffix pass, which for RV64 targets will convert addw and slliw to add or slli respectively when it can be determined that all the users of its result only use the lower 32 bits. This is a minor code size saving, as slliw has no matching compressed instruction and c.addw can address a more restricted set of registers than c.add.


At the risk of repeating myself, this has been a selective tour of some additions I thought it would be fun to write about. Apologies if I've missed your favourite new feature or improvement - the LLVM release notes will include some things I haven't had space for here. Thanks again for everyone who has been contributing to make the RISC-V in LLVM even better.

If you have a RISC-V project you think me and my colleagues and at Igalia may be able to help with, then do get in touch regarding our services.

Article changelog
  • 2023-03-19: Clarified that Zawrs and Zihintntl support just involves the MC layer (assembler/disassembler).
  • 2023-03-18: Initial publication date.

March 18, 2023 12:00 PM

March 14, 2023

José Dapena

Stack walk profiling NodeJS in Windows

Last year I wrote a series of blog posts (1, 2, 3) about stack walk profiling Chromium using Windows native tools around ETW.

A fast recap: ETW support for stack walking in V8 allows to show V8 JIT generated code in the Windows Performance Analyzer. This is a powerful tool to analyze work loads where Javascript execution time is significant.

In this blog post, I will cover the usage of this very same tool, but to analyze NodeJS execution.

Enabling stack walk JIT information in NodeJS

In an ideal situation, V8 engines would always generate stack walk information when Windows is profiling. This is something we will want to consider in the future, as we prove enabling it has no cost if we are not in a tracing session.

Meanwhile, we need to set the V8 flag --enable-etw-stack-walking somehow. This will install hooks that, when a profiling session starts, will emit the JIT generated code addresses, and the information about the source code associated to them.

For a command line execution of NodeJS runtime, it is as simple as passing the command line flag:

node --enable-etw-stack-walking

This will work enabling ETW stack walking for that specific NodeJS session… Good, but not very useful.

Enabling ETW stack walking for a session

What’s the problem here? Usually, NodeJS is invoked indirectly through other tools (based or not in NodeJS). Some examples are Yarn, NPM, or even some Windows scripts or link files.

We could tune all the existing launching scripts to pass --enable-etw-stack-walking to the NodeJS runtime when it is called. But that is not much convenient.

There is a better way though, just using NODE_OPTIONS environment variable. This way, stack walking support can be enabled for all NodeJS calls in a shell session, or even system wide.

Bad news and good news

Some bad news: NodeJS was refusing --enable-etw-stack-walking in NODE_OPTIONS. There is a filter for which V8 options it accepts (mostly for security purposes), and ETW support was not considered.

Good news? I implemented a fix adding the flag to the list accepted by NODE_OPTIONS. It has been landed already, and it is available from NodeJS 19.6.0. Unfortunately, if you are using an older version, then you may need to backport the patch.

Using it: linting TypeScript

To explain how this can be used, I will analyse ESLint on a known workload: TypeScript. For simplicity, we are using the lint task provided by TypeScript.

This example assumes the usage of Git Bash.

First, clone TypeScript from GitHub, and go to the cloned copy:

git clone
cd TypeScript

Then, install hereby and the dependencies of TypeScript:

npm install -g hereby
npm ci

Now, we are ready to profile the lint task. First, set NODE_OPTIONS:

export NODE_OPTIONS="--enable-etw-stack-walking"

Then, launch UIForETW. This tool simplifies capturing traces, and will provide good defaults for Javascript ETW analysis. It provides a very useful keyboard shortcut, <Ctrl>+<Win>+R, to start and then stop a recording.

Switch to Git Bash terminal and do this sequence:

  • Write (without pressing <Enter>): hereby lint
  • Press <Ctrl>+<Win>+R to start recording. Wait 3-4 seconds as recording does not start immediately.
  • Press <Enter>. ESLint will traverse all the TypeScript code.
  • Press again <Ctrl>+<Win>+R to stop recording.

After a few seconds UIForETW will automatically open the trace in Windows Performance Analyzer. Thanks to settings NODE_OPTIONS all the child processes of the parent node.exe execution also have stack walk information.

Randomascii inclusive (stack) analysis

Focusing on node.exe instances, in Randomascii inclusive (stack) view, we can see where time is spent for each of the node.exe processes. If I take the bigger one (that is the longest of the benchmarks I executed), I get some nice insights.

The worker threads take 40% of the CPU processing. What is happening there? I basically see JIT compilation and garbage collection concurrent marking. V8 offloads that work, so there is a benefit from a multicore machine.

Most of the work happens in the main thread, as expected. And most of the time is spent parsing and applying the lint rules (half for each).

If we go deeper in the rules processing, we can see which rules are more expensive.

Memory allocation

In total commit view, we can observe the memory usage pattern of the process running ESLint. For most of the seconds of the workload, allocation grows steadily (to over 2GB of RAM). Then there is a first garbage collection, and a bit later, the process finishes and all the memory is deallocated.

More findings

At first sight, I observe we are creating the rules objects for all the execution of ESLint. What does it mean? Could we run faster reusing those? I can also observe that a big part of the time in main thread leads to leaves doing garbage collection.

This is a good start! You can see how ETW can give you insights of what is happening and how much time it takes. And even correlate that to memory usage, File I/O, etc.

Builtins fix

Using NodeJS, as is today, will still show many missing lines in the stack. I did those tests, and could do a useful analysis, because I applied a very recent patch I landed in V8.

Before the fix, we would have this sequence:

  • Enable ETW recording
  • Run several NodeJS tests.
  • Each of the tests creates one or more JS contexts.
  • That context then sends to ETW the information of any code compiled with JIT.

But there was a problem: any JS context has already a lot of pre-compiled code associated: builtins and V8 snapshot code. Those were missing from the ETW traces captured.

The fix, as said, has been already landed to V8, and hopefully will be available soon in future NodeJS releases.

Wrapping up

There is more work to do:

  • WASM is still not supported.
  • Ideally, we would want to have --enable-etw-stack-walking set by default, as the impact while not tracing is minimal.

In any case, after these new fixes, capturing ETW stack walks of code executed by NodeJS runtime is a bit easier. I hope this gives some joy to your performance research.

One last thing! My work for these fixes is possible thanks to the sponsorship from Igalia and Bloomberg.

by José Dapena Paz at March 14, 2023 05:34 PM

Víctor Jáquez

Review of Igalia Multimedia activities (2022)

We, Igalia’s multimedia team, would like to share with you our list of achievements along the past 2022.

WebKit Multimedia


Phil already wrote a first blog post, of a series, on this regard: WebRTC in WebKitGTK and WPE, status updates, part I. Please, be sure to give it a glance, it has nice videos.

Long story short, last year we started to support Media Capture and Streams in WebKitGTK and WPE using GStreamer, either for input devices (camera and microphone), desktop sharing, webaudio, and web canvas. But this is just the first step. We are currently working on RTCPeerConnection, also using GStreamer, to share all these captured streams with other web peers. Meanwhile, we’ll wait for the second episode of Phil’s series 🙂


We worked in an initial implementation of MediaRecorder with GStreamer (1.20 or superior). The specification goes about allowing a web browser to record a selected stream. For example, a voice-memo or video application which could encode and upload a capture of your microphone / camera.


While WebKitGTK already has Gamepad support, WPE lacked it. We did the implementation last year, and there’s a blog post about it: Gamepad in WPEWebkit, with video showing a demo of it.

Capture encoded video streams from webcams

Some webcams only provide high resolution frames encoded in H.264 or so. In order to support these resolutions with those webcams we added the support for negotiate of those formats and decode them internally to handle the streams. Though we are just at the beginning of more efficient support.

Flatpak SDK maintenance

A lot of effort went to maintain the Flatpak SDK for WebKit. It is a set of runtimes that allows to have a reproducible build of WebKit, independently of the used Linux distribution. Nowadays the Flatpak SDK is used in Webkit’s EWS, and by many developers.

Among all the features added during the year we can highlight added Rust support, a full integrity check before upgrading, and offer a way to override dependencies as local projects.

MSE/EME enhancements

As every year, massive work was done in WebKit ports using GStreamer for Media Source Extensions and Encrypted Media Extensions, improving user experience with different streaming services in the Web, such as Odysee, Amazon, DAZN, etc.

In the case of encrypted media, GStreamer-based WebKit ports provide the stubs to communicate with an external Content Decryption Module (CDM). If you’re willing to support this in your platform, you can reach us.

Also we worked in a video demo showing how MSE/EME works in a Raspberry Pi 3 using WPE:

WebAudio demo

We also spent time recording video demos, such as this one, showing WebAudio using WPE on a desktop computer.


We managed to merge a lot of bug fixes in GStreamer, which in many cases can be harder to solve rather than implementing new features, though former are more interesting to tell, such as those related with making Rust the main developing language for GStreamer besides C.

Rust bindings and GStreamer elements for Vonage Video API / OpenTok

OpenTok is the legacy name of Vonage Video API, and is a PaaS (Platform As a Service) to ease the development and deployment of WebRTC services and applications.

We published our work in Github of Rust bindings both for the Client SDK for Linux and the Server SDK using REST API, along with a GStreamer plugin to publish and subscribe to video and audio streams.


In the beginning there was webrtcbin, an element that implements the majority of W3C RTCPeerConnection API. It’s so flexible and powerful that it’s rather hard to use for the most common cases. Then appeared webrtcsink, a wrapper of webrtcbin, written in Rust, which receives GStreamer streams which will be offered and streamed to web peers. Later on, we developed webrtcsrc, the webrtcsink counterpart: an element which source pads push streams from web peers, such as another browser, and forward those Web streams as GStreamer ones in a pipeline. Both webrtcsink and webrtcsrc are written in Rust.

Behavior-Driven Development test framework for GStreamer

Behavior-Driven Development is gaining relevance with tools like Cucumber for Java and its domain specific language, Gherkin to define software behaviors. Rustaceans have picked up these ideas and developed cucumber-rs. The logical consequence was obvious: Why not GStreamer?

Last year we tinkered with GStreamer-Cucumber, a BDD to define behavior tests for GStreamer pipelines.

GstValidate Rust bindings

There have been some discussion if BDD is the best way to test GStreamer pipelines, and there’s GstValidate, and also, last year, we added its Rust bindings.

GStreamer Editing Services

Though not everything was Rust. We work hard on GStreamer’s nuts and bolts.

Last year, we gathered the team to hack GStreamer Editing Services, particularly to explore adding OpenGL and DMABuf support, such as downloading or uploading a texture before processing, and selecting a proper filter to avoid those transfers.

GstVA and GStreamer-VAAPI

We helped in the maintenance of GStreamer-VAAPI and the development of its near replacement: GstVA, adding new elements such as the H.264 encoder, the compositor and the JPEG decoder. Along with participation on the debate and code reviewing of negotiating DMABuf streams in the pipeline.

Vulkan decoder and parser library for CTS

You might have heard about Vulkan has now integrated in its API video decoding, while encoding is currently work-in-progress. We devoted time on helping Khronos with the Vulkan Video Conformance Tests (CTS), particularly with a parser based on GStreamer and developing a H.264 decoder in GStreamer using Vulkan Video API.

You can check the presentation we did last Vulkanised.

WPE Android Experiment

In a joint adventure with Igalia’s Webkit team we did some experiments to port WPE to Android. This is just an internal proof of concept so far, but we are looking forward to see how this will evolve in the future, and what new possibilities this might open up.

If you have any questions about WebKit, GStreamer, Linux video stack, compilers, etc., please contact us.

by vjaquez at March 14, 2023 10:45 AM

March 12, 2023

Clayton Craft

Presence detection, without compromising privacy

I use Home Assistant (HA) to control my WiFi-enabled thermostat, which in turn is walled off from the Internet. So no matter how "smart" my thermostat wants to be, it's forced to live in isolation and follow orders from HA. For the past year, I've been trying out crude methods for detecting when humans are home, or not, so that the HVAC can be set accordingly. The last attempt required using the "ping" integration in HA. You basically provide a list of IPs, add users that are associated with the IP / device tracker for it, then you can conditionally run automation based on the status of those things.

This has worked pretty well, however there are some pain points that have been really digging at me:

  1. Setting this up requires using static DHCP for devices on the network that I want to associate with someone being home, like a cell phone. This is tricky since some phones use random MACs, and it's cumbersome to add new people to the "who could be home?" list.

  2. If people come over, I don't want the HVAC to shut off if the regular inhabitants leave. E.g. if my partner and I have to leave the house, and my mother-in-law stays at home, I'll get in trouble if the heat shuts off because HA thinks everyone is gone. Don't ask me how I know.

  3. Setting this all up requires basically configuring HA to track inhabitants and guests in my home. Even though I've gone through great lengths to try to prevent HA from sending any data externally, I still don't like that this data is being stored anywhere. I actually don't care if my partner or some particular guest is home and I'm not, it's easy enough to ask if I suddenly do. But seeing that someone is home is fine. I've heard that some folks use the "cloud" to manage HA and its data. Ouch.

So I present my latest attempt to alleviate those 3 things as much as possible:

  • Monitor IP ranges and detect if any devices in those ranges are "online" (responding to ping)
  • Use DHCP to "put" devices that humans carry or use when they are home into the appropriate monitored IP range
  • HA can use this to determine if any humans are connected (or not), and thus at home (or not)
  • Guests who are on WiFi are counted as "home" by HA without me having to do anything more than just give them the guest WiFi password (which I would have done previously anyways).
  • HA doesn't have any information about specific users and their home/not home status. And it doesn't care if devices use a consistent MAC or not.

All three pain points above are addressed... although maybe not 100% solved in every situation. But it still seems like an improvement.

At the heart of this is a script that runs ping in parallel on an arbitrary number of IP addresses given to it. I'm sure it could be improved... Like, I doubt this would turn out well if the list gets too large. And the run time of the script can be kinda long if there are no devices and it ends up retrying the max number of times. The retry is there because I've seen that some devices might not respond to ping immediately (they were sleeping, or had a WiFi reconnect event at the wrong time, etc...) So if you have any ideas, let me know!

# Pings the given addresses and prints "up" if any of
# them respond to the ping, else it prints "down". A
# non-zero return value is an error.
# A simple backoff delay is used between attempts to
# ping the group when none of them responded to the
# last attempt.

function pings(){
    local PIDS=()
    local STATUS=()
    local IP=( "$@" )

    for ip in "${IP[@]}"; do
        ping  -c1 -W1 "$ip" &>/dev/null &

    for p in "${PIDS[@]}"; do
        wait "${p}"

    for s in "${STATUS[@]}"; do
        if [[ $s -eq 0 ]]; then
            return 0

    return 1

if [ -z "$1" ]; then
    echo "Parameter required!!"
    exit 1

ADDRS=( "$@" )
try_interval=1  # starting interval, before backoff

while ! pings "${ADDRS[@]}"; do
    if [[ $tries -ge $max_tries ]]; then
        echo down
        exit 0

    sleep "$try_interval"

echo up
exit 0

This script should be saved in some directory that HA can access. Next, we need to actually run the script in a way that HA can consume the output directly and use it. A binary sensor seemed like the best choice:

  - platform: command_line
    name: 'network_users'
    device_class: presence
    scan_interval: 30 # seconds
    command_timeout: 30 # seconds
    command: 'bash -c "tools/ 192.168.20.{100..150}"'
    payload_on: 'up'
    payload_off: 'down'

The parameters to that script can be ranges that the shell can expand, so you don't have to list out every IP. Triggering every 30 seconds may be too often, I may back that off later to be a minute or more... so I'll run this for a bit to see if I start to form any opinions about changing it.

Next we need to create a "device tracker" so that this can be associated with some "user" in HA for presence detection. I used device_tracker.see in a scheduled automation to add and update a device tracker thing in HA specifically for network users:

- alias: Update presence tracking sensors
  description: This is done periodically so that device trackers for each sensor have
    updated values after HA startup / config reload
  - platform: time_pattern
    minutes: /1
  condition: []
  - parallel:
    - service: device_tracker.see
        dev_id: network_users
        location_name: '{{ "home" if is_state("binary_sensor.network_users", "on")
          else "not_home" }}'
      alias: Update Network Users Tracker Status
    alias: Update device trackers
    # .... other binary sensors used for determining presence can be listed here too!
  mode: single

I set this automation to fire every 1 minute, so that the current status is "known" to HA soon after it starts up, or config is reloaded. Maybe 1 minute is "too often", but it's just reading a value from the HA database, so, meh. Basically, I've learned that triggering automations based on state changes is flaky, and it's often more reliable to trigger some automations on a schedule and then check entity states in the condition or action.

Lastly, there needs to be a "person" in HA to associate this device tracker with. I created a generic "person" and associated the tracker with it. This can be done through Settings->People->Add Person. I called the fake person I made for this "Pamela", but the name doesn't matter.

The presence status of this person can now be used in automation and/or dashboard cards:

Home Assistant Card showing that someone is connected to the network, and HVAC is enabled by automation
Home Assistant Card showing that someone is connected to the network, and HVAC is enabled by automation

Now, this obviously isn't perfect... One of the issues is that this can lead to the HVAC being left on if someone forgets a phone at home, for example. But that could have happened over the last year with the old implementation, and I've found that 1) it's rare, and 2) the savings from having a "smarter" HVAC outweighs the handful of situations where it's being wasteful. Also, this could be circumvented if some smart guest assigns a static IP outside any expected range I pass to the script above. But if they do that, then the joke is on them since they'll just slowly freeze as the HVAC is shut off because HA thinks no one is home. I do have some other presence detection things that help out too (like IR motion detectors), that can help a little though.

With all of that, HA can tell if someone is home if they're connected to a network, and have an IP from DHCP (or otherwise) that's being monitored. And HA is completely oblivious to any specific people. Again, that's mostly just a personal concern and probably not very existential for my setup, but maybe this will be helpful for folks who do use some external thing for managing HA data, and are concerned about how location status for household members is handled.

by Unknown at March 12, 2023 12:00 AM

March 10, 2023

Andy Wingo

pre-initialization of garbage-collected webassembly heaps

Hey comrades, I just had an idea that I won't be able to work on in the next couple months and wanted to release it into the wild. They say if you love your ideas, you should let them go and see if they come back to you, right? In that spirit I abandon this idea to the woods.

Basically the idea is Wizer-like pre-initialization of WebAssembly modules, but for modules that store their data on the GC-managed heap instead of just in linear memory.

Say you have a WebAssembly module with GC types. It might look like this:

  (type $t0 (struct (ref eq)))
  (type $t1 (struct (ref $t0) i32))
  (type $t2 (array (mut (ref $t1))))
  (global $g0 (ref null eq)
    (ref.null eq))
  (global $g1 (ref $t1)
    (array.new_canon $t0 ( (i32.const 42))))
  (function $f0 ...)

You define some struct and array types, there are some global variables, and some functions to actually do the work. (There are probably also tables and other things but I am simplifying.)

If you consider the object graph of an instantiated module, you will have some set of roots R that point to GC-managed objects. The live objects in the heap are the roots and any object referenced by a live object.

Let us assume a standalone WebAssembly module. In that case the set of types T of all objects in the heap is closed: it can only be one of the types $t0, $t1, and so on that are defined in the module. These types have a partial order and can thus be sorted from most to least specific. Let's assume that this sort order is just the reverse of the definition order, for now. Therefore we can write a general type introspection function for any object in the graph:

(func $introspect (param $obj anyref)
  (block $L2 (ref $t2)
    (block $L1 (ref $t1)
      (block $L0 (ref $t0)
        (br_on_cast $L2 $t2 (local.get $obj))
        (br_on_cast $L1 $t1 (local.get $obj))
        (br_on_cast $L0 $t0 (local.get $obj))
      ;; Do $t0 things...
    ;; Do $t1 things...
  ;; Do $t2 things...

In particular, given a WebAssembly module, we can generate a function to trace edges in an object graph of its types. Using this, we can identify all live objects, and what's more, we can take a snapshot of those objects:

(func $snapshot (result (ref (array (mut anyref))))
  ;; Start from roots, use introspect to find concrete types
  ;; and trace edges, use a worklist, return an array of
  ;; all live objects in topological sort order

Having a heap snapshot is interesting for introspection purposes, but my interest is in having fast start-up. Many programs have a kind of "initialization" phase where they get the system up and running, and only then proceed to actually work on the problem at hand. For example, when you run python3, Python will first spend some time parsing and byte-compiling, importing the modules it uses and so on, and then will actually run's code. Wizer lets you snapshot the state of a module after initialization but before the real work begins, which can save on startup time.

For a GC heap, we actually have similar possibilities, but the mechanism is different. Instead of generating an array of all live objects, we could generate a serialized state of the heap as bytecode, and another function to read the bytecode and reload the heap:

(func $pickle (result (ref (array (mut i8))))
  ;; Return an array of bytecode which, when interpreted,
  ;; can reconstruct the object graph and set the roots
(func $unpickle (param (ref (array (mut i8))))
  ;; Interpret the bytecode, building object graph in
  ;; topological order

The unpickler is module-dependent: it will need one case to construct each concrete type $tN in the module. Therefore the bytecode grammar would be module-dependent too.

What you would get with a bytecode-based $pickle/$unpickle pair would be the ability to serialize and reload heap state many times. But for the pre-initialization case, probably that's not precisely what you want: you want to residualize a new WebAssembly module that, when loaded, will rehydrate the heap. In that case you want a function like:

(func $make-init (result (ref (array (mut i8))))
  ;; Return an array of WebAssembly code which, when
  ;; added to the module as a function and invoked, 
  ;; can reconstruct the object graph and set the roots.

Then you would use binary tools to add that newly generated function to the module.

In short, there is a space open for a tool which takes a WebAssembly+GC module M and produces M', a module which contains a $make-init function. Then you use a WebAssembly+GC host to load the module and call the $make-init function, resulting in a WebAssembly function $init which you then patch in to the original M to make M'', which is M pre-initialized for a given task.


Some of the object graph is constant; for example, an instance of a struct type that has no mutable fields. These objects don't have to be created in the init function; they can be declared as new constant global variables, which an engine may be able to initialize more efficiently.

The pre-initialized module will still have an initialization phase in which it builds the heap. This is a constant function and it would be nice to avoid it. Some WebAssembly hosts will be able to run pre-initialization and then snapshot the GC heap using lower-level facilities (copy-on-write mappings, pointer compression and relocatable cages, pre-initialization on an internal level...). This would potentially decrease latency and may allow for cross-instance memory sharing.


There are five preconditions to be able to pickle and unpickle the GC heap:

  1. The set of concrete types in a module must be closed.

  2. The roots of the GC graph must be enumerable.

  3. The object-graph edges from each live object must be enumerable.

  4. To prevent cycles, we have to know when an object has been visited: objects must have identity.

  5. We must be able to create each type in a module.

I think there are three limitations to this pre-initialization idea in practice.

One is externref; these values come from the host and are by definition not introspectable by WebAssembly. Let's keep the closed-world assumption and consider the case where the set of external reference types is closed also. In that case if a module allows for external references, we can perhaps make its pickling routines call out to the host to (2) provide any external roots (3) identify edges on externref values (4) compare externref values for identity and (5) indicate some imported functions which can be called to re-create exernal objects.

Another limitation is funcref. In practice in the current state of WebAssembly and GC, you will only have a funcref which is created by ref.func, and which (3) therefore has no edges and (5) can be re-created by ref.func. However neither WebAssembly nor the JS API has no way of knowing which function index corresponds to a given funcref. Including function references in the graph would therefore require some sort of host-specific API. Relatedly, function references are not comparable for equality (func is not a subtype of eq), which is a little annoying but not so bad considering that function references can't participate in a cycle. Perhaps a solution though would be to assume (!) that the host representation of a funcref is constant: the JavaScript (e.g.) representations of (ref.func 0) and (ref.func 0) are the same value (in terms of ===). Then you could compare a given function reference against a set of known values to determine its index. Note, when function references are expanded to include closures, we will have more problems in this area.

Finally, there is the question of roots. Given a module, we can generate a function to read the values of all reference-typed globals and of all entries in all tables. What we can't get at are any references from the stack, so our object graph may be incomplete. Perhaps this is not a problem though, because when we unpickle the graph we won't be able to re-create the stack anyway.

OK, that's my idea. Have at it, hackers!

by Andy Wingo at March 10, 2023 09:20 AM

Jesse Alama

Binary floats can let us down! When close enough isn't enough

How bi­na­ry float­ing points can let us down

If you've played Mo­nop­oly, you'll know abuot the Bank Er­ror in Your Fa­vor card in the Com­mu­ni­ty Chest. Re­mem­ber this?

A bank er­ror in your fa­vor? Sweet! But what if the bank makes an er­ror in its fa­vor? Sure­ly that's just as pos­si­ble, right?

I'm here to tell you that if you're do­ing every­day fi­nan­cial cal­cu­la­tions—noth­ing fan­cy, but in­volv­ing mon­ey that you care about—then you might need to know that us­ing bi­na­ry float­ing point num­bers, then some­thing might be go­ing wrong. Let's see how bi­na­ry float­ing-point num­bers might yield bank er­rors in your fa­vor—or the bank's.

In a won­der­ful pa­per on dec­i­mal float­ing-point num­bers, Mike Col­ishaw gives an ex­am­ple.

Here's how you can re­pro­duce that in JavaScript:

(1.05 * 0.7).toPrecision(2); # 0.73

Some pro­gram­mers might not be aware of this, but many are. By point­ing this out I'm not try­ing to be a smar­ty­pants who knows some­thing you don't. For me, this ex­am­ple il­lus­trates just how com­mon this sort of er­ror might be.

For pro­gram­mers who are aware of the is­sue, one typ­i­cal ap­proache to deal­ing with it is this: Nev­er work with sub-units of a cur­ren­cy. (Some cur­ren­cies don't have this is­sue. If that's you and your prob­lem do­main, you can kick back and be glad that you don't need to en­gage in the fol­low­ing sorts of headaches.) For in­stance, when work­ing with US dol­lars of eu­ros, this ap­proach man­dates that one nev­er works with eu­ros and cents, but only with cents. In this set­ting, dol­lars ex­ist only as an ab­strac­tion on top of cents. As far as pos­si­ble, cal­cu­la­tions nev­er use floats. But if a float­ing-point num­ber threat­ens to come up, some form of round­ing is used.

An­oth­er aproach for a pro­gram­mer is to del­e­gate fi­nan­cial cal­cu­la­tions to an ex­ter­nal sys­tem, such as a re­la­tion­al data­base, that na­tive­ly sup­ports prop­er dec­i­mal cal­cu­la­tions. One dif­fi­cul­ty is that even if one del­e­gates these cal­cu­la­tions to an ex­ter­nal sys­tem, if one lets a float­ing-point val­ue flow int your pro­gram, even a val­ue that can be trust­ed, it may be­come taint­ed just by be­ing im­port­ed into a lan­guage that doesn't prop­er­ly sup­port dec­i­mals. If, for in­stance, the re­sult of a cal­cu­la­tion done in, say, Post­gres, is ex­act­ly 0.1, and that flows into your JavaScript pro­gram as a num­ber, it's pos­si­ble that you'll be deal­ing with a con­t­a­m­i­nat­ed val­ue. For in­stance:

(0.1).toPrecision(25) # 0.1000000000000000055511151

This ex­am­ple, ad­mit­ted­ly, re­quires quite a lot of dec­i­mals (19!) be­fore the ugly re­al­i­ty of the sit­u­a­tion rears its head. The re­al­i­ty is that 0.1 does not, and can­not, have an ex­act rep­re­sen­ta­tion in bi­na­ry. The ear­li­er ex­am­ple with the cost of a phone call is there to raise your aware­ness of the pos­si­bil­i­ty that one doesn't need to go 19 dec­i­mal places be­fore one starts to see some weird­ness show­ing up.

There are all sorts of ex­am­ples of this. It's ex­ceed­ing­ly rare for a dec­i­mal num­ber to have an ex­act rep­re­sen­ta­tion in bi­na­ry. Of the num­bers 0.1, 0.2, …, 0.9, only 0.5 can be ex­act­ly rep­re­sent­ed in bi­na­ry.

Next time you look at a bank state­ment, or a bill where some tax is cal­cu­lat­ed, I in­vite you to ask how that was cal­cu­lat­ed. Are they us­ing dec­i­mals, or floats? Is it cor­rect?

I'm work­ing on the dec­i­mal pro­pos­al for TC39 to try to work what it might be like to add prop­er dec­i­mal num­bers to JavaScript. There are a few very in­ter­est­ing de­grees of free­dom in the de­sign space (such as the pre­cise datatype to be used to rep­re­sent these kinds of num­ber), but I'm op­ti­mistic that a rea­son­able path for­ward ex­ists, that con­sen­sus be­tween JS pro­gram­mers and JS en­gine im­ple­men­tors can be found, and even­tu­al­ly im­ple­ment­ed. If you're in­ter­est­ed in these is­sues, check out the README in the pro­pos­al and get in touch!

March 10, 2023 01:44 AM

March 09, 2023

Ziran Sun

Igalia on Reforestation

Igalia started the Reforestation project in 2019. As one of the company’s Corporate Social Responsibility (CSR) efforts, the Reforestation project focuses on conserving and expanding native, old growth forests to capture, and long-term storing, carbon emissions. Partnering with Galnus, we have been working on reforestation of a 10.993 hectares area with 11,000 trees, which absorbs 66 tons of carbon dioxide each year.

Phase I – Rois

The first land where the project started was framed in the woods of the communal land of “San Miguel de Costa” in Rois. Rois is a municipality of northwestern Spain in the province of A Coruña in the autonomous community of Galicia. Environment in this land was highly altered by human action and the predominance of eucalyptus plantations left few examples of native forest.

After the agreement for land transfer signed on the 29th June 2019, the project started with creating maps and technical plans for the areas affected. Purposes of these plans are not only to build a cartographic base for management and accomplishment of the works but also to designate accurately these zones as “Reserved Areas” in future management plans.

Work carried out

  • Clear and crush remaining eucalyptus and other exotic plants

Eucalyptus is an exotic and invasive species. It spread widely and had a negative impact on the environmental situation of this habitat. This work is to clear the existing growth and eliminate the regrowths of eucalyptus and other exotic plants before the start of planting work. In Q1 2020, 100% of exotic trees were cut as part of the Atlantic forest sponsorship. After that the focus was moved to eliminating the sprouts of eucalyptus grown from the seeds present in the soil in early years. Sprout and seeds elimination continues throughout the duration of the project.

  • Acquire trees, shrubs and all other materials needed.

To improve forest structure and increase biodiversity for the affected area, the following native trees and shrubs were chosen and purchased in Q3 2019 –
– 1000 downy birch (Betula pubescens)
– 650 common oak (Quercus robur)
– 350 wild cherry (Prunus avium)
– 150 common holly (Ilex aquifolium)

A lot of other materials such as tree protectors, stakes etc. were also acquired at the time time.

  • Planting

Planting started in Q4 2019, downy birches (Betula pubescens) and some common oaks (Quercus robur) were the first batch planted. This planting campaign continued in 2020 on the arrival of the rest of the trees.

  • Care during the first year

The first year after planting is key to the future development and success of the project.

In Q1 2020, 100% of planted trees were already sprouting and alive. At this stage 100% of the restoration area has already been planted with at least 65% of the trees.

The first summer after planting is vital for the trees and shrubs to settle. During the first summer, trees adapt and take root in their new location. If they survive, their chances of developing correctly will increase exponentially and the following years they will be able to focus all their energy on their growth. The beginning of the summer 2020 was difficult due to hot and dry weather. However, the tree mortality rate remains within the expected range.

In Q4 2020, most of the planted trees are well-established and ready for spring sprouting. By Q1 2021 many of them have reached over 2 meters high.

The trees and shrubs settled happily after the first year of growth. This is the photo taken in Spring 2020

And see how they had grown in a year’s time –

  • Wildlife studies

With plants sprouting all around the forest, insects, reptiles and birds start thriving too. The development of the ecosystem will provide for better results in future inventories and wildlife studies. In summer 2020 the wildlife studies started and first inventory was completed. In winter 2020 wildlife nest boxes were in the manufacturing process and some bat boxes were installed in Spring 2021.

  • Improving biodiversity

In addition to ensuring the introduction of a wide variety of tree species to reforestation areas, the project has also put effort on enhancing biodiversity in existing forest areas. For example, One of the areas targeted by the project is one hectare of young secondary forest made up mainly of oaks (Quercus robur). In this forest we have been planting understory species such as holly (Ilex aquifolium), laurel (Laurus nobilis) or Atlantic pear (Pyrus cordata) to increase biodiversity and improve the structure and complexity of the forest.

Phase II – Eume and Rois

Phase II of the reforestation project set sail in winter 2020. While the Rois project had moved to a steady stage with most trees and shrubs planted and settled through the first year’s growth, this phase is to expand restoration of new forest area in Rois and to start a new project in “Fragas do Eume” natural park.

  • Rois

In order to explain the progress in the Rois expansion project, the maps below distinguish three areas in different stages of development.
– Phase 1 (Green) – The green area is completely planted with native trees and free of eucalyptus sprouts.
– Phase 2 (Yellow) – Acacias have been eliminated in this area. The work to control this species and also the eucalyptus trees will continue.
– Phase 3 (Red) – The entire area is covered with dense plantations of eucalyptus and acacia. Work is in the preparation stage.

The following maps represent the progresses made between Spring to Fall in 2021.

Spring 2021:

Fall 2021:

  • Eume

Unlike the Rois project, habitats in “Fragas do Eume” natural park are some of the largest and best preserved examples in the world of coastal Atlantic rainforest, where much of its biodiversity and original structure is still preserved. Unfortunately, most of these forests are young secondary forests under numerous threats such as the presence of eucalyptus and other invasive species.

Our project represents one of the largest actions in recent years for the elimination and control of exotic species and the restoration of native forest in the area, increasing native forest surface area, reconnecting fragmented forest patches and improving the landscape in the “Fragas do Eume Natural Park”.

In order to explain the progress in the Eume project, it also distinguishes three areas in different stages of development.

Phase 1 (Green) – All the environmental restoration works have been completed. Only maintenance tasks remain for the next few years.
phase 2 (Yellow) – This area is in the process of control and elimination of eucalyptus to start the restoration of native vegetation with native trees plantation works.
Phase 3 (Red) – Waiting for the loggers to cut the big eucalyptus trees in this area, in order to start the work associated with the project.

The following maps represent the progresses made between Spring to Fall in 2021.

Spring 2021:

Fall 2021:


In 2021 Igalia started working on a 0.538-hectare land owned by the Esmelle Valley Association. This land is borded by a road and a forest track on each side, and the Esmelle stream flows across it. This once mighty stream and its tributary springs are currently used to provide drinking water to the houses in the surrounding area. Unfortunately, in summer time, the flow of the stream could reduce drastically due to dry weather and exhausting uses. In addition, this area also suffers invasion of numerous eucalyptus trees and other exotic species. Recovering and restoring a good example of the Atlantic Forest will bring great benefit to this enormously altered and humanized case.

After work preparation and planning, major work were carried out in 2022 including –
– Elimination and control of eucalyptus and its sprouts.
– Clearing and ground preparation
– Tree seedling, plantation and protections (placement of tree stakes and protectors etc.)
– Maintenance and replacement of dead trees

This work will continue in the next two years. Main focus will be on reinforcement and enrichment of the plantation based on the availability of tree seedlings of certain species, and finishing all the remaining maintenance work.

What’s next?

Igalia sees Reforestation as a long term effort. While maintaining and developing the current projects, Igalia doesn’t stop looking for new candidates. In Q4 2022, Igalia started preparing a new Reforestation project – Galnus: O Courel.

This is the first time that Igalia considered developing an environmental compensation project in one of the most rural and least densely populated areas of Galicia. Igalia believes that the environmental, social and economic characteristics associated with this area, offers an opportunity to carry out
environmental restoration projects. If it goes as planned, major work will happen in the next two years. Something to look forward to!


The Reforestation project is making contributions to our environment, and it has gone further. Here we’d like to share a picture of a 7-year-old boy’s work. This young student is from the community owning the Rois forest. He took advantage of a school newspaper project to communicate about our work in Rois forest.

Isn’t it a joy to see Iglaia’s Reforestation project is making impacts on our children, and our future? 🙂

by zsun at March 09, 2023 07:08 PM

March 03, 2023

Alex Bradbury

Commercially available RISC-V silicon

The RISC-V instruction set architecture has seen huge excitement and growth in recent years (10B cores estimated to have shipped as of Dec 2022) and I've been keeping very busy with RISC-V related work at Igalia. I thought it would be fun to look beyond the cores I've been working with and to enumerate the SoCs that are available for direct purchase or in development boards that feature RISC-V cores programmable by the end user. I'm certain to be missing some SoCs or have some mistakes or missing information - any corrections very gratefully received at or @asbradbury. I'm focusing almost exclusively on the RISC-V aspects of each SoC - i.e. don't expect a detailed listing of other on-chip peripherals or accelerators.

A few thoughts

  • It was absolutely astonishing how difficult it was to get basic information about the RISC-V specification implemented by many of these SoCs. In a number of cases, just a description of a "RV32 core at xxMHz" with further detective work being needed to find any information at all about even the standard instruction set extensions supported.
  • The way I've focused on the details of individual cores does a bit of a disservice to those SoCs with compute clusters or interesting cache hierarchies. If I were to do this again (or if I revisit this in the future), I'd look to rectify that. There a whole bunch of other micro-architectural details it would be interesting to detail too.
    • I've picked up CoreMark numbers where available, but of course that's a very limited metric to compare cores. It's also not always clear which compiler was used, which extensions were targeted and so on. Plus when the figure is taken from an IP core product page, the numbers may refer to a newer version of the IP. Where there are multiple numbers floating about I've listed them all.
  • There's a lot of chips here - but although I've likely missed some, not so many that it's impossible to enumerate. RISC-V is growing rapidly, so perhaps this will change in the next year or two.
  • A large proportion of the SoC designs listed are based on proprietary core designs. The exceptions are the collection of SoCs based on the T-Head cores, the SiFive E31 and Kendryte K210 (Rocket-derived) and the GreenWaves GAP8/GAP9 (PULP-derived). As a long-term proponent of open source silicon I'd hope to see this change over time. Once a company has moved from a proprietary ISA to an open standard (RISC-V), there's a much easier transition path to switch from a proprietary IP core to one that's open source.

64-bit Linux-capable application processors

  • StarFive JH7110
    • Core design:
      • 4 x RV64GC_Zba_Zbb SiFive U74 application cores, 1 x RV64IMAC SiFive S7 (this?) monitor core, and 1 x RV32IMFC SiFive E24 (ref, ref).
      • The U74 is a dual-issue in-order pipeline with 8 stages.
    • Key stats:
      • 1.5 GHz, fabbed on TSMC 28nm (ref).
      • StarFive report a CoreMark/MHz of 5.09.
    • Development board:
  • T-Head C910 ICE
    • Core design:
      • 2 x RV64GC T-Head C910 application cores and an additional T-Head C910 RV64GCV core (i.e., with the vector extension).
      • The C910 is a 3-issue out-of-order pipeline with 12 stages.
    • Key stats:
      • 1.2 GHz, fabbed on a 28nm process (ref).
    • Development board:
  • Allwinner D1-H (datasheet, user manual)
    • Core design:
      • 1 x RV64GC T-Head C906 application core. Additionally supports the unratified, v0.7.1 RISC-V vector specification (ref).
      • Single-issue in-order pipeline with 5 stages.
      • Verilog for the core is on GitHub under the Apache License (see discussion on what is included).
      • At least early versions of the chip incorrectly trapped on fence.tso. It's unclear if this has been fixed in later revisions.
    • Key stats:
      • 1GHz, taped out on a 22nm process node.
      • Reportedly 3.8 CoreMark/MHz.
    • Development board:
  • StarFive JH7100
    • Core design:
      • 2 x RV64GC SiFive U74 application cores and 1 x RV32IMAFC SiFive E24.
      • The U74 is a dual-issue in-order pipeline with 8 stages.
    • Key stats:
      • 1.2GHz (as listed on StarFive's page but articles about the V1 board claimed 1.5GHz), presumably fabbed on TSMC 28nm (the JH7110 is).
      • The current U74 product page claims 5.75 CoreMark/MHz (previous reports suggested 4.9 CoreMark/MHz).
    • Development board:
  • Kendryte K210 (datasheet)
    • Core design:
      • 2 x RV64GC application cores (reportedly implementations of the open-source Rocket core design).
      • If it's correct the K210 uses Rocket, it's a single-issue in-order pipeline with 5 stages.
      • Has a non-standard (older version of the privileged spec?) MMU (ref, so the nommu Linux port is typically used.
    • Key stats:
      • 400MHz, fabbed on TSMC 28nm.
    • Development board:
  • MicroChip PolarFire SoC MPFSxxxT
    • Core design:
      • 4 x RV64GC SiFive U54 application cores, 1 x RV64IMAC SiFive E51 (now renamed to S51) monitor core.
      • The U54 is a single-issue, in-order pipeline with 5 stages.
    • Key stats:
      • 667 MHz (ref), fabbed on a 28nm node (ref).
      • Microchip report 3.125 CoreMark/MHz.
    • Development board:
      • Available in the 'Icicle' development board.
  • SiFive FU740
    • Core design:
      • 4 x RV64GC SFive U74 application cores, 1 x RV64IMAC SiFive S71 monitor core.
      • It's hard to find details for the S71 core, the FU740 manual refers to it as an S7 while the HiFive Unmatched refers to it as the S71 - but neither have a page on SiFive's site. I'm told that the S71 has the same pipeline as the S76, just no support for the F and D extensions.
      • The U74 is a dual-issue in-order pipeline with 8 stages.
    • Key stats:
      • 1.2 GHz, fabbed on TSMC 28nm (ref).
      • The current U74 product page claims 5.75 CoreMark/MHz (previous reports suggested 4.9 CoreMark/MHz).
    • Development board:
  • SiFive FU540
    • Core design:
      • 4 x RV64GC SiFive U54 application cores, 1 x RV64IMAC SiFive E51 (now renamed to S51) monitor core.
      • The U54 is a single-issue, in-order pipeline with 5 stages.
    • Key stats:
      • 1.5GHz, fabbed on TSMC 28nm (ref).
      • The current U54 product page claims 3.16 CoreMark/MHz but it was 2.75 in 2017.
    • Development board:
  • Bouffalo Lab BL808
    • Core design:
    • Key stats:
      • The C906 runs at 480 MHz, the E907 at 320 MHz and the E902 at 150 MHz.
    • Development board:
      • Available in the Ox64 from Pine64.
  • Renesas RZ/Five
  • Kendryte K510
    • Core design:
      • 2 x RV64GC application cores and 1 x RV64GC core with DSP extensions. The Andes AX25MP appears to be used (ref).
    • Key stats:
      • 800 MHz.
    • Development board:
  • (Upcoming) Intel-SiFive Horse Creek SoC
    • Core design:
      • 4 x "RV64GBC" SiFive P550 application cores (docs not yet available, but an overview is here). As the 'B' extension was broken up into smaller sub-extensions, this is perhaps RV64GC_Zba_Zbb like the SiFive U74.
      • 13 stage, 3 issue, out-of-order pipeline.
      • As the bit manipulation extension was split into a range of sub-extensions it's unclear exactly which of the 'B' family extensions will be supported.
    • Key stats:
      • 2.2 GHz, fabbed on Intel's '4' process node (ref).
    • Development board:
  • (Upcoming) T-Head TH1520 (announcement)
    • Core design:
      • 4 x RV64GC T-Head C910 application cores, 1 x RV64GC T-Head C906, 1 x RV32IMC T-Head E902.
      • The C910 is a 3-issue out-of-order pipeline with 12 stages, the C906 is single-issue in-order with 5 stages, and the E902 is single-issue in-order with 2 stages.
      • Verilog for the cores is up on GitHub under the Apache license: C910, C906, E902. See discussion on what is included).
    • Key stats:
      • 2.4 GHz, fabbed on a 12nm process (ref).
    • Development board:

Embedded / specialised SoCs (mostly 32-bit)

  • SiFive FE310
  • GigaDevice GD32VF103 series
    • Core design:
      • 1 x RV32IMAC Nuclei Bumblebee N200 ("jointly developed by Nuclei System Technology and Andes Technology.")
      • No support for PMP (Physical Memory Protection), includes the 'ECLIC' interrupt controller derived from the CLIC design.
      • Single-issue, in-order pipeline with 2 stages.
    • Key stats:
      • 108 MHz. 360 CoreMark (ref), implying 3.33 CoreMark/MHz.
    • Development board:
  • GreenWaves GAP8
    • Core design:
    • Key stats:
      • 175 MHz Fabric Controller, 250 MHz cluster. Fabbed on TSMC's 55nm process (ref).
      • 22.65 GOPS at 4.24mW/GOP (ref).
      • Shipped 150,000 units, composed of roughly 80% open source and 20% proprietary IP (ref).
    • Development board:
  • GreenWaves GAP9
    • Core design:
      • Fabric Controller (FC) and compute cluster of 9 cores. Extends the RV32IMC (plus extensions) GAP8 core design with additional custom extensions (ref).
    • Key stats:
      • 400 MHz Fabric Controller and computer cluster. Fabbed on Global Foundries 22nm FDX process (ref).
      • 150.8 GOPS at 0.33mW/GOP (ref).
    • Development board:
      • GAP9 evaluation kit listed on Greenwaves store but you must email to receive access to order it.
  • Renesas RH850/U2B
    • Core design:
      • Features an NSITEXE DR1000C RISC-V parallel coprocessor, comprised of RV32I scalar processor units, a control core unit, and a vector processing unit based on the RISC-V vector extension.
    • Key stats:
      • 400 MHz. Fabbed on a 28nm process.
    • Development board:
      • None I can find.
  • Renesas R9A02G020
    • Core design:
      • 1 x RV32IMC AndesCore N22 (additionally supporting the Andes 'Performance' and 'CoDense' instruction set extensions).
      • Single-issue in-order with a 2-stage pipeline.
    • Key stats:
      • 32 MHz.
    • Development board:
  • Analog Devices MAX78000
    • Core design:
      • Features an RV32 RISC-V coprocessor of unknown (to me!) design and unknown ISA naming string.
    • Key stats:
      • 60 MHz (for the RISC-V co-processor), fabbed on a TSMC 40nm process (ref).
    • Development board:
  • Espressif ESP32-C3 / ESP8685
    • Core design:
      • 1 x RV32IMC core of unknown design, single issue in-order 4-stage pipeline (ref).
    • Key stats:
      • 160 MHz, fabbed on a TSMC 40nm process (ref).
      • 2.55 CoreMark/MHz.
    • Development board:
  • Espressif ESP32-C2 / ESP8684
    • Core design:
      • 1 x RV32IMC core of unknown design, single issue in-order 4-stage pipeline (ref).
    • Key stats:
      • 160 MHz, fabbed on a TSMC 40nm process.
      • 2.55 CoreMark/MHz.
      • Die photo available in this article.
    • Development board:
  • Espressif ESP32-C6
    • Core design:
      • 1 x RV32IMAC core of unknown design (four stage pipeline) and 1 x RV32IMAC core of unknown design (two stage pipeline) for low power operation (ref).
    • Key stats:
      • 160 MHz high performance (HP) core with 2.76 CoreMark/MHz, 20 MHz low power (LP) core.
    • Development board:
  • HiSilicon Hi3861
    • Core design:
      • 1 x RV32IM core of unknown design, supporting additional non-standard compressed instruction set extensions (ref).
    • Key stats:
      • 160 MHz.
    • Development board:
      • A low-cost board is available advertising support for Harmony OS.
  • Bouffalo Lab BL616/BL618
    • Core design:
      • 1 x RV32GC core of unknown design, also with support for the unratified 'P' packed SIMD extension (ref).
    • Key stats:
      • 320 MHz.
    • Development board:
  • Bouffalo Lab BL602/BL604
    • Core design:
    • Key stats:
      • 192 MHz, 3.1 CoreMark/MHz (ref).
    • Development board:
      • Available in the very low cost Pinecone evaluation board.
    • Other:
      • The BL702 appears to have the same core, so I haven't listed it separately.
  • Bluetrum AB5301A
  • WCH CH583/CH582/CH581
    • Core design:
      • 1 x RV32IMAC QingKe V4a core, which also supports a "hardware prologue/epilogue" extension.
      • Single issue in-order pipeline with 2 stages.
      • Unlike the other QingKe V4 series core designs, the V4a doesn't support the custom 'extended instruction' (XW) instruction set extension.
    • Key stats:
      • 20MHz.
    • Development board:
  • WCH CH32V307
    • Core design:
    • Key stats:
      • 144 MHz.
    • Development board:
  • WVH CH32V208
    • Core design:
      • 1 x RV32IMAC QingKe V4c core with custom instruction set extensions ('XW' for sign-extended byte and half word operations).
    • Key info:
      • 144 MHz.
    • Development board:
    • Other:
      • The CH32V203 is also available but I haven't listed it separately as it's not clear how the QingKe V4b core in that chip differs to the V4c in this one.
  • WCH CH569 / WCH CH573 / WCH CH32V103
    • Core design:
    • Key stats:
      • 120 MHz (CH569), 20 MHz (CH573), 80 MHz (CH32V103).
    • Development board:
  • WCH CH32V003
    • Core design:
      • 1 x RV32EC QingKe V2A with custom instruction set extensions ('XW' for sign-extended byte and half word operations).
    • Key stats:
    • Development board:
  • PicoCom PC802
    • Core design:
    • Key stats:
      • Fabbed on TSMC 12nm process (ref).
    • Development board:
  • CSM32RV20
    • Core design:
      • 1 x RV32IMAC core of unknown design.
    • Key stats:
      • 2.6 CoreMark/MHz.
    • Development board:
  • HPMicro HPM6750
    • Core design:
      • 2 x RV32IMAFDC cores (AndesCore D45) with an implementation of the draft 'P' packed SIMD spec (ref).
      • In-order dual-issue 8-stage pipeline.
    • Key stats:
      • 800 MHz
    • Development board:
      • Available in the HPM6750EVK as well as other variants.
  • Renesas R9A06G150
    • Core design:
      • 1 x RV32IMAFC AndesCore D25F with additional vendor-specific instruction set extensions.
      • 5 stage pipeline, single issue in-order.
    • Key stats:
      • 100 MHz
    • Development board:
      • None available currently.

Bonus: Other SoCs that don't match the above criteria or where there's insufficient info

  • The Espressif ESP32-P4 was announced, featuring a dual-core 400MHz RISC-V CPU with "an AI instructions extension". I look forward to incorporating it into the list above when more information is available.
  • I won't try to enumerate every use of RISC-V in chips that aren't programmable by end users or where development boards aren't available, but it's worth noting the use of RISC-V Google's Titan M2
  • In January 2022 Intel Mobileye announced the EyeQ Ultra featuring 12 RISC-V cores (of unknown design), but there hasn't been any news since.

Article changelog
  • 2023-03-31: Add Renesas R9A06G150 to the list.
  • 2023-03-26: Add HPM6750 to the list.
  • 2023-03-19: Add note on silicon bug in the Renesas RZ/Five.
  • 2023-03-12: Clarified details of several SiFive cores and made the listing of SoCs using cores derived from open source RTL exhaustive.
  • 2023-03-04:
    • Added in the CSM32RV20 and some extra Bouffalo Lab and WCH chips (thanks to Reddit user 1r0n_m6n).
    • Confirmed likely core design in the K510 (thanks to Reddit user zephray_wenting).
    • Further Espressif information and new ESP32-C2 entry contributed by Ivan Grokhotkov.
    • Clarified cores in the JH7110 and JH7100 (thanks to Conor Dooley for the tip).
    • Added T-Head C910-ICE (thanks to a tip via email).
  • 2023-03-03:
    • Added note about CoreMark scores.
    • Added the Renesas R9A02G020 (thanks to Giancarlo Parodi for the tip!).
    • Various typo fixes.
    • Add link to PicoCom RISC-V Summit talk and clarify the ISA extensions supported by the PC802 Andes N25F clusters.
  • 2023-03-03: Initial publication date.

March 03, 2023 12:00 PM

André Almeida

Installing kernel modules faster with multithread XZ

As a kernel developer, everyday I need to compile and install custom kernels, and any improvement in this workflow means to be more productive. While installing my fresh compiled modules, I noticed that it would be stuck in amdgpu compression for some time:

XZ      /usr/lib/modules/6.2.0-tonyk/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.xz

XZ format

My target machine is the Steam Deck, that uses .xz for compressing the modules. Giving that we want gamers to be able to install as many games as possible, the OS shouldn’t waste much disk space. amdgpu, when compiled with debug symbols can use a good hunk of space. Here’s the comparison of disk size of the module uncompressed, and then with .zst and .xz compression:

360M amdgpu.ko
61M  amdgpu.ko.zst
38M  amdgpu.ko.xz

This more compact module comes with a cost: more CPU time for compression.

Multithread compression

When I opened htop, I saw that only a lonely thread was doing the hard work to compress amdgpu, even that compression is a task easily parallelizable. I then hacked scripts/Makefile.modinst so XZ would use as many threads as possible, with the option -T0. In my main build machine, modules_install was running 4 times faster!

# before the patch
$ time make modules_install -j16
Executed in  100.08 secs

# after the patch
$ time make  modules_install -j16
Executed in   28.60 secs

Then, I submitted a patch to make this default for everyone: [PATCH] kbuild: modinst: Enable multithread xz compression

However, as Masahiro Yamada noticed, we shouldn’t be spawning numerous threads in the build system without the user request. Until today we specify manually how many threads we should run with make -jX.

Hopefully, Nathan Chancellor suggested that the same results can be achieved using XZ_OPT=-T0, so we still can benefit from this without the patch. I experimented with different -TX and -jY values, but in my notebook the most efficient values were X = Y = nproc. You can check some results bellow:

$ make modules_install
174.83 secs

$ make modules_install -j8
100.55 secs

$ make modules_install XZ_OPT=-T0
81.51 secs

$ make modules_install -j8 XZ_OPT=-T0
53.22 sec

March 03, 2023 12:00 AM

February 28, 2023

Lucas Fryzek

Journey Through Freedreno

Android running Freedreno
Android running Freedreno

As part of my training at Igalia I’ve been attempting to write a new backend for Freedreno that targets the proprietary “KGSL” kernel mode driver. For those unaware there are two “main” kernel mode drivers on Qualcomm SOCs for the GPU, there is the “MSM”, and “KGSL”. “MSM” is DRM compliant, and Freedreno already able to run on this driver. “KGSL” is the proprietary KMD that Qualcomm’s proprietary userspace driver targets. Now why would you want to run freedreno against KGSL, when MSM exists? Well there are a few ones, first MSM only really works on an up-streamed kernel, so if you have to run a down-streamed kernel you can continue using the version of KGSL that the manufacturer shipped with your device. Second this allows you to run both the proprietary adreno driver and the open source freedreno driver on the same device just by swapping libraries, which can be very nice for quickly testing something against both drivers.

When “DRM” isn’t just “DRM”

When working on a new backend, one of the critical things to do is to make use of as much “common code” as possible. This has a number of benefits, least of all reducing the amount of code you have to write. It also allows reduces the number of bugs that will likely exist as you are relying on well tested code, and it ensures that the backend is mostly likely going to continue to work with new driver updates.

When I started the work for a new backend I looked inside mesa’s src/freedreno/drm folder. This has the current backend code for Freedreno, and its already modularized to support multiple backends. It currently has support for the above mentioned MSM kernel mode driver as well as virtio (a backend that allows Freedreno to be used from within in a virtualized environment). From the name of this path, you would think that the code in this module would only work with kernel mode drivers that implement DRM, but actually there is only a handful of places in this module where DRM support is assumed. This made it a good starting point to introduce the KGSL backend and piggy back off the common code.

For example the drm module has a lot of code to deal with the management of synchronization primitives, buffer objects, and command submit lists. All managed at a abstraction above “DRM” and to re-implement this code would be a bad idea.

How to get Android to behave

One of this big struggles with getting the KGSL backend working was figuring out how I could get Android to load mesa instead of Qualcomm blob driver that is shipped with the device image. Thankfully a good chunk of this work has already been figured out when the Turnip developers (Turnip is the open source Vulkan implementation for Adreno GPUs) figured out how to get Turnip running on android with KGSL. Thankfully one of my coworkers Danylo is one of those Turnip developers, and he gave me a lot of guidance on getting Android setup. One thing to watch out for is the outdated instructions here. These instructions almost work, but require some modifications. First if you’re using a more modern version of the Android NDK, the compiler has been replaced with LLVM/Clang, so you need to change which compiler is being used. Second flags like system in the cross compiler script incorrectly set the system as linux instead of android. I had success using the below cross compiler script. Take note that the compiler paths need to be updated to match where you extracted the android NDK on your system.

ar = '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-ar'
c = ['ccache', '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang']
cpp = ['ccache', '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang++', '-fno-exceptions', '-fno-unwind-tables', '-fno-asynchronous-unwind-tables', '-static-libstdc++']
c_ld = 'lld'
cpp_ld = 'lld'
strip = '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-strip'
# Android doesn't come with a pkg-config, but we need one for Meson to be happy not
# finding all the optional deps it looks for.  Use system pkg-config pointing at a
# directory we get to populate with any .pc files we want to add for Android
pkgconfig = ['env', 'PKG_CONFIG_LIBDIR=/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/pkgconfig:/home/lfryzek/Documents/projects/igalia/freedreno/install-android/lib/pkgconfig', '/usr/bin/pkg-config']

system = 'android'
cpu_family = 'arm'
cpu = 'armv8'
endian = 'little'

Another thing I had to figure out with Android, that was different with these instructions, was how I would get Android to load mesa versions of mesa libraries. That’s when my colleague Mark pointed out to me that Android is open source and I could just check the source code myself. Sure enough you have find the OpenGL driver loader in Android’s source code. From this code we can that Android will try to load a few different files based on some settings, and in my case it would try to load 3 different shaded libraries in the /vendor/lib64/egl folder, ,, and I could just replace these libraries with the version built from mesa and voilà, you’re now loading a custom driver! This realization that I could just “read the code” was very powerful in debugging some more android specific issues I ran into, like dealing with gralloc.

Something cool that the opensource Freedreno & Turnip driver developers figured out was getting android to run test OpenGL applications from the adb shell without building android APKs. If you check out the freedreno repo, they have an script that can build tests in the tests-* folder. The nice benefit of this is that it provides an easy way to run simple test cases without worrying about the android window system integration. Another nifty feature about this repo is the libwrap tool that lets trace the commands being submitted to the GPU.

What even is Gralloc?

Gralloc is the graphics memory allocated in Android, and the OS will use it to allocate the surface for “windows”. This means that the memory we want to render the display to is managed by gralloc and not our KGSL backend. This means we have to get all the information about this surface from gralloc, and if you look in src/egl/driver/dri2/platform_android.c you will see existing code for handing gralloc. You would think “Hey there is no work for me here then”, but you would be wrong. The handle gralloc provides is hardware specific, and the code in platform_android.c assumes a DRM gralloc implementation. Thankfully the turnip developers had already gone through this struggle and if you look in src/freedreno/vulkan/tu_android.c you can see they have implemented a separate path when a Qualcomm msm implementation of gralloc is detected. I could copy this detection logic and add a separate path to platform_android.c.

Working with the Freedreno community

When working on any project (open-source or otherwise), it’s nice to know that you aren’t working alone. Thankfully the #freedreno channel on is very active and full of helpful people to answer any questions you may have. While working on the backend, one area I wasn’t really sure how to address was the synchronization code for buffer objects. The backend exposed a function called cpu_prep, This function was just there to call the DRM implementation of cpu_prep on the buffer object. I wasn’t exactly sure how to implement this functionality with KGSL since it doesn’t use DRM buffer objects.

I ended up reaching out to the IRC channel and Rob Clark on the channel explained to me that he was actually working on moving a lot of the code for cpu_prep into common code so that a non-drm driver (like the KGSL backend I was working on) would just need to implement that operation as NOP (no operation).

Dealing with bugs & reverse engineering the blob

I encountered a few different bugs when implementing the KGSL backend, but most of them consisted of me calling KGSL wrong, or handing synchronization incorrectly. Thankfully since Turnip is already running on KGSL, I could just more carefully compare my code to what Turnip is doing and figure out my logical mistake.

Some of the bugs I encountered required the backend interface in Freedreno to be modified to expose per a new per driver implementation of that backend function, instead of just using a common implementation. For example the existing function to map a buffer object into userspace assumed that the same fd for the device could be used for the buffer object in the mmap call. This worked fine for any buffer objects we created through KGSL but would not work for buffer objects created from gralloc (remember the above section on surface memory for windows comming from gralloc). To resolve this issue I exposed a new per backend implementation of “map” where I could take a different path if the buffer object came from gralloc.

While testing the KGSL backend I did encounter a new bug that seems to effect both my new KGSL backend and the Turnip KGSL backend. The bug is an iommu fault that occurs when the surface allocated by gralloc does not have a height that is aligned to 4. The blitting engine on a6xx GPUs copies in 16x4 chunks, so if the height is not aligned by 4 the GPU will try to write to pixels that exists outside the allocated memory. This issue only happens with KGSL backends since we import memory from gralloc, and gralloc allocates exactly enough memory for the surface, with no alignment on the height. If running on any other platform, the fdl (Freedreno Layout) code would be called to compute the minimum required size for a surface which would take into account the alignment requirement for the height. The blob driver Qualcomm didn’t seem to have this problem, even though its getting the exact same buffer from gralloc. So it must be doing something different to handle the none aligned height.

Because this issue relied on gralloc, the application needed to running as an Android APK to get a surface from gralloc. The best way to fix this issue would be to figure out what the blob driver is doing and try to replicate this behavior in Freedreno (assuming it isn’t doing something silly like switch to sysmem rendering). Unfortunately it didn’t look like the libwrap library worked to trace an APK.

The libwrap library relied on a linux feature known as LD_PRELOAD to load when the application starts and replace the system functions like open and ioctl with their own implementation that traces what is being submitted to the KGSL kernel mode driver. Thankfully android exposes this LD_PRELOAD mechanism through its “wrap” interface where you create a propety called wrap.<app-name> with a value LD_PRELOAD=<path to>. Android will then load your library like would be done in a normal linux shell. If you tried to do this with libwrap though you find very quickly that you would get corrupted traces. When android launches your APK, it doesn’t only launch your application, there are different threads for different android system related functions and some of them can also use OpenGL. The libwrap library is not designed to handle multiple threads using KGSL at the same time. After discovering this issue I created a MR that would store the tracing file handles as TLS (thread local storage) preventing the clobbering of the trace file, and also allowing you to view the traces generated by different threads separately from each other.

With this is in hand one could begin investing what the blob driver is doing to handle this unaligned surfaces.

What’s next?

Well the next obvious thing to fix is the aligned height issue which is still open. I’ve also worked on upstreaming my changes with this WIP MR.

Freedreno running 3d-mark
Freedreno running 3d-mark

February 28, 2023 05:00 AM

February 23, 2023

Eric Meyer

A Leap of Decades

I’ve heard it said there are two kinds of tech power users: the ones who constantly update to stay on the bleeding edge, and the ones who update only when absolutely forced to do so.  I’m in the latter camp.  If a program, setup, or piece of hardware works for me, I stick by it like it’s the last raft off a sinking island.

And so it has been for my early 2013 MacBook Pro, which has served me incredibly well across all those years and many continents, but was sliding into the software update chasm: some applications, and for that matter its operating system, could no longer be run on its hardware.  Oh and also, the top row of letter keys was becoming unresponsive, in particular the E-R-T sequence.  Which I kind of need if I’m going to be writing English text, never mind reloading pages and opening new browser tabs.

Stepping Up

An early 2013 MacBook Pro sitting on a desk next to the box of an early 2023 MacBook Pro, the latter illuminated by shafts of sunlight.
The grizzled old veteran on the verge of retirement and the fresh new recruit that just transferred in to replace them.

So on Monday, I dropped by the Apple Store and picked up a custom-built early 2023 MacBook Pro: M2 Max with 38 GPU cores, 64GB RAM, and 2TB SSD.  (Thus quadrupling the active memory and nearly trebling the storage capacity of its predecessor.)  I went with that balance, or perhaps imbalance, because I intend to have this machine last me another ten years, and in that time, RAM is more likely to be in demand than SSD.  If I’m wrong about that, I can always plug in an external SSD.  Many thanks to the many people in my Mastodon herd who nudged me in that direction.

I chose the 14” model over the 16”, so it is a wee bit smaller than my old 15” workhorse.  The thing that surprises me is the new machine looks boxier, somehow.  Probably it’s that the corners of the case are not nearly as rounded as the 2013 model, and I think the thickness ratio of display to body is closer to 1:1 than before.  It isn’t a problem or anything, it’s just a thing that I notice.  I’ll probably forget about it soon enough.

Some things I find mildly-to-moderately annoying:

  • DragThing doesn’t work any more.  It had stopped being updated before the 64-bit revolution, never mind the shift to Apple silicon, so this was expected, but wow do I miss it.  Like a stick-shift driver uselessly stomping the floorboards and blindly grasping air while driving an automatic car, I still flip the mouse pointer toward the right edge of the screen, where I kept my DragThing dock, before remembering it’s gone.  I’ve looked at alternatives, but none of them seem like they’re meant as straight up replacements, so I’ve yet to commit to one.  Maybe some day I’ll ask Daniel to teach me Swift to I can build my own. (Because I definitely need more demands on my time.)
  • The twisty arrows in the Finder to open and close folders don’t have enough visual weight.  Really, the overall UI feels like a movie’s toy representation of an operating system, not an actual operating system.  I mean, the visual presentation of the OS looks like something I would create, and brother, that is not a compliment.
  • The Finder’s menu bar has no visually distinct background.  What the hell.  No, seriously, what the hell?  The Notch I’m actually okay with, but removing the distinction between the active area of the menu bar and the inert rest of the desktop seems… ill-advised.  Do not like.  HARK, A FIX: Cory Birdsong pointed me to “System Settings… > Accessibility > Display > Reduce Transparency”, which fixes this, over on Mastodon.  Thanks, Cory!
  • I’m not used to the system default font(s) yet, which I imagine will come with time, but still catches me here and there.
  • The alert and other systems sounds are different, and I don’t like them.  Sosumi.

Oh, and it’s weird to me that the Apple logo on the back of the display doesn’t glow.  Not annoying, just weird.

Otherwise, I’m happy with it so far.  Great display, great battery life, and the keyboard works!

Getting Migratory

The 2013 MBP was backed up nightly to a 1TB Samsung SSD, so that was how I managed the migration: plugged the SSD into the new MBP and let Migration Assistant do its thing.  This got me 90% of the way there, really.  The remaining 10% is what I’ll talk about in a bit, in case anyone else finds themselves in a similar situation.

The only major hardware hurdle I hit was that my Dell U2713HM monitor, also of mid-2010s vintage, seems to limit HDMI signals to 1920×1080 despite supposedly supporting HDMI 1.4, which caught me by surprise.  When connected to a machine via DisplayPort, even my 2013 MBP, the Dell will go up to 2560×1440.  The new MBP only has one HDMI port and three USB-C ports.  Fortunately, the USB-C ports can be used as DisplayPorts, so I acquired a DisplayPort–to–USB-C cable and that fixed the situation right up.

Yes, I could upgrade to a monitor that supports USB-C directly, but the Dell is a good size for my work environment, it still looks pretty good, and did I mention I’m the cling-tightly-to-what-works kind of user?

Otherwise, in the hardware space, I’ll have to figure out how I want to manage connecting all the USB-A devices I have (podcasting microphone, wireless headset, desktop speaker, secondary HD camera, etc., etc.) to the USB-C ports.  I expected that to be the case, just as I expected some applications would no longer work.  I expect an adapter cable or two will be necessary, at least for a while.

Trouble Brewing

I said earlier that Migration Assistant got me 90% of the way to being switched over.  Were I someone who doesn’t install stuff via the Terminal, I suspect it would have been 100% successful, but I’m not, so it wasn’t.  As with the cables, I anticipated this would happen.  What I didn’t expect was that covering that last 10% would take me only an hour or so of actual work, most of it spent waiting on downloads and installs.

First, the serious and quite unexpected problem: my version of Homebrew used an old installation prefix, one that could break newer packages.  So, I needed to migrate Homebrew itself from /usr/local to /opt/homebrew.  Some searching around indicated that the best way to do this was uninstall Homebrew entirely, then install it fresh.

Okay, except that would also remove everything I’d installed with Homebrew.  Which was maybe not as much as some of y’all, but it was still a fair number of fairly essential packages.  When I ran brew list, I got over a hundred packages, of which most were dependencies.  What I found through further searching was that brew leaves returns a list of the packages I’d installed, without their dependencies.  Here’s what I got:


That felt a lot more manageable.  After a bit more research, boiled down to its essentials, the New Brew Shuffle I came up with was:

$ brew leaves > brewlist.txt

$ /bin/bash -c "$(curl -fsSL"

$ xcode-select --install

$ /bin/bash -c "$(curl -fsSL"

$ xargs brew install < brewlist.txt

The above does elide a few things.  In step two, the Homebrew uninstall script identified a bunch of directories that it couldn’t remove, and would have to be deleted manually.  I saved all that to a text file (thanks to Warp’s “Copy output” feature) for later study, and pressed onward.  I probably also had to sudo some of those steps; I no longer remember.

In addition to all the above, I elected to delete a few of the packages in brewlist.txt before I fed it back to brew install in the last step — things like ckan, left over from my Kerbal Space Program days  —  and to remove the version dependencies for PHP and Python.  Overall, the process was pretty smooth.  I just had to sit there and watch Homrebrew chew through all the installs, including all the dependencies.


Once all the reinstalls from the last step had finished, I was left with a few things to clean up.  For example, Python didn’t seem to have installed.  Eventually I realized it had actually installed as python3 instead of just plain python, so that was mostly fine and I’m sure there’s a way to alias one to the other that I might get around to looking up one day.

Ruby also didn’t seem to reinstall cleanly: there was a library it was looking for that complained about the chip architecture, and attempts to overcome that spawned even more errors, and none of them were decipherable to me or my searches.  Multiple attempts at uninstalling and then reinstalling Ruby through a variety of means, some with Homebrew, some other ways, either got me the same indecipherable erros or a whole new set of indecipherable errors.  In the end, I just uninstalled Ruby, as I don’t actually use it for anything I’m aware of, and the default Ruby that comes with macOS is still there.  If I run into some script I need for work that requires something more, I’ll revisit this, probably with many muttered imprecations.

Finally, httpd wasn’t working as intended.  I could launch it with brew services httpd start, but the resulting server was pointing to a page that just said “It works!”, and not bringing up any of my local hosts.  Eventually, I found where Homebrew had stuffed httpd and its various files, and then replaced its configuration files with my old configuration files.  Then I went through the cycle of typing sudo apachectl start, addressing the errors it threw over directories or PHP installs or whatever by editing httpd.conf, and then trying again.

After only three or four rounds of that, everything was up and running as intended  —  and as a bonus, I was able to mark httpd as a Login item in the Finder’s System Settings, so it will automatically come back up whenever I reboot!  Which my old machine wouldn’t do, for some reason I never got around to figuring out.

Now I just need to decide what to call this thing.  The old MBP was “CoCo”, as in the TRS-80 Color Computer, meant as a wry commentary on the feel of the keyboard and a callback to the first home computer I ever used.  That joke still works, but I’m thinking the new machine will be “C64” in honor of the first actually powerful home computer I ever used and its 64 kilobytes of RAM.  There’s a pleasing echo between that and the 64 gigabytes of RAM I now have at my literal fingertips, four decades later.

Now that I’m up to date on hardware and operating system, I’d be interested to hear what y’all recommend for good quality-of-life improvement applications or configuration changes.  Link me up!

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at February 23, 2023 03:35 PM

February 22, 2023

Emmanuele Bassi

Writing Bindable API, 2023 Edition

First of all, you should go on the gobject-introspection website and read the page on how to write bindable API. What I’m going to write here is going to build upon what’s already documented, or will update the best practices, so if you maintain a GObject/C library, or you’re writing one, you must be familiar with the basics of gobject-introspection. It’s 2023: it’s already too bad we’re still writing C libraries, we should at the very least be responsible about it.

A specific note for people maintaining an existing GObject/C library with an API designed before the mainstream establishment of gobject-introspection (basically, anything written prior to 2011): you should really consider writing all new types and entry points with gobject-introspection in mind, and you should also consider phasing out older API and replacing it piecemeal with a bindable one. You should have done this 10 years ago, and I can already hear the objections, but: too bad. Just because you made an effort 10 years ago it doesn’t mean things are frozen in time, and you don’t get to fix things. Maintenance means constantly tending to your code, and that doubly applies if you’re exposing an API to other people.

Let’s take the “how to write bindable API” recommendations, and elaborate them a bit.

Structures with custom memory management

The recommendation is to use GBoxed as a way to specify a copy and a free function, in order to clearly define the memory management semantics of a type.

The important caveat is that boxed types are necessary for:

  • opaque types that can only be heap allocated
  • using a type as a GObject property
  • using a type as an argument or return value for a GObject signal

You don’t need a boxed type for the following cases:

  • your type is an argument or return value for a method, function, or virtual function
  • your type can be placed on the stack, or can be allocated with malloc()/free()

Additionally, starting with gobject-introspection 1.76, you can specify the copy and free function of a type without necessarily registering a boxed type, which leaves boxed types for the thing they were created: signals and properties.

Addendum: object types

Boxed types should only ever be used for plain old data types; if you need inheritance, then the strong recommendation is to use GObject. You can use GTypeInstance, but only if you know what you’re doing; for more information on that, see my old blog post about typed instances.

Functionality only accessible through a C macro

This ought to be fairly uncontroversial. C pre-processor symbols don’t exist at the ABI level, and gobject-introspection is a mechanism to describe a C ABI. Never, ever expose API only through C macros; those are for C developers. C macros can be used to create convenience wrappers, but remember that anything they call must be public API, and that other people will need to re-implement the convenience wrappers themselves, so don’t overdo it. C developers deserve some convenience, but not at the expense of everyone else.

Addendum: inline functions

Static inline functions are also not part of the introspectable ABI of a library, because they cannot be used with dlsym(); you can provide inlined functions for performance reasons, but remember to always provide their non-inlined equivalent.

Direct C structure access for objects

Again, another fairly uncontroversial rule. You shouldn’t be putting anything into an instance structure, as it makes your API harder to future-proof, and direct access cannot do things like change notification, or memoization.

Always provide accessor functions.


Variadic argument functions are mainly C convenience. Yes, some languages can support them, but it’s a bad idea to have this kind of API exposed as the only way to do things.

Any variadic argument function should have two additional variants:

  • a vector based version, using C arrays (zero terminated, or with an explicit length)
  • a va_list version, to be used when creating wrappers with variadic arguments themselves

The va_list variant is kind of optional, since not many people go around writing variadic argument C wrappers, these days, but at the end of the day you might be going to write an internal function that takes a va_list anyway, so it’s not particularly strange to expose it as part of your public API.

The vector-based variant, on the other hand, is fundamental.

Incidentally, if you’re using variadic arguments as a way to collect similarly typed values, e.g.:

// void
// some_object_method (SomeObject *self,
//                     ...) G_GNUC_NULL_TERMINATED

some_object_method (obj, "foo", "bar", "baz", NULL);

there’s very little difference to using a vector and C99’s compound literals:

// void
// some_object_method (SomeObject *self,
//                     const char *args[])

some_object_method (obj, (const char *[]) {

Except that now the compiler will be able to do some basic type check and scream at you if you’re doing something egregiously bad.

Compound literals and designated initialisers also help when dealing with key/value pairs:

typedef struct {
  int column;
  union {
    const char *v_str;
    int v_int;
  } value;
} ColumnValue;

enum {

// void
// some_object_method (SomeObject *self,
//                     size_t n_columns,
//                     const ColumnValue values[])

some_object_method (obj, 2,
  (ColumnValue []) {
    { .column = COLUMN_NAME, .data = { .v_str = "Emmanuele" } },
    { .column = COLUMN_AGE, .data = { .v_int = 42 } },

So you should seriously reconsider the amount of variadic arguments convenience functions you expose.

Multiple out parameters

Using a structured type with a out direction is a good recommendation as a way to both limit the amount of out arguments and provide some future-proofing for your API. It’s easy to expand an opaque pointer type with accessors, whereas adding more out arguments requires an ABI break.

Addendum: inout arguments

Don’t use in-out arguments. Just don’t.

Pass an in argument to the callable for its input, and take an out argument or a return value for the output.

Memory management and ownership of inout arguments is incredibly hard to capture with static annotations; it mainly works for scalar values, so:

some_object_update_matrix (SomeObject *self,
                           double *xx,
                           double *yy,
                           double *xy,
                           double *yx)

can work with xx, yy, xy, yx as inout arguments, because there’s no ownership transfer; but as soon as you start throwing things in like pointers to structures, or vectors of string, you open yourself to questions like:

  • who allocates the argument when it goes in?
  • who is responsible for freeing the argument when it comes out?
  • what happens if the function frees the argument in the in direction and then re-allocates the out?
  • what happens if the function uses a different allocator than the one used by the caller?
  • what happens if the function has to allocate more memory?
  • what happens if the function modifies the argument and frees memory?

Even if gobject-introspection nailed down the rules, they could not be enforced, or validated, and could lead to leaks or, worse, crashes.

So, once again: don’t use inout arguments. If your API already exposes inout arguments, especially for non-scalar types, consider deprecations and adding new entry points.

Addendum: GValue

Sadly, GValue is one of the most notable cases of inout abuse. The oldest parts of the GNOME stack use GValue in a way that requires inout annotations because they expect the caller to:

  • initialise a GValue with the desired type
  • pass the address of the value
  • let the function fill the value

The caller is then left with calling g_value_unset() in order to free the resources associated with a GValue. This means that you’re passing an initialised value to a callable, the callable will do something to it (which may or may not even entail re-allocating the value) and then you’re going to get it back at the same address.

It would be a lot easier if the API left the job of initialising the GValue to the callee; then functions could annotate the GValue argument with out and caller-allocates=1. This would leave the ownership to the caller, and remove a whole lot of uncertainty.

Various new (comparatively speaking) API allow the caller to pass an unitialised GValue, and will leave initialisation to the callee, which is how it should be, but this kind of change isn’t always possible in a backward compatible way.


You can use three types of C arrays in your API:

  • zero-terminated arrays, which are the easiest to use, especially for pointers and strings
  • fixed-size arrays
  • arrays with length arguments

Addendum: strings and byte arrays

A const char* argument for C strings with a length argument is not an array:

 * some_object_load_data:
 * @self: ...
 * @str: the data to load
 * @len: length of @str in bytes, or -1
 * ...
some_object_load_data (SomeObject *self,
                       const char *str,
                       ssize_t len)

Never annotate the str argument with array length=len. Ideally, this kind of function should not exist in the first place. You should always use const char* for NUL-terminated strings, possibly UTF-8 encoded; if you allow embedded NUL characters then use a bytes array:

 * some_object_load_data:
 * @self: ...
 * @data: (array length=len) (element-type uint8): the data to load
 * @len: the length of the data in bytes
 * ...
some_object_load_data (SomeObject *self,
                       const unsigned char *data,
                       size_t len)

Instead of unsigned char you can also use uint8_t, just to drive the point home.

Yes, it’s slightly nicer to have a single entry point for strings and byte arrays, but that’s just a C convenience: decent languages will have a proper string type, which always comes with a length; and string types are not binary data.

Addendum: GArray, GPtrArray, GByteArray

Whatever you do, however low you feel on the day, whatever particular tragedy befell your family at some point, please: never use GLib array types in your API. Nothing good will ever come of it, and you’ll just spend your days regretting this choice.

Yes: gobject-introspection transparently converts between GLib array types and C types, to the point of allowing you to annotate the contents of the array. The problem is that that information is static, and only exists at the introspection level. There’s nothing that prevents you from putting other random data into a GPtrArray, as long as it’s pointer-sized. There’s nothing that prevents a version of a library from saying that you own the data inside a GArray, and have the next version assign a clear function to the array to avoid leaking it all over the place on error conditions, or when using g_autoptr.

Adding support for GLib array types in the introspection was a well-intentioned mistake that worked in very specific cases—for instance, in a library that is private to an application. Any well-behaved, well-designed general purpose library should not expose this kind of API to its consumers.

You should use GArray, GPtrArray, and GByteArray internally; they are good types, and remove a lot of the pain of dealing with C arrays. Those types should never be exposed at the API boundary: always convert them to C arrays, or wrap them into your own data types, with proper argument validation and ownership rules.

Addendum: GHashTable

What’s worse than a type that contains data with unclear ownership rules decided at run time? A type that contains twice the amount of data with unclear ownership rules decided at run time.

Just like the GLib array types, hash tables should be used but never directly exposed to consumers of an API.

Addendum: GList, GSList, GQueue

See above, re: pain and misery. On top of that, linked lists are a terrible data type that people should rarely, if ever, use in the first place.


Your callbacks should always be in the form of a simple callable with a data argument:

typedef void (* SomeCallback) (SomeObject *obj,
                               gpointer data);

Any function that takes a callback should also take a “user data” argument that will be passed as is to the callback:

// scope: call; the callback data is valid until the
// function returns
some_object_do_stuff_immediately (SomeObject *self,
                                  SomeCallback callback,
                                  gpointer data);

// scope: notify; the callback data is valid until the
// notify function gets called
some_object_do_stuff_with_a_delay (SomeObject *self,
                                   SomeCallback callback,
                                   gpointer data,
                                   GDestroyNotify notify);

// scope: async; the callback data is valid until the async
// callback is called
some_object_do_stuff_but_async (SomeObject *self,
                                GCancellable *cancellable,
                                GAsyncReadyCallback callback,
                                gpointer data);

// not pictured here: scope forever; the data is valid fori
// the entirety of the process lifetime

If your function takes more than one callback argument, you should make sure that it also takes a different user data for each callback, and that the lifetime of the callbacks are well defined. The alternative is to use GClosure instead of a simple C function pointer—but that comes at a cost of GValue marshalling, so the recommendation is to stick with one callback per function.

Addendum: the closure annotation

It seems that many people are unclear about the closure annotation.

Whenever you’re describing a function that takes a callback, you should always annotate the callback argument with the argument that contains the user data using the (closure argument) annotation, e.g.

 * some_object_do_stuff_immediately:
 * @self: ...
 * @callback: (scope call) (closure data): the callback
 * @data: the data to be passed to the @callback
 * ...

You should not annotate the data argument with a unary (closure).

The unary (closure) is meant to be used when annotating the callback type:

 * SomeCallback:
 * @self: ...
 * @data: (closure): ...
 * ...
typedef void (* SomeCallback) (SomeObject *self,
                               gpointer data);

Yes, it’s confusing, I know.

Sadly, the introspection parser isn’t very clear about this, but in the future it will emit a warning if it finds a unary closure on anything that isn’t a callback type.

Ideally, you don’t really need to annotate anything when you call your argument user_data, but it does not hurt to be explicit.

A cleaned up version of this blog post will go up on the gobject-introspection website, and we should really have a proper set of best API design practices on the Developer Documentation website by now; nevertheless, I do hope people will actually follow these recommendations at some point, and that they will be prepared for new recommendations in the future. Only dead and unmaintained projects don’t change, after all, and I expect the GNOME stack to last a bit longer than the 25 years it already spans today.

by ebassi at February 22, 2023 01:09 PM

February 20, 2023

Iago Toral

SuperTuxKart Vulkan vs OpenGL and Zink status on Raspberry Pi 4

SuperTuxKart Vulkan vs OpenGL

The latest SuperTuxKart release comes with an experimental Vulkan renderer and I was eager to check it out on my Raspbery Pi 4 and see how well it worked.

The short story is that while I have only tested a few tracks it seems to perform really well overall. In my tests, even with a debug build of Mesa I saw the FPS ranging from 60 to 110 depending on the track. I think the game might be able to produce more than 110 fps actually, since various tracks were able to reach exactly 110 fps I think the limiting factor here was the display.

I was then naturally interested in comparing this to the GL renderer and I was a bit surprised to see that, with the same settings, the GL renderer would be somewhere in the 8-20 fps range for the same tracks. The game was clearly hitting a very bad path in the GL driver so I had to fix that before I could make a fair comparison between both.

A perf session quickly pointed me to the issue: Mesa has code to transparently translate vertex attribute formats that are not natively supported to a supported format. While this is great for compatibility it is obviously going to be very slow. In particular, SuperTuxKart uses rgba16f and rg16f with some vertex buffers and Mesa was silently translating these to 32-bit counterparts because the GL driver was not advertising support for the 16-bit variants. The hardware does support 16-bit floating point vertex attributes though, so this was very easy to fix.

The Vulkan driver was exposing support for this already, which explains the dramatic difference in performance between both drivers. Indeed, with that change SuperTuxKart now plays smooth on OpenGL too, with framerates always above 30 fps and up to 110 fps depending on the track. We should probably have an option in Mesa to make this kind of under-the-hood compatibility translations more obvious to users so we can catch silly issues like this more easily.

With that said, even if GL is now a lot better, Vulkan is still ahead by quite a lot, producing 35-50% better framerate than OpenGL depending on the track, at least for the tracks that don’t hit the 110 fps mark, which as I said above, looks like it is a display maximum, at least with my setup.


During my presentation at XDC last year I mentioned Zink wasn’t supported on Raspberry Pi 4 any more due to feature requirements we could not fulfill.

In the past, Zink used to abort when it detected unsupported features, but it seems this policy has been changed and now it simply drops a warning and points to the possibility of incorrect rendering as a result.

Also, I have been talking to zmike about one of the features we could not support natively: scalarBlockLayout. Particularly, the issue with this is that we can’t make it work with vectors in all cases and the only alternative for us would be to scalarize everything through a lowering, which would probably have a performance impact. However, zmike confirmed that Zink is already doing this, so in practice we would not see vector load/stores from Zink, in which case it should work fine .

So with all that in mind, I did give Zink a go and indeed, I get the warning that we don’t support scalar block layouts (and some other feature I don’t remember now) but otherwise it mostly works. It is not as stable as the native driver and some things that work with the native driver don’t work with Zink at present, some examples I saw include the WebGL Aquarium demo in Chromium or SuperTuxKart.

As far as performance goes, it has been a huge leap from when I tested it maybe 2 years ago. With VkQuake3‘s OpenGL renderer performance with Zink used to be ~40% of the native OpenGL driver, but is now on par with it, even if not a tiny bit better, so kudos to zmike and all the other contributors to Zink for all the work they put into this over the last 2 years, it really shows.

With all that said, I didn’t do too much testing with Zink myself so if anyone here decides to give it a more thorough go, please let me know how it went in the comments.

by Iago Toral at February 20, 2023 10:30 AM

Emmanuele Bassi

High Leap

I’ve been working at Endless for two years, now.

I’m incredibly lucky to be working at a great company, with great colleagues, on cool projects, using technologies I love, towards a goal I care deeply about.

We’ve been operating a bit under the radar for a while, but now it’s time to unveil what we’ve been doing — and we’re doing it via a Kickstarter campaign:

The computer for the entire world

The OS for the entire world


It’s been an honour and a privilege working on this little, huge project for the past two years, and I can’t wait to see what another two years are going to bring us.

by ebassi at February 20, 2023 12:38 AM


GUI toolkits have different ways to lay out the elements that compose an application’s UI. You can go from the fixed layout management — somewhat best represented by the old ‘90s Visual tools from Microsoft; to the “springs and struts” model employed by the Apple toolkits until recently; to the “boxes inside boxes inside boxes” model that GTK+ uses to this day. All of these layout policies have their own distinct pros and cons, and it’s not unreasonable to find that many toolkits provide support for more than one policy, in order to cater to more use cases.

For instance, while GTK+ user interfaces are mostly built using nested boxes to control margins, spacing, and alignment of widgets, there’s a sizeable portion of GTK+ developers that end up using GtkFixed or GtkLayout containers because they need fixed positioning of children widget — until they regret it, because now they have to handle things like reflowing, flipping contents in right-to-left locales, or font size changes.

Additionally, most UI designers do not tend to “think with boxes”, unless it’s for Web pages, and even in that case CSS affords a certain freedom that cannot be replicated in a GUI toolkit. This usually results in engineers translating a UI specification made of ties and relations between UI elements into something that can be expressed with a pile of grids, boxes, bins, and stacks — with all the back and forth, validation, and resources that the translation entails.

It would certainly be easier if we could express a GUI layout in the same set of relationships that can be traced on a piece of paper, a UI design tool, or a design document:

  • this label is at 8px from the leading edge of the box
  • this entry is on the same horizontal line as the label, its leading edge at 12px from the trailing edge of the label
  • the entry has a minimum size of 250px, but can grow to fill the available space
  • there’s a 90px button that sits between the trailing edge of the entry and the trailing edge of the box, with 8px between either edges and itself

Sure, all of these constraints can be replaced by a couple of boxes; some packing properties; margins; and minimum preferred sizes. If the design changes, though, like it often does, reconstructing the UI can become arbitrarily hard. This, in turn, leads to pushback to design changes from engineers — and the cost of iterating over a GUI is compounded by technical inertia.

For my daily work at Endless I’ve been interacting with our design team for a while, and trying to get from design specs to applications more quickly, and with less inertia. Having CSS available allowed designers to be more involved in the iterative development process, but the CSS subset that GTK+ implements is not allowed — for eminently good reasons — to change the UI layout. We could go “full Web”, but that comes with a very large set of drawbacks — performance on low end desktop devices, distribution, interaction with system services being just the most glaring ones. A native toolkit is still the preferred target for our platform, so I started looking at ways to improve the lives of UI designers with the tools at our disposal.

Expressing layout through easier to understand relationships between its parts is not a new problem, and as such it does not have new solutions; other platforms, like the Apple operating systems, or Google’s Android, have started to provide this kind of functionality — mostly available through their own IDE and UI building tools, but also available programmatically. It’s even available for platforms like the Web.

What many of these solutions seem to have in common is using more or less the same solving algorithm — Cassowary.

Cassowary is:

an incremental constraint solving toolkit that efficiently solves systems of linear equalities and inequalities. Constraints may be either requirements or preferences. Client code specifies the constraints to be maintained, and the solver updates the constrained variables to have values that satisfy the constraints.

This makes it particularly suited for user interfaces.

The original implementation of Cassowary was written in 1998, in Java, C++, and Smalltalk; since then, various other re-implementations surfaced: Python, JavaScript, Haskell, slightly-more-modern-C++, etc.

To continue in the naming policy of Cassowary implementations, this small library is named after yet another flightless bird

To that collection, I’ve now added my own — written in C/GObject — called Emeus, which provides a GTK+ container and layout manager that uses the Cassowary constraint solving algorithm to compute the allocation of each child.

In spirit, the implementation is pretty simple: you create a new EmeusConstraintLayout widget instance, add a bunch of widgets to it, and then use EmeusConstraint objects to determine the relations between children of the layout:

simple-grid.js[Lines 89-170]download
        let button1 = new Gtk.Button({ label: 'Child 1' });
        this._layout.pack(button1, 'child1');;

        let button2 = new Gtk.Button({ label: 'Child 2' });
        this._layout.pack(button2, 'child2');;

        let button3 = new Gtk.Button({ label: 'Child 3' });
        this._layout.pack(button3, 'child3');;

            new Emeus.Constraint({ target_attribute: Emeus.ConstraintAttribute.START,
                                   relation: Emeus.ConstraintRelation.EQ,
                                   source_object: button1,
                                   source_attribute: Emeus.ConstraintAttribute.START,
                                   constant: -8.0 }),
            new Emeus.Constraint({ target_object: button1,
                                   target_attribute: Emeus.ConstraintAttribute.WIDTH,
                                   relation: Emeus.ConstraintRelation.EQ,
                                   source_object: button2,
                                   source_attribute: Emeus.ConstraintAttribute.WIDTH }),
            new Emeus.Constraint({ target_object: button1,
                                   target_attribute: Emeus.ConstraintAttribute.END,
                                   relation: Emeus.ConstraintRelation.EQ,
                                   source_object: button2,
                                   source_attribute: Emeus.ConstraintAttribute.START,
                                   constant: -12.0 }),
            new Emeus.Constraint({ target_object: button2,
                                   target_attribute: Emeus.ConstraintAttribute.END,
                                   relation: Emeus.ConstraintRelation.EQ,
                                   source_attribute: Emeus.ConstraintAttribute.END,
                                   constant: -8.0 }),
            new Emeus.Constraint({ target_attribute: Emeus.ConstraintAttribute.START,
                                   relation: Emeus.ConstraintRelation.EQ,
                                   source_object: button3,
                                   source_attribute: Emeus.ConstraintAttribute.START,
                                   constant: -8.0 }),
            new Emeus.Constraint({ target_object: button3,
                                   target_attribute: Emeus.ConstraintAttribute.END,
                                   relation: Emeus.ConstraintRelation.EQ,
                                   source_attribute: Emeus.ConstraintAttribute.END,
                                   constant: -8.0 }),
            new Emeus.Constraint({ target_attribute: Emeus.ConstraintAttribute.TOP,
                                   relation: Emeus.ConstraintRelation.EQ,
                                   source_object: button1,
                                   source_attribute: Emeus.ConstraintAttribute.TOP,
                                   constant: -8.0 }),
            new Emeus.Constraint({ target_attribute: Emeus.ConstraintAttribute.TOP,
                                   relation: Emeus.ConstraintRelation.EQ,
                                   source_object: button2,
                                   source_attribute: Emeus.ConstraintAttribute.TOP,
                                   constant: -8.0 }),
            new Emeus.Constraint({ target_object: button1,
                                   target_attribute: Emeus.ConstraintAttribute.BOTTOM,
                                   relation: Emeus.ConstraintRelation.EQ,
                                   source_object: button3,
                                   source_attribute: Emeus.ConstraintAttribute.TOP,
                                   constant: -12.0 }),
            new Emeus.Constraint({ target_object: button2,
                                   target_attribute: Emeus.ConstraintAttribute.BOTTOM,
                                   relation: Emeus.ConstraintRelation.EQ,
                                   source_object: button3,
                                   source_attribute: Emeus.ConstraintAttribute.TOP,
                                   constant: -12.0 }),
            new Emeus.Constraint({ target_object: button3,
                                   target_attribute: Emeus.ConstraintAttribute.HEIGHT,
                                   relation: Emeus.ConstraintRelation.EQ,
                                   source_object: button1,
                                   source_attribute: Emeus.ConstraintAttribute.HEIGHT }),
            new Emeus.Constraint({ target_object: button3,
                                   target_attribute: Emeus.ConstraintAttribute.HEIGHT,
                                   relation: Emeus.ConstraintRelation.EQ,
                                   source_object: button2,
                                   source_attribute: Emeus.ConstraintAttribute.HEIGHT }),
            new Emeus.Constraint({ target_object: button3,
                                   target_attribute: Emeus.ConstraintAttribute.BOTTOM,
                                   relation: Emeus.ConstraintRelation.EQ,
                                   source_attribute: Emeus.ConstraintAttribute.BOTTOM,
                                   constant: -8.0 }),

A simple grid

This obviously looks like a ton of code, which is why I added the ability to describe constraints inside GtkBuilder XML:

centered.ui[Lines 28-45]download
              <constraint target-object="button_child"
              <constraint target-object="button_child"
              <constraint target-object="button_child"

Additionally, I’m writing a small parser for the Visual Format Language used by Apple for their own auto layout implementation — even though it does look like ASCII art of Perl format strings, it’s easy to grasp.

The overall idea is to prototype UIs on top of this, and then take advantage of GTK+’s new development cycle to introduce something like this and see if we can get people to migrate from GtkFixed/GtkLayout.

by ebassi at February 20, 2023 12:37 AM

Recipes hackfest

The Recipes application started as a celebration of GNOME’s community and history, and it’s grown to be a great showcase for what GNOME is about:

  • design guidelines and attention to detail
  • a software development platform for modern applications
  • new technologies, strongly integrated with the OS
  • people-centered development

Additionally, Recipes has become a place where to iterate design and technology for the rest of the GNOME applications.

Nevertheless, while design patterns, toolkit features, Flatpak and portals, are part of the development experience, without content provided by the people using Recipes there would not be an application to begin with.

If we look at the work Endless has been doing on its own framework for content-driven applications, there’s a natural fit — which is why I was really happy to attend the Recipes hackfest in Yogyakarta, this week.

Fried Jawanese noodle make a healty breakfast

In the Endless framework we take structured data — like a web page, or a PDF document, or a mix of video and text — and we construct “shards”, which embed both the content, its metadata, and a Xapian database that can be used for querying the data. We take the shards and distribute them though Flatpak as a runtime extension for our applications, which means we can take advantage of Flatpak for shipping updates efficiently.

During the hackfest we talked about how to take advantage of the data model Endless applications use, as well as its distribution model; instead of taking tarballs with the recipe text, the images, and the metadata attached to each, we can create shards that can be mapped to a custom data model. Additionally, we can generate those shards locally when exporting the recipes created by new chefs, and easily re-integrate them with the shared recipe shards — with the possibility, in the future, to have a whole web application that lets you submit new recipes, and the maintainers review them without necessarily going through Matthias’s email. 😉

The data model discussion segued into how to display that data. The Endless framework has the concept of cards, which are context-aware data views; depending on context, they can have more or less details exposed to the user — and all those details are populated from the data model itself. Recipes has custom widgets that do a very similar job, so we talked about how to create a shared layer that can be reused both by Endless applications and by GNOME applications.

Sadly, I don’t remember the name of this soup, only that it had chicken hearts in it, and that Cosimo loved it

At the end of the hackfest we were able to have a proof of concept of Recipes loading the data from a custom shard, and using the Endless framework to display it; translating that into shareable code and libraries that can be used by other projects is the next step of the roadmap.

All of this, of course, will benefit more than just the Recipes application. For instance, we could have a Dictionary application that worked offline, and used Wiktionary as a source, and allowed better queries than just substring matching; we could have applications like Photos and Documents reuse the same UI elements as Recipes for their collection views; Software and Recipes already share a similar “landing page” design (and widgets), which means that Software could also use the “card” UI elements.

There’s lots for everyone to do, but exciting times are ahead!

And after we’re done we can relax by the pool

I’d be remiss if I didn’t thank our hosts at the Amikom university.

Yogyakarta is a great city; I’ve never been in Indonesia before, and I’ve greatly enjoyed my time here. There’s lots to see, and I strongly recommend visiting. I’ve loved the food, and the people’s warmth.

I’d like to thank my employer, Endless, for letting me take some time to attend the hackfest; and the GNOME Foundation, for sponsoring my travel.

The travelling Wilber

Sponsored by the GNOME Foundation

by ebassi at February 20, 2023 12:36 AM

On Vala

It seems I raised a bit of a stink on Twitter last week:

Of course, and with reason, I’ve been called out on this by various people. Luckily, it was on Twitter, so we haven’t seen articles on Slashdot and Phoronix and LWN with headlines like “GNOME developer says Vala is dead and will be removed from all servers for all eternity and you all suck”. At least, I’ve only seen a bunch of comments on Reddit about this, but nobody cares about that particular cesspool of humanity.

Sadly, 140 characters do not leave any room for nuance, so maybe I should probably clarify what I wrote on a venue with no character limit.

First of all, I’d like to apologise to people that felt I was attacking them or their technical choices: it was not my intention, but see above, re: character count. I may have only about 1000 followers on Twitter, but it seems that the network effect is still a bit greater than that, so I should be careful when wording opinions. I’d like to point out that it’s my private Twitter account, and you can only get to what it says if you follow me, or if you follow people who follow me and decide to retweet what I write.

My PSA was intended as a reflection on the state of Vala, and its impact on the GNOME ecosystem in terms of newcomers, from the perspective of a person that used Vala for his own personal projects; recommended Vala to newcomers; and has to deal with the various build issues that arise in GNOME because something broke in Vala or in projects using Vala. If you’re using Vala outside of GNOME, you have two options: either ignore all I’m saying, as it does not really apply to your case; or do a bit of soul searching, and see if what I wrote does indeed apply to you.

First of all, I’d like to qualify my assertion that Vala is a “dead language”. Of course people see activity in the Git repository, see the recent commits and think “the project is still alive”. Recent commits do not tell a complete story.

Let’s look at the project history for the past 10 cycles (roughly 2.5 years). These are the commits for every cycle, broken up in two values: one for the full repository, the other one for the whole repository except the vapi directory, which contains the VAPI files for language bindings:


Aside from the latest cycle, Vala has seen very little activity; the project itself, if we exclude binding updates, has seen less than 100 commits for every cycle — some times even far less. The latest cycle is a bit of an outlier, but we can notice a pattern of very little work for two/three cycles, followed by a spike. If we look at the currently in progress cycle, we can already see that the number of commits has decreased back to 55/42, as of this morning.


Number of commits is just a metric, though; more important is the number of contributors. After all, small, incremental changes may be a good thing in a language — though, spoiler alert: they are usually an indication of a series of larger issues, and we’ll come to that point later.

These are the number of developers over the same range of cycles, again split between committers to the full repository and to the full repository minus the vapi directory:


As you can see, the number of authors of changes is mostly stable, but still low. If we have few people that actively commit to the repository it means we have few people that can review a patch. It means patches linger longer and longer, while reviewers go through their queues; it means that contributors get discouraged; and, since nobody is paid to work full time on Vala, it means that any interruption caused by paid jobs will be a bottleneck on the project itself.

These concerns are not unique of a programming language: they exist for every volunteer-driven free and open source project. Programming languages, though, like core libraries, are problematic because any bottleneck causes ripple effects. You can take any stalled project you depend on, and vendor it into your own, but if that happens to the programming language you’re using, then you’re pretty much screwed.

For these reasons, we should also look at how well-distributed is the workload in Vala, i.e. which percentage of the work is done by the authors of those commits; the results are not encouraging. Over that range of cycles, Only two developers routinely crossed the 5% of commits:

  • Rico Tzschichholz
  • Jürg Billeter

And Rico has been the only one to consistently author >50% of the commits. This means there’s only one person dealing with the project on a day to day basis.

As the maintainer of a project who basically had to do all the work, I cannot even begin to tell you how soul-crushing that can become. You get burned out, and you feel responsible for everyone using your code, and then you get burned out some more. I honestly don’t want Rico to burn out, and you shouldn’t, either.

So, let’s go into unfair territory. These are the commits for Rust — the compiler and standard library:


These are the commits for Go — the compiler and base library:


These are the commits for Vala — both compiler and bindings:


These are the number of commits over the past year. Both languages are younger than Vala, have more tools than Vala, and are more used than Vala. Of course, it’s completely unfair to compare them, but those numbers should give you a sense of scale, of what is the current high bar for a successful programming language these days. Vala is a niche language, after all; it’s heavily piggy-backing on the GNOME community because it transpiles to C and needs a standard library and an ecosystem like the one GNOME provides. I never expected Vala to rise to the level of mindshare that Go and Rust currently occupy.

Nevertheless, we need to draw some conclusions about the current state of Vala — starting from this thread, perhaps, as it best encapsulates the issues the project is facing.

Vala, as a project, is limping along. There aren’t enough developers to actively effect change on the project; there aren’t enough developers to work on ancillary tooling — like build system integration, debugging and profiling tools, documentation. Saying that “Vala compiles to C so you can use tools meant for C” is comically missing the point, and it’s effectively like saying that “C compiles to binary code, so you can disassemble a program if you want to debug it”. Being able to inspect the language using tools native to the language is a powerful thing; if you have to do the name mangling in your head in order to set a breakpoint in GDB you are elevating the barrier of contributions way above the head of many newcomers.

Being able to effect change means also being able to introduce change effectively and without fear. This means things like continuous integration and a full test suite heavily geared towards regression testing. The test suite in Vala is made of 210 units, for a total of 5000 lines of code; the code base of Vala (vala AST, codegen, C code emitter, and the compiler) is nearly 75 thousand lines of code. There is no continuous integration, outside of the one that GNOME Continuous performs when building Vala, or the one GNOME developers perform when using jhbuild. Regressions are found after days or weeks, because developers of projects using Vala update their compiler and suddenly their projects cease to build.

I don’t want to minimise the enormous amount of work that every Vala contributor brought to the project; they are heroes, all of them, and they deserve as much credit and praise as we can give. The idea of a project-oriented, community-oriented programming language has been vindicated many times over, in the past 5 years.

If I scared you, or incensed you, then you can still blame me, and my lack of tact. You can still call me an asshole, and you can think that I’m completely uncool. What I do hope, though, is that this blog post pushes you into action. Either to contribute to Vala, or to re-new your commitment to it, so that we can look at my words in 5 years and say “boy, was Emmanuele wrong”; or to look at alternatives, and explore new venues in order to make GNOME (and the larger free software ecosystem) better.

by ebassi at February 20, 2023 12:36 AM

codes of conduct

a discussion on guadec-list about adopting a code of conduct for GUADEC prompted me to write down some thoughts about the issue.

GNOME, as a community, was if not the first, one of the first high profile free software foundations to define and implement a code of conduct.

Photo credit: Jonathan Thorne, CC by-nc-2.0

to be perfectly, honest I never thought an anti-harassment policy would be a controversial issue at all in 2014, after the rate of adoption of codes of conduct, and of anti-harrasment policies, at convention and conferences all over the world. there have been high profile cases, and speakers as well as attendees have finally started to stand up, and publicly state that they won’t attend a convention or a conference (even if sponsored) if the organizers do not put in place these kind of documents.

GNOME, as a community, was if not the first, one of the first high profile free software foundations to define and implement a code of conduct. I’ll actually come back to that later, but that was a point of pride for me.

yet, I have to admit that my heart sank a bit for every email in the discussion on guadec-list, especially because they were from members of our own community.

I do understand that we like our conferences like we like our software: free-as-in-speech, and interesting to work on.

as I said, GNOME already has a code of conduct, pitiful and neutered as it may be, and it applies to every venue of communication we have: mailing list, our web servers, and also our conferences. to have and implement a code of conduct is not a per-GUADEC-edition, local-team-only decision to take — and how do I know that? because it’s the board that approved the code of conduct for the mailing lists, IRC, and web servers, and it was not the moderation team, or the IRC operators, or the system administrators that took this decision.

I do understand that we like our conferences like we like our software: free-as-in-speech, and interesting to work on. that does not imply that we should just assume bad stuff won’t happen, or that people will automatically find the right person to help them because somebody else decided to be a jerk. to be fair, our software is full of well-defined rules for redistribution, and our conferences should be equally well-defined when it comes to acceptable behaviour and responsible people to contact. why we do that? because it helps in having a clear set of rules and people responsible to avoid abuse, if that happens.

I am lucky enough, and privileged enough, that I have not been discriminated for who I am, what I like, who I like, what I do, or how I do it.

I honestly have zero patience for the people saying that «everyone can be offended by something, thus we shouldn’t do anything». the usual, trite, argument is that having an anti-harassment policy will provide a “chilling effect” on attendees; it should also not be necessary to have these policies, because we trust our community to be composed of good people, and these policies automatically assume that everyone will be misbehaving. those are both ridiculous positions, even if they are sadly fairly common in the free and open source community at large. they are based on a fairly obvious misunderstanding; the code of conduct is not a sword for preventing people from misbehaving: it’s a shield for people being the object of discrimination and harassment.

Photo credit: Jenn and Tony Bot, CC by-nc-2.0

I am lucky enough, and privileged enough, that I have not been discriminated for who I am, what I like, who I like, what I do, or how I do it. others are not in such position, and they attend GUADEC. we want them to attend GUADEC, because they are our next contributor, our next user, our next tester, our next designer, our next bugsquad member, our next person submitting a documentation patch. I don’t want them to be placed in a position where they balk at the idea of participating at GUADEC because they don’t feel safe enough, because they aren’t part of our community yet. you want to talk about a chilling effect? that is the chilling effect. I have no concerns for members of our community: I know that most of them can actually behave like actual human beings in a social context. that knowledge comes from 10+ years in this community. I don’t expect, and I’d be foolish to do so, that new people that have not been at GUADEC yet, or have been newly introduced to our community, also posses that knowledge.

Photo credit: diffendale, CC by-nc-sa 2.0

a final thought: I actually want a better code of conduct for GNOME’s online services as well, one that is clear on responsibility and consequences, because our current one is a defanged travesty, which was implemented to be the lowest common denominator possible. it does not require responsibility for enforcing it, and it does not provide accountability for actually respecting it. it is, for all intents and purposes, like not having one. it probably was thought as a good compromise eight years ago, but it clearly is not enough any more, and it makes GNOME look bad. changing the code of conduct is a topic for the new board, one that I expect will be handled this year; I’ll make sure to prod them. ;-)

I’d like to thank to Marina and Karen for reviewing the draft of this article and for their suggestions.

more information

  1. How will our Code of Conduct improve our harassment handling?
  2. Code of Conduct
  3. Codes of Conduct 101 + FAQ
  4. Conference anti-harassment
  5. My New Convention Harassment Policy
  6. Convention Harassment Policy Follow-Up

by ebassi at February 20, 2023 12:34 AM

Dream Road

Right at the moment I’m writing this blog post, the Endless Kickstarter campaign page looks like this:

With 26 days to spare

I’m incredibly humbled and proud. Thank you all so much for your support and your help in bringing Endless to the world.

The campaign goes on, though; we added various new perks, including:

  • the option to donate an Endless computer to Habitat for Humanity or Funsepa, two charities that are involved in housing and education projects in developing countries
  • the full package — computer, carabiner, mug, and t-shirt; this one ships everywhere in the world, while we’re still working out the kinks of international delivery of the merch

Again, thank you all for your support.

by ebassi at February 20, 2023 12:34 AM

Berlin DX Hackfest / Day 3

the third, and last day of the DX hackfest opened with a quick recap as to what people have been working on in the past couple of days.

we had a nice lunch nearby, and then we went back to the Endocode office to tackle the biggest topic: a road map for GTK+.

we made good progress on all the items, and we have a fairly clear idea of who is going to work on what. sadly, my optimism on GProperty landing soon did not survive a discussion with Ryan; it turns out that there are many more layers of yak to be shaved, though we kinda agreed on the assumption that there is, in fact, a yak underneath all those layers. to be fair, the work on GProperty enabled a lot of the optimizations of GObject: property notifications, bulk installation of properties, and the private instance data reorganization of last year are just examples. both Ryan and I agreed that we should not increase the cost for callers of property setters — which right now would require asking the GProperty instance to the class of the instance that we’re modifying, which implies taking locks and other unpleasant stuff. luckily, we do have access to private class data, and with few minor modification we can use that private data to store the properties; thus, getting the properties of a class can be achieved with simple pointer offsets and dereferences, without locks being involved. I’ll start working on this very soon, and hopefully we’ll be able to revisit the issue at GUADEC, in time for the next development cycle of GLib.

in the meantime, I kept hacking on my little helper library that provides data types for canvases — and about which I’ll blog soon — as well as figuring out what’s missing from the initial code drop of the GTK+ scene graph that will be ready to be shown by the time GUADEC 2014 rolls around.

I’m flying back home on Saturday, so this is the last full day in Berlin for me. it was a pleasure to be here, and I’d really like to thank Endocode for generously giving us access to their office; Chris Kühl, for being a gracious and mindful host; and the GNOME Foundation, for sponsoring attendance to all these fine people and contributors, and me.

Sponsored by the GNOME Foundation

by ebassi at February 20, 2023 12:34 AM


like many fellow GNOME developers I will be in Strasbourg for GUADEC 2014.

on Monday morning, I will give a talk on the GTK+ Scene Graph Tool Kit, or GSK for short, but you should make sure to attend the many interesting talks that we have planned for you this year.

see you at GUADEC!

by ebassi at February 20, 2023 12:33 AM

GUADEC 2014 talk notes

I put the notes of the GSK talk I gave at GUADEC 2014 online; I believe there should be a video coming soon as well.

the notes are available on this very website.

by ebassi at February 20, 2023 12:33 AM