Planet Igalia

January 30, 2023

Yeunjoo Choi

Tips for using Vim when developing Chromium

There are lots of powerful IDEs which are broadly used by developers. Vim is a text editor, but it can be turned into a good IDE with awesome plugins and a little bit of configuration.

Many people already prefer using Vim to develop software because of its lightness and availability. I am one of them, and always use Vim to develop Chromium. However, someone would think that it’s hard to get Vim to reach the same performance as other IDEs, since Chromium is a very huge project.

For those potential users, this post introduces some tips for using Vim as an IDE when developing Chromium.

The context of this document is for Linux users who are used to using Vim.

Code Formatting

I strongly recommend using vim-codefmt for code formatting. It supports most file types in Chromium including GN (See GN docs) and autoformatting. However, I don’t like to use autoformatting and just bind :FormatLines to ctrl-I for formating a selected visual block.

1 
map <C-I> :FormatLines<CR> 

codefmt

Another option is using clang-format in depot_tools by clang-format.vim in tools/vim. Please check about tools/vim at the following section.

tools/vim

You can easily find a bunch of Vim files in tools/vim. I create a .vimrc file locally in a working directory and load only what I need.

1 2 3 4 5 6 7 
let chtool_path=getcwd().'/tools' filetype off let &rtp.=','.chtool_path.'/vim/mojom' exec 'source' chtool_path.'/vim/filetypes.vim' exec 'source' chtool_path.'/vim/ninja-build.vim' filetype plugin indent on 

Chromium provides vimscript files for syntax highliting and file detection of Mojom, which is the IDL for Mojo interfaces (IPC) among Chromium services. And ninja-build.vim allows you compile a file with ctrl-O or build the specific target with :CrBuild command. See each Vim files for details.

Running Tests

tools/autotest.py is very useful when you run tests of Chromium. As the description of the script, autotest.py finds the appropriate test suits and builds it, then runs it. ! is an option for running autotest.py inside Vim, but sometimes it’s a hassle to type all parameters. What about to write simple commands (or functions) with scripts under tools/vim?

This is an example script for running a test for the current line. Some codes are copied from ninja-build.vim and imported from ninja-output.py.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 
pythonx << endpython import os, vim def path_to_build_dir(): # Codes from tools/vim/ninja-build.vim. chrome_root = os.path.dirname(vim.current.buffer.name) fingerprints = ['chrome', 'net', 'v8', 'build', 'skia'] while chrome_root and not all( [os.path.isdir(os.path.join(chrome_root , fp)) for fp in fingerprints]): chrome_root = os.path.dirname(chrome_root) sys.path.append(os.path.join(chrome_root, 'tools', 'vim')) # Import GetNinjaOutputDirectory from tools/vim/ninja_output.py. from ninja_output import GetNinjaOutputDirectory return GetNinjaOutputDirectory(chrome_root) def run_test_for_line(linenumb): run_cmd = ' '.join(['!tools/autotest.py', '-C', path_to_build_dir(), '%', '--line', linenumb] ) vim.command(run_cmd) endpython fun! RunTestForCurrentLine() let l:current_line = shellescape(line('.')) pythonx run_test_for_line(vim.eval('l:current_line')) endfun map <C-T> :call RunTestForCurrentLine()<CR> 

Place the cursor on what we want to test and ctrl-t … Here you go. autotest

YouCompleteMe

tools/vim has an example configuration for YouCompleteMe (a code completion engine for Vim). See tools/vim/chromium.ycm_extra_conf.py.

1 2 
let g:ycm_extra_conf_globlist = ['../.ycm_extra_conf.py'] let g:ycm_goto_buffer_command = 'split-or-existing-window' 

As you already know, YouCompleteMe requires clangd. Very fortunately, Chromium already supports clangd and remote index server to get daily index snapshots.

Do not skip documents about using Clangd to build chromium and chromium remote index server.

Here are short demos for completion suggestions, ycm_auto

and code jumping(:YcmCompleter GoTo, :YcmCompleter GoToReferences) during Chromium development. ycm_jump

Commenter

A good developer writes good comments, so being a good commenter makes you a good developer (joke). In my humble opinion, NERD commenter will make you a good developer (joke again). You can get all details from the docs of NERD commenter, and Chromium needs only an additional option for mojom.

1 
let g:NERDCustomDelimiters = { 'mojom': {'left': '//'} } 

File Navigation

Chromium is really huge, so we benefit from efficient file navigation tools. There are lots of awesome fuzzy finder plugins for this purpose and I have been using command-t for a long time. But command-t requires a Vim executable with ruby/lua support or Neovim. If you need a simpler yet still powerful navigator, fzf.vim is an altenative great option.

Since Chromium a git project, search commands based by git ls-files will provide results very quickly. Use the option let g:CommandTFileScanner = 'git' for command-t, and use the command :GFiles for fzf.vim.

filenav

In case you can’t depend on git ls-files based commands, please make sure your plugin use rg or fd which are the fastest command line search tools. For example, fzf.vim has an option FZF_DEFAULT_COMMAND and the following suggestion in the manual: export FZF_DEFAULT_COMMAND='fd --type f'.

Conclusion

Thanks for reading and I hope this blog post can be some help to your development enviroment for Chromium. Any comment for sharing some other cool tips always be welcomed, so that we can make our Vim more productive. (Right, this is the real purpose of this blog post.)

…. I will be back someday with debugging Chromium in Vim. sneak peek

January 30, 2023 02:14 AM

January 27, 2023

Andy Wingo

three approaches to heap sizing

How much memory should a program get? Tonight, a quick note on sizing for garbage-collected heaps. There are a few possible answers, depending on what your goals are for the system.

you: doctor science

Sometimes you build a system and you want to study it: to identify its principal components and see how they work together, or to isolate the effect of altering a single component. In that case, what you want is a fixed heap size. You run your program a few times and determine a heap size that is sufficient for your problem, and then in future run the program with that new fixed heap size. This allows you to concentrate on the other components of the system.

A good approach to choosing the fixed heap size for a program is to determine the minimum heap size a program can have by bisection, then multiplying that size by a constant factor. Garbage collection is a space/time tradeoff: the factor you choose represents a point on the space/time tradeoff curve. I would choose 1.5 in general, but this is arbitrary; I'd go more with 3 or even 5 if memory isn't scarce and I'm really optimizing for throughput.

Note that a fixed-size heap is not generally what you want. It's not good user experience for running ./foo at the command line, for example. The reason for this is that program memory use is usually a function of the program's input, and only in some cases do you know what the input might look like, and until you run the program you don't know what the exact effect of input on memory is. Still, if you have a team of operations people that knows what input patterns look like and has experience with a GC-using server-side process, fixed heap sizes could be a good solution there too.

you: average josé/fina

On the other end of the spectrum is the average user. You just want to run your program. The program should have the memory it needs! Not too much of course; that would be wasteful. Not too little either; I can tell you, my house is less than 100m², and I spend way too much time shuffling things from one surface to another. If I had more space I could avoid this wasted effort, and in a similar way, you don't want to be too stingy with a program's heap. Do the right thing!

Of course, you probably have multiple programs running on a system that are making similar heap sizing choices at the same time, and the relative needs and importances of these programs could change over time, for example as you switch tabs in a web browser, so the right thing really refers to overall system performance, whereas what you are controlling is just one process' heap size; what is the Right Thing, anyway?

My corner of the GC discourse agrees that something like the right solution was outlined by Kirisame, Shenoy, and Panchekha in a 2022 OOPSLA paper, in which the optimum heap size depends on the allocation rate and the gc cost for a process, which you measure on an ongoing basis. Interestingly, their formulation of heap size calculation can be made by each process without coordination, but results in a whole-system optimum.

There are some details but you can imagine some instinctive results: for example, when a program stops allocating because it's waiting for some external event like user input, it doesn't need so much memory, so it can start shrinking its heap. After all, it might be quite a while before the program has new input. If the program starts allocating again, perhaps because there is new input, it can grow its heap rapidly, and might then shrink again later. The mechanism by which this happens is pleasantly simple, and I salute (again!) the authors for identifying the practical benefits that an abstract model brings to the problem domain.

you: a damaged, suspicious individual

Hoo, friends-- I don't know. I've seen some things. Not to exaggerate, I like to think I'm a well-balanced sort of fellow, but there's some suspicion too, right? So when I imagine a background thread determining that my web server hasn't gotten so much action in the last 100ms and that really what it needs to be doing is shrinking its heap, kicking off additional work to mark-compact it or whatever, when the whole point of the virtual machine is to run that web server and not much else, only to have to probably give it more heap 50ms later, I-- well, again, I exaggerate. The MemBalancer paper has a heartbeat period of 1 Hz and a smoothing function for the heap size, but it just smells like danger. Do I need danger? I mean, maybe? Probably in most cases? But maybe it would be better to avoid danger if I can. Heap growth is usually both necessary and cheap when it happens, but shrinkage is never necessary and is sometimes expensive because you have to shuffle around data.

So, I think there is probably a case for a third mode: not fixed, not adaptive like the MemBalancer approach, but just growable: grow the heap when and if its size is less than a configurable multiplier (e.g. 1.5) of live data. Never shrink the heap. If you ever notice that a process is taking too much memory, manually kill it and start over, or whatever. Default to adaptive, of course, but when you start to troubleshoot a high GC overhead in a long-lived proess, perhaps switch to growable to see its effect.

unavoidable badness

There is some heuristic badness that one cannot avoid: even with the adaptive MemBalancer approach, you have to choose a point on the space/time tradeoff curve. Regardless of what you do, your system will grow a hairy nest of knobs and dials, and if your system is successful there will be a lively aftermarket industry of tuning articles: "Are you experiencing poor object transit? One knob you must know"; "Four knobs to heaven"; "It's raining knobs"; "GC engineers DO NOT want you to grab this knob!!"; etc. (I hope that my British readers are enjoying this.)

These ad-hoc heuristics are just part of the domain. What I want to say though is that having a general framework for how you approach heap sizing can limit knob profusion, and can help you organize what you have into a structure of sorts.

At least, this is what I tell myself; inshallah. Now I have told you too. Until next time, happy hacking!

by Andy Wingo at January 27, 2023 09:45 PM

January 24, 2023

Andy Wingo

parallel ephemeron tracing

Hello all, and happy new year. Today's note continues the series on implementing ephemerons in a garbage collector.

In our last dispatch we looked at a serial algorithm to trace ephemerons. However, production garbage collectors are parallel: during collection, they trace the object graph using multiple worker threads. Our problem is to extend the ephemeron-tracing algorithm with support for multiple tracing threads, without introducing stalls or serial bottlenecks.

Recall that we ended up having to define a table of pending ephemerons:

struct gc_pending_ephemeron_table {
  struct gc_ephemeron *resolved;
  size_t nbuckets;
  struct gc_ephemeron *buckets[0];
};

This table holds pending ephemerons that have been visited by the graph tracer but whose keys haven't been found yet, as well as a singly-linked list of resolved ephemerons that are waiting to have their values traced. As a global data structure, the pending ephemeron table is a point of contention between tracing threads that we need to design around.

a confession

Allow me to confess my sins: things would be a bit simpler if I didn't allow tracing workers to race.

As background, if your GC supports marking in place instead of always evacuating, then there is a mark bit associated with each object. To reduce the overhead of contention, a common strategy is to actually use a whole byte for the mark bit, and to write to it using relaxed atomics (or even raw stores). This avoids the cost of a compare-and-swap, but at the cost that multiple marking threads might see that an object's mark was unset, go to mark the object, and think that they were the thread that marked the object. As far as the mark byte goes, that's OK because everybody is writing the same value. The object gets pushed on the to-be-traced grey object queues multiple times, but that's OK too because tracing should be idempotent.

This is a common optimization for parallel marking, and it doesn't have any significant impact on other parts of the GC--except ephemeron marking. For ephemerons, because the state transition isn't simply from unmarked to marked, we need more coordination.

high level

The parallel ephemeron marking algorithm modifies the serial algorithm in just a few ways:

  1. We have an atomically-updated state field in the ephemeron, used to know if e.g. an ephemeron is pending or resolved;

  2. We use separate fields for the pending and resolved links, to allow for concurrent readers across a state change;

  3. We introduce "traced" and "claimed" states to resolve races between parallel tracers on the same ephemeron, and track the "epoch" at which an ephemeron was last traced;

  4. We remove resolved ephemerons from the pending ephemeron hash table lazily, and use atomic swaps to pop from the resolved ephemerons list;

  5. We have to re-check key liveness after publishing an ephemeron to the pending ephemeron table.

Regarding the first point, there are four possible values for the ephemeron's state field:

enum {
  TRACED, CLAIMED, PENDING, RESOLVED
};

The state transition diagram looks like this:

  ,----->TRACED<-----.
 ,         | ^        .
,          v |         .
|        CLAIMED        |
|  ,-----/     \---.    |
|  v               v    |
PENDING--------->RESOLVED

With this information, we can start to flesh out the ephemeron object itself:

struct gc_ephemeron {
  uint8_t state;
  uint8_t is_dead;
  unsigned epoch;
  struct gc_ephemeron *pending;
  struct gc_ephemeron *resolved;
  void *key;
  void *value;
};

The state field holds one of the four state values; is_dead indicates if a live ephemeron was ever proven to have a dead key, or if the user explicitly killed the ephemeron; and epoch is the GC count at which the ephemeron was last traced. Ephemerons are born TRACED in the current GC epoch, and the collector is responsible for incrementing the current epoch before each collection.

algorithm: tracing ephemerons

When the collector first finds an ephemeron, it does a compare-and-swap (CAS) on the state from TRACED to CLAIMED. If that succeeds, we check the epoch; if it's current, we revert to the TRACED state: there's nothing to do.

(Without marking races, you wouldn't need either TRACED or CLAIMED states, or the epoch; it would be implicit in the fact that the ephemeron was being traced at all that you had a TRACED ephemeron with an old epoch.)

So now we have a CLAIMED ephemeron with an out-of-date epoch. We update the epoch and clear the pending and resolved fields, setting them to NULL. If, then, the ephemeron is_dead, we are done, and we go back to TRACED.

Otherwise we check if the key has already been traced. If so we forward it (if evacuating) and then trace the value edge as well, and transition to TRACED.

Otherwise we have a live E but we don't know about K; this ephemeron is pending. We transition E's state to PENDING and add it to the front of K's hash bucket in the pending ephemerons table, using CAS to avoid locks.

We then have to re-check if K is live, after publishing E, to account for other threads racing to mark to K while we mark E; if indeed K is live, then we transition to RESOLVED and push E on the global resolved ephemeron list, using CAS, via the resolved link.

So far, so good: either the ephemeron is fully traced, or it's pending and published, or (rarely) published-then-resolved and waiting to be traced.

algorithm: tracing objects

The annoying thing about tracing ephemerons is that it potentially impacts tracing of all objects: any object could be the key that resolves a pending ephemeron.

When we trace an object, we look it up in the pending ephemeron hash table. But, as we traverse the chains in a bucket, we also load each node's state. If we find a node that's not in the PENDING state, we atomically forward its predecessor to point to its successor. This is correct for concurrent readers because the end of the chain is always reachable: we only skip nodes that are not PENDING, nodes never become PENDING after they transition away from being PENDING, and we only add PENDING nodes to the front of the chain. We even leave the pending field in place, so that any concurrent reader of the chain can still find the tail, even when the ephemeron has gone on to be RESOLVED or even TRACED.

(I had thought I would need Tim Harris' atomic list implementation, but it turns out that since I only ever insert items at the head, having annotated links is not necessary.)

If we find a PENDING ephemeron that has K as its key, then we CAS its state from PENDING to RESOLVED. If this works, we CAS it onto the front of the resolved list. (Note that we also have to forward the key at this point, for a moving GC; this was a bug in my original implementation.)

algorithm: resolved ephemerons

Periodically a thread tracing the graph will run out of objects to trace (its mark stack is empty). That's a good time to check if there are resolved ephemerons to trace. We atomically exchange the global resolved list with NULL, and then if there were resolved ephemerons, then we trace their values and transition them to TRACED.

At the very end of the GC cycle, we sweep the pending ephemeron table, marking any ephemeron that's still there as is_dead, transitioning them back to TRACED, clearing the buckets of the pending ephemeron table as we go.

nits

So that's it. There are some drawbacks, for example that this solution takes at least three words per ephemeron. Oh well.

There is also an annoying point of serialization, which is related to the lazy ephemeron resolution optimization. Consider that checking the pending ephemeron table on every object visit is overhead; it would be nice to avoid this. So instead, we start in "lazy" mode, in which pending ephemerons are never resolved by marking; and then once the mark stack / grey object worklist fully empties, we sweep through the pending ephemeron table, checking each ephemeron's key to see if it was visited in the end, and resolving those ephemerons; we then switch to "eager" mode in which each object visit could potentially resolve ephemerons. In this way the cost of ephemeron tracing is avoided for that part of the graph that is strongly reachable. However, with parallel markers, would you switch to eager mode when any thread runs out of objects to mark, or when all threads run out of objects? You would get greatest parallelism with the former, but you run the risk of some workers prematurely running out of data, but when there is still a significant part of the strongly-reachable graph to traverse. If you wait for all threads to be done, you introduce a serialization point. There is a related question of when to pump the resolved ephemerons list. But these are engineering details.

Speaking of details, there are some gnarly pitfalls, particularly that you have to be very careful about pre-visit versus post-visit object addresses; for a semi-space collector, visiting an object will move it, so for example in the pending ephemeron table which by definition is keyed by pre-visit (fromspace) object addresses, you need to be sure to trace the ephemeron key for any transition to RESOLVED, and there are a few places this happens (the re-check after publish, sweeping the table after transitioning from lazy to eager, and when resolving eagerly).

implementation

If you've read this far, you may be interested in the implementation; it's only a few hundred lines long. It took me quite a while to whittle it down!

Ephemerons are challenging from a software engineering perspective, because they are logically a separate module, but they interact both with users of the GC and with the collector implementations. It's tricky to find the abstractions that work for all GC algorithms, whether they mark in place or move their objects, and whether they mark the heap precisely or if there are some conservative edges. But if this is the sort of thing that interests you, voilà the API for users and the API to and from collector implementations.

And, that's it! I am looking forward to climbing out of this GC hole, one blog at a time. There are just a few more features before I can seriously attack integrating this into Guile. Until the next time, happy hacking :)

by Andy Wingo at January 24, 2023 10:48 AM

January 23, 2023

Jacobo Aragunde

Accessible name computation in Chromium chapter III: name from hidden subtrees

Back in March of 2021, as part of my maintenance work on Chrome/Chromium accessibility at Igalia, I inadvertently started a long, deep dive into the accessible name calculation code, that had me intermittently busy since then.

This is the third and last post in the series. We started with an introduction to the problem, and followed by describing how we fixed things in Chromium. In this post, we will describe some problems we detected in the accessible name computation spec, and the process to address them.

This post extends what was recently presented in a lightning talk at BlinkOn 17, in November 2022:

Hidden content and aria-labelledby

Authors may know they can use aria-labelledby to name a node after another, hidden node. The attribute aria-describedby works analogously for descriptions, but in this post we will refer to names for simplicity.

This is a simple example, where the produced name for the input is “foo”:

<input aria-labelledby="label">
<div id="label" class="hidden">foo</div>

This is because the spec says, in step 2B:

  • if computing a name, and the current node has an aria-labelledby attribute that contains at least one valid IDREF, and the current node is not already part of an aria-labelledby traversal, process its IDREFs in the order they occur:
  • or, if computing a description, and the current node has an aria-describedby attribute that contains at least one valid IDREF, and the current node is not already part of an aria-describedby traversal, process its IDREFs in the order they occur:
    i. Set the accumulated text to the empty string.
    ii. For each IDREF:
     
    a. Set the current node to the node referenced by the IDREF.
    b. Compute the text alternative of the current node beginning with step 2. Set the result to that text alternative.
    c. Append the result, with a space, to the accumulated text.
     
    iii. Return the accumulated text.

When processing each IDREF per 2B.ii.b, it will stumble upon this condition in 2A:

If the current node is hidden and is not directly referenced by aria-labelledby or aria-describedby, nor directly referenced by a native host language text alternative element (e.g. label in HTML) or attribute, return the empty string. 

So, a hidden node referenced by aria-labelledby or aria-describedby will not return the empty string. Instead, it will go on with the subsequent steps in the procedure.

Unfortunately, the spec did not provide clear answers to some corner cases: what happens if there are elements that are explicitly hidden inside the first hidden node? What about nodes that aren’t directly referenced? How deep are we expected to go inside a hidden tree to calculate its name? And what about aria-hidden nodes?

Consider another example:

  <input id="test" aria-labelledby="t1">
  <div id="t1" style="visibility:hidden">
    <span aria-hidden="true">a</span>
    <span style="visibility:hidden">b</span>
    <span>c</span>
    d
  </div>

Given the markup above, WebKit generated the name “d”, Firefox said “abcd” and Chrome said “bcd”. In absence of clear answers, user agent behaviors differ.

There was an ongoing discussion in Github regarding this topic, where I shared my findings and concerns. Some conclusions that came out from it:

  • It appears that the original intention of the spec was allowing the naming of nodes from hidden subtrees, if done explicitly, and to include only those children that were not explicitly hidden.
  • It’s really hard to implement a check for explicitly hidden nodes inside a hidden subtree. For optimization purposes, that information is simplified away in early stages of stylesheet processing, and it wouldn’t be practical to recover or keep it.
  • Given the current state of affairs and how long the current behavior has been in place, changing it radically would cause confusion among authors, backwards compatibility problems, etc.

The agreement was to change the spec to match the actual behavior in browsers, and clarify the points where implementations differed. But, before that…

A dead end: the innerText proposal

An approach we discussed which looked promising was to use the innerText property for the specific case of producing a name from a hidden subtree. The spec would say something like: “if node is hidden and it’s the target of an aria-labelledby relation, return the innerText property”.

It’s easy to explain, and saves user agents the trouble of implementing the traversal. It went as far as to have a proposed wording and I coded a prototype for Chromium.

Unfortunately, innerText has a peculiar behavior with regard to hidden content: it will return empty if the node was hidden with the visibility:hidden rule.

The relevant spec says:

The innerText and outerText getter steps are:
 
1. If this is not being rendered or if the user agent is a non-CSS user agent, then return this’s descendant text content.

Where:

An element is being rendered if it has any associated CSS layout boxes, SVG layout boxes, or some equivalent in other styling languages.

An element with visibility:hidden has an associated layout box, hence it’s considered “rendered” and follows the steps 2 and beyond:

  1. Let results be a new empty list.
     
  2. For each child node node of this:
    i. Let current be the list resulting in running the rendered text collection steps with node. Each item in results will either be a string or a positive integer (a required line break count).

Where “rendered text collection steps” says:

  1. Let items be the result of running the rendered text collection steps with each child node of node in tree order, and then concatenating the results to a single list.
     
  2. If node’s computed value of ‘visibility’ is not ‘visible’, then return items.

The value of visibility is hidden, hence it returns items, which is empty because we had just started.

This behavior makes using innerText too big of a change to be acceptable for backwards compatibility reasons.

Clarifying the spec: behavior on naming after hidden subtrees

In the face of the challenges and preconditions described above, we agreed that the spec would document what most browsers are already doing in this case, which means exposing the entire, hidden subtree, down to the bottom. There was one detail that diverged: what about nodes with aria-hidden="true"? Firefox included them in the name, but Chrome and WebKit didn’t.

Letting authors use aria-hidden to really omit a node in a hidden subtree appeared to be a nice feature to have, so we first tried to codify that behavior into words. Unfortunately, the result became too convoluted: it added new steps, it created differences in what “hidden” means… We decided not to submit this proposal, but you can still take a look at it if you’re curious.

Instead, the proposal we submitted does not add or change many words in the spec text. The new text is:

“If the current node is hidden and is not part of an aria-labelledby or aria-describedby traversal, where the node directly referenced by that relation was hidden […], return the empty string.”

Instead, it adds a lot of clarifications and examples. Although it includes guidelines on what’s considered hidden, it does not make differences in the specific rules used to hide; as a result, aria-hidden nodes are expected to be included in the name when naming after a hidden subtree. 

Updating Chromium to match the spec: aria-hidden in hidden subtrees

We finally settled on including aria-hidden nodes when calculating a name or a description after a hidden subtree, via the aria-labelledby and aria-describedby relations. Firefox already behaved like this but Chromium did not. It appeared to be an easy change, like removing only one condition somewhere in the code… The reality is never like that, though.

The change landed successfully, but it’s much more than removing a condition: we found out the removal had side effects in other parts of the code. Additionally, we had some corrections in Chromium to prevent authors to use aria-hidden on focusable elements, but they weren’t explicit in the code; our change codifies them and adds tests. All this was addressed in the patch and, in general, I think we left things in a better shape than they were.

A recap

In the process to fix a couple of bugs about accessible name computation in Chrome, we detected undocumented cases in the spec and TODOs in the implementation. To address all of it, we ended up rewriting the traversal code, discussing and updating the spec and adding even more tests to document corner cases.

Many people were involved in the process, not only Googlers but also people from other browser vendors, independent agents, and members of the ARIA working group. I’d like to thank them all for their hard work building and maintaining accessibility on the Web.

But there is still a lot to do! There are some known bugs, and most likely some unknown ones. Some discussions around the spec still remain open and could require changes in browser behavior. Contributions are welcome!

Thanks for reading, and happy hacking!

by Jacobo Aragunde Pérez at January 23, 2023 05:00 PM

January 20, 2023

Gyuyoung Kim

WebGPU Conformance Test Suite

WebGPU Introduction

WebGPU is a new Web API that exposes modern computer graphics capabilities, specifically Direct3D 12, Metal, and Vulkan, for performing rendering and computation operations on a GPU. This goal is similar to the WebGL family of APIs, but WebGPU enables access to more advanced features of GPUs. Whereas WebGL is mostly for drawing images but can be repurposed (with great effort) for other kinds of computations, WebGPU provides first-class support for performing general computations on the GPU.  In other words, the WebGPU API is the successor to the WebGL and WebGL2 graphics APIs for the Web. It provides modern features such as “GPU compute” as well as lower overhead access to GPU hardware and better, more predictable performance. The below picture shows the WebGPU architecture diagram simply.[1]

WebGPU CTS

There is currently a CTS (Conformance Test Suites) that tests the behaviors defined by the WebGPU specification. The CTS has been written in TypeScript and developed in GitHub, and the CTS can be run with a sort of ‘adapter layer’ to work under other test infrastructures like WPT and Telemetry. At this point, there are around 1,400 tests that check the validation and operation of WebGPU implementations. There is a site to run the WebGPU CTS. If you go to the URL below with a browser that enables the WebGPU feature, you can run the CTS on your browser. https://gpuweb.github.io/cts/standalone/?runnow=0&worker=0&debug=0&q=webgpu:

Contribution

The WebGPU CTS has progressed in GitHub. If you want to contribute to the CTS project, you can start by taking a look at the issues that have been filed. Please follow the sequence below:
  1. Set up the CTS developing environment.
  2. Add to or edit the test plan in the CTS project site.
  3. Implement the test.
  4. Upload the test and request a review.

Organization

The CTS is largely composed of four categories – API, Shader, IDL, and Web Platform. First, the API category tests full coverage of the Javascript API surface of WebGPU. The Shader category tests the full coverage of the shaders that can be passed to WebGPU. The IDL category checks that the WebGPU IDL is correctly implemented. Lastly, the Web Platform category tests WebPlatform-specific interactions.

Example

This is a simple test for checking writeBuffer function validation. Basically, a general test consists of three blocks like description, parameters, and function body.
  • desc: Describes what it tests and summarize how it tests that functionality.
  • params: Defines parameters with a list of objects to be tested. Each such instance of the test is a “case”
  • fn: Implementation body of the test.

Parameterization

Let’s take a look at the parameterization a bit deeper. The CTS provides helpers like params() and others for creating large Cartesian products of test parameters. These generate “test cases” further subdivided into “test subcases”. Thanks to the parameterization support, we can test with more varied input values.

Contributors

The WebGPU CTS has been developed by several contributors. Google (let’s assume that chromium.org is Google) has been writing most of the test cases. After that, Intel, gmail.com, and Igalia have contributed to the CTS tests.
This table shows the commit count of the main contributors, classified by email accounts. As you can see in the table, Google and Intel are the main contributors to the WebGPU CTS. Igalia has also participated in the CTS project since June 2022, and has written test cases for checking the validity of the texture format and copy functionality as well as the operation of the index format and the depth/stencil state.

The next steps

The Chromium WebGPU team announced their plan via an intent-to-ship posted to blink-dev recently. The Origin Trial for WebGPU expires starting in M110 and they aim to ship the WebGPU feature to Chromium in M113.

References

[1] https://developer.chrome.com/en/docs/web-platform/webgpu/

by gyuyoung at January 20, 2023 03:05 AM

January 18, 2023

Alexander Dunaev

Deprecating old stuff

Chromium 109 (released January 10, 2023) has dropped support for xdg_shell_unstable_v6, one of Wayland shell protocols.

For years, the Chromium’s Wayland backend was aware of two versions of the XDG shell interface. There were two parallel implementations of the client part, similar like twins. In certain parts, the only difference was names of functions. Wayland has a naming convention that requires these names to start with the name of the interface they work with, so where one implementation would call xdg_toplevel_resize(), the other one would call zxdg_toplevel_v6_resize().

The shell is a now standard Wayland protocol extension that describes semantics for the “windows”. It does so by assigning “roles” to the Wayland surfaces, and by providing means to set relative positions of these surfaces. That altogether makes it possible to build tree-like hierarchies of windows, which is absolutely necessary to render a window that has a menu with submenus.

The XDG shell protocol was introduced in Wayland protocols a long time ago. The definition file mentions 2008, but I was unable to track its history back that far. The Git repository where these files are hosted these days was created in October 2015, and the earlier history is not preserved. (Maybe the older repository still exists somewhere?)

Back in 2015, the XDG shell interface was termed “unstable”. Wayland waives backwards compatibility for unstable stuff, and requires the versions to be explicitly mentioned in the names of protocols and their definition files. Following that rule, the XDG shell protocol was named xdg_shell_unstable_v5. Soon after the initial migration of the project, the version 6 was introduced, named xdg_shell_unstable_v6. Since then, there are two manifests in the unstable section, xdg-shell-unstable-v5.xml and xdg-shell-unstable-v6.xml. I assume versions prior to 5 were dropped at the migration.

This may be confusing, so let me clarify. Wayland defines several entities to describe stuff. A protocol is a grouping entity: it is the root element of the XML file where it is defined. A protocol may contain a number of interfaces that describe behaviour of objects. The real objects and calls are bound to interfaces, but the header files are generated for the protocol as a whole, so even if the client uses only one interface, it will import the whole protocol.

The XDG shell allows the application to, quoting its description, “create desktop-style surfaces”. A number of desktop environments implemented support for the unstable protocol, providing feedback and proposing improvements. In 2017, the shell protocol was declared stable, and got a new name: xdg_shell. The shell interface was renamed from zxdg_shell_v6 to xdg_wm_base. Since then, the clients had to choose between two protocols, or to support both.

For the stable stuff, the convention is simpler and easier to follow: the changes are assumed to be backwards compatible, so names of interfaces and protocols are fixed and no longer change. The version tracking is built into the protocol definition and uses the self-explanatory version and since attributes. The client can import one protocol and decide at run time which calls are possible and which are not depending on the version exposed by the compositor.

The stable version continued to evolve. Eventually, the unstable version became redundant, and in 2022 major Wayland compositors started to drop support for it. By November in the past year we realised that the code that implemented the unstable shell protocol in Chromium was no longer used by real clients. Of 30 thousand recorded uses of the Wayland backend, only 3 (yeah, just three) used the unstable interface. The time had come to get rid of the legacy, which was more than one thousand lines of code.

Some bits of the support for alternative shell interfaces is preserved, though. A special factory class named ui::ShellObjectFactory is used to hide the implementation details of the shell objects. Before getting rid of the unstable shell implementation, the primary use of the factory was to select the XDG shell interface supported by the host. However, it also made it possible to integrate easily with the alternative shells. For that purpose, mainly for the downstream projects, the factory will remain.

by Alex at January 18, 2023 01:21 PM

January 17, 2023

Manuel Rego

10 years ago

Once upon a time…

10 years ago I landed my first WebKit patch 🎂:

Bug 107275 - [GTK] Implement LayoutTestController::addUserScript

That was my first patch to a web engine related project, and also my first professional C++ patch. My previous work at Igalia was around web applications development with PHP and Java. At that time, 2013, Igalia had been already involved on web rendering engines for several years, and the work around web applications was fading out inside the company, so moving to the other side of the platform looked like a good idea. 10 years have passed and lots of great things have happened.

Since that first patch many more have come mostly in Chromium/Blink and WebKit; but also in WPT, CSS, HTML and other specs; and just a few, sadly, in Gecko (I’d love to find the opportunity to contribute there more at some point). I’ve became committer and reviewer in both Chromium and WebKit projects. I’m also member of the CSS Working Group (though not very active lately) and Blink API owner.

During these years I have had the chance to attend several events, spoke at a few conferences, got some interviews. I’ve been also helping to organize the Web Engines Hackfest since 2014 and a CSSWG face-to-face meeting at Igalia headquarters in A Coruña in January 2020 (I have so great memories of the last dinner).

These years have allowed me to meet lots of wonderful people from which I’ve learnt, and still learn every day, many things. I have the pleasure to work with amazing folks on a daily basis in the open. Every new feature or project in which I have to work, it’s again a new learning experience, and that’s something I really like about this kind of work.

In this period I’ve seen Igalia grow a lot in the web platform community, to the point that these days we’re the top world consultancy on the web platform, with an important position in several projects and standards bodies. We’re getting fairly well know in this ecosystem, and we’re very proud about having achieved that.

Looking back, I’m so grateful for the opportunity given by Igalia. Thanks to the amazing colleagues and the community and how they have helped me to contribute work on the different projects. Also thanks to the different customers that have allowed me to work upstream and develop my career around web browsers. 🙏 These days the source code I write (together with many others) is being used daily by millions of people, that’s totally unbelievable if you stop for a minute to think about it.

Looking forward to the next decade of involvement in the web platform. See you all on the web! 🚀

January 17, 2023 11:00 PM

Amanda Falke

Tutorial: Building WPE WebKit for Raspberry Pi 3

A lightning guide for building WPE WebKit with buildroot

This tutorial will be for getting “up and running” with WPE WebKit using a Raspberry Pi 3 using a laptop/desktop with Linux Fedora installed. WPE WebKit has many benefits; you may read here about why WPE WebKit is a great choice for embedded devices. WPE WebKit has minimal dependencies and it displays high-quality animations, WebGL and videos on embedded devices.

WebPlatformForEmbedded is our focus; for this tutorial, we’ll be building WPE WebKit with buildroot to build our image, so, make sure to clone the buildroot repository.

You will need:

Raspberry pi 3 items:

  • A raspberry pi 3.
  • A microSD card for the pi. (I usually choose 32GB microSD cards).
  • An external monitor, extra mouse, and extra keyboard for our rpi3, separately from the laptop.
  • An HDMI cable to connect the rpi3 to its own external monitor.

Laptop items:

  • Linux laptop/desktop.
    • This tutorial will be based on Fedora, but you can use Debian or any distro of your choice.
  • You also need a way to interface the microSD card with your laptop. You can get an SD Adapter for laptops that have an SD Card adapter slot, or you can use an external SD Card adapter interface for your computer.
    • This tutorial will be based on having a laptop with an SD card slot, and hence an SD Card Adapter will work just fine.

Items for laptop to communicate with rpi3:

  • An ethernet cable to connect the rpi3 to your laptop.
  • You need some way to get ethernet into your laptop. This is either in the form of an ethernet port on your laptop (not likely), or an adapter of some sort (likely a USB adapter).

Steps: High level overview

This is a high level overview of the steps we will be taking.

  1. Partition the blank SD card.
  2. In the buildroot repository, make <desired_config>.
  3. Run the buildroot menuconfig with make menuconfig to set up .config file.
  4. Run make to build sdcard.img in buildroot/output dir; change .config settings as needed.
  5. Write sdcard.img file to the SD card.
  6. Connect the rpi3 to its own external monitor, and its own mouse and keyboard.
  7. Connect the rpi3 to the laptop using ethernet cable.
  8. Put the SD card into the rpi3.
  9. Setup a shared ethernet connection between the laptop and rpi to get the IP address of rpi.
  10. ssh into the rpi and start WPE WebKit.

Steps: Detailed overview/sequence

1. Partition the blank SD card using fdisk. Create a boot partition and a root partition.

  • Note: this is only needed in case the output image format is root.tar. If it’s sdcard.img, then that is dumped directly to the sdcard, that image is already partitioned internally.
  • If you’re unfamiliar with fdisk, this is a good tutorial.

2. In the buildroot repository root directory, make the desired config.

  • Make sure to clone the buildroot repository.
  • Since things change a lot over time, it’s important to note the specific buildroot commit this tutorial was built on, and that this tutorial was built on January 12th, 2023. It is recommended to build from that commit for consistency to ensure that the tutorial works for you.
  • We are building the cog 2.28 wpe with buildroot. See the build options from the buildroot repository.
  • Run make list-defconfigs to get a list of configurations.
  • Copy raspberrypi3_wpe_2_28_cog_defconfig and run it: make raspberrypi3_wpe_2_28_cog_defconfig.
  • You will quickly get output which indicates that a .config file has been written in the root directory of the buildroot repository.

3. Run the buildroot menuconfig with make menuconfig to set up .config file.

  • Run make menuconfig. You’ll see options here for configuration. Go slowly and be careful.
  • Change these settings. Help menus are available for menuconfig, you’ll see them displayed on the screen.
Operation in menuconfig Location Value
ENABLE Target packages -> Filesystem and flash utilities dosfstools
ENABLE Target packages -> Filesystem and flash utilities mtools
ENABLE Filesystem images ext2/3/4 root filesystem
SET VALUE Filesystem images -> ext2/3/4 root filesystem -> ext2/3/4 variant ext4
DISABLE Filesystem images initial RAM filesystem linked into linux kernel

4. Run make to build sdcard.img in buildroot/output dir; change .config settings as needed.

  • Run make. Then get a coffee as the build and cross-compilation will take awhile.

  • In reality, you may encounter some errors along the way, as cross-compilation can be an intricate matter. This tutorial will guide you through those potential errors.
  • When you encounter errors, you’ll follow a “loop” of sorts:

    Run make -> encounter errors -> manually edit .config file -> -> remove buildroot/output dir -> run make again until sdcard.img is built successfully.

  • If you encounter CMake errors, such as fatal error: stdlib.h: No such file or directory, compilation terminated, and you have a relatively new version of CMake on your system, the reason for the error may be that buildroot is using your local CMake instead of the one specified in the buildroot configuration.
    • We will fix this error by setting in .config file: BR2_FORCE_HOST_BUILD=y. Then remove buildroot/output dir, and run make again.
  • If you encounter error such as path/to/buildroot/output/host/lib/gcc/arm-buildroot-linux-gnueabihf/9.2.0/plugin/include/builtins.h:23:10: fatal error: mpc.h: No such file or directory:
  • then we can fix this error by changing Makefile in ./output/build/linux-rpi-5.10.y/scripts/gcc-plugins/Makefile, by adding -I path/to/buildroot/output/host/include in plugin_cxxflags stanza. Then, as usual, remove buildroot/output dir, and run make again.

5. Write sdcard.img file to the SD card.

  • At this point after the make process, we should have sdcard.img file in buildroot/output/images directory.
  • Write this file to the SD card.
  • Consider using Etcher to do so.

6. Connect the rpi3 to its own external monitor, and its own mouse and keyboard.

  • We’ll have separate monitor, mouse and keyboard all connected to the raspberry pi so that we can use it independently from the laptop.

7. Connect the rpi3 to the laptop using ethernet cable.

8. Put the SD card into the rpi3.

9. Setup a shared ethernet connection between the laptop and rpi to get the IP address of rpi.

In general, one of the main problems for connecting via ssh to the Raspberry Pi is to know the IP address of the device. This is very simple with Raspbian OS; simply turn on the raspberry pi and edit configurations to enable ssh, often over wifi.

This is where the ethernet capabilities of the raspberry pi come in.

Goal: To find syslog message DHCPACK acknowledgement and assignment of the IP address after setting up shared connection between raspberry pi and the laptop.

Throughout this process, continually look at logs. Eventually we will see a message DHCPACK which will likely be preceded by several DHCP handshake related messages such as DHCP DISCOVER, REQUEST etc. The DHCPACK message will contain the IP address of the ethernet device, and we will then be able to ssh into it.

    1. Tail the syslogs of the laptop. On Debian distributions, this is often /var/log/syslog. Since we are using Fedora, we’ll be using systemd's journald with the journactl command:
      • sudo journalctl -f
      • Keep this open in a terminal window.
      • You can also come up with a better solution like grepping logs, if you like, or piping output of stdout elsewhere.
    1. In a second terminal window, open up the NetworkManager.
      • Become familiar with existing devices prior to powering on the raspberry pi, by running nmcli.
    1. Power on the raspberry pi. Watch your system logs.
      • Syslogs will detail the raspberry pi’s name.
    1. Look for that name in NetworkManager nmcli device.
      • Using NetworkManager nmcli, set up shared connection for the ethernet device.
      • Setting up a shared connection is as simple as nmcli connection add type ethernet ifname $ETHERNET_DEVICE_NAME ipv4.method shared con-name local
      • This is a good tutorial for setting up a shared connection with NetworkManager.
    1. Once the connection is shared, syslogs will show a DHCPACK message acknowledging the ethernet device and its IP address. (You may need to power cycle the rpi to see this message, but it will happen).

10. ssh into the rpi and start WPE WebKit.

  • Now that we have the IP address of the raspberry pi, we can ssh into it from the laptop: ssh root@<RPI3_IP_ADDRESS>. (The default password is ‘root’. You can also add your user public key to /root/.ssh/authorized_keys on the pi. You can simplify this process by creating an overlay/root/.ssh/authorized_keys on your computer and by specifying the path to the overlay directory in the BR2_ROOTFS_OVERLAY config variable. That will copy everything in the overlay dir to the image.)
  • After that, export these env variables WPE_BCMRPI_TOUCH=1 and WPE_BCMRPI_CURSOR=1 to enable keyboard and mouse control.
    • Why: Recall that generally WPE WebKit is for embedded devices, such as kioks, or set top boxes requiring control with a remote control or similar device or touch interaction. We are exporting these environment variables so that we can “test” WPE WebKit with our separate mouse and keyboard for our raspberry pi without the need for a touch screen or special hardware targets, or a Wayland compositor such as weston. If this piques your curiosity, please see the WPE WebKit FAQ on Wayland.
  • Start WPE WebKit with cog: cog "http://www.igalia.com/"
  • A browser will launch in the external monitor connected to the raspberry pi 3, and we can control the browser with the raspberry pi’s mouse and keyboard!

That’s all for now. Feel free to reach out in the support channels for WPE on Matrix.

WPE’s Frequently Asked Questions

by Amanda Falke at January 17, 2023 08:03 AM

January 16, 2023

Eric Meyer

Minimal Dark Mode Styling

Spurred by a toot in reply to a goofy dev joke I made, I’ve set up a dirty-rough Dark Mode handler for meyerweb:

@media (prefers-color-scheme: dark) {
	html body {filter: invert(1);}
	/* the following really should be managed by a cascade layer */
	html img, 
	html img.book.cover, 
	html img.book.cover.big, 
	html #archipelago a:hover img {filter: invert(1);}
	html #thoughts figure.standalone img {
	  box-shadow: 0.25em 0.25em 0.67em #FFF8;
	}
}

It’s the work of about five minutes’ thought and typing, so I suspect it’s teetering on the edge of Minimum Viable Product, but I’m not sure which side of that line it lands on.

Other than restructuring things so that I can use Cascade Layers to handle the Light and Dark Mode styling, what am I missing or overlooking?  <video> elements, I suppose, but anything else jump out at you as in need of correction?  Let me know!

(P.S. “Use only classes, never IDs” is not something that needs to be corrected.  Yes, I know why people think that, and this situation seems to be an argument in favor of that view, but I have a different view and will not be converting IDs to classes.  Thanks in advance for your forbearance.)


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at January 16, 2023 09:04 PM

Alejandro Piñeiro

v3dv status update 2023-01

We haven’t posted updates to the work done on the V3DV driver since
we announced the driver becoming Vulkan 1.2 Conformant

But after reaching that milestone, we’ve been very busy working on more improvements, so let’s summarize the work done since then.

Moved more functionality to the GPU

Our implementation for Events and Occlusion Queries were both mostly CPU based. We have refactored both features with a new GPU-side implementation based on the use of compute shaders.

In addition to be more “the Vulkan way”, has additional benefits. For example, for the case of the events, we no longer need to stall on the CPU when we need to handle GPU-side event commnds, and allowed to re-enable sync_fd import/export.

VK_KHR_sampler_ycbcr_conversion

We have just landed a real implementation for this extension, based on the work of Ella Stanforth as part of her Igalia Coding Experience with us. This was a really complex work, as this feature added support for multi-plane formats, and needed to modify various parts of the driver. A big kudos to Ella for getting this tricky feature going. Also thanks to Jason Ekstrand, as he worked on a common Mesa framework for ycbcr support.

Support for new extensions

Since 1.2 got announced the following extension got exposed:

  • VK_EXT_texel_buffer_alignment
  • VK_KHR_maintenance4
  • VK_KHR_zero_initialize_workgroup_memory
  • VK_KHR_synchronization2
  • VK_KHR_workgroup_memory_explicit_layout
  • VK_EXT_tooling_info (0 tools exposed though)
  • VK_EXT_border_color_swizzle
  • VK_EXT_shader_module_identifier
  • VK_EXT_depth_clip_control
  • VK_EXT_attachment_feeback_loop_layout
  • VK_EXT_memory_budget
  • VK_EXT_primitive_topology_list_restart
  • VK_EXT_load_store_op_none
  • VK_EXT_image_robustness
  • VK_EXT_pipeline_robustness
  • VK_KHR_shader_integer_dot_product

Some miscellanea

In addition to those, we also worked on the following:

  • Implemented heuristic to decide to enable double-buffer mode, that could help to improve performance on some cases. It still needs to be enabled through the V3D_DEBUG environment variable.

  • Getting v3dv and v3d using the same shader optimization method, that would allow to reuse more code between the OpenGL and Vulkan driver.

  • Getting the driver working with the fossilize-db tools

  • Bugfixing, mostly related to bugs identified through new Khronos CTS releases

by infapi00 at January 16, 2023 03:47 PM

January 12, 2023

Miyoung Shin

Removing Reference FrameTreeNode from RenderFrameHost

MPArch stands for Multiple Pages Architecture, and the Chromium teams at Igalia and Google are working together on an MPArch project that unites the implementation of several features (back/forward-cache, pre-rendering, guest views, and ports) that support multiple pages in the same tab but are implemented with different architecture models into a single architecture. I recommend you read the recently posted blog by Que first if you are not familiar with MPArch. It will help you understand the detailed history and structure of MPArch. This post assumes that the reader is familiar with the //content API and describes the ongoing task of removing reference FrameTreeNode from RenderFrameHost on the MPArch project.

Relationship between FrameTreeNode and RenderFrameHost in MPArch

The main frame and iframe are mirrored as FrameTreeNode and the document of each frame is mirrored as RenderFrameHost. Each FrameTreeNode has a current RenderFrameHost, which can change over time as the frame is navigated. Let’s look at the relationship between FrameTreeNode and RenderFrameHost of features using the MPArch structure.

BFCache
BFCache (back/forward cache) is a feature to cache pages in-memory (preserving javascript state and the DOM state) when the user navigates away from them. When the document of the new page is navigated, the existing document, RenderFrameHost, is saved in BFCache, and when the back/forward action occurs, RenderFrameHost is restored from BFCache. The FrameTreeNode is kept when RenderFrameHost is saved/restored.

Prerendering
Prerender2 is a feature to pre-render the next page for faster navigation. In prerender, a document will be invisible to the user and isn’t allowed to show any UI changes, but the page is allowed to load and run in the background.

Navigation changes to support multiple pages in WebContents

When the prerendered page is activated, the current RenderFrameHost that FrameTreeNode had is replaced with RenderFrameHost in Prerender.

Fenced Frame
The fenced frame enforces a boundary between the embedding page and the cross-site embedded document such that user data visible to the two sites is not able to be joined together. The fenced frame works similarly to the main frame for security and privacy-related access, and to the subframe for other cases on the browser side. To this end, it is wrapped as a dummy FrameTreeNode in FrameTree, and the fenced frame can be accessed through a delegate.

GuestView & Portals
Chromium implements a tag with a GuestVIew which is the templated base class for out-of-process frames in the chrome layer.
For more detail, please refer to Julie’s blog.

Portals allow the embedding of a page inside another page, which can later be activated and replace the main page.

Navigation changes to support multiple pages in WebContents


The GuestView and Portals are implemented with a multiple WebContents model and considering refactoring to MPArch.

Pending Deletion
When RenderFrameHost has started running unload handlers (this includes handlers for the unload, pagehide, and visibilitychange events) or when RenderFrameHost has completed running the unload handlers, RenderFrameHost‘s lifecycle is in the pending Deletion.

Reference FrameTreeNode from RenderFrameHost

Accessing FrameTreeNode from RenderFrameHost that enters BFCache, accessing FrameTreeNode from old RenderFrameHost that has already been swapped from the prerendered page, or accessing FrameTreeNode in the pending deletion status all have security issues. Currently, RenderFrameHost has direct access to FrameTreeNode internally and exposed methods to allow access to FrameTreeNode externally.

A meta bug 1179502 addresses this issue.

Introduce RenderFrameHostOwner
RenderFrameHostOwner is an interface for RenderFrameHost to communicate with FrameTreeNode owning it which can be null or can change during the lifetime of RenderFrameHost to prevent accidental violation of implicit “associated FrameTreeNode stays the same” assumptions. RenderFrameHostOwner is,
  // - Owning FrameTreeNode for main RenderFrameHosts in kActive, kPrerender,
  //   kSpeculative or kPendingCommit lifecycle states.
  // - Null for main RenderFrameHosts in kPendingDeletion lifecycle state.
  // - Null for main RenderFrameHosts stored in BFCache.
  // - Owning FrameTreeNode for subframes (which stays the same for the entire
  //   lifetime of a subframe RenderFrameHostImpl).

//content/browser/renderer_host/render_frame_host_impl.h


At all places where FrameTreeNode is used, we are auditing them if we could replace FrameTreeNode with RenderFrameHostOwner after checking what lifecycle states of RenderFrameHost we can have. However, there are still some places that are not available to use RenderFrameHostOwner (e.g. unload handler). We are still investigating how to deal with these cases. It would be necessary to refactor for RenderFrameHost::frame_tree_node(), which is widely used inside and outside RenderFrameHost, and that work is ongoing.

by mshin at January 12, 2023 10:52 AM

January 06, 2023

Enrique Ocaña

Cat’s Panic

It’s been 8 years since the last time I wrote a videogame just for personal fun. As it’s now become a tradition, I took advantage of the extra focused personal time I usually have on the Christmas season and gave a try to Processing to do my own “advent of code”. It’s a programming environment based on Java that offers a similar visual, canvas-based experience to the one I enjoyed as a child in 8 bit computers. I certainly found coding there to be a pleasant and fun experience.

So, what I coded is called Cat’s Panic, my own version of a known arcade game with a similar name. In this version, the player has to unveil the outline of a hidden cute cat on each stage.

The player uses the arrow keys to control a cursor that can freely move inside a border line. When pressing space, the cursor can start an excursion to try to cover a new area of the image to be unveiled. If any of the enemies touches the excursion path, the player loses a life. The excursion can be canceled at any time by releasing the space key. Enemies can be killed by trapping them in a released area. A stage is completed when 85% of the outline is unveiled.

Although this game is released under GPLv2, I don’t recommend anybody to look at its source code. It breaks all principles of good software design, it’s messy, ugly, and it’s only purpose was to make the developing process entertaining for me. You’ve been warned.

I’m open to contributions in the way of new cat pictures that add more stages to the already existing ones, though.

You can get the source code in the GitHub repository and a binary release for Linux here (with all the Java dependencies, which weight a lot).

Meow, enjoy!

by eocanha at January 06, 2023 03:32 AM

January 05, 2023

Ziran Sun

2022 Review – Igalia Web Platform Team

2022 was a good year for the Web Platform team in Igalia in a lot of ways. We worked on some really interesting Web technologies, and explored some really interesting new ways of funding them. We hired some great new people, and attended and spoke at some great events. And we did some of our meetings in person!

Projects

Open Prioritization Projects

In 2020 Igalia launched the Open Prioritization experiment. It is an experiment in crowd-funding prioritization of new feature implementations for web browsers.

  • :focus-visible in WebKit

Our first Open-Priotization experiment to choose a project to crowdfund launched with 6 possible candidates representing work in every browser engine, across many areas of the platform and at various stages of completeness. According to the pledges that different projects got, :focus-visible in WebKit stood out and became the winner. Rego worked on this feature in 2021 and he had been reporting regularly about his progress. :focus-visible was finally enabled by default in WebKit and shipped in Safari 15.4.

Bear in mind that Apple Did Not Crowdfund :focus-visible in Safari. Eric made it very clear in his blog that, “the addition of :focus-visible to WebKit was lead by the community, done by Igalia, and contributed to WebKit without any involvement from Apple except in the sense of their reviewing patches and accepting the contributions”.

  • Enter: Wolvic

In February 2022, Igalia announced Wolvic, a new browser project with an initial focus of picking up where Firefox Reality leaves off. Brian published Enter: Wolvic explaining what WebXR could bring and Igalia’s initial focus with Wolvic. As an open collective project, Wolvic Collective is seeking more partners and collective funding to further explore the potential and capability of WebXR technology. To understand in details about this project, please follow Brian’s talk on Wolvic’s vision.

The igalians involved in the Wolvic project have put a lot of effort on updating and nurturing the Firefox Reality codebase. Follow the commits by Sergio (@svillar), Felipe(@felipeerias) and Edu (@elima) from Web Platform team at Wolvic Github codebase.

  • MathML

As part of the Open Prioritization effort, it’s exciting that MathML is going to be shipped in in Chrome 109. MathML collective is also part of our open collective effort. The plan is to have MathML-Core support in other browsers for browser interoperability. The MathML-Core support is level 1 for this project. We have envisioned more work to follow and will move to Level 2 after that. Your supports would make a lot of difference to move this forward.

Other Collaborative Projects

Most projects happening in the team are collaborative efforts with our clients and other parties in the Open Source community.

  • :has()

The :has() pseudo class is a selector that specifies an element which has at least one element that matches the relative selector passed as an argument. Byungwoo (@byungwoo_bw on Twitter) started the prototyping work since 2021 and It’s shipped in Chromium 105 this year.

CSS :has() pseudo-class is a very exciting feature. Igalians have been talking about it since its initial stage of filing an intent to prototype. Byungwoo presented implementation and performance details in Blinkon 16. And you can find a recent blog on examples of using it by Eric. In addition, Byungwoo had a talk on Updates since has was enabled in Chrome.

  • CSS highlight pseudos

Since 2021, Delan has been working on spelling and grammar features in Chromium. She has been writing a series on development of this work ( Chromium spelling and grammar features and Chromium spelling and grammar, part 2). As the project progresses, Delan produced her third part of the series on CSS highlight pseudos.

  • Custom Handlers component for Chrome

Igalia has been working on improving the integration of the IPFS protocol in the main browsers. Javier has been blogging on his work including New Custom Handlers component for Chrome to introduce some architectural changes in Chrome to move the Custom Handlers logic into a new component, and Discovering Chrome’s pre-defined
Custom Handlers
to discuss feasible solutions.

  • Automated accessibility testing

The Web Platform team has been engaging in various aspects of accessiblity work in Browser, from web standard to solutions. The Automated accessibility testing by Alex is a reflection of our knowledge on testing.

  • WPT Python 3 Migration

In 2020, Igalia was involved in the Python 3 migration work for the web-platform-tests (WPT) project. The WPT Python 3 Migration Post walks through challenges, approaches and some implementations for this migration work.

As always, there are a lot of other Web technology topics Igalians are working on. We recommend you to follow Brian’s blog and Eric’s blog for updates.

Conferences & Events

The Web Platform team were able to attend some conferences and Events on-site this year after the Covid-19 Pandemic. Here are some highlights.

  • Igalia Summit and Web Engines Hackfest

With employees distributed in over 20 countries globally and most are working remotely, Igalia holds summits every year to give the employees opportunities to meet face-to-face. The summit normally runs in a period of a week with Hackfest, code camps, team building and recreational activities etc..

As it is a great opportunity to bring together people working on the different browsers and related standards to discuss ideas and plans, during the summit week Igalia normally hosts Web Engines Hackfest to invite members from all parts of the Web Platform community. This year’s Hackfest was a success with over 70 people attendees during 2 days involving lots of discussions and conversations around different features. Rego noted down his personal highlights for the event. Frédéric also published the status Update on OpenType MATH fonts.

The summit is a great event for Igalians gathering together. For a new joiner, it’s a great first summit. And for an experienced Igalian, there are a lot to take away too.

  • Blinkon 17

Blinkon 17 in November was a hybrid event hosted in the San Francisco Bay area as well as via the Internet. Igalia had a strong presence in the event with 7 Lightning Talks and three breakout talks. From the Web Platform team, Delan presented her work on Faster Style & Paint for CSS Highlights and Frédéric updated the status on MathML in his talk Shipping MathML in Chrome 109 in breakout sessions. Andreu’s lightning talk on Specifying Line-clamp is very interesting. And you might find the live session on History of the Web by Brian and Eric fascinating to follow.

  • TPAC

TPAC this year is a hybrid event and rego was happy that he managed to travel away to TPAC 2022 after pandemic and meet so many people in real life again.

Hiring

Igalia has been hiring actively . Web Platform team is no exception. In his blog of Igalia Web Platform team is hiring , Rego explains how Igalia’s cooperative flat structure works and what you could expect working in Igalia. Web Platform team is a very friendly and hard-working team. The team’s blogs probably only show a small part of work happening in the team. If you’d like to know more, contact us. And even better if you are interested in joining the team, check on rego’s blog about how to apply.

We were all super excited that Alice Boxhall (@sundress on Twitter) joined the team in November 2022. Alice represented Igalia as a candidate for the W3C TAG election. Brian tells you in his blog Alice 2023 why we thought that Alice could be a good addition to the W3C TAG. Although the final result didn’t go our way, our support for Alice continues for the future :-).

Looking forward to 2023

With some exciting work planned for 2023, we are looking forward to another productive year ahead!

by zsun at January 05, 2023 02:41 PM

January 04, 2023

Danylo Piliaiev

Turnips in the wild (Part 3)

This is the third part of my “Turnips in the wild” blog post series where I describe how I found and fixed graphical issues in the Mesa Turnip Vulkan driver for Adreno GPUs. If you missed the first two parts, you can find them here:

  1. Psychonauts 2
    1. Step 1 - Toggling driver options
    2. Step 2 - Finding the Draw Call and staring at it intensively
    3. Step 3 - Bisecting the shader until nothing is left
    4. Step 4 - Going deeper
      1. Loading varyings
      2. Packing varyings
      3. Interpolating varyings
  2. Injustice 2
  3. Monster Hunter: World

Psychonauts 2

A few months ago it was reported that “Psychonauts 2” has rendering artifacts in the main menu. Though only recently I got my hands on it.

Screenshot of the main menu with the rendering artifacts
Notice the mark on the top right, the game was running directly on Qualcomm board via FEX-Emu

Step 1 - Toggling driver options

Forcing direct rendering, forcing tiled rendering, disabling UBWC compression, forcing synchronizations everywhere, and so on, nothing helped or changed the outcome.

Step 2 - Finding the Draw Call and staring at it intensively

Screenshot of the RenderDoc with problematic draw call being selected
The first draw call with visible corruption

When looking around the draw call, everything looks good, but a single input image:

Input image of the draw call with corruption already present
One of the input images

Ughhh, it seems like this image comes from the previous frame, that’s bad. Inter-frame issues are hard to debug since there is no tooling to inspect two frames together…

* Looks around nervously *

Ok, let’s forget about it, maybe it doesn’t matter. Then next step would be looking at the pixel values in the corrupted region:

color = (35.0625, 18.15625, 2.2382, 0.00)

Now, let’s see whether RenderDoc’s built-in shader debugger would give us the same value as GPU or not.

color = (0.0335, 0.0459, 0.0226, 0.50)

(Or not)

After looking at the similar pixel on the RADV, RenderDoc seems right. So the issue is somewhere in how driver compiled the shader.

Step 3 - Bisecting the shader until nothing is left

A good start is to print the shader values with debugPrintfEXT, much nicer than looking at color values. Adding the debugPrintfEXT aaaand, the issue goes away, great, not like I wanted to debug it or anything.

Adding a printf changes the shader which affects the compilation process, so the changed result is not unexpected, though it’s much better when it works. So now we are stuck with observing pixel colors.

Bisecting a shader isn’t hard, especially if there is a reference GPU, with the same capture opened, to compare the changes. You delete pieces of the shader until results are the same, when you get the same results you take one step back and start removing other expressions, repeat until nothing could be removed.

First iteration of the elimination, most of the shader is gone (click me)
float _258 = 1.0 / gl_FragCoord.w;
vec4 _273 = vec4(_222, _222, _222, 1.0) * _258;
vec4 _280 = View.View_SVPositionToTranslatedWorld * vec4(gl_FragCoord.xyz, 1.0);
vec3 _284 = _280.xyz / vec3(_280.w);
vec3 _287 = _284 - View.View_PreViewTranslation;
vec3 _289 = normalize(-_284);
vec2 _303 = vec2(in_var_TEXCOORD0[0].x, in_var_TEXCOORD0[0].y) * Material.Material_ScalarExpressions[0].x;
vec4 _309 = texture(sampler2D(Material_Texture2D_0, Material_Texture2D_0Sampler), _303, View.View_MaterialTextureMipBias);
vec2 _312 = (_309.xy * vec2(2.0)) - vec2(1.0);
vec3 _331 = normalize(mat3(in_var_TEXCOORD10_centroid.xyz, cross(in_var_TEXCOORD11_centroid.xyz, in_var_TEXCOORD10_centroid.xyz) * in_var_TEXCOORD11_centroid.w, in_var_TEXCOORD11_centroid.xyz) * normalize((vec4(_312, sqrt(clamp(1.0 - dot(_312, _312), 0.0, 1.0)), 1.0).xyz * View.View_NormalOverrideParameter.w) + View.View_NormalOverrideParameter.xyz)) * ((View.View_CullingSign * View_PrimitiveSceneData._m0[(in_var_PRIMITIVE_ID * 37u) + 4u].w) * float(gl_FrontFacing ? (-1) : 1));


vec2 _405 = in_var_TEXCOORD4.xy * vec2(1.0, 0.5);
vec4 _412 = texture(sampler2D(LightmapResourceCluster_LightMapTexture, LightmapResourceCluster_LightMapSampler), _405 + vec2(0.0, 0.5));
uint _418 = in_var_LIGHTMAP_ID; // <<<<<<-
float _447 = _331.y;


vec3 _531 = (((max(0.0, dot((_412 * View_LightmapSceneData_1._m0[_418 + 5u]), vec4(_447, _331.zx, 1.0))))) * View.View_IndirectLightingColorScale);

bool _1313 = TranslucentBasePass.TranslucentBasePass_Shared_Fog_ApplyVolumetricFog > 0.0;
vec4 _1364;

vec4 _1322 = View.View_WorldToClip * vec4(_287, 1.0);
float _1323 = _1322.w;
vec4 _1352;
if (_1313)
{
    _1352 = textureLod(sampler3D(TranslucentBasePass_Shared_Fog_IntegratedLightScattering, View_SharedBilinearClampedSampler), vec3(((_1322.xy / vec2(_1323)).xy * vec2(0.5, -0.5)) + vec2(0.5), (log2((_1323 * View.View_VolumetricFogGridZParams.x) + View.View_VolumetricFogGridZParams.y) * View.View_VolumetricFogGridZParams.z) * View.View_VolumetricFogInvGridSize.z), 0.0);
}
else
{
    _1352 = vec4(0.0, 0.0, 0.0, 1.0);
}
_1364 = vec4(_1352.xyz + (in_var_TEXCOORD7.xyz * _1352.w), _1352.w * in_var_TEXCOORD7.w);

out_var_SV_Target0 = vec4(_531.x, _1364.w, 0, 1);


After that it became harder to reduce the code.

More elimination (click me)
vec2 _303 = vec2(in_var_TEXCOORD0[0].x, in_var_TEXCOORD0[0].y);
vec4 _309 = texture(sampler2D(Material_Texture2D_0, Material_Texture2D_0Sampler), _303, View.View_MaterialTextureMipBias);
vec3 _331 = normalize(mat3(in_var_TEXCOORD10_centroid.xyz, in_var_TEXCOORD11_centroid.www, in_var_TEXCOORD11_centroid.xyz) * normalize((vec4(_309.xy, 1.0, 1.0).xyz))) * ((View_PrimitiveSceneData._m0[(in_var_PRIMITIVE_ID)].w) );

vec4 _412 = texture(sampler2D(LightmapResourceCluster_LightMapTexture, LightmapResourceCluster_LightMapSampler), in_var_TEXCOORD4.xy);
uint _418 = in_var_LIGHTMAP_ID; // <<<<<<-

vec3 _531 = (((dot((_412 * View_LightmapSceneData_1._m0[_418 + 5u]), vec4(_331.x, 1,1, 1.0)))) * View.View_IndirectLightingColorScale);

vec4 _1352 = textureLod(sampler3D(TranslucentBasePass_Shared_Fog_IntegratedLightScattering, View_SharedBilinearClampedSampler), vec3(vec2(0.5), View.View_VolumetricFogInvGridSize.z), 0.0);

out_var_SV_Target0 = vec4(_531.x, in_var_TEXCOORD7.w, 0, 1);


And finally, the end result:

vec3 a = in_var_TEXCOORD10_centroid.xyz + in_var_TEXCOORD11_centroid.xyz;
float b = a.x + a.y + a.z + in_var_TEXCOORD11_centroid.w + in_var_TEXCOORD0[0].x + in_var_TEXCOORD0[0].y + in_var_PRIMITIVE_ID.x;
float c = b + in_var_TEXCOORD4.x + in_var_TEXCOORD4.y + in_var_LIGHTMAP_ID;

out_var_SV_Target0 = vec4(c, in_var_TEXCOORD7.w, 0, 1);

Nothing left but loading of varyings and the simplest operations on them in order to prevent their elimination by the compiler.

in_var_TEXCOORD7.w values are several orders of magnitude different from the expected ones and if any varying is removed the issue goes away. Seems like an issue with loading of varyings.

I created a simple standalone reproducer in vkrunner to isolate this case and make my life easier, but the same fragment shader passed without any trouble. This should have pointed me to undefined behavior somewhere.

Step 4 - Going deeper

Anyway, one major difference is the vertex shader, changing it does “fix” the issue. Though changing it resulted in the changes in varyings layout, without changing the layout the issue is present. Thus the vertex shader is an unlikely culprit here.

Let’s take a look at the fragment shader assembly:

bary.f r0.z, 0, r0.x
bary.f r0.w, 3, r0.x
bary.f r1.x, 1, r0.x
bary.f r1.z, 4, r0.x
bary.f r1.y, 2, r0.x
bary.f r1.w, 5, r0.x
bary.f r2.x, 6, r0.x
bary.f r2.y, 7, r0.x
flat.b r2.z, 11, 16
bary.f r2.w, 8, r0.x
bary.f r3.x, 9, r0.x
flat.b r3.y, 12, 17
bary.f r3.z, 10, r0.x
bary.f (ei)r1.x, 16, r0.x
....

Loading varyings

bary.f loads interpolated varying, flat.b loads it without interpolation. bary.f (ei)r1.x, 16, r0.x is what loads the problematic varying, though it doesn’t look suspicious at all. Looking through the state which defines how varyings are passed between VS and FS also doesn’t yield anything useful.

Ok, but what does second operand of flat.b r2.z, 11, 16 means (the command format is flat.b dst, src1, src2). The first one is location from where varying is loaded, and looking through the Turnip’s code the second one should be equal to the first otherwise “some bad things may happen”. Forced the sources to be equal - nothing changed… What have I expected? Since the standalone reproducer with the same assembly works fine.

The same description which promised bad things to happen also said that using 2 immediate sources for flat.b isn’t really expected. Let’s revert the change and get something like flat.b r2.z, 11, r0.x, nothing is changed, again.

Packing varyings

What else happens with these varyings? They are being packed! To remove their unused components, so let’s stop packing them. Aha! Now it works correctly!

Looking several times through the code, nothing is wrong. Changing the order of varyings helps, aligning them helps, aligning only flat varyings also helps. But code is entirely correct.

Though one thing changes, during the shuffling of varyings order I noticed that the resulting misrendering changed, so likely it’s not the order, but the location which is cursed.

Interpolating varyings

What’s left? How varyings interpolation is specified. The code emits interpolations only for used varyings, but looking closer the “used varyings” part isn’t that obviously defined. Emitting the whole interpolation state fixes the issue!

The culprit is found, stale data is being read of varying interpolation. The resulting fix could be found in tu: Fix varyings interpolation reading stale values + cosmetic changes

Correctly rendered draw call after the changes
Correctly rendered draw call after the changes

Injustice 2

Another corruption in the main menu.

Bad draw call on Turnip

How it should look:

The same draw call on RADV

The draw call inputs and state look good enough. So it’s time to bisect the shader.

Here is the output of the reduced shader on Turnip:

Enabling the display of NaNs and Infs shows that there are NaNs in the output on Turnip (NaNs have green color here):

While the correct rendering on RADV is:

Carefully reducing the shader further resulted in the following fragment which reproduces the issue:

r12 = uintBitsToFloat(uvec4(texelFetch(t34, _1195 + 0).x, texelFetch(t34, _1195 + 1).x, texelFetch(t34, _1195 + 2).x, texelFetch(t34, _1195 + 3).x));
....
vec4 _1268 = r12;
_1268.w = uintBitsToFloat(floatBitsToUint(r12.w) & 65535u);
_1275.w = unpackHalf2x16(floatBitsToUint(r12.w)).x;

On Turnip this _1275.w is NaN, while on RADV it is a proper number. Looking at assembly, the calculation of _1275.w from the above is translated into:

isaml.base0 (u16)(x)hr2.z, r0.w, r0.y, s#0, t#12
(sy)cov.f16f32 r1.z, hr2.z

In GLSL there is a read of uint32, stripping it of the high 16 bits, then converting the lower 16 bits to a half float.

In assembly the “read and strip the high 16 bits” part is done in a single command isaml, where the stripping is done via (u16) conversion.

At this point I wrote a simple reproducer to speed up iteration on the issue:

result = uint(unpackHalf2x16(texelFetch(t34, 0).x & 65535u).x);

After testing different values I confirmed that (u16) conversion doesn’t strip higher 16 bits, but clamps the value to 16 bit unsigned integer. Running the reproducer on the proprietary driver shown that it doesn’t fold u32 -> u16 conversion into isaml.

Knowing that the fix is easy: ir3: Do 16b tex dst folding only for floats

Monster Hunter: World

Main menu, again =) Before we even got here two other issues were fixed before, including the one which seems like an HW bug which proprietary driver is not aware of.

In this case of misrendering the culprit is a compute shader.

How it should look:

Compute shader are generally easier to deal with since much less state is involved.

None of debug options helped and shader printf didn’t work at that time for some reason. So I decided to look at the shader assembly trying to spot something funny.

ldl.u32 r6.w, l[r6.z-4016], 1
ldl.u32 r7.x, l[r6.z-4012], 1
ldl.u32 r7.y, l[r6.z-4032], 1
ldl.u32 r7.z, l[r6.z-4028], 1
ldl.u32 r0.z, l[r6.z-4024], 1
ldl.u32 r2.z, l[r6.z-4020], 1

Negative offsets into shared memory are not suspicious at all. Were they always there? How does it look right before being passed into our backend compiler?

vec1 32 ssa_206 = intrinsic load_shared (ssa_138) (base=4176, align_mul=4, align_offset=0)
vec1 32 ssa_207 = intrinsic load_shared (ssa_138) (base=4180, align_mul=4, align_offset=0)
vec1 32 ssa_208 = intrinsic load_shared (ssa_138) (base=4160, align_mul=4, align_offset=0)
vec1 32 ssa_209 = intrinsic load_shared (ssa_138) (base=4164, align_mul=4, align_offset=0)
vec1 32 ssa_210 = intrinsic load_shared (ssa_138) (base=4168, align_mul=4, align_offset=0)
vec1 32 ssa_211 = intrinsic load_shared (ssa_138) (base=4172, align_mul=4, align_offset=0)
vec1 32 ssa_212 = intrinsic load_shared (ssa_138) (base=4192, align_mul=4, align_offset=0)

Nope, no negative offsets, just a number of offsets close to 4096. Looks like offsets got wrapped around!

Looking at ldl definition it has 13 bits for the offset:

<pattern pos="0"           >1</pattern>
<field   low="1"  high="13" name="OFF" type="offset"/>    <--- This is the offset field
<field   low="14" high="21" name="SRC" type="#reg-gpr"/>
<pattern pos="22"          >x</pattern>
<pattern pos="23"          >1</pattern>

With offset type being a signed integer (so the one bit is for the sign). Which leaves us with 12 bits, meaning the upper bound of 4095. Case closed!

I know that there is a upper bound set on offset during optimizations, but where and how it is set?

The upper bound is set via nir_opt_offsets_options::shared_max and is equal to (1 << 13) - 1, which we saw is incorrect. Who set it?

Subject: [PATCH] ir3: Limit the maximum imm offset in nir_opt_offset for
 shared vars

STL/LDL have 13 bits to store imm offset.

Fixes crash in CS compilation in Monster Hunter World.

Fixes: b024102d7c2959451bfef323432beaa4dca4dd88
("freedreno/ir3: Use nir_opt_offset for removing constant adds for shared vars.")

Signed-off-by: Danylo Piliaiev
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14968>
---
 src/freedreno/ir3/ir3_nir.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

@@ -124,7 +124,7 @@ ir3_optimize_loop(struct ir3_compiler *compiler, nir_shader *s)
           */
          .uniform_max = (1 << 9) - 1,

-         .shared_max = ~0,
+         .shared_max = (1 << 13) - 1,

Weeeell, totally unexpected, it was me! Fixing the same game, maybe even the same shader…

Let’s set the shared_max to a correct value . . . . . Nothing changed, not even the assembly. The same incorrect offset is still there.

After a bit of wandering around the optimization pass, it was found that in one case the upper bound is not enforced correctly. Fixing it fixed the rendering.

The final changes were:

by Danylo Piliaiev at January 04, 2023 11:00 PM

Brian Kardell

The NASCAR Model

The NASCAR Model

Let's talk about interesting solutions to problems and how we can (today) begin to change things...

Over the past two years I’ve gone into a whole lot about the web ecosystem, and in particular the economics of it. I’ve made the case that the model we’ve built and grown for the last 20 years is ultimately fatally flawed. It is fundamentally dependent on a very few organizations. It is already inadequate and showing signs of strain.

I've been making a case for a while now that we need to re-think how we look at and fund web engines.

It seems really easy at some level, right? Just diversify the investment that is sponsoring work on engines. Every tech company in the world benefits from these projects, and lots of them make money hand over fist - giving back should be easy? But yeah, it's (currently) not and this should suprise no one. It isn't the norm, and companies don't like to spend money that they don't have to. Neither do we. We don't (currently) want to buy a browser or pay to use the open source project if we can avoid it, it's not the norm we're used to.

I was in the middle of trying to write something up on all of this just before the holidays (which got strange for me). I wanted to explain one such opportunity that we're building now, attempting to bend the arc of things in a healthier way. Before I could get back to post something, I saw Zach Leatherman post a thing on Mastodon:

the nascar car advertising model but for t-shirts on speakers at tech conferences

In reply, Chris Coyier shared screenshot of him wearing a F1 jacket that's full of logos like Red Bull, Tezos and Oracle.

To which I replied...

This, but for browser funding

While it sounds funny, and it obviously needs more words to explain it, I'm kind of serious.

It makes sense to look at history and try to learn from what worked (or didn't) elsewhere. It's interesting how many things seem to have followed a trend of getting a single original backer and ultimately diversifying. Lots of early TV shows had individual sponsors - soaps like Lux and cigarette companies like Winston, for example were, for decades even, lucrative single sources of funding.

While it's not what Zach was saying, it's interesting that for over 30 years, the NASCAR Cup Series was called the Winston Cup, until that became a problem. In 2004 it became the Nextel Cup, until it became the Sprint Cup, and then the Monster Energy Cup. Finally, in 2020 it just became the NASCAR Cup Series with a tiered sponsoship model and diverse sponsors.

You know, just like a lot of tech events.

Have a look at this page for AWS summit in DC sponsors for example. Or check out the video in this page for Salesforce's Dreamforce Sponsors. These pages are basically the NASCAR model. They are similarly tiered: you get different things for different levels of sponsorship. And those, I think, learn some things from what work for sports teams too. Here in Pittsbugh, for example, you'll see and hear ads that say "Proud Partner of the Pittsburgh Steelers". It costs money to be able to say that, and one of the things you get for that money is the ability to use the team's logo in your ad.

Spot the difference

A single event sponsorship can range from several thousand to literally millions of dollars. So, why do we get hundreds of tech companies to invest in sponsoring lots of events, or even putting their name on an actual NASCAR jacket (or car), but very few invest in web browsers?

I think the answer is pretty simple: Different departments and budgets.

Funding (from non-vendors) that goes into the browser engines today is coming from engineering departments of downstream products like Microsoft Edge or ad blocking extensions or the PlayStation 5. That's great, but engineering departments are constantly too tight already. Engineering is about efficiency, right? Do more with less.

Those sponsorships, on the other hand, are often marketing dollars (just like default search). Entirely different budgets/intents.

But, if you think about it, shouldn't those companies get some NASCAR-style credit for their browser support? Don't you think it's worth mentioning that Bloomberg Tech was really key to making CSS Grid Happen? Or that EyeO was really key to making :has() happen? Isn't that a kind of marketing? Don't we kind of need both kinds of investment?

Hey, there it is!

I haven't had much of a chance to talk about this yet, but have a look at this...

A screenshot of the Wolvic browser's sponsors page...

In February 2022, in a joint announcement with Mozilla, Igalia announced we'd be taking over stewardship of the project "Firefox Reality" which would now be called "Wolvic" with the tagline "Join the Pack". That's not just a cute slogan, it's kind of the whole idea. Rather than a singularly badass wolf, we need a pack.

We're trying some really interesting ideas with Wolvic to reimagine how we can fund the development of the browser, upstream work, and manage priorities. I talked about our plans in a small 'launch' event recently which you can watch if you like, but a brief summary is:

  1. We're partnering with some device manufacturers directly at several different levels. On some (Lynx R1, Huawei Lens, HTC Vive) we will be the default browser and they will commit to some funding, and we'll work with them closely.
  2. We're exploring other ideas with a new Wolvic collective. There are a lot of details there explaining all this, but just like other sponsorships, you get different things. One of those is being on a Sponsors page which will be in the browser itself too. Another is that you can help us figure out priorities.

I would encourage you to have a look at the collective page. Share it with your engineering department. Share it with your marketing department.

Or, just help be a part of sparking the change and contribute yourself. I think this isn't the worst idea - many of us (myself included) owe our entire careers to the web - your mileage and personal comfort levels may vary of course, but it doesn't feel wrong for me to throw a cup of coffee's worth of funding back to it every month.

I hope you'll help us explore these ideas, either through some funding, or just sharing the idea and encouraging organizations to participate (or all of the above).

January 04, 2023 05:00 AM

January 03, 2023

Víctor Jáquez

GStreamer compilation with third party libraries

Suppose that you have to hack a GStreamer element which requires a library that is not (yet) packaged by your distribution, nor wrapped as a Meson’s subproject. How do you do?

In our case, we needed the latest version of

Which are interrelated CMake projects.

For these cases, GStreamer’s uninstalled development scripts can use a special directory: gstreamer/prefix. As the README.md says:

NOTE: In the development environment, a fully usable prefix is also configured in gstreamer/prefix where you can install any extra dependency/project.

This means that gstenv.py script (the responsible of setting up the uninstalled development environment) will add

  • gstreamer/prefix/bin in PATH for executable files.
  • gstreamer/prefix/lib and gstreamer/prefix/share/gstreamer-1.0 in GST_PLUGIN_PATH, for out-of-tree elements.
  • gstreamer/prefix/lib in GI_TYPELIB_PATH for GObject Introspection metadata.
  • gstreamer/prefix/lib/pkgconfig in PKG_CONFIG_PATH for third party dependencies (our case!)
  • gstreamer/prefix/etc/xdg for XDG_CONFIG_DIRS for XDG compliant configuration files.
  • gstreamer/prefix/lib and gstreamer/prefix/lib64 in LD_LIBRARY_PATH for third party libraries.

Therefore, the general idea, is to compile those third party libraries with their installation prefix as gstreamer/prefix.

In our case, Vulkan repositories are interrelated so they need to be compiled in certain order. Also, we decided, for self-containment, to clone them in gstreamer/subprojects.

Vulkan-Headers

$ cd ~/gst/gstreamer/subprojects
$ git clone git@github.com:KhronosGroup/Vulkan-Headers.git
$ cd Vulkan-Headers
$ mkdir build
$ cd build
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=/home/vjaquez/gst/gstreamer/prefix ..
$ cmake --build . --install

Vulkan-Loader

$ cd ~/gst/gstreamer/subprojects
$ git clone git@github.com:KhronosGroup/Vulkan-Loader.git
$ cd Vulkan-Loader
$ mkdir build
$ cd build
$ cmake  -DCMAKE_BUILD_TYPE=Debug -DVULKAN_HEADERS_INSTALL_DIR=/home/vjaquez/gst/gstreamer/prefix DCMAKE_INSTALL_PREFIX=/home/vjaquez/gst/gstreamer/prefix ..
$ cmake --build . --install

Vulkan-Tools

$ cd ~/gst/gstreamer/subprojects
$ git clone git@github.com:KhronosGroup/Vulkan-Tools.git
$ cd Vulkan-Tools
$ mkdir build
$ cd build
$ cmake  -DCMAKE_BUILD_TYPE=Debug -DVULKAN_HEADERS_INSTALL_DIR=/home/vjaquez/gst/gstreamer/prefix DCMAKE_INSTALL_PREFIX=/home/vjaquez/gst/gstreamer/prefix ..
$ cmake --build . --install

Right now we have the Vulkan headers and the Vulkan loader pkg-config file in place. And we should be able to compile GStreamer. Right?

Not exactly, because gstenv.py only sets the environment variables for the development environment, not for GStreamer compilation. But the solution is simple, because we have all set in the proper order: just to set PKG_CONFIG_PATH when executing meson setup:

$ PKG_CONFIG_PATH=/home/vjaquez/gst/gstreamer/prefix/lib/pkgconfig meson setup --buildtype=debug build

by vjaquez at January 03, 2023 06:35 PM

December 30, 2022

Brian Kardell

Surreal Holiday

Surreal Holiday

Hello, yes, I had a heart attack.

In the middle of the night last Tuesday (technically Wednesday morning), I had a heart attack. Ambulance, EKGs - you get the idea. In the end they placed two stents in my heart and I spent two days in the ICU, one more in the general hospital before being sent home Friday afternoon.

Mainly I just wanted to say...

  • Sorry if I haven't replied somewhere
  • That's why
  • Thanks for the concern if you were worried
  • I'm gonna be ok
  • I am mostly ok now. I just happen to also have planned holidays, so it's been a good spot to disconnect for a bit.
  • Yes, I quit smoking (8 days cold and counting).
  • ... The whole thing was just... very weird

Surreal

In my experience - our brains are actually not very good at imagining the big "smack you in the face" things like this. When they happen I'm often like "Oh wow!... That was nothing like I imagined it would be."

Yes, I have imagined many of the bits here many times (I have an anxiety disoder) and this wasn't an exception. Literally no aspect of what unfolded was anything like I imagined it would be.

Short of literally sitting and imagining bad scenarios: I'm sure you think you have some idea of what a heart attack would feel like. Like, vaguely, at least. But... maybe you don't? Maybe it is very diffrent from that? It certainly was for me.

I guess the natural question is "different how?" and the answer is "yes".

Different like zuchini and cucumber, maybe? Superficially similar, but much different. If you bit into something thinking it contained lots of cucumber and that turned out to be zucchini (or vice versa) you would say "Oh wow!... That was nothing like I imagined it would be."

Anyway... I've been sitting here for way too long trying to think of how to explain all of the ways in which none of this was what I expected or how surreal it all felt, but honestly, I think I can't.

I spent a lot of time thinking about my family. Would they be ok, even if I wasnt? Pretty shitty for them to have this happen at Christmas. It was the middle of the night, most of them wouldn't find out for a while still probably. What would they be finding out?

I don't know, it was just many layers of strange, really.

One especially surreal moment I'd like to write down: As the doctors who were getting ready to put a stint in my heart were chatting, one of them said something like...

Man, can you imagine? So sad. Just before 50. He was looking great just yesterday, and then you just die. So sad.

Here's the thing: They weren't talking about me and I knew it. But, lying there on the table, not so short of 50 myself, at that moment... It totally could have... It still might, you know? They hadn't gone in yet. Did they just not think of that?

What they were talking about was the fact that Franco Harris had just died the same morning. It was just a few days shy of the 50th anniversay of The Immaculate Reception, a rematch game and so many events planned around it and him.

You don't really have to know about any of this, I guess - except that Franco's impact in Pittsburgh was huge and began just before I was born. My grandparents were part of the large Italian community that loved the Steelers - "Franco's Italian Army". I have a lot of memories of shared moments with family around the Steelers - ultimately because of all of this. I always felt a kind of connection with him. It was very strange to think that we we both seemed pretty great just yesterday, and now here we were.

Anyways.... It was weird.

Happily for my family, I made it through and I'm doing great.

I was home for the game too, which I was able to watch with my son. It was great - and happily for Pittsburgh, we won.

December 30, 2022 05:00 AM

December 29, 2022

Eric Meyer

Styling a ‘pre’ That Contains a ‘code’

I’ve just committed my first :has() selector to production CSS and want to share it, so I will!  But first, a little context that will feel familiar to nerds like me who post snippets of computer code to the Web, and have markup with this kind of tree structure (not raw source; we’d never allow this much formatting whitespace inside a <pre>):

<pre>
	<code>
		{{content goes here}}
	</code>
</pre>

It’s nicely semantic, indicating that the contents of the <pre> are in fact code of some kind, as opposed to just plain preformatted text like, say, output from a shell script or npm job, which isn’t code and thus should, perhaps, be styled in a distinct way.

Given cases like that, you’ve probably written rules that go a little something like this:

pre > code {
	display: block;
	background: #f1f1f1;
	padding: 1.33em;
	border-radius: 0.33em;
}

Which says: if a <code> element is the child of a <pre> element, then turn the <code> into a block box and give it some background color, padding, etc.

It works out fine, but it always feels a little fragile.  What if there are already <pre> styles that throw off the intended effect on code blocks?  Do we get into specificity wars between those rules and the code-block rules?  Find other ways to figure out what should be adjusted in which cases?  Those are probably manageable problems, but it would be better not to have them.

It’s also, when you step back for a moment, a little weird.  The <pre> is already a block box and the container for the code; why aren’t we styling that?  Because unless you wrote some scripting, whether server-side or client-side, to add a class to the <pre> in scenarios like this, there wasn’t a way to address it directly based on its structural contents.

There is now:

pre:has(> code) {
	background: #f1f1f1;
	padding: 1.33em;
	border-radius: 0.33em;
}

Now I’m styling any <pre> that has a <code> as a child, which is why I took out the display: block.  I don’t need it any more!

But suppose you have a framework or ancient script or something that inserts classed <span> elements between the <pre> and the <code>, like this:

<pre>
	<span class="pref">
		<code>
			{{content goes here}}
		</code>
	</span>
</pre>

First of all, ew, address the root problem here if at all possible.  But if that isn’t possible for whatever reason, you can still style the <pre> based on the presence of a <code> by removing the child combinator from the selector.  In other words:

pre:has(code) {
	background: #f1f1f1;
	padding: 1.33em;
	border-radius: 0.33em;
}

Now I’m styling any <pre> that has a <code> as a descendant  —  child, grandchild, great-great-great-great grandchild, whatever.

Which is not only more robust, it’s a lot more future-proof: even if some hot new front-end framework that sticks in <span> elements or something gets added to the site next year, this style will just keep chugging along, styling <pre> elements that contain <code> elements until long after that hot new framework has cooled to ash and been chucked into the bit-bucket.

There is one thing to keep in mind here, as pointed out by Emmanuel over on Mastodon: if you have a scenario where <pre> elements can contain child text nodes in addition to <code> blocks, the <pre> will still be styled in its entirely.  Consider:

<pre>
	{{some text is here}}
	<code>
		{{content goes here}}
	</code>
	{{or text is here}}
</pre>

pre:has(> code) and pre:has(code) will still match the <pre> element here, which means all of the text (both inside and outside the <code> elements)  will sit inside the light-gray box with the rounded corners.  If that’s fine for your use case, great!  If not, then don’t use :has() in this scenario, and stick with the pre > code {…} or pre code {…} approach of yore.  That will style just the <code> elements instead of the whole <pre>, as in the example at the beginning of this article.

As I write this, the code hasn’t gone into production on wpewebkit.org yet, but I think it will within the next week or two, and will be my first wide-production use of :has().  I feel like it’s a great way to close out 2022 and kick off 2023, because I am that kind of nerd.  If you are too, I hope you enjoyed this quick dive into the world of :has().


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at December 29, 2022 03:36 PM

December 21, 2022

José Dapena

Native call stack profiling (3/3): 2022 work in V8

This is the last blog post of the series. In first post I presented some concepts of call stack profiling, and why it is useful. In second post I reviewed Event Tracing for Windows, the native tool for the purpose, and how it can be used to trace Chromium.

This last post will review the work done in 2022 to improve the support in V8 of call stack profiling in Windows.

I worked on several of the fixes this year. This work has been sponsored by Bloomberg and Igalia.

This work was presented as a lightning talk in BlinkOn 17.

Some bad news to start… and a fix

In March I started working on the report that Windows event traces where not properly resolving the Javascript symbols.

After some bisecting I found this was a regression introduced by this commit, that changed the --js-flags handling to a later stage. This happened to be after V8 initialization, so the code that would enable instrumentation would not consider the flag.

The fix I implemented moved flags processing to happen right before platform initialization, so instrumentation worked again.

Simplified method names

Another fix I worked was to improve the methods name generation. Windows tracing would show a quite redundant description of each level, and that was making analysis more difficult.

Before my work, the entries would look like this:

string-tagcloud.js!LazyCompile:~makeTagCloud- string-tagcloud.js:231-232:22 0x0

After my change, now it looks like this:

string-tagcloud.js!makeTagCloud-231:22 0x0

The fix adds a specific implementation for ETW. Instead of reusing the method name that is also used for Perf, it has a specific implementation for function that takes into account what ETW backend exports already, to avoid redundancy. It also takes advantage of the existing method DebugNameCStr to retrieve inferred method names in case there is no name available.

Problem with Javascript code compiled before tracing

The way V8 ETW worked was that, when tracing was ongoing and a new function was compiled in JIT, it would emit information to ETW.

This implied a big problem. If a function was compiled by V8 before tracing started, then ETW would not properly resolve the function names so, when analyzing the traces, it would not be possible to know which function was called at any of the samples.

The solution is conceptually simple. When tracing starts, V8 traverse the living Javascript contexts and emit all the symbols. This adds noise to the tracing, as it is an expensive process. But, as it happens at the start of the tracing, it is very easy to isolate in the captured trace.

And a performance fix

I also fixed a huge performance penalty when tracing code from snapshots, caused by calculating all the time the end line numbers of code instead of caching it.

Initialization improvements

Paolo Severini improved the initialization code, so the initialization of an ETW session was lighter, and also tracing would be started or stopped correctly.

Benchmarking ETW overhead

After all these changes I did some benchmarking with and without ETW. The goal was knowing if it would be good to enable by default ETW support in V8, not requiring to pass any JS flag.

With Sunspider in a Windows 64 bits build:

Image showing slight overhead with ETW and bigger one with interpreted frames.

Other benchmarks I tried gave similar numbers.

So far, in 64 bits architecture I could not detect any overhead of enabling ETW support when recording is not happening, and the cost when it is enabled is very low.

Though, when combined with interpreted frames native stack, the overhead is close to 10%. This was expected as explained here.

So, good news so far. We still need to benchmark 32 bit architecture to see if the impact is similar.

Try it!

The work described in this post is available in V8 10.9.0. I hope you enjoy the improvements, and specially hope these tools help in the investigation of performance issues around Javascript, in NodeJS, Google Chrome or Microsoft Edge.

What next?

There is still a lot of things to do, and I hope I can continue working on improvements for V8 ETW support next year:

  • First, finishing the benchmarks, and considering to enable ETW instrumentation by default in V8 and derivatives.
  • Add full support for WASM.
  • Bugfixing, as we still see segments missing in certain benchnarmks.
  • Create specific events for when the JIT information of already compiled symbols is sent to ETW, to make it easier to differenciate from the code compiled while recording a trace.

If you want to track the work, keep an eye on V8 issue 11043.

The end

This is the last post in the series.

Thanks to Bloomberg and Igalia for sponsoring my work in ETW Chromium integration improvements!

by José Dapena Paz at December 21, 2022 08:30 AM

December 18, 2022

Víctor Jáquez

Video decoding in GStreamer with Vulkan Video extension (part 2)

Its has been a while since I reported my tinkering with the Vulkan Video provisional extension. Now the specification will have its final release soonish, and also there has been more engagement within the open source communities, such as the work-in-progress FFMpeg implementation by Lynne (please, please, read that post), and the also work-in-progress Mesa 3D drivers both for AMD and Intel by Dave Airlie! Along with the well known NVIDIA beta drivers for Vulkan.

From our side, we have been trying to provide an open source alternative to the video parser used by the Conformance Test Suite and the NVIDIA
vk_video_samples, using GStreamer: GstVkVideoParser, which intends to be a drop-in replacement of the current proprietary parser library.

Along the way, we have sketched the Vulkan Video support in
gfxreconstruct
, for getting traces of the API usage. Sadly, its kind of bit-rotten right now, even more because the specification has changed since then.

Regarding the H.264 decoder for GStreamer, we just restarted its hacking. The merge request was moved to monorepo, but for the sake of the well needed complete re-write, we changed the branch to this one (vkh264dec). We needed to re-write it because, besides the specification updates, we have learned many things along the journey, such as the out-of-band parameters update, Vulkan’s recommendation for memory pre-allocation as much as possible, the DBP/references handling, the debate about buffer vs. slice uploading, and other friction points that Lynne has spotted for future early adopters.

The way to compile it is grab the branch and compile as usually GStreamer is compiled with meson:

meson setup builddir -Dgst-plugins-bad:vulkan-video=enabled --buildtype=debug
ninja C builddir

And run simple pipelines such as

gst-launch-1.0 filesrc location=INPUT ! parsebin ! vulkanh264dec ! fakesink -v

Our objective is to have a functional demo for the next Vulkanised in
February
. We are very ambitious, we want it to work in Linux, Windows and in many GPU as possible. Wish us luck. And happy December festivities!

by vjaquez at December 18, 2022 01:14 PM

December 15, 2022

Andy Wingo

ephemeral success

Good evening, patient hackers :) Today finishes off my series on implementing ephemerons in a garbage collector.

Last time, we had a working solution for ephemerons, but it involved recursively visiting any pending ephemerons from within the copy routine—the bit of a semi-space collector that is called when traversing the object graph and we see an object that we hadn't seen yet. This recursive visit could itself recurse, and so we could overflow the control stack.

The solution, of course, is "don't do that": instead of visiting recursively, enqueue the ephemeron for visiting later. Iterate, don't recurse. But here we run into a funny problem: how do we add an ephemeron to a queue or worklist? It's such a pedestrian question ("just... enqueue it?") but I think it illustrates some of the particular concerns of garbage collection hacking.

speak, memory

The issue is that we are in the land of "can't use my tools because I broke my tools with my tools". You can't make a standard List<T> because you can't allocate list nodes inside the tracing routine: if you had memory in which you could allocate, you wouldn't be calling the garbage collector.

If the collector needs a data structure whose size doesn't depend on the connectivity of the object graph, you can pre-allocate it in a reserved part of the heap. This adds memory overhead, of course; for a 1000 MB heap, say, you used to be able to make graphs 500 MB in size (for a semi-space collector), but now you can only do 475 MB because you have to reserve 50 MB (say) for your data structures. Another way to look at it is, if you have a 400 MB live set and then you allocate 2GB of garbage, if your heap limit is 500 MB you will collect 20 times, but if it's 475 MB you'll collect 26 times, which is more expensive. This is part of why GC algorithms are so primitive; implementors have to be stingy that we don't get to have nice things / data structures.

However in the case of ephemerons, we will potentially need one worklist entry per ephemeron in the object graph. There is no optimal fixed size for a worklist of ephemerons. Most object graphs will have no or few ephemerons. Some, though, will have practically the whole heap.

For data structure needs like this, the standard solution is to reserve the needed space for a GC-managed data structure in the object itself. For example, for concurrent copying collectors, the GC might reserve a word in the object for a forwarding pointer, instead of just clobbering the first word. If you needed a GC-managed binary tree for a specific kind of object, you'd reserve two words. Again there are strong pressures to minimize this overhead, but in the case of ephemerons it seems sensible to make them pay their way on a per-ephemeron basis.

so let's retake the thing

So sometimes we might need to put an ephemeron in a worklist. Let's add a member to the ephemeron structure:

struct gc_ephemeron {
  struct gc_obj header;
  int dead;
  struct gc_obj *key;
  struct gc_obj *value;
  struct gc_ephemeron *gc_link; // *
};

Incidentally this also solves the problem of how to represent the struct gc_pending_ephemeron_table; just reserve 0.5% of the heap or so as a bucket array for a buckets-and-chains hash table, and use the gc_link as the intrachain links.

struct gc_pending_ephemeron_table {
  struct gc_ephemeron *resolved;
  size_t nbuckets;
  struct gc_ephemeron buckets[0];
};

An ephemeron can end up in three states, then:

  1. Outside a collection: gc_link can be whatever.

  2. In a collection, the ephemeron is in the pending ephemeron table: gc_link is part of a hash table.

  3. In a collection, the ephemeron's key has been visited, and the ephemeron is on the to-visit worklist; gc_link is part of the resolved singly-linked list.

Instead of phrasing the interface to ephemerons in terms of visiting edges in the graph, the verb is to resolve ephemerons. Resolving an ephemeron adds it to a worklist instead of immediately visiting any edge.

struct gc_ephemeron **
pending_ephemeron_bucket(struct gc_pending_ephemeron_table *table,
                         struct gc_obj *key) {
  return &table->buckets[hash_pointer(obj) % table->nbuckets];
}

void add_pending_ephemeron(struct gc_pending_ephemeron_table *table,
                           struct gc_obj *key,
                           struct gc_ephemeron *ephemeron) {
  struct gc_ephemeron **bucket = pending_ephemeron_bucket(table, key);
  ephemeron->gc_link = *bucket;
  *bucket = ephemeron;
}

void resolve_pending_ephemerons(struct gc_pending_ephemeron_table *table,
                                struct gc_obj *obj) {
  struct gc_ephemeron **link = pending_ephemeron_bucket(table, obj);
  struct gc_ephemeron *ephemeron;
  while ((ephemeron = *link)) {
    if (ephemeron->key == obj) {
      *link = ephemeron->gc_link;
      add_resolved_ephemeron(table, ephemeron);
    } else {
      link = &ephemeron->gc_link;
    }
  }
}

Copying an object may add it to the set of pending ephemerons, if it is itself an ephemeron, and also may resolve other pending ephemerons.

void resolve_ephemerons(struct gc_heap *heap, struct gc_obj *obj) {
  resolve_pending_ephemerons(heap->pending_ephemerons, obj);

  struct gc_ephemeron *ephemeron;
  if ((ephemeron = as_ephemeron(forwarded(obj)))
      && !ephemeron->dead) {
    if (is_forwarded(ephemeron->key))
      add_resolved_ephemeron(heap->pending_ephemerons,
                             ephemeron);
    else
      add_pending_ephemeron(heap->pending_ephemerons,
                            ephemeron->key, ephemeron);
  }
}

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  ...
  resolve_ephemerons(heap, obj); // *
  return new_obj;
}

Finally, we need to add something to the core collector to scan resolved ephemerons:

int trace_some_ephemerons(struct gc_heap *heap) {
  struct gc_ephemeron *resolved = heap->pending_ephemerons->resolved;
  if (!resolved) return 0;
  heap->pending_ephemerons->resolved = NULL;
  while (resolved) {
    resolved->key = forwarded(resolved->key);
    visit_field(&resolved->value, heap);
    resolved = resolved->gc_link;
  }
  return 1;
}

void kill_pending_ephemerons(struct gc_heap *heap) {
  struct gc_ephemeron *ephemeron;
  struct gc_pending_ephemeron_table *table = heap->pending_ephemerons;
  for (size_t i = 0; i < table->nbuckets; i++) {
    for (struct gc_ephemeron *chain = table->buckets[i];
         chain;
         chain = chain->gc_link)
      chain->dead = 1;    
    table->buckets[i] = NULL;
  }
}

void collect(struct gc_heap *heap) {
  flip(heap);
  uintptr_t scan = heap->hp;
  trace_roots(heap, visit_field);
  do { // *
    while(scan < heap->hp) {
      struct gc_obj *obj = scan;
      scan += align_size(trace_heap_object(obj, heap, visit_field));
    }
  } while (trace_ephemerons(heap)); // *
  kill_pending_ephemerons(heap); // *
}

The result is... not so bad? It makes sense to make ephemerons pay their own way in terms of memory, having an internal field managed by the GC. In fact I must confess that in the implementation I have been woodshedding, I actually have three of these damn things; perhaps more on that in some other post. But the perturbation to the core algorithm is perhaps less than the original code. There are still some optimizations to make, notably postponing hash-table lookups until the whole strongly-reachable graph is discovered; but again, another day.

And with that, thanks for coming along with me for my journeys into ephemeron-space.

I would like to specifically thank Erik Corry and Steve Blackburn for their advice over the years, and patience with my ignorance; I can only imagine that it's quite amusing when you have experience in a domain to see someone new and eager come in and make many of the classic mistakes. They have both had a kind of generous parsimony in the sense of allowing me to make the necessary gaffes but also providing insight where it can be helpful.

I'm thinking of many occasions but I especially appreciate the advice to start with a semi-space collector when trying new things, be it benchmarks or test cases or API design or new functionality, as it's a simple algorithm, hard to get wrong on the implementation side, and perfect for bringing out any bugs in other parts of the system. In this case the difference between fromspace and tospace pointers has a material difference to how you structure the ephemeron implementation; it's not something you can do just in a trace_heap_object function, as you don't have the old pointers there, and the pending ephemeron table is indexed by old object addresses.

Well, until some other time, gentle hackfolk, do accept my sincerest waste disposal greetings. As always, yours in garbage, etc.,

by Andy Wingo at December 15, 2022 09:21 PM

December 14, 2022

Eric Meyer

Highlighting Image Accessibility on Mastodon

Some time ago, I published user styles to visually highlight images on Twitter that didn’t have alt text  —  this was in the time before the “ALT” badge and the much improved accessibility features Twitter deployed in the pre-Musk era.

With the mass Mastodon migration currently underway in the circles I frequent, I spend more time there, and I missed the quick visual indication of images having alt text, as well as my de-emphasis styles for those images that don’t have useful alt text.  So I put the two together and wrote a new user stylesheet, which I apply via the Stylus browser extension.  If you’d like to also use it, please do so!

/* ==UserStyle==
@name           mastodon.social - 12/14/2022, 9:37:56 AM
@namespace      github.com/openstyles/stylus
@version        1.0.0
@description    Styles images posted on mastodon.social based on whether or not they have alt text.
@author         @Meyerweb@mastodon.social, https://meyerweb.com/
==/UserStyle== */

@-moz-document regexp(".*\.social.*") {
	.media-gallery__item-thumbnail img:not([alt]) {
		filter: grayscale(1) contrast(0.5);
	}
	.media-gallery__item-thumbnail img:not([alt]):hover {
		filter: none;
	}
	.media-gallery__item-thumbnail:has(img[alt]) {
		position: relative;
	}
	.media-gallery__item-thumbnail:has(img[alt])::after {
		content: "ALT";
		position: absolute;
		bottom: 0;
		right: 0;
		border-radius: 5px 0 0 0;
		border: 1px solid #FFF3;
		color: #FFFD;
		background: #000A;
		border-width: 1px 0 0 1px;
		padding-block: 0.5em 0.4em;
		padding-inline: 0.7em 0.8em;
		font: bold 90% sans-serif;
	}
}

Because most of my (admittedly limited and sporadic) Mastodon time is spent on mastodon.social, the styles I wrote are attuned to mastodon.social’s markup.  I set things up so these styles should be applied to any *.social site, but only those who use the same markup mastodon.social uses will get the benefits.  pinafore.social, for example, has different markup (I think they’re using Svelte).

You can always adapt the selectors to fit the markup of whatever Mastodon instance you use, if you’re so inclined.  Please feel free to share your changes in the comments, or in posts of your own.  And with any luck, this will be a temporary solution before Mastodon adds these sorts of things natively, just as Twitter eventually did.


Addendum: It was rightly pointed out to me that Firefox does not, as of this writing, support :has() by default.  If you want to use this in Firefox, as I do, set the layout.css.has-selector.enabled flag in about:config to true.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at December 14, 2022 07:48 PM

December 13, 2022

Jesse Alama

A comprehensive, authoritative FAQ on decimal arithmetic

Mike Cowlishaw’s FAQ on dec­i­mal arith­metic

If you’re in­ter­est­ed in dec­i­mal arith­metic in com­put­ers, you’ve got to check out Mike Cowlishaw’s FAQ on the sub­ject. There’s a ton of in­sight to be had there. If you like the kind of writ­ing that makes you feel smarter as you read it, this one is worth your time.

For con­text: Cowlishaw is the ed­i­tor of the 2008 edi­tion of the IEEE 754 stan­dard, up­dat­ing the 1985 and 1987 stan­dards. The words thus car­ry a lot of au­thor­i­ty, and it would be quite un­wise to ig­nore Mike in these mat­ters.

If you pre­fer sim­i­lar in­for­ma­tion in ar­ti­cle form, take a look at Mike’s Dec­i­mal Float­ing-Point: Al­go­rism for Com­put­ers. (Note the de­light­ful use of al­go­rism. Yes, it’s a word.)

The FAQ fo­cused main­ly on float­ing-point dec­i­mal arith­metic, not ar­bi­trary-pre­ci­sion dec­i­mal arith­metic (which is what one might im­me­di­ate­ly think of when the one hears dec­i­mal arith­metic). Ar­bi­trary-pre­ci­sion dec­i­mal arith­metic is whole oth­er ball of wax. In that set­ting, we’re talk­ing about se­quences of dec­i­mal dig­its whose length can­not be spec­i­fied in ad­vance. Pro­pos­als such as dec­i­mal128 are about a fixed bit width—128 bits—which al­lows for a lot of pre­ci­sion, but not ar­bi­trary pre­ci­sion.

One cru­cial in­sight I take away from Mike’s FAQ—a real mis­un­der­stand­ing on my part which is a bit em­bar­rass­ing to ad­mit—is that dec­i­mal128 is not just a 128-bit ver­sion of the same old bi­na­ry float­ing-point arith­metic we all know about (and might find bro­ken). It’s not as though adding more bits meets the de­mands of those who want high-pre­ci­sion arith­metic. No! Al­though dec­i­mal128 is a fixed-width en­cod­ing (128 bits), the un­der­ly­ing en­cod­ing is dec­i­mal, not bi­na­ry. That is, dec­i­mal128 isn’t just bi­na­ry float­ing-point with ex­tra juice. Just adding bits won’t un­break bust­ed float­ing-point arith­metic; some new ideas are need­ed. And dec­i­mal128 is a way for­ward. It is a new (well, rel­a­tive­ly new) for­mat that ad­dress­es all sorts of use cas­es that mo­ti­vate dec­i­mal arith­metic, in­clud­ing needs in busi­ness, fi­nance, ac­count­ing, and any­thing that uses hu­man dec­i­mal num­bers. What prob­a­bly led to my con­fu­sion is think­ing that the ad­jec­tive float­ing-point, re­gard­less of what it mod­i­fies, must be some kind of vari­a­tion of bi­na­ry float­ing-point arith­metic.

December 13, 2022 08:36 AM

December 12, 2022

Ricardo García

Vulkan extensions Igalia helped ship in 2022

Igalia Logo next to the Vulkan Logo

The end of 2022 is very close so I’m just in time for some self-promotion. As you may know, the ongoing collaboration between Valve and Igalia lets me and some of my colleagues work on improving the open-source Vulkan and OpenGL Conformance Test Suite. This work is essential to ship quality Vulkan drivers and, from the Khronos side, to improve the Vulkan standard further by, among other things, adding new functionality through API extensions. When creating a new extension, apart from reaching consensus among vendors about the scope and shape of the new APIs, CTS tests are developed in order to check the specification text is clear and vendors provide a uniform implementation of the basic functionality, corner cases and, sometimes, interactions with other extensions.

In addition to our CTS work, many times we review the Vulkan specification text from those extensions we develop tests for. We also do the same for other extensions and changes, and we also submit fixes and improvements of our own.

In 2022, our work was important to be able to ship a bunch of extensions you can probably see implemented in Mesa and used by VKD3D-Proton when playing your favorite games on Linux, be it on your PC or perhaps on the fantastic Steam Deck. Or maybe used by Zink when implementing OpenGL on top of your favorite Vulkan driver. Anyway, without further ado, let’s take a look.

VK_EXT_image_2d_view_of_3d

This extension was created by our beloved super good coder Mike Blumenkrantz to be able to create a 2D view of a single slice of a 3D image. It helps emulate functionality which was already possible with OpenGL, and is used by Zink. Siru developed tests for this one but we reviewed the spec and are listed as contributors.

VK_EXT_shader_module_identifier

One of my favorite extensions shipped in 2022. Created by Hans-Kristian Arntzen to be used by Proton, this extension lets applications query identifiers (hashes, if you want to think about them like that) for existing VkShaderModule objects and also to provide said identifiers in lieu of actual VkShaderModule objects when creating a pipeline. This apparently simple change has real-world impact when downloading and playing games on a Steam Deck, for example.

You see, DX12 games ship their shaders typically in an intermediate assembly-like representation called DXIL. This is the equivalent of the assembly-like SPIR-V language when used with Vulkan. But Proton has to, when implementing DX12 on top of Vulkan, translate this DXIL to SPIR-V before passing the shader down to Vulkan, and this translation takes some time that may result in stuttering that, if done correctly, would not be present in the game when it runs natively on Windows.

Ideally, we would bypass this cost by shipping a Proton translation cache with the game when you download it on Linux. This cache would allow us to hash the DXIL module and use the resulting hash as an index into a database to find a pre-translated SPIR-V module, which can be super-fast. Hooray, no more stuttering from that! You may still get stuttering when the Vulkan driver has to compile the SPIR-V module to native GPU instructions, just like the DX12 driver would when translating DXIL to native instructions, if the game does not, or cannot, pre-compile shaders somehow. Yet there’s a second workaround for that.

If you’re playing on a known platform with known hardware and drivers (think Steam Deck), you can also ship a shader cache for that particular driver and hardware. Mesa drivers already have shader caches, so shipping a RADV cache with the game makes total sense and we would avoid stuttering once more, because the driver can hash the SPIR-V module and use the resulting hash to find the native GPU module. Again, this can be super-fast so it’s fantastic! But now we have a problem, you see? We are shipping a cache that translates DXIL hashes to SPIR-V modules, and a driver cache that translates SPIR-V hashes to native modules. And both are big. Quite big for some games. And what do we want the SPIR-V modules for? For the driver to calculate their hashes and find the native module? Wouldn’t it be much more efficient if we could pass the SPIR-V hash directly to the driver instead of the actual module? That way, the database translating DXIL hashes to SPIR-V modules could be replaced with a database that translates DXIL hashes to SPIR-V hashes. This can save space in the order of gigabytes for some games, and this is precisely what this extension allows. Enjoy your extra disk space on the Deck and thank Hans-Kristian for it! We reviewed the spec, contributed to it, and created tests for this one.

VK_EXT_attachment_feedback_loop_layout

This one was written by Joshua Ashton and allows applications to put images in the special VK_IMAGE_LAYOUT_ATTACHMENT_FEEDBACK_LOOP_OPTIMAL_EXT layout, in which they can both be used to render to and to sample from at the same time. It’s used by DXVK 2.0+ to more efficiently support D3D9 games that read from active render targets. We reviewed, created tests and contributed to this one.

VK_EXT_mesh_shader

I don’t need to tell you more about this one. You saw the Khronos blog post. You watched my XDC 2022 talk (and Timur’s). You read my slides. You attended the Vulkan Webinar. Important to actually have mesh shaders on Vulkan like you have in DX12, so emulation of DX12 on Vulkan was a top goal. We contributed to the spec and created tests.

VK_EXT_mutable_descriptor_type

“Hey, hey, hey!” I hear you protest. “This is just a rename of the VK_VALVE_mutable_descriptor_type extension which was released at the end of 2020.” Right, but I didn’t get to tell you about it then, so bear with me for a moment. This extension was created by Hans-Kristian Arntzen and Joshua Ashton and it helps efficiently emulate the raw D3D12 binding model on top of Vulkan. For that, it allows you to have descriptors with a type that is only known at runtime, and also to have descriptor pools and sets that reside only in host memory. We had reviewed the spec and created tests for the Valve version of the extension. Those same tests are the VK_EXT_mutable_descriptor_type tests today.

VK_EXT_extended_dynamic_state3

The final boss of dynamic state, which helps you reduce the number of pipeline objects in your application as much as possibly allowed. Combine some of the old and new dynamic states with graphics pipeline libraries and you may enjoy stutter-free gaming. Guaranteed!1 This one will be used by native apps, translation layers (including Zink) and you-name-it. We developed tests and reviewed the spec for it.

1Disclaimer: not actually guaranteed.

December 12, 2022 09:37 PM

Andy Wingo

i'm throwing ephemeron party & you're invited

Good day, hackfolk. Today's note tries to extend our semi-space collector with support for ephemerons. Spoiler alert: we fail in a subtle and interesting way. See if you can spot it before the end :)

Recall that, as we concluded in an earlier article, a memory manager needs to incorporate ephemerons as a core part of the tracing algorithm. Ephemerons are not macro-expressible in terms of object trace functions.

Instead, to support ephemerons we need to augment our core trace routine. When we see an ephemeron E, we need to check if the key K is already visited (and therefore live); if so, we trace the value V directly, and we're done. Otherwise, we add E to a global table of pending ephemerons T, indexed under K. Finally whenever we trace a new object O, ephemerons included, we look up O in T, to trace any pending ephemerons for O.

So, taking our semi-space collector as a workbench, let's start by defining what an ephemeron is.

struct gc_ephemeron {
  struct gc_obj header;
  int dead;
  struct gc_obj *key;
  struct gc_obj *value;
};

enum gc_obj_kind { ..., EPHEMERON, ... };

static struct gc_ephemeron* as_ephemeron(struct gc_obj *obj) {
  uintptr_t ephemeron_tag = NOT_FORWARDED_BIT | (EPHEMERON << 1);
  if (obj->tag == ephemeron_tag)
    return (struct gc_ephemeron*)obj;
  return NULL;
}

First we need to allow the GC to know when an object is an ephemeron or not. This is somewhat annoying, as you would like to make this concern entirely the responsibility of the user, and let the GC be indifferent to the kinds of objects it's dealing with, but it seems to be unavoidable.

The heap will need some kind of data structure to track pending ephemerons:

struct gc_pending_ephemeron_table;

struct gc_heap {
  ...
  struct gc_pending_ephemeron_table *pending_ephemerons;
}

struct gc_ephemeron *
pop_pending_ephemeron(struct gc_pending_ephemeron_table*,
                      struct gc_obj*);
void
add_pending_ephemeron(struct gc_pending_ephemeron_table*,
                      struct gc_obj*, struct gc_ephemeron*);
struct gc_ephemeron *
pop_any_pending_ephemeron(struct gc_pending_ephemeron_table*);

Now let's define a function to handle ephemeron shenanigans:

void visit_ephemerons(struct gc_heap *heap, struct gc_obj *obj) {
  // We are visiting OBJ for the first time.
  // OBJ is the old address, but it is already forwarded.
  ASSERT(is_forwarded(obj));

  // First, visit any pending ephemeron for OBJ.
  struct gc_ephemeron *ephemeron;
  while ((ephemeron =
            pop_pending_ephemeron(heap->pending_ephemerons, obj))) {
    ASSERT(obj == ephemeron->key);
    ephemeron->key = forwarded(obj);
    visit_field(&ephemeron->value, heap);
  }

  // Then if OBJ is itself an ephemeron, trace it.
  if ((ephemeron = as_ephemeron(forwarded(obj)))
      && !ephemeron->dead) {
    if (is_forwarded(ephemeron->key)) {
      ephemeron->key = forwarded(ephemeron->key);
      visit_field(&ephemeron->value, heap);
    } else {
      add_pending_ephemeron(heap->pending_ephemerons,
                            ephemeron->key, ephemeron);
    }
  }
}

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  ...
  visit_ephemerons(heap, obj); // *
  return new_obj;
}

We wire it into the copy routine, as that's the bit of the collector that is called only once per object and which has access to the old address. We actually can't process ephemerons during the Cheney field scan, as there we don't have old object addresses.

Then at the end of collection, we kill any ephemeron whose key hasn't been traced:

void kill_pending_ephemerons(struct gc_heap *heap) {
  struct gc_ephemeron *ephemeron;
  while ((ephemeron =
            pop_any_pending_ephemeron(heap->pending_ephemerons)))
    ephemeron->dead = 1;    
}

void collect(struct gc_heap *heap) {
  // ...
  kill_pending_ephemerons(heap);
}

First observation: Gosh, this is quite a mess. It's more code than the core collector, and it's gnarly. There's a hash table, for goodness' sake. Goodbye, elegant algorithm!

Second observation: Well, at least it works.

Third observation: Oh. It works in the same way as tracing in the copy routine works: well enough for shallow graphs, but catastrophically for arbitrary graphs. Calling visit_field from within copy introduces unbounded recursion, as tracing one value can cause more ephemerons to resolve, ad infinitum.

Well. We seem to have reached a dead-end, for now. Will our hero wrest victory from the jaws of defeat? Tune in next time for find out: same garbage time (unpredictable), same garbage channel (my wordhoard). Happy hacking!

by Andy Wingo at December 12, 2022 02:02 PM

December 11, 2022

Andy Wingo

we iterate so that you can recurse

Sometimes when you see an elegant algorithm, you think "looks great, I just need it to also do X". Perhaps you are able to build X directly out of what the algorithm gives you; fantastic. Or, perhaps you can alter the algorithm a bit, and it works just as well while also doing X. Sometimes, though, you alter the algorithm and things go pear-shaped.

Tonight's little note builds on yesterday's semi-space collector article and discusses an worse alternative to the Cheney scanning algorithm.

To recall, we had this visit_field function that takes a edge in the object graph, as the address of a field in memory containing a struct gc_obj*. If the edge points to an object that was already copied, visit_field updates it to the forwarded address. Otherwise it copies the object, thus computing the new address, and then updates the field.

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  size_t size = heap_object_size(obj);
  struct gc_obj *new_obj = (struct gc_obj*)heap->hp;
  memcpy(new_obj, obj, size);
  forward(obj, new_obj);
  heap->hp += align_size(size);
  return new_obj;
}

void visit_field(struct gc_obj **field, struct gc_heap *heap) {
  struct gc_obj *from = *field;
  struct gc_obj *to =
    is_forwarded(from) ? forwarded(from) : copy(heap, from);
  *field = to;
}

Although a newly copied object is in tospace, all of its fields still point to fromspace. The Cheney scan algorithm later visits the fields in the newly copied object with visit_field, which both discovers new objects and updates the fields to point to tospace.

One disadvantage of this approach is that the order in which the objects are copied is a bit random. Given a hierarchical memory system, it's better if objects that are accessed together in time are close together in space. This is an impossible task without instrumenting the actual data access in a program and then assuming future accesses will be like the past. Instead, the generally-accepted solution is to ensure that objects that are allocated close together in time be adjacent in space. The bump-pointer allocator in a semi-space collector provides this property, but the evacuation algorithm above does not: it would need to preserve allocation order, but instead its order is driven by graph connectivity.

I say that the copying algorithm above is random but really it favors a breadth-first traversal; if you have a binary tree, first you will copy the left and the right nodes of the root, then the left and right children of the left, then the left and right children of the right, then grandchildren, and so on. Maybe it would be better to keep parent and child nodes together? After all they are probably allocated that way.

So, what if we change the algorithm:

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  size_t size = heap_object_size(obj);
  struct gc_obj *new_obj = (struct gc_obj*)heap->hp;
  memcpy(new_obj, obj, size);
  forward(obj, new_obj);
  heap->hp += align_size(size);
  trace_heap_object(new_obj, heap, visit_field); // *
  return new_obj;
}

void visit_field(struct gc_obj **field, struct gc_heap *heap) {
  struct gc_obj *from = *field;
  struct gc_obj *to =
    is_forwarded(from) ? forwarded(from) : copy(heap, from);
  *field = to;
}

void collect(struct gc_heap *heap) {
  flip(heap);
  trace_roots(heap, visit_field);
}

Here we favor a depth-first traversal: we eagerly call trace_heap_object within copy. No need for the Cheney scan algorithm; tracing does it all.

The thing is, this works! It might even have better performance for some workloads, depending on access patterns. And yet, nobody does this. Why?

Well, consider a linked list with a million nodes; you'll end up with a million recursive calls to copy, as visiting each link eagerly traverses the next. While I am all about unbounded recursion, an infinitely extensible stack is something that a language runtime has to provide to a user, and here we're deep into implementing-the-language-runtime territory. At some point a user's deep heap graph is going to cause a gnarly system failure via stack overflow.

Ultimately stack space needed by a GC algorithm counts towards collector memory overhead. In the case of a semi-space collector you already need twice the amount memory as your live object graph, and if you recursed instead of iterated this might balloon to 3x or more, depending on the heap graph shape.

Hey that's my note! All this has been context for some future article, so this will be on the final exam. Until then!

by Andy Wingo at December 11, 2022 09:19 PM

December 10, 2022

Qiuyi Zhang (Joyee)

Fixing snapshot support of class fields in V8

Up until V8 10.0, the class field initializers had been

December 10, 2022 09:11 PM

Uncaught exceptions in Node.js

In this post, I’ll jot down some notes that I took when refactoring the uncaught exception handling routines in Node.js. Hopefully it

December 10, 2022 09:11 PM

On deps/v8 in Node.js

I recently ran into a V8 test failure that only showed up in the V8 fork of Node.js but not in the upstream. Here I’ll write down my

December 10, 2022 09:11 PM

Andy Wingo

a simple semi-space collector

Good day, hackfolk. Today's article is about semi-space collectors. Many of you know what these are, but perhaps not so many have seen an annotated implementation, so let's do that.

Just to recap, the big picture here is that a semi-space collector divides a chunk of memory into two equal halves or spaces, called the fromspace and the tospace. Allocation proceeds linearly across tospace, from one end to the other. When the tospace is full, we flip the spaces: the tospace becomes the fromspace, and the fromspace becomes the tospace. The collector copies out all live data from the fromspace to the tospace (hence the names), starting from some set of root objects. Once the copy is done, allocation then proceeds in the new tospace.

In practice when you build a GC, it's parameterized in a few ways, one of them being how the user of the GC will represent objects. Let's take as an example a simple tag-in-the-first-word scheme:

struct gc_obj {
  union {
    uintptr_t tag;
    struct gc_obj *forwarded; // for GC
  };
  uintptr_t payload[0];
};

We'll divide all the code in the system into GC code and user code. Users of the GC define how objects are represented. When user code wants to know what the type of an object is, it looks at the first word to check the tag. But, you see that GC has a say in what the representation of user objects needs to be: there's a forwarded member too.

static const uintptr_t NOT_FORWARDED_BIT = 1;
int is_forwarded(struct gc_obj *obj) {
  return (obj->tag & NOT_FORWARDED_BIT) == 1;
}
void* forwarded(struct gc_obj *obj) { 
  return obj->forwarded;
}
void forward(struct gc_obj *from, struct gc_obj *to) {
  from->forwarded = to;
}

forwarded is a forwarding pointer. When GC copies an object from fromspace to tospace, it clobbers the first word of the old copy in fromspace, writing the new address there. It's like when you move to a new flat and have your mail forwarded from your old to your new address.

There is a contract between the GC and the user in which the user agrees to always set the NOT_FORWARDED_BIT in the first word of its objects. That bit is a way for the GC to check if an object is forwarded or not: a forwarded pointer will never have its low bit set, because allocations are aligned on some power-of-two boundary, for example 8 bytes.

struct gc_heap;

// To implement by the user:
size_t heap_object_size(struct gc_obj *obj);
size_t trace_heap_object(struct gc_obj *obj, struct gc_heap *heap,
                         void (*visit)(struct gc_obj **field,
                                       struct gc_heap *heap));
size_t trace_roots(struct gc_heap *heap,
                   void (*visit)(struct gc_obj **field,
                                 struct gc_heap *heap));

The contract between GC and user is in practice one of the most important details of a memory management system. As a GC author, you want to expose the absolute minimum interface, to preserve your freedom to change implementations. The GC-user interface does need to have some minimum surface area, though, for example to enable inlining of the hot path for object allocation. Also, as we see here, there are some operations needed by the GC which are usually implemented by the user: computing the size of an object, tracing its references, and tracing the root references. If this aspect of GC design interests you, I would strongly recommend having a look at MMTk, which has been fruitfully exploring this space over the last two decades.

struct gc_heap {
  uintptr_t hp;
  uintptr_t limit;
  uintptr_t from_space;
  uintptr_t to_space;
  size_t size;
};

Now we get to the implementation of the GC. With the exception of how to inline the allocation hot-path, none of this needs to be exposed to the user. We start with a basic definition of what a semi-space heap is, above, and below we will implement collection and allocation.

static uintptr_t align(uintptr_t val, uintptr_t alignment) {
  return (val + alignment - 1) & ~(alignment - 1);
}
static uintptr_t align_size(uintptr_t size) {
  return align(size, sizeof(uintptr_t));
}

All allocators have some minimum alignment, which is usually a power of two at least as large as the target language's ABI alignment. Usually it's a word or two; here we just use one word (4 or 8 bytes).

struct gc_heap* make_heap(size_t size) {
  size = align(size, getpagesize());
  struct gc_heap *heap = malloc(sizeof(struct gc_heap));
  void *mem = mmap(NULL, size, PROT_READ|PROT_WRITE,
                   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  heap->to_space = heap->hp = (uintptr_t) mem;
  heap->from_space = heap->limit = space->hp + size / 2;
  heap->size = size;
  return heap;
}

Making a heap is just requesting a bunch of memory and dividing it in two. How you get that space differs depending on your platform; here we use mmap and also the platform malloc for the struct gc_heap metadata. Of course you will want to check that both the mmap and the malloc succeed :)

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  size_t size = heap_object_size(obj);
  struct gc_obj *new_obj = (struct gc_obj*)heap->hp;
  memcpy(new_obj, obj, size);
  forward(obj, new_obj);
  heap->hp += align_size(size);
  return new_obj;
}

void flip(struct gc_heap *heap) {
  heap->hp = heap->from_space;
  heap->from_space = heap->to_space;
  heap->to_space = heap->hp;
  heap->limit = heap->hp + heap->size / 2;
}  

void visit_field(struct gc_obj **field, struct gc_heap *heap) {
  struct gc_obj *from = *field;
  struct gc_obj *to =
    is_forwarded(from) ? forwarded(from) : copy(heap, from);
  *field = to;
}

void collect(struct gc_heap *heap) {
  flip(heap);
  uintptr_t scan = heap->hp;
  trace_roots(heap, visit_field);
  while(scan < heap->hp) {
    struct gc_obj *obj = scan;
    scan += align_size(trace_heap_object(obj, heap, visit_field));
  }
}

Here we have the actual semi-space collection algorithm! It's a tiny bit of code about which people have written reams of prose, and to be fair there are many things to say—too many for here.

Personally I think the most interesting aspect of a semi-space collector is the so-called "Cheney scanning algorithm": when we see an object that's not yet traced, in visit_field, we copy() it to tospace, but don't actually look at its fields. Instead collect keeps track of the partition of tospace that contains copied objects which have not yet been traced, which are those in [scan, heap->hp). The Cheney scan sweeps through this space, advancing scan, possibly copying more objects and extending heap->hp, until such a time as the needs-tracing partition is empty. It's quite a neat solution that requires no additional memory.

inline struct gc_obj* allocate(struct gc_heap *heap, size_t size) {
retry:
  uintptr_t addr = heap->hp;
  uintptr_t new_hp = align_size(addr + size);
  if (heap->limit < new_hp) {
    collect(heap);
    if (heap->limit - heap->hp < size) {
      fprintf(stderr, "out of memory\n");
      abort();
    }
    goto retry;
  }
  heap->hp = new_hp;
  return (struct gc_obj*)addr;
}

Finally, we have the allocator: the reason we have the GC in the first place. The fast path just returns heap->hp, and arranges for the next allocation to return heap->hp + size. The slow path calls collect() and then retries.

Welp, that's a semi-space collector. Until next time for some notes on ephemerons again. Until then, have a garbage holiday season!

by Andy Wingo at December 10, 2022 08:50 PM

December 02, 2022

Emmanuele Bassi

On PyGObject

Okay, I can’t believe I have to do this again.

This time, let’s leave the Hellsite out of it, and let’s try to be nuanced from the start. I’d like to avoid getting grief online.

The current state of the Python bindings for GObject-based libraries is making it really hard to recommend using Python as a language for developing GTK and GNOME applications.

PyGObject is currently undermaintained, even after the heroic efforts of Christoph Reiter to keep the fires burning through the long night. The Python community needs more people to work on the bindings, if we want Python to be a first class citizen of the ecosystem.

There’s a lot to do, and not nearly enough people left to do it.

Case study: typed instances

Yes, thou shall use GObject should be the law of the land; but there are legitimate reasons to use typed instances, and GTK 4 has a few of them:

At this very moment, it is impossible to use the types above from Python. PyGObject will literally error out if you try to do so. There are technical reasons why that was a reasonable choice 15+ years ago, but most language bindings written since then can handle typed instances just fine. In fact, PyGObject does handle them, since GParamSpec is a GTypeInstance; of course, that’s because PyGObject has some ad hoc code for them.

Dealing with events and render nodes is not so important in GTK4; but not having access to the expressions API makes writing list widgets incredibly more complicated, requiring to set up everything through UI definition files and never modifying the objects programmatically.

Case study: constructing and disposing

While most of the API in PyGObject is built through introspection, the base wrapper for the GObject class is still very much written in CPython. This requires, among other things, wiring the class’s virtual functions manually, and figuring out the interactions between Python, GObject, and the type system wrappers. This means Python types that inherit from GObject don’t have automatic access to the GObjectClass.constructed and GObjectClass.dispose virtual functions. Normally, this would not be an issue, but modern GObject-based libraries have started to depend on being able to control construction and destruction sequences.

For instance, it is necessary for any type that inherits from GtkWidget to ensure that all its child widgets are disposed manually. While that was possible through the “destroy” signal in GTK3, in GTK4 the signal was removed and everything should go through the GObjectClass.dispose virtual function. Since PyGObject does not allow overriding or implementing that virtual function, your Python class cannot inherit from GtkWidget, but must inherit from ancillary classes like GtkBox or AdwBin. That’s even more relevant for the disposal of resources created through the composite template API. Theoretically, using Gtk.Template would take care of this, but since we cannot plug the Python code into the underlying CPython, we’re stuck.

Case study: documentation, examples, and tutorials

While I have been trying to write Python examples for the GNOME developers documentation, I cannot write everything by myself. Plus, I can’t convince Google to only link to what I write. The result is that searching for “Python” and “GNOME” or “GTK” will inevitably lead people to the GTK3 tutorials and references.

The fragmentation of the documentation is also an issue. The PyGObject website is off to readthedocs.org, which is understandable for a Python projects; then we have the GTK3 tutorial, which hasn’t been updated in a while, and for which there are no plans to have a GTK 4 version.

Additionally, the Python API reference is currently stuck on GTK3, with Python references for GTK4 and friends off on the side.

It would be great to unify the reference sites, and possibly have them under the GNOME infrastructure; but at this point, I’d settle for having a single place to find everything, like GJS did.

What can you do?

Pick up PyGObject. Learn how it works, and how the GObject and CPython API interact.

Write Python overrides to ensure that API like GTK and GIO are nice to use with idiomatic Python.

Write tests for PyGObject.

Write documentation.

Port existing tutorials, examples, and demos to GTK4.

Help Christoph out when it comes to triaging issues, and reviewing merge requests.

Join GNOME Python on Matrix, or the #gnome-python channel on Libera.

Ask questions and provide answers on Discourse.

What happens if nobody does anything?

Right now, we’re in maintenance mode. Things work because of inertia, and because nobody is really pushing the bindings outside of their existing functionality.

Let’s be positive, for a change, and assume that people will show up. They did for Vala when I ranted about it five years ago after a particularly frustrating week dealing with constant build failures in GNOME, so maybe the magic will happen again.

If people do not show up, though, what will likely happen is that Python will just fall on the wayside inside GNOME. Python developers won’t use GTK to write GUI applications; existing Python applications targeting GNOME will either wither and die, or get ported to other languages. The Python bindings themselves may stop working with newer versions of Python, which will inevitably lead downstream distributors to jettison the bindings themselves.

We have been through this dance with the C# bindings and Mono, and the GNOME and Mono communities are all the poorer for it, so I’d like to avoid losing another community of talented developers. History does not need to repeat itself.

by ebassi at December 02, 2022 06:56 PM

November 30, 2022

Eric Meyer

How to Verify Site Ownership on Mastodon Profiles

Like many of you, I’ve been checking out Mastodon and finding more and more things I like.  Including the use of XFN (XHTML Friends Network) semantics to verify ownership of sites you link from your profile’s metadata!  What that means is, you can add up to four links in your profile, and if you have an XFN-compliant link on that URL pointing to your Mastodon profile, it will show up as verified as actually being your site.

Okay, that probably also comes off a little confusing.  Let me walk through the process.

First, go to your home Mastodon server and edit your profile.  On servers like mastodon.social, there should be an “Edit profile” link under your user avatar.

Here’s what it looks like for me.  Yes, I prefer Light Mode.  No, I don’t want to have a debate about it.

I saw the same thing on another Mastodon server where I have an account, so it seems to be common to Mastodon in general.  I can’t know what every Mastodon server does, though, so you might have to root around to find how you edit your profile.  (Similarly, I can’t be sure that everything will be exactly as I depict it below, but hopefully it will be at least reasonably close.)

Under “Appearance” in the profile editing screen, which I believe is the default profile editing page, there should be a section called “Profile metadata”.  You’ll probably have to scroll a bit to reach it.  You can add up to four labels with content, and a very common label is “Web” or “Homepage” with the URL of your personal server.  They don’t all have to be links to sites; you could add your favorite color or relationship preference(s) or whatever

I filled a couple of these in for demonstration purposes, but now they’re officially part of my profile.  No, I don’t want to debate indentation preferences, either.

But look, over there next to the table, there’s a “Verification” section, with a little bit of explanation and a field containing some markup you can copy, but it’s cut off before the end of the markup.  Here’s what mine looks like in full:

<a rel="me" href="https://mastodon.social/@Meyerweb">Mastodon</a>

If I take this markup and add it to any URL I list in my metadata, then that entry in my metadata table will get special treatment, because it will mean I’ve verified it.

The important part is the rel="me", which establishes a shared identity.  Here’s how it’s (partially) described by XFN 1.1:

A link to yourself at a different URL. Exclusive of all other XFN values. Required symmetric.

I admit, that’s written in terse spec-speak, so let’s see how this works out in practice.

First, let’s look at the markup in my Mastodon profile’s page.  Any link to another site in the table of profile metadata has a me value in the rel attribute, like so:

<a href="https://meyerweb.com/" rel="nofollow noopener noreferrer me">

That means I’ve claimed via Mastodon that meyerweb.com is me at another URL.

But that’s not enough, because I could point at the home page of, say, Wikipedia as if it were mine.  That’s why XFN requires the relationship to be symmetric, which is to say, there needs to be a rel="me" annotated link on each end.  (On both ends.  However you want to say that.)

So on the page being pointed to, which in my case is https://meyerweb.com/, I need to include a link back to my Mastodon profile page, and that link also has to have rel="me".  That’s the markup Mastodon provided for me to copy, which we saw before and I’ll repeat here:

<a rel="me" href="https://mastodon.social/@Meyerweb">Mastodon</a>

Again, the important part is that the href points to my Mastodon profile page, and there’s a rel attribute containing me.  It can contain other things, like noreferrer, but needs to have me for the verfiication to work.  Note that the content of the link element doesn’t have to be the text “Mastodon”.  In my case, I’m using a Mastodon logo, with the markup looking like this:

<a rel="me" href="https://mastodon.social/@Meyerweb">
 	<img src="/pix/icons/mastodon.svg" alt="Mastodon">
</a>

With that in place, there’s a “me” link pointing to a page that contains a “me” link.  That’s a symmetric relationship, as XFN requires, and it verifies that the two pages have a shared owner.  Who is me!

Thus, if you go to my Mastodon profile page, in the table of my profile metadata, the entry for my homepage is specially styled to indicate it’s been verified as actually belonging to me.

The table as it appears on my profile page for me.  Your colors may vary.

And that’s how it works.

Next question: how can I verify my GitHub page?  At the moment, I’d have to put my Mastodon profile page’s URL into the one open field for URLs in GitHub profiles, because GitHub also does the rel="me" thing for its profile links.  But if I do that, I’d have to remove the link to my homepage, which I don’t want to do.

Until GitHub either provides a dedicated Mastodon profile field the way it provides a dedicated Twitter profile field, or else allows people to add multiple URLs to their profiles the way Mastodon does, I won’t be able to verify my GitHub page on Mastodon.  Not a huge deal for me, personally, but in general it would be nice to see GitHub become more flexible in this area.  Very smart people are also asking for this, so hopefully that will happen soon(ish).


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at November 30, 2022 06:45 PM

November 29, 2022

José Dapena

Native call stack profiling (2/3): Event Tracing for Windows and Chromium

In last blog post, I introduced call stack profiling, why it is useful, and how a system wide support can be useful. This new blog post will talk about Windows native call stack tracing, and how it is integrated in Chromium.

Event Tracing for Windows (ETW)

Event Tracing for Windows, usually also named with the acronym ETW, is a Windows kernel based tool that allows to log kernel and application events to a file.

A good description of its architecture is available at Microsoft Learn: About Event Tracing.

Essentially, it is an efficient event capturing tool, in some ways similar to LTTng. Its events recording stage is as lightweight as possible to avoid processing of collected data impacting the results as much as possible, reducing the observer effect.

The main participants are:
– Providers: kernel components (including device drivers) or applications that emit log events.
– Controllers: tools that control when a recording session starts and stops, which providers to record, and what each provider is expected to log. Controllers also decide where to dump the recorded data (typically a file).
– Consumers: tools that can read and analyze the recorded data, and combine with system information (i.e. debugging information). Consumers will usually get the data from previously recorded files, but it is also possible to consume tracing information in real time.

What about call stack profiling? ETW supports call stack sampling, allowing to capture call stacks when certain events happen, and associates the call stack to that event. Bruce Dawson has written a fantastic blog post about the topic.

Chromium call stack profiling in Windows

Chromium provides support for call stack profiling. This is done at different levels of the stack:
– It allows to build with frame pointers, so CPU profile samplers can properly capture the full call stack.
– v8 can generate symbol information for for JIT-compiled code. This is supported for ETW (and also for Linux Perf).

Compilation

In any platform compilation will usually benefit from compiling with the GN flag enable_profiling=true. This will enable frame pointers support. In Windows, it will also enable generation of function information for code generated by V8.

Also, symbol_level=1 should be added at least, so the compilation stage function names are available.

Chrome startup

To enable generation of V8 profiling information in Windows, these flags should be passed to chrome on launch:

chrome --js-flags="--enable-etw-stack-walking --interpreted-frames-native-stack"

--enable-etw-stack-walking will emit information of the functions compiled by V8 JIT engine, so they can be recorded while sampling the stack.

--interpreted-frames-native-stack will show the frames of interpreted code in the native stack, so external profilers as ETW can properly show those in the profiling samples.

Recording

Then, a session of the workload to analyze can be captured with Windows Performance Recorder.

An alternate tool with specific Chromium support, UIForETW can be used too. Main advantage is that it allows to select specific Chromium tracer categories, that will be emitted in the same trace. Its author, Bruce Dawson, has explained very well how to use it.

Analysis

For analysis, the tool Windows Performance Analyzer (WPA) can be used. Both UIForETW and Windows Performance Recorder will offer opening the obtained trace at the end of the capture for analysis.

Before starting analysis, in WPA, add the paths where the .PDB files with debugging information are available.

Then, select Computation/CPU Usage (Sampled):

.

From the available charts, we are interested in the ones providing stackwalk information:

Next

In the last post of this series, I will present the work done in 2022 to improve V8 support for Windows ETW.

by José Dapena Paz at November 29, 2022 09:15 AM

November 28, 2022

Andy Wingo

are ephemerons primitive?

Good evening :) A quick note, tonight: I've long thought that ephemerons are primitive and can't be implemented with mark functions and/or finalizers, but today I think I have a counterexample.

For context, one of the goals of the GC implementation I have been working on on is to replace Guile's current use of the Boehm-Demers-Weiser (BDW) conservative collector. Of course, changing a garbage collector for a production language runtime is risky, and for Guile one of the mitigation strategies for this work is that the new collector is behind an abstract API whose implementation can be chosen at compile-time, without requiring changes to user code. That way we can first switch to BDW-implementing-the-new-GC-API, then switch the implementation behind that API to something else.

Abstracting GC is a tricky problem to get right, and I thank the MMTk project for showing that this is possible -- you have user-facing APIs that need to be implemented by concrete collectors, but also extension points so that the user can provide some compile-time configuration too, for example to provide field-tracing visitors that take into account how a user wants to lay out objects.

Anyway. As we discussed last time, ephemerons are usually have explicit support from the GC, so we need an ephemeron abstraction as part of the abstract GC API. The question is, can BDW-GC provide an implementation of this API?

I think the answer is "yes, but it's very gnarly and will kill performance so bad that you won't want to do it."

the contenders

Consider that the primitives that you get with BDW-GC are custom mark functions, run on objects when they are found to be live by the mark workers; disappearing links, a kind of weak reference; and finalizers, which receive the object being finalized, can allocate, and indeed can resurrect the object.

BDW-GC's finalizers are a powerful primitive, but not one that is useful for implementing the "conjunction" aspect of ephemerons, as they cannot constrain the marker's idea of graph connectivity: a finalizer can only prolong the life of an object subgraph, not cut it short. So let's put finalizers aside.

Weak references have a tantalizingly close kind of conjunction property: if the weak reference itself is alive, and the referent is also otherwise reachable, then the weak reference can be dereferenced. However this primitive only involves the two objects E and K; there's no way to then condition traceability of a third object V to E and K.

We are left with mark functions. These are an extraordinarily powerful interface in BDW-GC, but somewhat expensive also: not inlined, and going against the grain of what BDW-GC is really about (heaps in which the majority of all references are conservative). But, OK. They way they work is, your program allocates a number of GC "kinds", and associates mark functions with those kinds. Then when you allocate objects, you use those kinds. BDW-GC will call your mark functions when tracing an object of those kinds.

Let's assume firstly that you have a kind for ephemerons; then when you go to mark an ephemeron E, you mark the value V only if the key K has been marked. Problem solved, right? Only halfway: you also have to handle the case in which E is marked first, then K. So you publish E to a global hash table, and... well. You would mark V when you mark a K for which there is a published E. But, for that you need a hook into marking V, and V can be any object...

So now we assume additionally that all objects are allocated with user-provided custom mark functions, and that all mark functions check if the marked object is in the published table of pending ephemerons, and if so marks values. This is essentially what a proper ephemeron implementation would do, though there are some optimizations one can do to avoid checking the table for each object before the mark stack runs empty for the first time. In this case, yes you can do it! Additionally if you register disappearing links for the K field in each E, you can know if an ephemeron E was marked dead in a previous collection. Add a pre-mark hook (something BDW-GC provides) to clear the pending ephemeron table, and you are in business.

yes, but no

So, it is possible to implement ephemerons with just custom mark functions. I wouldn't want to do it, though: missing the mostly-avoid-pending-ephemeron-check optimization would be devastating, and really what you want is support in the GC implementation. I think that for the BDW-GC implementation in whippet I'll just implement weak-key associations, in which the value is always marked strongly unless the key was dead on a previous collection, using disappearing links on the key field. That way a (possibly indirect) reference from a value V to a key K can indeed keep K alive, but oh well: it's a conservative approximation of what should happen, and not worse than what Guile has currently.

Good night and happy hacking!

by Andy Wingo at November 28, 2022 09:11 PM

November 25, 2022

Brian Kardell

Alice 2023

Alice 2023

There's an upcomming election in the W3C...

If you're not familliar with W3C elections already, I wrote a brief (~400 word) "Primer for Busy People" on the general "parts" of the W3C and introducing the two main elected bodies, including the Technical Architecture Group (TAG), a kind of steering committee for the W3C/Web at large.

The W3C TAG is charged with overseeing and helping direct and hold together the broad view and direction of the web over time. While that sounds perhaps a little abstract and esoteric, it is mostly grounded in very concrete things. Proposals go through TAG review and the TAG discusses and asks questions at many levels to enure that things are safe, ethical, well designed, forward looking, and consistently reasoned across APIs and working groups. As part of this they also try to gather and publish articulations of principles and advice which can be reapplied and referred back to. These principles help shape the web that we'll all be developing on in significant ways, for the rest of our lives. So, I think it matters that we have a good TAG.

The TAG has changed a lot since 2012 when I first started advocating that it needed reform and that regular developers should publicly support candidates and lobby member orgs to vote on our behalf. I think that a lot of that change is because it worked. A concerted effort was made to ensure that we had as good a slate of candidates as we could muster. Candidates threw their support to other candidates they believed were great choices too, etc. For the next few years, those candidates were elected something like 25:2 in terms of seats. I'm happy to say that I think that most of those changes have been really positive. The TAG is relevant again, it is productive and useful. They do outreach in the community. It is much more diverse on many levels. The work is better connected to practice. However, another outcome of this is that the W3C also changed how elections work. I'm not going to get into why I think this is complicated and a net loss (I do that in several other posts like Optimizing The W3C Sandwiches if you're really curious), but it means that each member has only one vote counted and candidates are left to convince you that they are the best candidate. The net result is that I've pretty much stepped away from advocating for candidates like I used to and focused instead on just helping to make sure there are good nominees, at least.

A notable exception to this was in 2019 when Alice Boxhall (@sundress on Twitter) ran. I wrote about why I was excited and supported her. She is truly a unique candidate in my mind. After her stint on the TAG, she stepped away from Google and took a sabbatical only to return this year to work with me at Igalia! In this election, she's running again and I couldn't support it more. All of the rationale I had in 2019 holds true today, and additionally now you can say she'll have experience and bring perspective from a very unique organization that represents a lot that is good and positive change in the web ecosystem.

I really wish I could say more about this election in a productive way, but all I can offer is these two points:

First... I hope you're similarly excited for Alice to run and join me in supporting her.

Vote for Alice

Second... Encourage people to actually vote, and vote together

We have a surplus of good candidates. There are many permutations of choices that would be arguably good. However, the truth is that with this voting system, if turnout is low (it often is, amazingly few orgs vote), it is very likely that someone at the very bottom of the preferences of the vast majority gets a one of those seats. Instead of optimizing one of the many good permutations, it can ensure that we lose one of those seats. Just having people turn up at all makes that kind of failure harder, so please - send a DM, tweet something, remind people - make sure they vote.

To be a little stronger, while you're at it - discuss with some other members in a community that you belong to. Get their thoughts and insights, and see if you can build some concensus on an order of preference - in doing so you'll greatly help the odds of being happy with the results.

But, also, again...

Vote for Alice

Polls close at 04:59 UTC on 2022-12-14 (23:59, Boston time on 2022-12-13).

November 25, 2022 05:00 AM

November 17, 2022

Brian Kardell

The Cool Thing You Didn’t Know Existed

The Cool Thing You Didn’t Know Existed

Recently, in discussions with my colleague Eric Meyer, we struck on an idea that we thought would be really helpful - even built a thing for our own use to try it out. Turned out it kind of already exists and I’d like to tell you about it, and how we hope to make it better.

Information about web related things are just one among many, many kinds of information flying at us through various channels a mile a minute (or ~ 1.61km per minute, if you prefer). Some of those things are about “standards”. I say “standards” in quotes because much of what we hear about are things we hope will someday be universally implemented.

Many ulimately fail, take wrong turns and ultimately are re-imagined. Those that ultimately do get universally implmented and reach Definitely A Standard™ status often take years to do so. This presents numerous challenges, so it’s really helpful to be able to really quickly understand what it is you’re investing your time in, and make sure you don’t miss the important stuff.

Luckily there are some kind of more newsworthy ‘events’ along the way - mainly, implementations landing. The more implementations there are, the more confidence you can have that this valuable - whether that is just taking a few minutes to find out about it, or actually taking some time to try to learn it and put it to use. But… the last one is special. It is the finish line. There aren’t more questions or doubts. Ironically though, the timeframes and noise before this point make it sometimes easy to miss.

That is, in fact, a pretty rare occurence. Wouldn’t it be nice if there was something that would only tell you that? That would be a pretty quiet channel, and if you saw something there you could say “oh, that’s important”. Interestingly, it kind of already exists

The MDN browser-comapt-data repo is a community maintained source that all of the browser implmenters and standards folks keep up. I think most people know this already, but what you might not know is that they create a release update to it regularly (currently twice a week) with, actually very good release notes.

GitHub, it seems, also has a built in pattern for providing an RSS Feed for project releases. The practical upshot of this is that you can Subscribe to the RSS Feed of the browser compat data releases.

Wow, cool.

We can do better...

This feed is great, but it still contains some noise you probably don’t care about, and because of that it can be a little confusing…

All of these updates are additions to BCD data features and renamings and removals along the way, for example. Those are just metadata that ultimately implementations might or might not be connected to. There is no new implementation news in this release, but it's not especially obvious. There are folks (like me) who like to follow that, but really it’s the implementations that are the high order bit here (even for me).

My colleague Eric Meyer and I were talking about all this, and ways to make the information coming out of BCD versions more digestable for ourselves and the public at large… and decided what the heck, let’s build something rough and see where it lands. So, in the thing we built, it's more focused.

A screenshot of a similar page we made (described below)

In this view, additions/removals sort of data are summarized as a count on a single line. Further details collapsed by default. You can open them if you want, of course, but they don’t get in the way when you are trying to quickly visually scan it.

The main content that is expanded by default is about the addition (or removals) of implementations - and those are organized, linked to relevant MDN reference, and you’re told who added it and visually, and easily able to see if it's reached fully complete implementation status.

Our experiments led to us reaching out to our peers at MDN and we’re currently working with them to see if we can make improvements some improvements like these. In the meantime - subscribe to that RSS feed and check it every week, I think you’ll be happy you did.

November 17, 2022 05:00 AM

November 16, 2022

José Dapena

Native call stack profiling (1/3): introduction

This week I presented a lightning talk in BlinkOn 17. There I talked about the work for improving native stack profiling support
in Windows.

This post starts a series where I will provide more context and details
to the presentation.

Why callstack profiling

First, a definition:

Callstack profiling: a performance analysis tool, that samples periodically the call stacks of all threads, for a specific workload.

Why is it useful? It provides a better undestanding of performance problems, specially if they are caused by CPU-bound bottle necks.

As we sample the full stack for each thread, we are capturing a handful of information:
– Which functions are using more CPU directly.
– As we capture the full stacktrace, we know also which functions involve more CPU usage, even if it is indirectly through the calls they do.

But it is not only useful for CPU waits. It will also capture when a method is waiting for something (i.e. because of networking, or a semaphore).

The provided information is useful for initial analysis of the problem, as it will give a high level view of where time could be spent by the application. But it will also be useful in further stages of the analysis, and even for comparing different implementations and consider possible changes.

How does it work?

For call stack sampling, we need some infrastructure to be able to capture and traverse properly the callstack for each thread.

In compilation stage, information is added for function names and the frame pointers. This allows, for a specific stack, to resolve later the actual names, and even lines of code that are captured.

In runtime stage, function information will be required for generated code. I.e. in a web browser, the Javascript code that is compiled in runtime.

Then, every sample will extract the callstack of all the threads of all the analysed processes. This will happen periodically, at the rate established by the profiling tool.

System wide native callstack profiling

When possible, sampling the call stacks of the full system can be benefitial for the analysis.

First, we may want to include system libraries and other dependencies of our component in the analysis. But also, system analyzers can provide other metrics that can give a better context to the analysed workload (network or CPU load, memory usage, swappiness, …).

In the end, many problems are not bound to a single component, so capturing the interaction with other components can be useful.

Next

In next blog posts in this series, I will present native stack profiling for Windows, and how it is integrated with Chromium.

by José Dapena Paz at November 16, 2022 10:13 PM

Jacobo Aragunde

Accessible name computation in Chromium Chapter II: The traversal problem

Back in March last year, as part of my maintenance work on Chrome/Chromium accessibility at Igalia, I inadvertently started a long, deep dive into the accessible name calculation code, that had me intermittently busy since then. Part of this work was already presented in a lightning talk at BlinkOn 15, in November 2021:

This is the second post in a series bringing a conclusion to the work presented in that talk. You may read the previous one, with an introduction to accessible name computation and a description of the problems we wanted to address, here.

Why traverse the DOM?

In our previous post, we found out we used a DOM tree traversal in AXNodeObject::TextFromDescendants to calculate the name for a node based on its descendants. There are some reasons this is not a good idea:

  • There was duplicate code related to selection of relevant nodes (cause of the whitespace bug). For example, the accessibility tree creation already selects relevant nodes, and simplifies whitespace nodes to one single unit whenever relevant.
  • The DOM tree does not contain certain relevant information we need for accessible names (cause of the ::before/after bug). Relevant pseudo-elements are already included in the accessibility tree.
  • In general, it would be more efficient to reuse the cached accessibility tree.

There must have been a good reason to traverse the DOM again instead of the cached accessibility tree, and it was easy enough to switch our code to do the latter, and find out what was breaking. Our tests were kind enough to point out multiple failures related with accessible names from hidden subtrees: it is possible, under certain conditions, to name elements after other hidden elements. Because hidden elements are generally excluded from the accessibility tree, to be able to compute a name after them we had resorted to the DOM tree.

Hidden content and aria-labelledby

Authors may know they can use aria-labelledby to name a node after another, hidden node. This is a simple example, where the produced name for the input is “foo”:

<input aria-labelledby="label">
<div id="label" class="hidden">foo</div>

Unfortunately, the spec did not provide clear answers to some corner cases: what happens if there are elements that are explicitly hidden inside the first hidden node? What with nodes that aren’t directly referenced? How deep are we expected to go inside a hidden tree to calculate its name? And what about aria-hidden nodes?

The discussion and changes to the spec that resulted from it are worth their own blog post! For this one, let me jump directly to the conclusion: we agreed to keep browsers current and past behavior for the most part, and change the spec to reflect it.

Addressing the traversal problem in Chromium

The goal now is to change name computation code to work exclusively with the accessibility tree. It would address the root cause of the problems we had identified, and simplify our code, potentially fixing other issues. To do so, we need to include in the accessibility tree all the information required to get a name from a hidden subtree like explained previously.

We have the concept of “ignored” nodes in the Blink accessibility tree: these nodes exist in the tree, are kept up-to-date, etc. But they are not exposed to the platform accessibility. We will use ignored nodes to keep name-related items in the tree.

At first, I attempted to include in the accessibility tree those nodes whose parent was a target of an aria-labelledby or aria-describedby relation, even if they were hidden. For that purpose, I had to track the DOM for changes in these relations, and maintain the tree up-to-date. This proved to be complicated, as we had to add more code, more conditions… Exactly the opposite to our goal of simplifying things. We ended up abandoning this line of work.

We considered the best solution would be the simplest one: keep all hidden nodes in the tree, just in case we need them. The patch had an impact in memory consumption that we decided to assume, taking into account that the advantages of the change (less and cleaner code, bug fixes) were more than the disadvantages, and its neutral-to-slightly-positive impact in other performance metrics.

There were some non-hidden nodes that were also excluded from the accessibility tree: that was the case of <label> elements, to prevent ATs from double-speaking the text label and the labeled element next to it. We also had to include these labels in the Blink accessibility tree, although marked “ignored”.

Finally, with all the necessary nodes in the tree, we could land a patch to simplify traversal for accessible name and description calculation.

Along the way, we made a number of small fixes and improvements in surrounding and related code, which hopefully made things more stable and performant.

Next in line: spec work

In this post, we have explained the big rework in accessible naming code to be able to use the accessibility tree as the only source of names. It was a long road, with a lot unexpected problems and dead-ends, but worth for the better code, improved test coverage and bugs fixed.

In the next and final post, I’ll explain the work we did in the spec, and the adjustments required in Chromium, to address imprecision when naming from hidden subtrees.

Thanks for reading, and happy hacking!

by Jacobo Aragunde Pérez at November 16, 2022 12:31 PM

November 15, 2022

Jesse Alama

QuickJS already supports arbitrary-precision decimals

Quick­JS al­ready sup­ports ar­bi­trary-pre­ci­sion dec­i­mals

Quick­JS is a neat JavaScript en­gine by Fab­rice Bel­lard. It’s fast and small (ver­sion 2021-03-27 clocks in at 759 Kb). It in­cludes sup­port for ar­bi­trary-pre­ci­sion dec­i­mals, even though the TC39 dec­i­mal pro­pos­al is (at the time of writ­ing) still at Stage 2. You can in­stall it us­ing the tool of your choice; I was able to in­stall it us­ing Home­brew for ma­cOS (the for­mu­la is called quickjs) and FreeB­SD and OpenB­SD. It can also be in­stalled us­ing esvu. (It doesn’t seem to be avail­able as a pack­age on Ubun­tu.)

To get start­ed with ar­bi­trary pre­ci­sion dec­i­mals, you need to fire up Quick­JS with the bignum flag:

$ qjs --bignum
Quick­JS - Type "\h" for help
qjs > 0.1 + 0.2 === 0.3
false
qjs > 0.1m + 0.2m === 0.3m
true
qjs > 0.12345678910111213141516171819m * 0.19181716151413121110987654321m
0.0236811308550240594910066199325923006903586430678413779899m

(The m is the pro­posed suf­fix for lit­er­al high-pre­ci­sion dec­i­mal num­bers that pro­pos­al-dec­i­mal is sup­posed to give us.) No­tice how we nice­ly un­break JavaScript dec­i­mal arith­metic, with­out hav­ing to load a li­brary.

The fi­nal API in the of­fi­cial TC39 Dec­i­mal pro­pos­al still has not been worked out. In­deed, a core ques­tion there re­mains out­stand­ing at the time of writ­ing: what kind of nu­mer­ic pre­ci­sion should be sup­port­ed? (The two main con­tenders are ar­bi­trary pre­ci­sion and the oth­er be­ing 128-bit IEEE 754 (high, but not *ar­bi­trary*, pre­ci­sion). Quick­JS does ar­bi­trary pre­ci­sion.) Nonethe­less, Quick­JS pro­vides a BigDecimal func­tion:

qjs > BigDec­i­mal("123456789.0123456789")
123456789.0123456789m

More­over, you can do ba­sic arith­metic with dec­i­mals: ad­di­tion, sub­trac­tion, mul­ti­pli­ca­tion, di­vi­sion, mod­u­lo, square roots, and round­ing. Every­thing is done with in­fi­nite pre­ci­sion (no loss of in­for­ma­tion). If you know, in ad­vance, what pre­ci­sion is need­ed, you can tweak the arith­meti­cal op­er­a­tions by pass­ing in an op­tions ar­gu­ment. Here’s an ex­am­ple of adding two big dec­i­mal num­bers:

qjs > var a = 0.12345678910111213141516171819m;
un­de­fined
qjs > var b = 0.19181716151413121110987654321m;
un­de­fined
qjs > BigDec­i­mal.add(a,b)
0.3152739506152433425250382614m
qjs > BigDec­i­mal.add(a,b, { "round­ing­Mode": "up", "max­i­mum­Frac­tion­Dig­its": 10 })
0.3152739507m

November 15, 2022 08:42 AM

November 14, 2022

Jesse Alama

Here’s how to unbreak floating-point math in JavaScript

Un­break­ing float­ing-point math in JavaScript

Be­cause com­put­ers are lim­it­ed, they work in a fi­nite range of num­bers, name­ly, those that can be rep­re­sent­ed straight­for­ward­ly as fixed-length (usu­al­ly 32 or 64) se­quences of bits. If you’ve only got 32 or 64 bits, it’s clear that there are only so many num­bers you can rep­re­sent, whether we’re talk­ing about in­te­gers or dec­i­mals. For in­te­gers, it’s clear that there’s a way to ex­act­ly rep­re­sent math­e­mat­i­cal in­te­gers (with­in the fi­nite do­main per­mit­ted by 32 or 64 bits). For dec­i­mals, we have to deal with the lim­its im­posed by hav­ing only a fixed num­ber of bits: most dec­i­mal num­bers can­not be ex­act­ly rep­re­sent­ed. This leads to headaches in all sorts of con­texts where dec­i­mals arise, such as fi­nance, sci­ence, en­gi­neer­ing, and ma­chine learn­ing.

It has to do with our use of base 10 and the com­put­er’s use of base 2. Math strikes again! Ex­act­ness of dec­i­mal num­bers isn’t an ab­struse, edge case-y prob­lem that some math­e­mati­cians thought up to poke fun at pro­gram­mers en­gi­neers who aren’t blessed to work in an in­fi­nite do­main. Con­sid­er a sim­ple ex­am­ple. Fire up your fa­vorite JavaScript en­gine and eval­u­ate this:

1 + 2 === 3

You should get true. Duh. But take that ex­am­ple and work it with dec­i­mals:

0.1 + 0.2 === 0.3

You’ll get false.

How can that be? Is float­ing-point math bro­ken in JavaScript? Short an­swer: yes, it is. But if it’s any con­so­la­tion, it’s not just JavaScript that’s bro­ken in this re­gard. You’ll get the same re­sult in all sorts of oth­er lan­guages. This isn’t wat. This is the un­avoid­able bur­den we pro­gram­mers bear when deal­ing with dec­i­mal num­bers on ma­chines with lim­it­ed pre­ci­sion.

Maybe you’re think­ing OK, but if that’s right, how in the world do dec­i­mal num­bers get han­dled at all? Think of all the fi­nan­cial ap­pli­ca­tions out there that must be do­ing the wrong thing count­less times a day. You’re quite right! One way of get­ting around odd­i­ties like the one above is by al­ways round­ing. So in­stead of work­ing with, say, this is by han­dling dec­i­mal num­bers as strings (se­quences of dig­its). You would then de­fine op­er­a­tions such as ad­di­tion, mul­ti­pli­ca­tion, and equal­i­ty by do­ing el­e­men­tary school math, dig­it by dig­it (or, rather, char­ac­ter by char­ac­ter).

So what to do?

Num­bers in JavaScript are sup­posed to be IEEE 754 float­ing-point num­bers. A con­se­quence of this is, ef­fec­tive­ly, that 0.1 + 0.2 will nev­er be 0.3 (in the sense of the === op­er­a­tor in JavaScript). So what can be done?

There’s an npm li­brary out there, dec­i­mal.js, that pro­vides sup­port for ar­bi­trary pre­ci­sion dec­i­mals. There are prob­a­bly oth­er li­braries out there that have sim­i­lar or equiv­a­lent func­tion­al­i­ty.

As you might imag­ine, the is­sue un­der dis­cus­sion is old. There are workarounds us­ing a li­brary.

But what about ex­tend­ing the lan­guage of JavaScript so that the equa­tion does get val­i­dat­ed? Can we make JavaScript work with dec­i­mals cor­rect­ly, with­out us­ing a li­brary?

Yes, we can.

Aside: Huge in­te­gers

It’s worth think­ing about a sim­i­lar is­sue that also aris­es from the finite­ness of our ma­chines: ar­bi­trar­i­ly large in­te­gers in JavaScript. Out of the box, JavaScript didn’t sup­port ex­treme­ly large in­te­gers. You’ve got 32-bit or (more like­ly) 64-bit signed in­te­gers. But even though that’s a big range, it’s still, of course, lim­it­ed. Big­Int, a pro­pos­al to ex­tend JS with pre­cise­ly this kind of thing, reached Stage 4 in 2019, so it should be avail­able in pret­ty much every JavaScript en­gine you can find. Go ahead and fire up Node or open your brows­er’s in­spec­tor and plug in the num­ber of nanosec­onds since the Big Bang:

13_787_000_000_000n // years * 365n // days * 24n // hours * 60n // minutes * 60n // seconds * 1000n // milliseconds * 1000n // microseconds * 1000n // nanoseconds

(Not a sci­en­ti­cian. May not be true. Not in­tend­ed to be a fac­tu­al claim.)

Adding big dec­i­mals to the lan­guage

OK, enough about big in­te­gers. What about adding sup­port for ar­bi­trary pre­ci­sion dec­i­mals in JavaScript? Or, at least, high-pre­ci­sion dec­i­mals? As we see above, we don’t even need to wrack our brains try­ing to think of com­pli­cat­ed sce­nar­ios where a ton of dig­its af­ter the dec­i­mal point are need­ed. Just look at 0.1 + 0.2 = 0.3. That’s pret­ty low-pre­ci­sion, and it still doesn’t work. Is there any­thing anal­o­gous to Big­Int for non-in­te­ger dec­i­mal num­bers? No, not as a li­brary; we al­ready dis­cussed that. Can we add it to the lan­guage, so that, out of the box—with no third-par­ty li­brary—we can work with dec­i­mals?

The an­swer is yes. Work is pro­ceed­ing on this mat­ter, but things re­main to un­set­tled. The rel­e­vant pro­pos­al is BigDec­i­mal. I’ll be work­ing on this for a while. I want to get big dec­i­mals into JavaScript. There are all sorts of is­sues to re­solve, but they’re def­i­nite­ly re­solv­able. We have ex­pe­ri­ence with ar­bi­trary pre­ci­sion arith­metic in oth­er lan­guages. It can be done.

So yes, float­ing-point math is bro­ken in JavaScript, but help is on the way. You’ll see more from me here as I tack­le this in­ter­est­ing prob­lem; stay tuned!

November 14, 2022 11:33 AM

Javier Fernández

Discovering Chrome’s pre-defined Custom Handlers

In a previous post I described some architectural changes I’ve applied in Chrome to move the Custom Handlers logic into a new component. One of the advantages of this new architecture is an easier extensibility for embedders and less friction with the //chrome layer.

Igalia and Protocol Labs has been collaborating for some years already to improve the multi-protocol capabilities in the main browsers; most of our work has been done in Chrome, but we have interesting contributions in Firefox and of course we have plans to increase or activity in Safari.

As I said in the mentioned post, one of the long-term goals that Protocol Labs has with this work is to improve the integration of the IPFS protocol in the main browsers. On this regard, its undeniable that the Brave browser is the one leading this effort, providing a native implementation of the IPFS protocol that can work with both, a local IPFS node and the HTTPS gateway. For the rest of the browsers, Protocol Labs would like to at least implement the HTTPS gateway support, as an intermediate step and to increase the adoption and use of the protocol.

This is where the Chrome’s pre-defined custom handlers feature could provide us a possible approach. This feature allows Chrome to register handlers for specific schemes at compile-time. The ChromeOS embedders use this mechanism to register handlers for the mailto and webcal schemes, to redirect the HTTP requests to mail.google.com and google.com/calendar sites respectively.

The idea would be that Chrome could use the same approach to register pre-defined handlers for the IPFS scheme, redirecting the IPFS request to the HTTPS gateways.

In order to understand better how this feature would work, I\’ll explain first how Chrome deals with the needs of the different embedders regarding the schemes.

The SchemesRegistry

The SchemeRegistry data structure is used to declare the schemes registered in the browser and assign some specific properties to different subsets of such group of schemes.

The data structure defines different lists to provide these specific features to the schemes; the main and more basic list is perhaps the standard schemes, which is defined as follows:

A standard-format scheme adheres to what RFC 3986 calls “generic URI syntax” (https://tools.ietf.org/html/rfc3986#section-3).

There are other lists to lend specific properties to a well-defined set of schemes:

// Schemes that are allowed for referrers.
std::vector referrer_schemes = {
  {kHttpsScheme, SCHEME_WITH_HOST_PORT_AND_USER_INFORMATION},
  {kHttpScheme, SCHEME_WITH_HOST_PORT_AND_USER_INFORMATION},
};
// Schemes that do not trigger mixed content warning.
std::vector secure_schemes = {
  kHttpsScheme, kAboutScheme, kDataScheme, 
  kQuicTransportScheme, kWssScheme,
};

The following class diagram shows how the SchemeRegistry data structure is defined, with some duplication to avoid layering violations, and its relationship with the Content Public API, used by the Chrome embedders to define its own behavior for some schemes

It’s worth mentioning that this relationship between the //url, //content and //blink layers shows some margin to be improved, as it was noted down by Dimitry Gozman (one of the Content Public API owners) in one of the reviews of the patches I’ve been working on. I hope we could propose some improvements on this code, as part of the goals that Igalia and Protocol Labs will define for 2023.

Additional schemes for Embedders

As it was commented on the preliminary discussions I had as part of the analysis of the initial refactoring work described in the first section, Chrome already provides a way for embedders to define new schemes and even modifying the behavior of the ones already registered. This is done via the Content Public API, as usual.

The Chrome’s Content Layer offers a Public API that embedders can implement to define its own behavior for certain features.

One of these abstract features is the addition of new schemes to be handled internally. The ContentClient interface has a method called AddAdditionalSchemes precisely for this purpose. On the other hand, the ContentBrowserClient interface provides the HasCustomSchemeHandler and IsHandledURL methods to allow embedders implement their specific behavior to handler such schemes.

The following class diagram shows how embedders implements the Content Client interfaces to provide specific behavior to some schemes:

I’ve applied a refactoring to allow defining pre-defined handlers via the SchemeRegistry. This is some excerpt of such refactoring:

// Schemes with a predefined default custom handler.
std::map predefined_handler_schemes;

void AddPredefinedHandlerScheme(const char* new_scheme, const char* handler) {
  DoAddSchemeWithHandler(
      new_scheme, handler,      
      GetSchemeRegistryWithoutLocking().predefined_handler_schemes);
}

const std::map
GetPredefinedHandlerSchemes() {
  return GetSchemeRegistry().predefined_handler_schemes;
}

Finally, embedders would just need to implement the ChromeContentClient::AddAdditionalSchemes interface to register a scheme and its associated handler.

For instance, in order to install in Android a predefined handler for the IPFS protocol, we would just need to add in the AwContentClient::AddAdditionalSchemes function the following logic:

  
 schemes.predefined_handler_schemes.emplace_back(
      "ipfs", "https://dweb.link/ipfs/?uri=%s");
  schemes.predefined_handler_schemes.emplace_back(
      "ipns", "https://dweb.link/ipns/?uri=%s");

Conclusion

The pre-defined handlers allow us to introduce in the browser a custom behavior for specific schemes.

Unlike the registerProtocolHandler method, we could bypass some if the restrictions imposed by the HTML API, given that in the context of an embedder browser its possible to have more control on the navigation use cases.

The custom handlers, either using the pre-defined handlers approach or the registerProtocolHandler method, is not the ideal way of introducing multi-protocol support in the browser. I personally consider it a promising intermediate step, but a more solid solution must be designed for these use cases of the browser.

by jfernandez at November 14, 2022 11:07 AM

November 10, 2022

Danylo Piliaiev

Debugging Unrecoverable GPU Hangs

  1. Breadcrumbs
  2. Turnip’s take on the breadcrumbs
  3. Limitations

I already talked about debugging hangs in “Graphics Flight Recorder - unknown but handy tool to debug GPU hangs”, now I want to talk about the most nasty kind of GPU hangs - the ones which cannot be recovered from, where your computer becomes completely unresponsive and you cannot even ssh into it.

How would one debug this? There is no data to get after the hang and it’s incredibly frustrating to even try different debug options and hypothesis, if you are wrong - you get to reboot the machine!

If you are a hardware manufacturer creating a driver for your own GPU, you could just run the workload in your fancy hardware simulator, wait for a few hours for the result and call it a day. But what if you don’t have access to a simulator, or to some debug side channel?

There are a few things one could try:

  • Try all debug options you have, try to disable compute dispatches, some draw calls, blits, and so on. The downside is that you have to reboot every time you hit the issue.
  • Eyeball the GPU packets until they start eyeballing you.
  • Breadcrumbs!

Today we will talk about the breadcrumbs. Unfortunately, Graphics Flight Recorder, which I already wrote about, isn’t of much help here. The hang is unrecoverable so GFR isn’t able to gather the results and write them to the disk.

But the idea of breadcrumbs is still useful! What if instead of gathering result post factum, we stream the results to some other machine. This would allow us to get the results even if our target becomes unresponsive. Though, the requirement to get the results ASAP considerably changes the workflow.

What if we write breadcrumbs on GPU after each command and spin a thread on CPU reading it in a busy loop?

In practice the the amount of breadcrumbs between the one sent over network and the one currently executed is just too big to be practical.

So we have to make GPU and CPU running in a lockstep.

  • GPU writes a breadcrumb and immediately waits for this value to be acknowledged;
  • CPU in a busy loop checks the last written breadcrumb value, sends it over socket, and writes it back to the fixed address;
  • GPU sees a new value and continues execution.

This way the most recent breadcrumb gets immediately sent over the network. In practice, some breadcrumbs are still lost between the last sent over the network and the one where GPU hangs. But the difference is only a few of them.

With the lockstep execution we could narrow the hanging command even further. For this we have to wait for a certain time after each breadcrumb before proceeding to the next one. I chose to just promt user for explicit keyboard input for each breadcrumb.

I ended up with the following workflow:

  • Run with breadcrumbs enabled but without requiring explicit user input;
  • On another machine receive the stream of breadcrumbs via network;
  • Note the last received breadcrumb, the hanging command would be nearby;
  • Reboot the target;
  • Enable breadcrumbs starting from a few breadcrumbs before the last one received, require explicit ack from the user on each breadcrumb;
  • In a few steps GPU should hang;
  • Now that we know the closest breadcrumb to the real hang location - we can get the command stream and see what happens right after the breadcrumb;
  • Knowing which command(s) is causing our hang - it’s now possible to test various changes in the driver.

Could Graphics Flight Recorder do this?

In theory yes, but it would require additional VK extension to be able to wait for value in memory. However, it still would have a crucial limitation.

Even with Vulkan being really close to the hardware there are still many cases where one Vulkan command is translated into many GPU commands under the hood. Things like image copies, blits, renderpass boundaries. For unrecoverable hangs we want to narrow down the hanging GPU command as much as possible, so it makes more sense to implement such functionality in a driver.

Turnip’s take on the breadcrumbs

I recently implemented it in Turnip (open-source Vulkan driver for Qualcomm’s GPUs) and used a few times with good results.

Current implementation in Turnip is rather spartan and gives the minimal amount of instruments to achieve the workflow described above. While looks like this:

1) Launch workload with TU_BREADCRUMBS envvar (TU_BREADCRUMBS=$IP:$PORT,break=$BREAKPOINT:$BREAKPOINT_HITS):

TU_BREADCRUMBS=$IP:$PORT,break=-1:0 ./some_workload

2) Receive breadcrumbs on another machine via this bash spaghetti:

> nc -lvup $PORT | stdbuf -o0 xxd -pc -c 4 | awk -Wposix '{printf("%u:%u\n", "0x" $0, a[$0]++)}'

Received packet from 10.42.0.19:49116 -> 10.42.0.1:11113 (local)
1:0
7:0
8:0
9:0
10:0
11:0
12:0
13:0
14:0
[...]
10:3
11:3
12:3
13:3
14:3
15:3
16:3
17:3
18:3

Each line is a breadcrumb № and how many times it was repeated (either if command buffer is reusable or if breadcrumb is in a command stream which repeats for each tile).

3) Increase hang timeout:

echo -n 120000 > /sys/kernel/debug/dri/0/hangcheck_period_ms

4) Launch workload and break on the last known executed breadcrumb:

TU_BREADCRUMBS=$IP:$PORT,break=15:3 ./some_workload
GPU is on breadcrumb 18, continue?y
GPU is on breadcrumb 19, continue?y
GPU is on breadcrumb 20, continue?y
GPU is on breadcrumb 21, continue?y
...

5) Continue until GPU hangs.

:tada: Now you know two breadcrumbs between which there is a command which causes the hang.

Limitations

  • If the hang was caused by a lack of synchronization, or a lack of cache flushing - it may not be reproducible with breadcrumbs enabled.
  • While breadcrumbs could help to narrow down the hang to a single command - a single command may access many different states and have many steps under the hood (e.g. draw or blit commands).
  • The method for breadcrumbs described here do affect performance, especially if tiling rendering is used, since breadcrumbs are repeated for each tile.

by Danylo Piliaiev at November 10, 2022 11:00 PM

Melissa Wen

V3D enablement in mainline kernel

Hey,

If you enjoy using upstream Linux kernel in your Raspberry Pi system or just want to give a try in the freshest kernel graphics drivers there, the good news is that now you can compile and boot the V3D driver from the mainline in your Raspberry Pi 4. Thanks to the work of Stefan, Peter and Nicolas [1] [2], the V3D enablement reached the Linux kernel mainline. That means hacking and using new features available in the upstream V3D driver directly from the source.

However, even for those used to compiling and installing a custom kernel in the Raspberry Pi, there are some quirks to getting the mainline v3d module available in 32-bit and 64-bit systems. I’ve quickly summarized how to compile and install upstream kernel versions (>=6.0) in this short blog post.

Note: V3D driver is not present in Raspberry Pi models 0-3.

First, I’m taking into account that you already know how to cross-compile a custom kernel to your system. If it is not your case, a good tutorial is already available in the Raspberry Pi documentation, but it targets the kernel in the rpi-linux repository (downstream kernel).

From this documentation, the main differences in the upstream kernel are presented below:

Raspberry Pi 4 64-bit (arm64)

Diff short summary:

  1. instead of getting the .config file from bcm2711_defconfig, get it by running make ARCH=arm64 defconfig
  2. compile and install the kernel image and modules as usual, but just copy the dtb file arch/arm64/boot/dts/broadcom/bcm2711-rpi-4-b.dtb to the /boot of your target machine (no /overlays directory to copy)
  3. change /boot/config.txt:
    • comment/remove the dt_overlay=vc4-kms-v3d entry
    • add a device_tree=bcm2711-rpi-4-b.dtb entry

Raspberry Pi 4 32-bits (arm)

Diff short summary:

  1. get the .config file by running make ARCH=arm multi_v7_defconfig
  2. using make ARCH=arm menuconfig or a similar tool, enable CONFIG_ARM_LPAE=y
  3. compile and install the kernel image and modules as usual, but just copy the dtb file arch/arm/boot/dts/bcm2711-rpi-4-b.dtb to the /boot of your target machine (no /overlays directory to copy)
  4. change /boot/config.txt:
    • comment/remove the dt_overlay=vc4-kms-v3d entry
    • add a device_tree=bcm2711-rpi-4-b.dtb entry

Step-by-step for remote deployment:

Set variables

Raspberry Pi 4 64-bit (arm64)

cd <path-to-upstream-linux-directory>
KERNEL=`make kernelrelease`
ARCH="arm64"
CROSS_COMPILE="aarch64-linux-gnu-"
DEFCONFIG=defconfig
IMAGE=Image
DTB_PATH="broadcom/bcm2711-rpi-4-b.dtb"
RPI4=<ip>
TMP=`mktemp -d`

Raspberry Pi 4 32-bits (arm)

cd <path-to-upstream-linux-directory>
KERNEL=`make kernelrelease`
ARCH="arm"
CROSS_COMPILE="arm-linux-gnueabihf-"
DEFCONFIG=multi_v7_defconfig
IMAGE=zImage
DTB_PATH="bcm2711-rpi-4-b.dtb"
RPI4=<ip>
TMP=`mktemp -d`

Get default .config file

make ARCH=$ARCH CROSS_COMPILE=$CROSS_COMPILE $DEFCONFIG

Raspberry Pi 4 32-bit (arm)

Additional step for 32-bit system. Enable CONFIG_ARM_LPAE=y using make ARCH=arm menuconfig

Cross-compile the mainline kernel

make ARCH=$ARCH CROSS_COMPILE=$CROSS_COMPILE $IMAGE modules dtbs

Install modules to send

make ARCH=$ARCH CROSS_COMPILE=$CROSS_COMPILE INSTALL_MOD_PATH=$TMP modules_install

Copy kernel image, modules and the dtb to your remote system

ssh $RPI4 mkdir -p /tmp/new_modules /tmp/new_kernel
rsync -av $TMP/ $RPI4:/tmp/new_modules/
scp arch/$ARCH/boot/$IMAGE $RPI4:/tmp/new_kernel/$KERNEL
scp arch/$ARCH/boot/dts/$DTB_PATH $RPI4:/tmp/new_kernel
ssh $RPI4 sudo rsync -av /tmp/new_modules/lib/modules/ /lib/modules/
ssh $RPI4 sudo rsync -av /tmp/new_kernel/ /boot/
rm -rf $TMP

Set config.txt of you RPi 4

In your Raspberry Pi 4, open the config file /boot/config.txt

  • comment/remove the dt_overlay=vc4-kms-v3d entry
  • add a device_tree=bcm2711-rpi-4-b.dtb entry
  • add a kernel=<image-name> entry

Why not Kworkflow?

You can safely use the steps above, but if you are hacking the kernel and need to repeat this compiling and installing steps repeatedly, why don’t try the Kworkflow?

Kworkflow is a set of scripts to synthesize all steps to have a custom kernel compiled and installed in local and remote machines and it supports kernel building and deployment to Raspberry Pi machines for Raspbian 32 bit and 64 bit.

After learning the kernel compilation and installation step by step, you can simply use kw bd command and have a custom kernel installed in your Raspberry Pi 4.

Note: the userspace 3D acceleration (glx/mesa) is working as expected on arm64, but the driver is not loaded yet for arm. Besides that, a bunch of pte invalid errors may appear when using 3D acceleration, it’s a known issue that are still under investigation.

November 10, 2022 02:40 PM

November 01, 2022

Abhijeet Kandalkar

Improving disk read performance

When I started working on analyzing the startup time of a browser, I didn’t have an idea where to start. For me, Performance was a number game and trial-n-error approach. The challenge I was looking into is described in bug 1270977, in which the chrome binary hosted on two different partitions behaved differently.

BROWSER LaCrOS Mount point Partition fs type comment
B1 stateful /run/imageloader/lacros-*/chrome stateful squashfs downloaded
B2 deployedLacros /usr/local/lacros-chrome stateful ext4 pushed by developer

You can learn more about LaCrOS in my previous blog. From the above table, the difference in filesystem was the first suspect. Squashfs is a compressed read-only file system for Linux.

To emulate the first launch, all caches are cleared using drop_caches as below. echo 3 > /proc/sys/vm/drop_caches

How to measure a file system read time?

For investigation and testing/benchmarking, sequential read requests are emulated using a dd command and random read requests are emulated using fio(Flexible IO Tester)

For sequential read, B1 took around 92MB/s while B2 took 242MB/s.(Higher read speed means faster). Its a huge difference. ext4 is faster than squashfs.

localhost ~ # echo 3 > /proc/sys/vm/drop_caches
localhost ~ # time dd
if=/run/imageloader/lacros-dogfood-dev/99.0.4824.0/chrome of=/dev/null bs=4096
43910+1 records in
43910+1 records out
179859368 bytes (180 MB, 172 MiB) copied, 1.93517 s, 92.9 MB/s

real    0m1.951s
user    0m0.020s
sys 0m0.637s
localhost ~ # echo 3 > /proc/sys/vm/drop_caches
localhost ~ # time dd if=/usr/local/lacros-chrome/chrome of=/dev/null bs=4096
43910+1 records in
43910+1 records out
179859368 bytes (180 MB, 172 MiB) copied, 0.744337 s, 242 MB/s

real    0m0.767s
user    0m0.056s
sys 0m0.285s

As Squashfs is a compressed file system, it first uncompresses the data and then performs a read. So definitely it’s going to be slower than ext4 but the difference(242 -92) was huge and unacceptable.

Can we do better?

The next step was to find some tools that could provide more information.

  • strace - Linux utility to find out system calls that take more time or have blocking IO. But couldn’t get much information out of this analysis.
  • iostat - This provides more insight information for input/output

$echo 3 > /proc/sys/vm/drop_caches && iostat -d -tc -xm 1 20

After running this command for both B1 and B2, it appears that B1 is not merging read requests. In other words, rrqm/s and %rrqm values are ZERO. Refer man-iostat

%rrqm The percentage of read requests merged together before being sent to the device.

rrqm/s The number of read requests merged per second that were queued to the device.

The data to be read is scattered in different sectors of a different block. As a result of scattering, the read requests were not merged, so there could be a lot of read requests down to the block device.

Next step was to find out who is responsible for merging read requests.

IO scheduler

IO scheduler optimizes disk access requests by merging I/O requests to similar locations on disk. To get the current IO scheduler for a disk-device

$/sys/block/<disk device>/queue/scheduler

for B1,

localhost ~ # cat /sys/block/dm-1/queue/scheduler 
none

For B2, it uses the bfq scheduler

localhost ~ # cat /sys/block/mmcblk0/queue/scheduler 
mq-deadline kyber [bfq] none

As you can notice, B1 does not have scheduler. For B2, scheduler is bfq(Budget Fair Queueing).

SUSE and IBM has good documentation about IO scheduling.

It has been found so far that there are differences between file systems, read request merge patterns, and IO schedulers. But there is one more difference which we havent noticed yet.

If you look at how B1 is mounted, you will see that we are using a virtual device.

localhost ~ # df -TH
Filesystem  Type      Size  Used  Avail  Use%  Mounted on
/dev/dm-1   squashfs  143M  143M      0  100%  /run/imageloader/lacros-dogfood-dev/99.0.4824.0

A twist in my story

For B1, we are not directly communicating with the block device instead the requests are going through a virtual device using a device mapper. As you might have noticed above, dm-1 device mapper virtual device is used by B1.

Read more about a device-mapper and refer to the diagram on wikipedia page.

We are using a dm-verity as a virtual device.

Device-Mapper’s “verity” target provides transparent integrity checking of block devices using a cryptographic digest provided by the kernel crypto API. This target is read-only.

From a security point of view, it makes sense to host your executable binary to a read-only files system that provides integrity checking.

There is a twist in that dm-verity does not have a scheduler, and scheduling is delegated to the scheduler of the underlying block device. The I/O will not be merged at the device-mapper layer, but at an underlying layer.

Several layers are different between B1 and B2, including partition encryption(dm-verity) and compression(squashfs).

In addition, we found that B1 defaults to using --direct-io

What is DIRECT IO ?

Direct I/O is a feature of the file system whereby file reads and writes go directly from the applications to the storage device, bypassing the operating system read and write caches.

Although direct I/O can reduce CPU usage, using it typically results in the process taking longer to complete, especially for relatively small requests. This penalty is caused by the fundamental differences between normal cached I/O and direct I/O.

Every direct I/O read causes a synchronous read from the disk; unlike the normal cached I/O policy where the read may be satisfied from the cache. This can result in very poor performance if the data was likely to be in memory under the normal caching policy.

Is it possible to control how many read requests are made?

We decided to go deeper and analyze the block layer. In the block layer the request queue is handled, and I/O scheduling is done. To see how the flow of IO goes through the block layer, there is a suit of tools consisting of

  • blktrace (tracing the IO requests through the block layer),
  • blkparse (parsing the output of blktrace to make it human readable),
  • btrace (script to combine blktrace and blkparse, and
  • btt (a blktrace output post-processing tool)), among others.

Below is the result of block-level parsing.

For B1,

ALL MIN AVG MAX N
Q2Q 0.000000001 0.000069861 0.004097547 94883
Q2G 0.000000242 0.000000543 0.000132620 94884
G2I 0.000000613 0.000001455 0.003585980 94884
I2D 0.000001399 0.000002456 0.003777911 94884
D2C 0.000133042 0.000176883 0.001771526 23721
Q2C 0.000136308 0.000181337 0.003968613 23721

For B2,

ALL MIN AVG MAX N
Q2Q 0.000000001 0.000016582 0.003283451 76171
Q2G 0.000000178 0.000000228 0.000016893 76172
G2I 0.000000626 0.000005904 0.000587304 76172
I2D 0.000002011 0.000007582 0.000861783 76172
D2C 0.000004959 0.000067955 0.001275667 19043
Q2C 0.000009132 0.000081670 0.001315828 19043
  • Number of read requests(N) are more for B1 than B2.
    • For B1, N = 94883 (partition with squashfs + dm-verity)
    • For B2, N = 76171 (ext4 partition)
  • Average D2C time for B1 is more than twice that of B2 (0.000176883 > 2*0.000067955).

D2C time: the average time from when the actual IO was issued to the driver until is completed (completion trace) back to the block IO layer.

--d2c-latencies option: This option instructs btt to generate the D2C latency file. This file has the first column (X values) representing runtime (seconds), while the second column (Y values) shows the actual latency for command at that time (either Q2D, D2C or Q2C).

FIO(Flexible IO Tester)

Up until now, all analysis has been performed using a dd which emulate sequential read requests.

Read access pattern also affects a IO performace so to emulated a random IO workloads tools like FIO is useful.

When you specify a --size, FIO creates a file with filename and when you don’ specify it FIO reads form a filename.

We are testing an existing chrome binary without providing its size.

localhost~# fio --name=4k_read --ioengine=libaio --iodepth=1 --rw=randread --bs=4k --norandommap --filename=<dir>/chrome

Hotspots

A “hotspot” is a place where the app is spending a lot of time. After performing above analysis, find those areas and speed them up. This is easily done using a profiling tool like It has been found so far that there are differences between file systems, read request merge patterns, and IO schedulers. But there one more difference which we havent noticed yet.

For optimal performance,

  • reduce the number of requests (N)
    • by tuning the blocksize of the filesystem(squashfs) of the partition
  • reduce the time taken to process each request(D2C)
    • by tuning a compression algorithm used by squashfs filesystem
    • by tuning a hashing function used by dm-verity

Tuning a blocksize

Block is a continuous location on the hard drive where data is stored. In general, FileSystem stores data as a collection of blocks. When the block size is small, the data gets divided into blocks and distributed in more blocks. As more blocks are created, there will be more number of read requests.

Squashfs compresses files, inodes, and directories, and supports block sizes from 4 KiB up to 1 MiB for greater compression.

We experimented with different block sizes of 4K, 64K, 128K, 256K. As the block size is increasing, the read speed for DD is increasing. We have chosen 128K.

mksquashfs /tmp/squashfs-root /home/chronos/cros-components/lacros-dogfood-dev/100.0.4881.0/image.squash.test.<block-size> -b <block-size>K

Tuning a compression algorithm

Squashfs supports different compression algorithms. mksquashfs command has the option -comp <comp> to configure the compression algorithm. There are different compressors available:

  • gzip (default)
  • lzma
  • lzo
  • lz4
  • xz
  • zstd

mksquashfs /tmp/squashfs-root /home/chronos/cros-components/lacros-dogfood-dev/100.0.4881.0/image.squash.test.<block-size> -b <block-size>K -comp gzip

Tuning a hashing algorithm

This is related to dm-verity. dm-verity decrypt the read request. It uses different hashing algorithms.

  • sha256 (default)
  • sha224
  • sha1
  • md5 We can use hardware implementation of the above algorithms to speed up things.

Conclusion

The purpose of this blog post is to demonstrate how to identify the hotspot and how to explore different parameters affecting the system’s performance. It was an interesting experience and I learned a lot. By sharing all the details about how the problem was broken down and improved, I hope others can learn something as well. As a result of playing around with the various combinations of parameters such as blocksize, compression, and direct-io, we have decided to go with the following configuration.

  1. For Filesystem,
    • Blocksize : Changed from 4K to 128K
    • Compression algorithm : Changed from gzip to xz
    • direct-io : Changed from on to off
  2. dm-verity :
    • Hashing algorithm : Currently it used sha256. Either use hardware implementation of sha256 or switch to other algorithm.

Your first step should be to identify your current stack and then analyze each layer of your file system stack using the available tools and optimize them. The information provided in this blog may be helpful to you in deciding what to do about your own performance issues. You can get it touch with me if you need any help.

November 01, 2022 06:30 PM

October 31, 2022

Andy Wingo

ephemerons and finalizers

Good day, hackfolk. Today we continue the series on garbage collection with some notes on ephemerons and finalizers.

conjunctions and disjunctions

First described in a 1997 paper by Barry Hayes, which attributes the invention to George Bosworth, ephemerons are a kind of weak key-value association.

Thinking about the problem abstractly, consider that the garbage collector's job is to keep live objects and recycle memory for dead objects, making that memory available for future allocations. Formally speaking, we can say:

  • An object is live if it is in the root set

  • An object is live it is referenced by any live object.

This circular definition uses the word any, indicating a disjunction: a single incoming reference from a live object is sufficient to mark a referent object as live.

Ephemerons augment this definition with a conjunction:

  • An object V is live if, for an ephemeron E containing an association betweeen objects K and V, both E and K are live.

This is a more annoying property for a garbage collector to track. If you happen to mark K as live and then you mark E as live, then you can just continue to trace V. But if you see E first and then you mark K, you don't really have a direct edge to V. (Indeed this is one of the main purposes for ephemerons: associating data with an object, here K, without actually modifying that object.)

During a trace of the object graph, you can know if an object is definitely alive by checking if it was visited already, but if it wasn't visited yet that doesn't mean it's not live: we might just have not gotten to it yet. Therefore one common implementation strategy is to wait until tracing the object graph is done before tracing ephemerons. But then we have another annoying problem, which is that tracing ephemerons can result in finding more live ephemerons, requiring another tracing cycle, and so on. Mozilla's Steve Fink wrote a nice article on this issue earlier this year, with some mitigations.

finalizers aren't quite ephemerons

All that is by way of introduction. If you just have an object graph with strong references and ephemerons, our definitions are clear and consistent. However, if we add some more features, we muddy the waters.

Consider finalizers. The basic idea is that you can attach one or a number of finalizers to an object, and that when the object becomes unreachable (not live), the system will invoke a function. One way to imagine this is a global association from finalizable object O to finalizer F.

As it is, this definition is underspecified in a few ways. One, what happens if F references O? It could be a GC-managed closure, after all. Would that prevent O from being collected?

Ephemerons solve this problem, in a way; we could trace the table of finalizers like a table of ephemerons. In that way F would only be traced if O is live already, so that by itself it wouldn't keep O alive. But then if O becomes dead, you'd want to invoke F, so you'd need it to be live, so reachability of finalizers is not quite the same as ephemeron-reachability: indeed logically all F values in the finalizer table are live, because they all will be invoked at some point.

In the end, if F references O, then F actually keeps O alive. Whether this prevents O from being finalized depends on our definition for finalizability. We could say that an object is finalizable if it is found to be unreachable after a full trace, and the finalizers F are in the root set. Or we could say that an object is finalizable if it is unreachable after a partial trace, in which finalizers are not themselves in the initial root set, and instead we trace them after determining the finalizable set.

Having finalizers in the initial root set is unfortunate: there's no quick check you can make when adding a finalizer to signal this problem to the user, and it's very hard to convey to a user exactly how it is that an object is referenced. You'd have to add lots of gnarly documentation on top of the already unavoidable gnarliness that you already had to write. But, perhaps it is a local maximum.

Incidentally, you might think that you can get around these issues by saying "don't reference objects from their finalizers", and that's true in a way. However it's not uncommon for finalizers to receive the object being finalized as an argument; after all, it's that object which probably encapsulates the information necessary for its finalization. Of course this can lead to the finalizer prolonging the longevity of an object, perhaps by storing it to a shared data structure. This is a risk for correct program construction (the finalized object might reference live-but-already-finalized objects), but not really a burden for the garbage collector, except in that it's a serialization point in the collection algorithm: you trace, you compute the finalizable set, then you have to trace the finalizables again.

ephemerons vs finalizers

The gnarliness continues! Imagine that O is associated with a finalizer F, and also, via ephemeron E, some auxiliary data V. Imagine that at the end of the trace, O is unreachable and so will be dead. Imagine that F receives O as an argument, and that F looks up the association for O in E. Is the association to V still there?

Guile's documentation on guardians, a finalization-like facility, specifies that weak associations (i.e. ephemerons) remain in place when an object becomes collectable, though I think in practice this has been broken since Guile switched to the BDW-GC collector some 20 years ago or so and I would like to fix it.

One nice solution falls out if you prohibit resuscitation by not including finalizer closures in the root set and not passing the finalizable object to the finalizer function. In that way you will never be able to look up E×OV, because you don't have O. This is the path that JavaScript has taken, for example, with WeakMap and FinalizationRegistry.

However if you allow for resuscitation, for example by passing finalizable objects as an argument to finalizers, I am not sure that there is an optimal answer. Recall that with resuscitation, the trace proceeds in three phases: first trace the graph, then compute and enqueue the finalizables, then trace the finalizables. When do you perform the conjunction for the ephemeron trace? You could do so after the initial trace, which might augment the live set, protecting some objects from finalization, but possibly missing ephemeron associations added in the later trace of finalizable objects. Or you could trace ephemerons at the very end, preserving all associations for finalizable objects (and their referents), which would allow more objects to be finalized at the same time.

Probably if you trace ephemerons early you will also want to trace them later, as you would do so because you think ephemeron associations are important, as you want them to prevent objects from being finalized, and it would be weird if they were not present for finalizable objects. This adds more serialization to the trace algorithm, though:

  1. (Add finalizers to the root set?)

  2. Trace from the roots

  3. Trace ephemerons?

  4. Compute finalizables

  5. Trace finalizables (and finalizer closures if not done in 1)

  6. Trace ephemerons again?

These last few paragraphs are the reason for today's post. It's not clear to me that there is an optimal way to compose ephemerons and finalizers in the presence of resuscitation. If you add finalizers to the root set, you might prevent objects from being collected. If you defer them until later, you lose the optimization that you can skip steps 5 and 6 if there are no finalizables. If you trace (not-yet-visited) ephemerons twice, that's overhead; if you trace them only once, the user could get what they perceive as premature finalization of otherwise reachable objects.

In Guile I think I am going to try to add finalizers to the root set, pass the finalizable to the finalizer as an argument, and trace ephemerons twice if there are finalizable objects. I think this wil minimize incoming bug reports. I am bummed though that I can't eliminate them by construction.

Until next time, happy hacking!

by Andy Wingo at October 31, 2022 12:21 PM

October 28, 2022

Jacobo Aragunde

Accessible name computation in Chromium Chapter I: Introduction

Back in March last year, as part of my maintenance work on Chrome/Chromium accessibility at Igalia, I inadvertently started a long, deep dive into the accessible name calculation code, that had me intermittently busy since then.

As a matter of fact, I started drafting this post over a year ago, and I already presented part of my work in a lightning talk at BlinkOn 15, in November 2021:

This series of posts is the promised closure from that talk, that extends and updates what was presented there.

Introduction

Accessible name and description computation is an important matter in web accessibility. It conveys how user agents (web browsers) are expected to expose the information provided by web contents in text from, in a way it’s meaningful for users of accessibility technologies (ATs). This text is meant to be spoken, or routed to a braille printer.

We know a lot of the Web is text, but it’s not always easy. Style rules can affect what is visible and hence, what is part of a name or description and what is not. There are specific properties to make content hidden for accessibility purposes (aria-hidden), and also to establish relations between elements for naming and description purposes (aria-labelledby, aria-describedby) beyond the markup hierarchy, potentially introducing loops. A variety of properties are used to provide naming, specially for non-text content, but authors may apply them anywhere (aria-label, alt, title) and we need to decide their priorities. There are a lot of factors in play.

There is a document maintained by the ARIA Working Group at W3C, specifying how to do this: https://w3c.github.io/accname/. The spec text is complex, as it has to lay out the different ways to name content, settle their priorities when they interact and define how other properties affect them.

Accessible name/description in Chromium

You may have heard the browser keeps multiple tree structures in memory to be able to display your web sites. There is the DOM tree, which corresponds with the nodes as specified in the HTML document after it’s been parsed, any wrong markup corrected, etc. It’s a live tree, meaning that it’s not static; JavaScript code and user interaction can alter it.

There is a layout tree, which I won’t dare explaining because it’s out of my experience field, but we can say it’s related with how the DOM is displayed: what is hidden and what is visible, and where it’s placed.

And then we have the accessibility tree: it’s a structure that keeps the hierarchy and information needed for ATs. It’s based on the DOM tree but they don’t have the same information: nodes in those trees have different attributes but, more importantly, there is not a 1:1 relation between DOM and accessibility nodes. Not every DOM node needs to be exposed in the accessibility tree and, because of aria-owns relations, tree hierarchies can be different too. It’s also a live tree, updated by JS code and user interaction.

In Chromium implementation, we must also talk about several accessibility trees. There is one tree per site, managed by Blink code in the renderer process, and represented in an internal and platform-agnostic way for the most part. But the browser process needs to compose the tree for the active tab contents with the browser UI and expose it through the means provided by the operating system. More importantly, these accessibility objects are different from the ones used in Blink, because they are defined by platform-specific abstractions: NSAccessibility, ATK/AT-SPI, MSAA/UIA/IAccessible2…

When working on accessible tree computation, most of the code I deal with is related with the Blink accessibility tree, save for bugs in the translation to platform abstractions. It is, still, a complex piece of code, more than I initially imagined. The logic is spread across twelve functions belonging to different classes, and they call each other creating cycles. It traversed the accessibility tree, but also the DOM tree in certain occasions.

The problems

Last year, my colleague Joanmarie Diggs landed a set of tests in Chromium, adapted from the test cases used by the Working Group as exit criteria for Candidate Recommendations. There were almost 300 tests, most of them passing just fine, but a few exposed bugs that I thought could be a good introduction to this part of the accessibility code.

The first issue I tried to fix was related to whitespace in names, as it looked simple enough to be a good first contact with the code. The issue consisted on missing whitespace in the accessible names: while empty or whitespace nodes translated to spaces in rendered nodes, the same didn’t happen in the accessible names and descriptions. Consider this example:

<label for="test">
  <span>1</span><span>2</span><span>3</span>
  <span>4</span><span>5</span>
</label>
<input id="test"/>

The label is rendered as “123 45” but Chromium produced the accessible name “12345”. It must be noticed that the spec is somewhat indefinite with regard of the whitespace, which is already reported, but we considered Chromium’s behavior was wrong for two reasons: it didn’t match the rendered text, and it was different from the other implementations. I reported the issue at the Chromium bug tracker.

Another, bigger problem I set to fix was that the CSS pseudo-elements ::before and ::after weren’t added to accessible names. Consider the markup:

<style>
  label:after { content:" fruit"; }
</style>
<label for="test">fancy</label>
<input id="test"/>

The rendered text is “fancy fruit”, but Chromium was producing the accessible name “fancy”. In addition to not matching the rendered text, the spec explicitly states those pseudo-elements must contribute to the accessible name. I reported this issue, too.

I located the source of both problems at AXNodeObject::TextFromDescendants: this function triggers the calculation of the name for a node based on its descendants, and it was traversing the DOM tree to obtain them. The DOM tree does not contain the aforementioned pseudo-elements and, for that reason, they were not being added to the name. Additionally, we were explicitly excluding whitespace nodes in this traversal, with the intention of preventing big chunks of whitespace being added to names, like new lines and tabs to indent the markup.

A question hit me at this point: why were we using the DOM for this in the first place? That meant doing an extra traversal for every name calculated, and duplicating code that selected which nodes are important. It seems the reasons for this traversal were lost in time, according to code comments:

// TODO: Why isn't this just using cached children?

While I could land (and I did) temporary fixes for these individual bugs, it looked like we could fix “the traversal problem” instead, addressing the root cause of the problems while streamlining our code and, potentially, fixing other issues. I will address this problem, the reasons to traverse the DOM in first place and the interesting detours this work lead me, in a future post. Stay tuned, and happy hacking!

EDIT: next chapter available here!

by Jacobo Aragunde Pérez at October 28, 2022 02:00 PM

October 27, 2022

Eric Meyer

Masked Gradient Dashed Lines

I talked in my last post about how I used linear gradients to recreate dashed lines for the navlinks and navbar of wpewebkit.org, but that wasn’t the last instance of dashing gradients in the design.  I had occasion to use that technique again, except this time as a way to create a gradient dash.  I mean, a dashed line that has a visible gradient from one color to another, as well as a dashed line that’s constructed using a linear gradient.  Only this time, the dashed gradient is used to create the gaps, not the dashes.

To set the stage, here’s the bit of the design I had to create:

Design!

The vertical dashed line down the left side of the design is a dashed linear gradient, but that’s not actually relevant.  It could have been a dashed border style, or an SVG, or really anything.  And that image on the right is a PNG, but that’s also not really relevant.  What’s relevant is I had to make sure the image was centered in the content column, and yet its dashed line connected to the left-side dashed line, regardless of where the image landed in the actual content.

Furthermore, the colors of the two dashed lines were different at the point I was doing the implementation: the left-side line was flat black, and the line in the image was more of a dark gray.  I could have just connected a dark gray line to the black, but it didn’t look right.  A color transition was needed, while still being a dashed line.

Ideally, I was looking for a solution that would allow a smooth color fade over the connecting line’s length while also allowing the page background color to show through.  Because the page background might sometimes be white and sometimes be a light blue and might in the future be lime green wavy pattern or something, who knows.

So I used a dashed linear gradient as a mask.  The CSS goes like this:

.banner::before {
	content: '';
	position: absolute;
	top: 50%;
	left: -5rem;
	width: 5rem;
	height: 1px;
	background: linear-gradient(90deg, #222, #888);
	mask-image: repeating-linear-gradient(
		270deg, transparent, #999 1px 3px, transparent 4px 7px);
}

Please allow me to break it down a step at a time.

First, there’s the positioning of the pseudo-element, reaching leftward 5rem from the left edge of the content column, which here I’ve annotated with a red outline. (I prefer outlines to borders because outlines don’t participate in layout, so they can’t shift things around.)

.banner::before {
	content: '';
	position: absolute;
	top: 50%;
	left: -5rem;
	width: 5rem;
	height: 1px;
}
The very skinny-short red box is where the connecting line needs to be drawn.

To that pseudo-element, I added a 90-degree-pointing linear gradient from black to gray to its background.

	…
	background: linear-gradient(90deg, #222, #888);
}
The gradient filling the entire background of the pseudo-element.

The pseudo-element does happen to touch the end of one of the vertical dashes, but that’s purely by coincidence.  It could have landed anywhere, including between two dashes.

So now it was time for a mask.  CSS Masks are images used to hide parts of an element based on the contents of the masking image, usually its alpha channel. (Using luminosity to define transparency levels via the mask-mode property is also an option, but Chrome doesn’t support that as yet.)  For example, you can use small images to clip off the corners of an element.

In this case, I defined a repeating linear gradient because I knew what size the dashes should be, and I didn’t want to mess with mask-size and mask-repeat (ironically enough, as you’ll see in a bit).  This way, the mask is 100% the size of the element’s box, and I just need to repeat the gradient pattern however many times are necessary to cross the entire width of the element’s background.

Given the design constraints, I wanted the dash pattern to start from the right side, the one next to the image, and repeat leftward, to ensure that the dash pattern would look unbroken.  Thus I set its direction to be 270 degrees.  Here I’ll have it alternate between red and transparent, because the actual color used for the opaque parts of the mask doesn’t matter, only its opacity:

	…
	mask-image: repeating-linear-gradient(
		270deg, transparent, red 1px 3px, transparent 4px 7px);
}
The pseudo-element being masked over and over again to create a dashed line that allows the backdrop to show through.

The way the mask works, the transparent parts of the masking image cause the corresponding parts of the element to be transparent  —  to be clipped away, in effect.  The opaque parts, whatever color they actually have, cause the corresponding parts of the element to be drawn.  Thus, the parts of the background’s black-to-gray gradient that line up with the opaque parts of the mask are rendered, and the rest of the gradient is not.  Thus, a color-fading dashed line.

This is actually why I separated the ends of the color stops by a pixel: by defining a one-pixel distance for the transition from transparent to opaque (and vice versa), the browser fills those pixels in with something partway between those two states.  It’s probably 50% opaque, but that’s really up to the browser.

The pixels of the repeating gradient pattern, shown here in the reading order of the CSS value.  In practice, this pattern is flipped horizontally, since its gradient arrow points to 270 degrees (leftward).

The result is a softening effect, which matches well with the dashed line in the image itself and doesn’t look out of place when it meets up with the left-hand vertical dashed line.

At least, everything I just showed you is what happens in Firefox, which proudly supports all the masking properties.  In Chromium and WebKit browsers, you need vendor prefixes for masking to work.  So here’s how the CSS turned out:

.banner::before {
	content: '';
	position: absolute;
	top: 50%;
	left: -5rem;
	width: 5rem;
	height: 1px;
	background: linear-gradient(90deg, #222, #888);
	-webkit-mask-image: repeating-linear-gradient(
		270deg, transparent, red 1px 3px, transparent 4px 7px);
	mask-image: repeating-linear-gradient(
		270deg, transparent, red 1px 3px, transparent 4px 7px);
}

And that’s where we were at launch.

It still bugged me a little, though, because the dashes I created were one pixel tall, but the dashes in the image weren’t.  They were more like a pixel and a half tall once you take the aliasing into account, so they probably started out 2 (or more)  pixels tall and then got scaled down.  The transition from those visually-slightly-larger dashes to the crisply-one-pixel-tall pseudo-element didn’t look quite right.

I’d hoped that just increasing the height of the pseudo-element to 2px would work, but that made the line look too thick.  That meant I’d have to basically recreate the aliasing myself.

At first I considered leaving the mask as it was and setting up two linear gradients, the existing one on the bottom and a lighter one on top.  Instead, I chose to keep the single background gradient and set up two masks, the original on the bottom and a more transparent one on top.  Which meant I’d have to size and place them, and keep them from tiling.  So, despite my earlier avoidance, I had to mess with mask sizing and repeating anyway.

mask-image:
	repeating-linear-gradient(
		270deg, transparent, #89A4 1px 3px, transparent 4px 7px),
	repeating-linear-gradient(
		270deg, transparent, #89A 1px 3px, transparent 4px 7px);
	mask-size: 100% 1px;
	mask-repeat: no-repeat;
	mask-position: 100% 0%, 100% 100%;
Two masks, one effect.

And that’s how it stands as of today.

In the end, this general technique is pretty much infinitely adaptable, in that you could define any dash pattern you like and have it repeat over a gradient, background image, or even just a plain background color.  It could be used to break up an element containing text, so it looks like the text has been projected onto a thick dashed line, or put some diagonal slashes through text, or an image, or a combination.

SLASHED TEXT
No PNG, no SVG, just text and CSS.

Or use a tiled radial gradient to make your own dotted line, one that’s fully responsive without ever clipping a dot partway through.  Wacky line patterns with tiled, repeated conic gradients?  Sure, why not?

The point being, masks let you break up the rectangularity of elements, which can go a long way toward making designs feel more alive.  Give ’em a try and see what you come up with!


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at October 27, 2022 07:37 PM

October 22, 2022

Ricardo García

My mesh shaders talk at XDC 2022

In my previous post I talked about the VK_EXT_mesh_shader extension that had just been released for Vulkan, and in which I had participated by reviewing the spec and writing CTS tests. Back then I referred readers to external documentation sources like the Vulkan mesh shading post on the Khronos Blog, but today I can add one more interesting resource. A couple of weeks ago I went to Minneapolis to participate in XDC 2022, where I gave an introductory talk about mesh shaders that’s now available on YouTube. In the talk I give some details about the concepts, the Vulkan API and how the new shading stages work.

* { padding: 0; margin: 0; overflow: hidden; } html, body { height: 100%; } img, span { /* All elements take the whole iframe width and are vertically centered. */ position: absolute; width: 100%; top: 0; bottom: 0; margin: auto; } span { /* This mostly applies to the play button. */ height: 1.5em; text-align: center; font-family: sans-serif; font-size: 500%; color: white; } Video: Replacing the geometry pipeline with mesh shaders "> * { padding: 0; margin: 0; overflow: hidden; } html, body { height: 100%; } img, span { /* All elements take the whole iframe width and are vertically centered. */ position: absolute; width: 100%; top: 0; bottom: 0; margin: auto; } span { /* This mostly applies to the play button. */ height: 1.5em; text-align: center; font-family: sans-serif; font-size: 500%; color: white; } Video: Replacing the geometry pipeline with mesh shaders " >

Just after me, Timur Kristóf also presented an excellent talk with details about the Mesa mesh shader implementation for RADV, available as part of the same playlist.

As an additional resource, I’m going to participate together with Timur, Steven Winston and Christoph Kubisch in a Khronos Vulkanised Webinar to talk a bit more about mesh shaders on October 27. You must register to attend, but attendance is free.

Back to XDC, crossing the Atlantic Ocean to participate in the event was definitely tiring, but I had a lot of fun at the conference. It was my first in-person XDC and a special one too, this year hosted together with WineConf and FOSS XR. Seeing everyone there and shaking some hands, even with our masks on most of the time, made me realize how much I missed traveling to events. Special thanks to Codeweavers for organizing the conference, and in particular to Jeremy White and specially to Arek Hiler for taking care of most technical details and acting as a host and manager in the XDC room.

Apart from my mesh shaders talk, do not miss other talks by Igalians at the conference:

And, of course, take a look at the whole event playlist for more super-interesting content, like the one by Alyssa Rosenzweig and Lina Asahi about reverse-engineered GPU drivers, the one about Zink by Mike Blumenkrantz (thanks for the many shout-outs!) or the guide to write Vulkan drivers by Jason Ekstrand which includes some info about the new open source Vulkan driver for NVIDIA cards.

That said, if you’re really interested in my talk but don’t want to watch a video (or the closed captions are giving you trouble), you can find the slides and the script to my talk below. Each thumbnail can be clicked to view a full-HD image.

Talk slides and script

Title slide: Replacing the geometry pipeline with mesh shaders

Hi everyone, I’m Ricardo Garcia from Igalia. Most of my work revolves around CTS tests and the Vulkan specification, and today I’ll be talking about the new mesh shader extension for Vulkan that was published a month ago. I participated in the release process of this extension by writing thousands of tests and reviewing and discussing the specification text for Vulkan, GLSL and SPIR-V.

Mesh shaders are a new way of processing geometry in the graphics pipeline. Essentially, they introduce an alternative way of creating graphics pipelines in Vulkan, but they don’t introduce a completely new type of pipeline.

The new extension is multi-vendor and heavily based on the NVIDIA-only extension that existed before, but some details have been fine-tuned to make it closer to the DX12 version of mesh shaders and to make it easier to implement for other vendors.

Main Points

I want to cover what mesh shaders are, how they compare to the classic pipelines and how they solve some problems.

Then we’ll take a look at what a mesh shader looks like and how it works, and we’ll also talk about drawbacks mesh shaders have.

What is mesh shading?

Mesh shading introduces a new type of graphics pipeline with a much smaller number of stages compared to the classic one. One of the new stages is called the mesh shading stage.

These new pipelines try to address some issues and shortcomings with classic pipelines on modern GPUs. The new pipeline stages have many things in common with compute shaders, as we’ll see.

Classic Graphics Pipeline

This is a simplified version of the classic pipeline.

Basically the pipeline can be divided in two parts. The first stages are in charge of generating primitives for the rasterizer.

Then the rasterizer does a lot of magic including primitive clipping, barycentric interpolation and preparing fragment data for fragment shader invocations.

It’s technically possible to replace the whole pipeline with a compute shader (there’s a talk on Thursday about this), but mesh shaders do not touch the rasterizer and everything that comes after it.

Mesh shaders try to apply a compute model to replace some of this with a shader that’s similar to compute, but the changes are restricted to the first part of the pipeline.

Mesh Shading Pipeline

If I have to cut the presentation short, this is perhaps one of the slides you should focus on.

Mesh shading employs a shader that’s similar to compute to generate geometry for the rasterizer.

There’s no input assembly, no vertex shader, no tessellation, etc. Everything that you did with those stages is now done in the new mesh shading stage, which is a bit more flexible and powerful.

Mesh Shading Pipeline (Full)

In reality, the mesh shading extension actually introduces two new stages. There’s an optional task shader that runs before the mesh shader, but we’ll forget about it for now.

Classic Pipeline Problems

These are the problems mesh shading tries to solve.

Vertex inputs are a bit annoying to implement in drivers and in some hardware they use specific fixed function units that may be a bottleneck in some cases, at least in theory.

The main pain point is that vertex shaders work at the per-vertex level, so you don’t generally have control of how geometry is arranged in primitives. You may run several vertex shader invocations that end up forming a primitive that faces back and is not visible and there’s no easy way to filter those out, so you waste computing power and memory bandwidth reading data for those vertices. Some implementations do some clever stuff here trying to avoid these issues.

Finally, tessellation and geometry shaders should perhaps be simpler and more powerful, and should work like compute shaders so we process vertices in parallel more efficiently.

How do they look? GLSL Extension

So far we know mesh shaders look a bit like compute shaders and they need to generate geometry somehow because their output goes to the rasterizer, so lets take a look.

The example will be in GLSL to make it easier to read. As you can see, it needs a new extension which, when translated to SPIR-V will be converted into a SPIR-V extension that gives access to some new opcodes and functionality.

How do they look? Local Size

The first similarity to compute is that mesh shaders are dispatched in 3d work groups like compute shaders, and each of them has a number of invocations in 3d controlled by the shader itself. Same deal. There’s a limit to the size of each work group, but the minimum mandatory limit by the spec is 128 invocations. If the hardware does not support work groups of that size, they will be emulated. We also have a properties structure in Vulkan where you can check the recommended maximum size for work groups according to the driver.

Inside the body of your shader you get access to the typical built-ins for compute shaders, like the number of work groups, work group id, local invocation indices, etc.

If subgroups are supported, you can also use subgroup operations in them.

How do they look? Type of output geometry

But mesh shaders also have to generate geometry. The type can not be chosen at runtime. When writing the shader you have to decide if you shader will output triangles, lines or points.

How do they look? Maximum vertices and primitives

You must also indicate an upper limit in the number of vertices and primitives that each work group will generate.

Generally speaking, this will be a small-ish number. Several implementations will limit you to 256 vertices and primitives at most, which is the minimum required limit.

To handle big meshes with this, you’ll need several work groups and each work group will handle a piece of the whole mesh.

In each work group, the local invocations will cooperate to generate arrays of vertex and primitive data.

How do they look? Output geometry arrays

And here you can see how. After, perhaps, some initial processing not seen here, you have to indicate how many actual vertices and primitives the work group will emit, using the SetMeshOutputsEXT call.

That call goes first before filling any output array and you can reason about it as letting the implementation allocate the appropriate amount of memory for those output arrays.

Mesh shaders output indexed geometry, like when you use vertex and index buffers together.

You need to write data for each vertex to an output array, and primitive indices to another output array. Typically, each local invocation handles one position or a chunk of those arrays so they cooperate together to fill the whole thing. In the slide here you see a couple of those arrays, the most typical ones.

The built-in mesh vertices ext array contains per-vertex built-ins, like the vertex position. Indices used with this array go from 0 to ACTUAL_V-1

Then, the primitive triangle indices ext array contains, for each triangle, 3 uint indices into the previous vertices array. The primitive indices array itself is accessed using indices from 0 to ACTUAL_P-1. If there’s a second slide that I want you to remember, it’s this one. What we have here is an initial template to start writing any mesh shader.

How do they look? Output attributes

There are a few more details we can add. For example, mesh shaders can also generate custom output attributes that will be interpolated and used as inputs to the fragment shader, just like vertex shaders can.

The difference is that in mesh shaders they form arrays. If we say nothing, like in the first output here, they’re considered per-vertex and have the same index range as the mesh vertices array.

A nice addition for mesh shaders is that you can use the perprimitiveEXT keyword to indicate output attributes are per-primitive and do not need to be interpolated, like the second output here. If you use these, you need to declare them with the same keyword in the fragment shader so the interfaces match. Indices to these arrays have the same range as the built-in primitive indices array.

And, of course, if there’s no input assembly we need to read data from somewhere. Typically from descriptors like storage buffers containing vertex and maybe index information, but we could also generate geometry procedurally.

Some built-in arrays

Just to show you a few more details, these are the built-in arrays used for geometry. There are arrays of indices for triangles, lines or points depending on what the shader is supposed to generate.

The mesh vertices ext array that we saw before can contain a bit more data apart from the position (point size, clip and cull distances).

The third array was not used before. It’s the first time I mention it and as you can see it’s per-primitive instead of per-vertex. You can indicate a few things like the primitive id, the layer or viewport index, etc.

Meshlets

As I mentioned before, each work group can only emit a relatively small number of primitives and vertices, so for big models, several work groups are dispatched and each of them is in charge of generating and processing a meshlet, which are the colored patches you see here on the bunny.

It’s worth mentioning the subdivision of big meshes into meshlets is typically done when preparing assets for the application, meaning there shouldn’t be any runtime delay.

Dispatching Work Groups

Mesh shading work groups are dispatched with specific commands inside a render pass, and they look similar to compute dispatches as you can see here, with a 3d size.

Mesh Shading Pipeline (Full)

Let’s talk a bit about task shaders, which are optional. If present, they go before mesh shaders, and the dispatch commands do not control the number of mesh shader work groups, but the number of task shader work groups that are dispatched and each task shader work group will dispatch a number of mesh shader work groups.

Task (Amplification) Shader

Each task shader work group also follows the compute model, with a number of local invocations that cooperate together.

Each work group typically pre-processes geometry in some way and amplifies or reduces the amount of work that needs to be done. That’s why it’s called the amplification shader in DX12.

Once that pre-processing is done, each task work group decides, at runtime, how many mesh work groups to launch as children, forming a tree with two levels.

Task (Amplification) Shader: Example Dispatch

One interesting detail about this is that compute built-ins in mesh shaders may not be unique when using task shaders. They are only unique per branch. In this example, we dispatched a couple of task shader work groups and each of them decided to dispatch 2 and 3 mesh shader work groups. Some mesh shader work groups will have the same work group id and, if the second task shader work group had launched 2 children instead of 3, even the number of work groups would be the same.

But we probably want them all to process different things, so the way to tell them apart from inside the mesh shader code is to use a payload: a piece of data that is generated in each task work group and passed down to its children as read-only data.

Combining the payload with existing built-ins allows you to process different things in each mesh shader work group.

Payload

This is done like this. On the left you have a task shader.

You can see it also works like a compute shader and invocations cooperate to pre-process stuff and generate the payload. The payload is a variable declared with the task payload shared ext qualifier. These payloads work like shared memory. That’s why they have “shared” in the qualifier.

In the mesh shader they are read-only. You can declare the same payload and read from it.

Mesh Shading Pros

Advantages:

Avoiding input assembly bottlenecks if they exist.

Pre-compute data and discard geometry in advance, saving processing power and memory bandwidth.

Geometry and tessellation can be applied freely, in more flexible ways.

The use of a model similar to compute shaders allows us to take advantage of the GPU processing power more effectively.

Many games use a compute pre-pass to process some data and calculate things that will be needed at draw time. With mesh shaders it may be possible to streamline this and integrate this processing into the mesh or task shaders.

You can also abuse mesh shading pipelines as compute pipelines with two levels if needed.

Mesh Shading Cons

Disadvantages:

Mesh shading is problematic for tiling GPUs as you can imagine, for the same reasons tessellation and geometry shaders suck on those platforms.

Giving users freedom in this part of the pipeline may allow them to shoot themselves in the foot and end-up with sub-optimal performance. If not used properly, mesh shaders may be slower than classic pipelines.

The structure of vertex and index buffers needs to be declared explicitly in shaders, increasing coupling between CPU and GPU code, which is not nice.

Most importantly, right now it’s hard or even impossible to write a single mesh shader that performs great on all implementations.

Some vendor preferences are exposed as properties by the extension.

NVIDIA loves smaller work groups and using loops in code to generate geometry with each invocation (several vertices and triangles per invocation).

Threads on AMD can only generate at most one vertex and one primitive, so they’d love you to use bigger work groups and use the local invocation index to access per-vertex and per-primitive arrays.

As you can imagine, this probably results in different mesh shaders for each vendor.

Questions and Answers

In the Q&A section, someone asked about differences between the Vulkan version of mesh shaders and the Metal version of them. Unfortunately, I’m not familiar with Metal so I said I didn’t know. Then, I was asked if it was possible to implement the classic pipeline on top of mesh shaders, and I replied it was theoretically possible. The programming model of mesh shaders is more flexible, so the classic pipeline can be implemented on top of it. However, I didn’t reply (because I have doubts) about how efficient that could be. Finally, the last question asked me to elaborate on stuff that could be done with task shaders, and I replied that apart from integrating compute pre-processing in the pipeline, they were also typically used to select LOD levels or discard meshlets and, hence, to avoid launching mesh work groups to deal with them.

Closing Slide

October 22, 2022 09:07 PM

Andy Wingo

the sticky mark-bit algorithm

Good day, hackfolk!

The Sticky Mark-Bit Algorithm

Also an intro to mark-sweep GC

7 Oct 2022 – Igalia

Andy Wingo

A funny post today; I gave an internal presentation at work recently describing the so-called "sticky mark bit" algorithm. I figured I might as well post it here, as a gift to you from your local garbage human.

Automatic Memory Management

“Don’t free, the system will do it for you”

Eliminate a class of bugs: use-after-free

Relative to bare malloc/free, qualitative performance improvements

  • cheap bump-pointer allocation
  • cheap reclamation/recycling
  • better locality

Continuum: bmalloc / tcmalloc grow towards GC

Before diving in though, we start with some broad context about automatic memory management. The term mostly means "garbage collection" these days, but really it describes a component of a system that provides fresh memory for new objects and automatically reclaims memory for objects that won't be needed in the program's future. This stands in contrast to manual memory management, which relies on the programmer to free their objects.

Of course, automatic memory management ensures some valuable system-wide properties, like lack of use-after-free vulnerabilities. But also by enlarging the scope of the memory management system to include full object lifetimes, we gain some potential speed benefits, for example eliminating any cost for free, in the case of e.g. a semi-space collector.

Automatic Memory Management

Two strategies to determine live object graph

  • Reference counting
  • Tracing

What to do if you trace

  • Mark, and then sweep or compact
  • Evacuate

Tracing O(n) in live object count

I should mention that reference counting is a form of automatic memory management. It's not enough on its own; unreachable cycles in the object reference graph have to be detected either by a heap tracer or broken by weak references.

It used to be that we GC nerds made fun of reference counting as being an expensive, half-assed solution that didn't work very well, but there have been some fundamental advances in the state of the art in the last 10 years or so.

But this talk is more about the other kind of memory management, which involves periodically tracing the graph of objects in the heap. Generally speaking, as you trace you can do one of two things: mark the object, simply setting a bit indicating that an object is live, or evacuate the object to some other location. If you mark, you may choose to then compact by sliding all objects down to lower addresses, squeezing out any holes, or you might sweep all holes into a free list for use by further allocations.

Mark-sweep GC (1/3)

freelist := []

allocate():
  if freelist is empty: collect()
  return freelist.pop()

collect():
  mark()
  sweep()
  if freelist is empty: abort

Concretely, let's look closer at mark-sweep. Let's assume for the moment that all objects are the same size. Allocation pops fresh objects off a freelist, and collects if there is none. Collection does a mark and then a sweep, aborting if sweeping yielded no free objects.

Mark-sweep GC (2/3)

mark():
  worklist := []
  for ref in get_roots():
    if mark_one(ref):
      worklist.add(ref)
  while worklist is not empty:
    for ref in trace(worklist.pop()):
      if mark_one(ref):
        worklist.add(ref)

sweep():
  for ref in heap:
    if marked(ref):
      unmark_one(ref)
    else
      freelist.add(ref)

Going a bit deeper, here we have some basic implementations of mark and sweep. Marking starts with the roots: edges from outside the automatically-managed heap indicating a set of initial live objects. You might get these by maintaining a stack of objects that are currently in use. Then it traces references from these roots to other objects, until there are no more references to trace. It will visit each live object exactly once, and so is O(n) in the number of live objects.

Sweeping requires the ability to iterate the heap. With the precondition here that collect is only ever called with an empty freelist, it will clear the mark bit from each live object it sees, and otherwise add newly-freed objects to the global freelist. Sweep is O(n) in total heap size, but some optimizations can amortize this cost.

Mark-sweep GC (3/3)

marked := 1

get_tag(ref):
  return *(uintptr_t*)ref
set_tag(ref, tag):
  *(uintptr_t*)ref = tag

marked(ref):
  return (get_tag(ref) & 1) == marked
mark_one(ref):
  if marked(ref): return false;
  set_tag(ref, (get_tag(ref) & ~1) | marked)
  return true
unmark_one(ref):
  set_tag(ref, (get_tag(ref) ^ 1))

Finally, some details on how you might represent a mark bit. If a ref is a pointer, we could store the mark bit in the first word of the objects, as we do here. You can choose instead to store them in a side table, but it doesn't matter for today's example.

Observations

Freelist implementation crucial to allocation speed

Non-contiguous allocation suboptimal for locality

World is stopped during collect(): “GC pause”

mark O(n) in live data, sweep O(n) in total heap size

Touches a lot of memory

The salient point is that these O(n) operations happen when the world is stopped. This can be noticeable, even taking seconds for the largest heap sizes. It sure would be nice to have the benefits of GC, but with lower pause times.

Optimization: rotate mark bit

flip():
  marked ^= 1

collect():
  flip()
  mark()
  sweep()
  if freelist is empty: abort

unmark_one(ref):
  pass

Avoid touching mark bits for live data

Incidentally, before moving on, I should mention an optimization to mark bit representation: instead of clearing the mark bit for live objects during the sweep phase, we could just choose to flip our interpretation of what the mark bit means. This allows unmark_one to become a no-op.

Reducing pause time

Parallel tracing: parallelize mark. Clear improvement, but speedup depends on object graph shape (e.g. linked lists).

Concurrent tracing: mark while your program is running. Tricky, and not always a win (“Retrofitting Parallelism onto OCaml”, ICFP 2020).

Partial tracing: mark only a subgraph. Divide space into regions, record inter-region links, collect one region only. Overhead to keep track of inter-region edges.

Now, let's revisit the pause time question. What can we do about it? In general there are three strategies.

Generational GC

Partial tracing

Two spaces: nursery and oldgen

Allocations in nursery (usually)

Objects can be promoted/tenured from nursery to oldgen

Minor GC: just trace the nursery

Major GC: trace nursery and oldgen

“Objects tend to die young”

Overhead of old-to-new edges offset by less amortized time spent tracing

Today's talk is about partial tracing. The basic idea is that instead of tracing the whole graph, just trace a part of it, ideally a small part.

A simple and effective strategy for partitioning a heap into subgraphs is generational garbage collection. The idea is that objects tend to die young, and that therefore it can be profitable to focus attention on collecting objects that were allocated more recently. You therefore partition the heap graph into two parts, young and old, and you generally try to trace just the young generation.

The difficulty with partitioning the heap graph is that you need to maintain a set of inter-partition edges, and you do so by imposing overhead on the user program. But a generational partition minimizes this cost because you never do an only-old-generation collection, so you don't need to remember new-to-old edges, and mutations of old objects are less common than new.

Generational GC

Usual implementation: semispace nursery and mark-compact oldgen

Tenuring via evacuation from nursery to oldgen

Excellent locality in nursery

Very cheap allocation (bump-pointer)

But... evacuation requires all incoming edges to an object to be updated to new location

Requires precise enumeration of all edges

Usually the generational partition is reflected in the address space: there is a nursery and it is in these pages and an oldgen in these other pages, and never the twain shall meet. To tenure an object is to actually move it from the nursery to the old generation. But moving objects requires that the collector be able to enumerate all incoming edges to that object, and then to have the collector update them, which can be a bit of a hassle.

JavaScriptCore

No precise stack roots, neither in generated nor C++ code

Compare to V8’s Handle<> in C++, stack maps in generated code

Stack roots conservative: integers that happen to hold addresses of objects treated as object graph edges

(Cheaper implementation strategy, can eliminate some bugs)

Specifically in JavaScriptCore, the JavaScript engine of WebKit and the Safari browser, we have a problem. JavaScriptCore uses a technique known as "conservative root-finding": it just iterates over the words in a thread's stack to see if any of those words might reference an object on the heap. If they do, JSC conservatively assumes that it is indeed a reference, and keeps that object live.

Of course a given word on the stack could just be an integer which happens to be an object's address. In that case we would hold on to too much data, but that's not so terrible.

Conservative root-finding is again one of those things that GC nerds like to make fun of, but the pendulum seems to be swinging back its way; perhaps another article on that some other day.

JavaScriptCore

Automatic memory management eliminates use-after-free...

...except when combined with manual memory management

Prevent type confusion due to reuse of memory for object of different shape

addrof/fakeobj primitives: phrack.org/issues/70/3.html

Type-segregated heaps

No evacuation: no generational GC?

The other thing about JSC is that it is constantly under attack by malicious web sites, and that any bug in it is a step towards hackers taking over your phone. Besides bugs inside JSC, there are bugs also in the objects exposed to JavaScript from the web UI. Although use-after-free bugs are impossible with a fully traceable object graph, references to and from DOM objects might not be traceable by the collector, instead referencing GC-managed objects by reference counting or weak references or even manual memory management. Bugs in these interfaces are a source of exploitable vulnerabilities.

In brief, there seems to be a decent case for trying to mitigate use-after-free bugs. Beyond the nuclear option of not freeing, one step we could take would be to avoid re-using memory between objects of different shapes. So you have a heap for objects with 3 fields, another objects with 4 fields, and so on.

But it would seem that this mitigation is at least somewhat incompatible with the usual strategy of generational collection, where we use a semi-space nursery. The nursery memory gets re-used all the time for all kinds of objects. So does that rule out generational collection?

Sticky mark bit algorithm

collect(is_major=false):
  if is_major: flip()
  mark(is_major)
  sweep()
  if freelist is empty:
    if is_major: abort
    collect(true)

mark(is_major):
  worklist := []
  if not is_major:
    worklist += remembered_set
    remembered_set := []
  ...

Turns out, you can generationally partition a mark-sweep heap.

Recall that to visit each live object, you trace the heap, setting mark bits. To visit them all again, you have to clear the mark bit between traces. Our first collect implementation did so in sweep, via unmark_one; then with the optimization we switched to clear them all before the next trace in flip().

Here, then, the trick is that you just don't clear the mark bit between traces for a minor collection (tracing just the nursery). In that way all objects that were live at the previous collection are considered the old generation. Marking an object is tenuring, in-place.

There are just two tiny modifications to mark-sweep to implement sticky mark bit collection: one, flip the mark bit only on major collections; and two, include a remembered set in the roots for minor collections.

Sticky mark bit algorithm

Mark bit from previous trace “sticky”: avoid flip for minor collections

Consequence: old objects not traced, as they are already marked

Old-to-young edges: the “remembered set”

Write barrier

write_field(object, offset, value):
  remember(object)
  object[offset] = value

The remembered set is maintained by instrumenting each write that the program makes with a little call out to code from the garbage collector. This code is the write barrier, and here we use it to add to the set of objects that might reference new objects. There are many ways to implement this write barrier but that's a topic for another day.

JavaScriptCore

Parallel GC: Multiple collector threads

Concurrent GC: mark runs while JS program running; “riptide”; interaction with write barriers

Generational GC: in-place, non-moving GC generational via sticky mark bit algorithm

Alan Demers, “Combining generational and conservative garbage collection: framework and implementations”, POPL ’90

So returning to JavaScriptCore and the general techniques for reducing pause times, I can summarize to note that it does them all. It traces both in parallel and concurrently, and it tries to trace just newly-allocated objects using the sticky mark bit algorithm.

Conclusions

A little-used algorithm

Motivation for JSC: conservative roots

Original motivation: conservative roots; write barrier enforced by OS-level page protections

Revived in “Sticky Immix”

Better than nothing, not quite as good as semi-space nursery

I find that people that are interested in generational GC go straight for the semispace nursery. There are some advantages to that approach: allocation is generally cheaper in a semispace than in a mark space, locality among new objects is better, locality after tenuring is better, and you have better access locality during a nursery collection.

But if for some reason you find yourself unable to enumerate all roots, you can still take advantage of generational collection via the sticky mark-bit algorithm. It's a simple change that improves performance, as long as you are able to insert write barriers on all heap object mutations.

The challenge with a sticky-mark-bit approach to generations is avoiding the O(n) sweep phase. There are a few strategies, but more on that another day perhaps.

And with that, presentation done. Until next time, happy hacking!

by Andy Wingo at October 22, 2022 08:42 AM

October 19, 2022

Eric Meyer

A Dashing Navbar Solution

#thoughts figure.standalone img { box-shadow: 0.25em 0.25em 0.67em #0008; margin-block-end: 0.5em; }

One of the many things Igalia does is maintain an official port of WebKit for embedded devices called WPE WebKit, and as you might expect, it has a web site.  The design had gotten a little stale since its launch a few years ago, so we asked Denis Radenković at 38one to come up with a new design, which we launched yesterday.  And I got to turn it into HTML and CSS!  Which was mostly normal stuff, margins and font sizing and all that, but also had some bits that called for some creativity.

There was one aspect of the design that I honestly thought wasn’t going to be possible, which was the way the “current page” link in the site navbar connected up to the rest of the page’s design via a dashed line. You can see it on the site, or in this animation about how each navlink was designed to appear.

Navigation link styles, not including the Home button, which is essentially the same but not nearly as obvious because of its specific layout.

I thought about using bog-standard dashed element borders: one on a filler element or pseudo-element that spanned the mostly-empty left side of the navbar, and one on each navlink before the current one, and then… I don’t know. I didn’t get that far, because I realized the dashed borders would almost certainly stutter, visually. What I mean is, the dashes wouldn’t follow a regular on-off pattern, which would look fairly broken.

My next thought was to figure out how to size a filler element or pseudo-element so that it was the exact width needed to reach the middle of the active navlink. Maybe there’s a clever way to do that with just HTML and CSS, but I couldn’t think of a way to do it that wasn’t either structurally nauseating or dependent on me writing clever JavaScript. So I tossed that idea, though I’ll return to it at the end of the post.

In the end, it was the interim development styles that eventually got me there. See, while building out other parts of the design, I just threw a dashed border on the bottom of the navbar, which (as is the way of borders) spanned its entire bottom edge. At some point I glanced up at it and thought, If only I could figure out a mask that would clip off the part of the border I don’t need. And slowly, I realized that with a little coding sleight-of-hand, I could almost exactly do that.

First, I removed the bottom border from the navbar and replaced it with a dashed linear gradient background, like this:

nav.global {
	background-image: linear-gradient(
		90deg,
		currentColor 25%,
		transparent 25% 75%,
		currentColor 75%
	);
	background-position: 0 100%;
	background-size: 8px 1px;
	background-repeat: repeat-x;
}

(Okay, I actually used the background shorthand form of the above — linear-gradient(…) 0 100% / 8px 1px repeat-x — but the result is the same.)

I thought about using repeating-linear-gradient, which would have let me skip having to declare a background-repeat, but would have required me to size the gradient’s color stops using length units instead of percentages. I liked how, with the above, I could set the color stops with percentages, and then experiment with the size of the image using background-size. (7px 1px? 9px 1px? 8px 1.33px? Tried ’em all, and then some.) But that’s mostly a personal preference, not based in any obvious performance or clarity win, so if you want to try this with a repeating gradient, go for it.

But wait. What was the point of recreating the border’s built-in dash effect with a gradient background? Well, as I said, it made it easier to experiment with different sizes. It also allowed me to create a dash pattern that would be very consistent across browsers, which border-style dashes very much are not.

But primarily, I wanted the dashes to be in the background of the navbar because I could then add small solid-color gradients to the backgrounds of the navlinks that come after the active one.

nav.global li.currentPage ~ li {
	background: linear-gradient(0deg, #FFF 2px, transparent 2px);
}

That selects all the following-sibling list items of the currentPage-classed list item, which here is the “Learn & Discover” link.

<ul class="about off">
	<li><a class="nav-link" href="…">Home</a></li>
	<li class="currentPage"><a class="nav-link" href="…">Learn &amp; Discover</a></li>
	<li><a class="nav-link" href="…">Blog</a></li>
	<li><a class="nav-link" href="…">Developers</a></li>
	<li><a class="btn cta" href="…">Get Started</a></li>
</ul>

Here’s the result, with the gradient set to be visible with a pinkish fill instead of #FFF so we can see it:

The “masking” gradients, here set to be a nicely rosy pink instead of the usual solid white.

You can see how the pink hides the dashes. In the actual styles, because the #FFF is the same as the design’s page background, those list items’ background gradients are placed over top of (and thus hide) the navbar’s background dash.

I should point out that the color stop on the white solid gradient could have been at 1px rather than 2px and still worked, but I decided to give myself a little bit of extra coverage, just for peace of mind. I should also point out that I didn’t just fill the backgrounds of the list items with background-color: #FFF because the navbar has a semitransparent white background fill and a blurring background-filter, so the page content can be hazily visible through the navbar, as shown here.

The backdrop-blurring of content behind the navbar, not blurring nearly as much as I thought they should, but oh well.

The white line is slightly suboptimal in this situation, but it doesn’t really stand out and does add a tiny bit of visual accent to the navbar’s edge, so I was willing to go with it.

The next step was a little trickier: there needs to be a vertical dashed line “connecting” to the navbar’s dashed line at the horizontal center of the link, and also the navbar’s dashed line needs to be hidden, but only to the right of the vertical dashed line. I thought about doing two background gradients on the list item, one for the vertical dashed line and one for the white “mask”, but I realized that constraining the vertical dashed line to be half the height of the list item while also getting the repeating pattern correct was too much for my brain to figure out.

Instead, I positioned and sized a generated pseudo-element like this:

nav.global ul li.currentPage {
	position: relative;
}
nav.global ul li.currentPage::before {
	content: '';
	position: absolute;
	z-index: 1;
	top: 50%;
	bottom: 0;
	left: 50%;
	right: 0;
	background:
		linear-gradient(180deg,
			currentColor 25%,
			transparent 25% 75%, 
			currentColor 75%) 0 0 / 1px 8px repeat-y, 
		linear-gradient(0deg, #FFFF 2px, transparent 2px);
	background-size: 1px 0.5em, auto;
}

That has the generated pseudo-element fill the bottom right quadrant of the active link’s list item, with a vertical dashed linear gradient running along its left edge and a solid white two-pixel gradient along its bottom edge, with the solid white below the vertical dash. Here it is, with the background pinkishly filled in to be visible behind the two gradients.

That’s the ::before with its background filled in a rosy pink instead of the usual transparency.

With that handled, the last step was to add the dash across the bottom of the current-page link and then mask the vertical line with a background color, like so:

nav.global ul li.currentPage a {
	position: relative;
	z-index: 2;
	background: var(--dashH); /* converted the dashes to variables */
	background-size: 0.5em 1px;
	background-position: 50% 100%;
	background-color: #FFF;
}

And that was it. Now, any of the navbar links can be flagged as the current page (with a class of currentPage on the enclosing list item) and will automatically link up with the dashes across the bottom of the navbar, with the remainder of that navbar dash hidden by the various solid white gradient background images.

An animation of the link styles, only now you know how they work.

So, it’s kind of a hack, and I wish there were a cleaner way to do this. And maybe there is! I pondered setting up a fine-grained grid and adding an SVG with a dashed path, or maybe a filler <span>, to join the current link to the line. That also feels like a hack, but maybe less of one. Or maybe not!

What I believe I want are the capabilities promised by the Anchored Positioning proposal. I think I could have done something like:

nav.global {position: relative;}
nav.global .currentPage {anchor-name: --navLink;}
nav.global::before {
	position: absolute;
	left: 0;
	bottom: 0;
	right: var(--center);
	--center: anchor(--navLink 50%);
	top: anchor(--navLink bottom);
}

…and then used that to run the dashed line over from the left side of the page to underneath the midpoint of the current link, and then up to the bottom edge of that link. Which would have done away with the need for the li ~ li overlaid-background hack, and nearly all the other hackery. I mean, I enjoy hackery as much as the next codemonaut, but I’m happier when the hacks are more elegant and minimal.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at October 19, 2022 01:00 PM

October 18, 2022

Julie Kim

A <webview> tag in Chromium

I often talk about a webview as a common term. For instance, when I worked on AGL project, we integrated Web Runtime in order to support a webview to native applications from AGL like Android WebView. However, I wasn’t aware of <webview> tag implementation for a web page in Chromium.

<webview>

<webview> is a tag that Chromium Apps and chrome://… pages could use.

Chromium has some samples using <webview> for testing. I launched one of these examples with some modification to include <div> at the same level with a <webview>.

The green box is a <webview> and the blue box is <div>.
A web page includes <webview> that has another web page inside it. Note that a <webview> tag is not a standard and it’s used for Chrome Apps and chrome:// pages.


WebContents is

A reusable component that is the main class of the Content module. It’s easily embeddable to allow multiprocess rendering of HTML into a view.
This image shows the conceptual application layers in Chromium.
https://www.chromium.org/developers/design-documents/displaying-a-web-page-in-chrome/
When we open a web page in Chromium, you can imagine that Chromium creates a WebContents and it shows us the rendered web page.


Current implementation

When <webview> exists in a web page, how does it work and where is it located on these layers from the image above?

Based on the current implementation, WebContents can have inner WebContents. Chromium implements a <webview> tag with a GuestVIew which is the templated base class for out-of-process frames in the chrome layer. A GuestView maintains an association between a guest WebContents and an owner WebContents.

For a <webview> interface at Javascript, Chromium has custom bindings. When a <webview> tag is attached, the API connected by the custom bindings is called from the renderer. The request for attaching it is sent to the browser through Mojo interface.


interface GuestViewHost {
// We have a RenderFrame with routing id of |embedder_local_frame_routing_id|.
// We want this local frame to be replaced with a remote frame that points
// to a GuestView. This message will attach the local frame to the guest.
// The GuestView is identified by its ID: |guest_instance_id|.
 AttachToEmbedderFrame(
     int32 embedder_local_frame_routing_id,
     int32 element_instance_id,
     int32 guest_instance_id,
     mojo_base.mojom.DictionaryValue params) => ();
It triggers WebContentsImpl::AttachInnerWebContents() to attach the inner WebContents to the outer WebContents.

Chromium has also another type of GuestView internally. That is MimeHandlerViewGuest to handle mime types from tags such as <object> or <embed>. For instance, if <embed> tag is linked to pdf file, it’s also managed by a GuestView.


MPArch

These days I’m working on MPArch, that
• enables one WebContents to host multiple (active / inactive, visible / invisible) pages. from MPArch document
For the detail for MPArch, you could refer to this blog post, ‘MPArch(Multiple Page Architecture) project in Chromium‘.

As a GuestView is implemented with a Multiple-WebContents model, it’s also considered for MPArch.
After working on Prerendering and fenced frames intensively, our team started to migrate GuestView tests in order to avoid direct access to WebContents and manage it based on RenderFrameHost.
Once we have some progress regarding GuestView, I hope we could share the status, as well.


We’re hiring. https://www.igalia.com/jobs/


by jkim at October 18, 2022 09:54 AM

October 13, 2022

Gyuyoung Kim

MPArch(Multiple Page Architecture) project in Chromium

This article is written by referring to the Multiple Page Architecture document.

Motivation

According to the Google Chromium team, some years ago it turned out that the BFCache(Back/Forward cache), Portals, Guest Views, and Prerender faced the same fundamental problem of having to support multiple pages in the same tab, but were implemented with different architecture models. BFCache and pending deletion documents were implemented with a Single-WebContents model. On the other hand, Portals and Guest Views were implemented with a Multiple-WebContents model. After extensive discussions between the Chrome Security Architecture team, the BFCache team, the Blink Architecture team, the Portals team, and others, they concluded that BFCache and Portals should converge to a single shared architecture. This is crucial to minimize the complexity of writing browser features on top of and within and minimize the maintenance cost across the codebase.
Multiple Page Architecture [1]
In addition to BFCache and Portals, Prerendering, FencedFrames, GuestViews, and Pending deletion frames were facing the same problem: how to support multiple pages in a given tab. The below table shows each feature’s requirements.
Multiple Page Architecture [1]
If the Chrome team didn’t invest in the architecture, as you can see in the table, the six use cases need to solve the same problems individually, which will lead to browser features having to deal with multiple incoherent cases, then they were supposed that it will greatly increase the complexity across the codebase.

Goal

The goal of multiple-page architecture is to enable projects including BFCache, Portals, Prerendering, and FencedFrames to minimize the complexity exposed to the browser features. Specifically, Multiple Page Architecture is to
  • Enable one WebContents to host multiple(active/inactive, visible/invisible) pages
  • Allow navigating WebContents to an existing page
Multiple Page Architecture [1]
For instance, when a page moves to BFCache, the ownership of the page is transferred from the main frame tree to BFCache. When a portal is activated, the ownership of the page is transferred to the main frame tree. This requires clearly splitting the tab and page / main-document concepts. This is closely related to and greatly benefits from Site Isolation’s effort to split page and document concepts.

Design Overview

Let’s take a look at a high-level design briefly. To apply the MPArch to Chromium, we have to change the core navigation and existing features. First, we need to make the core navigation provide the ability to host multiple pages in the same WebContents, allow one page to embed another, and allow WebContents to navigate to existing pages.

In particular, this includes,
  • Allow multiple FrameTrees to be created in the given WebContents in addition to the existing main one
  • Add a new navigation type, which navigates a given FrameTree to an existing page. (For example, BFCache will use it for history navigation restoring the page from the cache. Prerendering will use it to instantly navigate to a prerendered page.)
  • Modify RenderFrameHost to avoid knowing its current FrameTreeNode/FrameTree to allow the main RenderFrameHost to be transferred between different FrameTreeNodes.
And also, we need to fix existing features relying on the now-invalid assumptions like one tab only having one page at any given moment. The MPArch team evolves //content/public APIs to simplify these fixes and provide guidance and documentation to browser feature developers.

Igalia Participation

As we’ve seen briefly so far, the MPArch project is a significant Chromium architecture change that requires a lot of modification of Chromium overall. The Igalia team began to participate in the prerendering task first in the MPArch project in April 2021. We had mainly worked on the prerendering contents restriction as well as the long tail of fixes for MPArch-related features. After finishing the prerendering and long tail fixes of MPArch-related features, we started to work on the FencedFrame which is a target of the MPArch projects since September 2021. The Igalia team has merged over 550 patches for MPArch since April 2021 so far.

References

[1] Multiple Page Architecture
[2] Multi Page Architecture (BlinkOn 13)
[3] Prerendering
[4] Prerendering contents restriction
[5] Fixing long tail of features for MPArch

by gyuyoung at October 13, 2022 12:13 AM