Planet Igalia WebKit

January 27, 2021

Manuel Rego

:focus-visible in WebKit - January 2021

Let’s do a small introduction as this is a kind of special post.

As you might already know, last summer Igalia launched the Open Prioritization experiment, and :focus-visible in WebKit was the winner according to the pledges that the different projects got. Now it has moved into collecting funds stage, so far we’ve reached 80% of the goal and Igalia has already started to work on this. If you are interested and want to help sponsoring this work, please visit the project page at Open Collective.

In our regular client projects in Igalia, we provide periodic progress reports about the status of tasks and next plans. This blog post is a kind of monthly report, but this time the project has many customers, so it looks like this is a better format to share information about the status of things. Thank you all for supporting us in this development! 🙏

Understanding :focus-visible

Disclaimer: This is not a blog post explaining how :focus-visible works or the implications it has, you can read other articles if you’re looking for that.

First things first, my initial thoughts were that :focus-visible was a pseduo-class which would match an element when the browser natively shows a focus indicator (focus ring, outline) when an element of a page is focused. And that’s more or less what the spec says on the first sentence:

The :focus-visible pseudo-class applies while an element matches the :focus pseudo-class and the user agent determines via heuristics that the focus should be made evident on the element.

They key part here is that native behavior doesn’t show the focus indicator on purpose in some situations when the :focus pseudo-class matches, mainly because usability studies indicate that showing it in all the cases is not what the user expects and wants. Before having :focus-visible the web authors have not way to access the same criteria to style the focus indicator only when it’s going to be shown natively, and still keep the website being accessible.

Apart from that the spec has a set of heuristics that despite being non-normative, it looks like all implementations are following them. Summarizing them briefly they’d be something like:

  • If you use the mouse to focus an element it won’t match :focus-visible.
  • If you use the keyboard it’ll match :focus-visible.
  • Elements that support keyboard input (like <input> or contenteditable) always match :focus-visible.
  • When a script focuses a new element it’ll match or not :focus-visible depending on the previous active element.

This is just a quick & dirty summary, please read the spec for all the details. There have been years of research around these topics (how focus should work or not on the different use cases, what are the users and accessibility needs, how websites are managing focus, etc.) and these heuristics are somehow the result of all that work.

:focus-visible in the default UA style sheet

At this point it looks like we can more or less understand what :focus-visible is about. So let’s start playing with it. The definition seems very clear, but testing things in the current implementations (Chromium and Firefox) you might find some unexpected situations.

Let’s use a very basic example:

<style>
  :focus-visible { background: lime; }
</style>
<div tabindex="0">Focus me.</div>
#example1:focus-visible { background: lime; } #warning1 { display: none; color: red; font-size: 75%; font-style: italic; } @supports not (selector(:focus-visible)) { #warning1 { display: block; } }
WARNING: :focus-visible is not supported on your browser.
Focus me.

If you focus the <div> with a mouse click, :focus-visible doesn’t match per spec, so in this case the background doesn’t become green (if you use the keyboard to focus it will match :focus-visible and the background would be green). This works the same in Chromium and Firefox, but Chromium (despite the element doesn’t match :focus-visible) shows a focus indicator. Somehow the first spec definition is already not working as expected on Chromium… The issue here is that Chromium still uses :focus { outline: auto; } in the default UA style sheet, and the element matches :focus after the mouse click, that’s why it’s showing a focus indicator while not matching :focus-visible.

Actually this was already on the spec, but Chromium is not following that yet:

User agents should also use :focus-visible to specify the default focus style, so that authors using :focus-visible will not also need to disable the default :focus style.

There was already a related CSSWG issue on the topic, as the spec currently suggests the following code:

:focus:not(:focus-visible) {
  outline: 0;
}

This works as a kind of workaround for this issue, but if the default UA style sheet uses :focus-visible that won’t be needed.

Anyway, I’ve reported the Chromium bug and created WPT tests, during the tests review Emilio Cobos realized that this needed a change on the HTML spec and he wrote a PR to update it. After some discussion with Alice Boxhall the HTML change and the tests were approved and merged. I even was brave enough to write a patch to change this in Chromium which is still under review and needs some extra work.

The tests

WebKit is the third browser engine adding support for :focus-visible so the first natural step was to import the WPT tests related to the feature. This looked like something simple but it ended up needing some work to improve the tests.

There was a bunch of :focus-visible tests already in the WPT suite, but they need some love:

  • Some parts of the spec were not covered by tests, so I added some new tests.
  • Some tests were passing in WebKit, even when there was not an implementation yet, so I modified them to fail if there’s no :focus-visible support.

Then I imported the tests in WebKit and I discovered a bug related to focus event and :focus pseudo-class. :focus pseudo-class was not matching inside the focus event handler. This is probably not important for web authors, but :focus-visbile tests were relying on that. Actually this had been fixed in Chromium more than 5 years ago, so first I moved to WPT the Chromium internal test and used it to fix the problem in WebKit.

Once the tests were imported in WebKit, the problem was that a bunch of them were timing out in Mac platforms. After investigating the issue I realized that it’s because focus event is not dispatched when you click on a button in Mac. Digging deeper I found this old bug with lots of discussions on the topic, it looks like this is done to keep alignment with the native platform behavior, and also in order to avoid showing a focus indicator. Even Firefox has the same behavior on Mac. However, Chromium always dispatch the event independently of the platform. This makes that some of the tests don’t work automatically on Mac, as they wait for a focus event that is never dispatched. Anyway maybe once :focus-visible is implemented, it could be rediscussed the possibility of modifying this behavior, thought it might be not possible anyway. In any case, WebKitGTK port, the one I’m using for the development, does trigger the focus event in this case; and I’ve also changed WPE port to do the same (maybe Windows port will follow too).

One more thing about the tests, lots of these :focus-visible tests use testdriver.js to simulate user actions. For example for clicking an element they use test_driver.click(element), however that simple instruction is causing some kind of memory leak on Chromium when running the tests. The actual Chromium bug hasn’t been fixed yet, but I landed some workarounds that prevent the issue in these tests (waiting for the promise to be resolved before marking the test as done).

Status of WPT :focus-visible tests in the different browsers Status of WPT :focus-visible tests in the different browsers

To close the tests part, you can check the status in wpt.fyi, most of them are passing in all implementations which is great, but there are some interoperability issues that we’ll review next.

Interop issues

As I mentioned the wpt.fyi website helps to easily identify the interop issues between the different implementations.

  • :focus-visible on the default UA style sheet: This has been already commented before, but this is the reason why Chromium fails focus-visible-018.html test. Firefox fails focus-visible-017.html because the default UA style sheet mentions outline: auto, but Firefox uses a dotted outline.

  • :focus-visible on <select> element: There’s a Firefox failure on focus-visible-002.html because it doesn’t match :focus-visible when you click a <select> element. I opened a CSSWG issue to discuss this, and I initially thought that the agreement was that Firefox behavior is the right one. So I did a patch to change Chromium’s behavior and update the tests, but during the review I was pointed to a Chromium bug about this topic that was closed as WONTFIX, the reason is that when you click a <select> element you can type letters to select the option from the keyboard. Right now the discussion has been reopened and we’ll need to wait for the final resolution on the topic, to see which is the right implementation.

  • Keyboard interaction once an element is focused: This is tested by focus-visible-007.html. The example here is that you click an element to focus it, initially the element doesn’t match :focus-visible but then you use the keyboard (for example you type a letter), in that situation Chromium will start matching :focus-visible while Firefox won’t. The spec is quite explicit on the topic so it looks like a Firefox bug:

    If the user interacts with the page via the keyboard, the currently focused element should match :focus-visible (i.e. keyboard usage may change whether this pseudo-class matches even if it doesn’t affect :focus).

  • Programmatic focus and :focus-visible: What should happen with :focus-visible when the website uses element.focus() from a script to move the focus? The spec has some heuristics that depend on if the active element before focus() is called was matching (or not) :focus-visible. But I’ve opened a CSSWG issue to discuss what should happen when there’s no active element. The discussion is still ongoing and depending on that there might be changes in the current implementations. Right now there are some subtle differences between Chromium and Firefox here.

:-webkit-direct-focus?

Probably you don’t know what’s that, but it’s somehow related to :focus-visible so I believe it’s worth to mention it here.

WebKit is the browser that supports better :focus pseudo-class behavior on Shadow DOM (see the WPT tests results). The issue here is that the ShadowRoot should match :focus if some of the descendants are focused, so if you have an <input> element in the Shadow Tree, and you focus it, you’ll have 2 elements matching :focus the ShadowRoot and the <input>.

<style>
  #host {  padding: 1em; background: lightgrey; }
  #host:focus { background: lime; }
</style>
<div id="host"></div>
<script>
  shadowRoot = host.attachShadow(
    {mode: 'open', delegatesFocus: true});
  shadowRoot.innerHTML =
    '<input value="Focus me">';
</script>
#example2host { padding: 1em; background: lightgrey; } #example2host:focus { background: lime; }

In Chromium if you use delegatesFocus=true in element.attachShadow(), and you have an example like the one described above, you’ll get two focus indicators, one in the ShadowRoot and one in the <input>. Firefox doesn’t match :focus in the ShadowRoot so the issue is not present there.

WebKit matches :focus independently of delegatesFocus value (which is the right behavior per spec), so it’d be even more common to have a situation of getting two focus indicators. To avoid that WebKit introduced :-webkit-direct-focus pseudo-class, that is not web exposed, but it’s used in the default UA style sheet to avoid this bad effect of having a focus indicator on the ShadowRoot.

I believe :focus-visible spec should describe that behavior regarding how it works on ShadowRoot so it doesn’t match on those situations. That way WebKit could get rid of :-webkit-direct-focus and use :focus-visible once it’s implemented. I’ve reported a CSSWG issue to discuss this topic.

WIP implementation

So far I haven’t talked about the implementation at all, but the reason is that all the previous work is required in order to be able to do a proper implementation, with good quality and that is interoperable between the different browsers. :focus-visbile is a new feature, and despite all the interop mess regarding how focus works in the different browsers and platforms, we should aim to have a :focus-visible implementation as much interoperable as possible.

Despite all this related work, I’ve also found some time to work on a patch. It’s still not ready to be sent upstream but it’s already doing some things and passing some of the WPT tests. Of course several things are still missing, but next you can see quick screen recording with :focus-visible working on WebKit.

:focus-visible example running on WebKitGTK MiniBrowser

Some numbers

I know this is not really relevant, but it helps to get a grasp on what has been happening during this month:

  • 3 CSSWG issues reported.
  • 13 PRs merged in WPT.
  • 5 patches landed in WebKit.
  • 4 patches landed in Chromium.
  • And many discussions with different people, special thanks to Alice and Emilio that have been really helpful.

Next steps

The plan for February is to try to find an agreement on the CSSWG issues, close them, and update the WPT tests accordingly. Maybe this work could include even landing some patches on the current implementations. And of course, focus (pun intended) the effort on implementation of :focus-visible in WebKit.

I hope this blog post helps you to understand better the work that goes behind the scenes when a web platform feature is implemented, especially if you want to do it on a way that ensures browser interoperability and reduces web authors’ pain.

If you enjoyed this project update, stay tuned as there will more in the future.

January 27, 2021 11:00 PM

January 20, 2021

Sergio Villar

Flexbox Cats (a.k.a fixing images in flexbox)

In my previous post I discussed my most recent contributions to flexbox code in WebKit mainly targeted at reducing the number of interoperability issues among the most popular browsers. The ultimate goal was of course to make the life of web developers easier. It got quite some attention (I loved Alan Stearns’ description of the post) so I decided to write another one, this time focused in the changes I recently landed in WebKit (Safari’s engine) to improve the handling of elements with aspect ratio inside flexbox, a.k.a make images work inside flexbox. Some of them have been already released in the Safari 118 Tech Preview so it’s now possible to help test them and provide early feedback.

(BTW if you wonder about the blog post title I couldn’t resist the temptation of writing “Flexbox Cats” which sounded really great after the previous “Flexbox Gaps”. After all, image support was added to the Web just to post pictures of 🐱, wasn’t it?)

Same as I did before, I think it’d be useful to review some of the more relevant changes with examples so you could have any of those so inspiring a-ha moments when you realize that the issue you just couldn’t figure out was actually a problem in the implementation.

What was done

Images as flex items in column flows

Web engines are in charge of taking an element tree, and accompanying CSS and creating a box tree from this. All of this relies on Formatting Contexts. Each formatting context has specific ideas about how layout behaves. Both flex and grid, for example, created new, interesting formatting contexts which allow them to size their children by shrinking and or stretching them. But how all this works can vary. While there is “general” box code that is consulted by each formatting text, there are also special cases which require specialized overrides. Replaced elements (images, for example), should work a little differently in flex and grid containers. Consider this:

.flexbox {
    display: flex;
    flex-direction: column;
    height: 500px;
    justify-content: flex-start;
    align-items: flex-start;
}

.flexbox > * {
    flex: 1;
    min-width: 0;
    min-height: 0;
}


<div class="flexbox">
      <img src="cat1.jpg>
</div>

Ideally, the aspect ratio of the replaced element (the image, in the example) would be preserved as the flex context calculated its size in the relevant direction (column is the block direction/vertical in western writing modes, for example)…. But in WebKit, they weren’t. They are now.

Black and white cat by pixabay

Images as flex items in row flows

This second issue is kind of the specular twin of the previous one. The same issue that existed for block sizes was also there for inline sizes. Overriding inline sizes were not used to compute block sizes of items with aspect ratio (again the intrinsic inline size was used) and thus the aspect ratio of the image (replaced elements in general) was not preserved at all. Some examples of this issue:

.flexbox {
  display: flex;
  flex-direction: row;
  width: 500px;
  justify-content: flex-start;
  align-items: flex-start;
}
.flexbox > * {
  flex: 1;
  min-width: 0;
  min-height: 0;
}


<div class="flexbox">
    <img src="cat2.jpg">
</div>

Gray Cat by Gabriel Criçan

Images as flex items in auto-height flex containers

The two fixes above allowed us to “easily” fix this one because we can now rely on the computations done by the replaced elements code to compute sizes for items with aspect ratio even if they’re inside special formatting contexts as grid or flex. This fix was precisely about delegating that computation to the replaced elements code instead of duplicating all the aspect-ratio machinery in the flexbox code. This fix has apparently the potential to be a game changer:
This is a key bug to fix so that Authors can use Flexbox as intended. At the moment, no one can use Flexbox in a DOM structure where images are flex children.
Jen Simmons in bug 209983 Also don’t miss the opportunity to check this visually appealing demo by Jen which should work as expected now. For those of you not having a WebKit based browser I’ve recorded a screencast for you to compare (all circles should be round).
Left: old WebKit. Right: new WebKit (tested using WebKitGtk)
Apart from the screen cast, I’m also showcasing the issue with some actual code.

.flexbox {
    width: 500px;
    display: flex;
}
.flexbox > * {
    min-width: 0;
}


<div class="flexbox">  
  <img style="flex: auto;" src="cat3.jpg">
</div>

Tabby Cat by Bekka Mongeau

Flexbox additional cases for definite sizes

This was likely the trickiest one. I remember having nightmares with all the definite/indefinite stuff back then when I was implementing grid layout with other Igalia colleages. The whole thing about definite/indefinite sizes although sensible and relatively easy to understand is actually a huge challenge for web engines which were not really designed with them in mind. Laying out web content traditionally means taking a width as input to produce a height as output. However formatting contexts like grid or flex make the whole picture much more complicated.
This particular issue was not a malfunction but something that was not implemented. Essentially the flex specs define some cases where indefinite sizes should be considered as definite although the general rule considers them indefinite. For example, if a single-line flex container has a definite cross size we could assume that flex items have a definite size in the cross axis which is indeed equal to the flex container inner cross size.
In the following example the flex item, the image, has height:auto (by default) which is an indefinite size. However the flex container has a definite height (a fixed 300px). This means that when laying out the image, we could assume that its height is definite and equal to the height of the container. Having a definite height then allows you to properly compute the width using an aspect ratio.

.flexbox {
    display: flex;
    width: 0;
    height: 300px;
}


<div class="flexbox">
  <img src="cat4.png">
</div>

White and Black Cat With Blue Eyes by Thomas Svensson

Aspect ratio computations and box-sizing

Very common overlook in layout code. When dealing with layout bugs we (browser engineers) usually forget about box-sizing because the standard box model is the truth and the whole truth and the sole truth in our minds. Jokes aside, in this case the aspect ratio was applied to the border box (content + border + padding) instead of to the content box as it should. The result were distorted images because border and padding where altering the aspect ratio computations.

.flexbox {
  display: flex;
}
.flexbox > * {
  border-top: 150px solid blue;
  border-left: 30px solid orange;
  height: 300px;
  box-sizing: border-box;
}


<div class=flexbox>
  <img src="cat5.png"/>
</div>

Grayscale Photo of Long Fur Cat by Skyler Ewin

Conclusions

I mentioned this in the previous post but I’ll do it again here, having the web platform test suite has been an an absolute game changer for web browser engineers. They have helped us in many ways, from easily allowing us to verify our implementations to acting as a safety net against potential regressions we might add while fixing issues in the engines. We no longer have to manually test stuff in different browsers to check how other developers have interpreted the specs. We now have the test, period.
In this case, I’ve been using them in a different way. They have served me both as a guide, directing my efforts to reduce the flexbox interoperability issues and also as a nice metric to measure the progress of the task. Talking about metrics, this work made WebKit based browsers pass an additional 64 test cases from the WPT test suite, a very nice step forward for interoperability.
I’m attaching a screenshot with the current status of images as flex items from the WPT point of view. Each html file on the left column is a test, and each test performs multiple checks. For example the image-as-flexitem-* ones run 19 different checks (use cases) each. Each column show how many tests each browser successfully run. A quarter ago Safari’s (WebKit’s) figures for most of them were 11/19, 13/19 but now the last Tech Preview it’s passing all of them. Not bad huh?
image-as-flexitem-* flexbox tests in WPT as of 2021/01/20

Acknowledgements

Again many thanks to the different awesome folks at Apple, Google and my beloved Igalia that helped me with very insightful reviews and strong support at all levels.
Also I am thankful to all the photographers from whom I borrowed their nice cat pictures (including the Brown and Black Cat on top by pixabay).

by svillar at January 20, 2021 09:45 AM

December 22, 2020

Manuel Rego

2020 Recap

2020 is not a great year to do any kind of recap, but there have been some positive things happening in Igalia during this year. Next you can find a highlight of some of these things in no particular order.

CSS Working Group A Coruña F2F

The year couldn’t start better, on January Igalia hosted a CSS Working Group face-to-face meeting in our office in A Coruña (Galicia, Spain). Igalia has experience arranging other events in our office, but this was the first time that the CSSWG came here. It was an amazing week and I believe everyone enjoined the visit to this corner of the world. 🌍

Brian Kardell from Igalia was talking to everybody about Container Queries. This is one of the features that web authors have been asking for since ever, and Brian was trying to push the topic forward and find some kind of solution (even if not 100% feature complete) for this topic. In that week there were discussions about the relationship with other topics like Resize Observer or CSS Containment, and new ideas appeared too. Brian posted a blog post after the event, explaining some of those ideas. Later my colleague Javi Fernández worked on an experiment that Brian mentioned on a recent post. The good news is that all these conversations managed to bring this topic back to life, and past November Google announced that they have started working on a Container Queries prototype in Chromium.

During the meeting Jen Simmons (in Mozilla at that time, now in Apple) presented some topics from Mozilla, including a detailed proposal for Masonry Layout based on Grid, this has been something authors have also showed interest, and Firefox has already a prototype implementation behind a runtime flag.

Apart from the three days full of meetings and interesting discussions, some of the CSSWG members participated in a local meetup giving 4 nice talks:

Finally, I remember some corridor conversations about the Mozilla layoffs that had just happened just a few days before the event, but nobody could expect what was going to happen during the summer. It looks like 2020 has been a bad year for Mozilla in general and Servo in particular. 😢

Open Prioritization

This summer Igalia launched the Open Prioritization campaign, where we proposed a list of topics to be implemented on the different browser engines, and people supported them with different pledges; I wrote a blog post about it by that time.

Open Prioritization: :focus-visible in Safari/WebKit: $30.8K pledged out of $35K. Open Prioritization: :focus-visible in Safari/WebKit

This was a cool experiment, and it looks like a successful one, as :focus-visible in WebKit/Safari has been the winner. Igalia is currently collecting funds through Open Collective in order to start the implementation of :focus-visible in WebKit, you still have time to support it if you’re interested. If everything goes fine this should happen during the first quarter of 2021. 🚀

Igalia Chats

This actually started in later 2019, but it has been ongoing during the whole 2020. Brian Kardell has been recording a podcast series about the web platform and some of its features with different people from the industry. They have been getting more and popular, and Brian was even asked to record one of these for the last BlinkOn edition.

So far 8 episodes of around 1 hour length have been published, with 13 different guests. More to come in 2021! If you are curious and want to know more, you can find them at Igalia website or in your favourite podcasting platform.

Igalia contributions

This is not a comprehensive list but just some highlights of what Igalia has been doing in 2020 around CSS:

We’re working on a demo about these features, that we’ll be publishing next year.

In February Chromium published the requirements to become API owner. Due to my involvement on the Blink project since the fork from WebKit back in 2013, I was nominated and became Blink API Owner past March. 🥳

Yoav Weiss on the BlinkOn 13 Keynote announcing me as API owner Yoav Weiss on the BlinkOn 13 Keynote announcing me as API owner

The API owners met on a weekly basis to review the intent threads and discuss about them, it’s an amazing learning experience to be part of this group. In my case when reviewing intents I usually pay attention to things related to interoperability, like the status of the spec, test suites and other implementations. In addition, I have the support from all my awesome colleagues at Igalia that help me to play this role, thank you all!

2021 and beyond…

Igalia keeps growing and a bunch of amazing folks will join us soon, particularly Delan Azabani and Felipe Erias are already starting these days as part of the Web Platform team.

Open Prioritization should have the first successful project, as :focus-visible is advancing funding and it gets implemented in WebKit. We hope this can lead to new similar experiments in the future.

And I’m sure many other cool things will happen at Igalia next year, stay tuned!

December 22, 2020 11:00 PM

November 29, 2020

Philippe Normand

Catching up on WebKit GStreamer WebAudio backends maintenance

Over the past few months the WebKit development team has been working on modernizing support for the WebAudio specification. This post highlights some of the changes that were recently merged, focusing on the GStreamer ports.

My fellow WebKit colleague, Chris Dumez, has been very active lately, updating the WebAudio implementation …

by Philippe Normand at November 29, 2020 12:45 PM

November 26, 2020

Víctor Jáquez

Notes on using Emacs (LSP/ccls) for WebKit

I used to regard myself as an austere programmer in terms of tooling: Emacs —with a plain configuration— and grep. This approach forces you to understand all the elements involved in a project.

Some time ago I have to code in Rust, so I needed to learn the language as fast as possible. I looked for packages in MELPA that could help me to be productive quickly. Obviously, I installed rust-mode, but I also found racer for auto-completion. I tried it out. It was messy to setup and unstable, but it helped me to code while learning. When I felt comfortable with the base code, I uninstalled it.

This year I returned to work on WebKit. The last time I contributed to it was around five years ago, but now in a different area (still in the multimedia stack). WebKit is huge, and because of C++, I found gtags rather limited. Out of curiosity I looked for something similar to racer but for C++. And I spent a while digging on it.

The solution consists in the integration of three MELPA packages:

  • lsp-mode: a client for Language Server Protocol for Emacs.
  • company-mode: a text completion framework.
  • ccls: A C/C++ language server. Besides emacs-ccls adds more functionality to lsp-mode.

(I known, there’s a simpler alternative to lsp-mode, but I haven’t tried it yet).

First we might explain what’s LSP. It stands for Language Server Protocol, defined with JSON-RPC messages, between the editor and the language server. It was orginally developed by Microsoft for Visual Studio, which purpose is to support auto-completion, finding symbol’s definition, to show early error markers, etc., inside the editor. Therefore, lsp-mode is an Emacs mode that communicates with different language servers in LSP and operates in Emacs accordingly.

In order to support the auto-completion use-case lsp-mode uses the company-mode. This Emacs mode is capable to create a floating context menu where the editing cursor is placed.

The third part of the puzzle is, of course, the language server. There’s a language servers for different programming languages. For C & C++ there are two servers: clangd and ccls. The former uses Clang compiler, the last can use either Clang, GCC or MSVC. Along this text ccls will be used for reasons exposed later. In between, emacs-ccls leverages and extends the support of ccls in lsp-mode, though it’s not mandatory.

In short, the basic .emacs configuration, using use-package, would have these lines:


(use-package company
  :diminish
  :config (global-company-mode 1))

(use-package lsp-mode
  :diminish "L"
  :init (setq lsp-keymap-prefix "C-l"
              lsp-enable-file-watchers nil
              lsp-enable-on-type-formatting nil
              lsp-enable-snippet nil)
  :hook (c-mode-common . lsp-deferred)
  :commands (lsp lsp-deferred))

(use-package ccls
  :init (setq ccls-sem-highlight-method 'font-lock)
  :hook ((c-mode c++-mode objc-mode) . (lambda () (require 'ccls) (lsp-deferred))))

The snippet first configures company-mode. It is enabled globally because, normally, it is a nice feature to have, even in non-coding buffers, such as this very one, for writing a blog post in markdown format. Diminish mode hides or abbreviates the mode description in the Emacs’ mode line.

Later comes lsp-mode. It’s big and aims to do a lot of things, basically we have to tell it to disable certain features, such as file watcher, something not viable in massive projects as WebKit; as I don’t use snippet (generic text templates), I also disable it; and finally, lsp-mode tries to format the code at typing, I don’t know how the code style is figured out, but in my experience, it’s always detected wrong, so I disabled it too. Finally, lsp-mode is launched when a text uses the c-mode-common, shared by c++-mode too. lsp-mode is launched deferred, meaning it’ll startup until the buffer is visible; this is important since we might want to delay ccls session creation until the buffer’s .dir-locals.el file is processed, where it is configured for the specific project.

And lastly, ccls-mode configuration, hooked until c-mode or c++-mode are loaded up in a deferred fashion (already explained).

It’s important to understand how ccls works in order to integrate it in our workflow of a specific project, since it might need to be configured using Emacs’ per-directory local variales.

We are living in a post-Makefile world (almost), proof of that is ccls, which instead of a makefile, it uses a compilation database, a record of the compile options used to build the files in a project. It’s commonly described in JSON and it’s generated automatically by build systems such as meson or cmake, and later consumed by ninja or ccls to execute the compilation. Bear in mind that ccls uses a cache, which can eat a couple gigabytes of disk.

Now, let’s review the concrete details of using these features with WebKit. Let me assume that WebKit local repository is cloned in ~/WebKit.

As you may know, the cool way to compile WebKit is with flatpak. Flatpak adds an indirection in the compilation process, since it’s done in an isolated environment, above the native system. As a consequence, ccls has to be the one inside the Flatpak environment. In ~/.local/bin/webkit-ccls:

#!/bin/sh
set -eu
cd $HOME/WebKit/
exec Tools/Scripts/webkit-flatpak -c ccls "$@"

Basically the scripts calls ccls inside flatpak, which is available in the SDK. And this is why ccls instead of clang, since clang is not provided.

By default ccls assumes the compilation database is in the project’s root directory, but in our case, it’s not, thus it is required to configure the database directory for our WebKit setup. For it, as we already said, a .dir-locals.el file is used.


((c-mode
  (indent-tabs-mode . nil)
  (c-basic-offset . 4))
 (c++-mode
  (indent-tabs-mode . nil)
  (c-basic-offset . 4))
 (java-mode
  (indent-tabs-mode . nil)
  (c-basic-offset . 4))
 (change-log-mode
  (indent-tabs-mode . nil))
 (nil
  (fill-column . 100)
  (ccls-executable . "/home/vjaquez/.local/bin/webkit-ccls")
  (ccls-initialization-options . (:compilationDatabaseDirectory "/app/webkit/WebKitBuild/Release"
                                  :cache (:directory ".ccls-cache")))
  (compile-command . "build-webkit --gtk --debug")))

As you can notice, ccls-execute is defined here, though it’s not a safe local variable. Also the ccls-initialization-options, which is a safe local variable. It is important to notice that the compilation database directory is a path inside flatpak, and always use the Release path. I don’t understand why, but Debug path didn’t work for me. This mean that WebKit should be compiled as Release frequently, even if we only use Debug type for coding (as you may see in my compile-command).

Update: Now we can explain why it’s important to configure lsp-mode as deferred: to avoid connections to ccls before processing the .dir-locals.el file.

And that’s all. Now I have early programming errors detection, auto-completion, and so on. I hope you find these notes helpful.

Update: Sadly, because of flatpak indirection, symbols’ definition finding won’t work because the file paths stored in ccls cache are relative to flatpak’s file system. For that I still rely on global and its Emacs mode.

by vjaquez at November 26, 2020 04:20 PM

November 20, 2020

Paulo Matos

A tour of the for..of implementation for 32bits JSC

We look at the implementation of the for-of intrinsic in 32bit JSC (JavaScriptCore).

More…

by Paulo Matos at November 20, 2020 02:00 PM

October 29, 2020

Claudio Saavedra

Thu 2020/Oct/29

In this line of work, we all stumble at least once upon a problem that turns out to be extremely elusive and very tricky to narrow down and solve. If we&aposre lucky, we might have everything at our disposal to diagnose the problem but sometimes that&aposs not the case – and in embedded development it&aposs often not the case. Add to the mix proprietary drivers, lack of debugging symbols, a bug that&aposs very hard to reproduce under a controlled environment, and weeks in partial confinement due to a pandemic and what you have is better described as a very long lucid nightmare. Thankfully, even the worst of nightmares end when morning comes, even if sometimes morning might be several days away. And when the fix to the problem is in an inimaginable place, the story is definitely one worth telling.

The problem

It all started with one of Igalia&aposs customers deploying a WPE WebKit-based browser in their embedded devices. Their CI infrastructure had detected a problem caused when the browser was tasked with creating a new webview (in layman terms, you can imagine that to be the same as opening a new tab in your browser). Occasionally, this view would never load, causing ongoing tests to fail. For some reason, the test failure had a reproducibility of ~75% in the CI environment, but during manual testing it would occur with less than a 1% of probability. For reasons that are beyond the scope of this post, the CI infrastructure was not reachable in a way that would allow to have access to running processes in order to diagnose the problem more easily. So with only logs at hand and less than a 1/100 chances of reproducing the bug myself, I set to debug this problem locally.

Diagnosis

The first that became evident was that, whenever this bug would occur, the WebKit feature known as web extension (an application-specific loadable module that is used to allow the program to have access to the internals of a web page, as well to enable customizable communication with the process where the page contents are loaded – the web process) wouldn&apost work. The browser would be forever waiting that the web extension loads, and since that wouldn&apost happen, the expected page wouldn&apost load. The first place to look into then is the web process and to try to understand what is preventing the web extension from loading. Enter here, our good friend GDB, with less than spectacular results thanks to stripped libraries.

#0  0x7500ab9c in poll () from target:/lib/libc.so.6
#1  0x73c08c0c in ?? () from target:/usr/lib/libEGL.so.1
#2  0x73c08d2c in ?? () from target:/usr/lib/libEGL.so.1
#3  0x73c08e0c in ?? () from target:/usr/lib/libEGL.so.1
#4  0x73bold6a8 in ?? () from target:/usr/lib/libEGL.so.1
#5  0x75f84208 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#6  0x75fa0b7e in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#7  0x7561eda2 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#8  0x755a176a in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#9  0x753cd842 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#10 0x75451660 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#11 0x75452882 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#12 0x75452fa8 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#13 0x76b1de62 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#14 0x76b5a970 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#15 0x74bee44c in g_main_context_dispatch () from target:/usr/lib/libglib-2.0.so.0
#16 0x74bee808 in ?? () from target:/usr/lib/libglib-2.0.so.0
#17 0x74beeba8 in g_main_loop_run () from target:/usr/lib/libglib-2.0.so.0
#18 0x76b5b11c in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#19 0x75622338 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#20 0x74f59b58 in __libc_start_main () from target:/lib/libc.so.6
#21 0x0045d8d0 in _start ()

From all threads in the web process, after much tinkering around it slowly became clear that one of the places to look into is that poll() call. I will spare you the details related to what other threads were doing, suffice to say that whenever the browser would hit the bug, there was a similar stacktrace in one thread, going through libEGL to a call to poll() on top of the stack, that would never return. Unfortunately, a stripped EGL driver coming from a proprietary graphics vendor was a bit of a showstopper, as it was the inability to have proper debugging symbols running inside the device (did you know that a non-stripped WebKit library binary with debugging symbols can easily get GDB and your device out of memory?). The best one could do to improve that was to use the gcore feature in GDB, and extract a core from the device for post-mortem analysis. But for some reason, such a stacktrace wouldn&apost give anything interesting below the poll() call to understand what&aposs being polled here. Did I say this was tricky?

What polls?

Because WebKit is a multiprocess web engine, having system calls that signal, read, and write in sockets communicating with other processes is an everyday thing. Not knowing what a poll() call is doing and who is it that it&aposs trying to listen to, not very good. Because the call is happening under the EGL library, one can presume that it&aposs graphics related, but there are still different possibilities, so trying to find out what is this polling is a good idea.

A trick I learned while debugging this is that, in absence of debugging symbols that would give a straightforward look into variables and parameters, one can examine the CPU registers and try to figure out from them what the parameters to function calls are. Let&aposs do that with poll(). First, its signature.

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

Now, let's examine the registers.

(gdb) f 0
#0  0x7500ab9c in poll () from target:/lib/libc.so.6
(gdb) info registers
r0             0x7ea55e58	2124766808
r1             0x1	1
r2             0x64	100
r3             0x0	0
r4             0x0	0

Registers r0, r1, and r2 contain poll()&aposs three parameters. Because r1 is 1, we know that there is only one file descriptor being polled. fds is a pointer to an array with one element then. Where is that first element? Well, right there, in the memory pointed to directly by r0. What does struct pollfd look like?

struct pollfd {
  int   fd;         /* file descriptor */
  short events;     /* requested events */
  short revents;    /* returned events */
};

What we are interested in here is the contents of fd, the file descriptor that is being polled. Memory alignment is again in our side, we don&apost need any pointer arithmetic here. We can inspect directly the register r0 and find out what the value of fd is.

(gdb) print *0x7ea55e58
$3 = 8

So we now know that the EGL library is polling the file descriptor with an identifier of 8. But where is this file descriptor coming from? What is on the other end? The /proc file system can be helpful here.

# pidof WPEWebProcess
1944 1196
# ls -lh /proc/1944/fd/8
lrwx------    1 x x      64 Oct 22 13:59 /proc/1944/fd/8 -> socket:[32166]

So we have a socket. What else can we find out about it? Turns out, not much without the unix_diag kernel module, which was not available in our device. But we are slowly getting closer. Time to call another good friend.

Where GDB fails, printf() triumphs

Something I have learned from many years working with a project as large as WebKit, is that debugging symbols can be very difficult to work with. To begin with, it takes ages to build WebKit with them. When cross-compiling, it&aposs even worse. And then, very often the target device doesn&apost even have enough memory to load the symbols when debugging. So they can be pretty useless. It&aposs then when just using fprintf() and logging useful information can simplify things. Since we know that it&aposs at some point during initialization of the web process that we end up stuck, and we also know that we&aposre polling a file descriptor, let&aposs find some early calls in the code of the web process and add some fprintf() calls with a bit of information, specially in those that might have something to do with EGL. What can we find out now?

Oct 19 10:13:27.700335 WPEWebProcess[92]: Starting
Oct 19 10:13:27.720575 WPEWebProcess[92]: Initializing WebProcess platform.
Oct 19 10:13:27.727850 WPEWebProcess[92]: wpe_loader_init() done.
Oct 19 10:13:27.729054 WPEWebProcess[92]: Initializing PlatformDisplayLibWPE (hostFD: 8).
Oct 19 10:13:27.730166 WPEWebProcess[92]: egl backend created.
Oct 19 10:13:27.741556 WPEWebProcess[92]: got native display.
Oct 19 10:13:27.742565 WPEWebProcess[92]: initializeEGLDisplay() starting.

Two interesting findings from the fprintf()-powered logging here: first, it seems that file descriptor 8 is one known to libwpe (the general-purpose library that powers the WPE WebKit port). Second, that the last EGL API call right before the web process hangs on poll() is a call to eglInitialize(). fprintf(), thanks for your service.

Number 8

We now know that the file descriptor 8 is coming from WPE and is not internal to the EGL library. libwpe gets this file descriptor from the UI process, as one of the many creation parameters that are passed via IPC to the nascent process in order to initialize it. Turns out that this file descriptor in particular, the so-called host client file descriptor, is the one that the freedesktop backend of libWPE, from here onwards WPEBackend-fdo, creates when a new client is set to connect to its Wayland display. In a nutshell, in presence of a new client, a Wayland display is supposed to create a pair of connected sockets, create a new client on the Display-side, give it one of the file descriptors, and pass the other one to the client process. Because this will be useful later on, let&aposs see how is that currently implemented in WPEBackend-fdo.

    int pair[2];
    if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, pair)  0)
        return -1;

    int clientFd = dup(pair[1]);
    close(pair[1]);

    wl_client_create(m_display, pair[0]);

The file descriptor we are tracking down is the client file descriptor, clientFd. So we now know what&aposs going on in this socket: Wayland-specific communication. Let&aposs enable Wayland debugging next, by running all relevant process with WAYLAND_DEBUG=1. We&aposll get back to that code fragment later on.

A Heisenbug is a Heisenbug is a Heisenbug

Turns out that enabling Wayland debugging output for a few processes is enough to alter the state of the system in such a way that the bug does not happen at all when doing manual testing. Thankfully the CI&aposs reproducibility is much higher, so after waiting overnight for the CI to continuously run until it hit the bug, we have logs. What do the logs say?

WPEWebProcess[41]: initializeEGLDisplay() starting.
  -> wl_display@1.get_registry(new id wl_registry@2)
  -> wl_display@1.sync(new id wl_callback@3)

So the EGL library is trying to fetch the Wayland registry and it&aposs doing a wl_display_sync() call afterwards, which will block until the server responds. That&aposs where the blocking poll() call comes from. So, it turns out, the problem is not necessarily on this end of the Wayland socket, but perhaps on the other side, that is, in the so-called UI process (the main browser process). Why is the Wayland display not replying?

The loop

Something that is worth mentioning before we move on is how the WPEBackend-fdo Wayland display integrates with the system. This display is a nested display, with each web view a client, while it is itself a client of the system&aposs Wayland display. This can be a bit confusing if you&aposre not very familiar with how Wayland works, but fortunately there is good documentation about Wayland elsewhere.

The way that the Wayland display in the UI process of a WPEWebKit browser is integrated with the rest of the program, when it uses WPEBackend-fdo, is through the GLib main event loop. Wayland itself has an event loop implementation for servers, but for a GLib-powered application it can be useful to use GLib&aposs and integrate Wayland&aposs event processing with the different stages of the GLib main loop. That is precisely how WPEBackend-fdo is handling its clients&apos events. As discussed earlier, when a new client is created a pair of connected sockets are created and one end is given to Wayland to control communication with the client. GSourceFunc functions are used to integrate Wayland with the application main loop. In these functions, we make sure that whenever there are pending messages to be sent to clients, those are sent, and whenever any of the client sockets has pending data to be read, Wayland reads from them, and to dispatch the events that might be necessary in response to the incoming data. And here is where things start getting really strange, because after doing a bit of fprintf()-powered debugging inside the Wayland-GSourceFuncs functions, it became clear that the Wayland events from the clients were never dispatched, because the dispatch() GSourceFunc was not being called, as if there was nothing coming from any Wayland client. But how is that possible, if we already know that the web process client is actually trying to get the Wayland registry?

To move forward, one needs to understand how the GLib main loop works, in particular, with Unix file descriptor sources. A very brief summary of this is that, during an iteration of the main loop, GLib will poll file descriptors to see if there are any interesting events to be reported back to their respective sources, in which case the sources will decide whether to trigger the dispatch() phase. A simple source might decide in its dispatch() method to directly read or write from/to the file descriptor; a Wayland display source (as in our case), will call wl_event_loop_dispatch() to do this for us. However, if the source doesn&apost find any interesting events, or if the source decides that it doesn&apost want to handle them, the dispatch() invocation will not happen. More on the GLib main event loop in its API documentation.

So it seems that for some reason the dispatch() method is not being called. Does that mean that there are no interesting events to read from? Let&aposs find out.

System call tracing

Here we resort to another helpful tool, strace. With strace we can try to figure out what is happening when the main loop polls file descriptors. The strace output is huge (because it takes easily over a hundred attempts to reproduce this), but we know already some of the calls that involve file descriptors from the code we looked at above, when the client is created. So we can use those calls as a starting point in when searching through the several MBs of logs. Fast-forward to the relevant logs.

socketpair(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0, [128, 130]) = 0
dup(130)               = 131
close(130)             = 0
fcntl64(128, F_DUPFD_CLOEXEC, 0) = 130
epoll_ctl(34, EPOLL_CTL_ADD, 130, {EPOLLIN, {u32=1639599928, u64=1639599928}}) = 0

What we see there is, first, WPEBackend-fdo creating a new socket pair (128, 130) and then, when file descriptor 130 is passed to wl_client_create() to create a new client, Wayland adds that file descriptor to its epoll() instance for monitoring clients, which is referred to by file descriptor 34. This way, whenever there are events in file descriptor 130, we will hear about them in file descriptor 34.

So what we would expect to see next is that, after the web process is spawned, when a Wayland client is created using the passed file descriptor and the EGL driver requests the Wayland registry from the display, there should be a POLLIN event coming in file descriptor 34 and, if the dispatch() call for the source was called, a epoll_wait() call on it, as that is what wl_event_loop_dispatch() would do when called from the source&aposs dispatch() method. But what do we have instead?

poll([{fd=30, events=POLLIN}, {fd=34, events=POLLIN}, {fd=59, events=POLLIN}, {fd=110, events=POLLIN}, {fd=114, events=POLLIN}, {fd=132, events=POLLIN}], 6, 0) = 1 ([{fd=34, revents=POLLIN}])
recvmsg(30, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

strace can be a bit cryptic, so let&aposs explain those two function calls. The first one is a poll in a series of file descriptors (including 30 and 34) for POLLIN events. The return value of that call tells us that there is a POLLIN event in file descriptor 34 (the Wayland display epoll() instance for clients). But unintuitively, the call right after is trying to read a message from socket 30 instead, which we know doesn&apost have any pending data at the moment, and consequently returns an error value with an errno of EAGAIN (Resource temporarily unavailable).

Why is the GLib main loop triggering a read from 30 instead of 34? And who is 30?

We can answer the latter question first. Breaking on a running UI process instance at the right time shows who is reading from the file descriptor 30:

#1  0x70ae1394 in wl_os_recvmsg_cloexec (sockfd=30, msg=msg@entry=0x700fea54, flags=flags@entry=64)
#2  0x70adf644 in wl_connection_read (connection=0x6f70b7e8)
#3  0x70ade70c in read_events (display=0x6f709c90)
#4  wl_display_read_events (display=0x6f709c90)
#5  0x70277d98 in pwl_source_check (source=0x6f71cb80)
#6  0x743f2140 in g_main_context_check (context=context@entry=0x2111978, max_priority=, fds=fds@entry=0x6165f718, n_fds=n_fds@entry=4)
#7  0x743f277c in g_main_context_iterate (context=0x2111978, block=block@entry=1, dispatch=dispatch@entry=1, self=)
#8  0x743f2ba8 in g_main_loop_run (loop=0x20ece40)
#9  0x00537b38 in ?? ()

So it&aposs also Wayland, but on a different level. This is the Wayland client source (remember that the browser is also a Wayland client?), which is installed by cog (a thin browser layer on top of WPE WebKit that makes writing browsers easier to do) to process, among others, input events coming from the parent Wayland display. Looking at the cog code, we can see that the wl_display_read_events() call happens only if GLib reports that there is a G_IO_IN (POLLIN) event in its file descriptor, but we already know that this is not the case, as per the strace output. So at this point we know that there are two things here that are not right:

  1. A FD source with a G_IO_IN condition is not being dispatched.
  2. A FD source without a G_IO_IN condition is being dispatched.

Someone here is not telling the truth, and as a result the main loop is dispatching the wrong sources.

The loop (part II)

It is at this point that it would be a good idea to look at what exactly the GLib main loop is doing internally in each of its stages and how it tracks the sources and file descriptors that are polled and that need to be processed. Fortunately, debugging symbols for GLib are very small, so debugging this step by step inside the device is rather easy.

Let&aposs look at how the main loop decides which sources to dispatch, since for some reason it&aposs dispatching the wrong ones. Dispatching happens in the g_main_dispatch() method. This method goes over a list of pending source dispatches and after a few checks and setting the stage, the dispatch method for the source gets called. How is a source set as having a pending dispatch? This happens in g_main_context_check(), where the main loop checks the results of the polling done in this iteration and runs the check() method for sources that are not ready yet so that they can decide whether they are ready to be dispatched or not. Breaking into the Wayland display source, I know that the check() method is called. How does this method decide to be dispatched or not?

    [](GSource* base) -> gboolean
    {
        auto& source = *reinterpret_cast(base);
        return !!source.pfd.revents;
    },

In this lambda function we&aposre returning TRUE or FALSE, depending on whether the revents field in the GPollFD structure have been filled during the polling stage of this iteration of the loop. A return value of TRUE indicates the main loop that we want our source to be dispatched. From the strace output, we know that there is a POLLIN (or G_IO_IN) condition, but we also know that the main loop is not dispatching it. So let&aposs look at what&aposs in this GPollFD structure.

For this, let&aposs go back to g_main_context_check() and inspect the array of GPollFD structures that it received when called. What do we find?

(gdb) print *fds
$35 = {fd = 30, events = 1, revents = 0}
(gdb) print *(fds+1)
$36 = {fd = 34, events = 1, revents = 1}

That&aposs the result of the poll() call! So far so good. Now the method is supposed to update the polling records it keeps and it uses when calling each of the sources check() functions. What do these records hold?

(gdb) print *pollrec->fd
$45 = {fd = 19, events = 1, revents = 0}
(gdb) print *(pollrec->next->fd)
$47 = {fd = 30, events = 25, revents = 1}
(gdb) print *(pollrec->next->next->fd)
$49 = {fd = 34, events = 25, revents = 0}

We&aposre not interested in the first record quite yet, but clearly there&aposs something odd here. The polling records are showing a different value in the revent fields for both 30 and 34. Are these records updated correctly? Let&aposs look at the algorithm that is doing this update, because it will be relevant later on.

  pollrec = context->poll_records;
  i = 0;
  while (pollrec && i  n_fds)
    {
      while (pollrec && pollrec->fd->fd == fds[i].fd)
        {
          if (pollrec->priority = max_priority)
            {
              pollrec->fd->revents =
                fds[i].revents & (pollrec->fd->events | G_IO_ERR | G_IO_HUP | G_IO_NVAL);
            }
          pollrec = pollrec->next;
        }

      i++;
    }

In simple words, what this algorithm is doing is to traverse simultaneously the polling records and the GPollFD array, updating the polling records revents with the results of polling. From reading how the pollrec linked list is built internally, it&aposs possible to see that it&aposs purposely sorted by increasing file descriptor identifier value. So the first item in the list will have the record for the lowest file descriptor identifier, and so on. The GPollFD array is also built in this way, allowing for a nice optimization: if more than one polling record – that is, more than one polling source – needs to poll the same file descriptor, this can be done at once. This is why this otherwise O(n^2) nested loop can actually be reduced to linear time.

One thing stands out here though: the linked list is only advanced when we find a match. Does this mean that we always have a match between polling records and the file descriptors that have just been polled? To answer that question we need to check how is the array of GPollFD structures filled. This is done in g_main_context_query(), as we hinted before. I&aposll spare you the details, and just focus on what seems relevant here: when is a poll record not used to fill a GPollFD?

  n_poll = 0;
  lastpollrec = NULL;
  for (pollrec = context->poll_records; pollrec; pollrec = pollrec->next)
    {
      if (pollrec->priority > max_priority)
        continue;
  ...

Interesting! If a polling record belongs to a source whose priority is lower than the maximum priority that the current iteration is going to process, the polling record is skipped. Why is this?

In simple terms, this happens because each iteration of the main loop finds out the highest priority between the sources that are ready in the prepare() stage, before polling, and then only those file descriptor sources with at least such a a priority are polled. The idea behind this is to make sure that high-priority sources are processed first, and that no file descriptor sources with lower priority are polled in vain, as they shouldn&apost be dispatched in the current iteration.

GDB tells me that the maximum priority in this iteration is -60. From an earlier GDB output, we also know that there&aposs a source for a file descriptor 19 with a priority 0.

(gdb) print *pollrec
$44 = {fd = 0x7369c8, prev = 0x0, next = 0x6f701560, priority = 0}
(gdb) print *pollrec->fd
$45 = {fd = 19, events = 1, revents = 0}

Since 19 is lower than 30 and 34, we know that this record is before theirs in the linked list (and so it happens, it&aposs the first one in the list too). But we know that, because its priority is 0, it is too low to be added to the file descriptor array to be polled. Let&aposs look at the loop again.

  pollrec = context->poll_records;
  i = 0;
  while (pollrec && i  n_fds)
    {
      while (pollrec && pollrec->fd->fd == fds[i].fd)
        {
          if (pollrec->priority = max_priority)
            {
              pollrec->fd->revents =
                fds[i].revents & (pollrec->fd->events | G_IO_ERR | G_IO_HUP | G_IO_NVAL);
            }
          pollrec = pollrec->next;
        }

      i++;
    }

The first polling record was skipped during the update of the GPollFD array, so the condition pollrec && pollrec->fd->fd == fds[i].fd is never going to be satisfied, because 19 is not in the array. The innermost while() is not entered, and as such the pollrec list pointer never moves forward to the next record. So no polling record is updated here, even if we have updated revent information from the polling results.

What happens next should be easy to see. The check() method for all polled sources are called with outdated revents. In the case of the source for file descriptor 30, we wrongly tell it there&aposs a G_IO_IN condition, so it asks the main loop to call dispatch it triggering a a wl_connection_read() call in a socket with no incoming data. For the source with file descriptor 34, we tell it that there&aposs no incoming data and its dispatch() method is not invoked, even when on the other side of the socket we have a client waiting for data to come and blocking in the meantime. This explains what we see in the strace output above. If the source with file descriptor 19 continues to be ready and with its priority unchanged, then this situation repeats in every further iteration of the main loop, leading to a hang in the web process that is forever waiting that the UI process reads its socket pipe.

The bug – explained

I have been using GLib for a very long time, and I have only fixed a couple of minor bugs in it over the years. Very few actually, which is why it was very difficult for me to come to accept that I had found a bug in one of the most reliable and complex parts of the library. Impostor syndrome is a thing and it really gets in the way.

But in a nutshell, the bug in the GLib main loop is that the very clever linear update of registers is missing something very important: it should skip to the first polling record matching before attempting to update its revents. Without this, in the presence of a file descriptor source with the lowest file descriptor identifier and also a lower priority than the cutting priority in the current main loop iteration, revents in the polling registers are not updated and therefore the wrong sources can be dispatched. The simplest patch to avoid this, would look as follows.

   i = 0;
   while (pollrec && i  n_fds)
     {
+      while (pollrec && pollrec->fd->fd != fds[i].fd)
+        pollrec = pollrec->next;
+
       while (pollrec && pollrec->fd->fd == fds[i].fd)
         {
           if (pollrec->priority = max_priority)

Once we find the first matching record, let&aposs update all consecutive records that also match and need an update, then let&aposs skip to the next record, rinse and repeat. With this two-line patch, the web process was finally unlocked, the EGL display initialized properly, the web extension and the web page were loaded, CI tests starting passing again, and this exhausted developer could finally put his mind to rest.

A complete patch, including improvements to the code comments around this fascinating part of GLib and also a minimal test case reproducing the bug have already been reviewed by the GLib maintainers and merged to both stable and development branches. I expect that at least some GLib sources will start being called in a different (but correct) order from now on, so keep an eye on your GLib sources. :-)

Standing on the shoulders of giants

At this point I should acknowledge that without the support from my colleagues in the WebKit team in Igalia, getting to the bottom of this problem would have probably been much harder and perhaps my sanity would have been at stake. I want to thank Adrián and &Zcaronan for their input on Wayland, debugging techniques, and for allowing me to bounce back and forth ideas and findings as I went deeper into this rabbit hole, helping me to step out of dead-ends, reminding me to use tools out of my everyday box, and ultimately, to be brave enough to doubt GLib&aposs correctness, something that much more often than not I take for granted.

Thanks also to Philip and Sebastian for their feedback and prompt code review!

October 29, 2020 01:10 PM