Planet Igalia

October 20, 2016

Iago Toral

OpenGL terrain renderer update: Bloom and Cascaded Shadow Maps

Bloom Filter

The bloom filter makes bright objects glow and bleed through other objects positioned in between them and the camera. It is a common post-processing effect used all the time in video games and animated movies. The demo supports a couple of configuration options that control the intensity and behavior of the filter, here are some screenshots with different settings:

Bloom filter Off
Bloom filter On, default settings
Bloom filter On, intensity increased

I particularly like the glow effect that this brings to the specular reflections on the water surface, although to really appreciate that you need to run the demo and see it in motion.

Cascaded Shadow Maps

I should really write a post about basic shadow mapping before going into the details of Cascaded Shadow Maps, so for now I’ll just focus on the problem they try to solve.

One of the problems with shadow mapping is rendering high resolution shadows, specially for shadows that are rendered close to the camera. Generally, basic shadow mapping provides two ways in which we can improve the resolution of the shadows we render:

1. Increase the resolution of the shadow map textures. This one is obvious but comes at a high performance (and memory) hit.

2. Reduce the distance at which we can render shadows. But this is not ideal of course.

One compromise solution is to notice that, as usual with 3D computer graphics, it is far more important to render nearby objects in high quality than distant ones.

Cascaded Shadow Maps allow us to use different levels of detail for shadows that are rendered at different distances from the camera. Instead of having a single shadow map for all the shadows, we split the viewing frustum into slices and render shadows in each slice to a different shadow map.

There are two immediate benefits of this technique:

1. We have flexibility to define the resolution of the shadow maps for each level of the cascade, allowing us, for example, to increase the resolution of the levels closest to the camera and maybe reduce those that are further away.

2. Each level only records shadows in a slice of the viewing frustum, which increases shadow resolution even if we keep the same texture resolution we used with the original shadow map implementation for each shadow map level.

This approach also has some issues:

1. We need to render multiple shadow maps, which can be a serious performance hit depending on the resolutions of the shadow maps involved. This is why we usually lower the resolution of the shadow maps as distance from the camera increases.

2. As we move closer to or further from shadowed objects we can see the changes in shadow quality pop-in. Of course we can control this by avoiding drastic quality changes between consecutive levels in the cascade.

Here is an example that illustrates the second issue (in this case I have lowered the resolution of the 2nd and 3d cascade levels to 50% and 25% respectively so the effect was more obvious). The screenshots show the rendering of the shadows at different distances. We can see how the shadows in the close-up shot are very sharp and as the distance increases they become blurrier due to the use of a lower resolution shadow map:

CSM level 0 (4096×4096)
CSM level 1 (2048×2048)
CSM level 2 (1024×1024)

The demo supports up to 4 shadow map levels although the default configuration is to use 3. The resolution of each level can be configured separately too, in the default configuration I lowered the shadow resolution of the second and third levels to 75% and 50% respectively. If we configure the demo to run on a single level (with 100% texture resolution), we are back to the original shadow map implementation, so it is easy to experiment with both techniques.

I intend to cover the details behind shadow mapping and the implementation of the bloom filter in more detail in a future post, so again, stay tuned for more!

by Iago Toral at October 20, 2016 10:32 AM

October 13, 2016

Iago Toral

OpenGL terrain renderer: rendering the terrain mesh

In this post I’ll discuss how I setup and render terrain mesh in the OpenGL terrain rendering demo. Most of the relevant code for this is in the ter-terrain.cpp file.

Setting up a grid of vertices

Unless you know how to use a 3D modeling program properly, a reasonable way to create a decent mesh for a terrain consists in using a grid of vertices and elevate them according to a height map image. In order to create the grid we only need to decide how many rows and columns we want. This, in the end, determines the number of polygons and the resolution of the terrain.

We need to map these vertices to world coordinates too. We do that by defining a tile size, which is the distance between consecutive vertices in world units. Larger tile sizes increase the size of the terrain but lower the resolution by creating larger polygons.

The image below shows an 8×6 grid that defines 35 tiles. Each tile is rendered using 2 triangles:

8×6 terrain grid

The next step is to elevate these vertices so we don’t end up with a boring flat surface. We do this by sampling the height map image for each vertex in the grid. A height map is a gray scale image where the values of the pixels represent altitudes at different positions. The closer the color is to white, the more elevated it is.

Heightmap image

Adding more vertices to the grid increases the number of sampling points from the height map and reduces the sampling distances, leading to a smoother and more precise representation of the height map in the resulting terrain.

Sampling the heightmap to compute vertex heights

Of course, we still need to map the height map samples (gray scale colors) to altitudes in world units. In the demo I do this by normalizing the color values to [-1,+1] and then applying a scale factor to compute the altitude values in world space. By playing with the scaling factor we can make our terrain look more or less abrupt.

Altitude scale=6.0
Altitude scale=12.0

For reference, the height map sampling is implemented in ter_terrain_set_heights_from_texture().

Creating the mesh

At this point we know the full position (x, y, z) in world coordinates of all the vertices in our grid. The next step is to build the actual triangle mesh that we will use to render the terrain and the normal vectors for each triangle. This process is described below and is implemented in the ter_terrain_build_mesh() function.

Computing normals

In order to get nice lighting on our terrain we need to compute normals for each vertex in the mesh. A simple way to achieve this is would be to compute the normal for each face (triangle) and use that normal for each vertex in the triangle. This works, but it has 3 problems:

1. Every vertex of each triangle has the same exact normal, which leads to a rather flat result.

2. Adjacent triangles with different orientations showcase abrupt changes in the normal value, leading to significantly different lighting across the surfaces that highlight the individual triangles in the mesh.

3. Because each vertex in the mesh can have a different normal value for each triangle it participates in, we need to replicate the vertices when we render, which is not optimal.

Alternatively, we can compute the normals for each vertex considering the heights of its neighboring vertices. This solves all the problems mentioned above and leads to much better results thanks to the interpolation of the normal vectors across the triangles, which leads to smooth lighting reflection transitions:

Flat normals
Smooth normals

The implementation for this is in the function calculate_normal(), which takes the column and row indices of the vertex in the grid and calculates the Y coordinate by sampling the heights of the 4 nearby vertices in the grid.

Preparing the draw call

Now that we know the positions of all the vertices and their normal vectors we have all the information that we need to render the terrain. We still have to decide how exactly we want to render all the polygons.

The simplest way to render the terrain using a single draw call is to setup a vertex buffer with data for each triangle in the mesh (including position and normal information) and use GL_TRIANGLES for the primitive of the draw call. This, however, is not the best option from the point of view of performance.

Because the terrain will typically contain a large number of vertices and most of them participate in multiple triangles, we end up uploading a large amount of vertex data to the GPU and processing a lot of vertices in the draw call. The result is large memory requirements and suboptimal performance.

For reference, the terrain I used in the demo from my original post used a 251×251 grid. This grid represents 250×250 tiles, each one rendered as two triangles (6 vertices/tile), so we end up with 250x250x6=375,000 vertices. For each of these vertices we need to upload 24 bytes of vertex data with the position and normal, so we end up with a GPU buffer that is almost 9MB large.

One obvious way to reduce this is to render the terrain using triangle strips. The problem with this is that in theory, we can’t render the terrain with just one strip, we would need one strip (and so, one draw call) per tile column or one strip per tile row. Fortunately, we can use degenerate triangles to link separate strips for each column into a single draw call. With this we trim down the number of vertices to 126,000 and the size of the buffer to a bit below 3 MB. This alone produced a 15%-20% performance increase in the demo.

We can do better though. A lot of the vertices in the terrain mesh participate in various triangles across the large triangle strip in the draw call, so we can reduce memory requirements by using an index buffer to render the strip. If we do this, we trim things down to 63,000 vertices and ~1.5MB. This added another 4%-5% performance bonus over the original implementation.


So far we have been rendering the full mesh of the terrain in each frame and we do this by uploading the vertex data to the GPU just once (for example in the first frame). However, depending on where the camera is located and where it is looking at, just a fraction of the terrain may be visible.

Although the GPU will discard all the geometry and fragments that fall outside the viewport, it still has to process each vertex in the vertex shader stage before it can clip non-visible triangles. Because the number of triangles in the terrain is large, this is suboptimal and to address this we want to do CPU-side clipping before we render.

Doing CPU-side clippig comes with some additional complexities though: it requires that we compute the visible region of the terrain and upload new vertex data to the GPU in each frame preventin GPU stalls.

In the demo, we implement the clipping by computing a quad sub-region of the terrain that includes the visible area that we need to render. Once we know the sub-region that we want to render, we compute the new indices of the vertices that participate in the region so we can render it using a single triangle strip. Finally, we upload the new index data to the index buffer for use in the follow-up draw call.

Avoiding GPU stalls

Although all the above is correct, it actually leads, as described, to much worse performance in general. The reason for this is that our uploads of vertex data in each frame lead to frequent GPU stalls. This happens in two scenarios:

1. In the same frame, because we need to upload different vertex data for the rendering of the terrain for the shadow map and the scene (the shadow map renders the terrain from the point of view of the light, so the visible region of the terrain is different). This creates stalls because the rendering of the terrain for the shadow map might not have completed before we attempt to upload new data to the index buffer in order to render the terrain for the scene.

2. Between different frames. Because the GPU might not be completely done rendering the previous frame (and thus, stills needs the index buffer data available) before we start preparing the next frame and attempt to upload new terrain index data for it.

In the case of the Intel Mesa driver, these GPU stalls can be easily identified by using the environment variable INTEL_DEBUG=perf. When using this, the driver will detect these situations and produce warnings informing about the stalls, the buffers affected and the regions of the buffers that generate the stall, such as:

Stalling on glBufferSubData(0, 503992) (492kb) to a busy (0-1007984)
buffer object.  Use glMapBufferRange() to avoid this.

The solution to this problem that I implemented (other than trying to put as much work as possible between read/write accesses to the index buffer) comes in two forms:

1. Circular buffers

In this case, we allocate a larger buffer than we need so that each subsequent upload of new index data happens in a separate sub-region of the allocated buffer. I set up the demo so that each circular buffer is large enough to hold the index data required for all updates of the index buffer happening in each frame (the shadow map and the scene).

2. Multi-buffering

We allocate more than one circular buffer. When we don’t have enough free space at the end of the current buffer to upload the new index buffer data, we upload it to a different circular buffer instead. When we run out of buffers we circle back to the first one (which at this point will hopefully be free to be re-used again).

So why not just use a single, very large circular buffer? Mostly because there are limits to the size of the buffers that the GPU may be able to handle correctly (or efficiently). Also, why not having many smaller independent buffers instead of circular buffers? That would work just fine, but using fewer, larger buffers reduces the number of objects we need to bind/unbind and is better to prevent memory fragmentation, so that’s a plus.

Final touches

We are almost done, at this point we only need to add a texture to the terrain surface, add some slight fog effect for distant pixels to create a more realistic look, add a skybox (it is important to choose the color of the fog so it matches the color of the sky!) and tweak the lighting parameters to get a nice result:

Final rendering

I hope to cover some of these aspects in future posts, so stay tuned for more!

by Iago Toral at October 13, 2016 02:38 PM

October 12, 2016

Manuel Rego

Grid Layout Summertime

Summer season is over and despite of some nice holidays to rest, the Igalia team kept doing progress on the CSS Grid Layout implementation in Chromium/Blink and Safari/WebKit as part of our ongoing collaboration with Bloomberg.

By the end of September the spec has transitioned to Candidate Recommendation (CR), Rachel Andrew wrote a blog post explaining what does it mean. However in the previous months there were a few changes here and there that require the implementations to get updated.

If you remember my presentation on the last BlinkOn we were reviewing the status of the implementation. By that time there were a bunch things marked as WIP or TODO, but now most of them have been already implemented as I’ll explain in this post.

Auto repeat

My mate Sergio Villar already explained this feature thoroughly in a blog post so I won’t repeat it.

The new thing is that auto-fit keyword implementation has already landed, so you can use it too. auto-fit allows you to collapse the tracks that doesn’t contain any item.

So now you can do something like this:

<div style="display: grid; width: 700px;
            grid-template-columns: repeat(auto-fit, 150px); grid-template-rows: 100px;">
    <div style="grid-column: 1 / span 2;">grid-column: 1 / span 2</div>
    <div style="grid-column: 4;">grid-column: 4</div>

auto-fit example auto-fit example

In the example, there’s room for 4 columns of 150px however as the third column doesn’t have any item, so it’s collapsed to 0px, and you end up like having only 3 columns on your grid container. If you were using auto-fill keyword, the 3rd column won’t be collapsed and it will remain empty.

Multiple tracks

One of the changes on the spec was adding the possibility to specify more than one track in some of the properties: grid-auto-columns & grid-auto-rows and repeat() notation.

So for example you can now pass a track list to the grid-auto-columns property like:

<div style="display: grid;
            grid-auto-columns: 200px 50px; grid-template-rows: 100px;">
    <div style="grid-row: 1;">A</div>
    <div style="grid-row: 1;">B</div>
    <div style="grid-row: 1;">C</div>
    <div style="grid-row: 1;">D</div>

Multiple tracks on grid-auto-columns example Multiple tracks on grid-auto-columns example

As you see the first and third columns will have a 200px width, and the second and fourth ones will have a 50px width.

On top of that, you can also use a track list on the repeat() notation:

<div style="display: grid;
            grid-template-columns: repeat(3, 200px 50px); grid-template-rows: 100px;">
    <div style="grid-row: 1;">A</div>
    <div style="grid-row: 1;">B</div>
    <div style="grid-row: 1;">C</div>
    <div style="grid-row: 1;">D</div>
    <div style="grid-row: 1;">E</div>
    <div style="grid-row: 1;">F</div>

Multiple tracks on repeat() notation example Multiple tracks on repeat() notation example

This seems a small change, and it was not very hard to implement. However, this combined with the auto repeat feature required a few more changes than expected to make everything work properly.

fit-content() cap

fit-content keyword has been updated and now you can pass an argument to it that is used as a maximum size. As an argument for the fit-content() function you can specify either a length or a percentage (that will be resolved like for percentage tracks).

So in Grid Layout now you can use fit-content(argument) and the track size will be clamped at the argument:

<div style="display: inline-grid;
            grid-template-columns: repeat(2, fit-content(200px)); grid-template-rows: 100px;">
    <div>an item</div>
    <div>a very long grid item</div>

fit-content() function example fit-content() function example

As you can see the first column uses fit-content so its size is adapted to the item inside. However the second column would need more than 200px in order that the whole item fits, so its size is clamped to 200px.

Percentage support for grid gutters

Grid gutters have been added a while ago into the different implementations, but we were lacking support for percentage gaps.

We’ve been updating the implementation so now you can use percentages too (again they’re resolved like percentage tracks):

<div style="display: grid; width: 400px;
            grid-column-gap: 10%;
            grid-auto-columns: 1fr; grid-auto-rows: 100px;">
    <div style="grid-row: 1;">A</div>
    <div style="grid-row: 1;">B</div>
    <div style="grid-row: 1;">C</div>
    <div style="grid-row: 1;">D</div>

Percentage gaps example Percentage gaps example

The 10% column gap is resolved against the width of the grid container to 40px. Then the 1fr tracks take the available space, which in this case means 70px for each column.

The percentage support for grid gaps is marked as at-risk in the spec, however it has been already implemented in Gecko and Blink, so probably it won’t be removed from this level.

New syntax for grid shorthand

The grid shorthand syntax has been modified so now you can specify in just one property, the explicit grid in one axis and the implicit grid and the auto-placement algorithm mode in the other one.

For example, if you want a grid that has 2 columns of a given size and the rows will be created on demand, you could use the following syntax:

<div style="display: grid;
            grid: auto-flow 100px / 200px 100px;">

grid shorthand example grid shorthand example

Orthogonal flows

Orthogonal flows refers to situations when the writing mode of the grid container uses a different direction than some of its items. For example, if you have a horizontal grid container and some vertical items inside it.

This has been a long task but right now you can use orthogonal flows on grid layout already. The basic support has been implemented including alignment of orthogonal items.

So now you can use vertical items inside a horizontal grid layout:

<div style="display: inline-grid;">
  <div style="grid-row: 1;">Test</div>
  <div style="grid-row: 1; writing-mode: vertical-lr;">Chrome</div>
  <div style="grid-row: 1; writing-mode: vertical-lr;">Firefox</div>
  <div style="grid-row: 1; writing-mode: vertical-lr;">Safari</div>

Orthogonal flows example Orthogonal flows example

New normal value for alignment properties

This is a pretty complex issue from the specs and implementations point of view (as you can see in the epic review for this patch), but probably not very important for the end users because the default behavior on Grid Layout won’t be altered. Now there’s a new value normal for the alignment properties: justify-content, align-content, justify-items, align-items, justify-self and align-self.

The interesting thing regarding this new normal value is that it behaves differently depending on the layout model. In general, the behavior in the Grid Layout case is the same than when you use the stretch keyword.


This is just a high level overview of the main things that happened during the summer on the Blink and WebKit implementations leaded by Igalia. Of course, I’m missing some much more stuff like lots of bug fixes and even performance optimizations (e.g. nested grids are now 350% faster than before).

Maybe for some of you some of these changes are really great, maybe for other they’re not very relevant. Anyway the good news are that things keep moving forward, and the implementations are getting closer and closer to the specification. This means that we’re on the right path to have CSS Grid Layout enabled on most of the main browsers soon.

Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

Last, but not least, as usual we’d like to thank our friends at Bloomberg for all their amazing support which allows us to keep pushing Grid Layout.

October 12, 2016 10:00 PM

Andy Wingo

An incomplete history of language facilities for concurrency

I have lately been in the market for better concurrency facilities in Guile. I want to be able to write network servers and peers that can gracefully, elegantly, and efficiently handle many tens of thousands of clients and other connections, but without blowing the complexity budget. It's a hard nut to crack.

Part of the problem is implementation, but a large part is just figuring out what to do. I have often thought that modern musicians must be crushed under the weight of recorded music history, but it turns out in our humble field that's also the case; there are as many concurrency designs as languages, just about. In this regard, what follows is an incomplete, nuanced, somewhat opinionated history of concurrency facilities in programming languages, with an eye towards what I should "buy" for the Fibers library I have been tinkering on for Guile.

* * *

Modern machines have the raw capability to serve hundreds of thousands of simultaneous long-lived connections, but it’s often hard to manage this at the software level. Fibers tries to solve this problem in a nice way. Before discussing the approach taken in Fibers, it’s worth spending some time on history to see how we got here.

One of the most dominant patterns for concurrency these days is “callbacks”, notably in the Twisted library for Python and the Node.js run-time for JavaScript. The basic observation in the callback approach to concurrency is that the efficient way to handle tens of thousands of connections at once is with low-level operating system facilities like poll or epoll. You add all of the file descriptors that you are interested in to a “poll set” and then ask the operating system which ones are readable or writable, as appropriate. Once the operating system says “yes, file descriptor 7145 is readable”, you can do something with that socket; but what? With callbacks, the answer is “call a user-supplied closure”: a callback, representing the continuation of the computation on that socket.

Building a network service with a callback-oriented concurrency system means breaking the program into little chunks that can run without blocking. Whereever a program could block, instead of just continuing the program, you register a callback. Unfortunately this requirement permeates the program, from top to bottom: you always pay the mental cost of inverting your program’s control flow by turning it into callbacks, and you always incur run-time cost of closure creation, even when the particular I/O could proceed without blocking. It’s a somewhat galling requirement, given that this contortion is required of the programmer, but could be done by the compiler. We Schemers demand better abstractions than manual, obligatory continuation-passing-style conversion.

Callback-based systems also encourage unstructured concurrency, as in practice callbacks are not the only path for data and control flow in a system: usually there is mutable global state as well. Without strong patterns and conventions, callback-based systems often exhibit bugs caused by concurrent reads and writes to global state.

Some of the problems of callbacks can be mitigated by using “promises” or other library-level abstractions; if you’re a Haskell person, you can think of this as lifting all possibly-blocking operations into a monad. If you’re not a Haskeller, that’s cool, neither am I! But if your typey spidey senses are tingling, it’s for good reason: with promises, your whole program has to be transformed to return promises-for-values instead of values anywhere it would block.

An obvious solution to the control-flow problem of callbacks is to use threads. In the most generic sense, a thread is a language feature which denotes an independent computation. Threads are created by other threads, but fork off and run independently instead of returning to their caller. In a system with threads, there is implicitly a scheduler somewhere that multiplexes the threads so that when one suspends, another can run.

In practice, the concept of threads is often conflated with a particular implementation, kernel threads. Kernel threads are very low-level abstractions that are provided by the operating system. The nice thing about kernel threads is that they can use any CPU that is the kernel knows about. That’s an important factor in today’s computing landscape, where Moore’s law seems to be giving us more cores instead of more gigahertz.

However, as a building block for a highly concurrent system, kernel threads have a few important problems.

One is that kernel threads simply aren’t designed to be allocated in huge numbers, and instead are more optimized to run in a one-per-CPU-core fashion. Their memory usage is relatively high for what should be a lightweight abstraction: some 10 kilobytes at least and often some megabytes, in the form of the thread’s stack. There are ongoing efforts to reduce this for some systems but we cannot expect wide deployment in the next 5 years, if ever. Even in the best case, a hundred thousand kernel threads will take at least a gigabyte of memory, which seems a bit excessive for book-keeping overhead.

Kernel threads can be a bit irritating to schedule, too: when one thread suspends, it’s for a reason, and it can be that user-space knows a good next thread that should run. However because kernel threads are scheduled in the kernel, it’s rarely possible for the kernel to make informed decisions. There are some “user-mode scheduling” facilities that are in development for some systems, but again only for some systems.

The other significant problem is that building non-crashy systems on top of kernel threads is hard to do, not to mention “correct” systems. It’s an embarrassing situation. For one thing, the low-level synchronization primitives that are typically provided with kernel threads, mutexes and condition variables, are not composable. Also, as with callback-oriented concurrency, one thread can silently corrupt another via unstructured mutation of shared state. It’s worse with kernel threads, though: a kernel thread can be interrupted at any point, not just at I/O. And though callback-oriented systems can theoretically operate on multiple CPUs at once, in practice they don’t. This restriction is sometimes touted as a benefit by proponents of callback-oriented systems, because in such a system, the callback invocations have a single, sequential order. With multiple CPUs, this is not the case, as multiple threads can run at the same time, in parallel.

Kernel threads can work. The Java virtual machine does at least manage to prevent low-level memory corruption and to do so with high performance, but still, even Java-based systems that aim for maximum concurrency avoid using a thread per connection because threads use too much memory.

In this context it’s no wonder that there’s a third strain of concurrency: shared-nothing message-passing systems like Erlang. Erlang isolates each thread (called processes in the Erlang world), giving each it its own heap and “mailbox”. Processes can spawn other processes, and the concurrency primitive is message-passing. A process that tries receive a message from an empty mailbox will “block”, from its perspective. In the meantime the system will run other processes. Message sends never block, oddly; instead, sending to a process with many messages pending makes it more likely that Erlang will pre-empt the sending process. It’s a strange tradeoff, but it makes sense when you realize that Erlang was designed for network transparency: the same message send/receive interface can be used to send messages to processes on remote machines as well.

No network is truly transparent, however. At the most basic level, the performance of network sends should be much slower than local sends. Whereas a message sent to a remote process has to be written out byte-by-byte over the network, there is no need to copy immutable data within the same address space. The complexity of a remote message send is O(n) in the size of the message, whereas a local immutable send is O(1). This suggests that hiding the different complexities behind one operator is the wrong thing to do. And indeed, given byte read and write operators over sockets, it’s possible to implement remote message send and receive as a process that serializes and parses messages between a channel and a byte sink or source. In this way we get cheap local channels, and network shims are under the programmer’s control. This is the approach that the Go language takes, and is the one we use in Fibers.

Structuring a concurrent program as separate threads that communicate over channels is an old idea that goes back to Tony Hoare’s work on “Communicating Sequential Processes” (CSP). CSP is an elegant tower of mathematical abstraction whose layers form a pattern language for building concurrent systems that you can still reason about. Interestingly, it does so without any concept of time at all, instead representing a thread’s behavior as a trace of instantaneous events. Threads themselves are like functions that unfold over the possible events to produce the actual event trace seen at run-time.

This view of events as instantaneous happenings extends to communication as well. In CSP, one communication between two threads is modelled as an instantaneous event, partitioning the traces of the two threads into “before” and “after” segments.

Practically speaking, this has ramifications in the Go language, which was heavily inspired by CSP. You might think that a channel is just a an asynchronous queue that blocks when writing to a full queue, or when reading from an empty queue. That’s a bit closer to the Erlang conception of how things should work, though as we mentioned, Erlang simply slows down writes to full mailboxes rather than blocking them entirely. However, that’s not what Go and other systems in the CSP family do; sending a message on a channel will block until there is a receiver available, and vice versa. The threads are said to “rendezvous” at the event.

Unbuffered channels have the interesting property that you can select between sending a message on channel a or channel b, and in the end only one message will be sent; nothing happens until there is a receiver ready to take the message. In this way messages are really owned by threads and never by the channels themselves. You can of course add buffering if you like, simply by making a thread that waits on either sends or receives on a channel, and which buffers sends and makes them available to receives. It’s also possible to add explicit support for buffered channels, as Go, core.async, and many other systems do, which can reduce the number of context switches as there is no explicit buffer thread.

Whether to buffer or not to buffer is a tricky choice. It’s possible to implement singly-buffered channels in a system like Erlang via an explicit send/acknowlege protocol, though it seems difficult to implement completely unbuffered channels. As we mentioned, it’s possible to add buffering to an unbuffered system by the introduction of explicit buffer threads. In the end though in Fibers we follow CSP’s lead so that we can implement the nice select behavior that we mentioned above.

As a final point, select is OK but is not a great language abstraction. Say you call a function and it returns some kind of asynchronous result which you then have to select on. It could return this result as a channel, and that would be fine: you can add that channel to the other channels in your select set and you are good. However, what if what the function does is receive a message on a channel, then do something with the message? In that case the function should return a channel, plus a continuation (as a closure or something). If select results in a message being received over that channel, then we call the continuation on the message. Fine. But, what if the function itself wanted to select over some channels? It could return multiple channels and continuations, but that becomes unwieldy.

What we need is an abstraction over asynchronous operations, and that is the main idea of a CSP-derived system called “Concurrent ML” (CML). Originally implemented as a library on top of Standard ML of New Jersey by John Reppy, CML provides this abstraction, which in Fibers is called an operation1. Calling send-operation on a channel returns an operation, which is just a value. Operations are like closures in a way; a closure wraps up code in its environment, which can be later called many times or not at all. Operations likewise can be performed2 many times or not at all; performing an operation is like calling a function. The interesting part is that you can compose operations via the wrap-operation and choice-operation combinators. The former lets you bundle up an operation and a continuation. The latter lets you construct an operation that chooses over a number of operations. Calling perform-operation on a choice operation will perform one and only one of the choices. Performing an operation will call its wrap-operation continuation on the resulting values.

While it’s possible to implement Concurrent ML in terms of Go’s channels and baked-in select statement, it’s more expressive to do it the other way around, as that also lets us implement other operations types besides channel send and receive, for example timeouts and condition variables.

1 CML uses the term event, but I find this to be a confusing name. In this isolated article my terminology probably looks confusing, but in the context of the library I think it can be OK. The jury is out though.

2 In CML, synchronized.

* * *

Well, that's my limited understanding of the crushing weight of history. Note that part of this article is now in the Fibers manual.

Thanks very much to Matthew Flatt, Matthias Felleisen, and Michael Sperber for pushing me towards CML. In the beginning I thought its benefits were small and complication large, but now I see it as being the reverse. Happy hacking :)

by Andy Wingo at October 12, 2016 01:45 PM

October 09, 2016

Javier Fernández

Web Engines Hackfest 2016

Last week I attended the Web Engines Hackfest 2016, hosted by Igalia at the HQ premises in A Coruña. For those still unaware, it’s a unconference like event focused on pure hacking and technical discussions about the main Web Engines supporting the Web Platform.


This year there were a very interesting group of hackers, representing most of the main web engines, like Mozilla’s Gecko and Servo, Google’s Blink, Apple’s WebKit and Igalia’s WebKitGTK+.

Hacking log

This year I was totally focused on hacking Blink web engine to solve some of the most complex issues of the CSS Grid Layout feature. I’d like to start giving thanks to Google to join the hackfest sending Christian Biesinger, specially thanks to him because of the long flight to attend. It was a pleasure to work work with him on-site and have the opportunity to discuss these complex issues face to face.

Day 1

During the weeks previous to the hackfest I’ve been working on fixing the bug 628565, reported several months ago but hitting me quite much recently. It’s a very ugly issue, since it shows an unpredictable behavior of Grid layout logic, so I decided that I might fix it once for all. I managed to provide 2 different approaches, which can be analyzed in the following code review issues: Issue 2361373002 and Issue 2333583002. Basically, the root cause of both issues is the same; grid’s container intrinsic size is not computed correctly when there are grid items with orthogonal flows.

There are 2 fundamental concepts that are key for understanding this problem:

  • Intrinsic size computation must be done before layout.
  • Orthogonal flow boxes need to be laid out in order to compute min-content contribution to their container’s intrinsic width.

So, we had as single bug to fix for solving two different problems: unpredictable grid layout logic and incorrect grid’s intrinsic size computation. The first problem is the one reported in bug 628565. I’ll try to describe it here briefly, but further details can be obtained from the bug report. Let’s consider the following code:

   body { overflow: hidden; }
<div style="width: fit-content; display: grid; grid-template-rows: 50px; border: 5px solid; font: 25px/1 Ahem;">
   <div style="writing-mode: vertical-lr; color: magenta; background: cyan;">XX X</div>

The following pictures illustrate the issue when loading the code above with Google Chrome 55.0.2859.0 (Official Build) dev (64-bit) Revision 3f63c614e8c4501b1bfa3f608e32a9d12618b0a0-refs/heads/master@{#418117} under Linux operating system:

Chrome BEFORE resizing

Chrome AFTER resizing

Christian and I analyzed carefully Blink’s layout code and how it deals with orthogonal boxes. It’s worth mentioning that there are important issues with orthogonal flows in almost every web engine, but Blink has made lately some improvements on this regard. See how a very basic example works in Blink, Gecko and WebKit engines:

<div style="float: left; border: 5px solid; font: font: 25px/1 Ahem;"></div>
  <div style="writing-mode: vertical-lr; color: magenta; background: cyan;">XX X</div>


Blink has implemented a kind of pre-layout logic which is executed for any orthogonal flow box even before doing the actual layout, when box’s intrinsic size computation takes place. This solution allows using a more precise info about an orthogonal box’s height when computing its container’s intrinsic width; otherwise zero width would be assumed, as it’s the case of WebKit and Gecko engines. However, this approach leads to an incorrect behavior when using Grid Layout as I’ll explain later.

My first approach to solve these two issues was to update grid container’s intrinsic size after its orthogonal children were laid out. Hence, we could use the actual size of the children to compute their container’s intrinsic size. However, this violates the first rule of the two assertions defined before: no layout should be done for intrinsic size computation.

Layout Breakout Session

During the evening we held the Layout Breakout Session, where we had a nice discussion about the future of Layout in the different web engines. Christian Biesinger, one of the members of the Google’s Blink Layout team, talked about the new LayoutNG project; an experiment to implement a new Layout from scratch, cleaning up quite old code paths and solving some problems that were not possible to address with the current legacy code. This new LayoutNG idea is related to the new Layout API and the Houdini project, a new mechanism for defining new layout models without the requirement of a native support inside the browser. We had also some discussions about the current state of the Flexible Box specification in Blink and how WebKit’s implementation is quite abandoned and unmaintained nowadays.

In addition, we discussed about the current state of CSS Grid Layout implementation in the different engines. The implementation is almost complete in most of the main engines. Thanks to the collaboration between Igalia and Bloomberg we can confirm that WebKit and Blink’s implementations are almost completed. We have been evaluating Mozilla’s Gecko implementation of Grid and we verified it’s in a similar status. We talked about the recent news from TPAC, which Manuel Rego attended, about the CSS Grid Layout specification becoming Candidate Recommendation. Because of all these reasons, we have agreed with Christian that it’d be good to send the Blink intent-to-ship request as soon as possible; in case it’s accepted, it could be enabled by default in the next Chrome release.

Day 2

The day started with a meeting with Christian for evaluating the different approaches we implemented so far. We have discarded the ones requiring updating intrinsic size. We also decided to avoid solving the issue during the pre-layout of orthogonal items. This approach would have been the one with less impact on performance for grid layout and it would also solve the incorrect intrinsic size issue, however it would add penalties for cases not using grid at all.

Finally, Christian and I agreed on solving first the unpredictable behavior of grid layout logic. We would skip the issue of incorrect intrinsic size, overall because we think the Grid Layout specification is contradictory on this regard. For what is worth, I’ve created a new issue for the CSSWG in the W3C’s github. Even though we should wait for the issue to get clarified, we have already some possible approaches for getting what seems a more natural result for grid’s intrinsic size. The following example could help to understand the problem:


Both test cases were loaded using the same Chrome version commented before. They clearly show that neither min-content or max-content sizes are applied correctly to the grid container. The reason of this weird behavior is how content-sized tracks are handled by the Grid Tracks sizing algorithm in case of rendering grid items with orthogonal flow. From the last draft specification:

Then, if the min-content contribution of any grid items have changed based on the row sizes calculated in step 2, steps 1 and 2 are repeated with the new min-content contribution and max-content contribution (once only).

This, with the fact that orthogonal boxes pre-layout is performed before the tracks have been defined, causes that grid’s container intrinsic size is computed incorrectly. This problem is explained in detail in the W3C’s github issue mentioned before. So if anybody is interested on additional details or, even better, willing to participate in the ongoing discussion, just follow the link above.

Day 3

In addition to the orthogonal flow issues, Baseline Alignment was the other hot topic of my work during the hackfest. Baseline Alignment is the only feature still missing to complete the implementation of the CSS Box Alignment specification for Grid Layout. There is a preliminary approach in this code review issue, but it’s still not complete. The problem is that as it’s stated in the Alignment spec, Baseline Alignment may affect grid’s container intrinsic size:

When specified for align-self/justify-self, these values trigger baseline self-alignment, shifting the entire box within its container, which may affect the sizing of its container.

This fact implies that I should integrate Baseline offset computation inside the Grid Tracks sizing algorithm. Christian and I have been analyzing the sizing algorithm and we designed a possible approach, quite similar to what Flexbox implements in its layout logic.

Finally, we met again with Christian to discuss about the Minimum Implied size for Grid issues. Manuel had the chance to discuss it with the Grid spec editors at TPAC, but it seems there are still unresolved issues. There are an ongoing discussion at W3C’s github about this problem, you can get the details in issues 283 and 523. Christian suggested that he could add some feedback to the discussion, so we can clarify it as soon as possible. This is an important issue that may affect browser interoperability.

The hackfest

The Web Engines Hackfest is an event to share experiences between hackers of different web engines and brainstorming about the future of the Web Platform. This year there were hackers representing most of the main web engines, including Google’s Blink, Mozilla’s Gecko and Servo, Apple’s WebKit and Igalia’s WebKitGTK+.


There were scheduled talks from hackers of each engine so everybody could get an idea of their current state and future plans. In addition, some breakout sessions were scheduled during the kick-off session driven by my colleague Martin Robinson. We embraced everybody to propose new breakout sessions on the fly, every time an ongoing discussion needed a deeper debate or analysis.


I could attend most of the talks and breakout sessions, so I’ll give now my impressions about them. All the talks were recorded (will be available soon) and slides are available in the wiki, so I recommend to watch them if you haven’t already.


The first speaker in the schedule was Jack Moffitt (Mozilla) “Servo: Today & Tomorrow”. He gave a great overview of the progress they made on the new Servo rendering engine, emphasizing the multi-thread support and showing some performance metrics for different engine’s components, CSS parsing, scripting, layout, … He remarked the technical advantages of using Rust on the development of this new engine.


During Day 1 I attended two breakout sessions. The first one about the WebKitGTK+ engine was mainly focused on the recently published roadmap and specially about the new Threaded Compositor enabled by default. There was an interesting discussion about graphics support in WebKitGTK+ between the Collabora developers and Andrey Fedorov (Huawei).


The other breakout session I could attend during Day 1 was the one about the Servo engine. There was a nice discussion between Mozilla developers and Christian Biesinger about how both engines handle rendering in a different way. Servo hackers explained with more detail Servos’s threading model and its implication for layout.


Day 2 started with the awesome talk by Behdad Esfahbod (Google) “HarfBuzz, the First Ten Years”. He gave an overview of the evolution of the HarfBuff library along the last years, including the past experiences in the Web Engines Hackfest and former events under previous name WebKitGTK+ Hackfest. Clearly, rendering fonts is a huge technical challenge. It was also awesome to have Behdad here to work closely with my colleague Fred Wang on the fonts support for MathML.


I attended the MathML breakout session, where the hot topic was Fred’s prototype implemented for the Blink engine. The result of this experiment was really promising, so we started to discuss its chances of getting back in Chrome, even behind an experimental flag. Both Christian and Behdad expressed doubts about that being possible nowadays. They suggested that it may be feasible to implement it through Houdini and the new Blink’s Layout API. We have confirmed that the refactoring we have done for WebKit applies perfectly in Blink, since both share a substantial portion of the layout codebase. In addition, after our work for removing the Flexbox dependency and further code clean up, we can be confident on the MathML independence in the layout logic. After some discussion, we have agreed on continue our experiments in Blink, independently of its chances to be integrated in Blink in the future, since I don’t think the Houdini approach makes sense for MathML, at least now.

Later on Youenn Fabler (Apple) gave a talk about the Fetch API, “Fetching Bytes and Words on the Web”. He described the implementation of this specification in WebKit and its relation with the Streams API. The development is still ongoing and there are still quite much work to do.


I missed the talks of Day 3, so hopefully I’ll watch them as soon as the videos are available. It’s specially interesting the talk about WPE (aka WebKit For Wayland) , by my colleague Žan Doberšek. I’ve spent the day trying to get the most of having Christian Biesinger at our premises. As I commented before, we discussed about the Grid Layout’s Implied Minimum Size issue, one of the most complex problems we’ll have to solve in the short term.

by jfernandez at October 09, 2016 01:59 PM

October 06, 2016

Eduardo Lima Mitev

Example: Run a headless OpenGL (ES) compute shader via DRM render-nodes

It has been a long time indeed since my last entry here. But I have actually been quite busy on a new adventure: graphics driver development.

Two years ago I started contributing to Mesa, mostly to the Intel i965 backend, as a member of the Igalia Graphics Team. During this time I have been building my knowledge around GPU driver development, the different parts of the Linux graphics stack, its different projects, tools, and the developer communities around them. It is still the beginning of the journey, but I’m already very proud of the multiple contributions our team has made.

But today I want to start a series of articles discussing basic examples of how to do cool things with the Khronos APIs and the Linux graphics stack. It will be a collection of short and concise programs that achieve a concrete goal, pretty much what I wish existed when I looked myself up for them in the public domain.

If you, like me, are the kind of person that learns by doing, growing existing examples; then I hope you will find this series interesting, and also encouraging to write your own examples.

Before finishing this sort of introduction and before we enter into the matter, I want to leave my stance on what I consider a good (minimal) example program, because I think it will be useful for this and future examples, to help the reader understand what to expect and how are they different from similar code found online. That said, not all examples in the series will be minimal, though.

A minimal example should:

  • provide as minimum boilerplate as possible. E.g, wrapping the example in a C++ class is unnecessary noise unless your are demoing OOP.
  • be contained in a single source code file if possible. Following the code flow across multiple files adds mental overhead.
  • have none or as minimum dependencies as possible. The reader should not need to install stuff that are not strictly necessary to try the example.
  • not favor function over readability. It doesn’t matter if it is not a complete or well-behaving program if that adds stuff not related with the one thing being showcased.

Ok, now we are ready to move to the first example in this series: the simplest way to embed and run an OpenGL-ES compute shader in your C program on Linux, without the need of any GUI window or connection to the X (or Wayland) server. Actually, it isn’t even necessary to be running any window system at all. Also, the program doesn’t need any special privilege, other than access to the unprivileged DRI/DRM infrastructure. In modern distros, this access is typically granted by being member of the video group.

The code is fairly short, so I’m inline-ing it below. It can also be checked-up on my gpu-playground repository. See below for a quick explanation of its most interesting parts.

#include <EGL/egl.h>
#include <EGL/eglext.h>
#include <GLES3/gl31.h>
#include <assert.h>
#include <fcntl.h>
#include <gbm.h>
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

/* a dummy compute shader that does nothing */
#define COMPUTE_SHADER_SRC "          \
#version 310 es\n                                                       \
layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;       \
void main(void) {                                                       \
   /* awesome compute code here */                                      \
}                                                                       \

main (int32_t argc, char* argv[])
   bool res;

   int32_t fd = open ("/dev/dri/renderD128", O_RDWR);
   assert (fd > 0);

   struct gbm_device *gbm = gbm_create_device (fd);
   assert (gbm != NULL);

   /* setup EGL from the GBM device */
   EGLDisplay egl_dpy = eglGetPlatformDisplay (EGL_PLATFORM_GBM_MESA, gbm, NULL);
   assert (egl_dpy != NULL);

   res = eglInitialize (egl_dpy, NULL, NULL);
   assert (res);

   const char *egl_extension_st = eglQueryString (egl_dpy, EGL_EXTENSIONS);
   assert (strstr (egl_extension_st, "EGL_KHR_create_context") != NULL);
   assert (strstr (egl_extension_st, "EGL_KHR_surfaceless_context") != NULL);

   static const EGLint config_attribs[] = {
   EGLConfig cfg;
   EGLint count;

   res = eglChooseConfig (egl_dpy, config_attribs, &cfg, 1, &count);
   assert (res);

   res = eglBindAPI (EGL_OPENGL_ES_API);
   assert (res);

   static const EGLint attribs[] = {
   EGLContext core_ctx = eglCreateContext (egl_dpy,
   assert (core_ctx != EGL_NO_CONTEXT);

   res = eglMakeCurrent (egl_dpy, EGL_NO_SURFACE, EGL_NO_SURFACE, core_ctx);
   assert (res);

   /* setup a compute shader */
   GLuint compute_shader = glCreateShader (GL_COMPUTE_SHADER);
   assert (glGetError () == GL_NO_ERROR);

   const char *shader_source = COMPUTE_SHADER_SRC;
   glShaderSource (compute_shader, 1, &shader_source, NULL);
   assert (glGetError () == GL_NO_ERROR);

   glCompileShader (compute_shader);
   assert (glGetError () == GL_NO_ERROR);

   GLuint shader_program = glCreateProgram ();

   glAttachShader (shader_program, compute_shader);
   assert (glGetError () == GL_NO_ERROR);

   glLinkProgram (shader_program);
   assert (glGetError () == GL_NO_ERROR);

   glDeleteShader (compute_shader);

   glUseProgram (shader_program);
   assert (glGetError () == GL_NO_ERROR);

   /* dispatch computation */
   glDispatchCompute (1, 1, 1);
   assert (glGetError () == GL_NO_ERROR);

   printf ("Compute shader dispatched and finished successfully\n");

   /* free stuff */
   glDeleteProgram (shader_program);
   eglDestroyContext (egl_dpy, core_ctx);
   eglTerminate (egl_dpy);
   gbm_device_destroy (gbm);
   close (fd);

   return 0;

You have probably noticed that the program can be divided in 4 main parts:

  1. Creating a GBM device from a render-node
  2. Setting up a (surfaceless) EGL/OpenGL-ES context
  3. Creating a compute shader program
  4. Dispatching the compute shader

The first two parts are the most relevant to the purpose of this article, since they allow our program to run without requiring a window system. The rest is standard OpenGL code to setup and execute a compute shader, which is out of scope for this example.

Creating a GBM device from a render-node

What’s a render-node anyway? Render-nodes are the “new” DRM interface to access the unprivileged functions of a DRI-capable GPU. It is actually not new, only that this API was previously part of the single (privileged) interface exposed at /dev/dri/cardX. During Linux kernel 3.X series, the DRM driver started exposing the unprivileged part of its user-space API via the render-node interface, as a separate device file (/dev/dri/renderDXX). If you want to know more about render-nodes, there is a section about it in the Linux Kernel documentation and also a brief explanation on Wikipedia.

   int32_t fd = open ("/dev/dri/renderD128", O_RDWR);

The first step is to open the render-node file for reading and writing. On my system, it is exposed as /dev/dri/renderD128. It may be different on other systems, and any serious code would want to detect and select the appropiate render node file first. There can be more than one, one for each DRI-capable GPU in your system.

It is the render-node interface that ultimately allows us to use the GPU for computing only, from an unprivileged program. However, to be able to plug into this interface, we need to wrap it up with a Generic Buffer Manager (GBM) device, since this is the interface that Mesa (the EGL/OpenGL driver we are using) understands.

   struct gbm_device *gbm = gbm_create_device (fd);

That’s it. We now have a GBM device that is able to send commands to a GPU via its render-node interface. Next step is to setup an EGL and OpenGL context.

Setting up a (surfaceless) EGL/OpenGL-ES context

EGLDisplay egl_dpy = eglGetPlatformDisplay (EGL_PLATFORM_GBM_MESA, gbm, NULL);

This is the most interesting line of code in this section. We get access to an EGL display using the previously created GBM device as the platform-specific handle. Notice the EGL_PLATFORM_GBM_MESA identifier, which is the Mesa specific enum to interface a GBM device.

   const char *egl_extension_st = eglQueryString (egl_dpy, EGL_EXTENSIONS);
   assert (strstr (egl_extension_st, "EGL_KHR_surfaceless_context") != NULL);

Here we check that we got a valid EGL display that is able to create a surfaceless context.

   res = eglMakeCurrent (egl_dpy, EGL_NO_SURFACE, EGL_NO_SURFACE, core_ctx);

After setting up the EGL context and binding the OpenGL API we want (ES 3.1 in this case), we have the last line of code that requires attention. When activating the EGL/OpenGL-ES context, we want to make sure we specify EGL_NO_SURFACE, because that’s what we want right?

And that’s it. We now have an OpenGL context and we can start making OpenGL calls normally. Setting up a compute shader and dispatching it should be uncontroversial, so we leave it out of this article for simplicity.

What next?

Ok, with around 100 lines of code we were able to dispatch a (useless) compute shader program. Here are some basic ideas on how to grow this example to do something useful:

  • Dynamically detect and select among the different render-node interfaces available.
  • Add some input and output buffers to get data in and out of the compute shader execution.
  • Do some cool processing of the input data in the shader, then use the result back in the C program.
  • Add proper error checking, obviously.

In future entries, I would like to explore also minimal examples on how to do this using the Vulkan and OpenCL APIs. And also implement some cool routine in the shader that does something interesting.


With this example I tried to demonstrate how easy is to exploit the capabilities of modern GPUs for general purpose computing. Whether for off-screen image processing, computer vision, crypto, physics simulation, machine learning, etc; there are potentially many cases where off-loading work from the CPU to the GPU results in greater performance and/or energy efficiency.

Using a fairly modern Linux graphics stack, Mesa, and a DRI-capable GPU such as an Intel integrated graphics card; any C program or library can embed a compute shader today, and use the GPU as another computing unit.

Does your application have routines that could potentially be moved to an GLSL compute program? I would love to hear about it.

by elima at October 06, 2016 10:31 AM

October 05, 2016

Frédéric Wang

Fonts at the Web Engines Hackfest 2016

Last week I travelled to Galicia for one of the regular gatherings organized by Igalia. It was a great pleasure to meet again all the Igalians and friends. Moreover, this time was a bit special since we celebrated our 15th anniversary :-)

Igalia Summit October 2016
Photo by Alberto Garcia licensed under CC BY-SA 2.0.

I also attended the third edition of the Web Engines Hackfest, sponsored by Igalia, Collabora and Mozilla. This year, we had various participants from the Web Platform including folks from Apple, Collabora, Google, Igalia, Huawei, Mozilla or Red Hat. For my first hackfest as an Igalian, I invited some experts on fonts & math rendering to collaborate on OpenType MATH support HarfBuzz and its use in math rendering engines. In this blog post, I am going to focus on the work I have made with Behdad Esfahbod and Khaled Hosny. I think it was again a great and productive hackfest and I am looking forward to attending the next edition!

OpenType MATH in HarfBuzz

Web Engines Hackfest, main room
Photo by @webengineshackfest licensed under CC BY-SA 2.0.

Behdad gave a talk with a nice overview of the work accomplished in HarfBuzz during ten years. One thing appearing recently in HarfBuzz is the need for APIs to parse OpenType tables on all platforms. As part of my job at Igalia, I had started to experiment adding support for the MATH table some months ago and it was nice to have Behdad finally available to review, fix and improve commits.

When I talked to Mozilla employee Karl Tomlinson, it became apparent that the simple shaping API for stretchy operators proposed in my blog post would not cover all the special cases currently implemented in Gecko. Moreover, this shaping API is also very similar to another one existing in HarfBuzz for non-math script so we would have to decide the best way to share the logic.

As a consequence, we decided for now to focus on providing an API to access all the data of the MATH table. After the Web Engines Hackfest, such a math API is now integrated into the development repository of HarfBuzz and will available in version 1.3.3 :-)

MathML in Web Rendering Engines

Currently, several math rendering engines have their own code to parse the data of the OpenType MATH table. But many of them actually use HarfBuzz for normal text shaping and hence could just rely on the new math API the math rendering too. Before the hackfest, Khaled already had tested my work-in-progress branch with libmathview and I had done the same for Igalia’s Chromium MathML branch.

MathML Fraction parameters test
MathML test for OpenType MATH Fraction parameters in Gecko, Blink and WebKit.

Once the new API landed into HarfBuzz, Khaled was also able to use it for the XeTeX typesetting system. I also started to experiment this for Gecko and WebKit. This seems to work pretty well and we get consistent results for Gecko, Blink and WebKit! Some random thoughts:

  • The math data is exposed through a hb_font_t which contains the text size. This means that the various values are now directly resolved and returned as a fixed-point number which should allow to avoid rounding errors we may currently have in Gecko or WebKit when multiplying by float factors.
  • HarfBuzz has some magic to automatically handle invalid offsets and sizes that greatly simplifies the code, compared to what exist in Gecko and WebKit.
  • Contrary to Gecko’s implementation, HarfBuzz does not cache the latest result for glyph-specific data. Maybe we want to keep that?
  • The WebKit changes were tested on the GTK port, where HarfBuzz is enabled. Other ports may still need to use the existing parsing code from the WebKit tree. Perhaps Apple should consider adding support for the OpenType MATH table to CoreText?

Brotli/WOFF2/OTS libraries

Web Engines Hackfest, main room
Photo by @webengineshackfest licensed under CC BY-SA 2.0.

We also updated the copies of WOFF2 and OTS libraries in WebKit and Gecko respectively. This improves one requirement from the WOFF2 specification and allows to pass the corresponding conformance test.

Gecko, WebKit and Chromium bundle their own copy of the Brotli, WOFF2 or OTS libraries in their source repositories. However:

  • We have to use more or less automated mechanisms to keep these bundled copies up-to-date. This is especially annoying for Brotli and WOFF2 since they are still in development and we must be sure to always integrate the latest security fixes. Also, we get compiler warnings or coding style errors that do not exist upstream and that must be disabled or patched until they are fixed upstream and imported again.

  • This obviously is not an optimal sharing of system library and may increase the size of binaries. Using shared libraries is what maintainers of Linux (or other FLOSS systems) generally ask and this was raised during the WebKitGTK+ session. Similarly, we should really use the system Brotli/WOFF2 bundled in future releases of Apple’s operating systems.

There are several issues that make hard for package maintainers to provide these libraries: no released binaries or release tags, no proper build system to generate shared libraries, use of git submodule to include one library source code into another etc Things have gotten a bit better for Brotli and I was able to tweak the CMake script to produce shared libraries. For WOFF2, issue 40 and issue 49 have been inactive but hopefully these will be addressed in the future…

October 05, 2016 10:00 PM

Iago Toral

Source code for the OpenGL terrain renderer demo

I have been quite busy with various things in the last few weeks, but I have finally found some time to clean up and upload the code of the OpenGL terrain render demo to Github.

Since this was intended as a programming exercise I have not tried to be very elegant or correct during the implementation, so expect things like error handling to be a bit rough around the edges, but otherwise I think the code should be easy enough to follow.

Notice that I have only tested this on Intel GPUs. I know it works on NVIDIA too (thanks to Samuel and Chema for testing this) but there are a couple of rendering artifacts there, specifically at the edges of the skybox and some “pillars” showing up in the distance some times, probably because I am rendering one to many “rows” of the terrain and I end up rendering garbage. I may fix these some day.

The code I uploaded to the repository includes a few new features too:

  • Model variants, which are basically color variations of the same model
  • A couple of additional models (a new tree and plant) and a different rock type
  • Collision detection, which makes navigating the terrain more pleasent

Here is a new screenshot:


In future posts I will talk a bit about some of the implementation details, so having the source code around will be useful. Enjoy!

by Iago Toral at October 05, 2016 12:25 PM

October 03, 2016

Xabier Rodríguez Calvar

WebRTC in WebKit/WPE

For some time I worked at Igalia to enable WebRTC on WebKitForWayland or WPE for the Raspberry Pi 2.

The goal was to have the WebKit WebRTC tests working for a demo. My fellow Igalian Alex was working on the platform itself in WebKit and assisting with some tuning for the Pi on WebKit but the main work needed to be done in OpenWebRTC.

My other fellow Igalian Phil had begun a branch to work on this that was half way with some workarounds. My first task was getting into combat/workaround mode and make OpenWebRTC work with compressed streams from gst-rpicamsrc. OpenWebRTC supported only raw video streams and that Raspberry Pi Cam module GStreamer element provides only H264 encoded ones. I moved some encoders and parsers around, some caps modifications, removed some elements that didn’t work on the Pi and made it work eventually. You can see the result at:

To make this work by yourselves you needed a custom branch of Buildroot where you could build with the proper plugins enabled also selected the appropriate branches in WPE and OpenWebRTC.

Unfortunately the work was far from being finished so I continued the effort to try to make the arch changes in OpenWebRTC have production quality and that meant do some tasks step by step:

  • Rework the video orientation code: The workaround deactivated it as so far it was being done in GStreamer. In the case of rpicamsrc that can be done by the hardware itself so I cooked a GStreamer interface to enable rotation the same way it was done for the [gl]videoflip elements. The idea would be deprecate the original ones and use the new interface. These landed both in videoflip and glvideoflip. Of course I also implemented it on gst-rpicamsrc here and here and eventually in OpenWebRTC sources.
  • Rework video flip: Once OpenWebRTC sources got orientation support, I could rework the flip both for local and remote feeds.
  • Add gl{down|up}load elements back: There were some issues with the gl elements to upload and download textures, which we had removed. I readded them again.
  • Reworked bins linking: In OpenWebRTC there are some bins that are created to perform some tasks and depending on different circumstances you add or not some elements. I reworked the way those elements are linked so that we don’t have to take into account all the use cases to link them. Now this is easier as the elements are linked as they are the added to the bin.
  • Reworked the renderer_disabled: As in the case for orientation, some elements such as gst-rpicamsrc are able to change color and balance so I added support for that to avoid having that done by GStreamer elements if not necessary. In this case the proper interfaces were already there in GStreamer.
  • Moved the decoding/parsing from the source to the renderer: Before our changes the source was parsing/decoding the remote feeds, local sources were not decoded, just raw was supported. Our workarounds made the local sources decode too but this was not working for all cases. So why decoding at the sources when GStreamer has caps and you can just chain all that to the renderers? So I did eventually, I moved the parsing/decoding to the renderers. This took fixing all the caps negotiation from sources to renderers. Unfortunatelly I think I broke audio on the way, but surely nothing difficult to fix.

This is still a work in progress and now I am changing tasks and handing this over back to my fellow Igalians Phil, who I am sure will do an awesome job together with Alex.

And again, thanks to Igalia for letting me work on this and to Metrological that is sponsoring this work.

by calvaris at October 03, 2016 04:15 PM

September 29, 2016

Antonio Gomes

Making Chromium’s PDFium greater

One of the coolest things about working at Igalia, is that we have the chance to work on real world problems from day 1, either by improving an existing solution or creating new ones, based on open source technologies.

During the past month, I worked on a specific aspect of Chromium’s PDF engine, PDFium: extending PDFium’s capability so that Annotations can be manipulated by Acrobat JS scripting.

In practice, the goal was to make the specific scenario described on bug 492 to behave similarly in Chromium/PDFium, when compared to IE/Acrobat.

“Mission given, is mission accomplished”

The following Acrobat JS APIs were implemented:

Document::syncAnnotScan (only stubbed out)

… and Acrobat JS scripts like the snippet below can now be successfully executed by PDFium, making 200k+ PDF files to behave in Chromium/PDFium similarly to IE/Acrobat –
with regards to the way Annotations are manipulated.

1 this.disclosed = true;
2 this.syncAnnotScan();
3 var u = this.URL;
4 var args = u.split('calcrtid=');
5 var tags;
6 if (args.length > 1) {
7   tags = args[1].split('&');
8   this.gotoNamedDest(tags[0]);
9   var a = this.getAnnot(this.pageNum, tags[0]);
10   if (a != null) {
11     a.hidden = false;
12     var rect = a.rect;
13     if (rect != null) {
14       this.scroll(rect[2] + 100, rect[3] + 100);
15     }
16   }
17 }

Also as part of this project, it was possible to fix two discrepancies in PDFium, again if compared to Acrobat: 1) Hidden annotations were shown; 2) PDFium’s positioning Text Markup annotations.

About (2) specifically, on Acrobat if an “/AP” clause is present as part of an “/Annot” definition, the coordinates used to draw annotations come from “/Rect” values. On the other hand, when “/AP” is not defined, “/QuadPoints” values are used to grab the annotation coordinates from. The divergence was coming from the fact that PDFium uses “/Rect” regardless the presence or absence of “/AP”. This CL fixed it.

To wrap up the work, I was also able to extend PDFium’s main testing tool (pdfium_test) to actually run Acrobat JS tests (follow up), as well as make lots of driven-by clean ups, and change PDFium’s main client (Chromium) to deal better when its async nature.

The PDFium community

Throughout the project, interactions with the PDFium community were smooth:

  • PDFium contribution flow follows pretty much the Chromium’s: to report issues and Codereview to get the work reviewed.
  • Once approved, a CL is committed through a “Commit Queue” bot – which ensures a CLs correctness by running PDFium regression suite.
  • Owners are really responsive.


My work in PDFium is sponsored by Bloomberg. Thank you all!


by agomes at September 29, 2016 04:04 PM

September 27, 2016

Javier Muñoz

Attending ApacheCon and Apache Big Data Europe 2016

This year I will be attending my first ApacheCon and Apache Big Data here in Europe. Two major events related to open source technologies, techniques and best practices shaping the data ecosystem and cloud computing.

I will be in Seville (Spain) all week, November 14-18, under the sponsorship of my company Igalia. It will be great meeting with some fellows in the Apache Libcloud community, one of the open source projects where I contribute code upstream to support Ceph and custom solutions for cloud providers.

If you are interested in topics such as cloud, distributed systems, data, massive storage, scalability, security, devops, ML... or you would like to chat about some Apache project, feel free to approach me and talk about anything. You can also contact me via mail, linkedin, twitter, etc. See you there!

by Javier at September 27, 2016 10:00 PM

September 26, 2016

Michael Catanzaro

Epiphany Icon Refresh

We have a nice new app icon for Epiphany 3.24, thanks to Jakub Steiner (Update: and also Lapo Calamandrei):

Our new icon. Ignore the version numbers, it’s for 3.24.

Wow pretty!

The old icon was not actually specific to Epiphany, but was taken from the system, so it could be totally different depending on your icon theme. Here’s the icon currently used in GNOME, for comparison:

The old icon, for comparison

You can view the new icon it in its full 512×512 glory by navigating to about:web:

It’s big (click for full size)

(The old GNOME icon was a mere 256×256.)

Thanks Jakub!

by Michael Catanzaro at September 26, 2016 02:00 PM

September 21, 2016

Andy Wingo

is go an acceptable cml?

Yesterday I tried to summarize the things I know about Concurrent ML, and I came to the tentative conclusion that Go (and any Go-like system) was an acceptable CML. Turns out I was both wrong and right.

you were wrong when you said everything's gonna be all right

I was wrong, in the sense that programming against the CML abstractions lets you do more things than programming against channels-and-goroutines. Thanks to Sam Tobin-Hochstadt to pointing this out. As an example, consider a little process that tries to receive a message off a channel, and times out otherwise:

func withTimeout(ch chan int, timeout int) (result int) {
  var timeoutChannel chan int;
  var msg int;
  go func() {
    timeoutChannel <- 0
  select {
    case msg = <-ch: return msg;
    case msg = <-timeoutChannel: return 0;

I think that's the first Go I've ever written. I don't even know if it's syntactically valid. Anyway, I think we see how it should work. We return the message from the channel, unless the timeout happens before.

But, what if the message is itself a composite message somehow? For example, say we have a transformer that reads a value from a channel and adds 1 to it:

func onePlus(in chan int) (result chan int) {
  var out chan int
  go func () { out <- 1 + <-in }()
  return out

What if we do a withTimeout(onePlus(numbers), 0)? Assume the timeout fires first and that's the result that select chooses. There's still that onePlus goroutine out there trying to read from in and at some point probably it will succeed, but nobody will read its value. At that point the number just vanishes into the ether. Maybe that's OK in certain domains, but certainly not in general!

What CML gives you is the ability to express an event (which is kinda like a possibility of sending or receiving a message on a channel) in such a way that we don't run into this situation. Specifically with the wrap combinator, we would make an event such that receiving on numbers would run a function on the received message and return that as the message value -- which is of course the same as what we have, except that in CML the select wouldn't actually read the message off unless it select'd that channel for input.

Of course in Go you could just rewrite your program, so that the select statement looks like this:

select {
  case msg = <-ch: return msg + 1;
  case msg = <-timeoutChannel: return 0;

But here we're operating at a lower level of abstraction; we were forced to intertwingle our concerns of adding 1 and our concerns of timeout. CML is more expressive than Go.

you were right when you said we're all just bricks in the wall

However! I was right in the sense that you can build a CML system on top of Go-like systems (though possibly not Go in particular). Thanks to Vesa Karvonen for this comment and the link to their proof-of-concept CML implementation in Clojure's core.async. I understand Vesa also has an implementation in F# as well.

Folks should read Vesa's code, after reading the Reppy papers of course; it's delightfully short and expressive. The basic idea is that event composition operators like choose and wrap build up data structures instead of doing things. The sync operation then grovels through those data structures to collect a list of channels to pass on to core.async's equivalent of select. When select returns, sync determines which event that chosen channel and message corresponds to, and proceeds to "activate" the event (and, as a side effect, possibly issue NACK messages to other channels).

Provided you can map from the chosen select channel/message back to the event, (something that core.async can mostly do, with a caveat; see the code), then you can build CML on top of channels and goroutines.

o/~ yeah you were wrong o/~

On the other hand! One advantage of CML is that its events are not limited to channel sends and receives. I understand that timeouts, thread joins, and maybe some other event types are first-class event kinds in many CML systems. Michael Sperber, current Scheme48 maintainer and functional programmer, tells me that simply wrapping events in channels+goroutines works but can incur a big performance overhead relative to supporting those event types natively, due to the need to make the new goroutine and channel and the scheduling costs. He quotes 10X as the overhead!

So although CML and Go appear to be inter-expressible, maybe a proper solution will base the simple channel send/receive interface on CML rather than the other way around.

Also, since these events are now second-class, it must be OK to lose these events, for the same reason that the naïve withTimeout could lose a message from numbers. This is the case for timeouts usually but maybe you have to think about this more, and possibly provide an infinite stream of the message. (Of course the wrapper goroutine would be collected if the channel becomes unreachable.)

you were right when you said this is the end

I've long wondered how contemporary musicians deal with the enormous, crushing weight of recorded music. I don't really pick any more but hoo am I feeling this now. I think for Guile, I will continue hacking on fibers in a separate library, and I think that things will remain that way for the next couple years and possibly more. We need more experience and more mistakes before blessing and supporting any particular formulation of highly concurrent programming. I will say though that I am delighted that we are able to actually do this experimentation on a library level and I look forward to seeing what works out :)

Thanks again to Vesa, Michael, and Sam for sharing their time and knowledge; all errors are of course mine. Happy hacking!

by Andy Wingo at September 21, 2016 09:29 PM

September 20, 2016

Andy Wingo

concurrent ml versus go

Peoples! Lately I've been navigating the guile-ship through waters unknown. This post is something of an echolocation to figure out where the hell this ship is and where it should go.

Concretely, I have been working on getting a nice lightweight concurrency system rolling for Guile. I'll write more about that later, but you can think of it as being modelled on Go, though built as a library. (I had previously described it as "Erlang-like", but that's just not accurate.)

Earlier this year at Curry On this topic was burning in my mind and of course when I saw the language-hacker fam there I had to bend their ears. My targets: Matthew Flatt, the amazing boundary-crossing engineer, hacker, teacher, researcher, and implementor of Racket, and Matthias Felleisen, the godfather of the PLT research family. I saw them sitting together and I thought, you know what, what can they have to say to each other? These people have been talking together for 30 years right? Surely they are actually waiting for some ignorant dude to saunter up to the PL genius bar, right?

So saunter I do, saying, "if someone says to you that they want to build a server that will handle 100K or so simultaneous connections on Racket, what abstraction do you tell them to use? Racket threads?" Apparently: yes. A definitive yes, in the case of Matthias, with a pointer to Robby Findler's paper on kill-safe abstractions; and still a yes from Matthew with the caveat that for the concrete level of concurrency that I described, you'd have to run tests. More fundamentally, I was advised to look at Concurrent ML (on which Racket's concurrency facilities were based), that CML was much better put together than many modern variants like Go.

This was very interesting and new to me. As y'all probably know, I don't have a formal background in programming languages, and although I've read a lot of literature, reading things only makes you aware of the growing dimension of the not-yet-read. Concurrent ML was even beyond my not-yet-read horizon.

So I went back and read a bunch of papers. Turns out Concurrent ML is like Lisp in that it has a tribe and a tightly-clutched history and a diaspora that reimplements it in whatever language they happen to be working in at the moment. Kinda cool, and, um... a bit hard to appreciate in the current-day context when the only good references are papers from 10 or 20 years ago.

However, after reading a bunch of John Reppy papers, here is my understanding of what Concurrent ML is. I welcome corrections; surely I am getting this wrong.

1. CML is like Go, composed of channels and goroutines. (Forgive the modern referent; I assume most folks know Go at this point.)

2. Unlike Go, in CML a channel is never buffered. To make a buffered channel in CML, you spawn a thread that manages a buffer between two channels.

3. Message send and receive operations in CML are built on a lower-level primitive called "events". (send ch x) is instead euivalent to (sync (send-event ch x)). It's like an event is the derivative of a message send with respect to time, or something.

4. Events can be combined and transformed using the choose and wrap combinators.

5. Doing a sync on an event created by choose allows a user to build select in "user-space", as a library. Cool stuff. So this is what events are for.

6. There are separate event type implementations for timeouts, channel send/recv blocking operations, file descriptor blocking operations, syscalls, thread joins, and the like. These are supported by the CML implementation.

7. The early implementations of Concurrent ML were concurrent but not parallel; they did not run multiple "goroutines" on separate CPU cores at the same time. It was only in like 2009 that people started to do CML in parallel. I do not know if this late parallelism has a practical impact on the viability of CML.

ok go

What is the relationship of CML to Go? Specifically, is CML more expressive than Go? (I assume the reverse is not the case, but that would also be an interesting result!)

There are a few languages that only allow you to select over message receives (not sends), but Go's select doesn't have this limitation, so that's not a differentiator.

Some people say that it's nice to have events as the common denominator, but I don't get this argument. If the only event under consideration is message send or receive over a channel, events + choose + sync is the same in expressive power as a built-in select, as far as I can see. If there are other events, then your runtime already has to support them either way, and something like (let ((ch (make-channel))) (spawn-fiber (lambda () (put-message ch exp))) (get-message ch)) should be sufficient for any runtime-supported event in exp, like sleeps or timeouts or thread joins or whatever.

To me it seems like Go has made the right choices here. I do not see the difference, and that's why I wrote all this, is to be shown the error of my ways. Choosing channels, send, receive, and select as the primitives seems to have the same power as SML events.

Let this post be a pentagram on the floor, then, to summon the CML cognoscenti. Well-actuallies are very welcome; hit me up in the comments!

[edit: Sam Tobin-Hochstadt tells me I got it wrong and I believe him :) In the meantime while I work out how I was wrong, examples are welcome!]

by Andy Wingo at September 20, 2016 09:33 PM

Carlos García Campos

WebKitGTK+ 2.14

These six months has gone so fast and here we are again excited about the new WebKitGTK+ stable release. This is a release with almost no new API, but with major internal changes that we hope will improve all the applications using WebKitGTK+.

The threaded compositor

This is the most important change introduced in WebKitGTK+ 2.14 and what kept us busy for most of this release cycle. The idea is simple, we still render everything in the web process, but the accelerated compositing (all the OpenGL calls) has been moved to a secondary thread, leaving the main thread free to run all other heavy tasks like layout, JavaScript, etc. The result is a smoother experience in general, since the main thread is no longer busy rendering frames, it can process the JavaScript faster improving the responsiveness significantly. For all the details about the threaded compositor, read Yoon’s post here.

So, the idea is indeed simple, but the implementation required a lot of important changes in the whole graphics stack of WebKitGTK+.

  • Accelerated compositing always enabled: first of all, with the threaded compositor the accelerated mode is always enabled, so we no longer enter/exit the accelerating compositing mode when visiting pages depending on whether the contents require acceleration or not. This was the first challenge because there were several bugs related to accelerating compositing being always enabled, and even missing features like the web view background colors that didn’t work in accelerated mode.
  • Coordinated Graphics: it was introduced in WebKit when other ports switched to do the compositing in the UI process. We are still doing the compositing in the web process, but being in a different thread also needs coordination between the main thread and the compositing thread. We switched to use coordinated graphics too, but with some modifications for the threaded compositor case. This is the major change in the graphics stack compared to the previous one.
  • Adaptation to the new model: finally we had to adapt to the threaded model, mainly due to the fact that some tasks that were expected to be synchronous before became asyncrhonous, like resizing the web view.

This is a big change that we expect will drastically improve the performance of WebKitGTK+, especially in embedded systems with limited resources, but like all big changes it can also introduce new bugs or issues. Please, file a bug report if you notice any regression in your application. If you have any problem running WebKitGTK+ in your system or with your GPU drivers, please let us know. It’s still possible to disable the threaded compositor in two different ways. You can use the environment variable WEBKIT_DISABLE_COMPOSITING_MODE at runtime, but this will disable accelerated compositing support, so websites requiring acceleration might not work. To disable the threaded compositor and bring back the previous model you have to recompile WebKitGTK+ with the option ENABLE_THREADED_COMPOSITOR=OFF.


WebKitGTK+ 2.14 is the first release that we can consider feature complete in Wayland. While previous versions worked in Wayland there were two important features missing that made it quite annoying to use: accelerated compositing and clipboard support.

Accelerated compositing

More and more websites require acceleration to properly work and it’s now a requirement of the threaded compositor too. WebKitGTK+ has supported accelerated compositing for a long time, but the implementation was specific to X11. The main challenge is compositing in the web process and sending the results to the UI process to be rendered on the actual screen. In X11 we use an offscreen redirected XComposite window to render in the web process, sending the XPixmap ID to the UI process that renders the window offscreen contents in the web view and uses XDamage extension to track the repaints happening in the XWindow. In Wayland we use a nested compositor in the UI process that implements the Wayland surface interface and a private WebKitGTK+ protocol interface to associate surfaces in the UI process to the web pages in the web process. The web process connects to the nested Wayland compositor and creates a new surface for the web page that is used to render accelerated contents. On every swap buffers operation in the web process, the nested compositor in the UI process is automatically notified through the Wayland surface protocol, and  new contents are rendered in the web view. The main difference compared to the X11 model, is that Wayland uses EGL in both the web and UI processes, so what we have in the UI process in the end is not a bitmap but a GL texture that can be used to render the contents to the screen using the GPU directly. We use gdk_cairo_draw_from_gl() when available to do that, falling back to using glReadPixels() and a cairo image surface for older versions of GTK+. This can make a huge difference, especially on embedded devices, so we are considering to use the nested Wayland compositor even on X11 in the future if possible.


The WebKitGTK+ clipboard implementation relies on GTK+, and there’s nothing X11 specific in there, however clipboard was read/written directly by the web processes. That doesn’t work in Wayland, even though we use GtkClipboard, because Wayland only allows clipboard operations between compositor clients, and web processes are not Wayland clients. This required to move the clipboard handling from the web process to the UI process. Clipboard handling is now centralized in the UI process and clipboard contents to be read/written are sent to the different WebKit processes using the internal IPC.

Memory pressure handler

The WebKit memory pressure handler is a monitor that watches the system memory (not only the memory used by the web engine processes) and tries to release memory under low memory conditions. This is quite important feature in embedded devices with memory limitations. This has been supported in WebKitGTK+ for some time, but the implementation is based on cgroups and systemd, that is not available in all systems, and requires user configuration. So, in practice nobody was actually using the memory pressure handler. Watching system memory in Linux is a challenge, mainly because /proc/meminfo is not pollable, so you need manual polling. In WebKit, there’s a memory pressure handler on every secondary process (Web, Plugin and Network), so waking up every second to read /proc/meminfo from every web process would not be acceptable. This is not a problem when using cgroups, because the kernel interface provides a way to poll an EventFD to be notified when memory usage is critical.

WebKitGTK+ 2.14 has a new memory monitor, used only when cgroups/systemd is not available or configured, based on polling /proc/meminfo to ensure the memory pressure handler is always available. The monitor lives in the UI process, to ensure there’s only one process doing the polling, and uses a dynamic poll interval based on the last system memory usage to read and parse /proc/meminfo in a secondary thread. Once memory usage is critical all the secondary processes are notified using an EventFD. Using EventFD for this monitor too, not only is more efficient than using a pipe or sending an IPC message, but also allows us to keep almost the same implementation in the secondary processes that either monitor the cgroups EventFD or the UI process one.

Other improvements and bug fixes

Like in all other major releases there are a lot of other improvements, features and bug fixes. The most relevant ones in WebKitGTK+ 2.14 are:

  • The HTTP disk cache implements speculative revalidation of resources.
  • The media backend now supports video orientation.
  • Several bugs have been fixed in the media backend to prevent deadlocks when playing HLS videos.
  • The amount of file descriptors that are kept open has been drastically reduced.
  • Fix the poor performance with the modesetting intel driver and DRI3 enabled.

by carlos garcia campos at September 20, 2016 07:42 AM

September 19, 2016

Michael Catanzaro

Epiphany 3.22 (and a couple new stable releases too!)

It’s that time of year again! A new major release of Epiphany is out now, representing another six months of incremental progress. That’s a fancy way of saying that not too much has changed (so how did this blog post get so long?). It’s not for lack of development effort, though. There’s actually lot of action in git master and on sidebranches right now, most of it thanks to my awesome Google Summer of Code students, Gabriel Ivascu and Iulian Radu. However, I decided that most of the exciting changes we’re working on would be deferred to Epiphany 3.24, to give them more time to mature and to ensure quality. And since this is a blog post about Epiphany 3.22, that means you’ll have to wait until next time if you want details about the return of the traditional address bar, the brand-new user interface for bookmarks, the new support for syncing data between Epiphany browsers on different computers with Firefox Sync, or Prism source code view, all features that are brewing for 3.24. This blog also does not cover the cool new stuff in WebKitGTK+ 2.14, like new support for copy/paste and accelerated compositing in Wayland.

New stuff

So, what’s new in 3.22?

  • A new Paste and Go context menu option in the address bar, implemented by Iulian. It’s so simple, but it’s also the greatest thing ever. Why did nobody implement this earlier?
  • A new Duplicate Tab context menu option on tabs, implemented by Gabriel. It’s not something I use myself, but it seems some folks who use it in other browsers were disappointed it was missing in Epiphany.
  • A new keyboard shortcuts dialog is available in the app menu, implemented by Gabriel.

Gabriel also redesigned all the error pages. My favorite one is the new TLS error page, based on a mockup from Jakub Steiner:

Web app improvements

Pivoting to web apps, Daniel Aleksandersen turned his attention to the algorithm we use to pick a desktop icon for newly-created web apps. It was, to say the least, subpar; in Epiphany 3.20, it normally always fell back to using the website’s 16×16 favicon, which doesn’t look so great in a desktop environment where all app icons are expected to be at least 256×256. Epiphany 3.22 will try to pick better icons when websites make it possible. Read more on Daniel’s blog, which goes into detail on how to pick good web app icons.

Also new is support for system-installed web apps. Previously, Epiphany could only handle web apps installed in home directories, which meant it was impossible to package a web app in an RPM or Debian package. That limitation has now been removed. (Update: I had forgotten that limitation was actually removed for GNOME 3.20, but the web apps only worked when running in GNOME and not in other desktops, so it wasn’t really usable. That’s fixed now in 3.22.) This was needed to support packaging Fedora Developer Portal, but of course it can be used to package up any website. It’s probably only interesting to distributions that ship Epiphany by default, though. (Epiphany is installed by default in Fedora Workstation as it’s needed by GNOME Software to run web apps, it’s just hidden from the shell overview unless you “install” it.) At least one media outlet has amusingly reported this as Epiphany attempting to compete generally with Electron, something I did write in a commit message, but which is only true in the specific case where you need to just show a website with absolutely no changes in the GNOME desktop. So if you were expecting to see Visual Studio running in Epiphany: haha, no.

Shortcut woes

On another note, I’m pleased to announce that we managed to accidentally stomp on both shortcuts for opening the GTK+ inspector this cycle, by mapping Duplicate Tab to Ctrl+Shift+D, and by adding a new Ctrl+Shift+I shortcut to open the WebKit web inspector (in addition to F12). Go team! We caught the problem with Ctrl+Shift+D and removed the shortcut in time for the release, so at least you can still use that to open the GTK+ inspector, but I didn’t notice the issue with the web inspector until it was too late, and Ctrl+Shift+I will no longer work as expected in GTK+ apps. Suggestions welcome for whether we should leave the clashing Ctrl+Shift+I shortcut or get rid of it. I am leaning towards removing it, because we normally match Epiphany behavior with GTK+, and only match other browsers when it doesn’t conflict with GTK+. That’s called desktop integration, and it’s worked well for us so far. But a case can be made for matching other browsers, too.

Stable releases

On top of Epiphany 3.22, I’ve also rolled new stable releases 3.20.4 and 3.18.8. I don’t normally blog about stable releases since they only include bugfixes and are usually boring, so why are these worth mentioning here? Two reasons. First, one of the fixes in these releases is quite significant: I discovered that a few important features were broken when multiple tabs share the same web process behind the scenes (a somewhat unusual condition): the load anyway button on the unacceptable TLS certificate error page, password storage with GNOME keyring, removing pages from the new tab overview, and deleting web applications. It was one subtle bug that was to blame for breaking all of those features in this odd corner case, which finally explains some difficult-to-reproduce complaints we’d been getting, so it’s good to put out that bug of the way. Of course, that’s also fixed in Epiphany 3.22, but new stable releases ensure users don’t need a full distribution upgrade to pick up a simple bugfix.

Additionally, the new stable releases are compatible with WebKitGTK+ 2.14 (to be released later this week). The Epiphany 3.20.4 and 3.18.8 releases will intentionally no longer build with older versions of WebKitGTK+, as new WebKitGTK+ releases are important and all distributions must upgrade. But wait, if WebKitGTK+ is kept API and ABI stable in order to encourage distributions to release updates, then why is the new release incompatible with older versions of Epiphany? Well, in addition to stable API, there’s also an unstable DOM API that changes willy-nilly without any soname bumps; we don’t normally notice when it changes, since it’s autogenerated from web IDL files. Sounds terrible, right? In practice, no application has (to my knowledge) ever been affected by an unstable DOM API break before now, but that has changed with WebKitGTK+ 2.14, and an Epiphany update is required. Most applications don’t have to worry about this, though; the unstable API is totally undocumented and not available unless you #define a macro to make it visible, so applications that use it know to expect breakage. But unannounced ABI changes without soname bumps are obviously a big a problem for distributions, which is why we’re fixing this problem once and for all in WebKitGTK+ 2.16. Look out for a future blog post about that, probably from Carlos Garcia.

elementary OS

Lastly, I’m pleased to note that elementary OS Loki is out now. elementary is kinda (totally) competing with us GNOME folks, but it’s cool too, and the default browser has changed from Midori to Epiphany in this release due to unfixed security problems with Midori. They’ve shipped Epiphany 3.18.5, so if there are any elementary fans in the audience, it’s worth asking them to upgrade to 3.18.8. elementary does have some downstream patches to improve desktop integration with their OS — notably, they’ve jumped ahead of us in bringing back the traditional address bar — but desktop integration is kinda the whole point of Epiphany, so I can’t complain. Check it out! (But be sure to complain if they are not releasing WebKit security updates when advised to do so.)

by Michael Catanzaro at September 19, 2016 02:00 PM

September 18, 2016

Michael Catanzaro

A WebKit Update for Ubuntu

I’m pleased to learn that Ubuntu has just updated WebKitGTK+ from 2.10.9 to 2.12.5 in Ubuntu 16.04. To my knowledge, this is the first time Ubuntu has released a major WebKit update. It includes fixes for 16 security vulnerabilities detailed in WSA-2016-0004 and WSA-2016-0005.

This is really great. Of course, it would have been better if it didn’t take three and a half months to respond to WSA-2016-0004, and the week before WebKitGTK+ 2.12 becomes obsolete was not the greatest timing, but late security updates are much better than no security updates. It remains to be seen if Ubuntu will keep up with WebKit updates in the future, but I think I can tentatively stop complaining about Ubuntu for now. Debian is looking increasingly isolated in not offering WebKit security updates to its users.

Thanks, Ubuntu!

by Michael Catanzaro at September 18, 2016 04:34 AM

September 14, 2016

Manuel Rego

TPAC, Web Engines Hackfest & Igalia 15th anniversary on the horizon


Next week I’ll be in Lisbon attending TPAC. This is the annual conference organized by the W3C where all the different groups meet face to face during one week. It seems a huge event where you can meet lots of important people working on the web. Looking forward to being there. 😃

Due to our involvement on the implementation of CSS Grid Layout specification in Chromium/Blink and and Safari/WebKit, we’ve been interacting quite a lot with the CSS Working Group (CSS WG). Thus, I’ll be participating on their meetings during the event and also following the work around Houdini, with a close eye on the Layout API. BTW, thanks to the CSS WG chairs for letting me join them.

Like past year, Igalia will have a booth in the conference where you can chat with us about our involvement on the W3C, from the specs edition to the implementation of web standards on the different browsers. My colleagues Joanie and Juanjo will be attending the event too, so don’t hesitate to ping any of us to talk about Igalia and our contributions to the open web platform.

Web Engines Hackfest

This year Igalia is organizing and hosting again a new edition of the Web Engines Hackfest. The event will take place during the last week of the month (26-28th September), it used to be on December but we looked for a better date this year and it seems we’ve been successful as we’ll be around 40 people (more than ever). Thanks everyone attending, we just hope you really enjoy the event.

As usual my main goal for the hackfest is related to CSS Grid Layout implementation. Probably it’d be a good moment to draft a plan for shipping it on Chromium as we’ll have Christian Biesinger around, who is usually reviewing most of our Grid Layout patches.

On top of that and due to my involvement on the MathML refactoring we did on WebKit, I’ll be very interested on keep discussing about MathML and the next steps to make it a reality on the different web engines. Some fonts experts will be around, so we won’t miss the opportunity to try to improve OpenType MATH table support in HarfBuzz.

Apart from this, it’s worth highlighting the big number of Servo contributors we’ll have in the event, from Mozilla employees to external collaborators (like my colleague Martin). I’m eager to check firsthand the status of this engine and their future plans.

Last, but not least, hopefully during the hackfest we’ll find some time to discuss about the upstreaming process of WebKit for Wayland, trying to convert it in an official WebKit port like WebKitGTK+.

And there’ll be more topics that will be discussed there like: accessibility, multimedia, V8, WebRTC, etc. We’re quite a lot of people, so surely we’ll have productive meetings on most of these things.

Igalia 15th Anniversary

This month Igalia is becoming 15 years old! It’s amazing that a company with a completely different model (focused on free software and with a flat cooperative-like structure) has survived during all these years. I’m very grateful to the people who founded a company with such wonderful values, and all the people that have made possible to reach this point in our history. Thanks for letting me be part of it! 😊

Igalia 15th Anniversary Logo Igalia 15th Anniversary Logo

We’ll be celebrating the anniversary on the last week of September, we’ll have several parties during that week and we’ll have one of our summits at the weekend.

It’s awesome to see back in time and realize how many contributions we’ve been doing to a lot of different free software projects, from our first days inside the GNOME project, to our current work on the different browsers or graphics drivers, among others. Lots of programs you’re using every day have some code contributed by Igalia; from your computer to your phones, TVs, watches, etc.

Happy birthday Igalia, I wish we’ll have many more years of success on the free world! 🎂

September 14, 2016 10:00 PM

August 30, 2016

Javier Muñoz

AWS4 chunked upload goes upstream in Ceph RGW S3

With AWS Signature Version 4 (AWS4) you have the option of uploading the payload in fixed or variable-size chunks.

This chunked upload option, also known as Transfer payload in multiple chunks or STREAMING-AWS4-HMAC-SHA256-PAYLOAD feature in the Amazon S3 ecosystem, avoids reading the payload twice (or buffer it in memory) to compute the signature in the client side.

AWS4 chunked upload support is now upstream code in Ceph. It will also ship with the next Jewel release.

Signing single vs multiple chunks payloads

In AWS Signature Version 4 using the HTTP Authorization header is the most common method of providing authentication information. There are two supported options in S3:

The first option, 'Transfer payload in a single chunk', is the way how Ceph RGW S3 authenticates requests under AWS4 right now. It supports signed and unsigned payloads.

For this case the major client-side drawback with signed payloads uploads is the necessity to read the payload twice or buffer it in memory. For big files this approach may be inefficient so you might prefer to upload data in chunks instead.

The second option, 'Transfer payload in multiple chunks (chunked upload)', is the new functionality being added upstream and backported to Jewel.

In this case you break up the payload in fixed or variable-size chunks. By uploading data in chunks, you avoid accesing the payload twice to calculate the signature in the client side.

Signing the chunks

Each chunk signature computation includes the signature of the previous chunk. The seed signature is generated using the request headers only:

Bear in mind the seed signature is used in the signature computation of the first chunk only. For each subsequent chunk, you create a chunk signature that includes the signature of the previous chunk.

The string to sign used in each subsequent chunk signature is:

The chunk signatures chaining ensures you send the chunks in correct order.

Uploading chunked payloads with the AWS SDK for Java

To test chunked uploads with Ceph RGW S3 you can use the AWS SDK for Java. It supports AWS4 chunked uploads. The last version is available here.

Feel free to use this code based on the AWS SDK Java's UploadObjectSingleOperation example.



My work in Ceph is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Ceph development team. Thank you all!

by Javier at August 30, 2016 10:00 PM

August 29, 2016

Iago Toral

OpenGL terrain renderer

Lately I have been working on a simple terrain OpenGL renderer demo, mostly to have a playground where I could try some techniques like shadow mapping and water rendering in a scenario with a non trivial amount of geometry, and I thought it would be interesting to write a bit about it.

But first, here is a video of the demo running on my old Intel IvyBridge GPU:

And some screenshots too:

OpenGL Terrain screenshot 1 OpenGL Terrain screenshot 2
OpenGL Terrain screenshot 3 OpenGL Terrain screenshot 4

Note that I did not create any of the textures or 3D models featured in the video.

With that out of the way, let’s dig into some of the technical aspects:

The terrain is built as a 251×251 grid of vertices elevated with a heightmap texture, so it contains 63,000 vertices and 125,000 triangles. It uses a single 512×512 texture to color the surface.

The water is rendered in 3 passes: refraction, reflection and the final rendering. Distortion is done via a dudv map and it also uses a normal map for lighting. From a geometry perspective it is also implemented as a grid of vertices with 750 triangles.

I wrote a simple OBJ file parser so I could load some basic 3D models for the trees, the rock and the plant models. The parser is limited, but sufficient to load vertex data and simple materials. This demo features 4 models with these specs:

  • Tree A: 280 triangles, 2 materials.
  • Tree B: 380 triangles, 2 materials.
  • Rock: 192 triangles, 1 material (textured)
  • Grass: 896 triangles (yes, really!), 1 material.

The scene renders 200 instances of Tree A another 200 instances of Tree B, 50 instances of Rock and 150 instances of Grass, so 600 objects in total.

Object locations in the terrain are randomized at start-up, but the demo prevents trees and grass to be under water (except for maybe their base section only) because it would very weird otherwise :), rocks can be fully submerged though.

Rendered objects fade in and out smoothly via alpha blending (so there is no pop-in/pop-out effect as they reach clipping planes). This cannot be observed in the video because it uses a static camera but the demo supports moving the camera around in real-time using the keyboard.

Lighting is implemented using the traditional Phong reflection model with a single directional light.

Shadows are implemented using a 4096×4096 shadow map and Percentage Closer Filter with a 3x3 kernel, which, I read is (or was?) a very common technique for shadow rendering, at least in the times of the PS3 and Xbox 360.

The demo features dynamic directional lighting (that is, the sun light changes position every frame), which is rather taxing. The demo also supports static lighting, which is significantly less demanding.

There is also a slight haze that builds up progressively with the distance from the camera. This can be seen slightly in the video, but it is more obvious in some of the screenshots above.

The demo in the video was also configured to use 4-sample multisampling.

As for the rendering pipeline, it mostly has 4 stages:

  • Shadow map.
  • Water refraction.
  • Water reflection.
  • Final scene rendering.

A few notes on performance as well: the implementation supports a number of configurable parameters that affect the framerate: resolution, shadow rendering quality, clipping distances, multi-sampling, some aspects of the water rendering, N-buffering of dynamic VBO data, etc.

The video I show above runs at locked 60fps at 800×600 but it uses relatively high quality shadows and dynamic lighting, which are very expensive. Lowering some of these settings (very specially turning off dynamic lighting, multisampling and shadow quality) yields framerates around 110fps-200fps. With these settings it can also do fullscreen 1600×900 with an unlocked framerate that varies in the range of 80fps-170fps.

That’s all in the IvyBridge GPU. I also tested this on an Intel Haswell GPU for significantly better results: 160fps-400fps with the “low” settings at 800×600 and roughly 80fps-200fps with the same settings used in the video.

So that’s it for today, I had a lot of fun coding this and I hope the post was interesting to some of you. If time permits I intend to write follow-up posts that go deeper into how I implemented the various elements of the demo and I’ll probably also write some more posts about the optimization process I followed. If you are interested in any of that, stay tuned for more.

by Iago Toral at August 29, 2016 11:50 AM

August 22, 2016

Xavier Castaño

Igalia will be at IBC 2016

Do you work in the media, entertainment, broadcasting or content industry? If so, please join us at IBC2016, 9-13 September in Amsterdam. The IBC Exhibition covers fifteen halls across the Amsterdam RAI and hosts over 1,600 exhibitors spanning the creation, management and delivery of electronic and media entertainment.
You will be able to meet us at our booth where we will be showcasing demos of our latest work in browsers. The demo of WebKit for Wayland, a new port of WebKit that is currently being upstreamed, on an IVI shell is especially worth checking out.

Igalia will be in Hall 5, booth 19 so don’t miss your chance to attend and visit us! Find us on the interactive floorplan.

by xavier castano at August 22, 2016 10:15 AM

August 18, 2016

Adrián Pérez

SnabbWall's Traffic Analyzer: L7Spy

Wow, two posts just a few days apart from each other! Today I am writing about L7Spy, the application from the SnabbWall suite used to analyze network traffic and detect which applications they belong to. SnabbWall is being developed at Igalia with sponsorship from the NLnet Foundation. (Thanks!)


A Digression

There are two things related to L7Spy I want to mention beforehand:

nDPI 1.8 is Now Supported

The changes from the [recently released][ljndpi0.3.3] ljndpi v0.3.3 have been merged into the SnabbWall repository, which means that now it also supports using version 1.8 of the nDPI library.

Following Upstream

Snabb follows a monthly release schedule. Initially, I thought it would be better to do all the development for SnabbWall starting from a given Snabb release, and incorporate the changes from upstream later on, after the first one of SnabbWall is done. I noticed how it might end up being difficult to rebase SnabbWall on top of the changes accumulated from the upstream repository, and switched to the following scheme:

  • The branch containing the most up-to-date development code is always in the snabbwall branch.
  • Whenever a new Snabb version is released, changes are merged from the upstream repository using its corresponding release tag.
  • When a SnabbWall release is done, it tagged from current contents of the snabbwall branch, which means it is automatically based on the latest Snabb release.
  • If a released version needs fixing, a maintenance branch snabbwall-vX.Y for it is created from the release tag. Patches have to be backported or purposedfully made for that this new branch.

This approach reduces the burden of keeping up with changes in Snabb, while still allowing to make bugfix releases of any SnabbWall version when needed.


The L7Spy is a Snabb application, and as such it can be reused in your own application networks. It is made to be connected in a few different ways, which should allow for painless integration. The application has two bidirectional endpoints called south and north:

The L7Spy application.

Packets entering one of the ends are forwarded unmodified to the other end, in both directions, and they are scanned as they pass through the application.

The only mandatory connection is the receiving side of either south or north: this allows for using L7Spy as a “kitchen sink”, swallowing the scanned packets without sending them anywhere. As an example, this can be used to connect a RawSocket application to L7Spy, attach the RawSocket to an existing network interface, and do passive analysis of the traffic passing through that interface:

Passive analysis made easy

If you swap the RawSocket application with the one driving Intel NICs, and divert a copy of packets passing through your firewall towards the NIC, you would be doing essentially the same but at lightning speed and moving the load of packet analysis to a separate machine. Add a bit more of smarts on top, and this kind of setup would already look like a standalone Intrusion Detection System!

snabb wall

While implementing the L7Spy application, the snabb wall command has also come to life. Like all the other Snabb programs, the wall command is built into the snabb executable, and it has subcommands itself:

aperez@hikari ~/snabb/src % sudo ./snabb wall
[sudo] password for aperez:
  snabb wall <subcommand> <arguments...>
  snabb wall --help

Available subcommands:

  spy     Analyze traffic and report statistics

Use --help for per-command usage. Example:

  snabb wall spy --help

aperez@hikari ~/snabb/src %

For the moment being, only the spy subcommand is provided: it exercises the L7Spy application by feeding into it packets captured from one of the supported sources. It can work in batch mode —the default—, which is useful to analyze a static pcap file. The following example analyzes the full contents of the pcap file, reporting at the end all the traffic flows detected:

aperez@hikari ~/snabb/src % sudo ./snabb wall spy \
      pcap program/wall/tests/data/EmergeSync.cap
0xfffffffff19ba8be   18p -  RSYNC
0xffffffffc6f96d0f 4538p -  RSYNC
aperez@hikari ~/snabb/src %

Two flows are identified, which correspond to two different connections between the client with IPv4 address (from two different ports) to the server with address Both flows are correctly identified as belonging to the rsync application. Notice how the application is properly detected even though the rsync daemon is running in port 26883 which is a non-standard one for this kind of service.

It is also possible to ingest traffic directly from a network interface in live mode. Apart from the drivers for Intel cards included with Snabb (selectable using the intel10g and intel1g sources) there are two hardware-independent sources for TAP devices (tap) and RAW sockets (raw). Let's use a RAW socket to sniff a copy of every packet which passes through the a kernel-managed network interface, while opening a well known search engine in a browser using their https:// URL:

aperez@hikari ~/snabb/src % sudo ./snabb wall spy --live eth0
0x77becdcb    2p    -  SERVICE_GOOGLE:DNS
0x6e111c73    4p   -  SERVICE_GOOGLE:SSL

Note how even the connection is encrypted, our nDPI-based analyzer has been able to correlate a DNS query immediately followed by a connection to the address from the DNS response at port 443 as a typical pattern used to browse web pages, and it has gone the extra mile to tell us that it belongs to a Google-operated service.

There is one more feature in snabb wall spy. Passing --stats will print a line at the end (in normal mode), or every two seconds (in live mode), with a summary of the amount of data processed:

aperez@hikari ~/snabb/src % sudo ./snabb wall spy --live --stats eth0
=== 2016-08-17T18:05:54+0300 === 71866 Bytes, 99 packets, 35933.272 B/s, 49.500 PPS
=== 2016-08-17T18:19:35+0300 === 3105473 Bytes, 2884 packets, 1.599 B/s, 0.001 PPS

One last note: even though it is not shown in the examples, scanning and identifying IPv6 traffic is already supported.

Get that IPv6 flowing!

What Next?

Now that traffic analysis and flow identification is working nicely, it has to be used for actual filtering. I am now in the process of implementing the firewall component of SnabbWall (L7fw) as a Snabb application. Also, there will be a snabb wall filter subcommand, providing an off-the-shelf L7 filtering solution. Expect a couple more of blog posts about them.

In the meanwhile, if you want to play with L7Spy, I will be updating its WIP documentation in the following days. Happy hacking!

by aperez ( at August 18, 2016 08:51 AM

August 11, 2016

Xabier Rodríguez Calvar

Going to GUADEC 2016

I am at the airport getting ready to board beginning my trip to Karlsruhe for GUADEC 2016. I’ll be there with my colleague Michael Catanzaro.Please talk to us if you are interested in browsers or Igalia.

by calvaris at August 11, 2016 07:30 AM

July 31, 2016

Adrián Pérez

ljndpi 0.0.3 released with nDPI 1.8 support

Although I have been away from Snabb-related work for a while, the fact that nDPI 1.8 was released didn't went unnoticed. This library is an important building block for SnabbWall, the Layer-7 firewall which I am developing at Igalia with sponsorship from the NLnet Foundation (thanks!).


SnabbWall uses the nDPI library via the ljndpi binding, about which I wrote back in January. As any other component which uses a native library, ljndpi is a consumer of the API and ABI exported by it, so it needs to be updated to accomodate to changes introduced in new releases of nDPI.

Fortuntately most of the API (and ABI) from nDPI 1.7 remains unchanged in version 1.8, making it possible to maintain the interface exposed to the Lua world. This is very good news, and it means that any program using ljndpi does not need to be modified to work with nDPI 1.8: just update ljndpi to version 0.0.3 and you are ready to go.

Nitty-Gritty Details

In order to support versions 1.7 and 1.8 of the nDPI library, ndpi_revision() is used to obtain the version string of the loaded library, and parsed into a (major, minor, patch) version triplet. The ndpi.c module uses this version information to determine which symbols to make known to the FFI extension via ffi.cdef:

if lib_version.minor == 7 then
  -- Declare functions available only in nDPI 1.7
  ffi.cdef [[
elseif lib_version.minor == 8 then
  -- Ditto, for version 1.8
  ffi.cdef [[

The high-level ndpi.wrap module does the same, ensuring that only functions available in the version of the nDPI library currently loaded are used. In some cases, parameters passed to functions with the same name in nDPI may differ, and ndpi.wrap shuffles them around as needed in order to avoid changes in the interface exposed to Lua:

if lib_version.major == 7 then
   -- ...
   -- ...

   -- In nDPI 1.8 the second parameter (uint8_t proto) has been dropped.
   detection_module.find_port_based_protocol = function (dm, dummy, ...)
      local proto = lib.ndpi_find_port_based_protocol(dm, ...)
      return proto.master_protocol, proto.protocol

Last but not least, the numeric values which identify each protocol supported by nDPI have changed as well, due to the additional protocols supported by the new version of the library. The existing ndpi.protocol_ids module was renamed to ndpi.protocol_ids_1_7, and then a new ndpi.protocol_ids_1_8 generated using the update-protocol-ids tool. Finally, the ndpi.wrap module was updated to load the module which corresponds to the version of the nDPI library being used:

-- Return the module table.
return {
  protocol = require("ndpi.protocol_ids_"
            .. lib_version.major .. "_"
            .. lib_version.minor);
  -- ...

Easier Installation

As a bonus, it is now possible to install ljndpi using the LuaRocks package manager. LuaRocks provides similar functionality to NodeJS' npm or Python's pip. If your LuaJIT installation includes LuaRocks, getting the latest stable release of ljndpi is easier than ever:

luarocks install ljndpi

It is also possible to install a development snapshot directly. The following will fetch the source from the Git repository and install it using LuaRocks — though it is strongly recommended to use the stable releases:

luarocks install --server= ljndpi

Hopefully, this will make ljndpi easier to discover and used by projects other than SnabbWall. Who knows, it may even make it easier for distributions to package it! (Hint: LuaRocks supports specifying a custom tree.)

One More Thing...

He said it 31 times on stage (look it up!)

My previous post ended up in a cliffhanger which promised another post about SnabbWall. Supporting nDPI 1.8 in ljndpi seemed important enough to write about it, but rest assured that the post on SnabbWall is next in my “polish and publish” queue.

by aperez ( at July 31, 2016 08:15 PM

July 29, 2016

Alejandro Piñeiro

One year of Mesa


During the last year and something, my work at Igalia was focused on the Intel i965 driver for Mesa, the open source OpenGL implementation. Iago and Samuel were working for some time, then joined Edu and Antía, and then I joined myself, as the fifth igalian.

One of the reasons I will always remember working on this project is that I was also being a father for a year and something, so it will be always easy to relate one and the other. Changes! A project change plus a total life change.

In fact the parenthood affected a little how I joined the project. We (as Igalia) decided that I would be a good candidate to join the project around April. But as my leave of absence was estimated to be around the last weeks of May, and it would be strange to start a project and suddenly disappear, we decided to postpone the joining just after the parental leave. So I spent some time closing and transferring the stuff I was doing at Igalia at that moment, and starting to get into the specifics of Intel driver, and then at the end of May, my parental leave started. Pedro is here!


Pedro’s day zero

Mesa can be a challenging project, but nothing compared to how to parent. A lot of changes. In Spain the more usual parental leave for the father is 15 days, where the only  flexibility is that you need to take 2 just after the child has born, and chose when take the other 13. But those 13 needs to be taken in a row. So I decided to took 15 days in a row. Igalia gives you 15 extra days. In fact, the equivalent to 15 days, as they gives you flexibility of how to take them. So instead of took them as 15 full-day, I decided to take 30 part-time days. Additionally Pedro’s grandmother (maternal) came to live with us for a month. That allowed us to do a step by step go back to work. So it was something like:

  • Pedro was born, I am in full parental leave, plus grandmother living at our house helping.
  • I start to work part-time
  • Grandmother goes back to Brazil
  • I start to work full-time.
  • Pedro’s mother goes back to work part-time

And probably the more complicated moment was when Pedro’s mother went back to work. More or less at that time we switched the usual grandparents (paternal) weekend visits to in-week visits. So what it is my usual timetable? Three days per week I work 8:00-14:30 at my home, and I take care of Pedro on the afternoon, when his mother goes to work. Two days per week is 8:00-12:30 at my home, and 15:00-20:00 at my parents home. And that is more or less, as no day is the same that the other 😉 And now I go more to the parks!

Pedro playing on a park
Pedro playing on a park

There were other changes. For example switched from being one of the usual suspects at Igalia office to start to work mostly from home. In any case the experience is totally worthing, and it is incredible how fast the time passed. Pedro is one year old already!

Pedro destroying his birthday cake
Mesa tasks

I almost forgot that this blog post started to talk about my work on Mesa. So many photos! During this year, I have participated (although not alone, specially on spec implementations) in the following tasks (and except last one, in chronological order):

About which task I liked the most, I think that it was the work done on optimizing the NIR to vec4 pass. And not forget that the work done by Igalia to implement internalformat_query2 and vertex_attrib_64bit, plus the work done by Iago and Samuel with fp64, helped to get Mesa 12.0 exposing OpenGL 4.2 support on Broadwell and later.

What happens with accessibility?

Being working full time for the same project for a full year, plus learning how to parent, means that I didn’t have too much spare time for accessibility development. Having said so, I have been reviewing ATK patches, doing the ATK releases regularly and keeping an eye on the GNOME accessibility mailing lists, so I’m still around. And we have Joanmarie Diggs, also fellow igalian, still rocking on improving Orca, WebKitGTK and WAI-ARIA.


Finally, I’d like to thank both Intel and Igalia for supporting my work on Mesa and i965 all this time. Specially on allowing me so flexible timetable, where the important is what you deliver, and not when you do the work, allowing me to enjoy parenthood. I also want to thanks the igalian colleagues I has been working during this year, Iago, Samuel, Antía, Edu, Andrés, and Juan, and all those at Intel who have been helping and reviewing my work during all this year, like Jason Ekstrand, Kenneth Graunke, Matt Turner and Francisco Jerez.


by infapi00 at July 29, 2016 09:02 AM

July 14, 2016

Javier Muñoz

Virtual Data and Control Paths in Software-Defined Storage

Two of the promised promises in Software-Defined Storage (SDS) are higher automation and flexibility in order to reduce the costs around the storage administration tasks.

This entry will go over the required concepts to understand the virtual data and control paths in SDS solutions and how metadata are used to convey the data requirements into the automation software with the proper granularity and flexibility.

The entry will also include one of the most simple use cases ('x-amz-website-redirect-location') you can find to illustrate how all these concepts fit in Ceph.

Software-Defined Storage

Although there is no formal definition for Storage-Defined Storage (SDS), it is all about decoupling the hardware and the intelligence.

In SDS, the hardware becomes "irrelevant", the intelligence lives into the software, and the hardware may be considered a cheap commodity.

From an architectural perspective, SDS sees the storage in three layers: orchestration, data services (compression, deduplication...) and hardware.

SDS abstracts storage resources to enable pooling, replication, and on-demand provisioning of resources. The result is the ability to pool storage arrays into logical pools.


Beyond of being a stand-alone technology, SDS is also a building block of the Software-Defined Data Center (SDDC), together with Software-Defined Compute (SDC) and Software-Defined Networking (SDS) technologies.

In SDDC, as introduced by the former VMware CTO Steve Herrod in 2012, "compute, storage, networking, security, and availability services are pooled, aggregated, and delivered as software, and managed by intelligent, policy-driven software".

Software-Defined technology brings new and challenging changes into the industry but all of them are part of a movement towards promoting a greater role for software systems above the hardware.

Many of the complex and advanced functions that used to require proprietary hardware are implemented now in software running in commodity hardware.

Data Services and Storage Virtualization

Data Services and Storage Virtualization are fundamental blocks to understand the virtual data and control paths in Software-Defined solutions.

Data Services are standards-based and uniform means of supplying a need such as provisioning, protection, availability, performance, security, etc. over the stored data. The behaviours of those data services are defined by policies.

On the other hand, Storage Virtualization is the process of grouping the physical storage from multiple network storage devices so that it looks like a single storage device. Note that this logical storage may be the result of stacking multiple physical and/or logical abstractions.

Virtual Data and Control Paths

As described by the Storage Networking Industry Association (SNIA), SDS builds on the virtualization of the Data Path although SDS is not virtualization alone. Bear in mind the Control Path needs to be abstracted as a service as well.

The virtual data path is formed by block, file and object interfaces that support applications written to these interfaces. At the same time, the control path uses metadata to express requirements, data services control and service level capabilities.

This is the way how SDS simplify the management in order to reduce the cost of maintaining the storage infrastructure. This data services management and defining storage policies avoids the manual administration. It enables flexibility and automation.

Following this approach, each data object is able to convey its own requirements independent of which virtual storage device it resides on.

Virtual Data and Control Paths in Ceph RGW S3

Let's see how all these concepts fit in Ceph. The RGW S3 component will be used to support this point.

In the previous picture, virtual data and control paths in RGW S3 are surrounded by one black line.

The data path comes defined by the own AWS S3 interface specification. This path is a virtual path because the virtual RGW S3 object store is built on the top of the native Ceph object store.

The control path comes also defined by the AWS S3 interface as part of the metadata attached over the data.

The data services are enabled/configured by the storage administrator using 'pools'. These pools are the divisions or partitions used to define key parameters such as the number of placement groups, the CRUSH ruleset or the ownership. RGW S3 runs on top of those pools to provide the final storage service.

In some use cases, choosing the right pool or attaching metadata (zone/region) is enough to specify the place where the data will live. In other cases this configuration is not needed.

Take as example the 'x-amz-website-redirect-location' system-defined metadata in the S3 static website hosting feature. It is used to redirect requests for the associated object to another object in the same bucket or an external URL.

This metadata is a good example to see how the policy allows request redirections specified by the user instead of the administator.

While metadata attachment could look like an arbitrary action to get the job done, it is a recurrent pattern in S3. Here the metadata are part of the control path and they drive the behaviour of the use case along the whole storage stack.


Along this blog post we explored the virtual data and control paths in SDS, its relationship with SDDC and how metadata can be used to drive services in Ceph RGW S3.

If you are interested in SDS and metadata you might find the SNIA' Software Defined Storage White Paper a great resource. The paper covers the role of metadata in SDS and the user's view in detail.

In the case you are interested in SDDC, the SDDC VMware documentation is available on-line. Bear in mind the SDDC term covers a broad concept and this documentation covers the VMware perspective only.

As usual, the Ceph documentation is awesome to deepen more on Ceph and its S3 Object Gateway.

by Javier at July 14, 2016 10:00 PM

July 13, 2016

Iago Toral

Story and status of ARB_gpu_shader_fp64 on Intel GPUs

In case you haven’t heard yet, with the recently announced Mesa 12.0 release, Intel gen8+ GPUs expose OpenGL 4.3, which is quite a big leap from the previous OpenGL 3.3!

OpenGL 4.3

The Mesa i965 Intel driver now exposes OpenGL 4.3 on Broadwell and later!

Although this might surprise some, the truth is that even if the i965 driver only exposed OpenGL 3.3 it had been exposing many of the OpenGL 4.x extensions for quite some time, however, there was one OpenGL 4.0 extension in particular that was still missing and preventing the driver from exposing a higher version: ARB_gpu_shader_fp64 (fp64 for short). There was a good reason for this: it is a very large feature that has been in the works by Intel first and Igalia later for quite some time. We first started to work on this as far back as November 2015 and by that time Intel had already been working on it for months.

I won’t cover here what made this such a large effort because there would be a lot of stuff to cover and I don’t feel like spending weeks writing a series of posts on the subject :). Hopefully I will get a chance to talk about all that at XDC in September, so instead I’ll focus on explaining why we only have this working in gen8+ at the moment and the status of gen7 hardware.

The plan for ARB_gpu_shader_fp64 was always to focus on gen8+ hardware (Broadwell and later) first because it has better support for the feature. I must add that it also has fewer hardware bugs too, although we only found out about that later ;). So the plan was to do gen8+ and then extend the implementation to cover the quirks required by gen7 hardware (IvyBridge, Haswell, ValleyView).

At this point I should explain that Intel GPUs have two code generation backends: scalar and vector. The main difference between both backends is that the vector backend (also known as align16) operates on vectors (surprise, right?) and has native support for things like swizzles and writemasks, while the scalar backend (known as align1) operates on scalars, which means that, for example, a vec4 GLSL operation running is broken up into 4 separate instructions, each one operating on a single component. You might think that this makes the scalar backend slower, but that would not be accurate. In fact it is usually faster because it allows the GPU to exploit SIMD better than the vector backend.

The thing is that different hardware generations use one backend or the other for different shader stages. For example, gen8+ used to run Vertex, Fragment and Compute shaders through the scalar backend and Geometry and Tessellation shaders via the vector backend, whereas Haswell and IvyBridge use the vector backend also for Vertex shaders.

Because you can use 64-bit floating point in any shader stage, the original plan was to implement fp64 support on both backends. Implementing fp64 requires a lot of changes throughout the driver compiler backends, which makes the task anything but trivial, but the vector backend is particularly difficult to implement because the hardware only supports 32-bit swizzles. This restriction means that a hardware swizzle such as XYZW only selects components XY in a dvecN and therefore, there is no direct mechanism to access components ZW. As a consequence, dealing with anything bigger than a dvec2 requires more creative solutions, which then need to face some other hardware limitations and bugs, etc, which eventually makes the vector backend require a significantly larger development effort than the scalar backend.

Thankfully, gen8+ hardware supports scalar Geometry and Tessellation shaders and Intel‘s Kenneth Graunke had been working on enabling that for a while. When we realized that the vector fp64 backend was going to require much more effort than what we had initially thought, he gave a final push to the full scalar gen8+ implementation, which in turn allowed us to have a full fp64 implementation for this hardware and expose OpenGL 4.0, and soon after, OpenGL 4.3.

That does not mean that we don’t care about gen7 though. As I said above, the plan has always been to bring fp64 and OpenGL4 to gen7 as well. In fact, we have been hard at work on that since even before we started sending the gen8+ implementation for review and we have made some good progress.

Besides addressing the quirks of fp64 for IvyBridge and Haswell (yes, they have different implementation requirements) we also need to implement the full fp64 vector backend support from scratch, which as I said, is not a trivial undertaking. Because Haswell seems to require fewer changes we have started with that and I am happy to report that we have a working version already. In fact, we have already sent a small set of patches for review that implement Haswell‘s requirements for the scalar backend and as I write this I am cleaning-up an initial implementation of the vector backend in preparation for review (currently at about 100 patches, but I hope to trim it down a bit before we start the review process). IvyBridge and ValleView will come next.

The initial implementation for the vector backend has room for improvement since the focus was on getting it working first so we can expose OpenGL4 in gen7 as soon as possible. The good thing is that it is more or less clear how we can improve the implementation going forward (you can see an excellent post by Curro on that topic here).

You might also be wondering about OpenGL 4.1’s ARB_vertex_attrib_64bit, after all, that kind of goes hand in hand with ARB_gpu_shader_fp64 and we implemented the extension for gen8+ too. There is good news here too, as my colleague Juan Suárez has already implemented this for Haswell and I would expect it to mostly work on IvyBridge as is or with minor tweaks. With that we should be able to expose at least OpenGL 4.2 on all gen7 hardware once we are done.

So far, implementing ARB_gpu_shader_fp64 has been quite the ride and I have learned a lot of interesting stuff about how the i965 driver and Intel GPUs operate in the process. Hopefully, I’ll get to talk about all this in more detail at XDC later this year. If you are planning to attend and you are interested in discussing this or other Mesa stuff with me, please find me there, I’ll be looking forward to it.

Finally, I’d like to thank both Intel and Igalia for supporting my work on Mesa and i965 all this time, my igalian friends Samuel Iglesias, who has been hard at work with me on the fp64 implementation all this time, Juan Suárez and Andrés Gómez, who have done a lot of work to improve the fp64 test suite in Piglit and all the friends at Intel who have been helping us in the process, very especially Connor Abbot, Francisco Jerez, Jason Ekstrand and Kenneth Graunke.

by Iago Toral at July 13, 2016 02:48 PM

July 12, 2016

Frédéric Wang

MathML Improvements in WebKit

In a previous blog post, I explained the work made by Igalia’s web platform team to refactor WebKit’s MathML layout classes. I stated that although some rendering improvements were a nice side effect, the main goal of the first phase was really to clean the code up so that it is easier for developers to work on MathML in the future. Indeed this really made things easier to review: Quite unexpectedly to me, the second phase only took 4 days to be upstreamed… Kudos to Brent Fulgham for having reviewed so many patches in such a short period of time!

In this blog post, I am going to give an overview of the improvements made during these two first phases taking changeset r203109 as a reference. The changes will be available in WebKitGTK+ 2.14 in September and are likely to be included this month in the next Safari Technology Preview. It definitely remains more work to do such as the third phase or other rendering improvements, but I believe we have already made a big step forward!

Mathematical Fonts

Two years ago, basic support for operator stretching via the OpenType MATH table was added to WebKit. During the refactoring, we improved that support and also made use of more parameters to improve the math layout (see section about OpenType MATH parameters below). While Latin Modern Math will be used in most screenshots, the following one shows that you can use any math fonts. By default WebKit will try and use one of these fonts but if none are available or if you force a non-math font-family then the rendering quality may not be good.

The following screenshot gives the rendering for various fonts. For the last one we used the value sans-serif to illustrate issues with non-math fonts (displaystyle integral too small, mathvariant italic glyphs taken from another font, missing italic correction etc).

Screenshots of a formula with various fonts Integral obtained by contour integration
WebKitGTK+ r203109 with various font-family values.
From top to bottom: TeX Gyre Schola, Latin Modern, DejaVu Math, STIX, and sans-serif.

The href Attribute

This new feature is obvious: You can now create a hyperlink for any part of a mathematical formula! Here is a screenshot of the MathML Torture Test 21 with additional links, as displayed in WebKit r203109. Click the image to load the test case in your browser and test it.

Screenshot of MathML Torture test 21 with hyperlinks

The mathvariant Attribute

Unicode contains Mathematical Alphanumeric Symbols to convey special meaning such as double-struck or specific Arabic styles. Characters for these symbols are generally provided by math fonts. In MathML, mathematical variables are automatically rendered using the italic characters from this Unicode block. One can also access these characters via the mathvariant attribute and that attribute is actually used by many LaTeX-to-MathML converters.

In the following screenshot, you can see that the letters f, x and y are now drawn with this special mathematical italic glyphs and that WebKit uses the conventional fraktur style for the Lie algebra g. Note that the prime is still too small because WebKit does not make use of the ssty feature yet.

Screenshot of Wikipedia article on Lie Algebra Homomorphism of Lie algebra
Top: Safari 9.1.1. Bottom: Safari r203109.

Operators and Spacing

As said in my previous blog post, the rendering of large and stretchy operators have been rewritten a lot and as a consequence the rendering has improved. Also, I mentioned that the width of operators may depend on their height. This may cause accumulated approximations during the computation of preferred widths. The old flexbox-based implementation incorrectly forced layout during preferred computation to avoid that but a quick workaround for that security concern caused the approximate preferred widths to be used for the logical widths. With our new implementation, the logical width is now correctly calculated. Finally, we added partial support for the mpadded element which is often used to tweak spacing in mathematical formulas.

The screenshot below illustrates the fix for a serious regression with large operator (summation symbol) as well as improvements in the rendering of stretchy operators (horizontal braces). Note that the formula has a hack with a zero-width mpadded element which used to cause improper spacing (large gap between the group of a’s and the group of b’s).

Screenshot from Mozilla's MathML torture test using Safari 9.1.1 or the the current refactoring Tests 21 and 22 from the MathML torture test
Left: Safari 9.1.1. Right: Safari r203109.

The following screenshot shows how incorrect width computations used to cause excessive spacing after the stretchy radical and slash symbols:

Screenshot of Wikipedia article on Trigonometric Functions Sine of 1 degree
Top: Safari 9.1.1. Bottom: Safari r203109.

The displaystyle Property

Mathematical formulas can be integrated inside a paragraph of text (inline math in TeX terminology) or displayed in its own horizontally centered paragraph (display math in TeX terminology). In the latter case, the formula is in displaystyle and does not have any restrictions on vertical spacing. In the former case, the layout of the mathematical formula is modified a bit to optimize this vertical spacing and to better integrate within the surrounding text. The displaystyle property can also be set using the corresponding attribute or can change automatically in subformulas (e.g. in fractions).

In the following screenshot the fix for the large operator regression is obvious but you can also notice that the summation is now slightly different for the definition of a Bézier curve (top) and for the one of a rational Bézier curve (bottom). For example, to save some vertical space in the fractions, the sigma symbol is drawn smaller and the scripts attached to it are moved on its right. However, the script size could still be improved when we implement the scriptlevel property.

Screenshot of Wikipedia article on Bézier Curves Definitions of Bézier Curve
Left: Safari 9.1.1. Right: Safari r203109.

OpenType MATH Parameters

Many new OpenType MATH features have been implemented following the MathML in HTML5 Implementation Note:

  • Use of the AxisHeight parameter to set vertical position of fractions, tables and symmetric operators.
  • Use of layout constants for radicals (some of them were already used), scripts and fractions. This improves math spacing and positioning and allows to adjust them according to value of the displaystyle property discussed in the previous section.
  • Use of the italic correction of large operator glyph to set the position of subscripts.

The screenshots below illustrate some of these improvements. For the first one, the use of AxisHeight allows to better align the fraction bar with the plus sign. For the second one, the use of layout constants for scripts as well as the italic correction of the integral improves the placement of limits.

Screenshot of Wikipedia article on the Continued Fractions Generalized Continued Fraction
Top: Safari 9.1.1. Bottom: Safari r203109.
Screenshot of Wikipedia article on the Gamma Function Definition of the Gamma Function
Top: Safari 9.1.1. Bottom: Safari r203109.

Right-to-left radicals

One of the advantage of the old flexbox-based implementation is that right-to-left layout was available for free. This support has of course been preserved in the new implementation. We also added a simple workaround to mirror radicals using a scale transform as shown in the screenshot below. However, general glyph-level mirroring is still missing.

Screenshot of Wikipedia article on Lie Algebra Test 13 from the MathML torture test (Machrek Style)
Top: Safari 9.1.1. Bottom: Safari r203109.


Igalia’s web platform team has been able to follow the MathML in HTML5 Implementation Note in order to significantly improve the rendering of mathematical expressions in WebKit. More work remains to do but we will definitely appreciate any feedback that can help improving native MathML support in web engines. We are also excited to continue work and collaboration at the next Web Engines Hackfest!

July 12, 2016 10:00 PM

July 08, 2016

Frédéric Wang

MathML Refactoring in WebKit


If you follow WebKit developments, you are certainly aware that Igalia has been working on WebKit’s MathML implementation for some time. More recently, effort has been made to write a clean implementation addressing issues reported by WebKit reviewers in the past. After joining Igalia in March, I have been in charge of getting this work reviewed and merged into WebKit’s development branch. In the past four months, we have been successful in upstreaming the first phase of the refactoring and the work accomplished is described in this blog post.

Screenshot from Mozilla's MathML torture test using Safari 9.1.1 or the the current refactoring
MathML torture tests with the Latin Modern Math font
Left: Safari 9.1.1. Right: Current MathML branch.

Note that the focus was on code refactoring so the improvement may not be obvious for non-developers. Nevertheless many issues have already been fixed as a consequence of that work: math italic, position of scripts, stretchy and large operators, rendering update and more. More importantly, this preliminary step opens the way for beautiful math rendering based on TeX rules and the OpenType MATH table. Rendering improvements and implementation of new features have already started in the next phase of the refactoring, so stay tuned…

Design Issues

As explained in a previous report, the main design issues in the flexbox-based implementation released in 2013 can essentially be summarized in three points:

  1. WebKit’s code to stretch operator was not efficient at all and was limited to some basic fences buildable via Unicode characters.
  2. WebKit’s MathML code violated many layout invariants, making the code unreliable.
  3. WebKit’s MathML code relied heavily on the C++ renderer classes for flexboxes and had to manage too many anonymous renderers.

For the first point, the performance issue had been fixed by Igalia developers right after the initial feedback from WebKit developers and we improved that again during our refactoring. Partial support for the OpenType MATH table was added during my crowdfunding project and allowed to stretch more operators with the help of math fonts. For the second point, the main issue was also fixed right after the initial feedback. However one could still have some doubts regarding the layout steps, given the complexity implied by the third point. That last issue was still problematic so far and addressing it was the main achievement of our refactoring.

Technically, the dependence on flexbox is unnecessary and the implementation actually only used a limited set of flexbox features. Thus executing the whole flexbox code was overkill. It can also be a burden for people working on other places of the layout code. For example Javi Fernández has worked on improving the box alignments in the past and he had hard time fixing the MathML code impacted by his changes. This is probably the cause of the bad position of the summation symbol that can be seen in the screenshot above.

From the layout perspective, most of the rendering logic was implemented in the flexbox classes and the MathML “renderer” classes were really just managing the creation and update of anonymous nodes and style. Although this sounds good code reuse it actually made impossible to understand how and when layout steps happen or to add more advanced features. The new implementation replaced this manipulation of the render tree with simple arithmetic calculations on box metrics which is safer and more reliable. Also, complex renderers such as RenderMathMLScripts or RenderMathMLRoot actually achieve better rendering quality with less code!

As an example of the complexity, RenderMathMLUnderOver can behave as a RenderMathMLScripts in some situation so we really want the former class to reuse the latter class. As we will see below the old implementation of the two renderers were quite different: RenderMathMLUnderOver only relied on setting column direction in the user agent stylesheet while RenderMathMLScripts created a complex render tree structures with anonymous style. Hence it seemed difficult to share the two cases or to handle DOM changes causing to move from one case to the other one. With our new implementation, this is simply reduced to simple C++ inheritance.

When I started to work on WebKit some years ago, I made the mistake of continuing with the existing approach. The implementation of multiscripts or automatic italic mathvariant added more anonymous objects and made the situation even worse. After the end of my crowdfunding project, Alex Castro did more cleanup and tried to implement important features such as displaystyle but he also soon realized that it was too hard to work with the current code base…

Layout Refactoring

In order to solve the issues discussed in the previous section, Javi and Alex worked on a new MathML branch where the first step was to remove the inheritance on the flexbox layout classes. During the Web Engines Hackfest 2015, I collaborated with the Igalia’s web platform team team to continue the work on this branch. In a second step, we rewrote many MathML renderer functions so that they stop creating anonymous nodes or style. We obtained very encouraging results: The implementation looked much simpler and much more understandable!

Alex announced the initial plan on the webkit-dev mailing list. He started opening bugs and attaching patches to merge the first step. However, that step still required many of the flexbox logic and so made code hard to understand for reviewers. Hence when I joined Igalia four months ago Alex asked me to try and see how to reorganize patches so that the two initial steps can be submitted in one go. This corresponds to the first phase mentioned in the introduction. As indicated on the wiki page, the layout refactoring consisted in rewriting the following member functions of each renderer class:

  • computePreferredLogicalWidths: calculate preferred widths, based on the preferred widths of child renderers.
  • layoutBlock: set final position and size of child renderers.
  • firstLineBaseLine: calculate the ascent of the renderer.
  • paint (optional): perform special painting such as fraction bars.

Refactored renderers no longer rely on any flexbox code nor anonymous renderers and the functions mentioned above essentially perform arithmetic computations. By reading the code, one can be sure that we follow standard layout rules and that we do not perform unnecessary reflow. Moreover, the rules specific to math rendering are only located in the MathML renderers and can be better understood. Details for each class are provided in the next subsections. After all the layout functions were rewritten and the code managing the render tree structure removed, we were able to make the RenderMathMLBlock class inherit from RenderBlock instead of RenderFlexibleBox. Many of the bugs could then be immediately closed or otherwise fixed with small follow-up patches.


RenderMathMLSpace is a simple class inserting blank boxes for adjusting spacing of math formulas. Obviously, we do not need any of the complexity of flexbox so it was straightforward to write the layout functions.

3x Large space between 3 and x.


RenderMathMLRow performs rendering of a row of math items. Since WebKit does not support linebreaking in MathML at the moment, this is just putting child boxes on a same baseline. One specificity is that some operators can be stretched vertically and so their width may depend on their height.

{ 2x x 3 Row containing a stretched brace, a fraction and a scripted element.

Again, flexbox features are useless here. With the old code, it was not clear whether we were violating the CSS invariant with preferred and logical widths and which kind of relayout or render tree changes would happen when doing the stretch call. By properly implementing the layout functions previously mentioned all of this became much more trustable.


RenderMathMLFraction draws a fraction with numerator and denominator.

x+1y+2 Simple fraction.

This used to be implemented using a column direction for the fraction element. Numerator and denominator were wrapped into anonymous nodes with additional style to leave space for the fraction bar and to adjust the horizontal alignments.

RenderMathMLFraction (flexbox with column direction)
  RenderMathMLBlock (anonymous flexbox)
    RenderMathMLRow (numerator)
  RenderMathMLBlock (anonymous flexbox)
    RenderMathMLRow (denominator)

It was relatively easy to implement this without any anonymous nodes and again the use of flexbox did not sound justified. For example, to calculate the preferred width we just take the maximum of the preferred widths of the numerator and denominator. For the layout, the calculation of the logical width is similar and we calculate the horizontal coordinates of numerator and denominator so that their centers are aligned. Vertical metrics are similarly calculated from the vertical metrics of the numerator and denominator. During that step, we also fixed some bugs with the linethickness attribute and added support for some OpenType MATH table constants.

Scripts above and below

RenderMathMLUnderOver is used to attach some scripts above and below a base. Each child can itself be a horizontal stretchy operator.

base→ Base with stretchy arrow over it.

This was implemented in the user agent stylesheet by using flexboxes with column direction for the corresponding MathML elements and the C++ class had additional rules to fire the stretching. So the problems and solutions for this class were essentially a mixed of the cases of RenderMathMLFraction and RenderMathMLRow we just discussed.

Subscripts and Superscripts

RenderMathMLScripts is used for a base with some arbitrary number of scripts. All the scripts can have different positions (pre, post, sub, super) and metrics (width, ascent and descent). We must avoid collisions and take care of horizontal and vertical alignements.

base a b c d e f Base with pre and post scripts.

The old code used a complex render tree with additional style to achieve the best possible result. However, the quality was still bad as you can see for the script attached to the integral in the screenshot above. Managing the render tree was a nightmare: Just to give the idea, additional anonymous node and style were used to allow horizontal and vertical adjustments (similar to RenderMathMLFraction above) and prescripts had negative order property so that they were positioned before the base.

    Base Wrapper (anonymous flexbox)
      RenderMathMLRow (base)
    SubSupPair Wrapper (anonymous flexbox with column direction)
      RenderMathMLRow (post-subscript)
      RenderMathMLRow (subperscript)
    SubSupPair Wrapper (anonymous flexbox with column direction)
      RenderMathMLRow (post-subscript)
      RenderMathMLRow (post-superscript)
    ... (more postscripts)
    RenderMathMLBlock (prescripts separator)
    SubSupPair Wrapper (anonymous flexbox with column direction and order -1)
      RenderMathMLRow (pre-subscript)
      RenderMathMLRow (pre-subperscript)
    SubSupPair Wrapper (anonymous flexbox with column direction and order -1)
      RenderMathMLRow (pre-subscript)
      RenderMathMLRow (pre-superscript)
    ... (more prescripts)

Rules from TeX and the OpenType MATH table are relatively complex and we decided to implement them directly in the new refactoring as otherwise it was impossible to get decent quality. The code is still complex but we now have clear rules, we only perform simple calculations and the render tree structure matches the DOM tree.

“Enclosing” Notations

RenderMathMLMenclose is a row of math items with some additional notations. Gurpreet Kaur implemented this element two years ago but she followed the same approch, combining anonymous nodes and style for some simple notations and special painting for others.

x+1 circle and strike notations

During the refactoring, the code has been completely rewritten so that RenderMathMLMenclose is now essentially a derived class of RenderMathMLRow with the measuring and painting functions adjusted to take into account the additional notations. During that refactoring, we also removed support for unused radical notation, which was implemented using an anonymous RenderMathMLSquareRoot (see Radicals section below).

Helper Classes for Operators

The RenderMathMLOperator class is used for math operators. It was quite complex class and we decided to extract from it two features that are unrelated to layout:

The remaining code was indeed the real layout part but the mess with anonymous node and style was only removed later (see Text Classes below). Although it seems we just needed to move the code out of RenderMathMLOperator into those new classes, the case of MathOperator was particularly difficult. We had to split the effort into several small steps to make review possible and also fixed many issues due to the entanglement and confusion of these three different features of the RenderMathMLOperator class… The work done for MathOperator actually improved the rendering of stretchy operators as you can see for the horizontal braces in the screenshot above.


RenderMathMLRoot is used for square root or arbitrary N-th root. Many of the TeX and OpenType MATH table rules were already used by the old implementation with anonymous nodes and style. However, there were bugs difficult to fix related to zooming, child removal or style change due to the management of the anonymous RenderMathMLOperator to draw the radical sign.

x+1 + x+1 3 square and cube roots

The old implementation actually had two classes for the square and general cases (RenderMathMLSquareRoot and RenderMathMLRoot). The usual technique with various anonymous wrappers and style was used. After the refactoring, we were able to merge everything in a single RenderMathMLRoot class. Because the square root behaves as an mrow, we also made that class derive from RenderMathMLRow to reuse as much code as possible. Here is are how the render trees used to look like:

  RenderMathMLBlock (anonymous used for metric adjustements)
    RenderMathMLRadicalOperator (anonymous used for the radical symbol)
  RenderMathMLRootWrapper (anonymous used for the children)
    RenderMathMLRow (child 1)
    RenderMathMLRow (child 2)
    RenderMathMLRow (child N)

  RenderMathMLRootWrapper (anonymous for the index)
  RenderMathMLBlock (anonymous used for metric adjustements)
     RenderMathMLRadicalOperator (anonymous used for the radical symbol)
  RenderMathMLRootWrapper (anonymous for the base)

Again, we rewrote the implementation using only simple box positioning. The difficult part was to get rid of the anonymous RenderMathMLRadicalOperator to draw the radical symbol. This class was derived from RenderMathMLOperator and extended it with some fallback drawing when math fonts were not available. After having extracted stretchy operator shaping from RenderMathMLOperator it became possible to use the MathOperator helper class to draw the radical symbol. We implemented the fallback for missing math fonts the same as Gecko: Use a scale transform to stretch the base glyph for U+221A SQUARE ROOT. As a bonus, we used such transform to implement glyph mirroring, as required to draw right-to-left radicals in some Arabic mathematical notations.

Text Classes

These classes are containers for math text such as variables or operators. There is a generic RenderMathMLToken class and a derived class RenderMathMLOperator adding features specific to operators such as spacing, dictionary property, stretching… Anonymous wrappers and style were used to implement automatic italic mathvariant or operator spacing. The RenderText child of RenderMathMLOperator was (re)built as an anonymous text node so that is was possible to convert U+002D HYPHEN-MINUS into U+2212 MINUS SIGN or to provide some text for anonymous operators created by RenderMathMLFenced (see Unchanged Classes section).

RenderMathMLToken (e.g. mi element)
  RenderMathMLBlock (anonymous flexbox used to apply CSS italic)
    RenderBlock (anonymous created elsewhere to honor CSS rules)
        text run "x"

RenderMathMLOperator (mo element)
   RenderMathMLBlock (anonymous flexbox used for spacing)
    RenderBlock (anonymous created elsewhere to honor CSS rules)
      RenderText (anonymous destroyed and built again)
         text run "−"

We did a big refactoring to remove all the anonymous nodes created by the MathML renderer classes. Just like for MathOperator, we had to be careful and submit various small pieces as the text rendering was quite sensible to code change.

The simplified operator spacing that was supported by WebKit was easy to implement with the new approach. To do automatic italic mathvariant, we modified the paint function to use Mathematical Alphanumeric Symbols instead of CSS italic as you can notice for the variables displayed in the screenshot above. Hence we could remove the RenderMathMLBlock anonymous wrapper.

The use of an anonymous node for the text prevented it to appear in the dumped render tree of layout tests and also required some hacks in the accessibility code to expose that text. In order to address the cases of the minus sign and of mfenced operators, we decided to use our new MathOperator class again. Indeed MathOperator is actually also able to draw unstretched operators made of a single character and this works for the minus sign and for mfenced operators used in practice.

Unchanged Classes

Two classes have not been modified but such modifications were not needed to remove the dependency on RenderFlexibleBox:

  • RenderMathMLFenced is used for an mrow-like element that is defined in the MathML specification as strictly equivalent to constructions with rows and operators. It is implemented as a derived class of RenderMathMLRow and creates anonymous RenderMathMLOperators. This is the only remaining class that modifies the render tree structure. Note that prominent MathML websites and generators do not use the mfenced element, so it is not a big concern.

  • RenderMathMLTable is used for table layout. It is just derived from RenderTable, not RenderFlexibleBox. We did not change anything for now but we considered creating our own implementation in order to make our code independent from HTML table, to support MathML-specific table features and to make it better integrated with the rest of the MathML code.


Even if our main focus was on rendering, the changes we made also had impact on the MathML accessibility code. Indeed, the accessibility tree is generated from the MathML renderer classes: Since we changed the latter during the refactoring, we also had to adjust the accessibility code. Fortunately, we are lucky to have Joanmarie Diggs in our team and she was able to provide some help here.

First, the accessibility code exposes the linethickness of fractions to implement Apple’s AXMathLineThickness attribute. In practice, this is really necessary to know whether the linethickness is null or not (e.g. binomial coefficient VS the Legendre symbol). Apple’s unit test seemed to expose the ratio between the actual thickness and the default thickness but the accessibility code really just reads the actual thickness calculated by RenderMathMLFraction. Our fix and improvement for linethickness made the Apple’s unit test fail so we had to adjust RenderMathMLFraction to expose the value expected by that test.

In general, the accessibility code does not care about anonymous nodes created for layout purpose and there was some code to avoid exposing them in the accessibility tree. So removing all the anonymous during the layout refactoring was actually a good and safe thing to do. There were some helper functions to implement Apple’s AXMathRootRadicand and AXMathRootIndex attributes that had to be adjusted, though. These functions used to do some work to skip the anonymous wrappers and we were actually able to simplify them.

There was also some specific code for the RenderMathMLOperators and their anonymous RenderText that were necessary to expose the text content. Actually, there was an old bug in the accessibility code and the anonymous RenderMathMLOperators created by mfenced were not correctly exposed. The unit test we had for mfenced operators was only checking the text content but it was still passing and so the regression had never been detected before. After the layout refactoring we removed the anonymous RenderText of mfenced operators and so broke that test… We thus spent some time to fix the RenderMathMLOperator code. Essentially, we removed all the old hacks and only left a specific handling for mfenced operators. We also used this opportunity to improve and extend our MathML accessibility tests.

Finally, the MathML accessibility code was directly implemented into a generic AccessibilityRenderObject class. There was some functions to access math nodes and properties but also specific cases scattered all over the code (anonymous boxes, mfenced operators, math roles etc). In order to facilitate future work and maintenance we decided to move all the MathML code into a new AccessibilityMathMLElement class. Hence the work implied by the layout refactoring actually encouraged us to improve the organization and testing of our accessibility code!


In the past four months, Igalia’s web platform team has successfully upstreamed the refactoring of WebKit’s MathML renderer classes and we are now very confident about the quality of the layout code. In addition to the people mentioned above I would personally like to thank everybody who helped with this work. More specifically, I am very grateful to other people from Igalia (Martin Robinson, Sergio Villar and Manuel Rego) or Apple (Brent Fulgham and Darin Adler) who have spent some time to review patches. As a nice side effect of this work, mathematical formulas look better and the accessibility code has been improved. More is happenning in the next two phases. We are looking forward to continuing implementation of Web standards and collaboration with browser vendors at the next Web Engines Hackfest!

July 08, 2016 10:00 PM

July 01, 2016

Víctor Jáquez

VA-API and DRM/KMS in MinnowBoard

In 2012 I started to work on a video renderer for GStreamer which uses directly the DRM/KMS kernel subsystem to display images. I even blogged about it, but I didn’t finished it.

Nonetheless, in December last year, a customer asked us to finish the element, but this time focusing on i.MX6, rather than OMAP4. Consequently, early this year, kmssink got merged into upstream.

The video sink has some nice features, such as DMA-buf importing through the PRIME kernel interface. This feature makes possible a zero-copy path when the video decoder delivers dmabuf based frames. For this particular project, the kernel we used supports the CODA media subsystem, and through the GStreamer element v4l2videodec we could link pipelines negotiating dmabuf sharing, and thus we played videos very efficiently.

During the last GStreamer Hackfest, Nicolas Dufresne got working the MFC media subsystem, for Exynos, with v4l2videodec and kmssink, but I don’t remember if he also got the dmabuf sharing working.

All in all, kmssink had only been tested in a couple of ARM devices, so I wonder, as a gstremer-vaapi co-maintainer, if it also works in x86 devices.

Some colleagues at Igalia, use a Minnowboard to test WebKit4Wayland, so I borrowed it, installed a minimal Fedora 23 in it, and, surprisingly, I found out that it has a nice VA-API support:

   $ vainfo
   error: can't connect to X server!
   libva info: VA-API version 0.38.1
   libva info: va_getDriverName() returns 0
   libva info: Trying to open /usr/lib64/dri/
   libva info: Found init function __vaDriverInit_0_38
   libva info: va_openDriver() returns 0
   vainfo: VA-API version: 0.38 (libva 1.6.2)
   vainfo: Driver version: Intel i965 driver for Intel(R) Bay Trail - 1.6.2
   vainfo: Supported profile and entrypoints
         VAProfileMPEG2Simple            : VAEntrypointVLD
         VAProfileMPEG2Simple            : VAEntrypointEncSlice
         VAProfileMPEG2Main              : VAEntrypointVLD
         VAProfileMPEG2Main              : VAEntrypointEncSlice
         VAProfileH264ConstrainedBaseline: VAEntrypointVLD
         VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
         VAProfileH264Main               : VAEntrypointVLD
         VAProfileH264Main               : VAEntrypointEncSlice
         VAProfileH264High               : VAEntrypointVLD
         VAProfileH264High               : VAEntrypointEncSlice
         VAProfileH264StereoHigh         : VAEntrypointVLD
         VAProfileVC1Simple              : VAEntrypointVLD
         VAProfileVC1Main                : VAEntrypointVLD
         VAProfileVC1Advanced            : VAEntrypointVLD
         VAProfileNone                   : VAEntrypointVideoProc
         VAProfileJPEGBaseline           : VAEntrypointVLD

Afterwards, I compiled GStreamer using gst-uninstalled in order to have kmssink and the master branch of gstreamer-vaapi. And got it! I had hardware accelerated decoding with VA-API, and video rendering using KMS/DRM. But, sadly, no zero copy, since, in order to have dmabuf sharing, we have to finish and to merge bug 755072 in gstreamer-vaapi.

Running the typical Big Bunny video (1080p, H.264) the playback consumes less than the 50% of the CPU usage, compared with the 100% of CPU usage with software decoders (and a lot of frames dropped).

I recorded a small video with the experiment as a proof.

Update: Javer Martinez told me in twitter, that MFC can do dma-buf export since uses vb2 and Exynos DRM can import, but he couldn’t get zero copy to work due missing format support.

by vjaquez at July 01, 2016 05:00 PM