Planet Igalia

December 02, 2016

Frédéric Wang

Chromium on R-Car M3 & AGL/Wayland

As my fellow igalian Antonio Gomes explained on his blog, we have recenly been working on making the master branch of Chromium run on Linux Desktop using the upstream Ozone/Wayland backend. This effort is supported by Renesas and they were interested to rely on these recent developments to showcase Chromium running on the latest generation of their R-Car systems-on-chip for automotive applications. Ideally, they were willing to see this happening for the Automotive Grade Linux distribution.

Igalia Logo
Renesas Logo

Luckily, support for R-Car M3 was already available in the development branch of AGL and the AGL instructions for R-Car M2 actually worked pretty well for M3 too mutatis mutandis. There is also an OpenEmbedded/Yocto BSP layer for Chromium but unfortunately Wayland is only supported up to m48 (using the Ozone backend from Intel’s fork) and X11 is only supported up to m52. Hence we created a (for-now-private) fork of meta-browser that:

  • Uses the latest development version of Chromium, in particular the Linux/Ozone/Mash/Wayland support Igalia has been working on recently.
  • Relies on the new GN build system rather than the deprecated GYP one.
  • Supports the ARM64 architecture instead of just ARM32.
  • Installs more resources such as Mojo service manifests.
  • Applies more patches (e.g. build fixes or the workaround for issue 2485673002).

Renesas also provided us some R-Car M3 boards. I received my package this week and hence was able to start playing with it on Wednesday. Again, the AGL instructions for R-Car M2 apply well to M3 and the elinux page is very helpful to understand the various parts of the board. After having started AGL on the board and opened a Weston terminal (using mouse & keyboard as on the video or using ssh) I was able to run chromium via the following command:

/usr/bin/chromium/chrome --mash --ozone-plaform=wayland \
                         --user-data-dir=/tmp/user-data-dir \

The first line is the usual command currently used to execute Ozone/Mash/Wayland. The second line works around the fact that the AGL demo is running as root. The third line disables sandbox because the deprecated setuid sandbox does not work with Ozone and the newer user namespaces sandbox is not supported by the AGL kernel.

The following video a gives an overview of the hardware preparation as well as some basic tests with and

You can see that chromium runs relatively smoothly but we will do more testing to confirm these initial experiments. Also, we find the general issues with Linux/Ozone/Mash (internal VS external windows, no fullscreen, keyboard restricted to US layout, unwanted ChromeOS widgets etc) but we expect to continue to collaborate with Google to fix these bugs!

December 02, 2016 11:00 PM

Jacobo Aragunde

Sementando software Libre por Galicia

Escribo estas liñas en galego para falar das actividades relacionadas co software libre nas que teño participado nestas últimas datas no ámbito local, como parte da misión de Igalia de promover o software libre.

Para comezar, tiven a honra de ser invitado a dar un seminario, por terceiro ano consecutivo, como parte da asignatura de Deseño de Sistemas de Información no Mestrado de Enxeñería Informática impartido na Universidade da Coruña. O obxectivo era conectar o contido da materia coa realidade dun proxecto software desenvolvido en comunidade e para iso analizamos o proxecto e a comunidade de LibreOffice. Recordamos a súa longa historia, estudamos a composición e funcionamento da súa comunidade e botamos unha ollada á arquitectura da aplicación e ás técnicas e ferramentas empregadas para QA.

Ademais, a semana pasada visitei Monforte de Lemos para falar do proxecto empresarial de Igalia e de cómo é posible facer negocios con tecnoloxías de software libre. A invitación veu do IES A Pinguela, no que se imparten varios ciclos formativos relacionados coas TIC. Sempre é agradable dar a coñecer o proxecto de Igalia e incidir non só nos aspectos técnicos senón tamén nos humanos, como o noso modelo de organización democrático e igualitario ou os fortes principios de responsabilidade social. No plano técnico, analizamos distintos modelos de negocio e a súa relación co software libre, e casos prácticos de empresas e comunidades reais.

Charla en IES A Pinguela

Os contidos das dúas charlas, a continuación en formato PDF híbrido (permiten edición con LibreOffice):

by Jacobo Aragunde Pérez at December 02, 2016 08:55 AM

December 01, 2016

Asumu Takikawa

Pflua to assembly via dynasm

Today’s post is about the compiler implementation work I’ve been doing at Igalia for the past month and a half or so. The networking team at Igalia maintains the pflua library for fast packet filtering in Lua. The library implements the pflang language, which is what we call the filtering language that’s built into tools like tcpdump and libpcap.

The main use case for pflua is packet filtering integrated into programs written using Snabb, a high-speed networking framework implemented with LuaJIT.

(for more info about pflua, check out Andy Wingo’s blog posts about it and also Kat Barone-Adesi’s blog posts.

The big selling point of pflua is that it’s a very high speed implementation of pflang filtering (see here). Pflua gets its speed by compiling pflang expressions clevery (via ANF and SSA) to Lua code that’s executed by LuaJIT, which makes code very fast via tracing JIT compilation.

Compiling to Lua lets pflua take advantage of all the optimizations of LuaJIT, which is great when compiling simpler filters (such as the “accept all” filter) or with filters that reject packets quickly. With more complex filters and filters that accept packets, the performance can be more variable.

For example, the following chart from the pflua benchmark suite shows a situation in which benchmarks are paired off, with one filter accepting most packets and the second rejecting most. The Lua performance (light orange, furthest bar on the right) is better in the rejecting case:

Meanwhile, for filters that mostly accept packets, the Linux BPF implementations (blue & purple) are faster. As Andy explains in his original blog post on these benchmarks, this poor performance can be attributed to the execution of side traces that jump back to the original trace’s loop header, thus triggering unnecessary checks.

That brings me to what I’ve been working on. Instead of compiling down to Lua and relying on LuaJIT’s optimizations, I have been experimenting with compiling pflang filters directly to x64 assembly via pflua’s intermediate representation. The compilation is still implemented in Lua and for its code generation pass utilizes dynasm, a dynamic assembler that augments C/Lua with a preprocessor based assembly language (it was originally developed for use with C in the internals of LuaJIT). More precisely, the compiler uses the Lua port of dynasm.

The hope is that by compiling to x64 asm directly, we can get better performance out of cases where compiling to Lua does poorly. To spoil the ending, let me show you the same benchmark graph from earlier but with a newer Pflua and the new native backend (in dark orange on the far right):

As you can see, the performance of the native backend is almost uniformly better than the other implementations on these particular filters. The last case is the only one in which the default Lua backend (light orange, 2nd from right) does better.

In particular, the native backend does better than the Linux BPF cases even when the filter tends to accept packets. That said, I don’t want to give the impression that the native backend is always better than the Lua backend. For example, the Lua backend does better in the following examples:

Here the filters are simpler (such as the empty “accept all” filter) and execute fewer checks. In these cases, the Lua backend is faster, perhaps because LuaJIT does a good job of inlining away the checks or the funtion entirely. The native backend is still competitive though.

Using filters that are more complex appears to tip things favorably towards the dynasm backend. For example, here are the results on two more complex filters that target HTTP packets:

The two filters used for the graph above are the following:

tcp port 80 and
  (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)

tcp[2:2] = 80 and
  (tcp[20:4] = 1347375956 or
   tcp[24:4] = 1347375956 or
   tcp[28:4] = 1347375956 or
   tcp[32:4] = 1347375956 or
   tcp[36:4] = 1347375956 or
   tcp[40:4] = 1347375956 or
   tcp[44:4] = 1347375956 or
   tcp[48:4] = 1347375956 or
   tcp[52:4] = 1347375956 or
   tcp[56:4] = 1347375956 or
   tcp[60:4] = 1347375956)

The first is from the pcap-filter language manpage and targets HTTP data packets. The other is from a Stackexchange answer and looks for HTTP POST packets. As is pretty obvious from the graph, the native backend is winning by a large margin for both.


Now that you’ve seen the benchmark results, I want to remark on the general design of the new backend. It’s actually a pretty straightforward compiler that looks like one you would write following Appel’s Tiger book. The new backend consumes the SSA basic blocks IR that’s used by the Lua backend. From there, an instruction selection pass converts the tree structures in the SSA IR into straight-line code with virtual instructions, often introducing additional variables (really virtual registers) and movs for intermediate values.

Then a simple linear scan register allocator assigns real registers to the virtual registers, and also eliminates redundant mov instructions where possible. I’ve noticed that most pflang filters don’t produce intermediate forms with much register pressure, so spilling to memory is pretty rare. In fact, I’ve only been able to cause spilling with synthetic examples that I make up or from random testing outputs.

Finally, a code generation pass uses dynasm to produce real x64 instructions. Dynasm simplifies code generation quite a bit, since it lets you write a code generator as if you were just writing inline assembly in your Lua program.

For example, here’s a snippet showing how the code generator produces assembly for a virtual instruction that converts from network to host order:

elseif itype == "ntohl" then
   local reg = alloc[instr[2]]
   | bswap Rd(reg)

This is a single branch from a big conditional statement that dispatches on the type of virtual instruction. It first looks up the relevant register for the argument variable in the register allocation and then produces a bswap assembly instruction. The line that starts with a | is not normal Lua syntax; it’s consumed by the dynasm preprocessor and turned into normal Lua. As you can see, Lua expressions like reg can also be used in the assembly to control, for example, the register to use.

After preprocessing, the branch looks like this:

elseif itype == "ntohl" then
   local reg = alloc[instr[2]]
   --| bswap Rd(reg)
   dasm.put(Dst, 479, (reg))

(there is a very nice unofficial documentation page for dynasm by Peter Crawley if you want to learn more about it)

There are some cases where using dynasm has been a bit frustrating. For example, when implementing register spilling, you would like to be able to switch between using register or memory arguments for dynasm code generation (like in the Rd(reg) bit above). Unfortunately, it’s not possible to dynamically choose whether to use a register or memory argument, because it’s fixed at the pre-processing stage of dynasm.

This means that the code generator has to separate register and memory cases for every kind of instruction that it dispatches on (and potentially more than just two cases if there are several arguments). This turns the code into spaghetti with quite a bit of boilerplate. A better alternative is to introduce additional instructions to mov memory values into registers and vice versa, and then uniformly work on register arguments otherwise. This introduces some unnecessary overhead, but since spilling is so rare in practice, I haven’t found it to be a problem in practice.

Dynasm does let you do some really cute code though. For example, one of the tests that I wrote ensures that callee-save registers are used correctly by the native backend. This is a little tricky to test because you need to ensure that compiled function doesn’t interfere with registers used in the surrounding context (in this case, that’s the LuaJIT VM).

But expressing that test context is not really possible with just plain old Lua. The solution is to apply some more dynasm, and set up a test context which uses callee-save registers and then checks to make sure its registers don’t get mutated by calling the filter. It ends up looking like this:

local function test(expr)
   local Dst =

   local ssa = convert_ssa(convert_anf(optimize(expand(parse(expr),
   local instr =
   local alloc = ra.allocate(instr)
   local f = compile(instr, alloc)
   local pkt = v4_pkts[1]

   | push rbx
   -- we want to make sure %rbx still contains this later
   | mov rbx, 0xdeadbeef
   -- args to 'f'
   | mov64 rdi, ffi.cast(ffi.typeof("uint64_t"), pkt.packet)
   | mov rsi, pkt.len
   -- call 'f'
   | mov64 rax, ffi.cast(ffi.typeof("uint64_t"), f)
   | call rax
   -- make sure it's still there
   | cmp rbx, 0xdeadbeef
   -- put a bool in return register
   | sete al
   | pop rbx
   | ret

   local mcode, size = Dst:build()
   table.insert(anchor, mcode)
   local f = ffi.cast(ffi.typeof("bool(*)()"), mcode)
   return f()

What this Lua function does is it takes a pflang filter expression and compiles it into a function f, and then returns a new function generated with dynasm that will call f with particular values in callee-save registers. The generated function only returns true if f didn’t stomp over its registers.

You can see the rest of the code that uses this test here. I think it’s a pretty cool test.

Using the new backend

If you’d like to try out the new Pflua, it’s now merged into the upstream Igalia repository here. To set it up, you’ll need to clone the repo recursively:

git clone --recursive

in order to pull in both LuaJIT and dynasm submodules. Then enter the directory and run make to build it.

To check out how the new backend compiles some filters, go into the tools folder and run something like this:

$ ../env pflua-compile --native 'ip'
7f4030932000  4883FE0E          cmp rsi, +0x0e
7f4030932004  7C0A              jl 0x7f4030932010
7f4030932006  0FB7770C          movzx esi, word [rdi+0xc]
7f403093200a  4883FE08          cmp rsi, +0x08
7f403093200e  7403              jz 0x7f4030932013
7f4030932010  B000              mov al, 0x0
7f4030932012  C3                ret
7f4030932013  B001              mov al, 0x1
7f4030932015  C3                ret

and you’ll see some assembly dumped to standard output like above. If you have any code that uses pflua’s API, you can pass an extra argument to pf.compile_filter like the following:

pf.compile_filter(filter_expr, { native = true })

in order to use native compilation. The plan is to integrate this new work into Snabb (normal pflua is already integrated upstream) in order to speed up network functions that rely on packet filtering. Bug reports and contributions welcome!

December 01, 2016 12:27 PM

November 19, 2016

Víctor Jáquez

GStreamer-VAAPI 1.10 (now 1.10.1)

A lot of things have happened last month and I have neglected this report. Let me try to fix it.

First, the autumn GStreamer Hackfest and the GStreamer Conference 2016 in Berlin.

The hackfest’s venue was the famous C-Base, where we talked with friends and colleagues. Most of the discussions were leaned towards the release 1.10. Still, Julien Isorce and I talked about our approaches for DMABuf sharing with downstream; we agreed on to explore how to implement an allocator based on GstDmaBufAllocator. We talked with Josep Torra about the performance he was getting with gstreamer-vaapi encoders (and we already did good progress on this). Also we talked with Nicolas Dufresne about v4l2 and kmssink. And many other informal chats along with beer and pizza.

GStreamer Hackfest a the C-Base (from @calvaris tweet)
GStreamer Hackfest a the C-Base (from @calvaris tweet)

Afterwards, in the GStreamer Conference, I talked a bit about the current status of GStreamer-VAAPI and what we can expect from release 1.10. Here’s the video that the folks from Ubicast recorded. You can see the slides here too.

I spent most of the conference with Scott, Sree and Josep tracing the bottlenecks in the frame uploading to VA Intel driver, and trying to answer the questions of the folks who gently approached me. I hope my answers were any helpful.

And finally, the first of November, the release 1.10 hit the streets!

As it is already mentioned in the release notes, these are the main new features you can find in GStreamer-VAAPI 1.10

  • All decoders have been split, one plugin feature per codec. So far, the available ones, depending on the driver, are: vaapimpeg2dec, vaapih264dec, vaapih265dec, vaapivc1dec, vaapivp8dec, vaapivp9dec and vaapijpegdec (which already was split).
  • Improvements when mapping VA surfaces into memory, differenciating from negotiation caps and allocations caps, since the allocation memory for surfaces may be bigger than one that is going to be mapped.
  • vaapih265enc got support to constant bitrate mode (CBR).
  • Since several VA drivers are unmaintained, we decide to keep a white list with the va drivers we actually test, which is mostly the i915 and, in some degree, gallium from mesa project. Exporting the environment variable GST_VAAPI_ALL_DRIVERS disable the white list.
  • The plugin features are registered, in run-time, according to its support by the loaded VA driver. So only the decoders and encoder supported by the system are registered. Since the driver can change, some dependencies are tracked to invalidate the GStreamer registry and reload the plugin.
  • DMA-Buf importation from upstream has been improved, gaining performance.
  • vaapipostproc now can negotiate through caps the buffer transformations.
  • Decoders now can do reverse playback! But they only shows I frames, because the surface pool is smaller than the required by the GOP to show all the frames.
  • The upload of frames onto native GL textures has been optimized too, keeping a cache of the internal structures for the offered textures by the sink, improving a lot the performance with glimagesink.

This is the short log summary since branch 1.8:

  1  Allen Zhang
 19  Hyunjun Ko
  1  Jagyum Koo
  2  Jan Schmidt
  7  Julien Isorce
  1  Matthew Waters
  1  Michael Olbrich
  1  Nicolas Dufresne
 13  Scott D Phillips
  9  Sebastian Dröge
 30  Sreerenj Balachandran
  1  Stefan Sauer
  1  Stephen
  1  Thiago Santos
  1  Tim-Philipp Müller
  1  Vineeth TM
154  Víctor Manuel Jáquez Leal

Thanks to Igalia, Intel Open Source Technology Center and many others for making GStreamer-VAAPI possible.

by vjaquez at November 19, 2016 11:31 AM

November 15, 2016

Iago Toral

OpenGL terrain renderer update: Motion Blur

As usual, it can be turned on and off at build-time and there is some configuration available as well to control how the effect works. Here are some screen-shots:

Motion Blur Off
Motion Blur Off
Motion Blur On, intensity 12.5%
Motion Blur On, intensity 12.5%
Motion Blur On, intensity 25%
Motion Blur On, intensity 25%
Motion Blur On, intensity 50%
Motion Blur On, intensity 50%

Motion blur is a technique used to make movement feel smoother than it actually is and is targeted at hiding the fact that things don’t move in continuous fashion, but rather, at fixed intervals dictated by the frame rate. For example, a fast moving object in a scene can “leap” many pixels between consecutive frames even if we intend for it to have a smooth animation at a fixed speed. Quick camera movement produces the same effect on pretty much every object on the scene. Our eyes can notice these gaps in the animation, which is not great. Motion blur applies a slight blur to objects across the direction in which they move, which aims at filling the animation gaps produced by our discrete frames, tricking the brain into perceiving a smoother animation as result.

In my demo there are no moving objects other than the sky box or the shadows, which are relatively slow objects anyway, however, camera movement can make objects change screen-space positions quite fast (specially when we rotate the view point) and the motion- blur effect helps us perceive a smoother animation in this case.

I will try to cover the actual implementation in some other post but for now I’ll keep it short and leave it to the images above to showcase what the filter actually does at different configuration settings. Notice that the smoothing effect is something that can only be perceived during motion, so still images are not the best way to showcase the result of the filter from the perspective of the viewer, however, still images are a good way to freeze the animation and see exactly how the filter modifies the original image to achieve the smoothing effect.

by Iago Toral at November 15, 2016 12:19 PM

November 14, 2016

Manuel Rego

Recap of the Web Engines Hackfest 2016

This is my personal summary of the Web Engines Hackfest 2016 that happened at Igalia headquarters (in A Coruña) during the last week of September.

The hackfest is a special event, because the target audience are implementors of the different web browser engines. The idea is to join browser hackers for a few days to discuss different topics and work together in some of them. This year we almost reached 40 participants, which is the largest number of people attending the hackfest since we started it 8 years ago. Also it was really nice to have people from the different communities around the open web platform.

One of the breakout sessions of the Web Engines Hackfest 2016 One of the breakout sessions of the Web Engines Hackfest 2016


Despite the main focus of the event is the hacking time and the breakout sessions to discuss the different topics, we took advantage of having great developers around to arrange a few short talks. I’ll do a brief review of the talks from this year, which have been already published on a YouTube playlist (thanks to Chema and Juan for the amazing recording and video edition).

CSS Grid Layout

In my case, as usual lately, my focus was CSS Grid Layout implementation on Blink and WebKit, that Igalia is doing as part of our collaboration with Bloomberg. The good thing was that Christian Biesinger from Google was present in the hackfest, he’s usually the Blink developer reviewing our patches, so it was really nice to have him around to discuss and clarify some topics.

My colleague Javi already wrote a blog post about the hackfest with lots of details about the Grid Layout work we did. Probably the most important bit is that we’re getting really close to ship the feature. The spec is now Candidate Recommendation (despite a few issues that still need to be clarified) and we’ve been focusing lately on finishing the last features and fixing interoperability issues with Gecko implementation. And we’ve just sent the Intent to ship” mail to blink-dev, if everything goes fine Grid Layout might be enabled by default on Chromium 57 that will be released around March 2017.


It’s worth to mention the work performed by Behdad, Fred and Khaled adding support on HarfBuzz for the OpenType MATH table. Again Fred wrote a detailed blog post about all this work some weeks ago.

Also we had the chance to discuss with more googlers about the possibilities to bring MathML back into Chromium. We showed the status of the experimental branch created by Fred and explained the improvements we’ve been doing on the WebKit implementation. Now Gecko and WebKit both follow the implementation note and the tests have been upstreamed into the W3C Web platform Tests repository. Let’s see how all this will evolve in the future.


Last, but not least, as one of the event organizers, I’ve to say thanks to everyone attending the hackfest, without all of you the event won’t make any sense. Also I want to acknowledge the support from the hackfest sponsors Collabora, Igalia and Mozilla. And I’d also like to give kudos to Igalia for hosting and organizing the event one year more. Looking forward for 2017 edition!

Web Engines Hackfest 2016 sponsors: Collabora, Igalia and Mozilla Web Engines Hackfest 2016 sponsors: Collabora, Igalia and Mozilla

Igalia 15th Anniversary & Summit

Just to close this blog post let me talk about some extra events that happened on the last week of September just after the Web Engines Hackfest.

Igalia was celebrating its 15th anniversary with several parties during the week, one of them was in the last day of the hackfest at night with a cool live concert included. Of course hackfest attendees were invited to join us.

Igalia 15th Anniversary Party Igalia 15th Anniversary Party (picture by Chema Casanova)

Then, on the weekend, Igalia arranged a new summit. We usually do two per year and, in my humble opinion, they’re really important for Igalia as a whole. The flat structure is based on trust in our peers, so spending a few days per year all together is wonderful. It allows us to know each other better, while having a good time with our colleagues. And I’m sure they’re very useful for newcomers too, in order to help them to understand our company culture.

Igalia Fall Summit 2016 Igalia Fall Summit 2016 (picture by Alberto Garcia)

And that’s all for this post, let’s hope you didn’t get bored. Thanks for reading this far. 😊

November 14, 2016 11:00 PM

Antonio Gomes

Chromium, ozone, wayland and beyond

Over the past 2 months my fellow Igalian Frédéric Wang and I have been working on improving the Wayland support in Chromium. I will be detailing on this post some of the progresses we’ve made and what our next steps are.


The more Wayland evolved as a mature alternative to the traditional X11-based environments, the more the Chromium community realized efforts were needed to run Chrome natively and flawlessly on such environments. On that context, the Ozone project began back in 2013 with the main goal of making porting Chrome to different graphics systems easier.

Ozone consists in a set of C++ classes for abstracting different window systems on Linux.
It provides abstraction for the construction of accelerated surfaces underlying Aura UI framework, input devices assignment and event handling.

Yet in 2013, Intel presented an Ozone/Wayland backend to the Chromium community. Although project progressed really well, after some time of active development, it got stalled: its most recent update is compatible with Chromium m49, from December/2015.

Leaping forward to 2016, a Wayland backend was added to Chromium upstream by Michael Forney. The implementation is not as feature complete as Intel’s (for example, drag n’ drop support is not implemented, and it only works with specific build configurations of Chrome), but it is definitively a good starting point.

In July, I experimented a bit this existing Wayland backend in Chromium. In summary, I took the existing wayland support upstream, and “ported” the minimal patches from Intel’s repository, being able to get ContentShell running with Wayland. However, given how the Wayland backend in Chromium upstream is designed and implemented, the GPU process failed to start, and HW rendering path was not functional.


Communication with Google

Our main goal was to make Chrome capable of running natively on a Wayland-based environment, with hardware accelerated rendering enabled through the GPU process. After the investigation phase, the natural follow up step for us at Igalia was to contact Chrome hackers at Google, share our plans to Chromium/Wayland, and gather feedback. We reached out to @forney and @rjkroege, and once we heard back, we could understand why “porting” the missing bits from Intel’s original Ozone/Wayland implementation could not be the best choice on the long run.

Basically, the Ozone/Wayland backend present in Chromium upstream is designed in accordance to the original Mus+Ash Project’s architecture: the UI and GPU components would live in the same process. This way, the Wayland connection and other objects, that belong to the UI component, would be accessible to the GPU component and accelerated HW rendering could be implemented without any IPC.

This differs from Intel’s original implementation, where the GPU process owns the Wayland connections and objects, and the legacy Chromium IPC system is used to pass Wayland data from the GPU process to the Browser process.

The latest big Chromium refactor, since the Content API modularization, is the so called Mus+Ash (read "mustash") effort. In an over-summarized definition, it consists of factoring out specific parts of Chromium out of Chrome into "services", each featuring well defined interfaces for communication via Mojo IPC subsystem.

It was also pointed to us that eventually the UI and GPU components will be split into different processes. At that point, the same approach as the one done for the DRM/GBM Ozone backend was advised in order to implement communication between UI and GPU components.

Igalia improvements

Given that “ash” is a ChromeOS-oriented UI component, it seems logical to say that “mus+ash” project’s primarily focus is on the ChromeOS platform. Well, with that in mind, we thought we just needed to: (1) make a ChromeOS/Ozone/Wayland build of Chrome; (2) run ‘chrome –mash –ozone-platform=wayland’; (c) voilà, all should work.


When Fred and I started on this project two months ago, the situation was:

  • Chrome/Ozone/Wayland configuration was only build-able on ChromeOS and failed to run by default.
  • The Ozone project documentation was out of date (e.g. it referred to GYP and the obsolete Intel Ozone/Wayland code)
  • The Wayland code path was not tested by any buildbots, and could break without being noticed at this time.

After various code clean-ups and build fixes, we could finally launch Chrome off of a ChromeOS/Mash/Ozone/X11 build.

With further investigation, we verified that ChromeOS/Mash/Ozone/Wayland worked in m52 (when hardware accelerated rendering support was added to the Wayland backend), m53 and m54 checkouts, but not in m55. We bisected commits between m54 and m55, identified the regressions, and fixed them in issues 2387063002 and 2389053003. Finally, chrome –mash –ozone-platform=wayland was running with hardware accelerated rendering enabled on “off-device” ChromeOS builds.



Linux desktop builds

Our next goal at this point of the collaboration was to make it possible to build and run chrome –mash off of a regular Linux desktop build, through Ozone/Wayland backend.

After discussing about it on the ozone-dev mailing, we submitted and upstreamed various patches, converging to a point where a preliminary version of Chrome/mash/Ozone/Wayland was functional on a non-chromeos build, as shown below.


Note that this version of Chrome is stated above as “preliminary” because although it launches off of a regular desktop Linux build, it still runs with UI components from the ChromeOS UI environment, including a yellow bottom tray, and other items part of the so called “ash” Desktop environment.

Improving developers’ ecosystem

As mentioned, when we started on this collaboration, the Ozone project featured an out of date documentation, compilation and execution failures, and was mostly not integrated to Chromium’s continuous testing system (buildbots, etc). However, given that Ozone is the abstraction layer we would work on in order to get a Wayland backend up and running for Chrome, we found it sensible to improve this scenario.

By making use of Igalia’s community work excellence, many accomplishments were made
on this front, and now Ozone documentation is up-to-date.

Also, after having fixed Ozone/Wayland recurring build and execution failures, Fred and I realized the Ozone/Wayland codepath in Chromium needed to be exercised by a buildbot, so that its maintenance burden is reduced. Fred emailed ozone-dev about it and we started contributing to it, together with the Googler thomasanderson@. As a result,  Ozone/Wayland backend was made part of Chromium’s continuous integration testing system.

Also, Fred made a great post about the current Ozone/Wayland code in Chromium, its main classes and some experimentation done about adding Mojo capability to it. I consider it is a must-read follow on on this post.

Next steps

Today chrome —mash launches within an “ash” window, where the “ash” environment works as a shell for child internal windows, being Chrome one of them.
Being “mash” a shortener for “mus+ash”, our next step is to “eliminate”
the “ash” aspects from the execution path, resulting on a “Chrome Mus” launch.

This project is sponsored by Renesas Electronics …


… and has been performed by the great Igalian hacker Frédéric Wang and I on behalf of Igalia.


by agomes at November 14, 2016 04:42 PM

November 10, 2016

Xabier Rodríguez Calvar

Web Engines Hackfest 2016

From September 26th to 28th we celebrated at the Igalia HQ the 2016 edition of the Web Engines Hackfest. This year we broke all records and got participants from the three main companies behind the three biggest open source web engines, say Mozilla, Google and Apple. Or course, it was not only them, we had some other companies and ourselves. I was active part of the organization and I think we not only did not get any complain but people were comfortable and happy around.

We had several talks (I included the slides and YouTube links):

We had lots and lots of interesting hacking and we also had several breakout sessions:

  • WebKitGTK+ / Epiphany
  • Servo
  • WPE / WebKit for Wayland
  • Layout Models (Grid, Flexbox)
  • WebRTC
  • JavaScript Engines
  • MathML
  • Graphics in WebKit

What I did during the hackfest was working with Enrique and Žan to advance on reviewing our downstream implementation of Media Source Extensions (MSE) in order to land it as soon as possible and I can proudly say that we did already (we didn’t finish at the hackfest but managed to do it after it). We broke the bots and pissed off Michael and Carlos but we managed to deactivate it by default and continue working on it upstream.

So summing up, from my point of view and it is not only because I was part of the organization at Igalia, based also in other people’s opinions, I think the hackfest was a success and I think we will continue as we were or maybe growing a bit (no spoilers!).

Finally I would like to thank our gold sponsors Collabora and Igalia and our silver sponsor Mozilla.

by calvaris at November 10, 2016 08:59 AM

October 31, 2016

Manuel Rego

My experience at W3C TPAC 2016 in Lisbon

At the end of September I attended W3C TPAC 2016 in Lisbon together with my igalian fellows Joanie and Juanjo. TPAC is where all the people working on the different W3C groups meet for a week to discuss tons of topics around the Web. It was my first time on a W3C event so I had the chance to meet there a lot of amazing people, that I’ve been following on the internet for a long time.

Igalia booth

Like past year Igalia was present in the conference, and we had a nice exhibitor booth just in front the registration desk. Most of the time my colleague Juanjo was there explaining Igalia and the work we do to the people that came to the booth.

Igalia Booth at W3C TPAC 2016 Igalia Booth at W3C TPAC 2016 (picture by Juan José Sánchez)

In the booth we were showcasing some videos of the last standards implementations we’ve been doing upstream (like CSS Grid Layout, WebRTC, MathML, etc.). In addition, we were also showing a few demos of our WebKit port called WPE, which has been optimized to run on low-end devices like the Raspberry Pi.

It’s really great to have the chance to explain the work we do around browsers, and check that there are quite a lot of people interested on that. On this regard, Igalia is a particular consultancy that can help companies to develop standards upstream, contributing to both the implementations and the specification evolution and discussions inside the different standard bodies. Of course, don’t hesitate to contact us if you think we can help you to achieve similar goals on this field.

CSS Working Group

During TPAC I was attending the CSS WG meetings as an observer. It’s been really nice to check that a lot of people there appreciate the work we do around CSS Grid Layout, as part of our collaboration with Bloomberg. Not only the implementation effort we’re leading on Blink and WebKit, but also the contributions we make to the spec itself.

New ideas for CSS Grid Layout

There were not a lot of discussion about Grid Layout during the meetings, just a few issues that I’ll explain later. However, Jen Simmons did a very cool presentation with a proposal for a new regions specification.

As you probably remember one of the main concerns regarding CSS Regions spec is the need of dummy divs, that are used to flow the text into them. Jen was suggesting that we could use CSS Grids to solve that. Grids create boxes from CSS where you place elements from the DOM, the idea would be that if you can reference those boxes from CSS then the contents could flow there without the need of dummy nodes. This was linked with some other new ideas to improve CSS Grid Layout:

  • Apply backgrounds and borders to grid cells.
  • Skip cells during auto-placement.

All this is already possible nowadays using dummy div elements (of course if you use a browser with regions and grid support like Safari Technology Preview). However, the idea would be to be able to achieve the same thing without any empty item, directly referring them from CSS.

This of course needs further discussion and won’t be part of the level 1 of CSS Grid Layout spec, but it’d be really nice to get some of those things included in the future versions. I was discussing with Jen and some of them (like skipping cells for auto-placement) shouldn’t be hard to implement. Last, this reminded me about a discussion we had two years ago at the Web Engines Hackfest 2014 with Mihnea Ovidenie.

Last issues on CSS Grid Layout

After the meetings I had the chance to discuss hand by hand some issues with fantasai (one of the Grid Layout spec editors).

One of the topics was discussed during the CSS WG meeting, it was related to how to handle percentages inside Grid Layout when they need to be resolved in an intrinsic size situation. The question here was if both percentage tracks and percentage gaps should resolve the same or differently, the group agreed that they should work exactly the same. However there is something else, as Firefox computes percentage for margins different to what’s done in the other browsers. I tried to explain all this in a detailed issue for the CSS WG, I really think we only have option to do this, but it’ll take a while until we’ve a final resolution on this.

Another issue I think we’ve still open is about the minimum size of grid items. I believe the text on the spec still needs some tweaks, but, thanks to fantasai, we managed to clarify what should be the expected behavior in the common case. However, there’re still some open questions and an ongoing discussion regarding images.

Finally, it’s worth to mention that just after TPAC, the Grid Layout spec has transitioned to Candidate Recommendation (CR), which means that it’s getting stable enough to finish the implementations and release it to the wild. Hopefully this open issues will be fixed pretty soon.


I also attended the first day of Houdini meetings, I was lucky as they started with the Layout API (the one I was more interested on). It’s clear that Google is pushing a lot for this to happen, all the new Layout NG project inside Blink seems quite related to the new Houdini APIs. It looks like a nice thing to have, but, on a first sight, it seems quite hard to get it right. Mainly due to the big complexity of the layout on the web.

On the bright side the Painting API transitioned to CR. Blink has some prototype implementations already that can be used to create cool demos. It’s really amazing to see that the Houdini effort is already starting to produce some real results.


It’s always a pleasure to visit Portugal and Lisbon, both a wonderful country and city. One of the conference afternoons I had the chance to go for a walk from the congress center to the lovely Belém Tower. It was clear I couldn’t miss the chance to go there for such a big conference, you don’t always have TPAC so close from home.

Lisbon wall painting Lisbon wall painting

Overall it was a really nice experience, everyone was very kind and supportive about Igalia and our work on the web platform. Let’s keep working hard to push the web forward!

October 31, 2016 11:00 PM

October 24, 2016

Frédéric Wang

Analysis of Ozone Wayland


In the past two months, I have been working with my colleague Antonio Gomes on a Chromium project supported by Renesas. The goal is to have Chrome running on Wayland with accelerated rendering happening in a separate GPU process. Intel has done a great job to develop a specific Wayland platform for Ozone but unfortunately the code only works for older releases of Chromium and is not fully upstreamed yet. In theory, easy out-of-tree platforms was one of the guiding principles of Ozone but in practice we had to fix many build and run time errors for Ozone, even though we were working upstream. More generally, it is very challenging to keep in sync with all the refactoring work happening upstream and some work is definitely required to align Intel’s code with Google’s goals.

To summarize the situation briefly, if we follow Intel’s approach then the currently upstreamed Wayland code only compiles for ChromeOS (i.e. target_os="chromeos" in your build config) not for standard Linux build of Chrome. You can only have accelerated rendering at the price of removing separation of UI/GPU processes (i.e. via the --in-process-gpu switch). If you keep the default behavior of having the UI and GPU components running in separate processes then accelerated rendering fails. A fallback to software rendering is then attempted but it in turn fails in

When Antonio joined Igalia, he did some experiments on Ozone/Wayland and wrote a quick workaround to make the software rendering fallback work when UI and GPU are in separate processes. He was also able to run standard Linux build of Chrome on Wayland by upstreaming more code from Intel. He also noticed that in Google’s code, only GL drawing is happening in the GPU component (i.e. Wayland objects are owned by the UI, contrary to Intel’s approach) which is the reason why accelerated rendering fails in separate-process mode. After discussion with Google developers it however became clear that they would prefer a different approach that is consistent with two features they are working on:

  • Mus-ash, a project to separate the window management and shell functionality of ash from the chrome browser process.

  • Mojo, a new API for inter-process communication.

During the project, we were able to get chrome running on Wayland using the Mus code path, either on ChromeOS or on standard Linux build. For that code path, GPU and UI components are currently running in the same process so as discussed above accelerated rendering works for Wayland. However, the plan for mus is to move these two components into separate processes and hence we need to adapt the Wayland code in order to allow communication between the GPU (doing GL drawings) and the UI (owning Wayland objects).

I’ll let Antonio describe more precisely on his blog the work we have been doing to get chrome --mash running on Wayland. In this blog post, I’m aligning with Google’s goal and hence I’m focusing on this Mus code path. I’m going to give a quick overview of the structure of Ozone and more specifically what is used by the Wayland platform to perform accelerated rendering.

Ozone Architecture

Ozone Platform

OzonePlatform is the main Ozone interface used to instantiate and initialize an Ozone platform (X11, Wayland, DRM/GBM…). It provides factory getters for helper classes (SurfaceFactoryOzone, PlatformWindow…).

The goal is to have OzonePlatform::InitializeForUI and OzonePlatform::InitializeForGPU called from different processes as well as two different Mojo connectors for the UI and GPU services respectively. However at the time of writing, the two components are only in different threads in the same process. There is also only one Mojo connector available: The one for the service:ui service, passed to OzonePlatform::InitializeForUI.

Two important implementations to consider in this project are OzonePlatformWayland (the one we work on) and OzonePlatformGbm (the one that seems the best maintained and tested upstream).

Native Display Delegate

NativeDisplayDelegate is used to perform display configuration. The DRM/GBM platform has its own DrmNativeDisplayDelegate implementation. For the Wayland platform, we do not actually need real display devices, so we just do as for X11 and use the FakeDisplayDelegate class. See Issue 2389053003.


AcceleratedWidget is just a platform-specific object representing a surface on which compositors can paint pixels.

PlatformWindow represents a single window in the underlying platform windowing system with the usual property: It can be minimized, maximized, closed, put in fullscreen, etc

PlatformWindowDelegate. This is just a delegate to which the PlatformWindow sends events like e.g. OnBoundsChanged.

The implementations of PlatformWindow used for the Wayland and DRM/GBM platforms are respectively WaylandWindow and DrmWindow. At the moment, several features in WaylandWindow are not fully implemented but the minimal code to paint content into the window is present.

Surface and OpenGL

SurfaceFactoryOzone provides surface to be used to paint on a window. The implementations for the Wayland and DRM/GBM platforms are respectively WaylandSurfaceFactory and GbmSurfaceFactory. There are two options:

  • Accelerated drawing (GL path). This is done via a GLOzone instance returned by SurfaceFactoryOzone::GetGLOzone
  • Software Drawing (Skia). This is done via a SurfaceOzoneCanvas instance returned by SurfaceFactoryOzone::CreateCanvasForWidget.

In this project, we focus on accelerated rendering and on EGL, so we consider the GLOzoneEGL class. The following virtual pure functions must be implemented in derived class:

  • GLOzoneEGL::LoadGLES2Bindings, performing the GL initalization. In general, the default works well.
  • GLOzoneEGL::CreateOffscreenGLSurface returning an offscreen GLSurface of the specified dimension. It seems that it is mostly needed to create a dummy zero-dimensional offscreen surface during initialization. Hence the generic SurfacelessEGL should work fine. See Issue 2387063002.
  • GLOzoneEGL::GetNativeDisplay returns the EGL display connection to use.
  • GLOzoneEGL::CreateViewGLSurface returns a new GLSurface associated to a gfx::AcceleratedWidget.

SurfaceFactoryOzone also provides functions to create NativePixmap, which represents a buffer that can be directly imported via GL for rendering, or exported via dma-buf fds. The DRM/GBM platform implements it and uses GbmPixmap. For now, such pixmap objects are not needed by the Wayland platform so the SurfaceFactoryOzone function members are not implemented.

The instances of GLSurface returned by CreateViewGLSurface for the Wayland and DRM/GBM platforms are respectively GLSurfaceWayland and GbmSurface. GLSurfaceWayland is just a gl::NativeViewGLSurfaceEGL associated to a window created by wl_egl_window_create. GbmSurface instead provides surface-like semantics by deriving from GbmSurfaceless itself deriving from gl::SurfacelessEGL. Internally, a framebuffer is bound automatically for GL drawing in the GPU and the result are exported to the UI via pixmaps.

Wayland Platform

Wayland Connection

WaylandConnection is a class specific to the Wayland platform that helps to instantiate all the objects necessary to communicate with the Wayland display server.

It also manages a map from gfx::AcceleratedWidget instances to WaylandWindow instances that you can modify with the public GetWindow, AddWindow and RemoveWindow function members.

Wayland Platform Initialization

  • OzonePlatformWayland::InitializeUI is called from the UI thread. It creates instances of WaylandConnection and WaylandSurfaceFactory as members of OzonePlatformWayland. The Mojo connection of the service:ui service is discarded.

  • OzonePlatformWayland::InitializeGPU in the same process but a different thread. The WaylandConnection and WaylandSurfaceFactory members previously created are hence accessible from the GPU thread too. Nothing is done by this initialization function.

Wayland Window

A WaylandWindow is tied to a WaylandConnection and takes care of registering and unregistering itself to that WaylandConnection. It is merely a wrapper to native Wayland surfaces (wl_surface and xdg_surface) with additional window bounds. It also maintains communication with the PlatformWindowDelegate.

Wayland Surface and OpenGL

Currently, the constructor of WaylandSurfaceFactory receives the WaylandConnection. Because we want the UI process to hold the WaylandConnection and the GPU process to use WaylandSurfaceFactory for the GL rendering, the current setup will not work after mus is split. Instead, we should pass it the Mojo connector of the service:gpu service in order to indirectly communicate with the service:ui service (for testing purpose, we can for now just use the Mojo connector of the service:ui service passed to OzonePlatformWayland::InitializeUI).

The WaylandSurfaceFactory::CreateCanvasForWidget function and the WaylandCanvasSurface instance created require the WaylandConnection. However, they are used for software rendering so we can ignore them for now.

GLOzoneEGL::LoadGLES2Bindings and GLOzoneEGL::CreateOffscreenGLSurface do not seem fundamental here and they do not need any Wayland-specific code. Hence we can probably just keep the current default implementations.

At the moment GLOzoneEGLWayland::GetNativeDisplay just returns a native display provided by the WaylandConnection. It seems that we will not be able to do so when the WaylandConnection is no longer available on the GPU side. Probably we should just do like GLOzoneEGLGbm and return EGL_DEFAULT_DISPLAY.

The remaining GLOzoneEGLWayland::CreateViewGLSurface uses WaylandConnection::GetWindow to retrieve the WaylandWindow associated to AcceleratedWidget to draw on. This window provides a wl_surface and sizes that can be passed to wl_egl_window_create to create an egl_window. Then as said in the previous paragraph, a GLSurfaceWayland instance is created which is merely a gl::NativeViewGLSurfaceEGL associated to the egl_window.


In this blog post, an overview of the Ozone Architecture was provided, with focus on the Wayland platform. It also contains an analysis of how we could get accelerated rendering in a separate GPU process aligned with Google’s goals (Mus+ash and Mojo) and hence have it well-integrated with upstream code.

The main problem is that the GLOzoneEGLWayland code is very tied to the Wayland native objects (connection, surface, window…). We should instead only provide a Mojo connector to the GLOzoneEGLWayland class and hence to the GLSurfaceWayland class it constructs. Instances of WaylandWindow and WaylandConnector will only live on the UI side while the GL classes will live on the GPU side.

The GLSurfaceWayland should be rewritten to derive from SurfacelessEGL instead of NativeViewGLSurfaceEGL. We would then follow what is done in GbmSurface to provide some surface-like semantics but without the need to have a real egl_window object. Under the hood, the GL drawings will be performed on a framebuffer.

We should then find a way to export those framebuffers to the UI component via the Mojo connector. Again, following the DRM/GBM code with this NativePixmap objects seems the right option. At the end, the UI would convert the buffers into wl_buffers that can finally be attach to the wl_surface of the WaylandWindow.

During the past two months we have been maintaining the upstream Ozone code and made the Wayland platform work again. Try bots now build and run tests for the Wayland platform and we expect that such continuous testing will make the whole thing more robust. We have also been working on making Linux desktop work with the Mus+ash code path and started encouraging experiments of Mojo communication between the UI and GPU components of the Wayland platform.

Igalia Logo
Renesas Logo

It is really great to work with Antonio on this project and we are looking forward to continuing the collaboration on this with Google and Ozone developers. Last but not least, I would like to thank Renesas for supporting Igalia in this work to add Wayland support to chromium.

October 24, 2016 10:00 PM

October 20, 2016

Iago Toral

OpenGL terrain renderer update: Bloom and Cascaded Shadow Maps

Bloom Filter

The bloom filter makes bright objects glow and bleed through other objects positioned in between them and the camera. It is a common post-processing effect used all the time in video games and animated movies. The demo supports a couple of configuration options that control the intensity and behavior of the filter, here are some screenshots with different settings:

Bloom filter Off
Bloom filter On, default settings
Bloom filter On, intensity increased

I particularly like the glow effect that this brings to the specular reflections on the water surface, although to really appreciate that you need to run the demo and see it in motion.

Cascaded Shadow Maps

I should really write a post about basic shadow mapping before going into the details of Cascaded Shadow Maps, so for now I’ll just focus on the problem they try to solve.

One of the problems with shadow mapping is rendering high resolution shadows, specially for shadows that are rendered close to the camera. Generally, basic shadow mapping provides two ways in which we can improve the resolution of the shadows we render:

1. Increase the resolution of the shadow map textures. This one is obvious but comes at a high performance (and memory) hit.

2. Reduce the distance at which we can render shadows. But this is not ideal of course.

One compromise solution is to notice that, as usual with 3D computer graphics, it is far more important to render nearby objects in high quality than distant ones.

Cascaded Shadow Maps allow us to use different levels of detail for shadows that are rendered at different distances from the camera. Instead of having a single shadow map for all the shadows, we split the viewing frustum into slices and render shadows in each slice to a different shadow map.

There are two immediate benefits of this technique:

1. We have flexibility to define the resolution of the shadow maps for each level of the cascade, allowing us, for example, to increase the resolution of the levels closest to the camera and maybe reduce those that are further away.

2. Each level only records shadows in a slice of the viewing frustum, which increases shadow resolution even if we keep the same texture resolution we used with the original shadow map implementation for each shadow map level.

This approach also has some issues:

1. We need to render multiple shadow maps, which can be a serious performance hit depending on the resolutions of the shadow maps involved. This is why we usually lower the resolution of the shadow maps as distance from the camera increases.

2. As we move closer to or further from shadowed objects we can see the changes in shadow quality pop-in. Of course we can control this by avoiding drastic quality changes between consecutive levels in the cascade.

Here is an example that illustrates the second issue (in this case I have lowered the resolution of the 2nd and 3d cascade levels to 50% and 25% respectively so the effect was more obvious). The screenshots show the rendering of the shadows at different distances. We can see how the shadows in the close-up shot are very sharp and as the distance increases they become blurrier due to the use of a lower resolution shadow map:

CSM level 0 (4096×4096)
CSM level 1 (2048×2048)
CSM level 2 (1024×1024)

The demo supports up to 4 shadow map levels although the default configuration is to use 3. The resolution of each level can be configured separately too, in the default configuration I lowered the shadow resolution of the second and third levels to 75% and 50% respectively. If we configure the demo to run on a single level (with 100% texture resolution), we are back to the original shadow map implementation, so it is easy to experiment with both techniques.

I intend to cover the details behind shadow mapping and the implementation of the bloom filter in more detail in a future post, so again, stay tuned for more!

by Iago Toral at October 20, 2016 10:32 AM

October 17, 2016

Asumu Takikawa

Lua Workshop 2016

Now that I work at Igalia on the networking team, I’ve started to write code in Lua (specfically with LuaJIT) in order to work with Snabb. Since I’m still a Lua newbie, I decided to attend this year’s Lua Workshop in San Francisco to learn what’s going on in the community.

The workshop was a lot of fun! Kudos to Mashape for hosting the event. The talks were easy to follow even for a Lua newbie like myself. I took notes on all of the talks, but I won’t go over all the details since the talks were recorded and you can just see for yourself. Instead, I wanted to go over some of my personal highlights.

The design of Lua

The day was kicked off by Roberto Ierusalimschy by a talk on the design of Lua. The talk was focused on how tradeoffs are an important part of designing programming languages. In particular, Roberto criticized the widespread “general-purpose” label for programming languages, since no PL is truly general-purpose. For example, a high-level language chock full of abstractions may have unavoidable performance overheads. A language that comes with the kitchen sink is much harder to port to tiny microcontrollers.

Given that, Lua makes explicit tradeoffs to excel along four particular axes: portability, simplicity, small, and scripting. These design constraints make Lua especially suitable for applications such as embedding in TVs and video game engines. Apparently Lua is even embedded in the NetBSD kernel!

This led to an interesting discussion of the presence and lack of various features in today’s Lua. In particular, Roberto expressed some regret about adding the “three dots” varargs feature and also on the addition of goto in Lua 5.2.

As a Lisper, I was also really excited to hear that the Lua devs have been “dreaming” about adding macros into the language. I’m hoping that if they do add macros, they look at Matthew Flatt’s hygienic macros with sets of scopes that have subsequently been adopted into sweet.js. Although I imagine it will be difficult to scale it into a full-on phased macro system ala Racket because Lua’s modules live entirely at runtime (i.e., require is implemented as a function).

Lua history

Another fun talk on Lua proper was Daurnimator’s talk on the development history of Lua. The biggest shock for me from this talk was that Lua is still developed using RCS! Not even CVS or Subversion. It’s also a private repository that isn’t (yet) hosted for the public to see.

Of course, that makes it hard for outside people to contribute or even just learn from the code base. That’s where this talk comes in. Daurnimator has put in the work to port the Lua repo from RCS to git (via CVS import tools—apparently RCS and CVS share repo formats) in order to set up a public mirror.

My understanding is that the plan is to put up a mirror at that will host the full history. There was some murmuring at the workshop that Lua may switch over to git-based development in the future, though it sounded like the version on github would remain as just a mirror.

You might be wondering where the git history starts, since apparently the initial versions of Lua weren’t even developed in source control. It turns out that the initial commit is recreated from a tarball that was found of one of the early releases from around 1993.

Finally, a fun fact from the talk is that despite the Lua developers’ insistence that “Lua” isn’t an acronym, there’s an early commit message that included the following: “Linguagem para Usuarios de Aplicacao”.

Applications of Lua

There were a number of libraries and applications of Lua that were presented at the workshop. One of the projects that I’d be most likely to use and most excited about was the Rosie Pattern Language presented by Jamie Jennings. The slides for the talk are here.

The basic motivation for the tool is that there are many DevOps or Data Science style tasks that require munging data from some raw data format, such as system logs. Many hackers just reach for regular expressions for this kind of task, but regexps often become software engineering burden because they’re hard to understand and abstract. Not only that, many regexp engines are slow.

Rosie offers an alternative approach formulated as a language built over parser combinators. It comes with a big library of pre-defined patterns that make it easier to plug together existing solutions. There’s also a REPL for interactively trying patterns too. The connection with Lua is that Lua is the implementation language and also the extension scripting language for Rosie.

Another cool talk was Justin Donaldson’s talk on creating a HaXe backend that emits Lua code. HaXe is a cross-platform language that compiles to a number of different target languages. It’s intended for use in areas like video games (Papers, Please! was written in it) and mobile apps. It seems like quite a neat language based on its list of features.

According to Justin there were some interesting challenges in constructing a Lua backend. Apparently accommodating Lua’s multiple value return facility was tricky, as well as dealing with the fact that the Lua VM has a hard limit of 200 local variables per function due to the maximum stack size.

Optimizing for LuaJIT

Yichun Zhang gave a talk on optimizing LuaJIT applications. One of the main takeaways for me from the talk was “use Flame Graphs for everything”. Brendan Gregg’s Flame Graphs are a visualization technique for stack traces, or generally any kind of cost that can be attributed in a stack-like manner. Examples include CPU profiling or I/O.

The I/O one was particularly interesting. Normally, profilers track the time that processes spend executing code. But in the case of I/O, you’re interested in off-CPU time in which processes are just waiting. Visualizing that in a flame graph gives you an idea of what processes are creating I/O bottlenecks in your application.

There were also some interesting points about feature uses that often come up in these flame graphs too. For example, creating functions in hot paths is obviously bad for performance, and it does pop up as lj_BC_FNEW calls in the graphs (see #50 in the slides linked in the tweet above). Another example as to avoid table creation and resizing if possible, such as by using table.clear to reuse an already allocated table. It was a fast-paced talk and I didn’t catch all the details, so I recommend you watch the recording whenever it comes out.

Putting on my Racket developer hat for a second, this talk made me wonder if it would be feasible to create flame graphs for Racket profiler outputs. I’ve made some progress on that, as this tweet shows:

I plan to release the tool for hooking Racket up to flame graphs soon.


Overall, I thought the Lua Workshop was great. Something that I enjoyed about the workshop is that there was a strong international presence. I met people from Australia, Brazil, China, Italy, Japan, the UK, and probably other places that I don’t remember. Unfortunately while there were many countries represented, the diversity in gender among the attendees wasn’t very high. That’s something that many language communities deal with but I do hope that it gets better in the future.

Anyhow, happy Lua hacking!

October 17, 2016 12:19 PM

October 13, 2016

Iago Toral

OpenGL terrain renderer: rendering the terrain mesh

In this post I’ll discuss how I setup and render terrain mesh in the OpenGL terrain rendering demo. Most of the relevant code for this is in the ter-terrain.cpp file.

Setting up a grid of vertices

Unless you know how to use a 3D modeling program properly, a reasonable way to create a decent mesh for a terrain consists in using a grid of vertices and elevate them according to a height map image. In order to create the grid we only need to decide how many rows and columns we want. This, in the end, determines the number of polygons and the resolution of the terrain.

We need to map these vertices to world coordinates too. We do that by defining a tile size, which is the distance between consecutive vertices in world units. Larger tile sizes increase the size of the terrain but lower the resolution by creating larger polygons.

The image below shows an 8×6 grid that defines 35 tiles. Each tile is rendered using 2 triangles:

8×6 terrain grid

The next step is to elevate these vertices so we don’t end up with a boring flat surface. We do this by sampling the height map image for each vertex in the grid. A height map is a gray scale image where the values of the pixels represent altitudes at different positions. The closer the color is to white, the more elevated it is.

Heightmap image

Adding more vertices to the grid increases the number of sampling points from the height map and reduces the sampling distances, leading to a smoother and more precise representation of the height map in the resulting terrain.

Sampling the heightmap to compute vertex heights

Of course, we still need to map the height map samples (gray scale colors) to altitudes in world units. In the demo I do this by normalizing the color values to [-1,+1] and then applying a scale factor to compute the altitude values in world space. By playing with the scaling factor we can make our terrain look more or less abrupt.

Altitude scale=6.0
Altitude scale=12.0

For reference, the height map sampling is implemented in ter_terrain_set_heights_from_texture().

Creating the mesh

At this point we know the full position (x, y, z) in world coordinates of all the vertices in our grid. The next step is to build the actual triangle mesh that we will use to render the terrain and the normal vectors for each triangle. This process is described below and is implemented in the ter_terrain_build_mesh() function.

Computing normals

In order to get nice lighting on our terrain we need to compute normals for each vertex in the mesh. A simple way to achieve this is would be to compute the normal for each face (triangle) and use that normal for each vertex in the triangle. This works, but it has 3 problems:

1. Every vertex of each triangle has the same exact normal, which leads to a rather flat result.

2. Adjacent triangles with different orientations showcase abrupt changes in the normal value, leading to significantly different lighting across the surfaces that highlight the individual triangles in the mesh.

3. Because each vertex in the mesh can have a different normal value for each triangle it participates in, we need to replicate the vertices when we render, which is not optimal.

Alternatively, we can compute the normals for each vertex considering the heights of its neighboring vertices. This solves all the problems mentioned above and leads to much better results thanks to the interpolation of the normal vectors across the triangles, which leads to smooth lighting reflection transitions:

Flat normals
Smooth normals

The implementation for this is in the function calculate_normal(), which takes the column and row indices of the vertex in the grid and calculates the Y coordinate by sampling the heights of the 4 nearby vertices in the grid.

Preparing the draw call

Now that we know the positions of all the vertices and their normal vectors we have all the information that we need to render the terrain. We still have to decide how exactly we want to render all the polygons.

The simplest way to render the terrain using a single draw call is to setup a vertex buffer with data for each triangle in the mesh (including position and normal information) and use GL_TRIANGLES for the primitive of the draw call. This, however, is not the best option from the point of view of performance.

Because the terrain will typically contain a large number of vertices and most of them participate in multiple triangles, we end up uploading a large amount of vertex data to the GPU and processing a lot of vertices in the draw call. The result is large memory requirements and suboptimal performance.

For reference, the terrain I used in the demo from my original post used a 251×251 grid. This grid represents 250×250 tiles, each one rendered as two triangles (6 vertices/tile), so we end up with 250x250x6=375,000 vertices. For each of these vertices we need to upload 24 bytes of vertex data with the position and normal, so we end up with a GPU buffer that is almost 9MB large.

One obvious way to reduce this is to render the terrain using triangle strips. The problem with this is that in theory, we can’t render the terrain with just one strip, we would need one strip (and so, one draw call) per tile column or one strip per tile row. Fortunately, we can use degenerate triangles to link separate strips for each column into a single draw call. With this we trim down the number of vertices to 126,000 and the size of the buffer to a bit below 3 MB. This alone produced a 15%-20% performance increase in the demo.

We can do better though. A lot of the vertices in the terrain mesh participate in various triangles across the large triangle strip in the draw call, so we can reduce memory requirements by using an index buffer to render the strip. If we do this, we trim things down to 63,000 vertices and ~1.5MB. This added another 4%-5% performance bonus over the original implementation.


So far we have been rendering the full mesh of the terrain in each frame and we do this by uploading the vertex data to the GPU just once (for example in the first frame). However, depending on where the camera is located and where it is looking at, just a fraction of the terrain may be visible.

Although the GPU will discard all the geometry and fragments that fall outside the viewport, it still has to process each vertex in the vertex shader stage before it can clip non-visible triangles. Because the number of triangles in the terrain is large, this is suboptimal and to address this we want to do CPU-side clipping before we render.

Doing CPU-side clippig comes with some additional complexities though: it requires that we compute the visible region of the terrain and upload new vertex data to the GPU in each frame preventin GPU stalls.

In the demo, we implement the clipping by computing a quad sub-region of the terrain that includes the visible area that we need to render. Once we know the sub-region that we want to render, we compute the new indices of the vertices that participate in the region so we can render it using a single triangle strip. Finally, we upload the new index data to the index buffer for use in the follow-up draw call.

Avoiding GPU stalls

Although all the above is correct, it actually leads, as described, to much worse performance in general. The reason for this is that our uploads of vertex data in each frame lead to frequent GPU stalls. This happens in two scenarios:

1. In the same frame, because we need to upload different vertex data for the rendering of the terrain for the shadow map and the scene (the shadow map renders the terrain from the point of view of the light, so the visible region of the terrain is different). This creates stalls because the rendering of the terrain for the shadow map might not have completed before we attempt to upload new data to the index buffer in order to render the terrain for the scene.

2. Between different frames. Because the GPU might not be completely done rendering the previous frame (and thus, stills needs the index buffer data available) before we start preparing the next frame and attempt to upload new terrain index data for it.

In the case of the Intel Mesa driver, these GPU stalls can be easily identified by using the environment variable INTEL_DEBUG=perf. When using this, the driver will detect these situations and produce warnings informing about the stalls, the buffers affected and the regions of the buffers that generate the stall, such as:

Stalling on glBufferSubData(0, 503992) (492kb) to a busy (0-1007984)
buffer object.  Use glMapBufferRange() to avoid this.

The solution to this problem that I implemented (other than trying to put as much work as possible between read/write accesses to the index buffer) comes in two forms:

1. Circular buffers

In this case, we allocate a larger buffer than we need so that each subsequent upload of new index data happens in a separate sub-region of the allocated buffer. I set up the demo so that each circular buffer is large enough to hold the index data required for all updates of the index buffer happening in each frame (the shadow map and the scene).

2. Multi-buffering

We allocate more than one circular buffer. When we don’t have enough free space at the end of the current buffer to upload the new index buffer data, we upload it to a different circular buffer instead. When we run out of buffers we circle back to the first one (which at this point will hopefully be free to be re-used again).

So why not just use a single, very large circular buffer? Mostly because there are limits to the size of the buffers that the GPU may be able to handle correctly (or efficiently). Also, why not having many smaller independent buffers instead of circular buffers? That would work just fine, but using fewer, larger buffers reduces the number of objects we need to bind/unbind and is better to prevent memory fragmentation, so that’s a plus.

Final touches

We are almost done, at this point we only need to add a texture to the terrain surface, add some slight fog effect for distant pixels to create a more realistic look, add a skybox (it is important to choose the color of the fog so it matches the color of the sky!) and tweak the lighting parameters to get a nice result:

Final rendering

I hope to cover some of these aspects in future posts, so stay tuned for more!

by Iago Toral at October 13, 2016 02:38 PM

October 12, 2016

Manuel Rego

Grid Layout Summertime

Summer season is over and despite of some nice holidays to rest, the Igalia team kept doing progress on the CSS Grid Layout implementation in Chromium/Blink and Safari/WebKit as part of our ongoing collaboration with Bloomberg.

By the end of September the spec has transitioned to Candidate Recommendation (CR), Rachel Andrew wrote a blog post explaining what does it mean. However in the previous months there were a few changes here and there that require the implementations to get updated.

If you remember my presentation on the last BlinkOn we were reviewing the status of the implementation. By that time there were a bunch things marked as WIP or TODO, but now most of them have been already implemented as I’ll explain in this post.

Auto repeat

My mate Sergio Villar already explained this feature thoroughly in a blog post so I won’t repeat it.

The new thing is that auto-fit keyword implementation has already landed, so you can use it too. auto-fit allows you to collapse the tracks that doesn’t contain any item.

So now you can do something like this:

<div style="display: grid; width: 700px;
            grid-template-columns: repeat(auto-fit, 150px); grid-template-rows: 100px;">
    <div style="grid-column: 1 / span 2;">grid-column: 1 / span 2</div>
    <div style="grid-column: 4;">grid-column: 4</div>

auto-fit example auto-fit example

In the example, there’s room for 4 columns of 150px however as the third column doesn’t have any item, so it’s collapsed to 0px, and you end up like having only 3 columns on your grid container. If you were using auto-fill keyword, the 3rd column won’t be collapsed and it will remain empty.

Multiple tracks

One of the changes on the spec was adding the possibility to specify more than one track in some of the properties: grid-auto-columns & grid-auto-rows and repeat() notation.

So for example you can now pass a track list to the grid-auto-columns property like:

<div style="display: grid;
            grid-auto-columns: 200px 50px; grid-template-rows: 100px;">
    <div style="grid-row: 1;">A</div>
    <div style="grid-row: 1;">B</div>
    <div style="grid-row: 1;">C</div>
    <div style="grid-row: 1;">D</div>

Multiple tracks on grid-auto-columns example Multiple tracks on grid-auto-columns example

As you see the first and third columns will have a 200px width, and the second and fourth ones will have a 50px width.

On top of that, you can also use a track list on the repeat() notation:

<div style="display: grid;
            grid-template-columns: repeat(3, 200px 50px); grid-template-rows: 100px;">
    <div style="grid-row: 1;">A</div>
    <div style="grid-row: 1;">B</div>
    <div style="grid-row: 1;">C</div>
    <div style="grid-row: 1;">D</div>
    <div style="grid-row: 1;">E</div>
    <div style="grid-row: 1;">F</div>

Multiple tracks on repeat() notation example Multiple tracks on repeat() notation example

This seems a small change, and it was not very hard to implement. However, this combined with the auto repeat feature required a few more changes than expected to make everything work properly.

fit-content() cap

fit-content keyword has been updated and now you can pass an argument to it that is used as a maximum size. As an argument for the fit-content() function you can specify either a length or a percentage (that will be resolved like for percentage tracks).

So in Grid Layout now you can use fit-content(argument) and the track size will be clamped at the argument:

<div style="display: inline-grid;
            grid-template-columns: repeat(2, fit-content(200px)); grid-template-rows: 100px;">
    <div>an item</div>
    <div>a very long grid item</div>

fit-content() function example fit-content() function example

As you can see the first column uses fit-content so its size is adapted to the item inside. However the second column would need more than 200px in order that the whole item fits, so its size is clamped to 200px.

Percentage support for grid gutters

Grid gutters have been added a while ago into the different implementations, but we were lacking support for percentage gaps.

We’ve been updating the implementation so now you can use percentages too (again they’re resolved like percentage tracks):

<div style="display: grid; width: 400px;
            grid-column-gap: 10%;
            grid-auto-columns: 1fr; grid-auto-rows: 100px;">
    <div style="grid-row: 1;">A</div>
    <div style="grid-row: 1;">B</div>
    <div style="grid-row: 1;">C</div>
    <div style="grid-row: 1;">D</div>

Percentage gaps example Percentage gaps example

The 10% column gap is resolved against the width of the grid container to 40px. Then the 1fr tracks take the available space, which in this case means 70px for each column.

The percentage support for grid gaps is marked as at-risk in the spec, however it has been already implemented in Gecko and Blink, so probably it won’t be removed from this level.

New syntax for grid shorthand

The grid shorthand syntax has been modified so now you can specify in just one property, the explicit grid in one axis and the implicit grid and the auto-placement algorithm mode in the other one.

For example, if you want a grid that has 2 columns of a given size and the rows will be created on demand, you could use the following syntax:

<div style="display: grid;
            grid: auto-flow 100px / 200px 100px;">

grid shorthand example grid shorthand example

Orthogonal flows

Orthogonal flows refers to situations when the writing mode of the grid container uses a different direction than some of its items. For example, if you have a horizontal grid container and some vertical items inside it.

This has been a long task but right now you can use orthogonal flows on grid layout already. The basic support has been implemented including alignment of orthogonal items.

So now you can use vertical items inside a horizontal grid layout:

<div style="display: inline-grid;">
  <div style="grid-row: 1;">Test</div>
  <div style="grid-row: 1; writing-mode: vertical-lr;">Chrome</div>
  <div style="grid-row: 1; writing-mode: vertical-lr;">Firefox</div>
  <div style="grid-row: 1; writing-mode: vertical-lr;">Safari</div>

Orthogonal flows example Orthogonal flows example

New normal value for alignment properties

This is a pretty complex issue from the specs and implementations point of view (as you can see in the epic review for this patch), but probably not very important for the end users because the default behavior on Grid Layout won’t be altered. Now there’s a new value normal for the alignment properties: justify-content, align-content, justify-items, align-items, justify-self and align-self.

The interesting thing regarding this new normal value is that it behaves differently depending on the layout model. In general, the behavior in the Grid Layout case is the same than when you use the stretch keyword.


This is just a high level overview of the main things that happened during the summer on the Blink and WebKit implementations leaded by Igalia. Of course, I’m missing some much more stuff like lots of bug fixes and even performance optimizations (e.g. nested grids are now 350% faster than before).

Maybe for some of you some of these changes are really great, maybe for other they’re not very relevant. Anyway the good news are that things keep moving forward, and the implementations are getting closer and closer to the specification. This means that we’re on the right path to have CSS Grid Layout enabled on most of the main browsers soon.

Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

Last, but not least, as usual we’d like to thank our friends at Bloomberg for all their amazing support which allows us to keep pushing Grid Layout.

October 12, 2016 10:00 PM

Andy Wingo

An incomplete history of language facilities for concurrency

I have lately been in the market for better concurrency facilities in Guile. I want to be able to write network servers and peers that can gracefully, elegantly, and efficiently handle many tens of thousands of clients and other connections, but without blowing the complexity budget. It's a hard nut to crack.

Part of the problem is implementation, but a large part is just figuring out what to do. I have often thought that modern musicians must be crushed under the weight of recorded music history, but it turns out in our humble field that's also the case; there are as many concurrency designs as languages, just about. In this regard, what follows is an incomplete, nuanced, somewhat opinionated history of concurrency facilities in programming languages, with an eye towards what I should "buy" for the Fibers library I have been tinkering on for Guile.

* * *

Modern machines have the raw capability to serve hundreds of thousands of simultaneous long-lived connections, but it’s often hard to manage this at the software level. Fibers tries to solve this problem in a nice way. Before discussing the approach taken in Fibers, it’s worth spending some time on history to see how we got here.

One of the most dominant patterns for concurrency these days is “callbacks”, notably in the Twisted library for Python and the Node.js run-time for JavaScript. The basic observation in the callback approach to concurrency is that the efficient way to handle tens of thousands of connections at once is with low-level operating system facilities like poll or epoll. You add all of the file descriptors that you are interested in to a “poll set” and then ask the operating system which ones are readable or writable, as appropriate. Once the operating system says “yes, file descriptor 7145 is readable”, you can do something with that socket; but what? With callbacks, the answer is “call a user-supplied closure”: a callback, representing the continuation of the computation on that socket.

Building a network service with a callback-oriented concurrency system means breaking the program into little chunks that can run without blocking. Whereever a program could block, instead of just continuing the program, you register a callback. Unfortunately this requirement permeates the program, from top to bottom: you always pay the mental cost of inverting your program’s control flow by turning it into callbacks, and you always incur run-time cost of closure creation, even when the particular I/O could proceed without blocking. It’s a somewhat galling requirement, given that this contortion is required of the programmer, but could be done by the compiler. We Schemers demand better abstractions than manual, obligatory continuation-passing-style conversion.

Callback-based systems also encourage unstructured concurrency, as in practice callbacks are not the only path for data and control flow in a system: usually there is mutable global state as well. Without strong patterns and conventions, callback-based systems often exhibit bugs caused by concurrent reads and writes to global state.

Some of the problems of callbacks can be mitigated by using “promises” or other library-level abstractions; if you’re a Haskell person, you can think of this as lifting all possibly-blocking operations into a monad. If you’re not a Haskeller, that’s cool, neither am I! But if your typey spidey senses are tingling, it’s for good reason: with promises, your whole program has to be transformed to return promises-for-values instead of values anywhere it would block.

An obvious solution to the control-flow problem of callbacks is to use threads. In the most generic sense, a thread is a language feature which denotes an independent computation. Threads are created by other threads, but fork off and run independently instead of returning to their caller. In a system with threads, there is implicitly a scheduler somewhere that multiplexes the threads so that when one suspends, another can run.

In practice, the concept of threads is often conflated with a particular implementation, kernel threads. Kernel threads are very low-level abstractions that are provided by the operating system. The nice thing about kernel threads is that they can use any CPU that is the kernel knows about. That’s an important factor in today’s computing landscape, where Moore’s law seems to be giving us more cores instead of more gigahertz.

However, as a building block for a highly concurrent system, kernel threads have a few important problems.

One is that kernel threads simply aren’t designed to be allocated in huge numbers, and instead are more optimized to run in a one-per-CPU-core fashion. Their memory usage is relatively high for what should be a lightweight abstraction: some 10 kilobytes at least and often some megabytes, in the form of the thread’s stack. There are ongoing efforts to reduce this for some systems but we cannot expect wide deployment in the next 5 years, if ever. Even in the best case, a hundred thousand kernel threads will take at least a gigabyte of memory, which seems a bit excessive for book-keeping overhead.

Kernel threads can be a bit irritating to schedule, too: when one thread suspends, it’s for a reason, and it can be that user-space knows a good next thread that should run. However because kernel threads are scheduled in the kernel, it’s rarely possible for the kernel to make informed decisions. There are some “user-mode scheduling” facilities that are in development for some systems, but again only for some systems.

The other significant problem is that building non-crashy systems on top of kernel threads is hard to do, not to mention “correct” systems. It’s an embarrassing situation. For one thing, the low-level synchronization primitives that are typically provided with kernel threads, mutexes and condition variables, are not composable. Also, as with callback-oriented concurrency, one thread can silently corrupt another via unstructured mutation of shared state. It’s worse with kernel threads, though: a kernel thread can be interrupted at any point, not just at I/O. And though callback-oriented systems can theoretically operate on multiple CPUs at once, in practice they don’t. This restriction is sometimes touted as a benefit by proponents of callback-oriented systems, because in such a system, the callback invocations have a single, sequential order. With multiple CPUs, this is not the case, as multiple threads can run at the same time, in parallel.

Kernel threads can work. The Java virtual machine does at least manage to prevent low-level memory corruption and to do so with high performance, but still, even Java-based systems that aim for maximum concurrency avoid using a thread per connection because threads use too much memory.

In this context it’s no wonder that there’s a third strain of concurrency: shared-nothing message-passing systems like Erlang. Erlang isolates each thread (called processes in the Erlang world), giving each it its own heap and “mailbox”. Processes can spawn other processes, and the concurrency primitive is message-passing. A process that tries receive a message from an empty mailbox will “block”, from its perspective. In the meantime the system will run other processes. Message sends never block, oddly; instead, sending to a process with many messages pending makes it more likely that Erlang will pre-empt the sending process. It’s a strange tradeoff, but it makes sense when you realize that Erlang was designed for network transparency: the same message send/receive interface can be used to send messages to processes on remote machines as well.

No network is truly transparent, however. At the most basic level, the performance of network sends should be much slower than local sends. Whereas a message sent to a remote process has to be written out byte-by-byte over the network, there is no need to copy immutable data within the same address space. The complexity of a remote message send is O(n) in the size of the message, whereas a local immutable send is O(1). This suggests that hiding the different complexities behind one operator is the wrong thing to do. And indeed, given byte read and write operators over sockets, it’s possible to implement remote message send and receive as a process that serializes and parses messages between a channel and a byte sink or source. In this way we get cheap local channels, and network shims are under the programmer’s control. This is the approach that the Go language takes, and is the one we use in Fibers.

Structuring a concurrent program as separate threads that communicate over channels is an old idea that goes back to Tony Hoare’s work on “Communicating Sequential Processes” (CSP). CSP is an elegant tower of mathematical abstraction whose layers form a pattern language for building concurrent systems that you can still reason about. Interestingly, it does so without any concept of time at all, instead representing a thread’s behavior as a trace of instantaneous events. Threads themselves are like functions that unfold over the possible events to produce the actual event trace seen at run-time.

This view of events as instantaneous happenings extends to communication as well. In CSP, one communication between two threads is modelled as an instantaneous event, partitioning the traces of the two threads into “before” and “after” segments.

Practically speaking, this has ramifications in the Go language, which was heavily inspired by CSP. You might think that a channel is just a an asynchronous queue that blocks when writing to a full queue, or when reading from an empty queue. That’s a bit closer to the Erlang conception of how things should work, though as we mentioned, Erlang simply slows down writes to full mailboxes rather than blocking them entirely. However, that’s not what Go and other systems in the CSP family do; sending a message on a channel will block until there is a receiver available, and vice versa. The threads are said to “rendezvous” at the event.

Unbuffered channels have the interesting property that you can select between sending a message on channel a or channel b, and in the end only one message will be sent; nothing happens until there is a receiver ready to take the message. In this way messages are really owned by threads and never by the channels themselves. You can of course add buffering if you like, simply by making a thread that waits on either sends or receives on a channel, and which buffers sends and makes them available to receives. It’s also possible to add explicit support for buffered channels, as Go, core.async, and many other systems do, which can reduce the number of context switches as there is no explicit buffer thread.

Whether to buffer or not to buffer is a tricky choice. It’s possible to implement singly-buffered channels in a system like Erlang via an explicit send/acknowlege protocol, though it seems difficult to implement completely unbuffered channels. As we mentioned, it’s possible to add buffering to an unbuffered system by the introduction of explicit buffer threads. In the end though in Fibers we follow CSP’s lead so that we can implement the nice select behavior that we mentioned above.

As a final point, select is OK but is not a great language abstraction. Say you call a function and it returns some kind of asynchronous result which you then have to select on. It could return this result as a channel, and that would be fine: you can add that channel to the other channels in your select set and you are good. However, what if what the function does is receive a message on a channel, then do something with the message? In that case the function should return a channel, plus a continuation (as a closure or something). If select results in a message being received over that channel, then we call the continuation on the message. Fine. But, what if the function itself wanted to select over some channels? It could return multiple channels and continuations, but that becomes unwieldy.

What we need is an abstraction over asynchronous operations, and that is the main idea of a CSP-derived system called “Concurrent ML” (CML). Originally implemented as a library on top of Standard ML of New Jersey by John Reppy, CML provides this abstraction, which in Fibers is called an operation1. Calling send-operation on a channel returns an operation, which is just a value. Operations are like closures in a way; a closure wraps up code in its environment, which can be later called many times or not at all. Operations likewise can be performed2 many times or not at all; performing an operation is like calling a function. The interesting part is that you can compose operations via the wrap-operation and choice-operation combinators. The former lets you bundle up an operation and a continuation. The latter lets you construct an operation that chooses over a number of operations. Calling perform-operation on a choice operation will perform one and only one of the choices. Performing an operation will call its wrap-operation continuation on the resulting values.

While it’s possible to implement Concurrent ML in terms of Go’s channels and baked-in select statement, it’s more expressive to do it the other way around, as that also lets us implement other operations types besides channel send and receive, for example timeouts and condition variables.

1 CML uses the term event, but I find this to be a confusing name. In this isolated article my terminology probably looks confusing, but in the context of the library I think it can be OK. The jury is out though.

2 In CML, synchronized.

* * *

Well, that's my limited understanding of the crushing weight of history. Note that part of this article is now in the Fibers manual.

Thanks very much to Matthew Flatt, Matthias Felleisen, and Michael Sperber for pushing me towards CML. In the beginning I thought its benefits were small and complication large, but now I see it as being the reverse. Happy hacking :)

by Andy Wingo at October 12, 2016 01:45 PM

October 11, 2016

Javier Muñoz

Multipart Upload (Copy part) goes upstream in Ceph

The last Upload Part (Copy) patches went upstream in Ceph some days ago. This new feature is available in the master branch now, and it will ship with the first development checkpoint for Kraken.

In S3, this feature is used to copy/move data using an existing object as data source in the storage backend instead of downloading/uploading the object to achieve the same effect via the request body.

This extension, part of the Multipart Upload API, reduces the required bandwidth between the RGW cluster and the final user when copying/moving existing objects in specific use cases.

In this post I will introduce the feature to know this concept maps to Ceph and how it works under the hood.

Amazon S3 and the Multipart Upload API

In November 2010 AWS announced the Multipart Upload API to allow faster and more flexible uploads into Amazon S3.

Before the Multipart Upload API, the large object uploads usually experienced network issues with limited bandwidth connections. If there was an error on the upload you had to restart the upload. In some cases the network issues were not transient, and it was difficult (or maybe impossible) to upload those larger objects with low-end connections.

The other issue is related to the size of the object, and the time this information is available. Before the Multipart Upload API it was not possible to upload an object with unknown size. You had to wait to have the whole object in place before starting the upload.

In addition to these inconveniences there was also the inability to upload in parallel. The uploads had to be linear.

The next picture shows how things were working before the Multipart Upload API.

As you can see the upload process with the usual Upload API runs straight but it may raises issues with larger objects in some cases.

The Multipart Upload API resolves these limitations to upload a single object as a set of parts. After all the parts of your object are uploaded, Amazon S3 then presents the data as a single object.

With this feature you can create parallel uploads, pause and resume an object upload, and begin uploads before you know the total object size.

A typical multipart upload consists of four steps:

  1. Initiate the Multipart Upload
  2. Separate the object into multiple parts
  3. Upload the parts in any order, one at a time or in parallel
  4. Complete the upload

The official documentation also describes how the Multipart Upload API should interact with the Object Versioning or Bucket Lifecycle support in S3.

A great Multipart Upload API overview is also this official Amazon S3 Multipart Upload introduction by the S3 product managers.

Ceph and the Multipart Upload (Copy part) API

Ceph got the initial Multipart Upload API core support in June 2011. It was 6-7 months later that it was announced, as part of Amazon S3 publicly. It shipped with Argonaut (v0.48), the first major 'stable' release of Ceph in July 2012, although the feature went in around v0.29. A truly impressive timing.

The Multipart Upload API feature is a sensitive feature. It works in tandem with other parts of the code such as Object Versioning, ACL granting, AWS2/AWS4 auth and so on.

Some of the these last features were merged after the original Multipart Upload API code was upstreamed so the Multipart Upload API may be considered as one of the RGW S3 features requiring continuos evolution to face integration and growth with the new features going in.

Notice if you have a look at the previous Multipart Upload API description, and you need to copy/move an existing object, you will miss the right step to run it efficiently.

Before this API, RGW S3 users were copying/moving existing objects, by downloading and uploading the objects again.

After this new feature going upstream the efficient way to copy/move an existing object becomes the 'Upload Part (Copy)' option. The 'Upload Part (Copy)' option works specifying the source object by adding the 'x-amz-copy-source' and 'x-amz-copy-source-range' request headers.

This improvement allows the user to drive the multipart upload process (steps 1, 2, 3 and 4) while steps 2 and 3 handle data coming from the storage backend instead of the request body.

The next picture shows how the data reading happens in RGW S3 along the new copy process (steps 2 and 3 are affected).

Using the Multipart Upload (Copy part) API with Ceph

As you can imagine the new code is a bandwidth saver with specific use cases where the final object needs to be, at least partially, composed of objects that already exist in the system.

Feel free to use this Python code to test the new API.



My work in Ceph is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Ceph development team. Thanks Yehuda and Casey for all your support to go upstream!

by Javier at October 11, 2016 10:00 PM

October 09, 2016

Javier Fernández

Web Engines Hackfest 2016

Last week I attended the Web Engines Hackfest 2016, hosted by Igalia at the HQ premises in A Coruña. For those still unaware, it’s a unconference like event focused on pure hacking and technical discussions about the main Web Engines supporting the Web Platform.


This year there were a very interesting group of hackers, representing most of the main web engines, like Mozilla’s Gecko and Servo, Google’s Blink, Apple’s WebKit and Igalia’s WebKitGTK+.

Hacking log

This year I was totally focused on hacking Blink web engine to solve some of the most complex issues of the CSS Grid Layout feature. I’d like to start giving thanks to Google to join the hackfest sending Christian Biesinger, specially thanks to him because of the long flight to attend. It was a pleasure to work work with him on-site and have the opportunity to discuss these complex issues face to face.

Day 1

During the weeks previous to the hackfest I’ve been working on fixing the bug 628565, reported several months ago but hitting me quite much recently. It’s a very ugly issue, since it shows an unpredictable behavior of Grid layout logic, so I decided that I might fix it once for all. I managed to provide 2 different approaches, which can be analyzed in the following code review issues: Issue 2361373002 and Issue 2333583002. Basically, the root cause of both issues is the same; grid’s container intrinsic size is not computed correctly when there are grid items with orthogonal flows.

There are 2 fundamental concepts that are key for understanding this problem:

  • Intrinsic size computation must be done before layout.
  • Orthogonal flow boxes need to be laid out in order to compute min-content contribution to their container’s intrinsic width.

So, we had as single bug to fix for solving two different problems: unpredictable grid layout logic and incorrect grid’s intrinsic size computation. The first problem is the one reported in bug 628565. I’ll try to describe it here briefly, but further details can be obtained from the bug report. Let’s consider the following code:

   body { overflow: hidden; }
<div style="width: fit-content; display: grid; grid-template-rows: 50px; border: 5px solid; font: 25px/1 Ahem;">
   <div style="writing-mode: vertical-lr; color: magenta; background: cyan;">XX X</div>

The following pictures illustrate the issue when loading the code above with Google Chrome 55.0.2859.0 (Official Build) dev (64-bit) Revision 3f63c614e8c4501b1bfa3f608e32a9d12618b0a0-refs/heads/master@{#418117} under Linux operating system:

Chrome BEFORE resizing

Chrome AFTER resizing

Christian and I analyzed carefully Blink’s layout code and how it deals with orthogonal boxes. It’s worth mentioning that there are important issues with orthogonal flows in almost every web engine, but Blink has made lately some improvements on this regard. See how a very basic example works in Blink, Gecko and WebKit engines:

<div style="float: left; border: 5px solid; font: font: 25px/1 Ahem;"></div>
  <div style="writing-mode: vertical-lr; color: magenta; background: cyan;">XX X</div>


Blink has implemented a kind of pre-layout logic which is executed for any orthogonal flow box even before doing the actual layout, when box’s intrinsic size computation takes place. This solution allows using a more precise info about an orthogonal box’s height when computing its container’s intrinsic width; otherwise zero width would be assumed, as it’s the case of WebKit and Gecko engines. However, this approach leads to an incorrect behavior when using Grid Layout as I’ll explain later.

My first approach to solve these two issues was to update grid container’s intrinsic size after its orthogonal children were laid out. Hence, we could use the actual size of the children to compute their container’s intrinsic size. However, this violates the first rule of the two assertions defined before: no layout should be done for intrinsic size computation.

Layout Breakout Session

During the evening we held the Layout Breakout Session, where we had a nice discussion about the future of Layout in the different web engines. Christian Biesinger, one of the members of the Google’s Blink Layout team, talked about the new LayoutNG project; an experiment to implement a new Layout from scratch, cleaning up quite old code paths and solving some problems that were not possible to address with the current legacy code. This new LayoutNG idea is related to the new Layout API and the Houdini project, a new mechanism for defining new layout models without the requirement of a native support inside the browser. We had also some discussions about the current state of the Flexible Box specification in Blink and how WebKit’s implementation is quite abandoned and unmaintained nowadays.

In addition, we discussed about the current state of CSS Grid Layout implementation in the different engines. The implementation is almost complete in most of the main engines. Thanks to the collaboration between Igalia and Bloomberg we can confirm that WebKit and Blink’s implementations are almost completed. We have been evaluating Mozilla’s Gecko implementation of Grid and we verified it’s in a similar status. We talked about the recent news from TPAC, which Manuel Rego attended, about the CSS Grid Layout specification becoming Candidate Recommendation. Because of all these reasons, we have agreed with Christian that it’d be good to send the Blink intent-to-ship request as soon as possible; in case it’s accepted, it could be enabled by default in the next Chrome release.

Day 2

The day started with a meeting with Christian for evaluating the different approaches we implemented so far. We have discarded the ones requiring updating intrinsic size. We also decided to avoid solving the issue during the pre-layout of orthogonal items. This approach would have been the one with less impact on performance for grid layout and it would also solve the incorrect intrinsic size issue, however it would add penalties for cases not using grid at all.

Finally, Christian and I agreed on solving first the unpredictable behavior of grid layout logic. We would skip the issue of incorrect intrinsic size, overall because we think the Grid Layout specification is contradictory on this regard. For what is worth, I’ve created a new issue for the CSSWG in the W3C’s github. Even though we should wait for the issue to get clarified, we have already some possible approaches for getting what seems a more natural result for grid’s intrinsic size. The following example could help to understand the problem:


Both test cases were loaded using the same Chrome version commented before. They clearly show that neither min-content or max-content sizes are applied correctly to the grid container. The reason of this weird behavior is how content-sized tracks are handled by the Grid Tracks sizing algorithm in case of rendering grid items with orthogonal flow. From the last draft specification:

Then, if the min-content contribution of any grid items have changed based on the row sizes calculated in step 2, steps 1 and 2 are repeated with the new min-content contribution and max-content contribution (once only).

This, with the fact that orthogonal boxes pre-layout is performed before the tracks have been defined, causes that grid’s container intrinsic size is computed incorrectly. This problem is explained in detail in the W3C’s github issue mentioned before. So if anybody is interested on additional details or, even better, willing to participate in the ongoing discussion, just follow the link above.

Day 3

In addition to the orthogonal flow issues, Baseline Alignment was the other hot topic of my work during the hackfest. Baseline Alignment is the only feature still missing to complete the implementation of the CSS Box Alignment specification for Grid Layout. There is a preliminary approach in this code review issue, but it’s still not complete. The problem is that as it’s stated in the Alignment spec, Baseline Alignment may affect grid’s container intrinsic size:

When specified for align-self/justify-self, these values trigger baseline self-alignment, shifting the entire box within its container, which may affect the sizing of its container.

This fact implies that I should integrate Baseline offset computation inside the Grid Tracks sizing algorithm. Christian and I have been analyzing the sizing algorithm and we designed a possible approach, quite similar to what Flexbox implements in its layout logic.

Finally, we met again with Christian to discuss about the Minimum Implied size for Grid issues. Manuel had the chance to discuss it with the Grid spec editors at TPAC, but it seems there are still unresolved issues. There are an ongoing discussion at W3C’s github about this problem, you can get the details in issues 283 and 523. Christian suggested that he could add some feedback to the discussion, so we can clarify it as soon as possible. This is an important issue that may affect browser interoperability.

The hackfest

The Web Engines Hackfest is an event to share experiences between hackers of different web engines and brainstorming about the future of the Web Platform. This year there were hackers representing most of the main web engines, including Google’s Blink, Mozilla’s Gecko and Servo, Apple’s WebKit and Igalia’s WebKitGTK+.


There were scheduled talks from hackers of each engine so everybody could get an idea of their current state and future plans. In addition, some breakout sessions were scheduled during the kick-off session driven by my colleague Martin Robinson. We embraced everybody to propose new breakout sessions on the fly, every time an ongoing discussion needed a deeper debate or analysis.


I could attend most of the talks and breakout sessions, so I’ll give now my impressions about them. All the talks were recorded (will be available soon) and slides are available in the wiki, so I recommend to watch them if you haven’t already.


The first speaker in the schedule was Jack Moffitt (Mozilla) “Servo: Today & Tomorrow”. He gave a great overview of the progress they made on the new Servo rendering engine, emphasizing the multi-thread support and showing some performance metrics for different engine’s components, CSS parsing, scripting, layout, … He remarked the technical advantages of using Rust on the development of this new engine.


During Day 1 I attended two breakout sessions. The first one about the WebKitGTK+ engine was mainly focused on the recently published roadmap and specially about the new Threaded Compositor enabled by default. There was an interesting discussion about graphics support in WebKitGTK+ between the Collabora developers and Andrey Fedorov (Huawei).


The other breakout session I could attend during Day 1 was the one about the Servo engine. There was a nice discussion between Mozilla developers and Christian Biesinger about how both engines handle rendering in a different way. Servo hackers explained with more detail Servos’s threading model and its implication for layout.


Day 2 started with the awesome talk by Behdad Esfahbod (Google) “HarfBuzz, the First Ten Years”. He gave an overview of the evolution of the HarfBuff library along the last years, including the past experiences in the Web Engines Hackfest and former events under previous name WebKitGTK+ Hackfest. Clearly, rendering fonts is a huge technical challenge. It was also awesome to have Behdad here to work closely with my colleague Fred Wang on the fonts support for MathML.


I attended the MathML breakout session, where the hot topic was Fred’s prototype implemented for the Blink engine. The result of this experiment was really promising, so we started to discuss its chances of getting back in Chrome, even behind an experimental flag. Both Christian and Behdad expressed doubts about that being possible nowadays. They suggested that it may be feasible to implement it through Houdini and the new Blink’s Layout API. We have confirmed that the refactoring we have done for WebKit applies perfectly in Blink, since both share a substantial portion of the layout codebase. In addition, after our work for removing the Flexbox dependency and further code clean up, we can be confident on the MathML independence in the layout logic. After some discussion, we have agreed on continue our experiments in Blink, independently of its chances to be integrated in Blink in the future, since I don’t think the Houdini approach makes sense for MathML, at least now.

Later on Youenn Fabler (Apple) gave a talk about the Fetch API, “Fetching Bytes and Words on the Web”. He described the implementation of this specification in WebKit and its relation with the Streams API. The development is still ongoing and there are still quite much work to do.


I missed the talks of Day 3, so hopefully I’ll watch them as soon as the videos are available. It’s specially interesting the talk about WPE (aka WebKit For Wayland) , by my colleague Žan Doberšek. I’ve spent the day trying to get the most of having Christian Biesinger at our premises. As I commented before, we discussed about the Grid Layout’s Implied Minimum Size issue, one of the most complex problems we’ll have to solve in the short term.

by jfernandez at October 09, 2016 01:59 PM

October 06, 2016

Eduardo Lima Mitev

Example: Run a headless OpenGL (ES) compute shader via DRM render-nodes

It has been a long time indeed since my last entry here. But I have actually been quite busy on a new adventure: graphics driver development.

Two years ago I started contributing to Mesa, mostly to the Intel i965 backend, as a member of the Igalia Graphics Team. During this time I have been building my knowledge around GPU driver development, the different parts of the Linux graphics stack, its different projects, tools, and the developer communities around them. It is still the beginning of the journey, but I’m already very proud of the multiple contributions our team has made.

But today I want to start a series of articles discussing basic examples of how to do cool things with the Khronos APIs and the Linux graphics stack. It will be a collection of short and concise programs that achieve a concrete goal, pretty much what I wish existed when I looked myself up for them in the public domain.

If you, like me, are the kind of person that learns by doing, growing existing examples; then I hope you will find this series interesting, and also encouraging to write your own examples.

Before finishing this sort of introduction and before we enter into the matter, I want to leave my stance on what I consider a good (minimal) example program, because I think it will be useful for this and future examples, to help the reader understand what to expect and how are they different from similar code found online. That said, not all examples in the series will be minimal, though.

A minimal example should:

  • provide as minimum boilerplate as possible. E.g, wrapping the example in a C++ class is unnecessary noise unless your are demoing OOP.
  • be contained in a single source code file if possible. Following the code flow across multiple files adds mental overhead.
  • have none or as minimum dependencies as possible. The reader should not need to install stuff that are not strictly necessary to try the example.
  • not favor function over readability. It doesn’t matter if it is not a complete or well-behaving program if that adds stuff not related with the one thing being showcased.

Ok, now we are ready to move to the first example in this series: the simplest way to embed and run an OpenGL-ES compute shader in your C program on Linux, without the need of any GUI window or connection to the X (or Wayland) server. Actually, it isn’t even necessary to be running any window system at all. Also, the program doesn’t need any special privilege, other than access to the unprivileged DRI/DRM infrastructure. In modern distros, this access is typically granted by being member of the video group.

The code is fairly short, so I’m inline-ing it below. It can also be checked-up on my gpu-playground repository. See below for a quick explanation of its most interesting parts.

#include <EGL/egl.h>
#include <EGL/eglext.h>
#include <GLES3/gl31.h>
#include <assert.h>
#include <fcntl.h>
#include <gbm.h>
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

/* a dummy compute shader that does nothing */
#define COMPUTE_SHADER_SRC "          \
#version 310 es\n                                                       \
layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;       \
void main(void) {                                                       \
   /* awesome compute code here */                                      \
}                                                                       \

main (int32_t argc, char* argv[])
   bool res;

   int32_t fd = open ("/dev/dri/renderD128", O_RDWR);
   assert (fd > 0);

   struct gbm_device *gbm = gbm_create_device (fd);
   assert (gbm != NULL);

   /* setup EGL from the GBM device */
   EGLDisplay egl_dpy = eglGetPlatformDisplay (EGL_PLATFORM_GBM_MESA, gbm, NULL);
   assert (egl_dpy != NULL);

   res = eglInitialize (egl_dpy, NULL, NULL);
   assert (res);

   const char *egl_extension_st = eglQueryString (egl_dpy, EGL_EXTENSIONS);
   assert (strstr (egl_extension_st, "EGL_KHR_create_context") != NULL);
   assert (strstr (egl_extension_st, "EGL_KHR_surfaceless_context") != NULL);

   static const EGLint config_attribs[] = {
   EGLConfig cfg;
   EGLint count;

   res = eglChooseConfig (egl_dpy, config_attribs, &cfg, 1, &count);
   assert (res);

   res = eglBindAPI (EGL_OPENGL_ES_API);
   assert (res);

   static const EGLint attribs[] = {
   EGLContext core_ctx = eglCreateContext (egl_dpy,
   assert (core_ctx != EGL_NO_CONTEXT);

   res = eglMakeCurrent (egl_dpy, EGL_NO_SURFACE, EGL_NO_SURFACE, core_ctx);
   assert (res);

   /* setup a compute shader */
   GLuint compute_shader = glCreateShader (GL_COMPUTE_SHADER);
   assert (glGetError () == GL_NO_ERROR);

   const char *shader_source = COMPUTE_SHADER_SRC;
   glShaderSource (compute_shader, 1, &shader_source, NULL);
   assert (glGetError () == GL_NO_ERROR);

   glCompileShader (compute_shader);
   assert (glGetError () == GL_NO_ERROR);

   GLuint shader_program = glCreateProgram ();

   glAttachShader (shader_program, compute_shader);
   assert (glGetError () == GL_NO_ERROR);

   glLinkProgram (shader_program);
   assert (glGetError () == GL_NO_ERROR);

   glDeleteShader (compute_shader);

   glUseProgram (shader_program);
   assert (glGetError () == GL_NO_ERROR);

   /* dispatch computation */
   glDispatchCompute (1, 1, 1);
   assert (glGetError () == GL_NO_ERROR);

   printf ("Compute shader dispatched and finished successfully\n");

   /* free stuff */
   glDeleteProgram (shader_program);
   eglDestroyContext (egl_dpy, core_ctx);
   eglTerminate (egl_dpy);
   gbm_device_destroy (gbm);
   close (fd);

   return 0;

You have probably noticed that the program can be divided in 4 main parts:

  1. Creating a GBM device from a render-node
  2. Setting up a (surfaceless) EGL/OpenGL-ES context
  3. Creating a compute shader program
  4. Dispatching the compute shader

The first two parts are the most relevant to the purpose of this article, since they allow our program to run without requiring a window system. The rest is standard OpenGL code to setup and execute a compute shader, which is out of scope for this example.

Creating a GBM device from a render-node

What’s a render-node anyway? Render-nodes are the “new” DRM interface to access the unprivileged functions of a DRI-capable GPU. It is actually not new, only that this API was previously part of the single (privileged) interface exposed at /dev/dri/cardX. During Linux kernel 3.X series, the DRM driver started exposing the unprivileged part of its user-space API via the render-node interface, as a separate device file (/dev/dri/renderDXX). If you want to know more about render-nodes, there is a section about it in the Linux Kernel documentation and also a brief explanation on Wikipedia.

   int32_t fd = open ("/dev/dri/renderD128", O_RDWR);

The first step is to open the render-node file for reading and writing. On my system, it is exposed as /dev/dri/renderD128. It may be different on other systems, and any serious code would want to detect and select the appropiate render node file first. There can be more than one, one for each DRI-capable GPU in your system.

It is the render-node interface that ultimately allows us to use the GPU for computing only, from an unprivileged program. However, to be able to plug into this interface, we need to wrap it up with a Generic Buffer Manager (GBM) device, since this is the interface that Mesa (the EGL/OpenGL driver we are using) understands.

   struct gbm_device *gbm = gbm_create_device (fd);

That’s it. We now have a GBM device that is able to send commands to a GPU via its render-node interface. Next step is to setup an EGL and OpenGL context.

Setting up a (surfaceless) EGL/OpenGL-ES context

EGLDisplay egl_dpy = eglGetPlatformDisplay (EGL_PLATFORM_GBM_MESA, gbm, NULL);

This is the most interesting line of code in this section. We get access to an EGL display using the previously created GBM device as the platform-specific handle. Notice the EGL_PLATFORM_GBM_MESA identifier, which is the Mesa specific enum to interface a GBM device.

   const char *egl_extension_st = eglQueryString (egl_dpy, EGL_EXTENSIONS);
   assert (strstr (egl_extension_st, "EGL_KHR_surfaceless_context") != NULL);

Here we check that we got a valid EGL display that is able to create a surfaceless context.

   res = eglMakeCurrent (egl_dpy, EGL_NO_SURFACE, EGL_NO_SURFACE, core_ctx);

After setting up the EGL context and binding the OpenGL API we want (ES 3.1 in this case), we have the last line of code that requires attention. When activating the EGL/OpenGL-ES context, we want to make sure we specify EGL_NO_SURFACE, because that’s what we want right?

And that’s it. We now have an OpenGL context and we can start making OpenGL calls normally. Setting up a compute shader and dispatching it should be uncontroversial, so we leave it out of this article for simplicity.

What next?

Ok, with around 100 lines of code we were able to dispatch a (useless) compute shader program. Here are some basic ideas on how to grow this example to do something useful:

  • Dynamically detect and select among the different render-node interfaces available.
  • Add some input and output buffers to get data in and out of the compute shader execution.
  • Do some cool processing of the input data in the shader, then use the result back in the C program.
  • Add proper error checking, obviously.

In future entries, I would like to explore also minimal examples on how to do this using the Vulkan and OpenCL APIs. And also implement some cool routine in the shader that does something interesting.


With this example I tried to demonstrate how easy is to exploit the capabilities of modern GPUs for general purpose computing. Whether for off-screen image processing, computer vision, crypto, physics simulation, machine learning, etc; there are potentially many cases where off-loading work from the CPU to the GPU results in greater performance and/or energy efficiency.

Using a fairly modern Linux graphics stack, Mesa, and a DRI-capable GPU such as an Intel integrated graphics card; any C program or library can embed a compute shader today, and use the GPU as another computing unit.

Does your application have routines that could potentially be moved to an GLSL compute program? I would love to hear about it.

by elima at October 06, 2016 10:31 AM

October 05, 2016

Frédéric Wang

Fonts at the Web Engines Hackfest 2016

Last week I travelled to Galicia for one of the regular gatherings organized by Igalia. It was a great pleasure to meet again all the Igalians and friends. Moreover, this time was a bit special since we celebrated our 15th anniversary :-)

Igalia Summit October 2016
Photo by Alberto Garcia licensed under CC BY-SA 2.0.

I also attended the third edition of the Web Engines Hackfest, sponsored by Igalia, Collabora and Mozilla. This year, we had various participants from the Web Platform including folks from Apple, Collabora, Google, Igalia, Huawei, Mozilla or Red Hat. For my first hackfest as an Igalian, I invited some experts on fonts & math rendering to collaborate on OpenType MATH support HarfBuzz and its use in math rendering engines. In this blog post, I am going to focus on the work I have made with Behdad Esfahbod and Khaled Hosny. I think it was again a great and productive hackfest and I am looking forward to attending the next edition!

OpenType MATH in HarfBuzz

Web Engines Hackfest, main room
Photo by @webengineshackfest licensed under CC BY-SA 2.0.

Behdad gave a talk with a nice overview of the work accomplished in HarfBuzz during ten years. One thing appearing recently in HarfBuzz is the need for APIs to parse OpenType tables on all platforms. As part of my job at Igalia, I had started to experiment adding support for the MATH table some months ago and it was nice to have Behdad finally available to review, fix and improve commits.

When I talked to Mozilla employee Karl Tomlinson, it became apparent that the simple shaping API for stretchy operators proposed in my blog post would not cover all the special cases currently implemented in Gecko. Moreover, this shaping API is also very similar to another one existing in HarfBuzz for non-math script so we would have to decide the best way to share the logic.

As a consequence, we decided for now to focus on providing an API to access all the data of the MATH table. After the Web Engines Hackfest, such a math API is now integrated into the development repository of HarfBuzz and will available in version 1.3.3 :-)

MathML in Web Rendering Engines

Currently, several math rendering engines have their own code to parse the data of the OpenType MATH table. But many of them actually use HarfBuzz for normal text shaping and hence could just rely on the new math API the math rendering too. Before the hackfest, Khaled already had tested my work-in-progress branch with libmathview and I had done the same for Igalia’s Chromium MathML branch.

MathML Fraction parameters test
MathML test for OpenType MATH Fraction parameters in Gecko, Blink and WebKit.

Once the new API landed into HarfBuzz, Khaled was also able to use it for the XeTeX typesetting system. I also started to experiment this for Gecko and WebKit. This seems to work pretty well and we get consistent results for Gecko, Blink and WebKit! Some random thoughts:

  • The math data is exposed through a hb_font_t which contains the text size. This means that the various values are now directly resolved and returned as a fixed-point number which should allow to avoid rounding errors we may currently have in Gecko or WebKit when multiplying by float factors.
  • HarfBuzz has some magic to automatically handle invalid offsets and sizes that greatly simplifies the code, compared to what exist in Gecko and WebKit.
  • Contrary to Gecko’s implementation, HarfBuzz does not cache the latest result for glyph-specific data. Maybe we want to keep that?
  • The WebKit changes were tested on the GTK port, where HarfBuzz is enabled. Other ports may still need to use the existing parsing code from the WebKit tree. Perhaps Apple should consider adding support for the OpenType MATH table to CoreText?

Brotli/WOFF2/OTS libraries

Web Engines Hackfest, main room
Photo by @webengineshackfest licensed under CC BY-SA 2.0.

We also updated the copies of WOFF2 and OTS libraries in WebKit and Gecko respectively. This improves one requirement from the WOFF2 specification and allows to pass the corresponding conformance test.

Gecko, WebKit and Chromium bundle their own copy of the Brotli, WOFF2 or OTS libraries in their source repositories. However:

  • We have to use more or less automated mechanisms to keep these bundled copies up-to-date. This is especially annoying for Brotli and WOFF2 since they are still in development and we must be sure to always integrate the latest security fixes. Also, we get compiler warnings or coding style errors that do not exist upstream and that must be disabled or patched until they are fixed upstream and imported again.

  • This obviously is not an optimal sharing of system library and may increase the size of binaries. Using shared libraries is what maintainers of Linux (or other FLOSS systems) generally ask and this was raised during the WebKitGTK+ session. Similarly, we should really use the system Brotli/WOFF2 bundled in future releases of Apple’s operating systems.

There are several issues that make hard for package maintainers to provide these libraries: no released binaries or release tags, no proper build system to generate shared libraries, use of git submodule to include one library source code into another etc Things have gotten a bit better for Brotli and I was able to tweak the CMake script to produce shared libraries. For WOFF2, issue 40 and issue 49 have been inactive but hopefully these will be addressed in the future…

October 05, 2016 10:00 PM

Iago Toral

Source code for the OpenGL terrain renderer demo

I have been quite busy with various things in the last few weeks, but I have finally found some time to clean up and upload the code of the OpenGL terrain render demo to Github.

Since this was intended as a programming exercise I have not tried to be very elegant or correct during the implementation, so expect things like error handling to be a bit rough around the edges, but otherwise I think the code should be easy enough to follow.

Notice that I have only tested this on Intel GPUs. I know it works on NVIDIA too (thanks to Samuel and Chema for testing this) but there are a couple of rendering artifacts there, specifically at the edges of the skybox and some “pillars” showing up in the distance some times, probably because I am rendering one to many “rows” of the terrain and I end up rendering garbage. I may fix these some day.

The code I uploaded to the repository includes a few new features too:

  • Model variants, which are basically color variations of the same model
  • A couple of additional models (a new tree and plant) and a different rock type
  • Collision detection, which makes navigating the terrain more pleasent

Here is a new screenshot:


In future posts I will talk a bit about some of the implementation details, so having the source code around will be useful. Enjoy!

by Iago Toral at October 05, 2016 12:25 PM

October 03, 2016

Xabier Rodríguez Calvar

WebRTC in WebKit/WPE

For some time I worked at Igalia to enable WebRTC on WebKitForWayland or WPE for the Raspberry Pi 2.

The goal was to have the WebKit WebRTC tests working for a demo. My fellow Igalian Alex was working on the platform itself in WebKit and assisting with some tuning for the Pi on WebKit but the main work needed to be done in OpenWebRTC.

My other fellow Igalian Phil had begun a branch to work on this that was half way with some workarounds. My first task was getting into combat/workaround mode and make OpenWebRTC work with compressed streams from gst-rpicamsrc. OpenWebRTC supported only raw video streams and that Raspberry Pi Cam module GStreamer element provides only H264 encoded ones. I moved some encoders and parsers around, some caps modifications, removed some elements that didn’t work on the Pi and made it work eventually. You can see the result at:

To make this work by yourselves you needed a custom branch of Buildroot where you could build with the proper plugins enabled also selected the appropriate branches in WPE and OpenWebRTC.

Unfortunately the work was far from being finished so I continued the effort to try to make the arch changes in OpenWebRTC have production quality and that meant do some tasks step by step:

  • Rework the video orientation code: The workaround deactivated it as so far it was being done in GStreamer. In the case of rpicamsrc that can be done by the hardware itself so I cooked a GStreamer interface to enable rotation the same way it was done for the [gl]videoflip elements. The idea would be deprecate the original ones and use the new interface. These landed both in videoflip and glvideoflip. Of course I also implemented it on gst-rpicamsrc here and here and eventually in OpenWebRTC sources.
  • Rework video flip: Once OpenWebRTC sources got orientation support, I could rework the flip both for local and remote feeds.
  • Add gl{down|up}load elements back: There were some issues with the gl elements to upload and download textures, which we had removed. I readded them again.
  • Reworked bins linking: In OpenWebRTC there are some bins that are created to perform some tasks and depending on different circumstances you add or not some elements. I reworked the way those elements are linked so that we don’t have to take into account all the use cases to link them. Now this is easier as the elements are linked as they are the added to the bin.
  • Reworked the renderer_disabled: As in the case for orientation, some elements such as gst-rpicamsrc are able to change color and balance so I added support for that to avoid having that done by GStreamer elements if not necessary. In this case the proper interfaces were already there in GStreamer.
  • Moved the decoding/parsing from the source to the renderer: Before our changes the source was parsing/decoding the remote feeds, local sources were not decoded, just raw was supported. Our workarounds made the local sources decode too but this was not working for all cases. So why decoding at the sources when GStreamer has caps and you can just chain all that to the renderers? So I did eventually, I moved the parsing/decoding to the renderers. This took fixing all the caps negotiation from sources to renderers. Unfortunatelly I think I broke audio on the way, but surely nothing difficult to fix.

This is still a work in progress and now I am changing tasks and handing this over back to my fellow Igalians Phil, who I am sure will do an awesome job together with Alex.

And again, thanks to Igalia for letting me work on this and to Metrological that is sponsoring this work.

by calvaris at October 03, 2016 04:15 PM

September 29, 2016

Antonio Gomes

Making Chromium’s PDFium greater

One of the coolest things about working at Igalia, is that we have the chance to work on real world problems from day 1, either by improving an existing solution or creating new ones, based on open source technologies.

During the past month, I worked on a specific aspect of Chromium’s PDF engine, PDFium: extending PDFium’s capability so that Annotations can be manipulated by Acrobat JS scripting.

In practice, the goal was to make the specific scenario described on bug 492 to behave similarly in Chromium/PDFium, when compared to IE/Acrobat.

“Mission given, is mission accomplished”

The following Acrobat JS APIs were implemented:

Document::syncAnnotScan (only stubbed out)

… and Acrobat JS scripts like the snippet below can now be successfully executed by PDFium, making 200k+ PDF files to behave in Chromium/PDFium similarly to IE/Acrobat –
with regards to the way Annotations are manipulated.

1 this.disclosed = true;
2 this.syncAnnotScan();
3 var u = this.URL;
4 var args = u.split('calcrtid=');
5 var tags;
6 if (args.length > 1) {
7   tags = args[1].split('&');
8   this.gotoNamedDest(tags[0]);
9   var a = this.getAnnot(this.pageNum, tags[0]);
10   if (a != null) {
11     a.hidden = false;
12     var rect = a.rect;
13     if (rect != null) {
14       this.scroll(rect[2] + 100, rect[3] + 100);
15     }
16   }
17 }

Also as part of this project, it was possible to fix two discrepancies in PDFium, again if compared to Acrobat: 1) Hidden annotations were shown; 2) PDFium’s positioning Text Markup annotations.

About (2) specifically, on Acrobat if an “/AP” clause is present as part of an “/Annot” definition, the coordinates used to draw annotations come from “/Rect” values. On the other hand, when “/AP” is not defined, “/QuadPoints” values are used to grab the annotation coordinates from. The divergence was coming from the fact that PDFium uses “/Rect” regardless the presence or absence of “/AP”. This CL fixed it.

To wrap up the work, I was also able to extend PDFium’s main testing tool (pdfium_test) to actually run Acrobat JS tests (follow up), as well as make lots of driven-by clean ups, and change PDFium’s main client (Chromium) to deal better when its async nature.

The PDFium community

Throughout the project, interactions with the PDFium community were smooth:

  • PDFium contribution flow follows pretty much the Chromium’s: to report issues and Codereview to get the work reviewed.
  • Once approved, a CL is committed through a “Commit Queue” bot – which ensures a CLs correctness by running PDFium regression suite.
  • Owners are really responsive.


My work in PDFium is sponsored by Bloomberg. Thank you all!


by agomes at September 29, 2016 04:04 PM

September 27, 2016

Javier Muñoz

Attending ApacheCon and Apache Big Data Europe 2016

This year I will be attending my first ApacheCon and Apache Big Data here in Europe. Two major events related to open source technologies, techniques and best practices shaping the data ecosystem and cloud computing.

I will be in Seville (Spain) all week, November 14-18, under the sponsorship of my company Igalia. It will be great meeting with some fellows in the Apache Libcloud community, one of the open source projects where I contribute code upstream to support Ceph and custom solutions for cloud providers.

If you are interested in topics such as cloud, distributed systems, data, massive storage, scalability, security, devops, ML... or you would like to chat about some Apache project, feel free to approach me and talk about anything. You can also contact me via mail, linkedin, twitter, etc. See you there!

by Javier at September 27, 2016 10:00 PM

September 26, 2016

Michael Catanzaro

Epiphany Icon Refresh

We have a nice new app icon for Epiphany 3.24, thanks to Jakub Steiner (Update: and also Lapo Calamandrei):

Our new icon. Ignore the version numbers, it’s for 3.24.

Wow pretty!

The old icon was not actually specific to Epiphany, but was taken from the system, so it could be totally different depending on your icon theme. Here’s the icon currently used in GNOME, for comparison:

The old icon, for comparison

You can view the new icon it in its full 512×512 glory by navigating to about:web:

It’s big (click for full size)

(The old GNOME icon was a mere 256×256.)

Thanks Jakub!

by Michael Catanzaro at September 26, 2016 02:00 PM

September 21, 2016

Andy Wingo

is go an acceptable cml?

Yesterday I tried to summarize the things I know about Concurrent ML, and I came to the tentative conclusion that Go (and any Go-like system) was an acceptable CML. Turns out I was both wrong and right.

you were wrong when you said everything's gonna be all right

I was wrong, in the sense that programming against the CML abstractions lets you do more things than programming against channels-and-goroutines. Thanks to Sam Tobin-Hochstadt to pointing this out. As an example, consider a little process that tries to receive a message off a channel, and times out otherwise:

func withTimeout(ch chan int, timeout int) (result int) {
  var timeoutChannel chan int;
  var msg int;
  go func() {
    timeoutChannel <- 0
  select {
    case msg = <-ch: return msg;
    case msg = <-timeoutChannel: return 0;

I think that's the first Go I've ever written. I don't even know if it's syntactically valid. Anyway, I think we see how it should work. We return the message from the channel, unless the timeout happens before.

But, what if the message is itself a composite message somehow? For example, say we have a transformer that reads a value from a channel and adds 1 to it:

func onePlus(in chan int) (result chan int) {
  var out chan int
  go func () { out <- 1 + <-in }()
  return out

What if we do a withTimeout(onePlus(numbers), 0)? Assume the timeout fires first and that's the result that select chooses. There's still that onePlus goroutine out there trying to read from in and at some point probably it will succeed, but nobody will read its value. At that point the number just vanishes into the ether. Maybe that's OK in certain domains, but certainly not in general!

What CML gives you is the ability to express an event (which is kinda like a possibility of sending or receiving a message on a channel) in such a way that we don't run into this situation. Specifically with the wrap combinator, we would make an event such that receiving on numbers would run a function on the received message and return that as the message value -- which is of course the same as what we have, except that in CML the select wouldn't actually read the message off unless it select'd that channel for input.

Of course in Go you could just rewrite your program, so that the select statement looks like this:

select {
  case msg = <-ch: return msg + 1;
  case msg = <-timeoutChannel: return 0;

But here we're operating at a lower level of abstraction; we were forced to intertwingle our concerns of adding 1 and our concerns of timeout. CML is more expressive than Go.

you were right when you said we're all just bricks in the wall

However! I was right in the sense that you can build a CML system on top of Go-like systems (though possibly not Go in particular). Thanks to Vesa Karvonen for this comment and the link to their proof-of-concept CML implementation in Clojure's core.async. I understand Vesa also has an implementation in F# as well.

Folks should read Vesa's code, after reading the Reppy papers of course; it's delightfully short and expressive. The basic idea is that event composition operators like choose and wrap build up data structures instead of doing things. The sync operation then grovels through those data structures to collect a list of channels to pass on to core.async's equivalent of select. When select returns, sync determines which event that chosen channel and message corresponds to, and proceeds to "activate" the event (and, as a side effect, possibly issue NACK messages to other channels).

Provided you can map from the chosen select channel/message back to the event, (something that core.async can mostly do, with a caveat; see the code), then you can build CML on top of channels and goroutines.

o/~ yeah you were wrong o/~

On the other hand! One advantage of CML is that its events are not limited to channel sends and receives. I understand that timeouts, thread joins, and maybe some other event types are first-class event kinds in many CML systems. Michael Sperber, current Scheme48 maintainer and functional programmer, tells me that simply wrapping events in channels+goroutines works but can incur a big performance overhead relative to supporting those event types natively, due to the need to make the new goroutine and channel and the scheduling costs. He quotes 10X as the overhead!

So although CML and Go appear to be inter-expressible, maybe a proper solution will base the simple channel send/receive interface on CML rather than the other way around.

Also, since these events are now second-class, it must be OK to lose these events, for the same reason that the naïve withTimeout could lose a message from numbers. This is the case for timeouts usually but maybe you have to think about this more, and possibly provide an infinite stream of the message. (Of course the wrapper goroutine would be collected if the channel becomes unreachable.)

you were right when you said this is the end

I've long wondered how contemporary musicians deal with the enormous, crushing weight of recorded music. I don't really pick any more but hoo am I feeling this now. I think for Guile, I will continue hacking on fibers in a separate library, and I think that things will remain that way for the next couple years and possibly more. We need more experience and more mistakes before blessing and supporting any particular formulation of highly concurrent programming. I will say though that I am delighted that we are able to actually do this experimentation on a library level and I look forward to seeing what works out :)

Thanks again to Vesa, Michael, and Sam for sharing their time and knowledge; all errors are of course mine. Happy hacking!

by Andy Wingo at September 21, 2016 09:29 PM

September 20, 2016

Andy Wingo

concurrent ml versus go

Peoples! Lately I've been navigating the guile-ship through waters unknown. This post is something of an echolocation to figure out where the hell this ship is and where it should go.

Concretely, I have been working on getting a nice lightweight concurrency system rolling for Guile. I'll write more about that later, but you can think of it as being modelled on Go, though built as a library. (I had previously described it as "Erlang-like", but that's just not accurate.)

Earlier this year at Curry On this topic was burning in my mind and of course when I saw the language-hacker fam there I had to bend their ears. My targets: Matthew Flatt, the amazing boundary-crossing engineer, hacker, teacher, researcher, and implementor of Racket, and Matthias Felleisen, the godfather of the PLT research family. I saw them sitting together and I thought, you know what, what can they have to say to each other? These people have been talking together for 30 years right? Surely they are actually waiting for some ignorant dude to saunter up to the PL genius bar, right?

So saunter I do, saying, "if someone says to you that they want to build a server that will handle 100K or so simultaneous connections on Racket, what abstraction do you tell them to use? Racket threads?" Apparently: yes. A definitive yes, in the case of Matthias, with a pointer to Robby Findler's paper on kill-safe abstractions; and still a yes from Matthew with the caveat that for the concrete level of concurrency that I described, you'd have to run tests. More fundamentally, I was advised to look at Concurrent ML (on which Racket's concurrency facilities were based), that CML was much better put together than many modern variants like Go.

This was very interesting and new to me. As y'all probably know, I don't have a formal background in programming languages, and although I've read a lot of literature, reading things only makes you aware of the growing dimension of the not-yet-read. Concurrent ML was even beyond my not-yet-read horizon.

So I went back and read a bunch of papers. Turns out Concurrent ML is like Lisp in that it has a tribe and a tightly-clutched history and a diaspora that reimplements it in whatever language they happen to be working in at the moment. Kinda cool, and, um... a bit hard to appreciate in the current-day context when the only good references are papers from 10 or 20 years ago.

However, after reading a bunch of John Reppy papers, here is my understanding of what Concurrent ML is. I welcome corrections; surely I am getting this wrong.

1. CML is like Go, composed of channels and goroutines. (Forgive the modern referent; I assume most folks know Go at this point.)

2. Unlike Go, in CML a channel is never buffered. To make a buffered channel in CML, you spawn a thread that manages a buffer between two channels.

3. Message send and receive operations in CML are built on a lower-level primitive called "events". (send ch x) is instead euivalent to (sync (send-event ch x)). It's like an event is the derivative of a message send with respect to time, or something.

4. Events can be combined and transformed using the choose and wrap combinators.

5. Doing a sync on an event created by choose allows a user to build select in "user-space", as a library. Cool stuff. So this is what events are for.

6. There are separate event type implementations for timeouts, channel send/recv blocking operations, file descriptor blocking operations, syscalls, thread joins, and the like. These are supported by the CML implementation.

7. The early implementations of Concurrent ML were concurrent but not parallel; they did not run multiple "goroutines" on separate CPU cores at the same time. It was only in like 2009 that people started to do CML in parallel. I do not know if this late parallelism has a practical impact on the viability of CML.

ok go

What is the relationship of CML to Go? Specifically, is CML more expressive than Go? (I assume the reverse is not the case, but that would also be an interesting result!)

There are a few languages that only allow you to select over message receives (not sends), but Go's select doesn't have this limitation, so that's not a differentiator.

Some people say that it's nice to have events as the common denominator, but I don't get this argument. If the only event under consideration is message send or receive over a channel, events + choose + sync is the same in expressive power as a built-in select, as far as I can see. If there are other events, then your runtime already has to support them either way, and something like (let ((ch (make-channel))) (spawn-fiber (lambda () (put-message ch exp))) (get-message ch)) should be sufficient for any runtime-supported event in exp, like sleeps or timeouts or thread joins or whatever.

To me it seems like Go has made the right choices here. I do not see the difference, and that's why I wrote all this, is to be shown the error of my ways. Choosing channels, send, receive, and select as the primitives seems to have the same power as SML events.

Let this post be a pentagram on the floor, then, to summon the CML cognoscenti. Well-actuallies are very welcome; hit me up in the comments!

[edit: Sam Tobin-Hochstadt tells me I got it wrong and I believe him :) In the meantime while I work out how I was wrong, examples are welcome!]

by Andy Wingo at September 20, 2016 09:33 PM

Carlos García Campos

WebKitGTK+ 2.14

These six months has gone so fast and here we are again excited about the new WebKitGTK+ stable release. This is a release with almost no new API, but with major internal changes that we hope will improve all the applications using WebKitGTK+.

The threaded compositor

This is the most important change introduced in WebKitGTK+ 2.14 and what kept us busy for most of this release cycle. The idea is simple, we still render everything in the web process, but the accelerated compositing (all the OpenGL calls) has been moved to a secondary thread, leaving the main thread free to run all other heavy tasks like layout, JavaScript, etc. The result is a smoother experience in general, since the main thread is no longer busy rendering frames, it can process the JavaScript faster improving the responsiveness significantly. For all the details about the threaded compositor, read Yoon’s post here.

So, the idea is indeed simple, but the implementation required a lot of important changes in the whole graphics stack of WebKitGTK+.

  • Accelerated compositing always enabled: first of all, with the threaded compositor the accelerated mode is always enabled, so we no longer enter/exit the accelerating compositing mode when visiting pages depending on whether the contents require acceleration or not. This was the first challenge because there were several bugs related to accelerating compositing being always enabled, and even missing features like the web view background colors that didn’t work in accelerated mode.
  • Coordinated Graphics: it was introduced in WebKit when other ports switched to do the compositing in the UI process. We are still doing the compositing in the web process, but being in a different thread also needs coordination between the main thread and the compositing thread. We switched to use coordinated graphics too, but with some modifications for the threaded compositor case. This is the major change in the graphics stack compared to the previous one.
  • Adaptation to the new model: finally we had to adapt to the threaded model, mainly due to the fact that some tasks that were expected to be synchronous before became asyncrhonous, like resizing the web view.

This is a big change that we expect will drastically improve the performance of WebKitGTK+, especially in embedded systems with limited resources, but like all big changes it can also introduce new bugs or issues. Please, file a bug report if you notice any regression in your application. If you have any problem running WebKitGTK+ in your system or with your GPU drivers, please let us know. It’s still possible to disable the threaded compositor in two different ways. You can use the environment variable WEBKIT_DISABLE_COMPOSITING_MODE at runtime, but this will disable accelerated compositing support, so websites requiring acceleration might not work. To disable the threaded compositor and bring back the previous model you have to recompile WebKitGTK+ with the option ENABLE_THREADED_COMPOSITOR=OFF.


WebKitGTK+ 2.14 is the first release that we can consider feature complete in Wayland. While previous versions worked in Wayland there were two important features missing that made it quite annoying to use: accelerated compositing and clipboard support.

Accelerated compositing

More and more websites require acceleration to properly work and it’s now a requirement of the threaded compositor too. WebKitGTK+ has supported accelerated compositing for a long time, but the implementation was specific to X11. The main challenge is compositing in the web process and sending the results to the UI process to be rendered on the actual screen. In X11 we use an offscreen redirected XComposite window to render in the web process, sending the XPixmap ID to the UI process that renders the window offscreen contents in the web view and uses XDamage extension to track the repaints happening in the XWindow. In Wayland we use a nested compositor in the UI process that implements the Wayland surface interface and a private WebKitGTK+ protocol interface to associate surfaces in the UI process to the web pages in the web process. The web process connects to the nested Wayland compositor and creates a new surface for the web page that is used to render accelerated contents. On every swap buffers operation in the web process, the nested compositor in the UI process is automatically notified through the Wayland surface protocol, and  new contents are rendered in the web view. The main difference compared to the X11 model, is that Wayland uses EGL in both the web and UI processes, so what we have in the UI process in the end is not a bitmap but a GL texture that can be used to render the contents to the screen using the GPU directly. We use gdk_cairo_draw_from_gl() when available to do that, falling back to using glReadPixels() and a cairo image surface for older versions of GTK+. This can make a huge difference, especially on embedded devices, so we are considering to use the nested Wayland compositor even on X11 in the future if possible.


The WebKitGTK+ clipboard implementation relies on GTK+, and there’s nothing X11 specific in there, however clipboard was read/written directly by the web processes. That doesn’t work in Wayland, even though we use GtkClipboard, because Wayland only allows clipboard operations between compositor clients, and web processes are not Wayland clients. This required to move the clipboard handling from the web process to the UI process. Clipboard handling is now centralized in the UI process and clipboard contents to be read/written are sent to the different WebKit processes using the internal IPC.

Memory pressure handler

The WebKit memory pressure handler is a monitor that watches the system memory (not only the memory used by the web engine processes) and tries to release memory under low memory conditions. This is quite important feature in embedded devices with memory limitations. This has been supported in WebKitGTK+ for some time, but the implementation is based on cgroups and systemd, that is not available in all systems, and requires user configuration. So, in practice nobody was actually using the memory pressure handler. Watching system memory in Linux is a challenge, mainly because /proc/meminfo is not pollable, so you need manual polling. In WebKit, there’s a memory pressure handler on every secondary process (Web, Plugin and Network), so waking up every second to read /proc/meminfo from every web process would not be acceptable. This is not a problem when using cgroups, because the kernel interface provides a way to poll an EventFD to be notified when memory usage is critical.

WebKitGTK+ 2.14 has a new memory monitor, used only when cgroups/systemd is not available or configured, based on polling /proc/meminfo to ensure the memory pressure handler is always available. The monitor lives in the UI process, to ensure there’s only one process doing the polling, and uses a dynamic poll interval based on the last system memory usage to read and parse /proc/meminfo in a secondary thread. Once memory usage is critical all the secondary processes are notified using an EventFD. Using EventFD for this monitor too, not only is more efficient than using a pipe or sending an IPC message, but also allows us to keep almost the same implementation in the secondary processes that either monitor the cgroups EventFD or the UI process one.

Other improvements and bug fixes

Like in all other major releases there are a lot of other improvements, features and bug fixes. The most relevant ones in WebKitGTK+ 2.14 are:

  • The HTTP disk cache implements speculative revalidation of resources.
  • The media backend now supports video orientation.
  • Several bugs have been fixed in the media backend to prevent deadlocks when playing HLS videos.
  • The amount of file descriptors that are kept open has been drastically reduced.
  • Fix the poor performance with the modesetting intel driver and DRI3 enabled.

by carlos garcia campos at September 20, 2016 07:42 AM

September 19, 2016

Michael Catanzaro

Epiphany 3.22 (and a couple new stable releases too!)

It’s that time of year again! A new major release of Epiphany is out now, representing another six months of incremental progress. That’s a fancy way of saying that not too much has changed (so how did this blog post get so long?). It’s not for lack of development effort, though. There’s actually lot of action in git master and on sidebranches right now, most of it thanks to my awesome Google Summer of Code students, Gabriel Ivascu and Iulian Radu. However, I decided that most of the exciting changes we’re working on would be deferred to Epiphany 3.24, to give them more time to mature and to ensure quality. And since this is a blog post about Epiphany 3.22, that means you’ll have to wait until next time if you want details about the return of the traditional address bar, the brand-new user interface for bookmarks, the new support for syncing data between Epiphany browsers on different computers with Firefox Sync, or Prism source code view, all features that are brewing for 3.24. This blog also does not cover the cool new stuff in WebKitGTK+ 2.14, like new support for copy/paste and accelerated compositing in Wayland.

New stuff

So, what’s new in 3.22?

  • A new Paste and Go context menu option in the address bar, implemented by Iulian. It’s so simple, but it’s also the greatest thing ever. Why did nobody implement this earlier?
  • A new Duplicate Tab context menu option on tabs, implemented by Gabriel. It’s not something I use myself, but it seems some folks who use it in other browsers were disappointed it was missing in Epiphany.
  • A new keyboard shortcuts dialog is available in the app menu, implemented by Gabriel.

Gabriel also redesigned all the error pages. My favorite one is the new TLS error page, based on a mockup from Jakub Steiner:

Web app improvements

Pivoting to web apps, Daniel Aleksandersen turned his attention to the algorithm we use to pick a desktop icon for newly-created web apps. It was, to say the least, subpar; in Epiphany 3.20, it normally always fell back to using the website’s 16×16 favicon, which doesn’t look so great in a desktop environment where all app icons are expected to be at least 256×256. Epiphany 3.22 will try to pick better icons when websites make it possible. Read more on Daniel’s blog, which goes into detail on how to pick good web app icons.

Also new is support for system-installed web apps. Previously, Epiphany could only handle web apps installed in home directories, which meant it was impossible to package a web app in an RPM or Debian package. That limitation has now been removed. (Update: I had forgotten that limitation was actually removed for GNOME 3.20, but the web apps only worked when running in GNOME and not in other desktops, so it wasn’t really usable. That’s fixed now in 3.22.) This was needed to support packaging Fedora Developer Portal, but of course it can be used to package up any website. It’s probably only interesting to distributions that ship Epiphany by default, though. (Epiphany is installed by default in Fedora Workstation as it’s needed by GNOME Software to run web apps, it’s just hidden from the shell overview unless you “install” it.) At least one media outlet has amusingly reported this as Epiphany attempting to compete generally with Electron, something I did write in a commit message, but which is only true in the specific case where you need to just show a website with absolutely no changes in the GNOME desktop. So if you were expecting to see Visual Studio running in Epiphany: haha, no.

Shortcut woes

On another note, I’m pleased to announce that we managed to accidentally stomp on both shortcuts for opening the GTK+ inspector this cycle, by mapping Duplicate Tab to Ctrl+Shift+D, and by adding a new Ctrl+Shift+I shortcut to open the WebKit web inspector (in addition to F12). Go team! We caught the problem with Ctrl+Shift+D and removed the shortcut in time for the release, so at least you can still use that to open the GTK+ inspector, but I didn’t notice the issue with the web inspector until it was too late, and Ctrl+Shift+I will no longer work as expected in GTK+ apps. Suggestions welcome for whether we should leave the clashing Ctrl+Shift+I shortcut or get rid of it. I am leaning towards removing it, because we normally match Epiphany behavior with GTK+, and only match other browsers when it doesn’t conflict with GTK+. That’s called desktop integration, and it’s worked well for us so far. But a case can be made for matching other browsers, too.

Stable releases

On top of Epiphany 3.22, I’ve also rolled new stable releases 3.20.4 and 3.18.8. I don’t normally blog about stable releases since they only include bugfixes and are usually boring, so why are these worth mentioning here? Two reasons. First, one of the fixes in these releases is quite significant: I discovered that a few important features were broken when multiple tabs share the same web process behind the scenes (a somewhat unusual condition): the load anyway button on the unacceptable TLS certificate error page, password storage with GNOME keyring, removing pages from the new tab overview, and deleting web applications. It was one subtle bug that was to blame for breaking all of those features in this odd corner case, which finally explains some difficult-to-reproduce complaints we’d been getting, so it’s good to put out that bug of the way. Of course, that’s also fixed in Epiphany 3.22, but new stable releases ensure users don’t need a full distribution upgrade to pick up a simple bugfix.

Additionally, the new stable releases are compatible with WebKitGTK+ 2.14 (to be released later this week). The Epiphany 3.20.4 and 3.18.8 releases will intentionally no longer build with older versions of WebKitGTK+, as new WebKitGTK+ releases are important and all distributions must upgrade. But wait, if WebKitGTK+ is kept API and ABI stable in order to encourage distributions to release updates, then why is the new release incompatible with older versions of Epiphany? Well, in addition to stable API, there’s also an unstable DOM API that changes willy-nilly without any soname bumps; we don’t normally notice when it changes, since it’s autogenerated from web IDL files. Sounds terrible, right? In practice, no application has (to my knowledge) ever been affected by an unstable DOM API break before now, but that has changed with WebKitGTK+ 2.14, and an Epiphany update is required. Most applications don’t have to worry about this, though; the unstable API is totally undocumented and not available unless you #define a macro to make it visible, so applications that use it know to expect breakage. But unannounced ABI changes without soname bumps are obviously a big a problem for distributions, which is why we’re fixing this problem once and for all in WebKitGTK+ 2.16. Look out for a future blog post about that, probably from Carlos Garcia.

elementary OS

Lastly, I’m pleased to note that elementary OS Loki is out now. elementary is kinda (totally) competing with us GNOME folks, but it’s cool too, and the default browser has changed from Midori to Epiphany in this release due to unfixed security problems with Midori. They’ve shipped Epiphany 3.18.5, so if there are any elementary fans in the audience, it’s worth asking them to upgrade to 3.18.8. elementary does have some downstream patches to improve desktop integration with their OS — notably, they’ve jumped ahead of us in bringing back the traditional address bar — but desktop integration is kinda the whole point of Epiphany, so I can’t complain. Check it out! (But be sure to complain if they are not releasing WebKit security updates when advised to do so.)

by Michael Catanzaro at September 19, 2016 02:00 PM

September 18, 2016

Michael Catanzaro

A WebKit Update for Ubuntu

I’m pleased to learn that Ubuntu has just updated WebKitGTK+ from 2.10.9 to 2.12.5 in Ubuntu 16.04. To my knowledge, this is the first time Ubuntu has released a major WebKit update. It includes fixes for 16 security vulnerabilities detailed in WSA-2016-0004 and WSA-2016-0005.

This is really great. Of course, it would have been better if it didn’t take three and a half months to respond to WSA-2016-0004, and the week before WebKitGTK+ 2.12 becomes obsolete was not the greatest timing, but late security updates are much better than no security updates. It remains to be seen if Ubuntu will keep up with WebKit updates in the future, but I think I can tentatively stop complaining about Ubuntu for now. Debian is looking increasingly isolated in not offering WebKit security updates to its users.

Thanks, Ubuntu!

by Michael Catanzaro at September 18, 2016 04:34 AM

September 14, 2016

Manuel Rego

TPAC, Web Engines Hackfest & Igalia 15th anniversary on the horizon


Next week I’ll be in Lisbon attending TPAC. This is the annual conference organized by the W3C where all the different groups meet face to face during one week. It seems a huge event where you can meet lots of important people working on the web. Looking forward to being there. 😃

Due to our involvement on the implementation of CSS Grid Layout specification in Chromium/Blink and and Safari/WebKit, we’ve been interacting quite a lot with the CSS Working Group (CSS WG). Thus, I’ll be participating on their meetings during the event and also following the work around Houdini, with a close eye on the Layout API. BTW, thanks to the CSS WG chairs for letting me join them.

Like past year, Igalia will have a booth in the conference where you can chat with us about our involvement on the W3C, from the specs edition to the implementation of web standards on the different browsers. My colleagues Joanie and Juanjo will be attending the event too, so don’t hesitate to ping any of us to talk about Igalia and our contributions to the open web platform.

Web Engines Hackfest

This year Igalia is organizing and hosting again a new edition of the Web Engines Hackfest. The event will take place during the last week of the month (26-28th September), it used to be on December but we looked for a better date this year and it seems we’ve been successful as we’ll be around 40 people (more than ever). Thanks everyone attending, we just hope you really enjoy the event.

As usual my main goal for the hackfest is related to CSS Grid Layout implementation. Probably it’d be a good moment to draft a plan for shipping it on Chromium as we’ll have Christian Biesinger around, who is usually reviewing most of our Grid Layout patches.

On top of that and due to my involvement on the MathML refactoring we did on WebKit, I’ll be very interested on keep discussing about MathML and the next steps to make it a reality on the different web engines. Some fonts experts will be around, so we won’t miss the opportunity to try to improve OpenType MATH table support in HarfBuzz.

Apart from this, it’s worth highlighting the big number of Servo contributors we’ll have in the event, from Mozilla employees to external collaborators (like my colleague Martin). I’m eager to check firsthand the status of this engine and their future plans.

Last, but not least, hopefully during the hackfest we’ll find some time to discuss about the upstreaming process of WebKit for Wayland, trying to convert it in an official WebKit port like WebKitGTK+.

And there’ll be more topics that will be discussed there like: accessibility, multimedia, V8, WebRTC, etc. We’re quite a lot of people, so surely we’ll have productive meetings on most of these things.

Igalia 15th Anniversary

This month Igalia is becoming 15 years old! It’s amazing that a company with a completely different model (focused on free software and with a flat cooperative-like structure) has survived during all these years. I’m very grateful to the people who founded a company with such wonderful values, and all the people that have made possible to reach this point in our history. Thanks for letting me be part of it! 😊

Igalia 15th Anniversary Logo Igalia 15th Anniversary Logo

We’ll be celebrating the anniversary on the last week of September, we’ll have several parties during that week and we’ll have one of our summits at the weekend.

It’s awesome to see back in time and realize how many contributions we’ve been doing to a lot of different free software projects, from our first days inside the GNOME project, to our current work on the different browsers or graphics drivers, among others. Lots of programs you’re using every day have some code contributed by Igalia; from your computer to your phones, TVs, watches, etc.

Happy birthday Igalia, I wish we’ll have many more years of success on the free world! 🎂

September 14, 2016 10:00 PM

August 30, 2016

Javier Muñoz

AWS4 chunked upload goes upstream in Ceph RGW S3

With AWS Signature Version 4 (AWS4) you have the option of uploading the payload in fixed or variable-size chunks.

This chunked upload option, also known as Transfer payload in multiple chunks or STREAMING-AWS4-HMAC-SHA256-PAYLOAD feature in the Amazon S3 ecosystem, avoids reading the payload twice (or buffer it in memory) to compute the signature in the client side.

AWS4 chunked upload support is now upstream code in Ceph. It will also ship with the next Jewel release.

Signing single vs multiple chunks payloads

In AWS Signature Version 4 using the HTTP Authorization header is the most common method of providing authentication information. There are two supported options in S3:

The first option, 'Transfer payload in a single chunk', is the way how Ceph RGW S3 authenticates requests under AWS4 right now. It supports signed and unsigned payloads.

For this case the major client-side drawback with signed payloads uploads is the necessity to read the payload twice or buffer it in memory. For big files this approach may be inefficient so you might prefer to upload data in chunks instead.

The second option, 'Transfer payload in multiple chunks (chunked upload)', is the new functionality being added upstream and backported to Jewel.

In this case you break up the payload in fixed or variable-size chunks. By uploading data in chunks, you avoid accesing the payload twice to calculate the signature in the client side.

Signing the chunks

Each chunk signature computation includes the signature of the previous chunk. The seed signature is generated using the request headers only:

Bear in mind the seed signature is used in the signature computation of the first chunk only. For each subsequent chunk, you create a chunk signature that includes the signature of the previous chunk.

The string to sign used in each subsequent chunk signature is:

The chunk signatures chaining ensures you send the chunks in correct order.

Uploading chunked payloads with the AWS SDK for Java

To test chunked uploads with Ceph RGW S3 you can use the AWS SDK for Java. It supports AWS4 chunked uploads. The last version is available here.

Feel free to use this code based on the AWS SDK Java's UploadObjectSingleOperation example.



My work in Ceph is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Ceph development team. Thank you all!

by Javier at August 30, 2016 10:00 PM

August 29, 2016

Iago Toral

OpenGL terrain renderer

Lately I have been working on a simple terrain OpenGL renderer demo, mostly to have a playground where I could try some techniques like shadow mapping and water rendering in a scenario with a non trivial amount of geometry, and I thought it would be interesting to write a bit about it.

But first, here is a video of the demo running on my old Intel IvyBridge GPU:

And some screenshots too:

OpenGL Terrain screenshot 1 OpenGL Terrain screenshot 2
OpenGL Terrain screenshot 3 OpenGL Terrain screenshot 4

Note that I did not create any of the textures or 3D models featured in the video.

With that out of the way, let’s dig into some of the technical aspects:

The terrain is built as a 251×251 grid of vertices elevated with a heightmap texture, so it contains 63,000 vertices and 125,000 triangles. It uses a single 512×512 texture to color the surface.

The water is rendered in 3 passes: refraction, reflection and the final rendering. Distortion is done via a dudv map and it also uses a normal map for lighting. From a geometry perspective it is also implemented as a grid of vertices with 750 triangles.

I wrote a simple OBJ file parser so I could load some basic 3D models for the trees, the rock and the plant models. The parser is limited, but sufficient to load vertex data and simple materials. This demo features 4 models with these specs:

  • Tree A: 280 triangles, 2 materials.
  • Tree B: 380 triangles, 2 materials.
  • Rock: 192 triangles, 1 material (textured)
  • Grass: 896 triangles (yes, really!), 1 material.

The scene renders 200 instances of Tree A another 200 instances of Tree B, 50 instances of Rock and 150 instances of Grass, so 600 objects in total.

Object locations in the terrain are randomized at start-up, but the demo prevents trees and grass to be under water (except for maybe their base section only) because it would very weird otherwise :), rocks can be fully submerged though.

Rendered objects fade in and out smoothly via alpha blending (so there is no pop-in/pop-out effect as they reach clipping planes). This cannot be observed in the video because it uses a static camera but the demo supports moving the camera around in real-time using the keyboard.

Lighting is implemented using the traditional Phong reflection model with a single directional light.

Shadows are implemented using a 4096×4096 shadow map and Percentage Closer Filter with a 3x3 kernel, which, I read is (or was?) a very common technique for shadow rendering, at least in the times of the PS3 and Xbox 360.

The demo features dynamic directional lighting (that is, the sun light changes position every frame), which is rather taxing. The demo also supports static lighting, which is significantly less demanding.

There is also a slight haze that builds up progressively with the distance from the camera. This can be seen slightly in the video, but it is more obvious in some of the screenshots above.

The demo in the video was also configured to use 4-sample multisampling.

As for the rendering pipeline, it mostly has 4 stages:

  • Shadow map.
  • Water refraction.
  • Water reflection.
  • Final scene rendering.

A few notes on performance as well: the implementation supports a number of configurable parameters that affect the framerate: resolution, shadow rendering quality, clipping distances, multi-sampling, some aspects of the water rendering, N-buffering of dynamic VBO data, etc.

The video I show above runs at locked 60fps at 800×600 but it uses relatively high quality shadows and dynamic lighting, which are very expensive. Lowering some of these settings (very specially turning off dynamic lighting, multisampling and shadow quality) yields framerates around 110fps-200fps. With these settings it can also do fullscreen 1600×900 with an unlocked framerate that varies in the range of 80fps-170fps.

That’s all in the IvyBridge GPU. I also tested this on an Intel Haswell GPU for significantly better results: 160fps-400fps with the “low” settings at 800×600 and roughly 80fps-200fps with the same settings used in the video.

So that’s it for today, I had a lot of fun coding this and I hope the post was interesting to some of you. If time permits I intend to write follow-up posts that go deeper into how I implemented the various elements of the demo and I’ll probably also write some more posts about the optimization process I followed. If you are interested in any of that, stay tuned for more.

by Iago Toral at August 29, 2016 11:50 AM

August 22, 2016

Xavier Castaño

Igalia will be at IBC 2016

Do you work in the media, entertainment, broadcasting or content industry? If so, please join us at IBC2016, 9-13 September in Amsterdam. The IBC Exhibition covers fifteen halls across the Amsterdam RAI and hosts over 1,600 exhibitors spanning the creation, management and delivery of electronic and media entertainment.
You will be able to meet us at our booth where we will be showcasing demos of our latest work in browsers. The demo of WebKit for Wayland, a new port of WebKit that is currently being upstreamed, on an IVI shell is especially worth checking out.

Igalia will be in Hall 5, booth 19 so don’t miss your chance to attend and visit us! Find us on the interactive floorplan.

by xavier castano at August 22, 2016 10:15 AM

August 18, 2016

Adrián Pérez

SnabbWall's Traffic Analyzer: L7Spy

Wow, two posts just a few days apart from each other! Today I am writing about L7Spy, the application from the SnabbWall suite used to analyze network traffic and detect which applications they belong to. SnabbWall is being developed at Igalia with sponsorship from the NLnet Foundation. (Thanks!)


A Digression

There are two things related to L7Spy I want to mention beforehand:

nDPI 1.8 is Now Supported

The changes from the [recently released][ljndpi0.3.3] ljndpi v0.3.3 have been merged into the SnabbWall repository, which means that now it also supports using version 1.8 of the nDPI library.

Following Upstream

Snabb follows a monthly release schedule. Initially, I thought it would be better to do all the development for SnabbWall starting from a given Snabb release, and incorporate the changes from upstream later on, after the first one of SnabbWall is done. I noticed how it might end up being difficult to rebase SnabbWall on top of the changes accumulated from the upstream repository, and switched to the following scheme:

  • The branch containing the most up-to-date development code is always in the snabbwall branch.
  • Whenever a new Snabb version is released, changes are merged from the upstream repository using its corresponding release tag.
  • When a SnabbWall release is done, it tagged from current contents of the snabbwall branch, which means it is automatically based on the latest Snabb release.
  • If a released version needs fixing, a maintenance branch snabbwall-vX.Y for it is created from the release tag. Patches have to be backported or purposedfully made for that this new branch.

This approach reduces the burden of keeping up with changes in Snabb, while still allowing to make bugfix releases of any SnabbWall version when needed.


The L7Spy is a Snabb application, and as such it can be reused in your own application networks. It is made to be connected in a few different ways, which should allow for painless integration. The application has two bidirectional endpoints called south and north:

The L7Spy application.

Packets entering one of the ends are forwarded unmodified to the other end, in both directions, and they are scanned as they pass through the application.

The only mandatory connection is the receiving side of either south or north: this allows for using L7Spy as a “kitchen sink”, swallowing the scanned packets without sending them anywhere. As an example, this can be used to connect a RawSocket application to L7Spy, attach the RawSocket to an existing network interface, and do passive analysis of the traffic passing through that interface:

Passive analysis made easy

If you swap the RawSocket application with the one driving Intel NICs, and divert a copy of packets passing through your firewall towards the NIC, you would be doing essentially the same but at lightning speed and moving the load of packet analysis to a separate machine. Add a bit more of smarts on top, and this kind of setup would already look like a standalone Intrusion Detection System!

snabb wall

While implementing the L7Spy application, the snabb wall command has also come to life. Like all the other Snabb programs, the wall command is built into the snabb executable, and it has subcommands itself:

aperez@hikari ~/snabb/src % sudo ./snabb wall
[sudo] password for aperez:
  snabb wall <subcommand> <arguments...>
  snabb wall --help

Available subcommands:

  spy     Analyze traffic and report statistics

Use --help for per-command usage. Example:

  snabb wall spy --help

aperez@hikari ~/snabb/src %

For the moment being, only the spy subcommand is provided: it exercises the L7Spy application by feeding into it packets captured from one of the supported sources. It can work in batch mode —the default—, which is useful to analyze a static pcap file. The following example analyzes the full contents of the pcap file, reporting at the end all the traffic flows detected:

aperez@hikari ~/snabb/src % sudo ./snabb wall spy \
      pcap program/wall/tests/data/EmergeSync.cap
0xfffffffff19ba8be   18p -  RSYNC
0xffffffffc6f96d0f 4538p -  RSYNC
aperez@hikari ~/snabb/src %

Two flows are identified, which correspond to two different connections between the client with IPv4 address (from two different ports) to the server with address Both flows are correctly identified as belonging to the rsync application. Notice how the application is properly detected even though the rsync daemon is running in port 26883 which is a non-standard one for this kind of service.

It is also possible to ingest traffic directly from a network interface in live mode. Apart from the drivers for Intel cards included with Snabb (selectable using the intel10g and intel1g sources) there are two hardware-independent sources for TAP devices (tap) and RAW sockets (raw). Let's use a RAW socket to sniff a copy of every packet which passes through the a kernel-managed network interface, while opening a well known search engine in a browser using their https:// URL:

aperez@hikari ~/snabb/src % sudo ./snabb wall spy --live eth0
0x77becdcb    2p    -  SERVICE_GOOGLE:DNS
0x6e111c73    4p   -  SERVICE_GOOGLE:SSL

Note how even the connection is encrypted, our nDPI-based analyzer has been able to correlate a DNS query immediately followed by a connection to the address from the DNS response at port 443 as a typical pattern used to browse web pages, and it has gone the extra mile to tell us that it belongs to a Google-operated service.

There is one more feature in snabb wall spy. Passing --stats will print a line at the end (in normal mode), or every two seconds (in live mode), with a summary of the amount of data processed:

aperez@hikari ~/snabb/src % sudo ./snabb wall spy --live --stats eth0
=== 2016-08-17T18:05:54+0300 === 71866 Bytes, 99 packets, 35933.272 B/s, 49.500 PPS
=== 2016-08-17T18:19:35+0300 === 3105473 Bytes, 2884 packets, 1.599 B/s, 0.001 PPS

One last note: even though it is not shown in the examples, scanning and identifying IPv6 traffic is already supported.

Get that IPv6 flowing!

What Next?

Now that traffic analysis and flow identification is working nicely, it has to be used for actual filtering. I am now in the process of implementing the firewall component of SnabbWall (L7fw) as a Snabb application. Also, there will be a snabb wall filter subcommand, providing an off-the-shelf L7 filtering solution. Expect a couple more of blog posts about them.

In the meanwhile, if you want to play with L7Spy, I will be updating its WIP documentation in the following days. Happy hacking!

by aperez ( at August 18, 2016 08:51 AM

August 11, 2016

Xabier Rodríguez Calvar

Going to GUADEC 2016

I am at the airport getting ready to board beginning my trip to Karlsruhe for GUADEC 2016. I’ll be there with my colleague Michael Catanzaro.Please talk to us if you are interested in browsers or Igalia.

by calvaris at August 11, 2016 07:30 AM