Planet Igalia

February 17, 2018

Michael Catanzaro

On Compiling WebKit (now twice as fast!)

Are you tired of waiting for ages to build large C++ projects like WebKit? Slow headers are generally the problem. Your C++ source code file #includes a few headers, all those headers #include more, and those headers #include more, and more, and more, and since it’s C++ a bunch of these headers contain lots of complex templates to slow down things even more. Not fun.

It turns out that much of the time spent building large C++ projects is effectively spent parsing the same headers again and again, over, and over, and over, and over, and over….

There are three possible solutions to this problem:

  • Shred your CPU and buy a new one that’s twice as fast.
  • Use C++ modules: import instead of #include. This will soon become the best solution, but it’s not standardized yet. For WebKit’s purposes, we can’t use it until it works the same in MSVCC, Clang, and three-year-old versions of GCC. So it’ll be quite a while before we’re able to take advantage of modules.
  • Use unified builds (sometimes called unity builds).

WebKit has adopted unified builds. This work was done by Keith Miller, from Apple. Thanks, Keith! (If you’ve built WebKit before, you’ll probably want to say that again: thanks, Keith!)

For a release build of WebKitGTK+, on my desktop, our build times used to look like this:

real 62m49.535s
user 407m56.558s
sys 62m17.166s

That was taken using WebKitGTK+ 2.17.90; build times with any 2.18 release would be similar. Now, with trunk (or WebKitGTK+ 2.20, which will be very similar), our build times look like this:

real 33m36.435s
user 214m9.971s
sys 29m55.811s

Twice as fast.

The approach is pretty simple: instead of telling the compiler to build the original C++ source code files that developers see, we instead tell the compiler to build unified source files that look like this:

// UnifiedSource1.cpp
#include "CSSValueKeywords.cpp"
#include "ColorData.cpp"
#include "HTMLElementFactory.cpp"
#include "HTMLEntityTable.cpp"
#include "JSANGLEInstancedArrays.cpp"
#include "JSAbortController.cpp"
#include "JSAbortSignal.cpp"
#include "JSAbstractWorker.cpp"

Since files are included only once per translation unit, we now have to parse the same headers only once for each unified source file, rather than for each individual original source file, and we get a dramatic build speedup. It’s pretty terrible, yet extremely effective.

Now, how many original C++ files should you #include in each unified source file? To get the fastest clean build time, you would want to #include all of your C++ source files in one, that way the compiler sees each header only once. (Meson can do this for you automatically!) But that causes two problems. First, you have to make sure none of the files throughout your entire codebase use conflicting variable names, since the static keyword and anonymous namespaces no longer work to restrict your definitions to a single file. That’s impractical in a large project like WebKit. Second, because there’s now only one file passed to the compiler, incremental builds now take as long as clean builds, which is not fun if you are a WebKit developer and actually need to make changes to it. Unifying more files together will always make incremental builds slower. After some experimentation, Apple determined that, for WebKit, the optimal number of files to include together is roughly eight. At this point, there’s not yet much negative impact on incremental builds, and past here there are diminishing returns in clean build improvement.

In WebKit’s implementation, the files to bundle together are computed automatically at build time using CMake black magic. Adding a new file to the build can change how the files are bundled together, potentially causing build errors in different files if there are symbol clashes. But this is usually easy to fix, because only files from the same directory are bundled together, so random unrelated files will never be built together. The bundles are always the same for everyone building the same version of WebKit, so you won’t see random build failures; only developers who are adding new files will ever have to deal with name conflicts.

To significantly reduce name conflicts, we now limit the scope of using statements. That is, stuff like this:

using namespace JavaScriptCore;
namespace WebCore {

Has been changed to this:

namespace WebCore {
using namespace JavaScriptCore;
// ...

Some files need to be excluded due to unsolvable name clashes. For example, files that include X11 headers, which contain lots of unnamespaced symbols that conflict with WebCore symbols, don’t really have any chance. But only a few files should need to be excluded, so this does not have much impact on build time. We’ve also opted to not unify most of the GLib API layer, so that we can continue to use conventional GObject names in our implementation, but again, the impact of not unifying a few files is minimal.

We still have some room for further performance improvement, because some significant parts of the build are still not unified, including most of the WebKit layer on top. But I suspect developers who have to regularly build WebKit will already be quite pleased.

by Michael Catanzaro at February 17, 2018 07:07 PM

February 16, 2018

Michael Catanzaro

On Python Shebangs

So, how do you write a shebang for a Python program? Let’s first set aside the python2/python3 issue and focus on whether to use env. Which of the following is correct?

#!/usr/bin/env python

The first option seems to work in all environments, but it is banned in popular distros like Fedora (and I believe also Debian, but I can’t find a reference for this). Using env in shebangs is dangerous because it can result in system packages using non-system versions of python. python is used in so many places throughout modern systems, it’s not hard to see how using #!/usr/bin/env in an important package could badly bork users’ operating systems if they install a custom version of python in /usr/local. Don’t do this.

The second option is broken too, because it doesn’t work in BSD environments. E.g. in FreeBSD, python is installed in /usr/local/bin. So FreeBSD contributors have been upstreaming patches to convert #!/usr/bin/python shebangs to #!/usr/bin/env python. Meanwhile, Fedora has begun automatically rewriting #!/usr/bin/env python to #!/usr/bin/python, but with a warning that this is temporary and that use of #!/usr/bin/env python will eventually become a fatal error causing package builds to fail.

So obviously there’s no way to write a shebang that will work for both major Linux distros and major BSDs. #!/usr/bin/env python seems to work today, but it’s subtly very dangerous. Lovely. I don’t even know what to recommend to upstream projects.

Next problem: python2 versus python3. By now, we should all be well-aware of PEP 394. PEP 394 says you should never write a shebang like this:

#!/usr/bin/env python

unless your python script is compatible with both python2 and python3, because you don’t know what version you’re getting. Your python script is almost certainly not compatible with both python2 and python3 (and if you think it is, it’s probably somehow broken, because I doubt you regularly test it with both). Instead, you should write the shebang like this:

#!/usr/bin/env python2
#!/usr/bin/env python3

This works as long as you only care about Linux and BSDs. It doesn’t work on macOS, which provides /usr/bin/python and /usr/bin/python2.7, but still no /usr/bin/python2 symlink, even though it’s now been six years since PEP 394. It’s hard to understate how frustrating this is.

So let’s say you are WebKit, and need to write a python script that will be truly cross-platform. How do you do it? WebKit’s scripts are only needed (a) during the build process or (b) by developers, so we get a pass on the first problem: using /usr/bin/env should be OK, because the scripts should never be installed as part of the OS. Using #!/usr/bin/env python — which is actually what we currently do — is unacceptable, because our scripts are python2 and that’s broken on Arch, and some of our developers use that. Using #!/usr/bin/env python2 would be dead on arrival, because that doesn’t work on macOS. Seems like the option that works for everyone is #!/usr/bin/env python2.7. Then we just have to hope that the Python community sticks to its promise to never release a python2.8 (which seems likely).


by Michael Catanzaro at February 16, 2018 08:21 PM

February 15, 2018

Diego Pino

The B4 network function

Some time ago I started a series of blog posts about IPv6 and network namespaces. The purpose of those posts was preparing the ground for covering a network function called B4 (Basic Bridging BroadBand).

The B4 network function is one of the main components of a lw4o6 architecture (RFC7596). The function runs within every CPEs (Customer’s Premises Equipment, essentially a home router) of a carrier’s network. This function takes care of two things: 1) NAPT the customer’s IPv4 traffic and 2) encapsulate it into IPv6. This is fundamental as lw4o6 proposes an IPv6-only network, which can still provide IPv4 services and connectivity. Besides lw4o6, the B4 function is also present in other architectures such as DS-Lite or MAP-E. In the case of lw4o6 the exact name of this function is lwB4. All these architectures rely on A+P mapping techniques and are managed by the Softwire WG.

The diagram below shows how a lw4o6 architecture works:

lw4o6 chart
lw4o6 chart

Packets arriving the CPE from the customer (IPv4) are shown in red. Packets leaving the CPE to the carrier’s network are shown in blue (IPv6). The counterpart of a lwB4 function is the lwAFTR function, deployed at one of the border-routers of the carrrier’s network.

In the article ‘Dive into lw4o6’ I reviewed in detail how a lw4o6 architecture works. Please check out the article if you want to learn more.

At Igalia we implemented a high-performant lwAFTR network function. This network function has been part of Snabb since at least 2015, and has kept evolving and getting merged back to Snabb through new releases. We kindly thank Deutsche Telekom for their support financing this project, as well as Juniper networks, who also helped improving the status of Snabb’s lwAFTR.

While we were developing the lwAFTR network function we heavily tested it through a wide range of tests: end-to-end tests, performance tests, soak tests, etc. However, in some occassions we got to diagnose potential bugs in real deployments. To do that, we needed the other major component of a lw4o6 architecture: the B4 network function.

OpenWRT, the Linux-based OS powering many home routers, features a MAP network function to help deploying MAP-E architectures. This function can also be used to implement a B4 for DS-Lite or lw4o6. With the invaluable help of my colleagues Adrián and Carlos López I managed to setup an OpenWRT on a virtual machine with B4 enabled. However, I was not completely satisfied with the solution.

That led me to explore another solution very much inspired by an excelent blog post from Marcel Wiget: Lightweight 4over6 B4 Client in Linux Namespace. In this post Marcel describes how to build a B4 network function using standard Linux commands.

Basically, a B4 function does 2 things:

  • NAT44, which is possible to do it with iptables.
  • IPv4-in-IPv6 tunneling, which is possible to do it iproute2.

In addition, Marcel’s B4 network function is isolated into its own network namespace. That’s less of a headache than installing and configuring a virtual machine.

On the other hand, my deployment had a extra twist compared to a standard lw4o6 deployment. The lwAFTR I was trying to reach was somewhere out on the Internet, not within my ISP’s network. To make things worse, ISP providers in Spain are not rolling out IPv6 yet so I needed to use an IPv6 tunnel broker, more precisely Hurricane Electric (In the article ‘IPv6 tunnel’ I described how to set up such tunnel).

Basically my deployment looked like this:

lwB4-lwAFTR over Internet
lwB4-lwAFTR over Internet

After scratching my head during several days I came up with the following script: I break it down below in pieces for better comprehension.

Warning: The script requires a Hurricane Electric tunnel up and running in order to work.

Our B4 would have the following provisioned data:

  • B4 IPv6: IPv6 address provided by Hurricane Electric.
  • B4 IPv4:
  • B4 port-range: 4096-8191

While the address of the AFTR is 2001:DB8::0001.

Given this settings our B4 is ready to match the following softwire in the lwAFTR’s binding table:

softwire {
    psid 1;
    b4-ipv6 IFHE (See below);
    br-address 2001:DB8::0001;
    port-set {
        psid-length 12;

In case of doubt about how softwires work, please check ‘Dive into lw4o6’.


Definition of several constants. IPHT and IPNS stand for IP host and IP namespace. Our script will create a network namespace which requires a veth pair to communicate the namespace with the host. IPHT is an ULA address for the host side, while IPNS is an ULA address for the network namespace side. Likewise, IFHT and IFNS are the interface names for host and namespace sides respectively.

IFHE is the interface of the Hurricane Electric IPv6-in-IPv4 tunnel. We will use the IPv6 address of this interface as IPv6 source address of the B4.


Softwire related constants, as described above.

# Reset everything
ip li del dev "${IFHT}" &>/dev/null
ip netns del "${NS}" &> /dev/null

Removes namespace and host-side interface if defined.

# Create a network namespace and enable loopback on it
ip netns add "${NS}"
ip netns exec "${NS}" ip li set dev lo up

# Create the veth pair and move one of the ends to the NS.
ip li add name "${IFHT}" type veth peer name "${IFNS}"
ip li set dev "${IFNS}" netns "${NS}"

# Configure interface ${IFHT} on the host
ip -6 addr add "${IPHT}/${CID}" dev "${IFHT}"
ip li set dev "${IFHT}" up

# Configure interface ${IFNS} on the network namespace.
ip netns exec "${NS}" ip -6 addr add "${IPNS}/${CID}" dev "${IFNS}"
ip netns exec "${NS}" ip li set dev "${IFNS}" up

The commands above set up the basics of the network namespace. A network namespace is created with two virtual-interface pairs (think of a patch cable). Each of the veth ends is assigned a private IPv6 address (ULA). One of the ends of the veth pair is moved into the network namespace while the other remains on the host side. In case of doubt, please check this other article I wrote about network namespaces.

# Create IPv4-in-IPv6 tunnel.
ip netns exec "${NS}" ip -6 tunnel add b4tun mode ipip6 local "${IPNS}" remote "${IPHT}" dev "${IFNS}"
ip netns exec "${NS}" ip addr add dev b4tun
ip netns exec "${NS}" ip link set dev b4tun up
# All IPv4 packets go through the tunnel.
ip netns exec "${NS}" ip route add default dev b4tun
# Make ${IFNS} the default gw.
ip netns exec "${NS}" ip -6 route add default dev "${IFNS}"

From the B4 we will send IPv4 packets that will get encapsulated into IPv6. These packets will eventually leave the host via the Hurricane Electric tunnel. What we do here is to create an IPv4-in-IPv6 tunnel (ipip6) called b4tun. The tunnel has two ends: IPNS and IPHT. All IPv4 traffic started from the network namespace gets routed through b4tun, so it gets encapsulated. If the traffic if IPv6 native traffic it doesn’t need to get encapsulated, thus it’s simply forwarded to IFNS.

# Adjust MTU size. 
ip netns exec "${NS}" ip li set mtu 1252 dev b4tun
ip netns exec "${NS}" ip li set mtu 1300 dev vpeer9

Since packets leaving the CPE get IPv6 encapsulated we need to make room for those extra bytes that will grow the packet size. Normally routing appliances have a default MTU size of 1500 bytes. That’s why we artificially reduce the MTU size of both b4tun and vpeer9 interfaces to a number lower than 1500. This technique is known as MSS (Maximum Segment Size) Clamping.

# NAT44.
ip netns exec "${NS}" iptables -t nat --flush
ip netns exec "${NS}" iptables -t nat -A POSTROUTING -p tcp  -o b4tun -j SNAT --to $IP:$PORTRANGE
ip netns exec "${NS}" iptables -t nat -A POSTROUTING -p udp  -o b4tun -j SNAT --to $IP:$PORTRANGE
ip netns exec "${NS}" iptables -t nat -A POSTROUTING -p icmp -o b4tun -j SNAT --to $IP:$PORTRANGE

Outgoing IPv4 packets leaving the B4 got their IPv4 source address and port sourced natted. The block of code above flushes the iptables’s NAT44 rules and creates new Source NAT rules for several protocols.

# Enable forwarding and IPv6 NAT
sysctl -w net.ipv6.conf.all.forwarding=1
ip6tables -t nat --flush
# Packets coming into the veth pair in the host side, change their destination address to AFTR.
ip6tables -t nat -A PREROUTING  -i "${IFHT}" -j DNAT --to-destination "${AFTR_IPV6}"
# Outgoing packets change their source address to HE Client address (B4 address).
ip6tables -t nat -A POSTROUTING -o "${IFHE}" -j MASQUERADE

Outgoing packets leaving our host need to get their source address masqueraded to the IPv6 address assigned to the interface of the Hurricane Electric tunnel point. Likewise anything that comes into the host should seem to arrive from the lwAFTR, when actually its origin address is the IPv6 address of the other end of the Hurricane Electric tunnel. To overcome this problem I applied a NAT66 on source address and destination. Could this be done in a different way skipping the controversial NAT66? I’m not sure. I think the veth pairs need to get assigned private addresses so the only way to get the packets routed through the Internet is with a NAT66.

# Get into NS.
ip netns exec ${NS} ${bash} --rcfile <(echo "PS1=\"${NS}> \"")

The last step gets us into the network namespace from which we will be able to run commands constrained into the environment created during the steps before.

I don’t know how much useful or reusable this script can be, but in hindsight coming up with this complex setting helped me learning several Linux networking tools. I think I could have never figured all this out without the help and support from my colleagues as well as the guidance from Marcel’s original script and blog post.

February 15, 2018 06:00 AM

February 07, 2018

Andy Wingo

design notes on inline caches in guile

Ahoy, programming-language tinkerfolk! Today's rambling missive chews the gnarly bones of "inline caches", in general but also with particular respect to the Guile implementation of Scheme. First, a little intro.

inline what?

Inline caches are a language implementation technique used to accelerate polymorphic dispatch. Let's dive in to that.

By implementation technique, I mean that the technique applies to the language compiler and runtime, rather than to the semantics of the language itself. The effects on the language do exist though in an indirect way, in the sense that inline caches can make some operations faster and therefore more common. Eventually inline caches can affect what users expect out of a language and what kinds of programs they write.

But I'm getting ahead of myself. Polymorphic dispatch literally means "choosing based on multiple forms". Let's say your language has immutable strings -- like Java, Python, or Javascript. Let's say your language also has operator overloading, and that it uses + to concatenate strings. Well at that point you have a problem -- while you can specify a terse semantics of some core set of operations on strings (win!), you can't choose one representation of strings that will work well for all cases (lose!). If the user has a workload where they regularly build up strings by concatenating them, you will want to store strings as trees of substrings. On the other hand if they want to access characterscodepoints by index, then you want an array. But if the codepoints are all below 256, maybe you should represent them as bytes to save space, whereas maybe instead as 4-byte codepoints otherwise? Or maybe even UTF-8 with a codepoint index side table.

The right representation (form) of a string depends on the myriad ways that the string might be used. The string-append operation is polymorphic, in the sense that the precise code for the operator depends on the representation of the operands -- despite the fact that the meaning of string-append is monomorphic!

Anyway, that's the problem. Before inline caches came along, there were two solutions: callouts and open-coding. Both were bad in similar ways. A callout is where the compiler generates a call to a generic runtime routine. The runtime routine will be able to handle all the myriad forms and combination of forms of the operands. This works fine but can be a bit slow, as all callouts for a given operator (e.g. string-append) dispatch to a single routine for the whole program, so they don't get to optimize for any particular call site.

One tempting thing for compiler writers to do is to effectively inline the string-append operation into each of its call sites. This is "open-coding" (in the terminology of the early Lisp implementations like MACLISP). The advantage here is that maybe the compiler knows something about one or more of the operands, so it can eliminate some cases, effectively performing some compile-time specialization. But this is a limited technique; one could argue that the whole point of polymorphism is to allow for generic operations on generic data, so you rarely have compile-time invariants that can allow you to specialize. Open-coding of polymorphic operations instead leads to code bloat, as the string-append operation is just so many copies of the same thing.

Inline caches emerged to solve this problem. They trace their lineage back to Smalltalk 80, gained in complexity and power with Self and finally reached mass consciousness through Javascript. These languages all share the characteristic of being dynamically typed and object-oriented. When a user evaluates a statement like x = y.z, the language implementation needs to figure out where y.z is actually located. This location depends on the representation of y, which is rarely known at compile-time.

However for any given reference y.z in the source code, there is a finite set of concrete representations of y that will actually flow to that call site at run-time. Inline caches allow the language implementation to specialize the y.z access for its particular call site. For example, at some point in the evaluation of a program, y may be seen to have representation R1 or R2. For R1, the z property may be stored at offset 3 within the object's storage, and for R2 it might be at offset 4. The inline cache is a bit of specialized code that compares the type of the object being accessed against R1 , in that case returning the value at offset 3, otherwise R2 and offset r4, and otherwise falling back to a generic routine. If this isn't clear to you, Vyacheslav Egorov write a fine article describing and implementing the object representation optimizations enabled by inline caches.

Inline caches also serve as input data to later stages of an adaptive compiler, allowing the compiler to selectively inline (open-code) only those cases that are appropriate to values actually seen at any given call site.

but how?

The classic formulation of inline caches from Self and early V8 actually patched the code being executed. An inline cache might be allocated at address 0xcabba9e5 and the code emitted for its call-site would be jmp 0xcabba9e5. If the inline cache ended up bottoming out to the generic routine, a new inline cache would be generated that added an implementation appropriate to the newly seen "form" of the operands and the call-site. Let's say that new IC (inline cache) would have the address 0x900db334. Early versions of V8 would actually patch the machine code at the call-site to be jmp 0x900db334 instead of jmp 0xcabba6e5.

Patching machine code has a number of disadvantages, though. It inherently target-specific: you will need different strategies to patch x86-64 and armv7 machine code. It's also expensive: you have to flush the instruction cache after the patch, which slows you down. That is, of course, if you are allowed to patch executable code; on many systems that's impossible. Writable machine code is a potential vulnerability if the system may be vulnerable to remote code execution.

Perhaps worst of all, though, patching machine code is not thread-safe. In the case of early Javascript, this perhaps wasn't so important; but as JS implementations gained parallel garbage collectors and JS-level parallelism via "service workers", this becomes less acceptable.

For all of these reasons, the modern take on inline caches is to implement them as a memory location that can be atomically modified. The call site is just jmp *loc, as if it were a virtual method call. Modern CPUs have "branch target buffers" that predict the target of these indirect branches with very high accuracy so that the indirect jump does not become a pipeline stall. (What does this mean in the face of the Spectre v2 vulnerabilities? Sadly, God only knows at this point. Saddest panda.)

cry, the beloved country

I am interested in ICs in the context of the Guile implementation of Scheme, but first I will make a digression. Scheme is a very monomorphic language. Yet, this monomorphism is entirely cultural. It is in no way essential. Lack of ICs in implementations has actually fed back and encouraged this monomorphism.

Let us take as an example the case of property access. If you have a pair in Scheme and you want its first field, you do (car x). But if you have a vector, you do (vector-ref x 0).

What's the reason for this nonuniformity? You could have a generic ref procedure, which when invoked as (ref x 0) would return the field in x associated with 0. Or (ref x 'foo) to return the foo property of x. It would be more orthogonal in some ways, and it's completely valid Scheme.

We don't write Scheme programs this way, though. From what I can tell, it's for two reasons: one good, and one bad.

The good reason is that saying vector-ref means more to the reader. You know more about the complexity of the operation and what side effects it might have. When you call ref, who knows? Using concrete primitives allows for better program analysis and understanding.

The bad reason is that Scheme implementations, Guile included, tend to compile (car x) to much better code than (ref x 0). Scheme implementations in practice aren't well-equipped for polymorphic data access. In fact it is standard Scheme practice to abuse the "macro" facility to manually inline code so that that certain performance-sensitive operations get inlined into a closed graph of monomorphic operators with no callouts. To the extent that this is true, Scheme programmers, Scheme programs, and the Scheme language as a whole are all victims of their implementations. JavaScript, for example, does not have this problem -- to a small extent, maybe, yes, performance tweaks and tuning are always a thing but JavaScript implementations' ability to burn away polymorphism and abstraction results in an entirely different character in JS programs versus Scheme programs.

it gets worse

On the most basic level, Scheme is the call-by-value lambda calculus. It's well-studied, well-understood, and eminently flexible. However the way that the syntax maps to the semantics hides a constrictive monomorphism: that the "callee" of a call refer to a lambda expression.

Concretely, in an expression like (a b), in which a is not a macro, a must evaluate to the result of a lambda expression. Perhaps by reference (e.g. (define a (lambda (x) x))), perhaps directly; but a lambda nonetheless. But what if a is actually a vector? At that point the Scheme language standard would declare that to be an error.

The semantics of Clojure, though, would allow for ((vector 'a 'b 'c) 1) to evaluate to b. Why not in Scheme? There are the same good and bad reasons as with ref. Usually, the concerns of the language implementation dominate, regardless of those of the users who generally want to write terse code. Of course in some cases the implementation concerns should dominate, but not always. Here, Scheme could be more flexible if it wanted to.

what have you done for me lately

Although inline caches are not a miracle cure for performance overheads of polymorphic dispatch, they are a tool in the box. But what, precisely, can they do, both in general and for Scheme?

To my mind, they have five uses. If you can think of more, please let me know in the comments.

Firstly, they have the classic named property access optimizations as in JavaScript. These apply less to Scheme, as we don't have generic property access. Perhaps this is a deficiency of Scheme, but it's not exactly low-hanging fruit. Perhaps this would be more interesting if Guile had more generic protocols such as Racket's iteration.

Next, there are the arithmetic operators: addition, multiplication, and so on. Scheme's arithmetic is indeed polymorphic; the addition operator + can add any number of complex numbers, with a distinction between exact and inexact values. On a representation level, Guile has fixnums (small exact integers, no heap allocation), bignums (arbitrary-precision heap-allocated exact integers), fractions (exact ratios between integers), flonums (heap-allocated double-precision floating point numbers), and compnums (inexact complex numbers, internally a pair of doubles). Also in Guile, arithmetic operators are a "primitive generics", meaning that they can be extended to operate on new types at runtime via GOOPS.

The usual situation though is that any particular instance of an addition operator only sees fixnums. In that case, it makes sense to only emit code for fixnums, instead of the product of all possible numeric representations. This is a clear application where inline caches can be interesting to Guile.

Third, there is a very specific case related to dynamic linking. Did you know that most programs compiled for GNU/Linux and related systems have inline caches in them? It's a bit weird but the "Procedure Linkage Table" (PLT) segment in ELF binaries on Linux systems is set up in a way that when e.g. is loaded, the dynamic linker usually doesn't eagerly resolve all of the external routines that uses. The first time that calls frobulate, it ends up calling a procedure that looks up the location of the frobulate procedure, then patches the binary code in the PLT so that the next time frobulate is called, it dispatches directly. To dynamic language people it's the weirdest thing in the world that the C/C++/everything-static universe has at its cold, cold heart a hash table and a dynamic dispatch system that it doesn't expose to any kind of user for instrumenting or introspection -- any user that's not a malware author, of course.

But I digress! Guile can use ICs to lazily resolve runtime routines used by compiled Scheme code. But perhaps this isn't optimal, as the set of primitive runtime calls that Guile will embed in its output is finite, and so resolving these routines eagerly would probably be sufficient. Guile could use ICs for inter-module references as well, and these should indeed be resolved lazily; but I don't know, perhaps the current strategy of using a call-site cache for inter-module references is sufficient.

Fourthly (are you counting?), there is a general case of the former: when you see a call (a b) and you don't know what a is. If you put an inline cache in the call, instead of having to emit checks that a is a heap object and a procedure and then emit an indirect call to the procedure's code, you might be able to emit simply a check that a is the same as x, the only callee you ever saw at that site, and in that case you can emit a direct branch to the function's code instead of an indirect branch.

Here I think the argument is less strong. Modern CPUs are already very good at indirect jumps and well-predicted branches. The value of a devirtualization pass in compilers is that it makes the side effects of a virtual method call concrete, allowing for more optimizations; avoiding indirect branches is good but not necessary. On the other hand, Guile does have polymorphic callees (generic functions), and call ICs could help there. Ideally though we would need to extend the language to allow generic functions to feed back to their inline cache handlers.

Finally, ICs could allow for cheap tracepoints and breakpoints. If at every breakable location you included a jmp *loc, and the initial value of *loc was the next instruction, then you could patch individual locations with code to run there. The patched code would be responsible for saving and restoring machine state around the instrumentation.

Honestly I struggle a lot with the idea of debugging native code. GDB does the least-overhead, most-generic thing, which is patching code directly; but it runs from a separate process, and in Guile we need in-process portable debugging. The debugging use case is a clear area where you want adaptive optimization, so that you can emit debugging ceremony from the hottest code, knowing that you can fall back on some earlier tier. Perhaps Guile should bite the bullet and go this way too.

implementation plan

In Guile, monomorphic as it is in most things, probably only arithmetic is worth the trouble of inline caches, at least in the short term.

Another question is how much to specialize the inline caches to their call site. On the extreme side, each call site could have a custom calling convention: if the first operand is in register A and the second is in register B and they are expected to be fixnums, and the result goes in register C, and the continuation is the code at L, well then you generate an inline cache that specializes to all of that. No need to shuffle operands or results, no need to save the continuation (return location) on the stack.

The opposite would be to call ICs as if their were normal procedures: shuffle arguments into fixed operand registers, push a stack frame, and when the IC returns, shuffle the result into place.

Honestly I am looking mostly to the simple solution. I am concerned about code and heap bloat if I specify to every last detail of a call site. Also maximum speed comes with an adaptive optimizer, and in that case simple lower tiers are best.

sanity check

To compare these impressions, I took a look at V8's current source code to see where they use ICs in practice. When I worked on V8, the compiler was entirely different -- there were two tiers, and both of them generated native code. Inline caches were everywhere, and they were gnarly; every architecture had its own implementation. Now in V8 there are two tiers, not the same as the old ones, and the lowest one is a bytecode interpreter.

As an adaptive optimizer, V8 doesn't need breakpoint ICs. It can always deoptimize back to the interpreter. In actual practice, to debug at a source location, V8 will patch the bytecode to insert a "DebugBreak" instruction, which has its own support in the interpreter. V8 also supports optimized compilation of this operation. So, no ICs needed here.

Likewise for generic type feedback, V8 records types as data rather than in the classic formulation of inline caches as in Self. I think WebKit's JavaScriptCore uses a similar strategy.

V8 does use inline caches for property access (loads and stores). Besides that there is an inline cache used in calls which is just used to record callee counts, and not used for direct call optimization.

Surprisingly, V8 doesn't even seem to use inline caches for arithmetic (any more?). Fair enough, I guess, given that JavaScript's numbers aren't very polymorphic, and even with a system with fixnums and heap floats like V8, floating-point numbers are rare in cold code.

The dynamic linking and relocation points don't apply to V8 either, as it doesn't receive binary code from the internet; it always starts from source.

twilight of the inline cache

There was a time when inline caches were recommended to solve all your VM problems, but it would seem now that their heyday is past.

ICs are still a win if you have named property access on objects whose shape you don't know at compile-time. But improvements in CPU branch target buffers mean that it's no longer imperative to use ICs to avoid indirect branches (modulo Spectre v2), and creating direct branches via code-patching has gotten more expensive and tricky on today's targets with concurrency and deep cache hierarchies.

Besides that, the type feedback component of inline caches seems to be taken over by explicit data-driven call-site caches, rather than executable inline caches, and the highest-throughput tiers of an adaptive optimizer burn away inline caches anyway. The pressure on an inline cache infrastructure now is towards simplicity and ease of type and call-count profiling, leaving the speed component to those higher tiers.

In Guile the bounded polymorphism on arithmetic combined with the need for ahead-of-time compilation means that ICs are probably a code size and execution time win, but it will take some engineering to prevent the calling convention overhead from dominating cost.

Time to experiment, then -- I'll let y'all know how it goes. Thoughts and feedback welcome from the compilerati. Until then, happy hacking :)

by Andy Wingo at February 07, 2018 03:14 PM

February 05, 2018

Andy Wingo

notes from the fosdem 2018 networking devroom

Greetings, internet!

I am on my way back from FOSDEM and thought I would share with yall some impressions from talks in the Networking devroom. I didn't get to go to all that many talks -- FOSDEM's hallway track is the hottest of them all -- but I did hit a select few. Thanks to Dave Neary at Red Hat for organizing the room.

Ray Kinsella -- Intel -- The path to data-plane micro-services

The day started with a drum-beating talk that was very light on technical information.

Essentially Ray was arguing for an evolution of network function virtualization -- that instead of running VNFs on bare metal as was done in the days of yore, that people started to run them in virtual machines, and now they run them in containers -- what's next? Ray is saying that "cloud-native VNFs" are the next step.

Cloud-native VNFs to move from "greedy" VNFs that take charge of the cores that are available to them, to some kind of resource sharing. "Maybe users value flexibility over performance", says Ray. It's the Care Bears approach to networking: (resource) sharing is caring.

In practice he proposed two ways that VNFs can map to cores and cards.

One was in-process sharing, which if I understood him properly was actually as nodes running within a VPP process. Basically in this case VPP or DPDK is the scheduler and multiplexes two or more network functions in one process.

The other was letting Linux schedule separate processes. In networking, we don't usually do it this way: we run network functions on dedicated cores on which nothing else runs. Ray was suggesting that perhaps network functions could be more like "normal" Linux services. Ray doesn't know if Linux scheduling will work in practice. Also it might mean allowing DPDK to work with 4K pages instead of the 2M hugepages it currently requires. This obviously has the potential for more latency hazards and would need some tighter engineering, and ultimately would have fewer guarantees than the "greedy" approach.

Interesting side things I noticed:

  • All the diagrams show Kubernetes managing CPU node allocation and interface assignment. I guess in marketing diagrams, Kubernetes has completely replaced OpenStack.

  • One slide showed guest VNFs differentiated between "virtual network functions" and "socket-based applications", the latter ones being the legacy services that use kernel APIs. It's a useful terminology difference.

  • The talk identifies user-space networking with DPDK (only!).

Finally, I note that Conway's law is obviously reflected in the performance overheads: because there are organizational isolations between dev teams, vendors, and users, there are big technical barriers between them too. The least-overhead forms of resource sharing are also those with the highest technical consistency and integration (nodes in a single VPP instance).

Magnus Karlsson -- Intel -- AF_XDP

This was a talk about getting good throughput from the NIC to userspace, but by using some kernel facilities. The idea is to get the kernel to set up the NIC and virtualize the transmit and receive ring buffers, but to let the NIC's DMA'd packets go directly to userspace.

The performance goal is 40Gbps for thousand-byte packets, or 25 Gbps for traffic with only the smallest packets (64 bytes). The fast path does "zero copy" on the packets if the hardware has the capability to steer the subset of traffic associated with the AF_XDP socket to that particular process.

The AF_XDP project builds on XDP, a newish thing where a little kind of bytecode can run on the kernel or possibly on the NIC. One of the bytecode commands (REDIRECT) causes packets to be forwarded to user-space instead of handled by the kernel's otherwise heavyweight networking stack. AF_XDP is the bridge between XDP on the kernel side and an interface to user-space using sockets (as opposed to e.g. AF_INET). The performance goal was to be within 10% or so of DPDK's raw user-space-only performance.

The benefits of AF_XDP over the current situation would be that you have just one device driver, in the kernel, rather than having to have one driver in the kernel (which you have to have anyway) and one in user-space (for speed). Also, with the kernel involved, there is a possibility for better isolation between different processes or containers, when compared with raw PCI access from user-space..

AF_XDP is what was previously known as AF_PACKET v4, and its numbers are looking somewhat OK. Though it's not upstream yet, it might be interesting to get a Snabb driver here.

I would note that kernel-userspace cooperation is a bit of a theme these days. There are other points of potential cooperation or common domain sharing, storage being an obvious one. However I heard more than once this weekend the kind of "I don't know, that area of the kernel has a different culture" sort of concern as that highlighted by Daniel Vetter in his recent LCA talk.

François-Frédéric Ozog -- Linaro -- Userland Network I/O

This talk is hard to summarize. Like the previous one, it's again about getting packets to userspace with some support from the kernel, but the speaker went really deep and I'm not quite sure what in the talk is new and what is known.

François-Frédéric is working on a new set of abstractions for relating the kernel and user-space. He works on OpenDataPlane (ODP), which is kinda like DPDK in some ways. ARM seems to be a big target for his work; that x86-64 is also a target goes without saying.

His problem statement was, how should we enable fast userland network I/O, without duplicating drivers?

François-Frédéric was a bit negative on AF_XDP because (he says) it is so focused on packets that it neglects other kinds of devices with similar needs, such as crypto accelerators. Apparently the challenge here is accelerating a single large IPsec tunnel -- because the cryptographic operations are serialized, you need good single-core performance, and making use of hardware accelerators seems necessary right now for even a single 10Gbps stream. (If you had many tunnels, you could parallelize, but that's not the case here.)

He was also a bit skeptical about standardizing on the "packet array I/O model" which AF_XDP and most NICS use. What he means here is that most current NICs move packets to and from main memory with the help of a "descriptor array" ring buffer that holds pointers to packets. A transmit array stores packets ready to transmit; a receive array stores maximum-sized packet buffers ready to be filled by the NIC. The packet data itself is somewhere else in memory; the descriptor only points to it. When a new packet is received, the NIC fills the corresponding packet buffer and then updates the "descriptor array" to point to the newly available packet. This requires at least two memory writes from the NIC to memory: at least one to write the packet data (one per 64 bytes of packet data), and one to update the DMA descriptor with the packet length and possible other metadata.

Although these writes go directly to cache, there's a limit to the number of DMA operations that can happen per second, and with 100Gbps cards, we can't afford to make one such transaction per packet.

François-Frédéric promoted an alternative I/O model for high-throughput use cases: the "tape I/O model", where packets are just written back-to-back in a uniform array of memory. Every so often a block of memory containing some number of packets is made available to user-space. This has the advantage of packing in more packets per memory block, as there's no wasted space between packets. This increases cache density and decreases DMA transaction count for transferring packet data, as we can use each 64-byte DMA write to its fullest. Additionally there's no side table of descriptors to update, saving a DMA write there.

Apparently the only cards currently capable of 100 Gbps traffic, the Chelsio and Netcope cards, use the "tape I/O model".

Incidentally, the DMA transfer limit isn't the only constraint. Something I hadn't fully appreciated before was memory write bandwidth. Before, I had thought that because the NIC would transfer in packet data directly to cache, that this wouldn't necessarily cause any write traffic to RAM. Apparently that's not the case. Later over drinks (thanks to Red Hat's networking group for organizing), François-Frédéric asserted that the DMA transfers would eventually use up DDR4 bandwidth as well.

A NIC-to-RAM DMA transaction will write one cache line (usually 64 bytes) to the socket's last-level cache. This write will evict whatever was there before. As far as I can tell, there are three cases of interest here. The best case is where the evicted cache line is from a previous DMA transfer to the same address. In that case it's modified in the cache and not yet flushed to main memory, and we can just update the cache instead of flushing to RAM. (Do I misunderstand the way caches work here? Do let me know.)

However if the evicted cache line is from some other address, we might have to flush to RAM if the cache line is dirty. That causes a memory write traffic. But if the cache line is clean, that means it was probably loaded as part of a memory read operation, and then that means we're evicting part of the network function's working set, which will later cause memory read traffic as the data gets loaded in again, and write traffic to flush out the DMA'd packet data cache line.

François-Frédéric simplified the whole thing to equate packet bandwidth with memory write bandwidth, that yes, the packet goes directly to cache but it is also written to RAM. I can't convince myself that that's the case for all packets, but I need to look more into this.

Of course the cache pressure and the memory traffic is worse if the packet data is less compact in memory; and worse still if there is any need to copy data. Ultimately, processing small packets at 100Gbps is still a huge challenge for user-space networking, and it's no wonder that there are only a couple devices on the market that can do it reliably, not that I've seen either of them operate first-hand :)

Talking with Snabb's Luke Gorrie later on, he thought that it could be that we can still stretch the packet array I/O model for a while, given that PCIe gen4 is coming soon, which will increase the DMA transaction rate. So that's a possibility to keep in mind.

At the same time, apparently there are some "coherent interconnects" coming too which will allow the NIC's memory to be mapped into the "normal" address space available to the CPU. In this model, instead of having the NIC transfer packets to the CPU, the NIC's memory will be directly addressable from the CPU, as if it were part of RAM. The latency to pull data in from the NIC to cache is expected to be slightly longer than a RAM access; for comparison, RAM access takes about 70 nanoseconds.

For a user-space networking workload, coherent interconnects don't change much. You still need to get the packet data into cache. True, you do avoid the writeback to main memory, as the packet is already in addressable memory before it's in cache. But, if it's possible to keep the packet on the NIC -- like maybe you are able to add some kind of inline classifier on the NIC that could directly shunt a packet towards an on-board IPSec accelerator -- in that case you could avoid a lot of memory transfer. That appears to be the driving factor for coherent interconnects.

At some point in François-Frédéric's talk, my brain just died. I didn't quite understand all the complexities that he was taking into account. Later, after he kindly took the time to dispell some more of my ignorance, I understand more of it, though not yet all :) The concrete "deliverable" of the talk was a model for kernel modules and user-space drivers that uses the paradigms he was promoting. It's a work in progress from Linaro's networking group, with some support from NIC vendors and CPU manufacturers.

Luke Gorrie and Asumu Takikawa -- SnabbCo and Igalia -- How to write your own NIC driver, and why

This talk had the most magnificent beginning: a sort of "repent now ye sinners" sermon from Luke Gorrie, a seasoned veteran of software networking. Luke started by describing the path of righteousness leading to "driver heaven", a world in which all vendors have publically accessible datasheets which parsimoniously describe what you need to get packets flowing. In this blessed land it's easy to write drivers, and for that reason there are many of them. Developers choose a driver based on their needs, or they write one themselves if their needs are quite specific.

But there is another path, says Luke, that of "driver hell": a world of wickedness and proprietary datasheets, where even when you buy the hardware, you can't program it unless you're buying a hundred thousand units, and even then you are smitten with the cursed non-disclosure agreements. In this inferno, only a vendor is practically empowered to write drivers, but their poor driver developers are only incentivized to get the driver out the door deployed on all nine architectural circles of driver hell. So they include some kind of circle-of-hell abstraction layer, resulting in a hundred thousand lines of code like a tangled frozen beard. We all saw the abyss and repented.

Luke described the process that led to Mellanox releasing the specification for its ConnectX line of cards, something that was warmly appreciated by the entire audience, users and driver developers included. Wonderful stuff.

My Igalia colleague Asumu Takikawa took the last half of the presentation, showing some code for the driver for the Intel i210, i350, and 82599 cards. For more on that, I recommend his recent blog post on user-space driver development. It was truly a ray of sunshine in dark, dark Brussels.

Ole Trøan -- Cisco -- Fast dataplanes with VPP

This talk was a delightful introduction to VPP, but without all of the marketing; the sort of talk that makes FOSDEM worthwhile. Usually at more commercial, vendory events, you can't really get close to the technical people unless you have a vendor relationship: they are surrounded by a phalanx of salesfolk. But in FOSDEM it is clear that we are all comrades out on the open source networking front.

The speaker expressed great personal pleasure on having being able to work on open source software; his relief was palpable. A nice moment.

He also had some kind words about Snabb, too, saying at one point that "of course you can do it on snabb as well -- Snabb and VPP are quite similar in their approach to life". He trolled the horrible complexity diagrams of many "NFV" stacks whose components reflect the org charts that produce them more than the needs of the network functions in question (service chaining anyone?).

He did get to drop some numbers as well, which I found interesting. One is that recently they have been working on carrier-grade NAT, aiming for 6 terabits per second. Those are pretty big boxes and I hope they are getting paid appropriately for that :) For context he said that for a 4-unit server, these days you can build one that does a little less than a terabit per second. I assume that's with ten dual-port 40Gbps cards, and I would guess to power that you'd need around 40 cores or so, split between two sockets.

Finally, he finished with a long example on lightweight 4-over-6. Incidentally this is the same network function my group at Igalia has been building in Snabb over the last couple years, so it was interesting to see the comparison. I enjoyed his commentary that although all of these technologies (carrier-grade NAT, MAP, lightweight 4-over-6) have the ostensible goal of keeping IPv4 running, in reality "we're day by day making IPv4 work worse", mainly by breaking the assumption that just because you get traffic from port P on IP M, doesn't mean you can send traffic to M from another port or another protocol and have it reach the target.

All of these technologies also have problems with IPv4 fragmentation. Getting it right is possible but expensive. Instead, Ole mentions that he and a cross-vendor cabal of dataplane people have a "dark RFC" in the works to deprecate IPv4 fragmentation entirely :)

OK that's it. If I get around to writing up the couple of interesting Java talks I went to (I know right?) I'll let yall know. Happy hacking!

by Andy Wingo at February 05, 2018 05:22 PM

February 01, 2018

Iago Toral

Intel Mesa driver for Linux is now OpenGL 4.6 conformant

Khronos has recently announced the conformance program for OpenGL 4.6 and I am very happy to say that Intel has submitted successful conformance applications for various of its GPU models for the Mesa Linux driver. For specifics on the conformant hardware you can check the list of conformant OpenGL products at the Khronos webstite.

Being conformant on day one, which the Intel Mesa Vulkan driver also obtained back in the day, is a significant achievement. Besides Intel Mesa, only NVIDIA managed to do this, which I think speaks of the amount of work and effort that one needs to put to achieve it. The days where Linux implementations lagged behind are long gone, we should all celebrate this and acknowledge the important efforts that companies like Intel have put into making this a reality.

Over the last 8-9 months or so, I have been working together with some of my Igalian friends to keep the Intel drivers (for both OpenGL and Vulkan) conformant, so I am very proud that we have reached this milestone. Kudos to all my work mates who have worked with me on this, to our friends at Intel, who have been providing reviews for our patches, feedback and additional driver fixes, and to many other members in the Mesa community who have contributed to make this possible in one way or another.

Of course, OpenGL 4.6 conformance requires that we have an implementation of GL_ARB_gl_spirv, which allows OpenGL applications to consume SPIR-V shaders. If you have been following Igalia’s work, you have already seen some of my colleagues sending patches for this over the last months, but the feature is not completely upstreamed yet. We are working hard on this, but the scope of the implementation that we want for upstream is rather ambitious, since it involves to (finally) have a full shader linker in NIR. Getting that to be as complete as the current GLSL linker and in a shape that is good enough for review and upstreaming is going to take some time, but it is surely a worthwhile effort that will pay off in the future, so please look forward to it and be patient with us as we upstream more of it in the coming months.

It is also important to remark that OpenGL 4.6 conformance doesn’t just validate new features in OpenGL 4.6, it is a full conformance program for OpenGL drivers that includes OpenGL 4.6 functionality, and as such, it is a super set of the OpenGL 4.5 conformance. The OpenGL 4.6 CTS does, in fact, incorporate a whole lot of bugfixes and expanded coverage for OpenGL features that were already present in OpenGL 4.5 and prior.

What is the conformance process and why is it important?

It is a well known issue with standards that different implementations are not always consistent. This can happen for a number of reasons. For example, implementations have bugs which can make something work on one platform but not on another (which will then require applications to implement work arounds). Another reason for this is that some times implementators just have different interpretations of the standard.

The Khronos conformance program is intended to ensure that products that implement Khronos standards (such as OpenGL or Vulkan drivers) do what they are supposed to do and they do it consistently across implementations from the same or different vendors. This is achieved by producing an extensive test suite, the Conformance Test Suite (or CTS for short), which aims to verify that the semantics of the standard are properly implemented by as many vendors as possible.

Why is CTS different to other test suites available?

One reason is that CTS development includes Khronos members who are involved in the definition of the API specifications. This means there are people well versed in the spec language who can review and provide feedback to test developers to ensure that the tests are good.

Another reason is that before new tests go in, it is required that there are at least a number of implementation (from different vendors) that pass them. This means that various vendors have implemented the related specifications and these different implementations agree on the expected result, which is usually a good sign that the tests and the implementations are good (although this is not always enough!).

How does CTS and the Khronos conformance process help API implementators and users?

First, it makes it so that existing and new functionality covered in the API specifications is tested before granting the conformance status. This means that implementations have to run all these tests and pass them, producing the same results as other implementations, so as far as the test coverage goes, the implementations are correct and consistent, which is the whole point of this process: it wont’ matter if you’re running your application on Intel, NVIDIA, AMD or a different GPU vendor, if your application is correct, it should run the
same no matter the driver you are running on.

Now, this doesn’t mean that your application will run smoothly on all conformant platforms out of the box. Application developers still need to be aware that certain aspects or features in the specifications are optional, or that different hardware implementations may have different limits for certain things. Writing software that can run on multiple platforms is always a challenge and some of that will always need to be addressed on the application side, but at least the conformance process attempts to ensure that for applications that do their part of the work, things will work as intended.

There is another interesting implication of conformance that has to do with correct API specification. Designing APIs that can work across hardware from different vendors is a challenging process. With the CTS, Khronos has an opportunity to validate the specifications against actual implementations. In other words, the CTS allows Khronos to verify that vendors can implement the specifications as intended and revisit the specification if they can’t before releasing them. This ensures that API specifications are reasonable and a good match for existing hardware implementations.

Another benefit of CTS is that vendors working on any API implementation will always need some level of testing to verify their code during development. Without CTS, they would have to write their own tests (which would be biased towards their own interpretations of the spec anyway), but with CTS, they can leave that to Khronos and focus on the implementation instead, cutting down development times and sharing testing code with other vendors.

What about Piglit or other testing frameworks?

CTS doesn’t make Piglit obsolete or redundant at all. While CTS coverage will improve over the years it is nearly impossible to have 100% coverage, so having other testing frameworks around that can provide extra coverage is always good.

My experience working on the Mesa drivers is that it is surprisingly easy to break stuff, specially on OpenGL which has a lot of legacy stuff in it. I have seen way too many instances of patches that seemed correct and in fact fixed actual problems only to later see Piglit, CTS and/or dEQP report regressions on existing tests. The (not so surprising) thing is that many times, the regressions were reported on just one of these testing frameworks, which means they all provide some level of coverage that is not included in the others.

It is for this reason that the continuous integration system for Mesa provided by Intel runs all of these testing frameworks (and then some others). You just never get enough testing. And even then, some regressions slip into releases despite all the testing!

Also, for the case of Piglit in particular, I have to say that it is very easy to write new tests, specially shader runner tests, which is always a bonus. Writing tests for CTS or dEQP, for example, requires more work in general.

So what have I been doing exactly?

For the last 9 months or so, I have been working on ensuring that the Intel Mesa drivers for both Vulkan and OpenGL are conformant. If you have followed any of my work in Mesa over the last year or so, you have probably already guessed this, since most of the patches I have been sending to Mesa reference the conformance tests they fix.

To be more thorough, my work included:

  1. Reviewing and testing patches submitted for inclusion in CTS that either fixed test bugs, extended coverage for existing features or added new tests for new API specifications. CTS is a fairly active project with numerous changes submitted for review pretty much every day, for OpenGL, OpenGL ES and Vulkan, so staying on top of things requires a significant dedication.
  • Ensuring that the Intel Mesa drivers passed all the CTS tests for both Vulkan and OpenGL. This requires to run the conformance tests, identifying test failures, identifying the cause for the failures and providing proper fixes. The fixes would go to CTS when the cause for the issue was a bogus test, to the driver, when it was a bug in our implementation or the fact that the driver was simply missing some functionality, or they could even go to the OpenGL or Vulkan specs, when the source of the problem was incomplete, ambiguous or incorrect spec language that was used to drive the test development. I have found instances of all these situations.

  • Where can I get the CTS code?

    Good news, it is open source and available at GitHub.

    This is a very important and welcomed change by Khronos. When I started helping Intel with OpenGL conformance, specifically for OpenGL 4.5, the CTS code was only available to specific Khronos members. Since then, Khronos has done a significant effort in working towards having an open source testing framework where anyone can contribute, so kudos to Khronos for doing this!

    Going open source not only leverages larger collaboration and further development of the CTS. It also puts in the hands of API users a handful of small test samples that people can use to learn how some of the new Vulkan and OpenGL APIs released to the public are to be used, which is always nice.

    What is next?

    As I said above, CTS development is always ongoing, there is always testing coverage to expand for existing features, bugfixes to provide for existing tests, more tests that need to be adapted or changed to match corrections in in the spec language, new extensions and APIs that need testing coverage, etc.

    And on the driver side, there are always new features to implement that come with their potential bugs that need to be fixed, occasional regressions that need to be addressed promptly, new bugs uncovered by new tests that need fixes, etc

    So the fun never really stops 🙂

    Final words

    In this post, besides bringing the good news to everyone, I hope that I have made a good case for why the Khronos CTS is important for the industry and why we should care about it. I hope that I also managed to give a sense for the enormous amount of work that goes into making all of this possible, both on the side of Khronos and the side of the driver developer teams. I think all this effort means better drivers for everyone and I hope that we all, as users, come to appreciate it for that.

    Finally, big thanks to Intel for sponsoring our work on Mesa and CTS, and also to Igalia, for having me work on this wonderful project.

    OpenGL® and the oval logo are trademarks or registered trademarks of Silicon Graphics, Inc. in the United States and/or other countries worldwide. Additional license details are available on the SGI website.

    by Iago Toral at February 01, 2018 11:34 AM

    January 24, 2018

    Michael Catanzaro

    Announcing Epiphany Technology Preview

    If you use macOS, the best way to use a recent development snapshot of WebKit is surely Safari Technology Preview. But until now, there’s been no good way to do so on Linux, short of running a development distribution like Fedora Rawhide.

    Enter Epiphany Technology Preview. This is a nightly build of Epiphany, on top of the latest development release of WebKitGTK+, running on the GNOME master Flatpak runtime. The target audience is anyone who wants to assist with Epiphany development by testing the latest code and reporting bugs, so I’ve added the download link to Epiphany’s development page.

    Since it uses Flatpak, there are no host dependencies asides from Flatpak itself, so it should work on any system that can run Flatpak. Thanks to the Flatpak sandbox, it’s far more secure than the version of Epiphany provided by your operating system. And of course, you enjoy automatic updates from GNOME Software or any software center that supports Flatpak.


    (P.S. If you want to use the latest stable version instead, with all the benefits provided by Flatpak, get that here.)

    by Michael Catanzaro at January 24, 2018 10:58 PM

    January 17, 2018

    Andy Wingo

    instruction explosion in guile

    Greetings, fellow Schemers and compiler nerds: I bring fresh nargery!

    instruction explosion

    A couple years ago I made a list of compiler tasks for Guile. Most of these are still open, but I've been chipping away at the one labeled "instruction explosion":

    Now we get more to the compiler side of things. Currently in Guile's VM there are instructions like vector-ref. This is a little silly: there are also instructions to branch on the type of an object (br-if-tc7 in this case), to get the vector's length, and to do a branching integer comparison. Really we should replace vector-ref with a combination of these test-and-branches, with real control flow in the function, and then the actual ref should use some more primitive unchecked memory reference instruction. Optimization could end up hoisting everything but the primitive unchecked memory reference, while preserving safety, which would be a win. But probably in most cases optimization wouldn't manage to do this, which would be a lose overall because you have more instruction dispatch.

    Well, this transformation is something we need for native compilation anyway. I would accept a patch to do this kind of transformation on the master branch, after version 2.2.0 has forked. In theory this would remove most all high level instructions from the VM, making the bytecode closer to a virtual CPU, and likewise making it easier for the compiler to emit native code as it's working at a lower level.

    Now that I'm getting close to finished I wanted to share some thoughts. Previous progress reports on the mailing list.

    a simple loop

    As an example, consider this loop that sums the 32-bit floats in a bytevector. I've annotated the code with lines and columns so that you can correspond different pieces to the assembly.

       0       8   12     19
    1| (use-modules (rnrs bytevectors))
    2| (define (f32v-sum bv)
    3|   (let lp ((n 0) (sum 0.0))
    4|     (if (< n (bytevector-length bv))
    5|         (lp (+ n 4)
    6|             (+ sum (bytevector-ieee-single-native-ref bv n)))
    7|          sum)))

    The assembly for the loop before instruction explosion went like this:

      17    (handle-interrupts)     at (unknown file):5:12
      18    (uadd/immediate 0 1 4)
      19    (bv-f32-ref 1 3 1)      at (unknown file):6:19
      20    (fadd 2 2 1)            at (unknown file):6:12
      21    (s64<? 0 4)             at (unknown file):4:8
      22    (jnl 8)                ;; -> L4
      23    (mov 1 0)               at (unknown file):5:8
      24    (j -7)                 ;; -> L1

    So, already Guile's compiler has hoisted the (bytevector-length bv) and unboxed the loop index n and accumulator sum. This work aims to simplify further by exploding bv-f32-ref.

    exploding the loop

    In practice, instruction explosion happens in CPS conversion, as we are converting the Scheme-like Tree-IL language down to the CPS soup language. When we see a Tree-Il primcall (a call to a known primitive), instead of lowering it to a corresponding CPS primcall, we inline a whole blob of code.

    In the concrete case of bv-f32-ref, we'd inline it with something like the following:

    (unless (and (heap-object? bv)
                 (eq? (heap-type-tag bv) %bytevector-tag))
      (error "not a bytevector" bv))
    (define len (word-ref bv 1))
    (define ptr (word-ref bv 2))
    (unless (and (<= 4 len)
                 (<= idx (- len 4)))
      (error "out of range" idx))
    (f32-ref ptr len)

    As you can see, there are four branches hidden in the bv-f32-ref: two to check that the object is a bytevector, and two to check that the index is within range. In this explanation we assume that the offset idx is already unboxed, but actually unboxing the index ends up being part of this work as well.

    One of the goals of instruction explosion was that by breaking the operation into a number of smaller, more orthogonal parts, native code generation would be easier, because the compiler would only have to know about those small bits. However without an optimizing compiler, it would be better to reify a call out to a specialized bv-f32-ref runtime routine instead of inlining all of this code -- probably whatever language you write your runtime routine in (C, rust, whatever) will do a better job optimizing than your compiler will.

    But with an optimizing compiler, there is the possibility of removing possibly everything but the f32-ref. Guile doesn't quite get there, but almost; here's the post-explosion optimized assembly of the inner loop of f32v-sum:

      27    (handle-interrupts)
      28    (tag-fixnum 1 2)
      29    (s64<? 2 4)             at (unknown file):4:8
      30    (jnl 15)               ;; -> L5
      31    (uadd/immediate 0 2 4)  at (unknown file):5:12
      32    (u64<? 2 7)             at (unknown file):6:19
      33    (jnl 5)                ;; -> L2
      34    (f32-ref 2 5 2)
      35    (fadd 3 3 2)            at (unknown file):6:12
      36    (mov 2 0)               at (unknown file):5:8
      37    (j -10)                ;; -> L1

    good things

    The first thing to note is that unlike the "before" code, there's no instruction in this loop that can throw an exception. Neat.

    Next, note that there's no type check on the bytevector; the peeled iteration preceding the loop already proved that the bytevector is a bytevector.

    And indeed there's no reference to the bytevector at all in the loop! The value being dereferenced in (f32-ref 2 5 2) is a raw pointer. (Read this instruction as, "sp[2] = *(float*)((byte*)sp[5] + (uptrdiff_t)sp[2])".) The compiler does something interesting; the f32-ref CPS primcall actually takes three arguments: the garbage-collected object protecting the pointer, the pointer itself, and the offset. The object itself doesn't appear in the residual code, but including it in the f32-ref primcall's inputs keeps it alive as long as the f32-ref itself is alive.

    bad things

    Then there are the limitations. Firstly, instruction 28 tags the u64 loop index as a fixnum, but never uses the result. Why is this here? Sadly it's because the value is used in the bailout at L2. Recall this pseudocode:

    (unless (and (<= 4 len)
                 (<= idx (- len 4)))
      (error "out of range" idx))

    Here the error ends up lowering to a throw CPS term that the compiler recognizes as a bailout and renders out-of-line; cool. But it uses idx as an argument, as a tagged SCM value. The compiler untags the loop index, but has to keep a tagged version around for the error cases.

    The right fix is probably some kind of allocation sinking pass that sinks the tag-fixnum to the bailouts. Oh well.

    Additionally, there are two tests in the loop. Are both necessary? Turns out, yes :( Imagine you have a bytevector of length 1025. The loop continues until the last ref at offset 1024, which is within bounds of the bytevector but there's one one byte available at that point, so we need to throw an exception at this point. The compiler did as good a job as we could expect it to do.

    is is worth it? where to now?

    On the one hand, instruction explosion is a step sideways. The code is more optimal, but it's more instructions. Because Guile currently has a bytecode VM, that means more total interpreter overhead. Testing on a 40-megabyte bytevector of 32-bit floats, the exploded f32v-sum completes in 115 milliseconds compared to around 97 for the earlier version.

    On the other hand, it is very easy to imagine how to compile these instructions to native code, either ahead-of-time or via a simple template JIT. You practically just have to look up the instructions in the corresponding ISA reference, is all. The result should perform quite well.

    I will probably take a whack at a simple template JIT first that does no register allocation, then ahead-of-time compilation with register allocation. Getting the AOT-compiled artifacts to dynamically link with runtime routines is a sufficient pain in my mind that I will put it off a bit until later. I also need to figure out a good strategy for truly polymorphic operations like general integer addition; probably involving inline caches.

    So that's where we're at :) Thanks for reading, and happy hacking in Guile in 2018!

    by Andy Wingo at January 17, 2018 10:30 AM

    January 15, 2018

    Asumu Takikawa

    Supporting both VMDq and RSS in Snabb

    In my previous blog post, I talked about the support libraries and the core structure of Snabb’s NIC drivers. In this post, I’ll talk about some of the driver improvements we made at Igalia over the last few months.

    (as in my previous post, this work was joint work with Nicola Larosa)


    Modern NICs are designed to take advantage of increasing parallelism in modern CPUs in order to scale to larger workloads.

    In particular, to scale to 100G workloads, it becomes necessary to work in parallel since a single off-the-shelf core cannot keep up. Even with 10G hardware, processing packets in parallel makes it easier for software to operate at line-rate because the time budget is quite tight.

    To get an idea of what the time budget is like, see these calculations. tl;dr is 67.2 ns/packet or about 201 cycles.

    To scale to multiple CPUs, NICs have a feature called receive-side scaling or RSS which distributes incoming packets to multiple receive queues. These queues can be serviced by separate cores.

    RSS and related features for Intel NICs are detailed more in an overview whitepaper

    RSS works by computing a hash in hardware over the packet to determine the flow it belongs to (this is similar to the hashing used in IPFIX, which I described in a previous blog post).

    RSS diagram
    A diagram showing how RSS directs packets

    The diagram above tries to illustrate this. When a packet arrives in the NIC, the hash is computed. Packets with the same hash (i.e., they’re in the same flow) are directed to a particular receive queue. Receive queues live in RAM as a ring buffer (shown as blue rings in the diagram) and packets are placed there via DMA by consulting registers on the NIC.

    All this means that network functions that depend on tracking flow-related state can usually still work in this parallel setup.

    As a side note, you might wonder (I did anyway!) what happens to fragmented packets whose flow membership may not be identifiable from a fragment. It turns out that on Intel NICs, the hash function will ignore the layer 3 flow information when a packet is fragmented. This means that on occasion a fragmented packet may end up on a different queue than a non-fragmented packet in the same flow. More on this problem here.

    Snabb’s two Intel drivers

    The existing driver used in most Snabb programs ( worked well and was mature but was missing support for RSS.

    An alternate driver (apps.intel_mp.intel_mp) made by Peter Bristow supported RSS, but wasn’t entirely compatible with the features provided by the main Intel driver. We worked on extending intel_mp to work as a more-or-less drop in replacement for intel_app.

    The incompatibility between the two drivers was caused mainly by lack of support for VMDq (Virtual Machine Device Queues) in intel_mp. This is another feature that allows for multiple queue operation on Intel NICs that is used to allow a NIC to present itself as multiple virtualized sets of queues. It’s often used to host VMs in a virtualized environment, but can also be used (as in Snabb) for serving logically separate apps.

    The basic idea is that queues may be assigned to separate pools assigned to a VM or app with its own particular MAC address. A host can use this to run logically separate network functions sharing a single NIC. As with RSS, services running on separate cores can service the queues in parallel.

    RSS diagram
    A diagram showing how VMDq affects queue selection

    As the diagram above shows, adding VMDq changes queue selection slightly from the RSS case above. An appropriate pool is selected based on criteria such as the MAC address (or VLAN tag, and so on) and then RSS may be used.

    BTW, VMDq is not the only virtualization feature on these NICs. There is also SR-IOV or “Single Root I/O Virtualization” which is designed to provide a virtualized NIC for every VM that directly uses the NIC hardware resources. My understanding is that Snabb doesn’t use it for now because we can implement more switching flexibility in software.

    The intel_app driver supports VMDq but not RSS and the opposite situation is true for intel_mp. It turns out that both features can be used simultaneously, in which case packets are first sorted by MAC address and then by flow hashing for RSS. Basically each VMDq pool has its own set of RSS queues.

    We implemented this support in the intel_mp driver and made the driver interface mostly compatible with intel_app so that only minimal modifications are necessary to switch over. In the process, we made bug-fixes and performance fixes in the driver to try to ensure that performance and reliability are comparable to using intel_app.

    The development process was made a lot easier due to the existence of the intel_app code that we could copy and follow in many cases.

    The tricky parts were making sure that the NIC state was set correctly when multiple processes were using the NIC. In particular, intel_app can rely on tracking VMDq state inside a single Lua process.

    For intel_mp, it is necessary to use locking and IPC (via shared memory) to coordinate between different Lua processes that are setting driver state. In particular, the driver needs to be careful to be aware of what resources (VMDq pool numbers, MAC address registers, etc.) are available for use.

    Current status

    The driver improvements are now merged upstream in intel_mp, which is now the default driver, and is available in the Snabb 2017.11 “Endive” release. It’s still possible to opt out and use the old driver in case there are any problems with using intel_mp. And of course we appreciate any bug reports or feedback.

    by Asumu Takikawa at January 15, 2018 04:24 PM

    January 12, 2018

    Diego Pino

    More practical Snabb

    Some time ago, in a Hacker News thread an user proposed the following use case for Snabb:

    I have a ChromeCast on my home network, but I want sandbox/log its traffic. I would want to write some logic to ignore video data, because that’s big. But I want to see the metadata and which servers it’s talking to. I want to see when it’s auto-updating itself with new binaries and record them.

    Is that a good use case for Snabb Switch, or is there is an easier way to accomplish what I want?

    I decided to take this request and implement it as a tutorial. Hopefully, the resulting tutorial can be a valuable piece of information highlighting some of Snabb’s strengths:

    • Fine-grained control of the data-plane.
    • Wide variety of solid libraries for protocol parsing.
    • Rapid prototyping.

    Limiting the project’s scope

    Before putting my hands down on this project, I broke it down into smaller pieces and checked how much of it is already supported in Snabb. To fully implement this project I’d need:

    1. To be able to discover Chromecast devices.
    2. Identify their network flows.
    3. Save the data to disk.

    Snabb provides libraries to identify network flows as well as capturing packets and filter them by content. That pretty much covers bullets 2) and 3). However, Snabb doesn’t provide any tool or library to fully support bullet 1). Thus, I’m going to limit the scope of this tutorial to that single feature: Discover Chromecast and similar devices in a local network.

    Multicast DNS

    A fast lookup on Chromecast’s Wikipedia article reveals Chromecast devices rely on a protocol called Multicast DNS (mDNS).

    Multicast DNS is standardized as RFC6762. The origin of the protocol goes back to Apple’s Rendezvous, later rebranded as Bonjour. Bonjour is in fact the origin of the more generic concept known as Zeroconf. Zeroconf’s goal is to automatically create usable TCP/IP computer networks when computers or network peripherals are interconnected. It is composed of three main elements:

    • Addressing: Self-Assigned Link-Local Addressing (RFC2462 and RFC3927). Automatically assigned addresses in the network space.
    • Naming: Multicast DNS (RFC6762). Host name resolution.
    • Browsing: DNS Service Discovery (RFC6763). The ability of discovering devices and services in a local network.

    Multicast DNS and DNS-SD are very similar and are often mixed up, although they are not strictly the same thing. The former is the description of how to do name resolution in a serverless DNS network, while DNS-SD, although a protocol as well, is an specific use of Multicast DNS.

    One of the nicest things of Multicast DNS is that it reuses many of the concepts of DNS. This allowed mDNS to spread quickly and gain fast adoption, since existing software only required mininimal change. What’s more, programmers didn’t need to learn new APIs or study a completely brand-new protocol.

    Today Multicast DNS is featured in a myriad of small devices, ranging from Google Chromecast to Amazon’s FireTV or Philips Hue lights, as well as software such as Apple’s Bonjour or Spotify.

    This tutorial is going to focus pretty much on mDNS/DNS-SD. Since Multicast DNS reuses many of the ideas of DNS, I am going to review DNS first. Feel free to skip the next section if you are already familiar with DNS.

    DNS basis

    The most common use case of DNS is resolving host names to IP addresses:

    $ dig -t A +short

    In the command above, flag ‘-t A’ means an Address record. There are actually many different types of DNS records. The most common ones are:

    • A (Address record). Used to map hostnames to IPv4 address.
    • AAAA (IPv6 address record). Used to map hostnames to IPv6 address.
    • PTR (Pointer record). Used for reverse DNS lookups, that means, IP addresses to hostnames.
    • SOA (Start of zone of authority). DNS can be seen as a distributed database which is organized in a hierarchical layout of subdomains. A DNS zone is a contiguous portion of the domain space for which a server is responsible of. The top-level DNS zone is known as the DNS root zone, which consists of 13 logical root name servers (although there are more than 13 instances) that contain the top-level domains, generic top-level domains (.com, .net, etc) and country code top-level domains. The command below prints out how the domain gets resolved (I trimmed down the output for the sake of clarity).
    $ dig @ +trace
    ; <<>> DiG 9.10.3-P4-Ubuntu <<>> @ +trace
    ; (1 server found)
    ;; global options: +cmd
    .                       181853  IN      NS
    .                       181853  IN      NS
    .                       181853  IN      NS
    .                       181853  IN      RRSIG   NS 8 0 518400 20180117170000 20180104160000 41824 ....
    ;; Received 525 bytes from in 48 ms
    com.                    172800  IN      NS
    com.                    172800  IN      NS
    com.                    172800  IN      NS
    com.                    86400   IN      RRSIG   DS 8 1 86400 20180118170000 20180105160000 41824 ...
    ;; Received 1174 bytes from in 44 ms             172800  IN      NS             172800  IN      NS             172800  IN      NS             172800  IN      NS
    ;; Received 664 bytes from in 44 ms         300     IN      A
    ;; Received 48 bytes from in 48 ms

    The domain name is split in parts. First the top-level domain is consulted which returns a list of name servers. The root server gets consulted to resolve the subdomain .com. That also returns a list of generic top-level domain name servers. Name server is picked and returns another list of name servers for Finally gets resolved by, that returns the A record containing the domain name IPv4 address.

    Using DNS is also possible to resolve an IP address to a domain name.

    $ dig -x +short

    In this case, the type record is PTR. The command above is equivalent to:

    $ dig -t PTR +short

    When using PTR records for reverse lookups, the target IPv4 addres has to be part of the domain This is an special domain registered under the top-level domain arpa and it’s used for reverse IPv4 lookup. Reverse lookup is the most common use of PTR records, but in fact PTR records are just pointers to a canonical name and other uses are possible as we will see later.


    • DNS helps solving a host name to an IP address. Other types of record resolution are possible.
    • DNS is a centralized protocol where DNS servers respond to DNS queries.
    • DNS names are grouped in zones or domains, forming a hierarchical structure. Each SOA is responsible of the name resolution within its area.

    DNS Service Discovery

    Unlike DNS, Multicast DNS doesn’t require a central server. Instead devices listen on port 5353 for DNS queries to a multicast address. In the case of IPv4, this destination address is In addition, the destination Ethernet address of a mDNS request must be 01:00:5E:00:00:FB.

    The Multicast DNS standard defines the domain name local as a pseudo-TLD (top-level domain) under which hosts and services can register. For instance, a laptop computer might answer to the name mylaptop.local (replace mylaptop for your actual laptop’s name).

    $ dig @ -p 5353 mylaptop.local. +short

    To discover all the services and devices in a local network, DNS-SD sends a PTR Multicast DNS request asking for the domain name `services._dns-sd._udp.local.

    $ dig @ -p 5353 -t PTR _services._dns-sd._udp.local

    The expected result should be a set of PTR records announcing their domain name. In my case the dig command doesn’t print out any PTR records, but using tcpdump I can check I’m in fact receiving mDNS responses:

    $ sudo tcpdump "port 5353" -t -qns 0 -e -i wlp3s0
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on wlp3s0, link-type EN10MB (Ethernet), capture size 262144 bytes
    44:85:00:4f:b8:fc > 01:00:5e:00:00:fb, IPv4, length 99: > UDP, length 57
    54:60:09:fc:d6:04 > 01:00:5e:00:00:fb, IPv4, length 82: > UDP, length 40
    54:60:09:fc:d6:04 > 01:00:5e:00:00:fb, IPv4, length 299: > UDP, length 257
    54:60:09:fc:d6:04 > 01:00:5e:00:00:fb, IPv4, length 119: > UDP, length 77
    f4:f5:d8:d3:de:dc > 01:00:5e:00:00:fb, IPv4, length 299: > UDP, length 257
    f4:f5:d8:d3:de:dc > 01:00:5e:00:00:fb, IPv4, length 186: > UDP, length 144

    Why dig doesn’t print out the PTR records is still a mystery to me. So instead of dig I used Avahi, the free software implementation of mDNS/DNS-SD, to browse the available devices:

    $ avahi-browse -a
    + wlp3s0 IPv4 dcad2b6c-7a21-10c310-568b-ad83b4a3ea1e          _googlezone._tcp     local
    + wlp3s0 IPv4 1ebe35f6-26f1-bc92-318c-9e35fdcbe11d          _googlezone._tcp     local
    + wlp3s0 IPv4 Google-Cast-Group-71010755f10ad16b10c231437a5e543d1dc3 _googlecast._tcp     local
    + wlp3s0 IPv4 Chromecast-Audio-fd7d2b9d29c92b24db10be10661010eebb9f _googlecast._tcp     local
    + wlp3s0 IPv4 Google-Home-d81d02e1e48a1f0b7d2cbac88f2df820  _googlecast._tcp     local
    + wlp3s0 IPv4 dcad2b6c7a2110c310-0                            _spotify-connect._tcp local

    Each row identifies a service instance name. The structure of a service instance name is the following:

    Service Instance Name = <Instance> . <Service> . <Domain>

    For example, consider the following record “_spotify-connect._tcp.local”:

    • Domain: local. The pseudo-TLD used by Multicast DNS.
    • Service: spotify-connect._tcp. The service names consists of a pair of DNS labels. The first label identifies what the service does (_spotify-connect is a service that allows an user to continue playing Spotify from a phone to a desktop computer, and viceversa). The second label identifies what protocol the service uses, in this case TCP.
    • Instance: dcad2b6c7a2110c310-0. An user friendly name that identifies the service.

    Besides a PTR record, an instance also replies with several additional DNS records that might be useful for the requester. These extra records are part of the PTR record and are embed in the DNS additional records field. These extra records are of 3 types:

    • SRV: Gives the target host and port where the service instance can be reached.
    • TXT: Gives additional information about this instance, in a structured form using key/value pairs.
    • A: IPv4 address of the reached instance.

    Snabb’s DNS-SD

    Now that we have a fair understanding of Multicast DNS and DNS-SD, we can start coding the app in Snabb. Like on the previous posts I decided not to past the code directly here, instead I’ve pushed the code to a remote branch and will comment on the most relevant parts. To checkout this repo do:

    $ git clone
    $ cd snabb
    $ git remote add dpino
    $ git checkout dns-sd


    • The app needs to send a DNS-SD packet through a network interface managed by the OS. I used Snabb’s RawSocket app to do that.
    • A DNSSD app emits one DNS-SD request every second. This is done in DNSSD’s pull method. There’s a helper class called mDNSQuery that is in charge of composing this request.
    • The DNSSD app receives responses on its push method. If the response is a Multicast DNS packet, it will print out all the contained DNS records in stdout.
    • A Multicast DNS packet is composed by a header and a body. The header contains control information such as number of queries, answers, additional records, etc. The body contains DNS records. If the mDNS packet is a response packet, these are the DNS records we would need to print out.
    • To help me handling Multicast DNS packets I created a MDNS helper class. Similarly, I added a DNS helper class that helps me parsing the necessary DNS records: PTR, SRV, TXT and A records.

    Here is Snabb’s dns-sd command in use:

    $ sudo ./snabb dnssd --interface wlp3s0
    PTR: (name: _services._dns-sd._udp.local; domain-name: _spotify-connect._tcp )
    SRV: (target: dcad2b6c7a2110c310-0)
    TXT: (CPath=/zc/0;VERSION=1.0;Stack=SP;)
    PTR: (name: _googlecast._tcp.local; domain-name: Chromecast-Audio-fd7d2b9d29c92b24db10be10661010eebb9f)
    SRV: (target: 1ebe35f6-26f1-bc92-318c-9e35fdcbe11d)
    TXT: (id=fd7d2b9d29c92b24db10be10661010eebb9f;cd=224708C2E61AED24676383796588FF7E;
    rm=8F2EE2757C6626CC;ve=05;md=Chromecast Audio;ic=/setup/icon.png;fn=Jukebox;

    Finally I’d like to share some trick or practices I used when coding the app:

    1) I started small by capturing a DNS-SD’s request emited from Avahi. Then I sent that very same packet from Snabb and checked the response was a Multicast DNS packet:

    $ avahi-browse -a
    $ sudo tcpdump -i wlp3s0 -w mdns.pcap

    Then open mdns.pcap with Wireshark, mark the request packet only and save it to disk. Then use od command to dump the packet as text:

    $ od -j 40 -A x -tx1 mdns_request.pcap
    000028 01 00 5e 00 00 fb 44 85 00 4f b8 fc 08 00 45 00
    000038 00 55 32 7c 00 00 01 11 8f 5a c0 a8 56 1e e0 00
    000048 00 fb e3 53 14 e9 00 41 89 9d 25 85 01 20 00 01
    000058 00 00 00 00 00 01 09 5f 73 65 72 76 69 63 65 73
    000068 07 5f 64 6e 73 2d 73 64 04 5f 75 64 70 05 6c 6f
    000078 63 61 6c 00 00 0c 00 01 00 00 29 10 00 00 00 00
    000088 00 00 00

    This dumped packet can be copied raw into Snabb such in MDNS’s selftest.

    NOTE: text2pcap command can also be a very convenient tool to convert a dumped packet in text format to a pcap file.

    2) Instead of sending requests on the wire to obtain responses, I saved a bunch of responses to a .pcap file and used the file as an input for the DNS parser. In fact the command supports a –pcap flag that can be used to print out DNS records.

    $ sudo ./snabb dnssd --pcap /home/dpino/avahi-browse.pcap
    Reading from file: /home/dpino/avahi-browse.pcap
    PTR: (name: _services._dns-sd._udp.local; domain-name: _spotify-connect._tcp)
    PTR: (name: ; domain-name: dcad2b6c7a2110c310-0)
    SRV: (target: dcad2b6c7a2110c310-0)
    TXT: (CPath=/zc/0;VERSION=1.0;Stack=SP;)

    3) When sending a packet to the wire, checkout the packet’s header checksum are correct. Wireshark has a mode to verify whether a packet’s header checksums are correct or not, which is very convenient. Snabb counts with protocol libraries to calculate a IP, TCP or UDP checksums. Check out how mDNSQuery does it.

    Last thoughts

    Implementing this tool has helped me to understand DNS better, specially the Multicast DNS/DNS-SD part. I never expected it could be so interesting.

    Going from an idea to a working prototype with Snabb is really fast. It’s one of the advantages of user-space networking and one of the things I enjoy the most. That said the resulting code has been bigger that I initially expected. I think that to avoid losing this work I will try to land the DNS and mDNS libraries into Snabb.

    This post puts an end to this series of practical Snabb posts. I hope you found them interesting as much as I enjoyed writing them. Luckily in the future these posts can be useful for anyone interested in user-space networking to try out Snabb.

    January 12, 2018 06:00 AM

    January 11, 2018

    Frédéric Wang

    Review of Igalia's Web Platform activities (H2 2017)

    Last september, I published a first blog post to let people know a bit more about Igalia’s activities around the Web platform, with a plan to repeat such a review each semester. The present blog post focuses on the activity of the second semester of 2017.


    As part of Igalia’s commitment to diversity and inclusion, we continue our effort to standardize and implement accessibility technologies. More specifically, Igalian Joanmarie Diggs continues to serve as chair of the W3C’s ARIA working group and as an editor of Accessible Rich Internet Applications (WAI-ARIA) 1.1, Core Accessibility API Mappings 1.1, Digital Publishing WAI-ARIA Module 1.0, Digital Publishing Accessibility API Mappings 1.0 all of which became W3C Recommandations in December! Work on versions 1.2 of ARIA and the Core AAM will begin in January. Stay tuned for the First Public Working Drafts.

    We also contributed patches to fix several issues in the ARIA implementations of WebKit and Gecko and implemented support for the new DPub ARIA roles. We expect to continue this collaboration with Apple and Mozilla next year as well as to resume more active maintenance of Orca, the screen reader used to access graphical desktop environments in GNU/Linux.

    Last but not least, progress continues on switching to Web Platform Tests for ARIA and “Accessibility API Mappings” tests. This task is challenging because, unlike other aspects of the Web Platform, testing accessibility mappings cannot be done by solely examining what is rendered by the user agent. Instead, an additional tool, an “Accessible Technology Test Adapter” (ATTA) must be also be run. ATTAs work in a similar fashion to assistive technologies such as screen readers, using the implemented platform accessibility API to query information about elements and reporting what it obtains back to WPT which in turn determines if a test passed or failed. As a result, the tests are currently officially manual while the platform ATTAs continue to be developed and refined. We hope to make sufficient progress during 2018 that ATTA integration into WPT can begin.


    This semester, we were glad to receive Bloomberg’s support again to pursue our activities around CSS. After a long commitment to CSS and a lot of feedback to Editors, several of our members finally joined the Working Group! Incidentally and as mentioned in a previous blog post, during the CSS Working Group face-to-face meeting in Paris we got the opportunity to answer Microsoft’s questions regarding The Story of CSS Grid, from Its Creators (see also the video). You might want to take a look at our own videos for CSS Grid Layout, regarding alignment and placement and easy design.

    On the development side, we maintained and fixed bugs in Grid Layout implementation for Blink and WebKit. We also implemented alignment of positioned items in Blink and WebKit. We have several improvements and bug fixes for editing/selection from Bloomberg’s downstream branch that we’ve already upstreamed or plan to upstream. Finally, it’s worth mentioning that the work done on display: contents by our former coding experience student Emilio Cobos was taken over and completed by antiik (for WebKit) and rune (for Blink) and is now enabled by default! We plan to pursue these developments next year and have various ideas. One of them is improving the way grids are stored in memory to allow huge grids (e.g. spreadsheet).

    Web Platform Predictability

    One of the area where we would like to increase our activity is Web Platform Predictability. This is obviously essential for our users but is also instrumental for a company like Igalia making developments on all the open source Javascript and Web engines, to ensure that our work is implemented consistently across all platforms. This semester, we were able to put more effort on this thanks to financial support from Bloomberg and Google AMP.

    We have implemented more frame sandboxing attributes WebKit to improve user safety and make control of sandboxed documents more flexible. We improved the sandboxed navigation browser context flag and implemented the new allow-popup-to-escape-sandbox, allow-top-navigation-without-user-activation, and allow-modals values for the sandbox attribute.

    Currently, HTML frame scrolling is not implemented in WebKit/iOS. As a consequence, one has to use the non-standard -webkit-overflow-scrolling: touch property on overflow nodes to emulate scrollable elements. In parallel to the progresses toward using more standard HTML frame scrolling we have also worked on annoying issues related to overflow nodes, including flickering/jittering of “position: fixed” nodes or broken Find UI or a regression causing content to disappear.

    Another important task as part of our CSS effort was to address compatibility issues between the different browsers. For example we fixed editing bugs related to HTML List items: WebKit’s Bug 174593/Chromium’s Issue 744936 and WebKit’s Bug 173148/Chromium’s Issue 731621. Inconsistencies in web engines regarding selection with floats have also been detected and we submitted the first patches for WebKit and Blink. Finally, we are currently improving line-breaking behavior in Blink and WebKit, which implies the implementation of new CSS values and properties defined in the last draft of the CSS Text 3 specification.

    We expect to continue this effort on Web Platform Predictability next year and we are discussing more ideas e.g. WebPackage or flexbox compatibility issues. For sure, Web Platform Tests are an important aspect to ensure cross-platform inter-operability and we would like to help improving synchronization with the conformance tests of browser repositories. This includes the accessibility tests mentioned above.


    Last November, we launched a fundraising Campaign to implement MathML in Chromium and presented it during Frankfurt Book Fair and TPAC. We have gotten very positive feedback so far with encouragement from people excited about this project. We strongly believe the native MathML implementation in the browsers will bring about a huge impact to STEM education across the globe and all the incumbent industries will benefit from the technology. As pointed out by Rick Byers, the web platform is a commons and we believe that a more collective commitment and contribution are essential for making this world a better place!

    While waiting for progress on Chromium’s side, we have provided minimal maintenance for MathML in WebKit:

    • We fixed all the debug ASSERTs reported on Bugzilla.
    • We did follow-up code clean up and refactoring.
    • We imported Web Platform tests in WebKit.
    • We performed review of MathML patches.

    Regarding the last point, we would like to thank Minsheng Liu, a new volunteer who has started to contribute patches to WebKit to fix issues with MathML operators. He is willing to continue to work on MathML development in 2018 as well so stay tuned for more improvements!


    During the second semester of 2017, we worked on the design, standardization and implementation of several JavaScript features thanks to sponsorship from Bloomberg and Mozilla.

    One of the new features we focused on recently is BigInt. We are working on an implementation of BigInt in SpiderMonkey, which is currently feature-complete but requires more optimization and cleanup. We wrote corresponding test262 conformance tests, which are mostly complete and upstreamed. Next semester, we intend to finish that work while our coding experience student Caio Lima continues work on a BigInt implementation on JSC, which has already started to land. Google also decided to implement that feature in V8 based on the specification we wrote. The BigInt specification that we wrote reached Stage 3 of TC39 standardization. We plan to keep working on these two BigInt implementations, making specification tweaks as needed, with an aim towards reaching Stage 4 at TC39 for the BigInt proposal in 2018.

    Igalia is also proposing class fields and private methods for JavaScript. Similarly to BigInt, we were able to move them to Stage 3 and we are working to move them to stage 4. Our plan is to write test262 tests for private methods and work on an implementation in a JavaScript engine early next year.

    Igalia implemented and shipped async iterators and generators in Chrome 63, providing a convenient syntax for exposing and using asynchronous data streams, e.g., HTML streams. Additionally, we shipped a major performance optimization for Promises and async functions in V8.

    We implemented and shipped two internationalization features in Chrome, Intl.PluralRules and Intl.NumberFormat.prototype.formatToParts. To push the specifications of internationalization features forwards, we have been editing various other internationalization-related specifications such as Intl.RelativeTimeFormat, Intl.Locale and Intl.ListFormat; we also convened and led the first of what will be a monthly meeting of internationalization experts to propose and refine further API details.

    Finally, Igalia has also been formalizing WebAssembly’s JavaScript API specification, which reached the W3C first public working draft stage, and plans to go on to improve testing of that specification as the next step once further editorial issues are fixed.


    Thanks to sponsorship from Mozilla we have continued our involvement in the Quantum Render project with the goal of using Servo’s WebRender in Firefox.

    Support from Metrological has also given us the opportunity to implement more web standards from some Linux ports of WebKit (GTK and WPE, including:

    • WebRTC
    • WebM
    • WebVR
    • Web Crypto
    • Web Driver
    • WebP animations support
    • HTML interactive form validation
    • MSE


    Thanks for reading and we look forward to more work on the web platform in 2018. Onwards and upwards!

    January 11, 2018 11:00 PM

    Gyuyoung Kim

    Share my experience to build Chromium with ICECC

    If you’re a Chromium developer, I guess that you’ve suffered from the long build time of Chromium like me. Recently, I’ve set up the icecc build environment for Chromium in the Igalia Korea office. Although there have been some instructions how to build Chromium with ICECC, I think that someone might feel they are a bit difficult. So, I’d like to share my experience how to set up the environment to build Chromium on the icecc with Clang on Linux in order to speed up the build of Chromium.

    P.S. Recently Google announced that they will open Goma (Google internal distributed build system) for everyone. Let’s see goma can make us not to use icecc anymore 😉!msg/chromium-dev/q7hSGr_JNzg/p44IkGhDDgAJ

    Prerequisites in your environment

    1. First, we should install the icecc on your all machines.
      sudo apt-get install icecc [icecc-monitor]
    2. To build Chromium using icecc on the Clang, we have to use some configurations when generating the gn files. To do it as easily as possible, I made a script. So you just download it and register it to $PATH. 
      1. Clone the script project
        $ git clone

        FYI, you can see the detailed configuration here –

      2. Add your chromium/src path to
        # Please set your path to ICECC_VERSION and CHROMIUM_SRC.
        export CHROMIUM_SRC=$HOME/chromium/src
        export ICECC_VERSION=$HOME/chromium/clang.tar.gz
      3. Register it to PATH environment in .bashrc
        export PATH=/path/to/ChromiumBuild:$PATH
      4. Create a clang toolchain from the patched Chromium version
        I added the process to “sync” argument of So please just execute the below command before starting compiling. Whenever you run it, clang.tar.gz will be updated every time with the latest Chromium version.
    $ sync


    1. Run an icecc scheduler on a master machine,
      sudo service icecc-scheduler start
    2. Then, run an icecc daemon on each slave machine
      sudo service iceccd start

      If you run icemon, you can monitor the build status on the icecc.

    3. Start building in chromium/src
      $ Debug|Release

      When you start building Chromium, you’ll see that the icecc works on the monitor!

    Build Time

    In my case, I’ve  been using 1 laptop and 2 desktops for Chromium build. The HW information is as below,

    • Laptop (Dell XPS 15″ 9560)
      1. CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
      2. RAM: 16G
      3. SSD
    • Desktop 1
      1. CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
      2. RAM: 16G
      3. SSD
    • Desktop 2
      1. CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
      2. RAM: 16G
      3. SSD

    I’ve measured how long time I’ve spent to build Chromium with the script in below cases,

    • Laptop
      • Build time on Release with the jumbo build: About 84 min
      • Build time on Release without the jumbo build: About 203 min
    • Laptop + Desktop 1 + Desktop 2
      • Build time on Release with the jumbo build: About 35 min
      • Build time on Release without the jumbo build: About 73 min

    But these builds haven’t applied any object caching by ccache yet. If ccache works next time, the build time will be reduced. Besides, the build time is depended on the count of build nodes and performance. So this time can differ from your environment.

    Troubleshooting (Ongoing)

    1. Undeclared identifier
      1. Error message
         error: use of undeclared identifier '__builtin_ia32_vpdpbusd512_mask'
         return (__m512i) __builtin_ia32_vpdpbusd512_mask ((__v16si) __S, ^
      2. Solution
        Please check if the path of ICECC_VERSION was set correctly.
    2. Loading error
      1. Error message
        usr/bin/clang: error while loading shared libraries: 
        failed to map segment from shared object
      2. Solution
        Not find a correct fix yet. Just restart the build for now.
    3. Out of Memory
      1. Error message
        LLVM ERROR: out of memory
      2. Solution
        After checking what node generated the OOM error, please add more RAM to the machine.


    1. WebKitGTK SpeedUpBuild:
    2. compiling-chromium-with-clang-and-icecc :
    3. Rune Lillesveen’s icecc-chromium project:

    by gyuyoung at January 11, 2018 12:37 AM

    January 10, 2018

    Manuel Rego

    "display: contents" is coming

    Yes, display: contents is enabled by default in Blink and WebKit and it will be probably shipped in Chrome 65 and Safari 11.1. These browsers will join Firefox that is shipping it since version 37, which makes Edge the only one missing the feature (you can vote for it!).

    Regarding this I’d like to highlight that the work to support it in Chromium was started by Emilio Cobos during his Igalia Coding Experience that took place from fall 2016 to summer 2017.

    You might (or not) remember a blog post from early 2016 where I was talking about the Igalia Coding Experience program and some ideas of tasks to be done as part of the Web Platform team. One of the them was display: contents which is finally happening.

    What is display: contents?

    This new value for the display property allows you to somehow remove an element from the box tree but still keep its contents. The proper definition from the spec:

    The element itself does not generate any boxes, but its children and pseudo-elements still generate boxes and text runs as normal. For the purposes of box generation and layout, the element must be treated as if it had been replaced in the element tree by its contents (including both its source-document children and its pseudo-elements, such as ::before and ::after pseudo-elements, which are generated before/after the element’s children as normal).

    A simple example will help to understand it properly:

    <div style="display: contents;
                background: magenta; border: solid thick black; padding: 20px;
                color: cyan; font: 30px/1 Monospace;">
      <span style="background: black;">foobar</span>

    display: contents makes that the div doesn’t generate any box, so its background, border and padding are not renderer. However the inherited properties like color and font have effect on the child (span element) as expected.

    For this example, the final result would be something like:

    <span style="background: black; color: cyan; font: 30px/1 Monospace;">foobar</span>







    Unsupported vs actual (in your browser) vs supported output for the previous example

    If you want more details Rachel Andrew has a nice blog post about this topic.

    CSS Grid Layout & display: contents

    As you could expect from a post from myself this is somehow related to CSS Grid Layout. 😎 display: contents can be used as a replacement of subgrids (which are not supported by any browser at this point) in some use cases. However subgrids are still needed for other scenarios.

    The canonical example for Grid Layout auto-placement is a simple form like:

      form   { display:     grid;   }
      label  { grid-column: 1;      }
      input  { grid-column: 2;      }
      button { grid-column: span 2; }
      <label>Name</label><input />
      <label>Mail</label><input />

    A simple form formatted with CSS Grid Layout A simple form formatted with CSS Grid Layout

    However this is not the typical HTML of a form, as you usually want to use a list inside, so people using screen readers will know how many fields they have to fill in your form beforehand. So the HTML looks more like this:

        <li><label>Name</label><input /></li>
        <li><label>Mail</label><input /></li>

    With display: contents you’ll be able to have the same layout than in the first case with a similar CSS:

    ul     { display: grid;       }
    li     { display: contents;   }
    label  { grid-column: 1;      }
    input  { grid-column: 2;      }
    button { grid-column: span 2; }

    So this is really nice, now when you convert your website to start using CSS Grid Layout, you would need less changes on your HTML and you won’t need to remove some HTML that is really useful, like the list in the previous example.

    Chromium implementation

    As I said in the introduction, Firefox shipped display: contents three years ago, however Chromium didn’t have any implementation for it. Igalia as CSS Grid Layout implementor was interested in having support for the feature as it’s a handy solution for several Grid Layout use cases.

    The proposal for the Igalia Coding Experience was the implementation of display: contents on Blink as the main task. Emilio did an awesome job and managed to implement most of it, reporting issues to CSS Working Group and other browsers as needed, and writing tests for the web-platform-tests repository to ensure interoperability between the implementations.

    Once the Coding Experience was over there were still a few missing things to be able to enable display: contents by default. Rune Lillesveen (Google and previously Opera) who was helping during the whole process with the reviews, finished the work and shipped it past week.

    WebKit implementation

    WebKit already had an initial support for display: contents that was only used internally by Shadow DOM implementation and not exposed to the end users, neither supported by the rest of the code.

    We reactivated the work there too, he didn’t have time to finish the whole thing but later Antti Koivisto (Apple) completed the work and enabled it by default on trunk by November 2017.


    Igalia is one of the top external contributors on the open web platform projects. This put us on a position that allows us to implement new features in the different open source projects, thanks to our community involvement and internal knowledge after several years of experience on the field. Regarding display: contents implementation, this feature probably wouldn’t be available today in Chromium and WebKit without Igalia’s support, in this particular case through a Coding Experience.

    We’re really happy about the results of the Coding Experience and we’re looking forward to repeat the success story in the future.

    Of course, all the merit goes to Emilio, who is an impressive engineer and did a great job during the Coding Experience. As part of this process he got committer privileges in both Chromium and WebKit projects. Kudos!

    Last, but not least, thanks to Antti and Rune for finishing the work and making display: contents available to WebKit and Chromium users.

    January 10, 2018 11:00 PM

    December 27, 2017

    Manuel Rego

    Web Engines Hackfest 2017

    One year more Igalia organized, hosted and sponsored a new edition of the Web Engines Hackfest. This is my usual post about the event focusing on the things I was working on during those days.


    This year I wanted to talk about this because being part of the organization in an event with 60 people is not that simple and take some time. I’m one of the members of the organization which ended up meaning that I was really busy during the 3 days of the event trying to make the life of all the attendees as easy as possible.

    Yeah, this year we were 60 people, the biggest number ever! Note that last year we were 40 people, so it’s clear the interest in the event is growing after each edition, which is really nice.

    For the first time we had conference badges, with so many faces it was going to be really hard to put names to all of them. In addition we had pronouns stickers and different lanyard color for the people that don’t want to appear in the multimedia material published during and after the event. I believe all these things worked very well and we’ll be repeating for sure in future editions.

    My Web Engines Hackfest 2017 conference badge My Web Engines Hackfest 2017 conference badge

    The survey after the event showed an awesome positive feedback, so we’re really glad that you have enjoined the hackfest. I was specially happy seeing how many parallel discussions were taking place all around the office in small groups of people.


    The main focus of the event is the different discussions that happen between people about many different topics. This together with the possibility to hack with some people from other browser engines and/or other companies make the hackfest an special event.

    On top of that, we arrange a few talks that are usually quite interesting. The talks from this year can be found on a YouTube playlist (thanks Juan for helping with the video editing). This year the talks covered the following topics: Web Platform Tests, zlib optimizations, Chromium on Wayland, BigInt, WebRTC and WebVR.

    Some pictures of the Web Engines Hackfest 2017 Some pictures of the Web Engines Hackfest 2017

    CSS Grid Layout

    During the hackfest I participated in several breakout sessions, like for example one about Web Platform Tests or another one about MathML. However as usual on the last years my main was related to CSS Grid Layout. In this case we took advantage to discuss several topics from which I’m going to highlight two:

    Chromium bug on input elements which only happens on Mac

    This is about Chromium bug #727076, where the input elements that are grid items get collapsed/shrunk when the user starts to enter some text. This was a tricky issue only reproducible on Mac platform that was hitting us for a while, so we needed to find a solution.

    We had some long discussions about this and finally my colleague Javi found the root cause of the problem and finally fixed it, kudos!

    This bug is a quite complex and tight to some Chromium specific implementation bits. The summary is that in Mac platform you can get requests about the intrinsic (preferred) width of the grid container at random times, which means that you’re not sure a full layout will be performed afterwards. Our code was not ready for that, as we were always expecting a full layout after asking for the intrinsic width.

    Percentage tracks and gutters

    This is a neverending topic that has been discussed in the CSS WG for a long time. There are 2 different things so let’s go one by one.

    First, how percentage tracks are resolved when the width/height of a grid container is indefinite. So far, this was not symmetric but was working like in regular blocks:

    • Percentage columns work like percentage widths: This means that they are treated as auto during intrinsic size computation and later resolved during layout.
    • Percentage rows work like percentage heights: In this case they are treated as auto and never resolved.

    However the CSS WG decided to change this and make both symmetric, so percentage rows will work like percentage columns. This hasn’t been implemented by any browser yet, and all of them have interoperability with the previous status. We’re not 100% sure about the complexity this could bring and before changing current behavior we’re going to gather some usage statistics to verify this won’t break a lot o content out there. We’d also love to get feedback from other implementors about this. More information about this topic can be found on CSS WG issue #1921.

    Now let’s talk about the second issue, how percentage gaps work. The whole discussion can be checked on CSS WG issue #509 that I started back in TPAC 2016. For this case there are no interoperability between browsers as Firefox has its own solution of back-computing percentage gaps, the rest of browsers have the same behavior in line with the percentage tracks resolution, but again it is not symmetric:

    • Percentage column gaps contribute zero to intrinsic width computation and are resolved as percent during layout.
    • Percentage row gaps are treated always as zero.

    The CSS WG resolved to modify this behavior to make them both symmetric, in this case choosing the row gaps behavior as reference. So browsers will need to change how column gaps work to avoid resolving the percentages during layout. We don’t know if we could detect this situation in Blink/WebKit without quite complex changes on the engine, and we’re waiting for feedback from other implementors on that regard.

    So I won’t say any of those topics are definitely closed yet, and it won’t be unexpected if some other changes happen in the future when the implementations try to catch up with the spec changes.


    To close this blog post let’s say thanks to everyone that come to our office and participated in the event, the Web Engines Hackfest won’t be possible without such a great bunch of people that decided to spend a few days working together on improving the status of the Web.

    Web Engines Hackfest 2017 sponsors: Collabora, Google, Igalia and Mozilla Web Engines Hackfest 2017 sponsors: Collabora, Google, Igalia and Mozilla

    Of course we cannot forget about the sponsors either: Collabora, Google, Igalia and Mozilla. Thank you all very much!

    And last, but not least, thanks to Igalia for organizing and hosting one year more this event. Looking forward to the new year and the 2018 edition!

    December 27, 2017 11:00 PM

    December 17, 2017

    Michael Catanzaro

    Epiphany Stable Flatpak Releases

    The latest stable version of Epiphany is now available on Flathub. Download it here. You should be able to double click the flatpakref to install it in GNOME Software, if you use any modern GNOME operating system not named Ubuntu. But, in my experience, GNOME Software is extremely buggy, and it often as not does not work for me. If you have trouble, you can use the command line:

    flatpak install --from

    This has actually been available for quite a while now, but I’ve delayed announcing it because some things needed to be fixed to work well under Flatpak. It’s good now.

    I’ve also added a download link to Epiphany’s webpage, so that you can actually, you know, download and install the software. That’s a useful thing to be able to do!


    The obvious benefit of Flatpak is that you get the latest stable version of Epiphany (currently 3.26.5) and WebKitGTK+ (currently 2.18.3), no matter which version is shipped in your operating system.

    The other major benefit of Flatpak is that the browser is protected by Flatpak’s top-class bubblewrap sandbox. This is, of course, a UI process sandbox, which is different from the sandboxing model used in other browsers, where individual browser tabs are sandboxed from each other. In theory, the bubblewrap sandbox should be harder to escape than the sandboxes used in other major browsers, because the attack surface is much smaller: other browsers are vulnerable to attack whenever IPC messages are sent between the web process and the UI process. Such vulnerabilities are mitigated by a UI process sandbox. The disadvantage of this approach is that tabs are not sandboxed from each other, as they would be with a web process sandbox, so it’s easier for a compromised tab to do bad things to your other tabs. I’m not sure which approach is better, but clearly either way is much better than having no sandbox at all. (I still hope to have a web process sandbox working for use when WebKit is used outside of Flatpak, but that’s not close to being ready yet.)


    Now, there are a couple of loose ends. We do not yet have desktop notifications working under Flatpak, and we also don’t block the screen from turning off when you’re watching fullscreen video, so you’ll have to wiggle your mouse every five minutes or so when you’re watching YouTube to keep the lights on. These should not be too hard to fix; I’ll try to get them both working soon. Also, drag and drop does not work. I’m not nearly brave enough to try fixing that, though, so you’ll just have to live without drag and drop if you use the Flatpak version.

    Also, unfortunately the stable GNOME runtimes do not receive regular updates. So while you get the latest version of Epiphany, most everything else will be older. This is not good. I try to make sure that WebKit gets updated, so you’ll have all the latest security updates there, but everything else is generally stuck at older versions. For example, the 3.26 runtime uses, for the most part, whatever software versions were current at the time of the 3.26.1 release, and any updates newer than that are just not included. That’s a shame, but the GNOME release team does not maintain GNOME’s Flatpak runtimes: we have three other other redundant places to store the same build information (JHBuild, GNOME Continuous, BuildStream) that we need to take care of, and adding yet another is not going to fly. Hopefully this situation will change soon, though, since we should be able to use BuildStream to replace the current JSON manifest that’s used to generate the Flatpak runtimes and keep everything up to date automatically. In the meantime, this is a problem to be aware of.

    by Michael Catanzaro at December 17, 2017 07:21 PM

    December 13, 2017

    Gyuyoung Kim

    What is navigator.registerProtocolHandler?

    Have you heard that navigator.registerProtocolHandler javascript API?

    The API is to give you the power to add your customized scheme(a.k.a. protocol) to your website or web application. If you register your own custom scheme, it can help you to avoid collisions with other DNS, or help other people use it correctly, and be able to claim the moral high ground if they use it incorrectly. In this post, I would like to introduce the feature with some examples, and what I’ve contributed to the feature.


    The registerProtocolHandler had been started discussing since 2006. Finally, it was introduced in HTML5 for the first time.

    This is a simple example of the use of navigator.registerProtocolHandler. Basically, the registerProtocolHandler takes three arguments,

    • scheme – The scheme that you want to handle. For example, mailto, tel, bitcoin, or irc.
    • url – A URL within your web application that can handle the specified scheme.
    • title – The title of the handler.
                                           "web search");
    <a href="web+search:igalia">Igalia</a>

    As this example, you can register a custom scheme with a URL that can be combined with the given URL(web+search:igalia)

    However, you need to keep in mind some limitations when you use it.

    1. URL should include %s placeholder.
    2. You can register own custom handler only when the URL of the custom handler is same the website origin. If not, the browser will generate a security error.
    3. There are pre-defined schemes in the specification. Except for these schemes, you’re only able to use “web+foo” style scheme. But, the length of “web+foo” is at least five characters including ‘web+’ prefix. If not, the browser will generate a security error.
      • bitcoin, geo, im, irc, ircs, magnet, mailto, mms, news, nntp, openpgp4fpr, sip, sms, smsto, ssh, tel, urn, webcal, wtai, xmpp.
    4. Schemes should not contain any colons. For example, mailto: will generate an error. So, you have to just use mailto.

    Status on major browsers

    Firefox (3+)

    When Firefox meets a webpage which includes navigator.registerProtocolHandler as below,

                                      "web search");

    It will show an alert bar just below the URL bar. Then, it will ask us if we want to add the URL to the Firefox handler list.

    After allowing to add it to the supported handler list, you can see it in the Applications section of the settings page (about:preferences#general).

    Now, the web+github custom scheme can be used in the site as below,

    <a href="web+github:wiki">Wiki</a>
    <a href="web+github:Pull requests">PR</a>

    Chromium (13+)

    Let’s take a look Chrome on the registerProtocolHandler. In Chrome, Chrome shows a new button inside the URL bar instead of the alert bar in Firefox.

    If you press the button, a dialog will be shown in order to ask if you allow the website to register the custom scheme.

    In the “Handlers” section of the settings (chrome://settings/handlers?search=protocol), you’re able to see that the custom handler was registered.

    FYI, other browsers based on Chromium (i.e. Opera, Yandex, Whale, etc.) have handled it similar to Chrome unless each vendor has own specific behavior.

    Safari (WebKit)

    Though WebKit has supported this feature in WebCore and WebProcess of WebKit2, no browsers based on WebKit have supported this feature with UI. Although I had tried to implement it in UIProcess of WebKit2 through the webkit-dev mailing list so that browsers based on WebKit can support it, Unfortunately, some of Apple engineers had doubts about this feature though there were some agreements to support it in WebKit. So I failed to implement it in UIProcess of WebKit2.

    My contributions in WebKit and Chromium

    I’ve mainly contributed to apply new changes in the specification, to fix bugs and improve test cases. First, I’d like to introduce the bug that I modified in Chromium recently.

    Let’s assume that URL has multiple placeholders(%s).

    <a href="test:duplicated_placeholders">this</a>

    According to the specification, only first “%s” placeholder should be substituted, not substitute all placeholders. But, Chrome has substituted all placeholders with the given URL as below, even though Firefox has only substituted the first placeholder.

    So I fixed the problem in [registerProtocolHandler] Only substitute the first “%s” placeholder. The latest Chromium substitutes only first placeholder like below,

    This is a whole list what I’ve contributed both WebKit and Chromium so far.

    1.  WebKit
    2. Chromium


    So far, we’ve looked at the navigator.registerProtcolHandler for your web application simply. The API can be useful if you want to make users use your web applications like a web-based email client or calendar.

    by gyuyoung at December 13, 2017 08:59 AM

    December 07, 2017

    Víctor Jáquez

    Enabling HuC for SKL/KBL in Debian/testing

    Recently, our friend Florent complained that it was impossible to set a constant bitrate when encoding H.264 using low-power profile with gstreamer-vaapi .

    Low-power (LP) profiles are VA-API entry points, available in Intel SkyLake-based procesor and succesors, which provide video encoding with low power consumption.

    Later on, Ullysses and Sree, pointed out that CBR in LP is ony possible if HuC is enabled in the kernel.

    HuC is a firmware, loaded by i915 kernel module, designed to offload some of the media functions from the CPU to GPU. One of these functions is bitrate control when encoding. HuC saves unnecessary CPU-GPU synchronization.

    In order to load HuC, it is required first to load GuC, another Intel’s firmware designed to perform graphics workload scheduling on the various graphics parallel engines.

    How we can install and configure these firmwares to enable CBR in low-power profile, among other things, in Debian/testing?

    Check i915 parameters

    First we shall confirm that our kernel and our i915 kernel module is capable to handle this functionality:

    $ sudo modinfo i915 | egrep -i "guc|huc|dmc"
    firmware:       i915/bxt_dmc_ver1_07.bin
    firmware:       i915/skl_dmc_ver1_26.bin
    firmware:       i915/kbl_dmc_ver1_01.bin
    firmware:       i915/kbl_guc_ver9_14.bin
    firmware:       i915/bxt_guc_ver8_7.bin
    firmware:       i915/skl_guc_ver6_1.bin
    firmware:       i915/kbl_huc_ver02_00_1810.bin
    firmware:       i915/bxt_huc_ver01_07_1398.bin
    firmware:       i915/skl_huc_ver01_07_1398.bin
    parm:           enable_guc_loading:Enable GuC firmware loading (-1=auto, 0=never [default], 1=if available, 2=required) (int)
    parm:           enable_guc_submission:Enable GuC submission (-1=auto, 0=never [default], 1=if available, 2=required) (int)
    parm:           guc_log_level:GuC firmware logging level (-1:disabled (default), 0-3:enabled) (int)
    parm:           guc_firmware_path:GuC firmware path to use instead of the default one (charp)
    parm:           huc_firmware_path:HuC firmware path to use instead of the default one (charp)

    Install firmware

    $ sudo apt install firmware-misc-nonfree

    UPDATE: In order to install this Debian package, you should have enabled the non-free apt repository in your sources list.

    Verify the firmware are installed:

    $ ls -1 /lib/firmware/i915/

    Update modprobe configuration

    Edit or create the configuration file /etc/modprobe.d/i915.con

    $ sudo vim /etc/modprobe.d/i915.conf
    $ cat /etc/modprobe.d/i915.conf
    options i915 enable_guc_loading=1 enable_guc_submission=1


    $ sudo systemctl reboot 


    Now it is possible to verify that the i915 module kernel loaded the firmware correctly by looking at the kenrel logs:

    $ journalctl -b -o short-monotonic -k | egrep -i "i915|dmr|dmc|guc|huc"
    [   10.303849] miau kernel: Setting dangerous option enable_guc_loading - tainting kernel
    [   10.303852] miau kernel: Setting dangerous option enable_guc_submission - tainting kernel
    [   10.336318] miau kernel: i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
    [   10.338664] miau kernel: i915 0000:00:02.0: firmware: direct-loading firmware i915/kbl_dmc_ver1_01.bin
    [   10.339635] miau kernel: [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_01.bin (v1.1)
    [   10.361811] miau kernel: i915 0000:00:02.0: firmware: direct-loading firmware i915/kbl_huc_ver02_00_1810.bin
    [   10.362422] miau kernel: i915 0000:00:02.0: firmware: direct-loading firmware i915/kbl_guc_ver9_14.bin
    [   10.393117] miau kernel: [drm] GuC submission enabled (firmware i915/kbl_guc_ver9_14.bin [version 9.14])
    [   10.410008] miau kernel: [drm] Initialized i915 1.6.0 20170619 for 0000:00:02.0 on minor 0
    [   10.559614] miau kernel: snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
    [   11.937413] miau kernel: i915 0000:00:02.0: fb0: inteldrmfb frame buffer device

    That means that HuC and GuC firmwares were loaded successfully.

    Now we can check the status of the modules using sysfs

    $ sudo cat /sys/kernel/debug/dri/0/i915_guc_load_status
    GuC firmware status:
            path: i915/kbl_guc_ver9_14.bin
            fetch: SUCCESS
            load: SUCCESS
            version wanted: 9.14
            version found: 9.14
            header: offset is 0; size = 128
            uCode: offset is 128; size = 142272
            RSA: offset is 142400; size = 256
    GuC status 0x800330ed:
            Bootrom status = 0x76
            uKernel status = 0x30
            MIA Core status = 0x3
    Scratch registers:
             0:     0xf0000000
             1:     0x0
             2:     0x0
             3:     0x5f5e100
             4:     0x600
             5:     0xd5fd3
             6:     0x0
             7:     0x8
             8:     0x3
             9:     0x74240
            10:     0x0
            11:     0x0
            12:     0x0
            13:     0x0
            14:     0x0
            15:     0x0
    $ sudo cat /sys/kernel/debug/dri/0/i915_huc_load_status
    HuC firmware status:
            path: i915/kbl_huc_ver02_00_1810.bin
            fetch: SUCCESS
            load: SUCCESS
            version wanted: 2.0
            version found: 2.0
            header: offset is 0; size = 128
            uCode: offset is 128; size = 218304
            RSA: offset is 218432; size = 256
    HuC status 0x00006080:

    Test GStremer

    $ gst-launch-1.0 videotestsrc num-buffers=1000 ! video/x-raw, format=NV12, width=1920, height=1080, framerate=\(fraction\)30/1 ! vaapih264enc bitrate=8000 keyframe-period=30 tune=low-power rate-control=cbr ! mp4mux ! filesink location=test.mp4
    Setting pipeline to PAUSED ...
    Pipeline is PREROLLING ...
    Got context from element 'vaapiencodeh264-0': gst.vaapi.Display=context, gst.vaapi.Display=(GstVaapiDisplay)"\(GstVaapiDisplayGLX\)\ vaapidisplayglx0";
    Pipeline is PREROLLED ...
    Setting pipeline to PLAYING ...
    New clock: GstSystemClock
    Got EOS from element "pipeline0".
    Execution ended after 0:00:11.620036001
    Setting pipeline to PAUSED ...
    Setting pipeline to READY ...
    Setting pipeline to NULL ...
    Freeing pipeline ...
    $ gst-discoverer-1.0 test.mp4 
    Analyzing file:///home/vjaquez/gst/master/intel-vaapi-driver/test.mp4
    Done discovering file:///home/vjaquez/test.mp4
      container: Quicktime
        video: H.264 (High Profile)
      Duration: 0:00:33.333333333
      Seekable: yes
      Live: no
          video codec: H.264 / AVC
          bitrate: 8084005
          encoder: VA-API H264 encoder
          datetime: 2017-12-07T14:29:23Z
          container format: ISO MP4/M4A

    Misison accomplished!


    by vjaquez at December 07, 2017 02:35 PM

    November 28, 2017

    Diego Pino

    Practical Snabb

    In a previous article I introduced Snabb, a toolkit for developing network functions. In this article I want to dive into some practical examples on how to use Snabb for network function programming.

    The elements of a network function

    A network function is any program that does something with traffic data. There’s a certain set of operations that can be done onto any packet. Operations such as reading, modifying (headers or payload), creating (new packets), dropping or forwarding. Any network function is a combination of these primitives. For instance, a NAT function consists of packet header modification and forwarding.

    Some of the built-in network functions featured in Snabb are:

    • lwAFTR (NAT, encap/decap): Implementation of the lwAFTR network function as specified in RFC7596. lwAFTR is a NAT between IPv6 and IPv4 address+port.
    • IPSEC (processing): encryption of packet payloads using AES instructions.
    • Snabbwall (filtering): a L7 firewall that relies on libnDPI for Deep-Packet Inspection. It also allows L3/L4 filtering using tcpdump alike expressions.

    Real-world scenarios

    The downside of by-passing the kernel and taking full control of a NIC is that the NIC cannot be used by any other program. That means the network function run by Snabb acts as a black-box. Some traffic comes in, gets transformed and it’s pushed out through the same NIC (or any other NIC controlled by the network function). The advantage is clear, outstanding performance.

    For this reason Snabb is mostly used to develop network functions that run within the ISP’s network, where traffic load is expected to be high. An ISP can spare one or several NICs to run a network function alone since the results pay off (lower hardware costs, custom network function development, good performance, etc).

    Snabb might seem like a less attractive tool in other scenarios. However, that doesn’t mean it cannot be used to program network functions that run in a personal computer or in a less demanding network. Snabb has interfaces to Tap, Raw socket and Unix socket programming, which allows to use Snabb as a program managed by the kernel. In fact, using some of these interfaces is the best way to start with Snabb if you don’t count with native hardware support.

    Building Snabb

    In this tutorial I’ll cover two examples to help me illustrate how to use Snabb. But before proceeding with the examples, we need to download and build Snabb.

    $ git clone
    $ cd snabb
    $ make

    Now we can run the snabb executable, which will print out a list of all the subprograms available:

    $ cd src/
    $ sudo ./snabb
    Usage: ./snabb <program> ...
    This snabb executable has the following programs built in:
    For detailed usage of any program run:
      snabb <program> --help
    If you rename (or copy or symlink) this executable with one of
    the names above then that program will be chosen automatically.

    Hello world!

    One of the simplest network functions to build is something that reads packets from a source, filters some of them and forwards the rest to an output. In this case I want to capture traffic from my browser (packets to HTTP or HTTPS). Here is how our hello world! program looks like:

    #!./snabb snsh
    local pcap = require("apps.pcap.pcap")
    local PcapFilter = require("apps.packet_filter.pcap_filter").PcapFilter
    local RawSocket = require("apps.socket.raw").RawSocket
    local args = main.parameters
    local iface = assert(args[1], "No listening interface")
    local fileout = args[2] or "output.pcap"
    local c =, "nic", RawSocket, iface), "filter", PcapFilter, {filter = "tcp dst port 80 or dst port 443"}), "writer", pcap.PcapWriter, fileout), "nic.tx -> filter.input"), "filter.output -> writer.input")

    Now save the script and run it:

    $ chmod +x http-filter.snabb 
    $ sudo ./http-filter.snabb wlp3s0

    While the script is running I open a few websites in my browser. Hopefully some packets will be captured onto output.pcap:

    $ sudo tcpdump -tr output.pcap
    IP sagan.50062 > Flags [P.], seq 0:926, ack 1, win 229, length 926: HTTP: GET / HTTP/1.1
    IP sagan.50062 > Flags [.], ack 189, win 237, length 0
    IP sagan.50062 > Flags [.], ack 368, win 245, length 0
    IP sagan.37346 > Flags [S], seq 370675941, win 29200, options [mss 1460,sackOK,TS val 1370741706 ecr 0,nop,wscale 7], length 0
    IP sagan.37346 > Flags [.], ack 2640726891, win 229, options [nop,nop,TS val 1370741710 ecr 2287287426], length 0
    IP sagan.37346 > Flags [P.], seq 0:439, ack 1, win 229, options [nop,nop,TS val 1370741729 ecr 2287287426], length 439: HTTP: POST / HTTP/1.1
    IP sagan.37346 > Flags [.], ack 789, win 251, options [nop,nop,TS val 1370741733 ecr 2287287449], length 0

    Some highlights in this script:

    • The shebang line (#./snabb snsh) refers to the Snabb’s shell (snsh), one of the many subprograms available in Snabb. It allows us to run Snabb scripts, that is Lua programs that have access to the Snabb environment (engine, apps, libraries, etc).
    • There’s a series of libraries that where not loaded: config, engine, main, etc. These libraries are part of the Snabb environment and are automatically loaded in every program.
    • The network function instantiates 3 apps: RawSocket, PcapFilter and PcapWriter, initializes them and pipes them together through links forming a graph. This graph is passed to the engine that executes it for 30 seconds.

    Martian packets

    Let’s continue with another example: a network function that manages a more complex set of rules to filter out traffic. Since there are more rules I will encapsulate the filtering logic into a custom app.

    The data we’re going to filter are martian packets. According to Wikipedia, a martian packet is “an IP packet seen on the public internet that contains a source or destination address that is reserved for special-use by Internet Assigned Numbers Authority (IANA)”. For instance, packets with RFC1918 addresses or multicast addresses seen on the public internet are martian packets.

    Unlike the previous example, I decided not to code this network function as an script, but as a program instead. The network function lives at src/program/martian. I’ve pushed the final code to a branch in my Snabb repository:

    $ git remote add dpino
    $ git fetch dpino
    $ git checkout -b dpino/martian-packets

    To run the app:

    $ sudo ./snabb martian program/martian/test/sample.pcap
    link report:
       3 sent on filter.output -> writer.input (loss rate: 0%)
       5 sent on reader.output -> filter.input (loss rate: 0%)

    The functions lets pass 3 out of 5 packets from sample.pcap.

    $ sudo tcpdump -qns 0 -t -e -r program/martian/test/sample.pcap
    reading from file program/martian/test/sample.pcap, link-type EN10MB (Ethernet)
    00:00:01:00:00:00 > fe:ff:20:00:01:00, IPv4, length 62: > tcp 0
    fe:ff:20:00:01:00 > 00:00:01:00:00:00, IPv4, length 62: > tcp 0
    00:00:01:00:00:00 > fe:ff:20:00:01:00, IPv4, length 54: > tcp 0
    90:e2:ba:94:2a:bc > 02:cf:69:15:81:01, IPv4, length 242: > ICMP echo reply, id 1024, seq 0, length 208
    90:e2:ba:94:2a:bc > 02:cf:69:15:81:01, IPv4, length 242: > ICMP echo reply, id 53, seq 0, length 208

    The last two packets are martian packets. They cannot occur in a public network since their source or destination addresses are private addresses.

    Some highlights about this network function:

    • Instead of a filtering app, I’ve coded my own filtering app, called MartianFiltering. This new app is the responsible for determining whether a packet is a martian packet or not. This operation has to be done in the push method of the app.
    • I’ve coded some utility functions to parse CIDR addresses (such as and to check whether an IP address belongs to a network. Instead I could have used Snabb’s filtering library that allows to filter packets using tcpdump like expressions. For instance, “net mask”.
    • The network function doesn’t use a network interface to read packets from, instead it reads packets out of a .pcap file.
    • Every Snabb program has a run function, that is the program’s entry point. A Snabb program or library can also add a selftest function, which is used to unit test the module ($ sudo ./snabb snsh -t program.martian). On the other hand, Snabb apps must implement a new method and optionally a push or pull method (or both, but at least one of them).

    Here’s the app’s graph:, "reader", pcap.PcapReader, filein), "filter", MartianFilter), "writer", pcap.PcapWriter, fileout), "reader.output -> filter.input"), "filter.output -> writer.input")

    And here is how MartianPacket:pull method looks like:

    function MartianFilter:push ()
       local input, output = assert(self.input.input), assert(self.output.output)
       while not link.empty(input) do
          local pkt = link.receive(input)
          local ip_hdr = ipv4:new_from_mem( + IPV4_OFFSET, IPV4_SIZE)
          if self:is_martian(ip_hdr:src()) or self:is_martian(ip_hdr:dst()) then
             link.transmit(output, pkt)

    As a rule of thumb, in every Snabb program there’s always one app only that feeds packets into the graph, in this case the PcapReader app. Such applications have to override the method pull. Apps that would like to manipulate packets will have a chance to do it in their push method.


    Snabb is a very useful tool for coding network functions that need to run at very high speed. For this reason, it’s usually deployed as part of an ISP network infrastructure. However, the toolkit is versatile enough to allow us code any type of application that has to manipulate network traffic.

    In this tutorial I introduced how to start using Snabb to code network functions. In a first example I showed how to download and build Snabb plus a very simple application that filters HTTP or HTTPS traffic from a network interface. On a second example, I introduced how to code a Snabb program and an app, MartianFiltering. This app exemplifies how to filter out packets based on a set of rules and forward or drop packets based on those conditions. Other more sophisticated network functions, such as firewalling, packet-rate limiting or DDoS prevention attack, behave in a similar manner.

    That’s all for now. I left out another example that consisted of sending and receiving Multicast DNS packets. Likely I’ll cover it in a followup article.

    November 28, 2017 06:00 AM

    November 24, 2017

    Víctor Jáquez

    Intel MediaSDK on Debian (testing)

    Everybody knows it: install Intel MediaSDK in GNU/Linux is a PITA. With CentOS or Yocto is less cumbersome, if you trust blindly on scripts ran as root.

    I don’t like CentOS, I feel it like if I were living in the past. I like Debian (testing, of course) and I also wanted to understand a little more about MediaSDK. And this is what I did to have Intel MediaSDK working in Debian/testing.

    First, I did a pristine installation of Debian testing with a netinst image in my NUC 6i5SYK, with a normal desktop user setup (Gnome3).

    The madness comes later.

    Intel’s identifies two types of MediaSDK installation: Gold and Generic. Gold is for CentOS, and Generic for the rest of distributions. Obviously, Generic means you’re on your own. For the purpose of this exercise I used as reference Generic Linux* Intel® Media Server Studio Installation.

    Let’s begin by grabbing the Intel® Media Server Studio – Community Edition. You will need to register yourself and accept the user agreement, because this is proprietary software.

    At the end, you should have a tarball named MediaServerStudioEssentials2017R3.tar.gz

    Extract the files for Generic instalation

    $ cd ~
    $ tar xvf MediaServerStudioEssentials2017R3.tar.gz
    $ cd MediaServerStudioEssentials2017R3
    $ tar xvf SDK2017Production16.5.2.tar.gz
    $ cd SDK2017Production16.5.2/Generic
    $ mkdir tmp
    $ tar -xvC tmp -f intel-linux-media_generic_16.5.2-64009_64bit.tar.gz


    Bad news: in order to get MediaSDK working you need to patch the mainlined kernel.

    Worse news: the available patches are only for the version 4.4 the kernel.

    Still, systemd works on 4.4, as far as I know, so it would not be a big problem.

    Grab building dependencies
    $ sudo apt install build-essential devscripts libncurses5-dev
    $ sudo apt build-dep linux

    Grab kernel source

    I like to use the sources from the git repository, since it would be possible to do some rebasing and blaming in the future.

    $ cd ~
    $ git clone
    $ git pull -v --tags
    $ git checkout -b 4.4 v4.4

    Extract MediaSDK patches

    $ cd ~/MediaServerStudioEssentials2017R3/SDK2017Production16.5.2/Generic/tmp/opt/intel/mediasdk/opensource/patches/kmd/4.4
    $ tar xvf intel-kernel-patches.tar.bz2
    $ cd intel-kernel-patches
    $ PATCHDIR=$(pwd)

    Patch the kernel

    cd ~/linux
    $ git am $PATCHDIR/*.patch

    The patches should apply with some warnings but no fatal errors (don’t worry, be happy).

    Still, there’s a problem with this old kernel: our recent compiler doesn’t build it as it is. Another patch is required:

    $ wget
    $ git am 0002-UBUNTU-SAUCE-no-up-disable-pie-when-gcc-has-it-enabl.patch

    TODO: Shall I need to modify the EXTRAVERSION string in kernel’s Makefile?

    Build and install the kernel

    Notice that we are using our current kernel configuration. That is error prone. I guess that is why I had to select NVM manually.

    $ cp /boot/config-4.12.0-1-amd64 ./.config
    $ make olddefconfig
    $ make nconfig # -- select NVM
    $ scripts/config --disable DEBUG_INFO
    $ make deb-pkg
    $ sudo dpkg -i linux-image-4.4.0+_4.4.0+-2_amd64.deb linux-headers-4.4.0+_4.4.0+-2_amd64.deb linux-firmware-image-4.4.0+_4.4.0+-2_amd64.deb

    Configure GRUB2 to boot Linux 4.4. by default

    This part was absolutely tricky for me. It took me a long time to figure out how to specify the kernel ID in the grubenv.

    $ sudo vi /etc/default/grub

    And change the line GRUB_DEFAULT=saved. By default it is set to 0. And update GRUB.

    $ sudo update-grub

    Now look for the ID of the installed kernel image in /etc/grub/grub.cfg and use it:

    $ sudo grub-set-default "gnulinux-4.4.0+-advanced-2c246bc6-65bb-48ea-9517-4081b016facc>gnulinux-4.4.0+-advanced-2c246bc6-65bb-48ea-9517-4081b016facc"

    Please note it is twice and separated by a >. Don’t ask me why.

    Copy MediaSDK firmware (and libraries too)

    I like to use rsync rather normal cp because there are the options like --dry-run and --itemize-changes to verify what I am doing.

    $ cd ~/MediaServerStudioEssentials2017R3/SDK2017Production16.5.2/Generic/tmp
    $ sudo rsync -av --itemize-changes ./lib /
    $ sudo rsync -av --itemize-changes ./opt/intel/common /opt/intel
    $ sudo rsync -av --itemize-changes ./opt/intel/mediasdk/{include,lib64,plugins} /opt/intel/mediasdk

    All these directories contain blobs that do the MediaSDK magic. They are dlopened by hard coded paths by mfx_dispatch, which will be explain later.

    In /lib lives the firmware (kernel blob).

    In /opt/intel/common… I have no idea what are those shared objects.

    In /opt/intel/mediasdk/include live header files for programming an compilation.

    In /opt/intel/mediasdk/lib64 live the driver for the modified libva (iHD) and other libraries.

    In /opt/intel/mediasdk/plugins live, well, plugins…

    In conclusion, all these bytes are darkness and mystery.


    $ sudo systemctl reboot

    The system should boot, automatically, in GNU/Linux 4.4.

    Please, log with Xorg, not in Wayland, since it is not supported, as far as I know.


    For compiling GStreamer I will use gst-uninstalled. Someone may say that I should use gst-build because is newer and faster, but I feel more comfortable doing the following kind of hacks with the old&good autotools.

    Basically this is a reproduction of Quick-start guide to gst-uninstalled for GStreamer 1.x.

    $ sudo apt build-dep gst-plugins-{base,good,bad}1.0
    $ wget -q -O - | sh

    I will modify the gst-uninstalled script, and keep it outside of the repository. For that I will use the systemd file-hierarchy spec for user’s executables.

    $ cd ~/gst
    $ mkdir -p ~/.local/bin
    $ mv master/gstreamer/scripts/gst-uninstalled ~/.local/bin
    $ ln -sf ~/.local/bin/gst-uninstalled ./gst-master

    Do not forget to edit your ~/.profile to add ~/.local/bin in the environment variable PATH.

    Patch ~/.local/bin/gst-uninstalled

    The modifications are to handle the three dependencies libraries that are required by MediaSDK: libdrm, libva and mfx_dispatch.

    diff --git a/scripts/gst-uninstalled b/scripts/gst-uninstalled
    index 81f83b6c4..d79f19abd 100755
    --- a/scripts/gst-uninstalled
    +++ b/scripts/gst-uninstalled
    @@ -122,7 +122,7 @@ GI_TYPELIB_PATH=$GST/gstreamer/gst:$GI_TYPELIB_PATH
     export LD_LIBRARY_PATH
     export GI_TYPELIB_PATH
     export PKG_CONFIG_PATH="\
    @@ -140,6 +140,9 @@ $GST_PREFIX/lib/pkgconfig\
     export GST_PLUGIN_PATH="\
    @@ -227,6 +230,16 @@ export GST_VALIDATE_APPS_DIR=$GST_VALIDATE_APPS_DIR:$GST/gst-editing-services/te
     export GST_VALIDATE_PLUGIN_PATH=$GST_VALIDATE_PLUGIN_PATH:$GST/gst-devtools/validate/plugins/
     export GIO_EXTRA_MODULES=$GST/prefix/lib/gio/modules:$GIO_EXTRA_MODULES
    +# MediaSDK
    +export LIBVA_DRIVERS_PATH=/opt/intel/mediasdk/lib64
    +export LD_LIBRARY_PATH="\

    Now, initialize the gst-uninstalled environment:

    $ cd ~/gst
    $ ./gst-master

    Grab libdrm from its repository and switch to the branch with the supported version by MediaSDK.

    $ cd ~/gst/master
    $ git clone git://
    $ cd drm
    $ git checkout -b intel libdrm-2.4.67

    Extract the distributed tarball in the cloned repository.

    $ tar -xv --strip-components=1 -C . -f ~/MediaServerStudioEssentials2017R3/SDK2017Production16.5.2/Generic/tmp/opt/intel/mediasdk/opensource/libdrm/2.4.67-64009/libdrm-2.4.67.tar.bz2

    Then we could check the big delta between upstream and the changes done by Intel for MediaSDK.

    Let’s put it in a commit for later rebases.

    $ git add -u
    $ git add .
    $ git commit -m "mediasdk changes"

    Get build dependencies and compile.

    $ sudo apt build-dep libdrm
    $ ./configure
    $ make -j8

    Since the pkgconfig files (*.pc) of libdrm are generated to work installed, it is needed to modify them in order to work uninstalled.

    $ prefix=${HOME}/gst/master/drm
    $ sed -i -e "s#^libdir=.*#libdir=${prefix}/.libs#" ${prefix}/*.pc
    $ sed -i -e "s#^includedir=.*#includedir=${prefix}#" ${prefix}/*.pc

    In order to C preprocessor could find the uninstalled libdrm header files we need to make them available in the expected path according to the pkgconfig file and right now they are not there. To fix that it is possible to create proper symbolic links.

    $ cd ~/gst/master/drm
    $ ln -s include/drm/ libdrm


    This modified a version of libva. These modifications messed a bit with the opensource version of libva, because Intel decided not to prefix the library, or some other strategy. In gstreamer-vaapi we had to blacklist VA-API version 0.99, because it is the version number, arbitrary set, of this modified version of libva for MediaSDK.

    Again, grab the original libva from repo and change the branch aiming to the divert point. It was difficult to find the divert commit id since even the libva version number was changed. Doing some archeology I guessed the branch point was in version 1.0.15, but I’m not sure.

    $ cd ~/gst/master
    $ git clone
    $ cd libva
    $ git checkout -b intel libva-1.0.15
    $ tar -xv --strip-components=1 -C . -f ~/MediaServerStudioEssentials2017R3/SDK2017Production16.5.2/Generic/tmp/opt/intel/mediasdk/opensource/libva/1.67.0.pre1-64009/libva-1.67.0.pre1.tar.bz2
    $ git add -u
    $ git add .
    $ git commit -m "mediasdk"

    Before compile, verify that Makefile is going to link against the uninstalled libdrm. You can do that by grepping for LIBDRM in Makefile.

    Get compilation dependencies and build.

    $ sudo apt build-dep libva
    $ ./configure
    $ make -j8

    Moidify the pkgconfig files for uninstalled

    $ prefix=${HOME}/gst/master/libva
    $ sed -i -e "s#^libdir=.*#libdir=${prefix}/va/.libs#" ${prefix}/pkgconfig/*.pc
    $ sed -i -e "s#^includedir=.*#includedir=${prefix}#" ${prefix}/pkgconfig/*.pc

    Fix header path with symbolic links

    $ cd ~/gst/master/libva/va
    $ ln -sf drm/va_drm.h


    This static library which must be linked with MediaSDK applications. In our case, to the GStreamer plugin.

    According to its documentation (included in the tarball):

    the dispatcher is a layer that lies between application and the SDK implementations. Upon initialization, the dispatcher locates the appropiate platform-specific SDK implementation. If there is none, it will select the software SDK implementation. The dispatcher will redirect subsequent function calls to the same functions in the selected SDK implementation.

    In the tarball there is the source of the mfx_dispatcher, but it only compiles with cmake. I have not worked with cmake on uninstalled setups, but we are lucky, there is a repository with autotools support:

    $ cd ~/gst/master
    $ git clone

    And compile. After running ./configure it is better to confirm, grepping the generated Makefie, that the uninstalled versions of libdrm and libva are going to be used.

    $ autoreconf  --install
    $ ./configure
    $ make -j8

    Finally, just as the other libraries, it is required to fix the pkgconfig files:d

    $ prefix=${HOME}/gst/master/mfx_dispatch
    $ sed -i -e "s#^libdir=.*#libdir=${prefix}/.libs#" ${prefix}/*.pc
    $ sed -i -e "s#^includedir=.*#includedir=${prefix}#" ${prefix}/*.pc

    Test it!

    At last we are in a position where it is possible to test if everything works as expected. For it we are going to run the pre-compiled version of vainfo bundled in the tarball.

    We will copy it to our uninstalled setup, thus we would running without specifing the path.

    $ sync -av /home/vjaquez/MediaServerStudioEssentials2017R3/SDK2017Production16.5.2/Generic/tmp/usr/bin/vainfo ./prefix/bin/
    $ vainfo
    libva info: VA-API version 0.99.0
    libva info: va_getDriverName() returns 0
    libva info: User requested driver 'iHD'
    libva info: Trying to open /opt/intel/mediasdk/lib64/
    libva info: Found init function __vaDriverInit_0_32
    libva info: va_openDriver() returns 0
    vainfo: VA-API version: 0.99 (libva 1.67.0.pre1)
    vainfo: Driver version:
    vainfo: Supported profile and entrypoints
          VAProfileH264ConstrainedBaseline: VAEntrypointVLD
          VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
          VAProfileH264ConstrainedBaseline: <unknown entrypoint>
          VAProfileH264ConstrainedBaseline: <unknown entrypoint>
          VAProfileH264Main               : VAEntrypointVLD
          VAProfileH264Main               : VAEntrypointEncSlice
          VAProfileH264Main               : <unknown entrypoint>
          VAProfileH264Main               : <unknown entrypoint>
          VAProfileH264High               : VAEntrypointVLD
          VAProfileH264High               : VAEntrypointEncSlice
          VAProfileH264High               : <unknown entrypoint>
          VAProfileH264High               : <unknown entrypoint>
          VAProfileMPEG2Simple            : VAEntrypointEncSlice
          VAProfileMPEG2Simple            : VAEntrypointVLD
          VAProfileMPEG2Main              : VAEntrypointEncSlice
          VAProfileMPEG2Main              : VAEntrypointVLD
          VAProfileVC1Advanced            : VAEntrypointVLD
          VAProfileVC1Main                : VAEntrypointVLD
          VAProfileVC1Simple              : VAEntrypointVLD
          VAProfileJPEGBaseline           : VAEntrypointVLD
          VAProfileJPEGBaseline           : VAEntrypointEncPicture
          VAProfileVP8Version0_3          : VAEntrypointEncSlice
          VAProfileVP8Version0_3          : VAEntrypointVLD
          VAProfileVP8Version0_3          : <unknown entrypoint>
          VAProfileHEVCMain               : VAEntrypointVLD
          VAProfileHEVCMain               : VAEntrypointEncSlice
          VAProfileVP9Profile0            : <unknown entrypoint>
          <unknown profile>               : VAEntrypointVideoProc
          VAProfileNone                   : VAEntrypointVideoProc
          VAProfileNone                   : <unknown entrypoint>

    It works!

    Compile GStreamer

    I normally make a copy of ~/gst/master/gstreamer/script/ in ~/.local/bin in order to modify it, like adding support for ccache, disabling gtkdoc and gobject-instrospections, increase the parallel tasks, etc. But that is out of the scope of this document.

    $ cd ~/gst/master/
    $ ./gstreamer/scripts/

    Everything should be built without issues and, at the end, we could test if the gst-msdk elements are available:

    $ gst-inspect-1.0 msdk
    Plugin Details:
      Name                     msdk
      Description              Intel Media SDK encoders
      Filename                 /home/vjaquez/gst/master/gst-plugins-bad/sys/msdk/.libs/
      License                  BSD
      Source module            gst-plugins-bad
      Source release date      2017-11-23 16:39 (UTC)
      Binary package           GStreamer Bad Plug-ins git
      Origin URL               Unknown package origin
      msdkh264dec: Intel MSDK H264 decoder
      msdkh264enc: Intel MSDK H264 encoder
      msdkh265dec: Intel MSDK H265 decoder
      msdkh265enc: Intel MSDK H265 encoder
      msdkmjpegdec: Intel MSDK MJPEG decoder
      msdkmjpegenc: Intel MSDK MJPEG encoder
      msdkmpeg2enc: Intel MSDK MPEG2 encoder
      msdkvp8dec: Intel MSDK VP8 decoder
      msdkvp8enc: Intel MSDK VP8 encoder
      9 features:
      +-- 9 elements


    Now, let’s run a simple pipeline. Please note that gst-msdk elements have rank zero, then they will not be autoplugged, it is necessary to craft the pipeline manually:

    $ gst-launch-1.0 filesrc location= ~/test.264 ! h264parse ! msdkh264dec ! videoconvert ! xvimagesink
    Setting pipeline to PAUSED ...
    Pipeline is PREROLLING ...
    libva info: VA-API version 0.99.0
    libva info: va_getDriverName() returns 0
    libva info: User requested driver 'iHD'
    libva info: Trying to open /opt/intel/mediasdk/lib64/
    libva info: Found init function __vaDriverInit_0_32
    libva info: va_openDriver() returns 0
    Redistribute latency...
    Pipeline is PREROLLED ...
    Setting pipeline to PLAYING ...
    New clock: GstSystemClock
    Got EOS from element "pipeline0".
    Execution ended after 0:00:02.502411331
    Setting pipeline to PAUSED ...
    Setting pipeline to READY ...
    Setting pipeline to NULL ...
    Freeing pipeline ...


    by vjaquez at November 24, 2017 09:57 AM

    November 22, 2017

    Asumu Takikawa

    Writing network drivers in a high-level language

    Another day, another post about Snabb. Today, I’ll start to explain some work I’ve been doing at Igalia for Deutsche Telekom on driver development. All the DT driver work I’ll be talking about was joint work with Nicola Larosa.

    When writing a networking program with Snabb, the program has to get some packets to crunch on from somewhere. Like everything else in Snabb, these packets come from an app.

    These source apps might be a synthetic packet generator or, for anything running on real hardware, a network driver that talks to a NIC (a network card). That network driver part is the subject of this blog post.

    These network drivers are written in LuaJIT like the rest of the system. This is maybe not that surprising if you know Snabb does kernel-bypass networking (like DPDK or other similar approaches), but it’s still quite remarkable! The vast majority of drivers that people are familiar with (graphics drivers, wifi drivers, or that obscure CueCat driver) are written in C.

    For the Igalia project, we worked on extending the existing Snabb drivers for Intel NICs with some extra features. I’ll talk more about the new work that we did specifically in a second blog post. For this post, I’ll introduce how we can even write a driver in Lua.

    (and to be clear, the existing Snabb drivers aren’t my work; they’re the work of some excellent Snabb hackers like Luke Gorrie and others)

    For the nitty-gritty details about how Snabb bypasses the kernel to let a LuaJIT program operate on the NIC, I recommend reading Luke Gorrie’s neat blog post about it. In this post, I’ll talk about what happens once user-space has a hold on the network card.

    Driver infrastructure

    When a driver starts up, it of course needs to initialize the hardware. The datasheet for the Intel 82599 NIC for example dedicates an entire chapter to this. A lot of the initialization process consists of poking at the appropriate configuration registers on the device, waiting for things to power up and tell you they’re ready, and so on.

    To actually poke at these registers, the driver uses memory-mapped I/O to the PCI device. The MMIO memory is, as far as LuaJIT and we are concerned, just a pointer to a big chunk of memory given to us by the hardware.pci library via the FFI.

    It’s up to us to interpret this returned uint32_t pointer in a useful way. Specifically, we know certain offsets into this memory are mapped to registers as specified in the datasheet.

    Since we’re living in a high-level language, we want to hide away the pointer arithmetic needed to access these registers. So Snabb has a little DSL in the lib.hardware.register library that takes text descriptions of registers like this:

    -- Name        Address / layout        Read-write status/description
    array_registers = [[
       RSSRK       0x5C80 +0x04*0..9       RW RSS Random Key

    and then lets you map them into a register table:

    my_regs = {}
    pci      = require 'lib.hardware.pci'
    register = require 'lib.hardware.register'
    -- get a pointer to MMIO for some PCI address
    base_ptr = pci.map_pci_memory_unlocked("02:00.01", 0)
    -- defines (an array of) registers in my_regs
    register.define_array(array_registers, my_regs, base_ptr)

    After defining these registers, you can use the my_regs table to access the registers like any other Lua data. For example, the “RSS Random Key” array of registers can be initialized with some random data like this:

    for i=0, 9 do

    This code looks like straightforward Lua code, but it’s poking at the NIC’s configuration registers. These registers are also often manipulated at the bit level, and there is some library support for that in the lib.bits.

    For example, here are some prose instructions to initialize a certain part of the NIC from the datasheet:

    Disable TC arbitrations while enabling the packet buffer free space monitor:
      — Tx Descriptor Plane Control and Status (RTTDCS), bits:
      TDPAC=0b, VMPAC=1b, TDRM=0b, BDPM=1b, BPBFSM=0b

    This is basically instructing the implementor to set some bits and clear some bits in the RTTDCS register, which can be translated into some code that looks like this:

    bits = require "lib.bits"
    -- clear these bits
    my_regs.RTTDCS:clr(bits { TDPAC=0, TDRM=4, BPBFSM=23 })
    -- set these bits
    my_regs.RTTDCS:set(bits { VMPAC=1, BDPM=22 })

    The bits function just takes a table of bit offsets to set (the table key strings only matter for documentation’s sake) and turns it into a number to use for setting a register. It’s possible to write these bit manipulations with just arithmetic operations as well, but it’s usually more verbose that way.

    Getting packets into the driver

    To build the actual driver, we use the handy infrastructure above to do the device initialization and configuration and then drive a main loop that accepts packets from the NIC and feeds them into the Snabb program (we will just consider the receive path in this post). The core structure of this main loop is simpler than you might expect.

    On a NIC like the Intel 82599, the packets are transferred into the host system’s memory via DMA into a receive descriptor ring. This is a circular buffer that keeps entries that contain a pointer to packet data and then some metadata.

    A typical descriptor entry looks like this:

    |             Address (to memory allocated by driver)                |
    |    VLAN tag    | Errors | Status |    Checksum    |    Length      |

    The driver allocates some DMA-friendly memory (via memory.dma_alloc from core.memory) for the descriptor ring and then sets the NIC registers (RDBAL & RDBAH) so that the NIC knows the physical address of the ring. There are some neat tricks in core.memory which make the virtual to physical address translation easy.

    In addition to this ring, a packet buffer is allocated for each entry in the ring and its (physical) address is stored in the first field of the entry (see diagram above).

    The NIC will then DMA packets into the buffer as they are received and filtered by the hardware.

    The descriptor ring has head/tail pointers (like a typical circular buffer) indicating where new packets arrive, and where the driver is reading off of. The driver mainly sets the tail pointer, indicating how far it has processed.

    A Snabb app can introduce new packets into a program by implementing the pull method. A driver’s pull method might have the following shape (based on the intel_app driver):

    local link = require ""
    -- pull method definition on Driver class
    function Driver:pull ()
       -- make sure the output link exists
       local l = self.output.tx
       if l == nil then return end
       -- sync the driver and HW on descriptor ring head/tail pointers
       -- pull a standard number of packets for a link
       for i = 1, engine.pull_npackets do
          -- check head/tail pointers to make sure packets are available
          if not self:can_receive() then break end
          -- take packet from descriptor ring, put into the Snabb output link
          link.transmit(l, self:receive())
       -- allocate new packet buffers for all the descriptors we processed
       -- we can't reuse the buffers since they are now owned by the next app

    Of course, the real work is done in the helper methods like sync_receive and receive. I won’t go over the implementation of those, but they mostly deal with manipulating the head and tail pointers of the descriptor ring appropriately while doing any allocation that is necessary to keep the ring set up.

    The takeaway I wanted to communicate from this skeleton is that using Lua makes for very clear and pleasant code that doesn’t get too bogged down in low-level details. That’s partly because Snabb’s core abstracts the complexity of using device registers and allocating DMA memory and things like that. That kind of abstraction is made a lot easier by LuaJIT and its FFI, so that the surface code looks like it’s just manipulating tables and making function calls.

    In the next blog post, I’ll talk about some specific improvements we made to Snabb’s drivers to make it more ready for use with multi-process Snabb apps.

    by Asumu Takikawa at November 22, 2017 10:45 AM

    November 16, 2017

    Alberto Garcia

    “Improving the performance of the qcow2 format” at KVM Forum 2017

    I was in Prague last month for the 2017 edition of the KVM Forum. There I gave a talk about some of the work that I’ve been doing this year to improve the qcow2 file format used by QEMU for storing disk images. The focus of my work is to make qcow2 faster and to reduce its memory requirements.

    The video of the talk is now available and you can get the slides here.

    The KVM Forum was co-located with the Open Source Summit and the Embedded Linux Conference Europe. Igalia was sponsoring both events one more year and I was also there together with some of my colleages. Juanjo Sánchez gave a talk about WPE, the WebKit port for embedded platforms that we released.

    The video of his talk is also available.

    by berto at November 16, 2017 10:16 AM

    November 14, 2017

    Michael Catanzaro

    Igalia is Hiring

    Igalia is hiring web browser developers. If you think you’re a good candidate for one of these jobs, you’ll want to fill out the online application accompanying one of the postings. We’d love to hear from you.

    We’re especially interested in hiring a browser graphics developer. We realize that not many graphics experts also have experience in web browser development, so it’s OK if you haven’t worked with web browsers before. Low-level Linux graphics experience is the more important qualification for this role.

    Igalia is not just a great place to work on cool technical projects like WebKit. It’s also a political and social project: an egalitarian, worker-owned cooperative where everyone has an equal vote in company decisions and receives equal pay. It’s been around for 16 years, so it’s also not a startup. You can work remotely from wherever you happen to be, or from our office in A Coruña, Spain. You won’t have a boss, but you will be expected to work well with your colleagues. It’s not the right fit for everyone, but there’s nowhere I’d rather be.

    by Michael Catanzaro at November 14, 2017 03:04 PM

    November 13, 2017

    Diego Pino

    Snabb explained in less than 10 minutes

    Last month I attended the 20th edition of GORE (the Spain’s Network Operator Group meeting) where I delivered an introductory talk about Snabb (Spanish). Slides of the talk are also available online (English).

    Taking advantage of this presentation I decided to write down an introductory article about Snabb. Something that could allow anyone to understand what’s Snabb easily.

    What is Snabb?

    Snabb is a toolkit for developing network functions in user-space. This definition refers to two keywords that are worth clarifying: network functions and user-space.

    What’s a network function?

    A network function is any program that does something on network traffic. What kind of things can be done on traffic? For instance: to read packets, modify their headers, create new packets, discard packets or forward them. Any network function is a combination of these basic operations. Here are some examples:

    • Filtering function (i.e. firewalling): read incoming packets, compare to table of rules and execute an action (forward or drop).
    • Traffic mapping (i.e. NAT): read incoming packets, modify headers and forward packet.
    • Encapsulation (i.e. VPN): read incoming packets, create a new packet, embed packet into new one and send it.

    What’s user-space networking?

    In the last few years, there has been a new trend for writing down network functions. This new trend consists of writing down the entire network function in user-space and do not leave any processing to the kernel.

    Traditionally when writing down network functions we use the abstractions provided by the OS. The goal of any OS is to create abstractions over hardware that programs can use. This happens at many levels. For instance, when dealing with a hard-drive we don’t need to think of heads, cylinders and sectors but use a higher level abstraction: the filesystem. Networking is another layer abstracted by the OS. As programmers, we don’t deal with the NIC directly, instead we work with sockets and have access to APIs to deal with the TCP/IP stack.

    However, the addition of higher level abstractions implicitly adds an overhead to the processing of our network function. The first disadvantage is that the function is split in two lands: user-space and kernel-space, and switching between both lands has a cost. And even if we move as much logic as possible to the kernel, there are inherit costs caused by the kernel’s networking layer.

    The need of skipping the kernel and program network functions entirely in user-space was triggered by the continuous improvement of hardware. Today is possible to buy a 10G NIC for less than 200 euros. Soon the idea of building high-performance network appliances out of commodity hardware seemed feasible. Someone could pick an Intel Xeon, fill in the available PCI slots with 10G NICs and expect to have the equivalent of a very expensive Cisco or Juniper router for a fraction of its cost.

    If we drive the hardware described above entirely with Linux we won’t be able to squeeze all its performance. Every packet hitting the NICs will have to go through the kernel’s networking layer and that has a cost caused by all the operations the kernel does onto packets before they’re available to manipulate by our program. To understand how much this is a problem, I need to introduce the concept of budget in a network function.

    Know your network function budget

    If we want to make the most of our hardware we generally would like to run our network function at line-rate speed, that means, the maximum speed of the NIC. How much time is that? In a 10G NIC, if we are receiving packets of an average size of 550-bytes at the maximum speed then we’re receiving a new packet every 440ns. That’s all the time we have available to run our network function on a packet.

    Usually the way a NIC works is by placing incoming packets in a queue or buffer. This buffer is actually a ring-buffer, that means there are two cursors pointing to the buffer, the Rx cursor and the Tx cursor. When a new packet arrives, the packet is written at the Rx position and the cursor gets updated. When a packet leaves the buffer, the packet is read at the Tx position and the cursor gets updated after read. Our network function fetches packets from the Tx cursor. If it’s too slow processing a packet, the Rx cursor will eventually overpass the TX cursor. When that happens there’s a packet drop (a packet was overwritten before it was consumed).

    Let’s go back to the 440ns number. How much time is that? Kernel hacker Jesper Brouer discusses this issue on his excellent talk “Network stack challenges at increasing speed” (I also recommend LWN’s summary of the talk: Improving Linux networking performance). Here’s the cost of some common operations: (cost varies depending on hardware but the order of magnitude is similar across different hardware settings)

    • Spinlock (Lock/Unlock): 16ns.
    • L2 cache hit: 4.3ns.
    • L3 cache hit: 7.9ns.
    • Cache miss: 32ns.

    Taking into account those numbers 440ns doesn’t seem like a lot of time. System calls cost is also prohibitive, which should be minimized as much as possible.

    Another important thing to notice is that the smaller the size of the packet, the smaller the budget. On a 10G NIC if we’re receiving packets of 64-byte on average, the smallest IPv4 packet size possible, that means we are receiving a new packet every 59ns. In this scenario two straight cache misses would eat the whole budget.

    In conclusion, at these NIC speeds the additional overhead the kernel networking layer adds is non trivial, but significantly big enough to affect the execution of our network function. Since our budget gets reduced packets are more likely to be dropped at higher speeds or at smaller packet sizes, limiting the overall performance of our network card.

    NOTE: This is a general picture of the issue of doing high-performance networking in the Linux kernel. The kernel hackers are not ignorant of these problems and have been working on ways to fix them in the last years. In that regard is worth mentioning the addition of XDP (eXpress Data Path), a kernel abstraction to execute network functions as closer to the hardware as possible. But that’s a subject for another post.

    By-passing the kernel

    User-space networking needs to by-pass the kernel’s networking layer so it can squeeze all the performance of the underlying hardware. There are several strategies to do that: user-space drivers, PF_RING, Netmap, etc (Cloudflare has an excellent article on kernel by-pass, commenting several of those strategies). Snabb chooses to handling the hardware directly, that means, to provide user-space drivers for the NICs it supports.

    Snabb offers support mostly for Intel cards (although some Solarflare and Mellanox models are also supported). Implementing a driver, either in kernel-space or user-space, is not an easy task. It’s fundamental to have access to the vendor’s datasheet (generally a very large document) to know how to initialize the NIC, how to read packets from it, how to transfer data, etc. Intel provides such datasheet. In fact, Intel started a few years ago a project with a similar goal: DPDK. DPDK is an open-source project that implements drivers in user-space. Although originally it only provided drivers for Intel NICs, as the adoption of the project increased, other vendors have started to add drivers for their hardware.

    Inside Snabb

    Snabb was started in 2012 by free software hacker Luke Gorrie. Snabb provides direct access to the high-performance NICs but in addition to that it also provides an environment for building and running network functions.

    Snabb is composed of several elements:

    • An Engine, that runs the network functions.
    • Libraries, that ease the development of network functions.
    • Apps, reusable software components that generally manipulate packets.
    • Programs, ready-to-use standalone network functions.

    A network function in Snabb is a combination of apps connected together by links. The Snabb’s engine is in charge of feeding the app graph with packets and give a chance to every app to execute.

    The engine processes the app graph in breadths. A breadth consists of two steps:

    • Inhale, puts packet into the graph.
    • Process, every app has a chance to receive packets and manipulate them.

    During the inhale phase the method pull of an app gets executed. Apps that implement such method act as packet generators within the app graph. Packets are placed at the app’s links. Generally there’s only one app of think kind for every graph.

    During the process phase the method push of an app gets executed. This gives a chance to every app to read packet from its incoming link, do something with them and likely place them out their outgoing link.

    Hands-on example

    Let’s build a network function that captures packets from a 10G NIC filters them using a packet-filtering expression and writes down the filtered packets to a pcap file. Such network function would look like this:

    Snabb basic filter
    Snabb basic filter

    In Snabb code the equivalent graph above could be coded like this:

    function run()
    	local c =
    	-- App definition.
    	config.add(c, "nic", Intel82599, {
    		pci = "0000:04:00.0"
    	config.add(c, "filter", PcapFilter, "src port 80")
    	config.add(c, "pcap", Pcap.PcapWriter, "output.pcap")
    	-- Link definition., "nic.tx        -> filter.input"), "filter.output -> pcap.input")

    A configuration is created describing the app graph of the network function. The configuration is passed down to Snabb which executes it for 10 seconds.

    When Snabb’s engine runs this network function it executes the pull method of each app to feed packets into the graph links, inhale step. During the process step, the method push of each app is executed so apps have a chance to fetch packets from their incoming links, do something with them and likely place them into their outgoing links.

    Here’s how the real implementation of PcapFilter.push method looks like:

    function PcapFilter:push ()
    	while not link.empty(self.input.rx) do
     		local p = link.receive(self.input.rx)
      		if self.accept_fn(, p.length) then
         		link.transmit(self.output.tx, p)

    A packet in Snabb is a really simple data structure. Basically, it consists of a length field and an array of bytes of fixed size.

    struct packet {
    	uint16_t length;
      	unsigned char data[10*1024];

    A link is a ring-buffer of packets.

    struct link {
    	struct packet *packets[1024];
      	// the next element to be read
      	int read;
      	// the next element to be written
      	int write;

    Every app has zero or many input links and zero or many output links. The number of links is created on runtime when the graph is defined. In the example above, the nic app has one outgoing link (nic.tx); the filter app has one incoming link (filter.rx) and one outgoing link (filter.tx); and the pcap app has one incoming link (pcap.input).

    It might be surprising that packets and links are defined in C code, instead of Lua. Snabb runs on top of LuaJIT, an ultra-fast virtual machine for executing Lua programs. LuaJIT implements an FFI (Foreign Function Interface) to interact with C data types and call C runtime functions or external libraries directly from Lua code. In Snabb most data structures are defined in C which allows to compact data more efficiently.

    local ether_header_t = ffi.typeof [[
    /* All values in network byte order.  */
    struct {
       uint8_t  dhost[6];
       uint8_t  shost[6];
       uint16_t type;
    } __attribute__((packed))

    Calling a C-runtime function is really easy too.

      void syslog(int priority, const char\* format, ...);
    ffi.C.syslog(2, "error:...");

    Wrapping up and last thoughts

    In this article I’ve covered the basics of Snabb. I showed how to use Snabb to build network functions and explained why Snabb is a very convenient toolkit to write such type of programs. Snabb runs very fast since it by-passes the kernel, which makes it very useful for high-performance networking. In addition, Snabb is written in the high-level language Lua which enormously simplifies the entry barrier to start writing network functions.

    However, there’s more things in Snabb I left out in this article. Snabb comes with a preset of programs ready to run. It also comes with a vast collection of apps and libraries which can help to speed up the construction of new network functions.

    You don’t need to own a Intel10G card to start using Snabb today. Snabb can be used over TAP interfaces. It won’t be highly performant but it’s the best way to start with Snabb.

    In a next article I plan to cover a more elaborated example of a network function using TAP interfaces.

    November 13, 2017 06:00 AM

    November 08, 2017

    Eleni Maria Stea

    Fosscomm 2017 [update: slides in English]

    FOSSCOMM (Free and Open Source Software Communities Meeting) is a Greek conference aiming at free-software and open-source enthusiasts, developers, and communities. The event is solely organized and ran by volunteers (usually university students, communities, Linux User Groups) and is taking place in a different city every year. The attendance is free and everyone is welcome to make a presentation or a workshop related to free and open source projects.

    I always try to attend this meeting when the dates and the place are convenient, as it is a great opportunity to meet old friends and hangout with geek people.

    This year’s Fosscomm2017 (website is in Greek) was held at the Harokopio University of Athens, during the weekend: 4-5th November 2017.

    I grabbed the opportunity to go and give a talk about Mesa 3D, a project where the Igalia’s graphics team makes several contributions and releases the last 5 years.

    My presentation was titled: “Hacking on Mesa 3D” and it was a short introductory talk about the OpenGL implementation, the OpenGL extension system, the GLSL compiler, the drivers and some of the development and debugging processes we use.

    Original slides in Greek:


    English translation:


    Video (in Greek1 because that was the conference language):

    To my surprise, my talk wasn’t the only one mentioning Igalia. 🙂 Dimitris Glynos, one of the Co-Founders of Census (and FOSSCOMM sponsor) gave the talk “FOSS is all we got: building a competitive IT skill set in Greece today” and mentioned us among other examples of companies that work successfully on open source projects.

    Among the other talks I’ve attended, I particularly liked the “Linux Metrics” workshop by Effie Mouzeli and Giorgos Kargiotakis (during FOSSCOMM day #1), that was aiming to teach users and developers how to use metrics tools to detect performance issues. It was so successful that they’ve been asked to re-run it the following day.

    Most of the other presentations and workshops, as well as the schedule, can be found here (for those who can understand greek): The FOSSCOMM organizers will soon edit the videos and upload them on a channel on YouTube.

    this is my post-FOSSCOMM cup collection

    I’d like to thank the people who attended the presentation, the FOSSCOMM2017 organization team who did such a great job on preparing and hosting the event and of course Igalia that is giving me the opportunity to work on cool graphics stuff.

    See you at FOSSCOMM 2018! 😉

    [1]: Subtitles coming soon.

    by hikiko at November 08, 2017 07:30 AM

    November 01, 2017

    Martin Robinson

    Small Things

    Even between two highly-developed western countries, there are a lot of cultural differences. After moving, I experienced the sort of culture shock that the Internet warns you about. Thankfully, the passage of time means that grumbling noon-time stomachs gradually give way to curiously peckish 2:30pm lunches. Instead of sitting in dread, willing your useless, polite American hands to flag a waiter, you manage to order a tiny beer using only your eyeballs. Big differences fade into the background so much that maybe you start to keep a list of them, just to avoid the feeling that you are forgetting some original piece of yourself.

    This new familiarity begins to expose the incredibly long tail of subtle differences that have been hanging out quietly in the background. Unnamed onomatopoeias have a completely different sound. People are making gestures with their hands while they speak, and those gestures actually mean something very clear. Your brain calmly catalogs these curiosities as they become too trivial to comment on.

    If you are like me, you stare at the street, the stoplights, and the sidewalks. Suddenly, the endless, small scale war being waged in the space between the double (and triple) parkers and the buildings becomes apparent. You see the rows of bollards silently holding back a tide of cars and delivery vans. Unspoken rules from your home country no longer apply here, after having taken them for granted for years

    I don’t want to blab on too long about mundane things, so I will just point to the example of curb cuts. In the US we use curb cuts to connect the roadway to private garages, driveways, and parking lots. Thousands of dollars are spent lovingly crafting each of these small cement altars to the passage of automobiles. The sidewalk itself kneels to the pavement, so that cars can smoothly and comfortably climb into pedestrian space. This is all, of course, at the expense of people walking and in wheelchairs who often have to travel across an uneven sidewalk and wait for cars as they appear and (hopefully) leave. A curb cut is a signal that at any moment a car may enter the sidewalk and that it has a right to be there.

    Spanish cities sometimes use little ramps instead. They look cheap and their metal surface is usually painted a bright and gaudy yellow. Their angle is decidedly steeper compared to modern curb cuts in the US, which means it is not easy to drive onto the sidewalk quickly or comfortably. Additionally, they look like they can also be added and removed cheaply and without modifying the sidewalk at all. Even more bizarrely, they are installed on the sacred roadway itself, so the sidewalk remains level for the all people who might happen to be using it. Sometimes, they even extend so far out into the roadway that parallel parking would be difficult or impossible, which prevents the space from becoming a private parking spot.

    These little ramps are a small detail of the city, but for me they send a clear message. They announce to cars that they are entering a segregated pedestrian space. This invitation is conditional on moving slowly and carefully and can be revoked at any time with a hydraulic wrench. Maybe they are common simply because they are a cheap leftover from a period when this was a poorer country. I have a feeling that as time goes on, they will slowly be replaced by compact curb cuts descending from nice, new sidewalks. Despite all this, I feel a little bit of sadness, because their economy and their imperfection made the sidewalk just that much nicer.

    November 01, 2017 04:00 AM

    October 30, 2017

    Víctor Jáquez

    GStreamer Conference 2017

    This year, the GStreamer Conference happened in Prague, along with the traditional autumn Hackfest.

    Prague is a beautiful city, though this year I couldn’t visit it as much as I wanted, since the Embedded Linux Conference Europe and the Open Source Summit also took place there, and Igalia, being a Linux Foundation sponsor, had a booth in the venue, where I talked about our work with WebKit, Snabb, and obviously, GStreamer.

    But, let’s back to the GStreamer Hackfest and Conference.

    One of the features that I like the most of the GStreamer project is its community, the people involved in it, by writing code, sharing their work with many others. They might appear a bit tough at beginning (or at least that looked to me) but in real they are all kind and talented persons. And I’m proud of consider myself part of this community. Nonetheless it has a diversity problem, as many other Open Source communities.

    GStreamer Conference 2017

    During the Hackfest, Hyunjun and I, met with Sree and talked about the plans for GStreamer-VAAPI, the new features in VA-API and libva and how we could map them to the GStreamer’s design. Also we talked about the future developments in the msdk elements, merged one year ago in gst-plugins-bad. Also, I talked a bit with Nicolas Dufresne regarding kmssink and DMABuf.

    In the Conference, which happened in the same venue as the hackfest, I talked wit the authors of gstreamer-media-SDK. They are really energetic.

    I delivered my usual talk about GStreamer-VAAPI. You can find the slides, as a web presentation, here. Also, as every year, our friends of Ubicast, recorded the talks, and made them available for streaming almost instantaneously:

    My colleague Enrique talked in the Conference about the Media Source Extensions (MSE) on WebKit, and Hyunjun shared his experience with VA-API on Rust.

    Also, in the conference venue, we showed a couple demos. One of them was a MinnowBoard running WPE, rendering videos from YouTube using gstreamer-vaapi to decode video.

    by vjaquez at October 30, 2017 04:24 PM