Planet Igalia

February 09, 2016

Víctor Jáquez

GStreamer VA-API under the umbrella of GStreamer

We have a new GStreamer VA-API release: 1.6.0!

Wait a minute“, you might say, “weren’t the last release 0.7?“, and you will be correct; but something big has happened: GStreamer VA-API is now part of the official GStreamer project!

And this means a couple changes.

First of all, the official repository of the project has been moved, now it’s co-hosted along with the rest of the GStreamer components, in (fdo):

Anonymous Git repository: git://

Developer Git repository: ssh://

Web Git gateway:

Second, the bug tracking has also moved, since now GStreamer VA-API is now a component of GStreamer, the new bugs must be filled here:

And the bug list is here:

What will happen with the old bugs?” you will ask. Well, we will move them as soon as we have reviewed all them.

The third change, as you already noticed, is the version scheme. Now we are following the GStreamer version numbering to avoid confusion and to simplify our development. Hence, this release, 1.6.0, supports the current GStreamer stable version (1.6.3), and our current development branch is the 1.7.x. The future releases will follow the GStreamer version schemes and dependencies.

Sweet! but, what’s new in this release?“. The answer is, not much. Really. Most of the changes are related with the coupling with the upstream processes (autoconf setup, documentation, etc.). Perhaps the most remarkable thing is the removal of the support libraries (libgstvaapi-*.so) used by the vaapi plugin: now they are compiled as one static library and linked to the GStreamer’s plugin. Also, the custom parsers were removed, and the plugin and elements documentation got better shape.

At code level, we had to push a huge indentation commit, in order to align with the GStreamer code style. This commit artificially kills the blame history, but it was our better option.

I ought to say that those were not the only changes at code level, Michael Olbrich fixed a missing frame release in the Wayland backend. And Sree, as usual, fixed a bunch of hardcore stuff. But specially I want to thank Tim-Philipp Müller, for helping us along the upstreaming process. And obviously to the Intel Open Source Technology Center, for let this happen.

Here’s the git’s short log summary since the last 0.7.0 release:

 1  Joel Holdsworth
 1  Michael Olbrich
 9  Sreerenj Balachandran
 4  Tim-Philipp Müller
42  Víctor Manuel Jáquez Leal

By the way, Igalia is hiring!

by vjaquez at February 09, 2016 12:32 PM

February 08, 2016

Andy Wingo

a lambda is not (necessarily) a closure


Greets, folks! Check it out: Guile had a whole track devoted to it at FOSDEM this year. OK, so it was only half a day, but there were like a dozen talks! And the room was full all the morning! And -- get this -- I had nothing to do with its organization! I think we can credit the Guix project with the recent surge of interest in Guile; fully half the talks were from people excited about using Guix to solve their problems. Thanks very, very much to Pjotr Prins for organizing the lovely event.

I gave a talk on how the Guile 2.2 compiler and virtual machine could change the way people program. Happily, the video recording came out OK! Video below (or here if that doesn't work), and slides here.

The time was super-limited though and I wasn't able to go into the detail that I'd like. So, dear readers, here we are, with a deeper look on lambda representation in Guile.

a lambda is not (necessarily) a closure

What is this?

(lambda (a b) (+ a b))

If you answer, "it's a lambda expression", you're right! You're also right if you say it's a function -- I mean, lambda makes a function, right? There are lots of things that you could say that would be right, including silly things like "twenty-two characters set in an awkward typeface".

But if you said "it's a closure" -- well you're right in general I guess, like on a semantic what-does-it-mean level, but as far as how Guile represents this thing at run-time, hoo boy are there a number of possibilities, and a closure is just one of them. This article dives into the possibilities, with the goal being to help you update your mental model of "how much do things cost".

In Guile, a lambda expression can be one of the following things at run-time:

  1. Gone

  2. Inlined

  3. Contified

  4. Code pointer

  5. Closure

Let's look into these one-by-one.

lambda: gone

If Guile can prove that a lambda expression is never reached, it won't be present at run-time. The main way this happens is via partial evaluation, but later passes can do this too. In the most basic example, consider the lambda bound to f by this let expression.

(let ((f (lambda ()

Guile has an ,optimize command that can be run at the REPL to show the effect of partial evaluation on your code. These days it's a bit out of date in a way -- it can't show what CPS-based optimization will do to your code -- but for our purposes here it will transform the expression to the following code:

(let ((f (lambda ()
=> 42

So the lambda is gone, big whoop. The interesting thing though is that this happens concurrently with other things that partial evaluation does, so the lambda goes away in this expression too:

(let ((launch? #f)
      (f (lambda ()
  (if launch? (f) 'just-kidding))
=> 'just-kidding

lambda: inlined

The other trick that partial evaluation can do with lambda expressions is inlining. Re-taking the example above, if we change launch? to #t, the branch folds the other way and the application (f) inlines:

(let ((launch? #t)
      (f (lambda ()
  (if launch? (f) 'just-kidding))
=> (let ((launch? #t)
         (f (lambda ()
     (if #t (f) 'just-kidding))
=> (let ((launch? #t)
         (f (lambda ()
=> (let ((launch? #t)
         (f (lambda ()
     ((lambda () (launch-the-missiles!))))
=> (let ((launch? #t)
         (f (lambda ()
=> (launch-the-missiles!)

Here again the lambda is gone, but not because it was unreachable, but because it was inlined into its use. I showed some intermediate steps as well, just so you get a feel about how partial evaluation works. The inlining step is illustrated by the fourth transformation, where the lambda application went away, replaced by its body.

Partial evaluation can also unroll many kinds of recursion:

(letrec ((lp (lambda (n)
               (if (zero? n)
                   (+ n (lp (1- n)))))))
  (lp 5))
=> 15

The partial evaluator in Guile 2.2 is more or less unchanged from the one in Guile 2.0, so you get these benefits on old Guile as well. Building a good intuition as to what the partial evaluator will do is important if you want to get the best performance out of Guile. Use the ,optimize command at the REPL to see the effects of partial evaluation on any given expression.

lambda: contified

So, here we step into the unknown, in the sense that from here on out, these optimizations are new in Guile 2.2. Unfortunately, they can be hard to see as they aren't really representable in terms of source-to-source transformations over Scheme programs. Consider this program:

(define (count-down n)
  (define loop
    (lambda (n out)
      (let ((out (cons n out)))
        (if (zero? n)
            (loop (1- n) out)))))
  (loop n '()))

It's a little loop that builds a list of integers. The lambda in this loop, bound to loop, will be contified into the body of count-down.

To see that this is the case, we have to use a new tool, ,disassemble (abbreviated ,x). This takes a procedure and prints its bytecode. It can be hard to understand, so I'm going to just point out some "shapes" of disassembly that you can recognize.

> ,x count-down
Disassembly of #<procedure count-down (n)> at #x9775a8:

  10    (cons 2 1 2)
  11    (br-if-u64-=-scm 0 1 #f 5) ;; -> L2
  14    (sub/immediate 1 1 1)
  15    (br -5)                    ;; -> L1

I've snipped the disassembly to the interesting part. The first thing to notice is that there's just one procedure here: only one time that ,x prints "Disassembly of ...". That means that the lambda was eliminated somehow, either because it was dead or inlined, as described above, or because it was contified. It wasn't dead; we can see that from looking at the ,optimize output, which doesn't significantly change the term. It wasn't inlined either; again, ,optimize can show you this, but consider that because partial evaluation can't determine when the loop would terminate, it won't find a point at which it can stop unrolling the loop. (In practice what happens though is that it tries, hits an effort or code growth limit, then aborts the inlining attempt.)

However, what we see in the disassembly is the body of the loop: we cons something onto a list (the cons), check if a two numbers are equal (br-if-u64-=-scm), and if they are we jump out of the loop (L2). Otherwise we subtract 1 from a number (sub/immediate) and loop (br to L1). That is the loop. So what happened?

Well, if inlining is copying, then contification is rewiring. Guile's compiler was able to see that although it couldn't inline the loop function, it could see all of loop's callers, and that loop always returned to the same "place". (Another way to say this is that loop is always called with the same continuation.) The compiler was then able to incorporate the body of loop into count-down, rewiring calls to loop to continue to loop's beginning, and rewriting returns from loop to proceed to the continuation of the loop call.

a digression on language

These words like "contification" and "continuation" might be unfamiliar to you, and I sympathize. If you know of a better explanation of contification, I welcome any links you might have. The name itself comes from a particular formulation of the intermediate language used in Guile, the so-called "CPS" language. In this language, you convert a program to make it so it never returns: instead, each sub-expression passes its values to its continuation via a tail call. Each continuation is expressed as a lambda expression. See this article for an intro to CPS and how it relates to things you might know like SSA.

Transforming a program into CPS explodes it into a bunch of little lambdas: every subexpression gets its own. You would think this would be a step backwards, if your goal is to eliminate closures in some way. However it's possible to syntactically distinguish between lambda expressions which are only ever used as continuations and those that are used as values. Let's call the former kind of lambda a cont and the latter a function. A cont-lambda can be represented at run-time as a label -- indeed, the disassembly above shows this. It turns out that all lambda expressions introduced by the CPS transformation are conts. Conts form a first-order flow graph, and are basically the same as SSA basic blocks. If you're interested in this kind of thing, see Andrew Kennedy's great paper, Compiling with Continuations, Continued, and see also CPS soup for more on how this evolved in Guile 2.2.

I say all this to give you a vocabulary. Functions that are present in the source program start life as being treated as function-lambdas. Contification takes function-lambda values and turns then into cont-lambda labels, if it can. That's where the name "contification" comes from. For more on contification, see MLton's page on its contification pass, linking to the original paper that introduces the concept.

and we're back

Contification incorporates the body of a function into the flow graph of its caller. Unlike inlining, contification is always an optimization: it never causes code growth, and it enables other optimizations by exposing first-order control flow. (It's easier for the compiler to reason about first-order loops than it is to reason about control flow between higher-order functions.)

Contification is a reliable optimization. If a function's callers are always visible to the compiler, and the function is always called with the same continuation, it will be contified. These are two fairly simple conditions that you can cultivate your instincts to detect and construct.

Contification can also apply to mutually recursive functions, if as a group they are all always called with the same continuation. It's also an iterative process, in the sense that contifying one set of functions can expose enough first-order control flow that more contification opportunities become apparent.

It can take a while to get a feel for when this optimization applies. You have to have a feel for what a continuation is, and what it means for a function's callers to all be visible to the compiler. However, once you do internalize these conditions, contification is something you can expect Guile's compiler to do to your code.

lambda: code pointer

The next representation a lambda might have at run-time is as a code pointer. In this case, the function fails the conditions for contification, but we still avoid allocating a closure.

Here's a little example to illustrate the case.

(define (thing)
  (define (log what)
    (format #t "Very important log message: ~a\n" what))
  (log "ohai")
  (log "kittens")
  (log "donkeys"))

In this example, log is called with three different continuations, so it's not eligible for contification. Unfortunately, this example won't illustrate anything for us because the log function is so small that partial evaluation will succeed in inlining it. (You could determine this for yourself by using ,optimize.) So let's make it bigger, to fool the inliner:

(define (thing)
  (define (log what)
    (format #t "Very important log message: ~a\n" what)
    ;; If `log' is too short, it will be inlined.  Make it bigger.
    (format #t "Did I ever tell you about my chickens\n")
    (format #t "I was going to name one Donkey\n")
    (format #t "I always wanted a donkey\n")
    (format #t "In the end we called her Raveonette\n")
    (format #t "Donkey is not a great name for a chicken\n")
    (newline) (newline) (newline) (newline) (newline))
  (log "ohai")
  (log "kittens")
  (log "donkeys"))

Now if we disassembly it, we do get disassembly for two different functions:

,x thing
Disassembly of #<procedure thing ()> at #x97d704:

Disassembly of log at #x97d754:

So, good. We defeated the inliner. Let's look closer at the disassembly of the outer function.

,x thing
Disassembly of #<procedure thing ()> at #x97d704:
  12    (call-label 3 2 8)              ;; log at #x97d754

Here we see that instead of the generic call instruction, we have the specific call-label instruction which calls a procedure whose code is at a known offset from the calling function.

call-label is indeed a cheaper call than the full call instruction that has to check that the callee is actually a function and so on. But that's not the real optimization here. If all callers of a function are known -- and by this time, you're starting to catch the pattern, I think -- if all callers are known, then the procedure does not need to exist as a value at run-time.

This affords a number of optimization opportunities. Theoretically there are many -- all call sites can be specialized to the specific callee. The callee can have an optimized calling convention that doesn't have anything to do with the generic convention. Effect analysis can understand the side effects and dependencies of the callee in a more precise way. The compiler can consider unboxing some arguments and return values, if it finds that useful.

In Guile though, there's only one real optimization that we do, and that is related to free variables. Currently in Guile, all procedures have a uniform calling convention, in which the procedure being called (the callee) is itself passed as the zeroeth argument, and then the arguments follow on the stack. The function being called accesses its free variables through that zeroeth argument. If however there is no need for the procedure to be represented as a value, we are free to specialize that zeroeth argument.

So, consider a well-known procedure like log above. (By "well-known", we mean that all of log's callers are known.) Since log doesn't actually have any lexically bound free variables, we can just pass in anything as argument zero when invoking it. In practice we pass #f, because it happens to be an easy value to make.

(Why isn't format treated as a free variable in log? Because there is special support from the linker for lazily initializing the locations of variables imported from other modules or defined at the top level instead of within a lexical contour. In short: only variables that are (a) used within the lambda and (b) defined within a let or similar count towards a lambda's free variables.)

For a well-known procedure with only one free variable, we can pass in that free variable as the zeroeth argument. Internally to the function, we rewrite references to that free variable to reference argument 0 instead. This is a neat hack because we can have a lambda with a free variable but which results in no allocation at run-time.

Likewise if there are two free variables -- and this is starting to sound like Alice's restaurant, isn't it -- well we do have to pass in their values to the procedure, but we don't have to build an actual closure object with a tag and a code pointer and all. Pairs happen to be small and have some fast paths in Guile, so we use that. References to the free variables get internally rewritten to be car or cdr of argument 0.

For three or more free variables, we do the same, but with a vector.

For a final trick, a set of mutually recursive procedures whose callers are all known can share the object that collects their free variables. We collect the union of the free variables of all of the procedures, and pack them into a specialized representation as above.

Note that for well-known procedures, all variables that are free in the lambda are also free in the caller; that's why the 1-free-variable substitution works. The lambda is bound in a scope that dominates its callers, but its free variables dominate the lambda so they dominate the callers too. For that reason in this case we could choose to do lambda lifting instead, with no penalty: instead of bundling up the free variables in a heap object, we could pass them as arguments. Dybvig claims this is not a great idea because it increases register pressure. That could be true, but I haven't seen the numbers. Anyway, we do the flat closure thing, so we pack the free vars into data.

All these ideas came pretty much straight from the great Optimizing Closures in O(0) Time by Andrew Keep et al.

lambda: closure

OK! So you have a lambda whose callees are not all visible to the compiler. You need to reify the procedure as a value. That reified procedure-as-value is a closure: an object with a tag, a code pointer, and an array of free variables.

Of course, if the procedure has no free variables, you just have the tag and the code pointer... and because Scheme is semantically squirrely when it comes to the result of (eqv? (lambda () 10) (lambda () 10)) (it's unspecified: lambda expressions don't have identity), we can statically allocate the closure in the binary, as a constant.

Otherwise we do allocate the heap object.

Note however that if a group of mutually recursive procedures has just one entry that is not "well-known", then that procedure clique can share one closure object.

lambda: it's complicated

In summary, a lambda is an abstraction that has many concrete representations. Guile will choose the cheapest representation that it can. If you need to eke out even more performance from your program, having a good mental model of how the abstract maps to the concrete will help you know where to focus your efforts, and what changes might be helpful. Good luck, and happy hacking!

by Andy Wingo at February 08, 2016 10:12 AM

Claudio Saavedra

Mon 2016/Feb/08

About a year ago, Igalia was approached by the people working on printing-related technologies in HP to see whether we could give them a hand in their ongoing effort to improve the printing experience in the web. They had been working for a while in extensions for popular web browsers that would allow users, for example, to distill a web page from cruft and ads and format its relevant contents in a way that would be pleasant to read in print. While these extensions were working fine, they were interested in exploring the possibility of adding this feature to popular browsers, so that users wouldn't need to be bothered with installing extensions to have an improved printing experience.

That's how Alex, Martin, and me spent a few months exploring the Chromium project and its printing architecture. Soon enough we found out that the Chromium developers had been working already on a feature that would allow pages to be removed from cruft and presented in a sort of reader mode, at least in mobile versions of the browser. This is achieved through a module called dom distiller, which basically has the ability to traverse the DOM tree of a web page and return a clean DOM tree with only the important contents of the page. This module is based on the algorithms and heuristics in a project called boilerpipe with some of it also coming from the now popular Readability. Our goal, then, was to integrate the DOM distiller with the modules in Chromium that take care of generating the document that is then sent to both the print preview and the printing service, as well as making this feature available in the printing UI.

After a couple of months of work and thanks to the kind code reviews of the folks at Google, we got the feature landed in Chromium's repository. For a while, though, it remained hidden behind a runtime flag, as the Chromium team needed to make sure that things would work well enough in all fronts before making it available to all users. Fast-forward to last week, when I found out by chance that the runtime flag has been flipped and the Simplify page printing option has been available in Chromium and Chrome for a while now, and it has even reached the stable releases. The reader mode feature in Chromium seems to remain hidden behind a runtime flag, I think, which is interesting considering that this was the original motivation behind the dom distiller.

As a side note, it is worth mentioning that the collaboration with HP was pretty neat and it's a good example of the ways in which Igalia can help organizations to improve the web experience of users. From the standards that define the web to the browsers that people use in their everyday life, there are plenty of areas in which work needs to be done to make the web a more pleasant place, for web developers and users alike. If your organization relies on the web to reach its users, or to enable them to make use of your technologies, chances are that there are areas in which their experience can be improved and that's one of the things we love doing.

February 08, 2016 09:01 AM

February 04, 2016

Andy Wingo

guile compiler tasks

Hey! We released Guile 2.1.2, including the unboxing work, and we fixed the slow bootstrap problem by shipping pre-built bootstraps in tarballs. A pretty OK solution in my opinion; check it out!

future work

At this point I think I'm happy with Guile's compiler and VM, enough for now. There is a lot more work to do but it's a good point at which to release a stable series. There will probably be a number of additional pre-releases, but not any more significant compiler/VM work that must be done before a release.

However, I was talking with Guilers at FOSDEM last weekend and we realized that although we do a pretty good job at communicating the haps in compiler-land, we don't do a good job at sharing a roadmap or making it possible for other folks to join the hack. And indeed, it's been difficult to do so while things were changing so much: I had to get things right in my head before joining in the confusion of other people's heads.

In that spirit I'd like to share a list of improvements that it would be nice to make at some point. If you take one of these tasks, be my guest: find me on IRC (wingo on freenode) and let me know, and I'll help as I am able. You need to be somewhat independent; I'm not offering a proper mentoring or anything, more like office hours or something, where you come with the problem you are having and I commiserate and give context/background/advice as I am able.

So with that out of the way, here's a huge list of stuff! Following this, more details on each one.

  1. stripping binaries

  2. full source in binaries

  3. cps in in binaries

  4. linking multiple modules together

  5. linking a single executable

  6. instruction explosion

  7. elisp optimizations

  8. prompt removal

  9. basic register allocation

  10. optimal register allocation

  11. unboxed record fields

  12. textual CPS

  13. avoiding arity checks

  14. unboxed calls and returns

  15. module-level inlining

  16. cross-module inlining

As a bonus, in the end I'll give some notes on native compilation. But first, the hacks!

stripping binaries

Guile uses ELF as its object file format, and currently includes source location information as DWARF data. On space-constrained devices this might be too much. Your task: add a hack to the linker that can strip existing binaries. Read Ian Lance Taylor's linker articles for more background, if you don't know things about linkers yet.

full source in binaries

Wouldn't it be nice if the ELF files that Guile generates actually included the source as well as the line numbers? We could do that, in a separate strippable ELF section. This point is like the reverse of the previous point :)

cps in in binaries

We could also include the CPS IR in ELF files too. This would enable some kinds of link-time optimization and cross-module inlining. You'd need to define a binary format for CPS, like LLVM bitcode or so. Neat stuff :)

linking multiple modules together

Currently in Guile, just about every module is a separate .go file. Loading a module will cause a few stat calls and some seeks and reads and all that. Wouldn't it be nice if you could link together all the .go files that were commonly used into one object? Again this is a linker hack, but it needs support from the run-time as well: when the run-time goes to load a file, it should first check in a registry if that file has been logically provided by some other file. We'd be able to de-duplicate constant data from various modules. However there is an initialization phase when loading a .go file which effectively performs all the relocations needed by constants that need a fix-up at load-time; see the ELF article I linked to above for more. For some uses, it would be OK to produce one relocation/initialization procedure. For others, if you expected to only load a fraction of the modules in a .go file, it would be a lose on startup time,
so you would probably need to support lazy relocation when a module is first loaded.

Anyway, your task would be to write a linker hack that loads a bunch of .go files, finds the relocations in them, de-duplicates the constants, and writes out a combined .go file that includes a table of files contained in it. Good luck :) This hack would work great for Emacs, where it's effectively a form of unexec that doesn't actually rely on unexec.

linking a single executable

In the previous task, you could end up with the small guile binary that links to libguile (or your binary linking to libguile), and then a .go file containing all the modules you are interestd in. It sure would be nice to be able to link those together into just one binary, or at least to link the .go into the Guile binary. If the Guile is statically linked itself, you would have a statically linked application. If it's dynamically linked, it would remain dynamically linked. Again, a linker hack, but one that could provide a nicer way to distribute Guile binaries.

instruction explosion

Now we get more to the compiler side of things. Currently in Guile's VM there are instructions like vector-ref. This is a little silly: there are also instructions to branch on the type of an object (br-if-tc7 in this case), to get the vector's length, and to do a branching integer comparison. Really we should replace vector-ref with a combination of these test-and-branches, with real control flow in the function, and then the actual ref should use some more primitive unchecked memory reference instruction. Optimization could end up hoisting everything but the primitive unchecked memory reference, while preserving safety, which would be a win. But probably in most cases optimization wouldn't manage to do
this, which would be a lose overall because you have more instruction dispatch.

Well, this transformation is something we need for native compilation anyway. I would accept a patch to do this kind of transformation on the master branch, after version 2.2.0 has forked. In theory this would remove most all high level instructions from the VM, making the bytecode closer to a virtual CPU, and likewise making it easier for the compiler to emit native code as it's working at a lower level.

elisp optimizations

Guile implements Emacs Lisp, and does so well. However it hasn't been the focus of a lot of optimization. Emacs has a lot of stuff going on on its side, and so have we, so we haven't managed to replace the Elisp interpreter in Emacs with one written in Guile, though Robin Templeton has brought us a long way forward. We need someone to do both the integration work but also to poke the compiler and make sure it's a clear win.

prompt removal

It's pretty natural to use delimited continuations when compiling some kind of construct that includes a break statement to Guile, whether that compiler is part of Elisp or just implemented as a Scheme macro. But, many instances of prompts can be contified, resulting in no overhead at run-time. Read up on contification and contify the hell out of some prompts!

basic register allocation

Guile usually tries its best to be safe-for-space: only the data which might be used in the future of a program is kept alive, and the rest is available for garbage collection. Notably, this applies to function arguments, temporaries, and lexical variables: if a value is dead, the GC can collect it and re-use its space. However this isn't always what you want. Sometimes you might want to have all variables that are in scope to be available, for better debugging. Your task would be to implement a "slot allocator" (which is really register allocation) that keeps values alive in the parts of the programs that they dominate.

optimal register allocation

On the other hand, our slot allocator -- which is basically register allocation, but for stack slots -- isn't so great. It does OK but you can often end up shuffling values in a loop, which is the worst. Your task would be to implement a proper register allocator: puzzle-solving, graph-coloring, iterative coalescing, something that really tries to do a good job. Good luck!

unboxed record fields

Guile's "structs", on which records are implemented, support unboxed values, but these values are untyped, not really integrated with the record layer, and always boxed in the VM. Your task would be to design a language facility that allows us to declare records with typed fields, and to store unboxed values in those fields, and to cause access to their values to emit boxing/unboxing instructions around them. The optimizer will get rid of those boxing/unboxing instructions if it can. Good luck!

textual CPS

The CPS language is key to all compiler work in Guile, but it doesn't have a nice textual form like LLVM IR does. Design one, and implement a parser and an unparser!

avoiding arity checks

If you know the procedure you are calling, like if it's lexically visible, then if you are calling it with the right number of arguments you can skip past the argument check and instead do a call-label directly into the body. Would be pretty neat!

unboxed calls and returns

Likewise if a function's callers are all known, it might be able to unbox its arguments or return value, if that's a good idea. Tricky! You could start with a type inference pass or so, and maybe that could produce some good debugging feedback too.

module-level inlining

Guile currently doesn't inline anything that's not lexically visible. Unfortunately this restriction extends to top-level definitions in a module: they are treated as mutable and so never inlined/optimized/etc. Probably we need to change the semantics here such that a module can be compiled as a unit, and all values which are never mutated can be assumed to be constant. Probably you also want a knob to turn off this behavior, but really you can always re-compile and re-load a module as a whole if re-loading a function at run-time doesn't work because it was inlined. Anyway. Some semantic work here, but some peval work as well. Be careful!

cross-module inlining

Likewise Guile currently doesn't inline definitions from other modules. However for small functions this really hurts. Guile should probably serialize tree-il for small definitions in .go files, and allow peval to speculatively inline imported definitions. This is related to the previous point and has some semantic implications.

bobobobobobonus! native compilation

Thinking realistically, native compilation is the next step. We have the object file format, cool. We will need the ability to call out from machine code in .go files to run-time functions, so we need to enhance the linker, possibly even with things like PLT/GOT sections to avoid dirtying too many pages. We need to lower the CPS even further, to get closer to some kind of machine model, then go specific, with an assembler for each architecture. The priority in the beginning will be simplicity and minimal complexity; good codegen will come later. This is obviously the most attractive thing but it's also the most tricky, design-wise. I want to do at least part of this, so though you can't have it all, you are welcome to help :)

That's it for now. I'll amend the post with more things as and when I think of them. Comments welcome too, as always. Happy hacking!

by Andy Wingo at February 04, 2016 09:38 PM

Claudio Saavedra

Thu 2016/Feb/04

We've opened a few positions for developers in the fields of multimedia, networking, and compilers. I could say a lot about why working in Igalia is way different to working on your average tech-company or start-up, but I think the way it's summarized in the announcements is pretty good. Have a look at them if you are curious and don't hesitate to apply!

February 04, 2016 12:53 PM

February 03, 2016

Michael Catanzaro

On Subresource Certificate Validation

Ryan Castellucci has a quick read on subresource certificate validation. It is accurate; I fixed this shortly after joining Igalia. (Update: This was actually in response to a bug report from him.) Run his test to see if your browser is vulnerable.

Epiphany, Xombrero, Opera Mini and Midori […] were loading subresources, such as scripts, from HTTPS servers without doing proper certificate validation. […] Unfortunately Xombrero and Midori are still vulnerable. Xombrero seems to be dead, and I’ve gotten no response from them. I’ve been in touch with Midori, but they say they don’t have the resources to fix it, since it would require rewriting large portions of the code base in order to be able to use the fixed webkit.

I reported this to the Midori developers in late 2014 (private bug). It’s hard to understate how bad this is: it makes HTTPS completely worthless, because an attacker can silently modify JavaScript loaded via subresources.

This is actually a unique case in that it’s a security problem that was fixed only thanks to the great API break, which has otherwise been the cause of many security problems. Thanks to the API break, we were able to make the new API secure by default without breaking any existing applications. (But this does no good for applications unable to upgrade.)

(A note to folks who read Ryan’s post: most mainstream browsers do silently block invalid certificates, but Safari will warn instead. I’m not sure which behavior I prefer.)

by Michael Catanzaro at February 03, 2016 08:36 PM

February 01, 2016

Michael Catanzaro

On WebKit Security Updates

Linux distributions have a problem with WebKit security.

Major desktop browsers push automatic security updates directly to users on a regular basis, so most users don’t have to worry about security updates. But Linux users are dependent on their distributions to release updates. Apple fixed over 100 vulnerabilities in WebKit last year, so getting updates out to users is critical.

This is the story of how that process has gone wrong for WebKit.

Before we get started, a few disclaimers. I want to be crystal clear about these points:

  1. This post does not apply to WebKit as used in Apple products. Apple products receive regular security updates.
  2. WebKitGTK+ releases regular security updates upstream. It is safe to use so long as you apply the updates.
  3. The opinions expressed in this post are my own, not my employer’s, and not the WebKit project’s.

Browser Security in a Nutshell

Web engines are full of security vulnerabilities, like buffer overflows, null pointer dereferences, and use-after-frees. The details don’t matter; what’s important is that skilled attackers can turn these vulnerabilities into exploits, using carefully-crafted HTML to gain total control of your user account on your computer (or your phone). They can then install malware, read all the files in your home directory, use your computer in a botnet to attack websites, and do basically whatever they want with it.

If the web engine is sandboxed, then a second type of attack, called a sandbox escape, is needed. This makes it dramatically more difficult to exploit vulnerabilities. Chromium has a top-class Linux sandbox. WebKit does have a Linux sandbox, but it’s not any good, so it’s (rightly) disabled by default. Firefox does not have a sandbox due to major architectural limitations (which Mozilla is working on).

For this blog post, it’s enough to know that attackers use crafted input to exploit vulnerabilities to gain control of your computer. This is why it’s not a good idea to browse to dodgy web pages. It also explains how a malicious email can gain control of your computer. Modern email clients render HTML mail using web engines, so malicious emails exploit many of the same vulnerabilities that a malicious web page might. This is one reason why good email clients block all images by default: image rendering, like HTML rendering, is full of security vulnerabilities. (Another reason is that images hosted remotely can be used to determine when you read the email, violating your privacy.)

WebKit Ports

To understand WebKit security, you have to understand the concept of WebKit ports, because different ports handle security updates differently.

While most code in WebKit is cross-platform, there’s a large amount of platform-specific code as well, to improve the user and developer experience in different environments. Different “ports” run different platform-specific code. This is why two WebKit-based browsers, say, Safari and Epiphany (GNOME Web), can display the same page slightly differently: they’re using different WebKit ports.

Currently, the WebKit project consists of six different ports: one for Mac, one for iOS, two for Windows (Apple Windows and WinCairo), and two for Linux (WebKitGTK+ and WebKitEFL). There are some downstream ports as well; unlike the aforementioned ports, downstream ports are, well, downstream, and not part of the WebKit project. The only one that matters for Linux users is QtWebKit.

If you use Safari, you’re using the Mac or iOS port. These ports get frequent security updates from Apple to plug vulnerabilities, which users receive via regular updates.

Everything else is broken.

Since WebKit is not a system library on Windows, Windows applications must bundle WebKit, so each application using WebKit must be updated individually, and updates are completely dependent on the application developers. iTunes, which uses the Apple Windows port, does get regular updates from Apple, but beyond that, I suspect most applications never get any security updates. This is a predictable result, the natural consequence of environments that require bundling libraries.

(This explains why iOS developers are required to use the system WebKit rather than bundling their own: Apple knows that app developers will not provide security updates on their own, so this policy ensures every iOS application rendering HTML gets regular WebKit security updates. Even Firefox and Chrome on iOS are required to use the system WebKit; they’re hardly really Firefox or Chrome at all.)

The same scenario applies to the WinCairo port, except this port does not have releases or security updates. Whereas the Apple ports have stable branches with security updates, with WinCairo, companies take a snapshot of WebKit trunk, make their own changes, and ship products with that. Who’s using WinCairo? Probably lots of companies; the biggest one I’m aware of uses a WinCairo-based port in its AAA video games. It’s safe to assume few to no companies are handling security backports for their downstream WinCairo branches.

Now, on to the Linux ports. WebKitEFL is the WebKit port for the Enlightenment Foundation Libraries. It’s not going to be found in mainstream Linux distributions; it’s mostly used in embedded devices produced by one major vendor. If you know anything at all about the internet of things, you know these devices never get security updates, or if they do, the updates are superficial (updating only some vulnerable components and not others), or end a couple months after the product is purchased. WebKitEFL does not bother with pretense here: like WinCairo, it has never had security updates. And again, it’s safe to assume few to no companies are handling security backports for their downstream branches.

None of the above ports matter for most Linux users. The ports available on mainstream Linux distributions are QtWebKit and WebKitGTK+. Most of this blog will focus on WebKitGTK+, since that’s the port I work on, and the port that matters most to most of the people who are reading this blog, but QtWebKit is widely-used and deserves some attention first.

It’s broken, too.


QtWebKit is the WebKit port used by Qt software, most notably KDE. Some cherry-picked examples of popular applications using QtWebKit are Amarok, Calligra, KDevelop, KMail, Kontact, KTorrent, Quassel, Rekonq, and Tomahawk. QtWebKit provides an excellent Qt API, so in the past it’s been the clear best web engine to use for Qt applications.

After Google forked WebKit, the QtWebKit developers announced they were switching to work on QtWebEngine, which is based on Chromium, instead. This quickly led to the removal of QtWebKit from the WebKit project. This was good for the developers of other WebKit ports, since lots of Qt-specific code was removed, but it was terrible for KDE and other QtWebKit users. QtWebKit is still maintained in Qt and is getting some backports, but from a quick check of their git repository it’s obvious that it’s not receiving many security updates. This is hardly unexpected; QtWebKit is now years behind upstream, so providing security updates would be very difficult. There’s not much hope left for QtWebKit; these applications have hundreds of known vulnerabilities that will never be fixed. Applications should port to QtWebEngine, but for many applications this may not be easy or even possible.

Update: As pointed out in the comments, there is some effort to update QtWebKit. I was aware of this and in retrospect should have mentioned this in the original version of this article, because it is relevant. Keep an eye out for this; I am not confident it will make its way into upstream Qt, but if it does, this problem could be solved.


WebKitGTK+ is the port used by GTK+ software. It’s most strongly associated with its flagship browser, Epiphany, but it’s also used in other places. Some of the more notable users include Anjuta, Banshee, Bijiben (GNOME Notes), Devhelp, Empathy, Evolution, Geany, Geary, GIMP, gitg, GNOME Builder, GNOME Documents, GNOME Initial Setup, GNOME Online Accounts, GnuCash, gThumb, Liferea, Midori, Rhythmbox, Shotwell, Sushi, and Yelp (GNOME Help). In short, it’s kind of important, not only for GNOME but also for Ubuntu and Elementary. Just as QtWebKit used to be the web engine for choice for Qt applications, WebKitGTK+ is the clear choice for GTK+ applications due to its nice GObject APIs.

Historically, WebKitGTK+ has not had security updates. Of course, we released updates with security fixes, but not with CVE identifiers, which is how software developers track security issues; as far as distributors are concerned, without a CVE identifier, there is no security issue, and so, with a few exceptions, distributions did not release our updates to users. For many applications, this is not so bad, but for high-risk applications like web browsers and email clients, it’s a huge problem.

So, we’re trying to improve. Early last year, my colleagues put together our first real security advisory with CVE identifiers; the hope was that this would encourage distributors to take our updates. This required data provided by Apple to WebKit security team members on which bugs correspond to which CVEs, allowing the correlation of Bugzilla IDs to Subversion revisions to determine in which WebKitGTK+ release an issue has been fixed. That data is critical, because without it, there’s no way to know if an issue has been fixed in a particular release or not. After we released this first advisory, Apple stopped providing the data; this was probably just a coincidence due to some unrelated internal changes at Apple, but it certainly threw a wrench in our plans for further security advisories.

This changed in November, when I had the pleasure of attending the WebKit Contributors Meeting at Apple’s headquarters, where I was finally able meet many of the developers I had interacted with online. At the event, I gave a presentation on our predicament, and asked Apple to give us information on which Bugzilla bugs correspond to which CVEs. Apple kindly provided the necessary data a few weeks later.

During the Web Engines Hackfest, a yearly event that occurs at Igalia’s office in A Coruña, my colleagues used this data to put together WebKitGTK+ Security Advisory WSA-2015-0002, a list of over 130 vulnerabilities disclosed since the first advisory. (The Web Engines Hackfest was sponsored by Igalia, my employer, and by our friends at Collabora. I’m supposed to include their logos here to advertise how cool it is that they support the hackfest, but given all the doom and gloom in this post, I decided perhaps they would perhaps prefer not to have their logos attached to it.)

Note that 130 vulnerabilities is an overcount, as it includes some issues that are specific to the Apple ports. (In the future, we’ll try to filter these out.) Only one of the issues — a serious error in the networking backend shared by WebKitGTK+ and WebKitEFL — resided in platform-specific code; the rest of the issues affecting WebKitGTK+ were all cross-platform issues. This is probably partly because the trickiest code is cross-platform code, and partly because security researchers focus on Apple’s ports.

Anyway, we posted WSA-2015-0002 to the oss-security mailing list to make sure distributors would notice, crossed our fingers, and hoped that distributors would take the advisory seriously. That was one month ago.

Distribution Updates

There are basically three different approaches distributions can take to software updates. The first approach is to update to the latest stable upstream version as soon as, or shortly after, it’s released. This is the strategy employed by Arch Linux. Arch does not provide any security support per se; it’s not necessary, so long as upstream projects release real updates for security problems and not simply patches. Accordingly, Arch almost always has the latest version of WebKitGTK+.

The second main approach, used by Fedora, is to provide only stable release updates. This is more cautious, reflecting that big updates can break things, so they should only occur when upgrading to a new version of the operating system. For instance, Fedora 22 shipped with WebKitGTK+ 2.8, so it would release updates to new 2.8.x versions, but not to WebKitGTK+ 2.10.x versions.

The third approach, followed by most distributions, is to take version upgrades only rarely, or not at all. For smaller distributions this may be an issue of manpower, but for major distributions it’s a matter of avoiding regressions in stable releases. Holding back on version updates actually works well for most software. When security problems arise, distribution maintainers for major distributions backport fixes and release updates. The problem is that this not feasible for web engines; due to the huge volume of vulnerabilities that need fixed, security issues can only practically be handled upstream.

So what’s happened since WSA-2015-0002 was released? Did it convince distributions to take WebKitGTK+ security seriously? Hardly. Fedora is the only distribution that has made any changes in response to WSA-2015-0002, and that’s because I’m one of the Fedora maintainers. (I’m pleased to announce that we have a 2.10.7 update headed to both Fedora 23 and Fedora 22 right now. In the future, we plan to release the latest stable version of WebKitGTK+ as an update to all supported versions of Fedora shortly after it’s released upstream.)


Ubuntu releases WebKitGTK+ updates somewhat inconsistently. For instance, Ubuntu 14.04 came with WebKitGTK+ 2.4.0. 2.4.8 is available via updates, but even though 2.4.9 was released upstream over eight months ago, it has not yet been released as an update for Ubuntu 14.04.

By comparison, Ubuntu 15.10 (the latest release) shipped with WebKitGTK+ 2.8.5, which has never been updated; it’s affected by about 40 vulnerabilities fixed in the latest upstream release. Ubuntu organizes its software into various repositories, and provides security support only to software in the main repository. This version of WebKitGTK+ is in Ubuntu’s “universe” repository, not in main, so it is excluded from security support. Ubuntu users might be surprised to learn that a large portion of Ubuntu software is in universe and therefore excluded from security support; this is in contrast to almost all other distributions, which typically provide security updates for all the software they ship.

I’m calling out Ubuntu here not because it is specially-negligent, but simply because it is our biggest distributor. It’s not doing any worse than most of our other distributors.


Debian provides WebKit updates to users running unstable, and to testing except during freeze periods, but not to released version of Debian. Debian is unique in that it has a formal policy on WebKit updates. Here it is, reproduced in full:

Debian 8 includes several browser engines which are affected by a steady stream of security vulnerabilities. The high rate of vulnerabilities and partial lack of upstream support in the form of long term branches make it very difficult to support these browsers with backported security fixes. Additionally, library interdependencies make it impossible to update to newer upstream releases. Therefore, browsers built upon the webkit, qtwebkit and khtml engines are included in Jessie, but not covered by security support. These browsers should not be used against untrusted websites.

For general web browser use we recommend Iceweasel or Chromium.

Chromium – while built upon the Webkit codebase – is a leaf package, which will be kept up-to-date by rebuilding the current Chromium releases for stable. Iceweasel and Icedove will also be kept up-to-date by rebuilding the current ESR releases for stable.

(Iceweasel and Icedove are Debian’s de-branded versions of Firefox and Thunderbird, the product of an old trademark spat with Mozilla.)

Debian is correct that we do not provide long term support branches, as it would be very difficult to backport security fixes. But it is not correct that “library interdependencies make it impossible to update to newer upstream releases.” This might have been true in the past, but for several years now, we have avoided requiring new versions of libraries whenever it would cause problems for distributions, and — with one big exception that I will discuss below — we ensure that each release maintains both API and ABI compatibility. (Distribution maintainers should feel free to get in touch if we accidentally introduce some compatibility issue for your distribution; if you’re having trouble taking our updates, we want to help. I recently worked with openSUSE to make sure WebKitGTK+ can still be compiled with GCC 4.8, for example.)

The risk in releasing updates is that WebKitGTK+ is not a leaf package: a bad update could break some application. This seems to me like a good reason for application maintainers to carefully test the updates, rather than a reason to withhold security updates from users, but it’s true there is some risk here. One possible solution would be to have two different WebKitGTK+ packages, say, webkitgtk-secure, which would receive updates and be used by high-risk software like web browsers and email clients, and a second webkitgtk-stable package that would not receive updates to reduce regression potential.

Recommended Distributions

We regularly receive bug reports from users with very old versions of WebKit, who trust their distributors to handle security for them and might not even realize they are running ancient, unsafe versions of WebKit. I strongly recommend using a distribution that releases WebKitGTK+ updates shortly after they’re released upstream. That is currently only Arch and Fedora. (You can also safely use WebKitGTK+ in Debian testing — except during its long freeze periods — and Debian unstable, and maybe also in openSUSE Tumbleweed, and (update) also in Gentoo testing. Just be aware that the stable releases of these distributions are currently not receiving our security updates.) I would like to add more distributions to this list, but I’m currently not aware of any more that qualify.

The Great API Break

So, if only distributions would ship the latest release of WebKitGTK+, then everything would be good, right? Nope, because of a large API change that occurred two and a half years ago, called WebKit2.

WebKit (an API layer within the WebKit project) and WebKit2 are two separate APIs around WebCore. WebCore is the portion of the WebKit project that Google forked into Blink; it’s too low-level to be used directly by applications, so it’s wrapped by the nicer WebKit and WebKit2 APIs. The difference between the WebKit and WebKit2 APIs is that WebKit2 splits work into multiple secondary processes. Asides from the UI process, an application will have one or many separate web processes (for the actual page rendering), possibly a separate network process, and possibly a database process for IndexedDB. This is good for security, because it allows the secondary processes to be sandboxed: the web process is the one that’s likely to be compromised first, so it should not have the ability to access the filesystem or the network. (Remember, though, that there is no Linux sandbox yet, so this is currently only a theoretical benefit.) The other main benefit is robustness. If a web site crashes the renderer, only a single web process crashes (corresponding to one tab in Epiphany), not the entire browser. UI process crashes are comparatively rare.

Intermission: Certificate Verification

Another advantage provided by the API change is the opportunity to handle HTTPS connections more securely. In the original WebKitGTK+ API, applications must handle certificate verification on their own. This was a serious mistake; predictably, applications performed no verification at all, or did so improperly. For instance, take this Shotwell bug which is not fixed in any released version of Shotwell, or this Banshee bug which is still open. Probably many more applications are affected, because I have not done a comprehensive check. The new API is secure by default; applications can ignore verification errors, but only if they go out of their way to do so.

Remember that even though WebKitGTK+ 2.4.9 was released upstream over eight months ago, Ubuntu 14.04 is still on 2.4.8? It’s worth mentioning that 2.4.9 contains the fix for that serious networking backend issue I mentioned earlier (CVE-2015-2330). The bug is that TLS certificate verification was not performed until an HTTP response was received from the server; it’s supposed to be performed before sending an HTTP request, to prevent secure cookies from leaking. This is a disaster, as attackers can easily use it to get your session cookie and then control your user account on most websites. (Credit to Ross Lagerwall for reporting that issue.) We reported this separately to oss-security due to its severity, but that was not enough to convince distributions to update. But most applications in Ubuntu 14.04, including Epiphany and Midori, would not even benefit from this fix, because the change only affects WebKit2; remember, there’s no certificate verification in the original WebKitGTK+ API. (Modern versions of Epiphany do use WebKit2, but not the old version included in Ubuntu 14.04.) Old versions of Epiphany and Midori load pages even if certificate verification fails; the verification result is only used to change the status of a security indicator, basically giving up your session cookies to attackers.

Removing WebKit1

WebKit2 has been around for Mac and iOS for longer, but the first stable release for WebKitGTK+ was the appropriately-versioned WebKitGTK+ 2.0, in March 2013. This release actually contained three different APIs: webkitgtk-1.0, webkitgtk-3.0, and webkit2gtk-3.0. webkitgtk-1.0 was the original API, used by GTK+ 2 applications. webkitgtk-3.0 was the same thing for GTK+ 3 applications, and webkit2gtk-3.0 was the new WebKit2 API, available only for GTK+ 3 applications.

Maybe it should have remained that way.

But, since the original API was a maintenance burden and not as stable or robust as WebKit2, it was deleted after the WebKitGTK+ 2.4 release in March 2014. Applications had had a full year to upgrade; surely that was long enough, right? The original WebKit API layer is still maintained for the Mac, iOS, and Windows ports, but the GTK+ API for it is long gone. WebKitGTK+ 2.6 (September 2014) was released with only one API, webkit2gtk-4.0, which was basically the same as webkit2gtk-3.0 except for a couple small fixes; most applications were able to upgrade by simply changing the version number. Since then, we have maintained API and ABI compatibility for webkit2gtk-4.0, and intend to do so indefinitely, hopefully until GTK+ 4.0.

A lot of good that does for applications using the API that was removed.

WebKit2 Adoption

While upgrading to the WebKit2 API will be easy for most applications (it took me ten minutes to upgrade GNOME Initial Setup), for many others it will be a significant challenge. Since rendering occurs out of process in WebKit2, the DOM API can only be accessed by means of a shared object injected into the web process. For applications that perform only a small amount of DOM manipulation, this is a minor inconvenience compared to the old API. For applications that use extensive DOM manipulation — the email clients Evolution and Geary, for instance — it’s not just an inconvenience, but a major undertaking to upgrade to the new API. Worse, some applications (including both Geary and Evolution) placed GTK+ widgets inside the web view; this is no longer possible, so such widgets need to be rewritten using HTML5. Say nothing of applications like GIMP and Geany that are stuck on GTK+ 2. They first have to upgrade to GTK+ 3 before they can consider upgrading to modern WebKitGTK+. GIMP is working on a GTK+ 3 port anyway (GIMP uses WebKitGTK+ for its help browser), but many applications like Geany (the IDE, not to be confused with Geary) are content to remain on GTK+ 2 forever. Such applications are out of luck.

As you might expect, most applications are still using the old API. How does this work if it was already deleted? Distributions maintain separate packages, one for old WebKitGTK+ 2.4, and one for modern WebKitGTK+. WebKitGTK+ 2.4 has not had any updates since last May, and the last real comprehensive security update was over one year ago. Since then, almost 130 vulnerabilities have been fixed in newer versions of WebKitGTK+. But since distributions continue to ship the old version, few applications are even thinking about upgrading. In the case of the email clients, the Evolution developers are hoping to upgrade later this year, but Geary is completely dead upstream and probably will never be upgraded. How comfortable are you with using an email client that has now had no security updates for a year?

(It’s possible there might be a further 2.4 release, because WebKitGTK+ 2.4 is incompatible with GTK+ 3.20, but maybe not, and if there is, it certainly will not include many security fixes.)

Fixing Things

How do we fix this? Well, for applications using modern WebKitGTK+, it’s a simple problem: distributions simply have to start taking our security updates.

For applications stuck on WebKitGTK+ 2.4, I see a few different options:

  1. We could attempt to provide security backports to WebKitGTK+ 2.4. This would be very time consuming and therefore very expensive, so count this out.
  2. We could resurrect the original webkitgtk-1.0 and webkitgtk-3.0 APIs. Again, this is not likely to happen; it would be a lot of work to restore them, and they were removed to reduce maintenance burden in the first place. (I can’t help but feel that removing them may have been a mistake, but my colleagues reasonably disagree.)
  3. Major distributions could remove the old WebKitGTK+ compatibility packages. That will force applications to upgrade, but many will not have the manpower to do so: good applications will be lost. This is probably the only realistic way to fix the security problem, but it’s a very unfortunate one. (But don’t forget about QtWebKit. QtWebKit is based on an even older version of WebKit than WebKitGTK+ 2.4. It doesn’t make much sense to allow one insecure version of WebKit but not another.)

Or, a far more likely possibility: we could do nothing, and keep using insecure software.

by Michael Catanzaro at February 01, 2016 10:40 PM

Xabier Rodríguez Calvar

Web Engines Hackfest according to me

And once again, in December we celebrated the hackfest. This year happened between Dec 7-9 at the Igalia premises and the scope was much broader than WebKitGTK+, that’s why it was renamed as Web Engines Hackfest. We wanted to gather people working on all open source web engines and we succeeded as we had people working on WebKit, Chromium/Blink and Servo.

The edition before this I was working with Youenn Fablet (from Canon) on the Streams API implementation in WebKit and we spent our time on the same thing again. We have to say that things are much more mature now. During the hackfest we spent our time in fixing the JavaScriptCore built-ins inside WebCore and we advanced on the automatic importation of the specification web platform tests, which are based on our prior test implementation. Since now they are managed there, it does not make sense to maintain them inside WebKit too, we just import them. I must say that our implementation is fairly complete since we support the current version of the spec and have almost all tests passing, including ReadableStream, WritableStream and the built-in strategy classes. What is missing now is making Streams work together with other APIs, such as Media Source Extensions, Fetch or XMLHttpRequest.

There were some talks during the hackfest and we did not want to be less, so we had our own about Streams. You can enjoy it here:

You can see all hackfest talks in this YouTube playlist. The ones I liked most were the ones by Michael Catanzaro about HTTP security, which is always interesting given the current clumsy political movements against cryptography and the one by Dominik Röttsches about font rendering. It is really amazing what a browser has to do just to get some letters painted on the screen (and look good).

As usual, the environment was amazing and we had a great time, including the traditional Street Fighter‘s match, where Gustavo found a worthy challenger in Changseok :)

Of course, I would like to thank Collabora and Igalia for sponsoring the event!

And by the way, quite shortly after that, I became a WebKit reviewer!

by calvaris at February 01, 2016 11:06 AM

January 31, 2016

Manuel Rego

Deep Dive into Grid Layout Placement

During the last months as part of my work in Igalia I’ve been focused on finishing the new/missing bits of the CSS Grid Layout Blink’s implementation related to items placement. In summary the task was mostly related to 2 things:

  • Support implicit grid before explicit. So the grid can add tracks on demand not only on the growing direction (usually right/bottom) but also on the opposite one.

  • Fix unknown named grid lines resolution. This is the case when an item is placed in a line called “foo”, but no lines with that name exist in the grid.

These might seem quite simple tasks, but they implied quite a lot of changes in the underlying implementation. I ended up refactoring all the code to resolve grid positions in order to complete them. I even wrote a document explaining the whole plan and all the technical details, where you can find links to all the related patches.

Now that we’ve finished this, it’s time to explain how you can use it. Although my colleague Sergio already wrote about this in 2014, the specification has changed since that time, so I think it’s better to try to explain the whole thing from scratch. This post is a kind of summary with examples of Placing Grid Items” section of the CSS Grid Layout spec.

Grid lines

This is probably one of the most important concepts of the grid layout spec. The grid lines are the ones dividing horizontally and vertically a grid. And they’re actually numbered, starting at 1.

Let’s use a 3x2 grid as example to explain how this works:

<div style="display: grid;
            grid-template-columns: 300px 200px 100px;
            grid-template-rows: 100px 50px;">

Numbered grid lines example Numbered grid lines example

Grid placement properties

In order to position the items inside the grid container, you need to use the grid placement properties. These properties are:

  • grid-column-start: Set the first vertical line for the item.
  • grid-column-end: Set the last vertical line for the item.
  • grid-row-start: Set the first horizontal line for the item.
  • grid-row-end: Set the last horizontal line for the item.

With these properties you define the area where the grid item will be placed. In order to do that, you use the line numbers.

The initial value for these properties is auto, which makes possible that items are automatically placed looking for empty cells inside the grid. For more information about this, please review a previous post on the matter.

On top of that, there’re some handy shorthands:

  • grid-column: Shorthand for grid-column-start and grid-column-end properties.
  • grid-row: Shorthand for grid-row-start and grid-row-end properties.
  • grid-area: Shorthand to set the 4 placement properties in just one declaration.

So, imagine that you add the following item in the previous grid:

<div style="grid-column-start: 2; grid-column-end: 4;
  grid-row-start: 1; grid-row-end 2;">

Probably easier to read if you use some shorthands:

<div style="grid-column: 2 / 4; grid-row: 1 / 2;"></div>

This means that the grid item will be placed taking the 2nd and 3rd columns in the first row.

Place item using line numbers example Place item using line numbers example

Cell spanning

Previous item was spanning 2 columns (2nd and 3rd ones) referencing the lines. You could do the same using the span keyword, together with the number of cells you want to span.

So, you could place the item in the very same position using:

<div style="grid-column: 2 / span 2; grid-row: 1;"></div>

Place item using span example Place item using span example

Note that here you’re not setting the end line for the row. This means that grid-row-end takes a value of auto. In this case auto defaults to a span of one.

Negative line numbers

So far we’ve only seen positive numbers, but lines have also negative indexes. Negative numbers allow you to reference the lines starting from the end of the grid.

Following with the same example, you could place the item again in the same position using the negative indexes:

<div style="grid-column: -3 / -1; grid-row: -3 / -2;"></div>

Place item using negative line numbers example Place item using negative line numbers example

This might be really useful in some situations. For example, if you want to be sure that the item is in the last column, independently of the number of tracks, you’ll just need to set: grid-column-end: -1;.

Named grid lines

Not only that, but you can also name the grid lines, so you don’t need to remember the specific number to reference to them.

Let’s modify the definition of the grid, keeping the size of tracks but adding names to the lines:

<div style="display: grid;
            grid-template-columns: [start] 300px [main] 200px [aside] 100px [end];
            grid-template-rows: [top] 100px [middle] 50px [bottom];">

And again if we want to place the item in the same position, we just need to reference the name of the lines:

<div style="grid-column: main / end; grid-row: top / middle;"></div>

Place item using line names example Place item using line names example

One line can have several names, you just need to set them in the definition: grid-template-rows: [top start] 100px [middle center] 50px [bottom end];.

Also the names of the lines can be repeated. To reference them you’ve to use a number that can be again positive or negative. Let’s use a different example to showcase this:

<div style="display: grid;
            grid-template-columns: [col] 200px [gap] 50px [col] 200px [gap] 50px [col] 200px [gap];">

And imagine that you place some items like this:

<div style="grid-column: col 2;"></div>
<div style="grid-column: col 1 / gap 2;"></div>
<div style="grid-column: col -1;"></div>

Place items with repeated named grid lines example Place items with repeated named grid lines example

And of course, you can span to a named grid line:

<div style="grid-column: col 2 / span 2 gap;"></div>

Place item spanning to named grid line example Place item spanning to named grid line example

Grid areas

Better still, you can define grid areas and place items directly on them. You have to use the grid-template-areas property to put names to the different areas in your grid. And you could use the grid-area shorthand directly to place the items.

Let’s use a bigger grid (5x4) to show an example:

<div style="display: grid;
            grid-template-columns: 100px 100px 100px 100px 100px;
            grid-template-rows: 100px 100px 100px 100px;
                'title  title  title  title  aside'
                'nav    main   main   main   aside'
                'nav    main   main   main   aside'
                'footer footer footer footer footer';">

And position one item in each of the areas:

<div style="grid-area: title;"></div>
<div style="grid-area: nav;"></div>
<div style="grid-area: main;"></div>
<div style="grid-area: aside;"></div>
<div style="grid-area: footer;"></div>

Place items inside grid areas example Place items inside grid areas example

Grid areas & Named grid lines

One interesting thing about areas and placement is that grid areas create implicit names for the grid lines surrounding them. These implicit names use the “-start” and “-end” suffixes. And you can reference those lines when placing an item, instead of using the whole area.

E.g. the “title” area from previous example creates 4 implicit names for the lines (2 in each axis):

  • Left line: “title-start
  • Right line: “title-end
  • Top line: “title-start
  • Bottom line: “title-end

Following with that example you could place an item using the implicit names:

<div style="grid-column: main-start / aside-end;
            grid-row: title-start / nav-end;"></div>

Place item using implicit names from grid areas example Place item using implicit names from grid area

And the same can happen the other way around. If you name lines using those suffixes, you’re actually creating implicit areas. So, we could just create the very same grid using named grid lines:

<div style="display: grid;
            grid-template-columns: [title-start nav-start footer-start] 100px [main-start nav-end] 100px 100px 100px [aside-start title-end main-end] 100px [aside-end footer-end];
            grid-template-rows: [title-start aside-start] 100px [nav-start main-start title-end] 100px 100px [footer-start nav-end main-end aside-end] 100px [footer-end];">

All the examples of items positioned in this section will be exactly in the same place with this new grid.

Implicit grid

With the grid definition properties (grid-template-columns, grid-template-rows and grid-template-areas) you determine the explicit number of tracks (columns and rows) in your grid. However, grid spec allows you to place items outside of the explicit grid. In order to support that, implicit tracks are created automatically, the size of these tracks is controlled by grid-auto-columns and grid-auto-rows properties. In the following examples I’ll use red color to highlight the implicit lines.

This time let’s use a simple 2x1 grid:

<div style="display: grid;
            grid-template-columns: 200px 200px;
            grid-auto-columns: 100px;">

And imagine that you place an item in the 5th column (grid-column: 5;). As the grid only has 2 columns, 3 implicit columns will be added in order to position the item.

Implicit grid example Implicit grid example

Again you can also create implicit tracks with items that span several cells. For example, if an item takes 3 columns starting on the 2nd one (grid-column: 2 / span 3);

Implicit grid with span example Implicit grid with span example

Originally, the implicit tracks could only be added at the end. But now it’s possible to add implicit tracks before the explicit grid. For example, if you place an item using grid-column: -5; it’ll add 2 columns on the left and it’ll be placed in the -2nd column.

Implicit grid before explicit example Implicit grid before explicit example

Implicit grid & Named grid lines

But this is not the only way to create implicit tracks, you can also create them if you use undefined named grid lines. This is more a way to show mistakes on the user CSS than a feature itself, but maybe someone finds it useful. The idea is that all the lines in the implicit grid will take any random name you might need to position an item.

The basic example is placing items referencing a nonexistent line called “foo”. For example you will create 3 implicit columns (1 before and 2 after the explicit grid) with the following items:

<div style="grid-column: foo;"></div>
<div style="grid-column: 2 / span 2 foo;"></div>
<div style="grid-column: -1 foo;"></div>

Implicit grid using undefined named grid lines example Implicit grid using undefined named grid lines example

Note that the simplest example grid-column: foo is being placed in the 4th column (adding an extra empty column just after the explicit grid). This is because first line that is considered to be called “foo” is the first implicit line (line 4), so last line of the grid (line 3) is not included.

Also, the last item grid-column: -1 foo is placed on the -1th column (maybe you was not expecting that). This is because of you start looking for a line named “foo” from the edge of the explicit grid. So, you ignore lines -1, -2 and -3 (as they’re not called “foo”) and consider line -4 (first line on the implicit grid) to have that name.

This is a bit trickier if the line actually exists, as you’ve to count it too in order to place the item. Specially it’s complex if you’re using span to a named grid line, but there’re not enough lines. In that case only implicit lines in the search direction are considered to have that name.

Again, hopefully an example can help to understand this. Let’s add a name to the middle line in the previous example:

<div style="display: grid;
            grid-template-columns: 200px [middle] 200px;
            grid-auto-columns: 100px;">

And now, let’s use place a few items referencing that “middle” line:

<div style="grid-column: 3 middle;"></div>
<div style="grid-column: span 2 middle / 5;"></div>
<div style="grid-column: -4 / span middle;"></div>

Implicit grid using more named grid lines than available example Implicit grid using more named grid lines than available example

The strange case here is grid-column: span 2 middle / 5;, as you can see it takes from -1th column to 4th column (both included). The item ends at line 5, and it has to span 2 lines called “middle” to find the start. You could think that it should count line 4 and line 2, but, as explained before, you have to start counting lines from the edge of the explicit grid. So you actually count line 2 and then you’ve to consider the implicit lines on the left to find the start position (line -4).

Special cases

Grid placement properties have a few special situations that are resolved by the conflict handling section of the spec.

For example, if you place an item where the end line is before than the start line, both lines are swapped. Thus, something like grid-column: 5 / 2; would become grid-column: 2 / 5;.

Another situation is the one in which you have span in both the start and end positions. The span for the end position is discarded. So, grid-column: span 2 / span 3; would become grid-column: span 2;. Which will use the grid placement algorithm to find an empty area (of 2 columns in this particular example) to position itself.

Last one is the case when you only have a span to named grid line. In that case, it’s replaced by span 1. E.g. grid-column: span foo; will become grid-column: span 1;.

Placement special cases example Placement special cases example


If you have read that far it seems you’re really interested on CSS Grid Layout. The main conclusion is that the specification is really flexible regarding how to place items on the grid. As you can see there’re quite a lot of different ways to position an item in the very same place. Probably each person will get used to a few of them and just forget about the rest.

IMHO the basic stuff (line indexes, spans, line names and areas) is quite straightforward. And the implicit grid is not that hard either. Then negative indexes bring some fun. However undefined named grid lines behavior is really tricky (hopefully not something you should care about anyway). But it’s true that I’m biased as I’ve been dealing with this for a long time.

Last, I thought it would be nice to finish this with a big example which uses most of the things described in this post. If you got it right and don’t get confused at all you’re mastering grid layout placement. Congrats! 😄

This is the definition of the grid container used in this example:

<div style="display: grid;
            grid-template-columns: [left] 200px [main-start] 100px [center] 100px 100px [main-end] 50px 50px [right];
            grid-template-rows: [top title-start] 50px [title-end main-start] 200px [main-end center] 150px 100px [bottom];
            grid-auto-columns: 50px;
            grid-auto-rows: 50px;">

And in the next picture you can see how different items will be placed.

Placement big example Placement big example

You can try it live in our examples repository:


As commented on the introduction Blink’s implementation should support now (Chrome 50+) all these different placement possibilities. Igalia has been working on this implementation and we’re porting it now to WebKit too.

On the other side, Gecko has already support for it too, in this case developed by Mozilla.

Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

Finally, as usual I’d like to highlight one more time that all this work has been done as part of a collaboration between Igalia and Bloomberg. Big thanks for your support!

January 31, 2016 11:00 PM

January 21, 2016

Andy Wingo

talks i would like to give in 2016

Every year I feel like I'm trailing things in a way: I hear of an amazing conference with fab speakers, but only after the call for submissions had closed. Or I see an event with exactly the attendees I'd like to schmooze with, but I hadn't planned for it, and hey, maybe I could have even spoke there.

But it's a new year, so let's try some new things. Here's a few talks I would love to give this year.

building languages on luajit

Over the last year or two my colleagues and I have had good experiences compiling in, on, and under LuaJIT, and putting those results into production in high-speed routers. LuaJIT has some really interesting properties as a language substrate: it has a tracing JIT that can punch through abstractions, it has pretty great performance, and it has a couple of amazing escape hatches that let you reach down to the hardware in the form of the FFI and the DynASM assembly generator. There are some challenges too. I can tell you about them :)

try guile for your next project!

This would be a talk describing Guile, what it's like making programs with it, and the kind of performance you can expect out of it. If you're a practicing programmer who likes shipping small programs that work well, are fun to write, and run with pretty good performance, I think Guile can be a great option.

I don't get to do many Guile talks because hey, it's 20 years old, so we don't get the novelty effect. Still, I judge a programming language based on what you can do with it, and recent advances in the Guile implementation have expanded its scope significantly, allowing it to handle many problem sizes that it couldn't before. This talk will be a bit about the language, a bit about the implementation, and a bit about applications or problem domains.

compiling with persistent data structures

As part of Guile's recent compiler improvements, we switched to a somewhat novel intermediate language. It's continuation-passing-style, but based on persistent data structures. Programming with it is interesting and somewhat different than other intermediate languages, and so this would be a talk describing the language and what it's like to work in it. Definitely a talk for compiler people, by a compiler person :)

a high-performance networking with luajit talk

As I mentioned above, my colleagues and I at work have been building really interesting things based on LuaJIT. In particular, using the Snabb Switch networking toolkit has let us build an implementation of a "lightweight address family translation router" -- the internet-facing component of an IPv4-as-a-service architecture, built on an IPv6-only network. Our implementation flies.

It sounds a bit specialized, and it is, but this talk could go two ways.

One version of this talk could be for software people that aren't necessarily networking specialists, describing the domain and how with Snabb Switch, LuaJIT, compilers, and commodity x86 components, we are able to get results that compete well with offerings from traditional networking vendors. Building specialized routers and other network functions in software is an incredible opportunity for compiler folks.

The other version would be more for networking people. We'd explain the domain less and focus more on architecture and results, and look more ahead to challenges of 100Gb/s ports.

let me know!

I'll probably submit some of these to a few conferences, but if you run an event and would like me to come over and give one of these talks, I would be flattered :) Maybe that set of people is empty, but hey, it's worth a shot. Probably contact via the twitters has the most likelihood of response.

There are some things you need to make sure are covered before reaching out, of course. It probably doesn't need repeating in 2016, but make sure that you have a proper code of conduct, and that that you'll be able to put in the time to train your event staff to create that safe space that your attendees need. Getting a diverse speaker line-up is important to me too; conferences full of white dudes like me are not only boring but also serve to perpetuate an industry full of white dudes. If you're reaching out, reach out to women and people of color too, and let me know that you're working on it. This old JSConf EU post has some ideas too. Godspeed, and happy planning!

by Andy Wingo at January 21, 2016 11:59 AM

January 19, 2016

Adrián Pérez

ljndpi: The SnabbWall sidekick

Howdy, and happy 2016! Last time we met here, I wrote about SnabbWall, a suite of Snabb Switch applications implementing the machinery needed for a Layer-7 firewall using deep packet inspection, and a firewall program itself. The work I am doing in SnabbWall is the result of ongoing collaboration between Igalia and the NLnet Foundation. We are very grateful to have their sponsorship.


We are treating development of SnabbWall like every other project, and among the periodic duties it is important to report progress. Development is happening 100% out in the open, so early on we decided to do also make the status updates open—by blogging them.

(For those interested in following only related posts, I have set up an additional feed: RSS, Atom.)

nDPI Lua Binding: Check

The second milestone for SnabbWall was having a binding using the handy LuaJIT FFI extension for the nDPI library, and I am happy to announce that the binding, dubbed ljndpi, is fairly complete by now. As an added bonus, I have been careful to make the code independet from Snabb Switch, which made it possible to split out the history into a separate repository, which gets imported under lib/ljndpi/ in the SnabbWall repository using git-subtree, and then built into the snabb executable. This is the same approach currently used in Snabb Switch to build the external dependencies (that is ljsyscall, pflua, and LuaJIT).

Hannibal Smith, happy about the project results

Having a working binding for nDPI completes the second milestone of the SnabbWall roadmap.

Some Implementation Notes

Low-level Definitions

The ndpi.c submodule (source: ndpi/c.lua) contains only the definitions needed by the FFI extension to call nDPI functions, and takes care of loading the shared object. If you check the code, you may notice that some of the type names do not match exactly those used in the nDPI headers: as long as the byte size and their role is used consistently, the FFI extension does not really care about the actual names of the types, so I have decided to use names which looked better in my eyes—the nDPI API is kinda terrible in that regard.

Protocol Bitmasks

Telling nDPI which protocols it should detect is done using a big bitmask, with one bit for each supported protocol.

The implementation of NDPI_PROTOCOL_BITMASK in C uses a bunch of macros to operate on a struct which wraps an array of integers, and they do not lend themselves to be easily wrapped using the FFI extension. Because of that, I decided to reimplement it in pure Lua. The size of the array of integers in the bitmask struct and their bit width may vary across different nDPI versions. In preparation for future changes —which should not happen often—, the type is defined using the values of the NDPI_NUM_BITS and NDPI_BITS constants: copying their values from the nDPI headers is the only change needed.

As a bonus, I have thrown in support for chaining calls to all methods except :is_set(). This is super convenient:

local bits = ndpi.protocol_bitmask()
print(bits:set_all():del(32):is_set(32)) --> false
print(bits:reset():add(42):is_set(42))   --> true

Certain operations, like enabling detection of all the supported protocols can be done with a one-liner:

local dm = ndpi.detection_module(1000)

Protocol Names and Identifiers

For each supported protocol, nDPI has both a macro to give a name to its numerical identifier, and a table with data about the protocol. One of the pieces of data is a string with the name of the protocol, in a format suitable to be presented to the user. These strings are const static in the C side of the world, but every time ndpi_get_proto_name() gets called it would be needed to convert the plain array of 8-bit integers to a Lua string using ffi.string(), which in turn would create a copy. Of course one could cache the strings in the Lua side, but protocol identifiers and names do not change (as long as the nDPI version remains unchanged), so instead of caching the results, I have opted for writing a small script which parses the ndpi_protocol_ids.h file and generates the corresponding ndpi/protocol_ids.lua file:

-- Generated by ljdnpi's tools/update-protocol-ids script
local T = {
  -- ...
  -- ...
return T

Note how the same table contains both numeric and string indexes. This is perfectly fine in Lua, and allows both mapping from a protocol name to its identifier, and from an identifier to a name. The strings are not as nice is the ones returned by ndpi_get_proto_name(), but on the upside they are guaranteed to always be uppercase and be valid identifiers.

Wrapper Module

Finally, the ndpi.wrap submodule (source: ndpi/wrap.lua) ties the previous pieces together to provide an idiomatic Lua interface to nDPI. To achieve this, it uses ffi.metatype() to associate metatables to the C types used by nDPI.

The and ndpi.flow types were a bit tricky because they are opaque: their implementation details are hidden, and nDPI only allows us to create values of those types, free them, and pass them to functions. So in C one does:

// Create and initialize an endpoint identifier:
const size = ndpi_detection_get_sizeof_ndpi_id_struct();
struct ndpi_id_struct *ndpi_id = malloc(size);
memset(ndpi_id, 0x00, size);

How does one do that without knowing in advance the size? Well, to begin with, the FFI extension will associate the metatable of the metatype to any pointer value of the C type, no matter how it has been created. Additionally, FFI metatypes have an additional __new metamethod which allows to manually allocate memory or doing any other thing that may be needed, so I ended up with this (the version for ndpi.flow is the same, but using lib.ndpi_free_flow instead of for the __gc metamethod):

local id_struct_ptr_t = ffi.typeof("ndpi_id_t*")
local id_struct_size  = lib.ndpi_detection_get_sizeof_ndpi_id_struct()

local function id_new(ctype)
  local id = ffi.cast(id_struct_ptr_t, C.malloc(id_struct_size))
  ffi.fill(id, id_struct_size)
  return id

local id_type = ffi.metatype("ndpi_id_t", { __new = id_new, __gc = })

Note that the metatable does not have an __index entry, so the type does not have any methods attached to it. This effectively only allows creating values, passing them around, and letting the garbage collector free their memory—which is exactly what we want.

For the ndpi.detection_module type, the main difference is that there is a table associated to the __index entry of the metatype. It contains the functions which act on a struct ndpi_detection_module_struct*, some of them renamed to shorter (or just more logical) names, and some of them are small wrappers which perform additional conversion of values between the Lua and C sides of the world.

Finally, the ndpi.wrap module also imports the bits from the other submodules (ndpi.c, ndpi.protocol_bitmask, and ndpi.protocol_ids). Provided that ndpi.wrap is what should be loaded when using a plain require("ndpi"), the one last bit is making sure that ndpi.lua has:

-- Requiring "ndpi" returns the same as "ndpi.wrap"
return require("ndpi.wrap")

There is a part of the public nDPI API which I have purposedly omitted from ljndpi: the ndpi_t**() functions used to implement a tree structure. These are used to provide (as a convenience) a data structure which maps keys to pointers, and it is not really needed to interface with nDPI for packet inspection. Any program which needs a similar structure can use Lua tables. If needed, it is even easier to roll your own custom data structure in Lua than to wrap these functions.

Putting nDPI Into Use

The nDPI API is quite low-level, and so is the binding exposed by ljndpi. As an exercise to test the module and get my hands greasy in preparation for writing SnabbWall's L7 Spy, I reimplemented a pure-Lua version of the ndpiReader tool included with nDPI, which works only on pcap capture files. If you want to use ljndpi I would strongly encourage you to take a look examples/readpcap in the source tree.

The basic idea goes this way:

  • Create a ndpi.detection_module() and enable detection of all supported protocols.

  • Open the input pcap capture (or prepare whatever mechanism you use to read one packet at a time).

  • Create a table where flow information is stored.

  • For each packet:

    1. Build a “flow identifier” (flowid) from information in the packet (more on this below).
    2. If the flow table does not have an entry for the flowid, create a new one. Make sure you create also a ndpi.flow object, and two objects (one for the source host, another for the destination).
    3. Call ndpi.detection_module:process_packet(), passing the ndpi.flow and objects corresponding to the flow. If this function returns anything else than ndpi.protocol.PROTOCOL_UNKNOWN, then the protocol has been identified (jackpot!).

So what's a “flow identifier” anyway? It can be any value (really!), as long as you can calculate it solely from data in the packets, and each identifier should be unique for each application (Layer-7!) passing data over the network between a pair of hosts. For IP-based protocols (TCP, UDP), you want to use at least the IP addresses and ports involved. Note that nDPI considers the direction of the data irrelevant for the flows —no matter in which direction traffic goes, it belongs to a certain application—, but the direction of the source and destination of each inspected packet is relevant. That means that your flow identifier must be the same regardless of the direction, but the values passed as the two last arguments to :process_packet() have to indicate the direction of the packets.

There are a number of improvements one can make on top of the basic method above, being the most obvious one using VLAN tags to identify flows (readpcap does this already), but I am leaving flow identification for a follow-up post. Dear readers: think about it as homework, and make sure to check the answers with me in the following weeks.

(Or, you know, just go ahead, clone ljndpi, and experiment!)

by aperez ( at January 19, 2016 06:29 PM

Andy Wingo

unboxing in guile

Happy snowy Tuesday, hackfolk! I know I said in my last dispatch that I'd write about Lua soon, but that article is still cooking. In the meantime, a note on Guile and unboxing.

on boxen, on blitzen

Boxing is a way for a programming language implementation to represent a value.

A boxed value is the combination of a value along with a tag providing some information about the value. Both the value and the tag take up some space. The value can be thought to be inside a "box" labelled with the tag and containing the value.

A value's tag can indicate whether the value's bits should be interpreted as an unsigned integer, as a double-precision floating-point number, as an array of words of a particular data type, and so on. A tag can also be used for other purposes, for example to indicate whether a value is a pointer or an "immediate" bit string.

Whether values in a programming language are boxed or not is an implementation consideration. It can be the case that in languages with powerful type systems that a compiler can know what the representation of all values are in all parts of all programs, and so boxing is never needed. However, it's much easier to write a garbage collector if values have a somewhat uniform representation, with tag bits to tell the GC how to trace any pointers that might be contained in the object. Tags can also carry run-time type information needed by a dynamically typed language like Scheme or JavaScript, to allow for polymorphic predicates like number? or pair?.

Boxing all of the values in a program can incur significant overhead in space and in time. For example, one way to implement boxes is to allocate space for the tag and the value on the garbage-collected heap. A boxed value would then be referred to via a pointer to the corresponding heap allocation. However, most memory allocation systems align their heap allocations on word-sized boundaries, for example on 8-byte boundaries. That means that the low 3 bits of a heap allocation will always be zero. If you make a bit string whose low 3 bits are not zero, it cannot possibly be a valid pointer. In that case you can represent some types within the set of bit strings that cannot be valid pointers. These values are called "immediates", as opposed to "heap objects". In Guile, we have immediate representations for characters, booleans, some special values, and a subset of the integers. Alternately, a programming language implementation can represent values as double-precision floating point numbers, and shove pointers into the space of the NaN values. And for heap allocations, some systems can associate one tag with a whole page of values, minimizing per-value boxing overhead.

The goal of these optimizations is to avoid heap allocation for some kinds of boxes. While most language implementations have good garbage collectors that make allocation fairly cheap, the best way to minimize allocation cost is to refrain from it entirely.

In Guile's case, we currently use a combination of low-bit tagging for immediates, including fixnums (a subset of the integers), and tagged boxes on the heap for everything else, including floating-point numbers.

Boxing floating-point numbers obviously incurs huge overhead on floating-point math. You have to consider that each intermediate value produced by a computation will result in the allocation of another 8 bytes for the value and 4 or 8 bytes for the tag. Given that Guile aligns allocations on 8-byte boundaries, the result is a 16-byte allocation in either case. Consider this loop to sum the doubles in a bytevector:

(use-modules (rnrs bytevectors))
(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (+ sum (bytevector-ieee-double-native-ref v i)))

Each trip through the loop is going to allocate not one but two heap floats: one to box the result of bytevector-ieee-double-native-ref (whew, what a mouthful), and one for the sum. If we have a bytevector of 10 million elements, that will be 320 megabytes of allocation. Guile can allocate short-lived 16-byte allocations at about 900 MB/s on my machine, so summing this vector is going to take at least 350ms, just for the allocation. Indeed, without unboxing I measure this loop at 580ms for a 10 million element vector:

> (define v (make-f64vector #e10e6 1.0))
> ,time (f64-sum v)
$1 = 1.0e7
;; 0.580114s real time, 0.764572s run time.  0.268305s spent in GC.

The run time is higher than the real time due to parallel marking. I think in this case, allocation has even higher overhead because it happens outside the bytecode interpreter. The add opcode has a fast path for small integers (fixnums), and if it needs to work on flonums it calls out to a C helper. That C helper doesn't have a pointer to the thread-local freelist so it has to go through a more expensive allocation path.

Anyway, in the time that Guile takes to fetch one f64 value from the vector and add it to the sum, the CPU ticked through some 150 cycles, so surely we can do better than this.

unboxen, unblitzen

Let's take a look again at the loop to see where the floating-point allocations are produced.

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (+ sum (bytevector-ieee-double-native-ref v i)))

It turns out there's no reason for the loquatiously-named bytevector-ieee-double-native-ref to return a boxed number. It's a monomorphic function that is well-known to the Guile compiler and virtual machine, and it even has its own opcode. In Guile 2.0 and until just a couple months ago in Guile 2.2, this function did box its return value, but that was because the virtual machine had no facility for unboxed values of any kind.

To allow bytevector-ieee-double-native-ref to return an unboxed double value, the first item of business was then to support unboxed values in Guile's VM. Looking forward to unboxed doubles, we made a change such that all on-stack values are 64 bits wide, even on 32-bit systems. (For simplicity, all locals in Guile take up the same amount of space. For the same reason, fetching 32-bit floats also unbox to 64-bit doubles.)

We also made a change to Guile's "stack maps", which are data structures that tell the garbage collector which locals are live in a stack frame. There is a stack map recorded at every call in a procedure, to be used when an activation is pending on the stack. Stack maps are stored in a side table in a separate section of the compiled ELF library. Live values are traced by the garbage collector, and dead values are replaced by a special "undefined" singleton. The change we made was to be able to indicate that live values were boxed or not, and if they were unboxed, what type they were (e.g. unboxed double). Knowing the type of locals helps the debugger to print values correctly. Currently, all unboxed values are immediates, so the GC doesn't need to trace them, but it's conceivable that we could have unboxed pointers at some point. Anyway, instead of just storing one bit (live or dead) per local in the stack map, we store two, and reserve one of the bit patterns to indicate that
the local is actually an f64 value.

But the changes weren't done then: since we had never had unboxed locals, there were quite a few debugging-related parts of the VM that assumed that we could access the first slot in an activation to see if it was a procedure. This dated from a time in Guile where slot 0 would always be the procedure being called, but the check is bogus ever since Guile 2.2 allowed local value slots corresponding to the closure or procedure arguments to be re-used for other values, if the closure or argument was dead. Another nail in the coffin of procedure-in-slot-0 was driven by closure optimizations, in which closures whose callees are all visible could specialize the representation of their closure in non-standard ways. It took a while, but unboxing f64 values flushed out these bogus uses of slot 0.

The next step was to add boxing and unboxing operations to the VM (f64->scm and scm->f64, respectively). Then we changed bytevector-ieee-double-native-ref to return an unboxed value and then immediately box it via f64->scm. Similarly for bytevector-ieee-double-native-set!, we unbox the value via scm->f64, potentially throwing a type error. Unfortunately our run-time type mismatch errors got worse; although the source location remains the same, scm->f64 doesn't include the reason for the unboxing. Oh well.

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
              (+ sum boxed))

When we lower Tree-IL to CPS, we insert the needed f64->scm and scm->f64 boxing and unboxing operations around bytevector accesses. Cool. At this point we have a system with unboxed f64 values, but which is slower than the original version because every f64 bytevector access involves two instructions instead of one, although the instructions themselves together did the same amount of work. However, telling the optimizer about these instructions could potentially eliminate some of them. Let's keep going and see where we get.

Let's attack the other source of boxes, the accumulation of the sum. We added some specialized instuctions to the virtual machine to support arithmetic over unboxed values. Doing this is potentially a huge win, because not only do you avoid allocating a box for the result, you also avoid the type checks on the incoming values. So we add f64+, f64-, and so on.

Unboxing the + to f64+ is a tricky transformation, and relies on type analysis. Our assumption is that if type analysis indicates that we are in fact able to replace a generic arithmetic instruction with a combination of operand unboxing, unboxed arithmetic, and a boxing operation, then we should do it. Separating out the boxes and the monomorphic arithmetic opens the possibility to remove the resulting box, and possibly remove the unboxing of operands too. In this case, we run an optimization pass and end up with something like:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
               (f64+ (scm->f64 sum)
                     (scm->f64 boxed)))))

Scalar replacement via fabricated expressions will take the definition of boxed as (f64->scm f64) and fabricate a definition of f64 as (scm->f64 boxed), which propagates down to the f64+ so we get:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
               (f64+ (scm->f64 sum)

Dead code elimination can now kill boxed, so we end up with:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i)))
               (f64+ (scm->f64 sum)

Voilà, we removed one allocation. Yay!

As we can see from the residual code, we're still left with one f64->scm boxing operation. That expression is one of the definitions of sum, one of the loop variables. The other definition is 0.0, the starting value. So, after specializing arithmetic operations, we go through the set of multiply-defined variables ("phi" variables) and see what we can do to unbox them.

A phi variable can be unboxed if all of its definitions are unboxable. It's not always clear that you should unbox, though. For example, maybe you know via looking at the definitions for the value that it can be unboxed as an f64, but all of its uses are boxed. In that case it could be that you throw away the box when unboxing each definition, only to have to re-create them anew when using the variable. You end up allocating twice as much instead of not at all. It's a tricky situation. Currently we assume a variable with multiple definitions should only be unboxed if it has an unboxed use. The initial set of unboxed uses is the set of operands to scm->f64. We iterate this set to a fixed point: unboxing one phi variable could cause others to be unbox as well. As a heuristic, we only require one unboxed use; it could be there are other uses that are boxed, and we could indeed hit that pessimal double-allocation case. Oh well!

In this case, the intermediate result looks something like:

(define (f64-sum v)
  (let lp ((i 0) (sum (scm->f64 0.0)))
    (let ((sum-box (f64->scm sum)))
      (if (< i (bytevector-length v))
          (lp (+ i 8)
              (let ((f64 (bytevector-ieee-double-native-ref v i)))
                  (f64+ (scm->f64 sum-box)

After the scalar replacement and dead code elimination passes, we end up with something more like:

(define (f64-sum v)
  (let lp ((i 0) (sum (scm->f64 0.0)))
    (let ((sum-box (f64->scm sum)))
      (if (< i (bytevector-length v))
          (lp (+ i 8)
              (f64+ sum
                    (bytevector-ieee-double-native-ref v i)))

Well this is looking pretty good. There's still a box though. Really we should sink this to the exit, but as it happens there's something else that accidentally works in our favor: loop peeling. By peeling the first loop iteration, we create a control-flow join at the loop exit that defines a phi variable. That phi variable is subject to the same optimization, sinking the box down to the join itself. So in reality the result looks like:

(define (f64-sum v)
  (let ((i 0)
        (sum (scm->f64 0.0))
        (len (bytevector-length v)))
     (if (< i len)
         (let ((i (+ i 8))
               (sum (f64+ sum
                          (bytevector-ieee-double-native-ref v i))))
           (let lp ((i i) (sum sum))
             (if (< i len)
                 (lp (+ i 8)
                     (f64+ sum (bytevector-ieee-double-native-ref v i)))

As you can see, the peeling lifted the length computation up to the top too, which is a bonus. We should probably still implement allocation sinking, especially for loops for which peeling isn't an option, but the current status often works well. Running f64-sum on a 10-million-element packed double array goes down from 580ms to 99ms, or to some 25 or 30 CPU cycles per element, and of course no time in GC. Considering that this loop still has the overhead of bytecode interpretation and cache misses, I think we're doing A O K.


It used to be that using packed bytevectors of doubles was an easy way to make your program slower using types (thanks to Sam Tobin-Hochstadt for that quip). The reason is that although a packed vector of doubles uses less memory, every access to it has to allocate a new boxed number. Compare to "normal" vectors where sure, it uses more memory, but fetching an element fetches an already-boxed value. Now with the unboxing optimization, this situation is properly corrected... in most cases.

The major caveat is that for unboxing to work completely, each use of a potentially-unboxable value has to have an alternate implementation that can work on unboxed values. In our example above, the only use was f64+ (which internally is really called fadd), so we win. Writing an f64 to a bytevector can also be unboxed. Unfortunately, bytevectors and simple arithmetic are currently all of the unboxable operations. We'll implement more over time, but it's a current limitation.

Another point is that we are leaning heavily on the optimizer to remove the boxes when it can. If there's a bug or a limitation in the optimizer, it could be the box stays around needlessly. It happens, hopefully less and less but it does happen. To be sure you get the advantages, you need to time the code and see if it's spending significant time in GC. If it is, then you need to disassemble your code to see where that's happening. It's not a very nice thing, currently. The Scheme-like representations I gave above were written by hand; the CPS intermediate language is much more verbose than that.

Another limitation is that function arguments and return values are always boxed. Of course, the compiler can inline and contify a lot of functions, but that means that to use abstraction, you need to build up a mental model of what the inliner is going to do.

Finally, it's not always obvious to the compiler what the type of a value is, and that necessarily limits unboxing. For example, if we had started off the loop by defining sum to be 0 instead of 0.0, the result of the loop as a whole could be either an exact integer or an inexact real. Of course, loop peeling mitigates this to an extent, unboxing sum within the loop after the first iteration, but it so happens that peeling also prevents the phi join at the loop exit from being unboxed, because the result from the peeled iteration is 0 and not 0.0. In the end, we are unable to remove the equivalent of sum-box, and so we still allocate once per iteration. Here is a clear case where we would indeed need allocation sinking.

Also, consider that in other contexts the type of (+ x 1.0) might actually be complex instead of real, which means that depending on the type of x it might not be valid to unbox this addition. Proving that a number is not complex can be non-obvious. That's the second way that fetching a value from a packed vector of doubles or floats is useful: it's one of the rare times that you know that a number is real-valued.

on integer, on fixnum

That's all there is to say about floats. However, when doing some benchmarks of the floating-point unboxing, one user couldn't reproduce some of the results: they were seeing huge run-times for on a microbenchmark that repeatedly summed the elements of a vector. It turned out that the reason was that they were on a 32-bit machine, and one of the loop variables used in the test was exceeding the fixnum range. Recall that fixnums are the subset of integers that fit in an immediate value, along with their tag. Guile's fixnum tag is 2 bits, and fixnums have a sign bit, so the most positive fixnum on a 32-bit machine is 229—1, or around 500 million. It sure is a shame not to be able to count up to #xFFFFFFFF without throwing an allocation party!

So, we set about seeing if we could unbox integers as well in Guile. Guile's compiler has a lot more visibility as to when something is an integer, compared to real numbers. Anything used as an index into a vector or similar data structure must be an exact integer, and any query as to the length of a vector or a string or whatever is also an integer.

Note that knowing that a value is an exact integer is insufficient to unbox it: you have to also know that it is within the range of your unboxed integer data type. Here we take advantage of the fact that in Guile, type analysis also infers ranges. So, cool. Because the kinds of integers that can be used as indexes and lengths are all non-negative, our first unboxed integer type is u64, the unsigned 64-bit integers.

If Guile did native compilation, it would always be a win to unbox any integer operation, if only because you would avoid polymorphism or any other potential side exit. For bignums that are within the unboxable range, the considerations are similar to the floating-point case: allocation costs dominate, so unboxing is almost always a win, provided that you avoid double-boxing. Eliminating one allocation can pay off a lot of instruction dispatch.

For fixnums, though, things are not so clear. Immediate tagging is such a cheap way of boxing that in an interpreter, the extra instructions you introduce could outweigh any speedup from having faster operations.

In the end, I didn't do science and I decided to just go ahead and unbox if I could. We are headed towards native compilation, this is a necessary step along that path, and what the hell, it seemed like a good idea at the time.

Because there are so many more integers in a typical program than floating-point numbers, we had to provide unboxed integer variants of quite a number of operations. Of course we could unconditionally require unboxed arguments to vector-ref, string-length and so on, but in addition to making u64 variants of arithmetic, we also support bit operations like logand and such. Unlike the current status with floating point numbers, we can do test-and-branch over unboxed u64 comparisons, and we can compare u64 values to boxed SCM values.

In JavaScript, making sure an integer is unboxed is easy: you just do val | 0. The bit operation | truncates the value to a uint32 32-bit two's-complement signed integer (thanks to Slava for the correction). In Guile though, we have arbitrary-precision bit operations, so although (logior val 0) would assert that val is an integer, it wouldn't necessarily mean that it's unboxable.

Instead, the Guile idiom for making sure you have an unboxed integer in a particular range should go like this:

(define-inlinable (check-uint-range x mask)
  (let ((x* (logand x mask)))
    (unless (= x x*)
      (error "out of range" x))

A helper like this is useful to assert that an argument to a function is of a particular type, especially given that arguments to functions are always boxed and treated as being of unknown type. The logand asserts that the value is an integer, and the comparison asserts that it is within range.

For example, if we want to implement a function that does modular 8-bit addition, it can go like:

(define-inlinable (check-uint8 x)
  (check-uint-range x #xff))
(define-inlinable (truncate-uint8 x)
  (logand x #xff))
(define (uint8+ x y)
  (truncate-uint8 (+ (check-uint8 x) (check-uint8 y))))

If we disassemble this function, we get something like:

Disassembly of #<procedure uint8+ (x y)> at #xa8d0f8:

   0    (assert-nargs-ee/locals 3 2)    ;; 5 slots (2 args)
   1    (scm->u64/truncate 4 3)
   2    (load-u64 1 0 255)
   5    (ulogand 4 4 1)
   6    (br-if-u64-=-scm 4 3 #f 17)     ;; -> L1
;; [elided code to throw an error if x is not in range]
  23    (scm->u64/truncate 3 2)
  24    (ulogand 3 3 1)
  25    (br-if-u64-=-scm 3 2 #f 18)     ;; -> L2
;; [elided code to throw an error if y is not in range]
  43    (uadd 4 4 3)
  44    (ulogand 4 4 1)
  45    (u64->scm 3 4)
  46    (return-values 2)               ;; 1 value

The scm->u64/truncate instructions unbox an integer, but truncating it to the u64 range. They are used when we know that any additional bits won't be used, as in this case where we immediately do a logand of the unboxed value. All in all it's not a bad code sequence; there are two possible side exits for each argument (not an integer signalled by the unboxing, and out of range signalled by the explicit check), and no other run-time dispatch. For now I think we can be pretty happy with the code.

That's about it for integer unboxing. We also support unboxed signed 64-bit integers, mostly for use as operands or return values from bytevector-s8-ref and similar unboxed accessors on bytevectors. There are fewer operations that have s64 variants, though, compared to u64 variants.


Up until now in Guile, it could be that you might have to avoid Scheme if you needed to do some kinds of numeric computation. Unboxing floating-point and integer numbers makes it feasible to do more computation in Scheme instead of having to rely in inflexible C interfaces. At the same time, as a Scheme hacker I feel much more free knowing that I can work on 64-bit integers without necessarily allocating bignums. I expect this optimization to have a significant impact on the way I program, and what I program. We'll see where this goes, though. Until next time, happy hacking :)

by Andy Wingo at January 19, 2016 11:57 AM

January 11, 2016

Andy Wingo

the half strap: self-hosting and guile

or, "why does building guile take so friggin long"

Happy new year's, hackfolk! I don't know about y'all, but I'm feeling pretty good about 2016. Let's make some cool stuff!

Today's article is about Guile and how it builds itself. It's a Scheme implementation mostly written in Scheme, so how it would go about doing that isn't straightforward. And although the performance of Guile is pretty great these days, a user's first experience with it will probably be building it, which is a process that takes approximately forever. Seriously. On this newish laptop with an i7-5600U CPU and four cores it takes like 45 minutes. On older machines it can take even longer. What gives?

Well, fictional reader, it's a good question. I'm glad you asked! Before getting to the heart of the matter, I summarize a bit of background information.

and then nothing turned itself inside out

Guile is mostly written in Scheme. Some parts of it are written in C -- some runtime routines, some supporting libraries (the garbage collector, unicode support, arbitrary precision arithmetic), and the bytecode interpreter. The first phase when building Guile is to take the system's C compiler -- a program that takes C source code and produces native machine code -- and use it to build libguile, the part of Guile that is written in C.

The next phase is to compile the parts of Guile written in Scheme. Currently we compile to bytecode which is then interpreted by libguile, but this discussion would be the same if we compiled Scheme to native code instead of bytecode.

There's a wrinkle, though: the Scheme compiler -- the program that takes a Scheme program and produces bytecode -- is written in Scheme. When we built libguile, we could use the system's C compiler. But the system has no Scheme compiler, so how do we do?

The answer is that in addition to a Scheme compiler, Guile also includes a Scheme interpreter. We use the interpreter to load the Scheme compiler, and then use the compiler to produce bytecode from Scheme.

There's another wrinkle, though, and I bet you can guess what it is :) The Scheme interpreter is also written in Scheme. It used to be that Guile's Scheme interpreter was written in C, but that made it impossible to tail-call between compiled and interpreted code. So some six years ago, I rewrote the interpreter in Scheme.

As I mention in that article, Guile actually has two Scheme interpreters: the one in Scheme and one in C that is only used to compile the one in Scheme, and never used again. The bootstrap interpreter written in C avoids the problem with tail calls to compiled code because when it runs, there is no compiled code.

So in summary, Guile's build has the following general phases:

  1. The system C compiler builds libguile.

  2. The bootstrap C interpreter in libguile loads the Scheme compiler and builds eval.go from eval.scm. (Currently .go is the extension for compiled Guile code. The extension predates the Go language. Probably we switch to .so at some point, though.)

  3. The Scheme interpreter from eval.go loads the Scheme compiler and compiles the rest of the Scheme code in Guile, including the Scheme compiler itself.

In the last step, Guile compiles each file in its own process, allowing for good parallelization. This also means that as the compiler builds, the compiler itself starts running faster because it can use the freshly built .go files instead having to use the interpreter to load the source .scm files.

so what's slow?

Building libguile is not so slow; it takes about a minute on my laptop. Could be faster, but it's fine.

Building eval.go is slow, but at two and half minutes it's bearable.

Building the rest of the Scheme code is horribly slow though, and for me takes around 40 or 50 minutes. What is going on?

The crucial difference between building libguile and building the .go files is that when we build libguile, we use the C compiler, which is itself a highly optimized program. When we build .go files, we use the Scheme compiler, which hasn't yet been compiled! Indeed if you rebuild all the Scheme code using a compiled Scheme compiler instead of an interpreted Scheme compiler, you can rebuild all of Guile in about 5 minutes. (Due to the way the Makefile dependencies work, the easiest way to do this if you have a built Guile is rm bootstrap/ice-9/eval.go && make -jN.)

The story is a bit complicated by parallelism, though. Usually if you do a make -j4, you will be able to build 4 things at the same time, taking advantage of 4 cores (if you have them). However Guile's Makefile rules are arranged in such a way that the initial eval.go compile is done serially, when nothing else is running. This is because the bootstrap interpreter written in C uses C stack space as temporary storage. It could be that when compiling bigger files, the C interpreter might run out of stack, and with C it's hard to detect exactly how much stack you have. Indeed, sometimes we get reports of strange bootstrap failures that end up being because Guile was built with -O0 and the compiler decided to use much more stack space than we usually see. We try to fix these, usually by raising the static stack limits that Guile's C interpreter imposes, but we certainly don't want a limitation in the bootstrap interpreter to affect the internal structure of the rest of Guile. The
bootstrap interpreter's only job is to load the compiler and build eval.go, and isn't tested in any other way.

So eval.go is build serially. After that, compilation can proceed in parallel, but goes more slowly before speeding up. To explain that, I digress!

a digression on interpreters

When Scheme code is loaded into Guile from source, the process goes like this:

  1. Scheme code is loaded from disk or wherever as a stream of bytes.

  2. The reader parses that byte stream into S-expressions.

  3. The expander runs on the S-expressions, expanding macros and lowering Scheme code to an internal language called "Tree-IL".

Up to here, the pipeline is shared between the interpreter and the compiler. If you're compiling, Guile will take the Tree-IL, run the partial evaluator on it, lower to CPS, optimize that CPS, and then emit bytecode. The next time you load this file, Guile will just mmap in the .go file and skip all of the other steps. Compilation is great!

But if you are interpreting, a few more things happen:

  1. The memoizer does some analysis on the Tree-IL and turns variable references into two-dimensional (depth, offset) references on a chained environment. See the story time article for more; scroll down about halfway for the details. The goal is to do some light compilation on variable access so that the interpreter will have to do less work, and also prevent closures from hanging on to too much data; this is the "flat closure" optimization, for the interpreter.

  2. The interpreter "compiles" the code to a chain of closures. This is like the classic direct-threading optimization, but for a tree-based interpreter.

The closure-chaining strategy of the interpreter is almost exactly as in described in SICP's analyze pass. I came up with it independently, but so did Jonathan Rees in 1982 and Marc Feeley in 1986, so I wasn't surprised when I found the prior work!

Back in 2009 when we switched to the eval-in-Scheme, we knew that it would result in a slower interpreter. This is because instead of the interpreter being compiled to native code, it was compiled to bytecode. Also, Guile's Scheme compiler wasn't as good then, so we knew that we were leaving optimizations on the floor. Still, the switch to an evaluator in Scheme enabled integration of the compiler, and we thought that the interpreter speed would improve with time. I just took a look and with this silly loop:

(let lp ((n 0)) (if (< n #e1e7) (lp (1+ n))))

Guile 1.8's interpreter written in C manages to run this in 1.1 seconds. Guile 2.0's interpreter written in Scheme and compiled to the old virtual machine does it in 16.4 seconds. Guile 2.1.1's interpreter, with the closure-chaining optimization, a couple of peephole optimizations in the interpreter, and compiled using the better compiler and VM from Guile 2.2, manages to finish in 2.4 seconds. So we are definitely getting better, and by the time we compile eval.scm to native code I have no doubt that we will be as good as the old C implementation. (Of course, when compiled to Guile 2.2's VM, the loop finishes in 55 milliseconds, but comparing a compiler and an interpreter is no fair.)

The up-shot for bootstrap times is that once the interpreter is compiled, the build currently runs a little slower, because the compiled eval.go interpreter is a bit slower than the bootstrap interpreter in libguile.

bottom up, top down

Well. Clearly I wanted to share a thing with you about interpreters; thank you for following along :) The salient point is that Guile's interpreter is now pretty OK, though of course not as good as the compiler. Still, Guile 2.0 builds in 12 minutes, while Guile 2.2 builds in 40 or 50, and Guile 2.2 has a faster interpreter. What's the deal?

There are a few factors at play but I think the biggest is that Guile 2.2's compiler is simply much more sophisticated than Guile 2.0's compiler. Just loading it up at bootstrap-time takes longer than loading Guile 2.0's compiler, because there's more code using more macro abstractions than in Guile 2.0. The expander has to do more work, and the evaluator has to do more work. A compiler is a program that runs on programs, and interpreting a bigger program is going to be slower than interpreting a smaller program.

It's a somewhat paradoxical result: to make programs run faster, we needed a better compiler, but that better compiler is bigger, and so it bootstraps from source more slowly. Some of the improvements to generated code quality were driven by a desire to have the compiler run faster, but this only had the reverse effect on bootstrap time.

Unfortunately, Guile 2.2's compiler also runs slow when it's fully compiled: compiling one largeish module in Guile 2.2 compared to 2.0 takes 10.7 seconds instead of 1.9. (To reproduce, ,time (compile-file "module/ice-9/psyntax-pp.scm") from a Guile 2.0 or 2.2 REPL.) How can we explain this?

Understanding this question has taken me some time. If you do a normal profile of the code using statprof, you get something like this:

> ,profile (compile-file "module/ice-9/psyntax-pp.scm")
%     cumulative   self             
time   seconds     seconds  procedure
 12.41      1.61      1.61  language/cps/intmap.scm:393:0:intmap-ref
  6.35      1.05      0.82  vector-copy
  5.92     13.09      0.77  language/cps/intset.scm:467:5:visit-branch
  5.05      0.71      0.65  language/cps/intmap.scm:183:0:intmap-add!
  4.62      1.40      0.60  language/cps/intset.scm:381:2:visit-node
  3.61      0.93      0.47  language/cps/intset.scm:268:0:intset-add
  3.46      0.49      0.45  language/cps/intset.scm:203:0:intset-add!
  3.17      1.01      0.41  language/cps/intset.scm:269:2:adjoin
  3.03      1.46      0.39  language/cps/intmap.scm:246:2:adjoin

("Cumulative seconds" can be greater than the total number of seconds for functions that have multiple activations live on the stack.)

These results would seem to unequivocally indicate that the switch to persistent data structures in the new compiler is to blame. This is a somewhat disheartening realization; I love working with the new data structures. They let me write better code and think about bigger things.

Seeing that most of the time is spent in intmap and intset manipulations, I've tried off and on over the last few months to speed them up. I tried at one point replacing hot paths with C -- no speedup, so I threw it away. I tried adding an alternate intmap implementation that, for transient packed maps, would store the map as a single vector; no significant speedup, binned it. I implemented integer unboxing in the hopes that it would speed up the results; more about that in another missive. I stared long and hard at the generated code, looking for opportunities to improve it (and did make some small improvements). Even when writing this article, the results are such a shame that I put the article on hold for a couple weeks while I looked into potential improvements, and managed to squeak out another 10%.

In retrospect, getting no speedup out of C hot paths should have been a hint.

For many years, a flat statistical profile with cumulative/self timings like the one I show above has been my go-to performance diagnostic. Sometimes it does take a bit of machine sympathy to understand, though; when you want to know what's calling a hot function, usually you look farther down the list for functions that don't have much self time but whose cumulative time matches the function you're interested in. But this approach doesn't work for hot functions that are called from many, many places, as is the case with these fundamental data structure operations.

Indeed at one point I built a tool to visualize statistical stack samples, the idea being you often want to see how a program gets to its hot code. This tool was useful but its output could be a bit overwhelming. Sometimes you'd have to tell it to generate PDF instead of PNG files because the height of the image exceeded Cairo's internal limits. The tool also had too many moving pieces to maintain. Still, the core of the idea was a good one, and I incorporated the non-graphical parts of it into Guile proper, where they sat unused for a few years.

Fast-forward to now, where faced with this compiler performance problem, I needed some other tool to help me out. It turns out that in the 2.0 to 2.2 transition, I had to rewrite the profiler's internals anyway to deal with the new VM. The old VM could identify a frame's function by the value in local slot 0; the new one has to look up from instruction pointer values. Because this lookup can be expensive, the new profiler just writes sampled instruction pointer addresses into an array for later offline analysis, eventual distilling to a flat profile. It turns out that this information is exactly what's needed to do a tree profile like I did in chartprof. I had to add cycle detection to prevent the graphs from being enormous, but cycle detection makes much more sense in a tree output than in a flat profile. The result, distilled a bit:

> ,profile (compile-file "module/ice-9/psyntax-pp.scm") #:display-style tree
100.0% read-and-compile at system/base/compile.scm:208:0
  99.4% compile at system/base/compile.scm:237:0
    99.4% compile-fold at system/base/compile.scm:177:0
      75.3% compile-bytecode at language/cps/compile-bytecode.scm:568:0
        73.8% lower-cps at language/cps/compile-bytecode.scm:556:0
          41.1% optimize-higher-order-cps at language/cps/optimize.scm:86:0
          29.9% optimize-first-order-cps at language/cps/optimize.scm:106:0
          1.5% convert-closures at language/cps/closure-conversion.scm:814:0
      20.5% emit-bytecode at language/cps/compile-bytecode.scm:547:0
        18.5% visit-branch at language/cps/intmap.scm:514:5
          18.5% #x7ff420853318 at language/cps/compile-bytecode.scm:49:15
            18.5% compile-function at language/cps/compile-bytecode.scm:83:0
              18.5% allocate-slots at language/cps/slot-allocation.scm:838:0
      3.6% compile-cps at language/tree-il/compile-cps.scm:1071:0
        2.5% optimize at language/tree-il/optimize.scm:31:0
        0.6% cps-convert/thunk at language/tree-il/compile-cps.scm:924:0
        0.4% fix-letrec at language/tree-il/fix-letrec.scm:213:0
  0.6% compile-fold at system/base/compile.scm:177:0
    0.6% save-module-excursion at ice-9/boot-9.scm:2607:0
      0.6% #x7ff420b95254 at language/scheme/compile-tree-il.scm:29:3

I've uploaded the full file here, for the curious Guile hacker.

So what does it mean? The high-order bit is that we spend some 70% of the time in the optimizer. Indeed, running the same benchmark but omitting optimizations gets a much more respectable time:

$ time meta/uninstalled-env \
  guild compile -O0 module/ice-9/psyntax-pp.scm -o /tmp/foo.go
wrote `/tmp/foo.go'

real	0m3.050s
user	0m3.404s
sys	0m0.060s

One of the results of this investigation was that we should first compile the compiler with -O0 (no optimizations), then compile the compiler with -O2 (with optimizations). This change made it into the 2.1.1 release a couple months ago.

We also spend around 18.5% of time in slot allocation -- deciding what local variable slots to allocate to CPS variables. This takes time because we do a precise live variable analysis over the CPS, which itself has one variable for every result value and a label for every program point. Then we do register allocation, but in a way that could probably be optimized better. Perhaps with -O0 we should use a different strategy to allocate slots: one which preserves the values of variables that are available but dead. This would actually be an easier allocation task. An additional 1.5% is spent actually assembling the bytecode.

Interestingly, partial evaluation, CPS conversion, and a couple of other small optimizations together account for only 3.6% of time; and reading and syntax expansion account for only 0.6% of time. This is good news at least :)

up in the trees, down in the weeds

Looking at the top-down tree profile lets me see that the compiler is spending most of its time doing things that the Guile 2.0 compiler doesn't do: loop optimizations, good slot allocations, and so on. To an extent, then, it's to be expected that the Guile 2.2 compiler is slower. This also explains why the C fast-paths weren't so effective at improving performance: the per-operation costs for the already pretty low and adding C implementations wasn't enough of a speedup to matter. The problem was not that intmap-ref et al were slow, it was that code was calling them a lot.

Improving the optimizer has been a bit challenging, not least due to the many axes of "better". Guile's compiler ran faster before the switch to "CPS soup" and persistent data structures, but it produced code that ran slower because I wasn't able to write the optimizations that I would have liked. Likewise, Guile 2.0's compiler ran faster, because it did a worse job. But before switching to CPS soup, Guile's compiler also used more memory, because per-program-point and per-variable computations were unable to share space with each other.

I think the top-down profiler has given me a better point of view in this instance, as I can reason about what I'm doing on a structural level, which I wasn't able to understand from the flat profile. Still, it's possible to misunderstand the performance impact of leaf functions when they are spread all over a tree, and for that reason I think we probably need both kinds of profilers.

In the case of Guile's compiler I'm not sure that I'll change much at this point. We'll be able to switch to native compilation without a fundamental compiler rewrite. But spending most of the time in functions related to data structures still seems pretty wrong to me on some deep level -- what if the data structures were faster? What if I wrote the code in some other way that didn't need the data structures so much? It gnaws at me. It gnaws and gnaws.

the half strap

Unfortunately, while compiling Scheme to native code will probably speed up the compiler, it won't necessarily speed up the bootstrap. I think the compiler has some 800 KB of source code right now, and let's say that we're able to do native compilation with 1200 KB. So 50% more code, but probably the result is two to ten times faster on average: a win, in terms of compiler speed, when compiled. But for bootstrap time, because in the beginning of the bootstrap most of the compiler isn't compiled, it could well be a slowdown.

This is the disadvantage of bootstrapping from an interpreter -- the more compiler you write, the slower your strap.

Note that this is different from the case where you bootstrap from a compiled Scheme compiler. In our case we do a half-bootstrap, first building an interpreter in C, compiling the interpreter in Scheme, then bootstrapping off that.

It's a common trope in compiler development where the heroic, farsighted compiler hacker refuses to add optimizations unless they make the compiler bootstrap faster. Dybvig says as much in his "History of Chez Scheme" paper. Well, sure -- if you're willing to accept complete responsibility for bootstrapping. From my side, I'm terrified that I could introduce some error in a binary that could reproduce itself worm-like into all my work and it make it impossible to change anything. You think I jest, but the Sanely Bootstrappable Common Lisp papers instilled me with fear. Want to change your tagging scheme? You can't! Want to experiment with language, start programming using features from your own dialect? You can't! No, thank you. I value my sanity more than that.

Incidentally, this also answers a common question people have: can I use some existing Guile to compile a new Guile? The answer is tricky. You can if the two Guiles implement the same language and virtual machine. Guile-the-language is fairly stable. However, due to the way that the VM and the compiler are co-developed, some of the compiler is generated from data exported by libguile. If that information happens to be the same on your Guile, then yes, it's possible. Otherwise no. For this reason it's not something we describe, besides cross-compilers from the same version. Just half strap: it takes a while but it's fairly fool-proof.

and that's it!

Thanks for reading I guess. Good jobbies! Next time, some words on Lua. Until then, happy strapping!

by Andy Wingo at January 11, 2016 09:51 PM

December 31, 2015

Diego Pino

Architecture of the Web Inspector

In Igalia we have been contributing to the WebKit project for many years. Starting with WebKitGTK+ and progressively reaching other areas such as WebCore, improving accessibility, implementing new APIs, tooling, fixing tons of bugs, etc. The Web Inspector is another area were we have contributed, although much more modestly. This is a post I wanted to write since a long time ago. It’s a brief tour through the Web Inspector. Here I discuss its architecture, history and main features. Hopefully this information can be useful for other people who would like to start hacking on this important component of the WebKit project.

What’s the Web Inspector?

The Web Inspector is a tool that is part of WebKit and allows you to inspect web pages. I mean inspect in a broad sense. It doesn’t only allow you to debug your JavaScript code, but it also includes a rich set of tools, usually divided into panels, which provide very valuable information such as loading times of resources (files, images), CSS properties and DOM elements edition, visual identification of elements within a web page, and many things more. It basically answers the question what’s going on inside a web page?

The difference between the Web Inspector and other tools, such as Firefox’s famous Web Developer extension, is that the Web Inspector is not an external plugin, but it’s part of WebKit. That means that every WebKit port features the Web Inspector. Nowadays, all major open-source browsers include their own Web Inspector alike tool. Chrome features its Developer tools and Firefox has its own Developer tools too.

Throughtout the years

The Web Inspector was shipped in WebKit for the first time in January 2006. Since then it has gone through big and small changes.

The first big change came in June 2007. There was a big redesign, the network panel was included for the first time, syntax highlighting, better error reporting, etc.

One year later, in September 2008, the Inspector went through another big redesign and more panels were included (Elements, Resources, Scripts, Profile, Database).

November 2009 brought better edition of DOM elements, inline creation of CSS rules and selectors, a widget for color editing, JSON and CSS syntax highlighting, etc.

In June 2012 a brand-new Web Inspector was introduced but only for Safari 6. Exactly one year later, Safari 6’s Web Inspector was open-sourced. During some time the new and old versions of the Web Inspector lived together in the codebase. This new inspector brought a major visual redesign and new layout elements: toolbar, navigation bar, quick console, content, sidebar, etc. The panels were structured in: Resources, Timeline, Debugger, Styles, Layers and Node. The current Web Inspector it’s still based on this release.

Two months before Apple open-sourced the new Web Inspector, Google forked WebKit and created Blink. The Web Inspector is known as Developer Tools in Chrome. Since Blink was forked two months before the new Web Inspector release, Chrome’s Developer Tools are actually based in the old Web Inspector code. Although probably it has gone through many changes since the last two years. The bottom line is that WebKit’s Web Inspector and Chrome’s Developer Tools look in fact very similar.

A first glimpse

The Web Inspector source code lives at Source/WebInspectorUI. Most of it is composed of HTML, CSS and JavaScript although there are parts in C++ to bridge with WebCore and JavaScriptCore. According to OpenHub, WebKit’s source code is 20% JavaScript. That’s a big amount of code, although not all JavaScript code in WebKit is part of the Web Inspector. There’s JavaScript also in the Layout tests and JavaScriptCore.

The WebInspectorUI, which represents the frontend, is structured in several folders:

  • Base/: The main classes (Bootstrap.js, WebInspector.js, EventListener.js, Object.js, etc).
  • Configurations/: XCode configuration files.
  • Controllers/: Catch events and call Model classes to perform business logic. Important files: XXXManager.js (TimelineManager.js, DebuggerManager.js, StorageManager.js, LogManager.js).
  • Images/: Images for icons, visual elements, etc. The images for GTK are different due to license restrictions.
  • Localications/: Contains file with localized strings in English.
  • Models/: Classes that perform business logic (KeyboardShortcut.js, LogObjet.js, Timeline.js, Color.js, etc).
  • Protocol/: Observers that respond to notifications emitted by the backend.
  • Scripts/: Several scripts in Perl, Ruby and Python to perform automatic tasks (update CodeMirror, update resources, minimize CSS, etc).
  • Tools/PrettyPrinting/: External tool for pretty print source code in the Console tab.
  • UserInterface/External:
    • CodeMirror/ : Very powerful text editor implemented in JavaScript for the browser. Used for code editing (JavaScript, CSS) inside Web Inspector.
    • ESLint/ : Check JavaScript syntax.
    • Esprima/ : JavaScript parser, for code completion in the editor.
  • Versions/: Description of protocols for different iOS Versions.
  • View/: Classes for visual element objects.

Getting started

Think of the Web Inspector as a web application that lives inside WebKit. It’s possible to modify one of its elements (HTML, CSS of JavaScript) and see the change reflected in the UI just by typing the following command:

$ Tools/Scripts/build-webkit --inspector-frontend

But this only works in not CMake ports (Apple). It has the inconvenient of no updating localized strings too. In other ports, such as WebKitGTK+, there’s no –inspector-frontend flag so it’s necessary to build WebKit. As usual, only the changed files are built:

$ Tools/Scripts/build-webkit --gtk
   WebKit  is  now built   (00m:18s).
   To  run MiniBrowser with    this    newly-built code,   use the
   "../Tools/Scripts/run-minibrowser"  script.

Once it’s built, open MiniBrowser and the Inspector (Ctrl+I) to see your changes:

$ Tools/Scripts/run-minibrowser   --gtk

There’s a permanent tag with open bugs for the Web Inspector and the URL, can be used to file out new bugs related to the Web Inspector.

How to debug?

Definitively when starting hacking in a new project it’s fundamental to be able to see what’s going on inside. If you try to inspect a Web Inspector variable using console.log(var) nothing will be printed out. It’s necessary to build WebKit in debug mode and enable the flag developerExtrasEnabled:

$ Tools/Script/build-webkit --gtk --debug

In the case of the WebKitGTK+ port, developerExtrasEnabled is always set to TRUE.

It’s also possible to do the same in release mode, just by removing an #ifdef block in WebKit2/UIProcess/gtk/WebInspectorProxyGtk.cpp:

#ifndef NDEBUG
   //  Allow   developers  to  inspect the Web Inspector   in  debug   builds
   //  without changing    settings.

Now all console.log(var) messages will be printed out in the shell. With this setting enabled, it’s also possible to open a new Web Inspector to inspect the Web Inspector. With the Web Inspector open, righ-click on one of its elements and select Inspect Element, just like in a normal web page.

Remote debugging

The Web Inspector is a multi-tier application. It’s divided into 3 layers: a frontend, a backend and a target. This division detaches the inspector from the inspected browser. In other words, it’s possible to use the inspector to inspect a browser running in a remote device. This can be useful to debug an iPhone web application or a WebKitGTK+ based browser running in an embedded environment, such as the RaspberryPi.

On the browser to be inspected, first define the WEB_INSPECTOR_SERVER variable:

$ Tools/Script/run-minibrowser --gtk

On the client side, open WebKit:

$ Tools/Script/run-minibrowser --gtk

Go to the server URL, in this case, and you will see a Web Inspector which is actually inspecting a remote browser.

Architecture of the Web Inspector

As mentioned before, the Web Inspector is a 3-tier application divided into several layers:

  • Frontend, also known as the client.
  • Target, also known as the debugee.
  • Backend, also known as the server.

The frontend (WebInspectorUI/UserInterface/) is Web Inspector’s user interface. It’s what the user sees and interacts with. It’s implemented in HTML, CSS and JavaScript. Sometimes this frontend is referred as the debug client, whereas the backend is referred as the server.

The target is the program being debugged. In normal operation of the Web Inspector, the target program is the WebKit loaded in the browser. In remote debugging mode, the inspected target is a WebKit loaded in a remote machine.

The backend (JavaScriptCore/inspector/ and /WebCore/inspector) is what mediates between the target and the frontend. It gives access to the elements that live in JavaScriptCore and WebCore. For instance, in order to be able to debug JavaScript code within a web page, it’s necessary to have access to JavaScriptCore’s stackframe. In consequence, JavaScriptCore has to provide hooks to the Web Inspector so it can access the inspected properties. The same happens with DOM elements. Showing up a DOM element properties requires that WebCore provides this information to the inspector.


Let’s dive into the frontend first. If you grep for HTML code hoping to find the layout of the inspector elements, you’re not going to find any code actually. The reason why it’s because all the layout elements in the inspector are via DOM operations (createElement, appendChild,…). For instance, in UserInterface/Views/SidebarPanel.js:

WebInspector.SidebarPanel = class SidebarPanel extends WebInspector.Object
    constructor(identifier, displayName, element, role, label)
        this._element = element || document.createElement("div");
        this._element.classList.add("panel", identifier);

        this._contentElement = document.createElement("div");
        this._contentElement.className = "content";

Everything starts with Main.html (UserInterface/Main.html). This file loads all the elements that compose the inspector UI. First it loads several external components (CodeMirror, JSLint), then CSS files and after that all the JavaScript files that form the inspector. These files can be classified into different types. Some are architectural elements (Base/XXX.js), others implement business logic operations (Model/XXX.js), other implements programmatic logic (Controllers/XXX.js) and others implement visual elements and widgets (View/XXX.js). Usually UI classes have a CSS file of the same associated, for instance Views/SidebarPanel.js and Views/SidebarPanel.css.

The WebInspector namespace (UserInterface/Base/WebInspector.js) is the central element of the frontend. Everything is going to be accessed from there:

var WebInspector = {}; // Namespace

As the model, controller and view classes are processed they are going to hook themselves to the WebInspector namespace. Here’s the definition of Views/DOMDetailsSidebarPanel.js:

WebInspector.DOMDetailsSidebarPanel = class DOMDetailsSidebarPanel extends WebInspector.DetailsSidebarPanel
    constructor(identifier, displayName, singularDisplayName, element, dontCreateNavigationItem)
        super(identifier, displayName, singularDisplayName, element, dontCreateNavigationItem);
        this.element.addEventListener("click", this._mouseWasClicked.bind(this), true);
        this._domNode = null;

It’s a classical Model-View-Controller pattern. The view accesses the controller to execute business logic operations, which are implemented by models. On the same files there’s:

    if (this._domNode && this._domNode.ownerDocument) {
        var mainResource = WebInspector.frameResourceManager.resourceForURL(this._domNode.ownerDocument.documentURL);
        if (mainResource)
            var parentFrame = mainResource.parentFrame;

    WebInspector.handlePossibleLinkClick(event, parentFrame);

Where frameResourceManager is an instance of Controllers/FrameResourceManager.js. Views can also access Models, not usually to perform operations on them but to query them. It’s the case on the same file of inspect method:

   // Iterate over the objects to find a WebInspector.DOMNode to inspect.
   for (var i = 0; i < objects.length; ++i) {
      if (objects[i] instanceof WebInspector.DOMNode) {
         nodeToInspect = objects[i];

Where DOMNode is a model class (./Models/DOMNode.js).

One thing that characterizes the frontend code is that it tends to quickly adopt new JavaScript features implemented by JavaScriptCore. All the code is structured in classes, makes use of inheritance, there are getter and setter methods, there are for-of loops, makes use of Map and Set classes, etc. Any new ES2015 feature that lands in JavaScriptCore, if it’s convenient and simplifies code, makes its room into the inspector. And it makes sense to do it, as it’s guaranteed the latest JavaScriptCore version is going to be there. It also makes the inspector a helpful codebase to understand new ES2015 features.

WebKit Remote Debugging Protocol

Until this point, most of the frontend architecture is covered. I mentioned earlier that another layer of the inspector is the backend. The backend is what mediates between the target program and the frontend. It consists of several C++ classes that expose properties of WebCore (WebCore/inspector) and JavaScriptCore (JavaScriptCore/inspector) to the inspector. But how is possible that C++ classes and JavaScript classes can exchange information?

The answer to that is the WebKit Remote Debugging Protocol, a JSON formatted protocol than enables communication between the frontend and the backend, and vice versa. This protocol is based on the JSON-RPC 2.0 specification. Currently there’s an attempt, under the RemoteDebug initiative, to standardize all the remote debugging protocols that major browsers use. Remote debugging is a bidirectional protocol: clients send asynchronous requests to the server, the server responds to these request and/or generates notifications. The protocol is divided into a different number of domains.

  • Console: Defines methods and events for interaction with the JavaScript console.
  • Debugger: Exposes JavaScript debugging functions; allows setting and removing breakpoints, stepping through execution, exploring stack traces, etc.
  • DOM: Exposes DOM read/write operations.
  • DOM Debugger: Allows setting breakpoints on particular DOM operations and events. JavaScript execution will stop on these operations as if there was a regular breakpoint set.
  • Network: Allows tracking network activities of the page; exposes information about HTTP and WebSocket requests and responses, their headers, bodies, raw timing, etc.
  • Page: Actions and events related to the inspected page.
  • Runtime: Exposes JavaScript runtime by means of remote evaluation and mirror objects.
  • Timeline: Provides its clients with instrumentation records that are generated during the page runtime.

Each domain defines a number of commands it implements and events it generates. For instance, when setting a breakpoint in the frontend’s console, the following message is sent:

    "id": 10, // <-- command sequence number generated by the caller
    "method": "Debugger.setBreakpointByUrl", // <-- protocol method
    "params": { // <-- named parameters map
        "lineNumber": 23,
        "url": ""

For this command, the backend will generate the following response:

    "id": 10, // <-- same id as in the command
    "result": { // <-- command result
        "breakpointId": "",
        "locations": [
                "lineNumber": 23,
                "columnNumber": 10

Frontend-to-backend communication: an example

Let’s use the clearMessages command and messagesCleared event defined in the Console domain (JavaScriptCore/inspector/protocol/Console.json) to illustrate how frontend-to-backend communication works:

    domain: Console,
    description: Console domain defines methods and events for interaction with the JavaScript console...
    commands: [ 
        name: clearMessages,
        description: Clears console messages collected in the browser.
    events: [ 
        name: messagesCleared,
        description: Issued when console is cleared. This happens either upon <code>clearMessages</code> command or after page navigation.

In the frontend, the LogManager class (Controllers/LogManager.js) sends a clearMessages command through ConsoleAgent:

    this._clearMessagesRequested = true;


Domain commands are implemented in the backend by agents, which are located at WebCore/inspector/agents/ and JavaScriptCore/inspector/agents/, depending on what information available in the backend they need to access. Agents are accessible from the frontend through the Window object. In Main.js (UserInterface/Base/Main.js):

// Enable the Console Agent after creating the singleton managers.
if (window.ConsoleAgent)

The glue code that communicates the frontend with the backend is implemented by a set of dispatcher classes. These classes are generated automatically during the build process, out of the definition of the protocol domains. Here is an excerpt of JavaScriptCore/DerivedSources.make:

    $(JavaScriptCore)/inspector/scripts/codegen/ \
    $(JavaScriptCore)/inspector/scripts/codegen/ \
    $(JavaScriptCore)/inspector/scripts/codegen/ \
    $(JavaScriptCore)/inspector/scripts/codegen/ \
    $(JavaScriptCore)/inspector/scripts/codegen/ \
    $(JavaScriptCore)/inspector/scripts/codegen/ \
    $(JavaScriptCore)/inspector/scripts/codegen/ \
    $(JavaScriptCore)/inspector/scripts/codegen/ \
    $(JavaScriptCore)/inspector/scripts/codegen/ \
    $(JavaScriptCore)/inspector/scripts/codegen/ \
    $(JavaScriptCore)/inspector/scripts/codegen/ \
    $(JavaScriptCore)/inspector/scripts/codegen/ \
    $(JavaScriptCore)/inspector/scripts/codegen/ \
    $(JavaScriptCore)/inspector/scripts/ \
    $(JavaScriptCore_SCRIPTS_DIR)/ \

# Inspector Backend Dispatchers, Frontend Dispatchers, Type Builders
InspectorFrontendDispatchers.h : CombinedDomains.json $(INSPECTOR_GENERATOR_SCRIPTS)
    $(PYTHON) $(JavaScriptCore)/inspector/scripts/ --framework JavaScriptCore --outputDir . ./CombinedDomains.json

This is something common in WebKit, where there are many classes for which there is not existing code in the Source/ directory but are generated automatically and placed at Release/DerivedSources/. The InspectorBackendDispatchers.h (WebKitBuild/Release/DerivedSources/JavaScriptCore/inspector/InspectorBackendDispatchers.h) implements an interface for the domain commands, while the InspectorFrontDispatchers.h implements an interface for the domain notifications:

class JS_EXPORT_PRIVATE ConsoleBackendDispatcherHandler {
    virtual void enable(ErrorString&) = 0;
    virtual void disable(ErrorString&) = 0;
    virtual void clearMessages(ErrorString&) = 0;
    virtual void setMonitoringXHREnabled(ErrorString&, bool in_enabled) = 0;
    virtual void addInspectedNode(ErrorString&, int in_nodeId) = 0;
    virtual ~ConsoleBackendDispatcherHandler();

The dispatching of a command is done by matching a command name with a method name and calling that method.

void ConsoleBackendDispatcher::dispatch(long callId, const String& method, Ref<InspectorObject>&& message)
    Ref<ConsoleBackendDispatcher> protect(*this);
    if (method == "clearMessages")
        clearMessages(callId, message);
    else if
        m_backendDispatcher->reportProtocolError(&callId, BackendDispatcher::MethodNotFound, makeString('\'', "Console", '.', method, "' was not found"));

InspectorConsoleAgent.cpp (JavaScriptCore/inspector/agents/InspectorConsoleAgent.cpp) implements the clearMessages() method. Once it has finished it will send a notification back to the frontend through the frontendDispatcher class.

void InspectorConsoleAgent::clearMessages(ErrorString&)
    m_expiredConsoleMessageCount = 0;
    m_previousMessage = nullptr;


    if (m_frontendDispatcher && m_enabled)

Backend-to-frontend communication: response

The frontend dispatcher is the mechanism by which the backend can send information to the backend. The frontend dispatcher implements the protocol notifications of a domain:

// DO NOT EDIT THIS FILE. It is automatically generated from 
// CombinedDomains.json by the script: Source/JavaScriptCore/inspector/
// scripts/

class JS_EXPORT_PRIVATE ConsoleFrontendDispatcher {
    ConsoleFrontendDispatcher(FrontendChannel* frontendChannel) : 
        m_frontendChannel(frontendChannel) { }
    void messageAdded(RefPtr<Inspector::Protocol::Console::ConsoleMessage> 
    void messageRepeatCountUpdated(int count);
    void messagesCleared();
    FrontendChannel* m_frontendChannel;

In order to react to backend notifications, the frontend needs to register observers of backend events. This registration happens in Main.js:

// Register observers for events from the InspectorBackend.
if (InspectorBackend.registerConsoleDispatcher)
    InspectorBackend.registerConsoleDispatcher(new WebInspector.ConsoleObserver);
if (InspectorBackend.registerInspectorDispatcher)
    InspectorBackend.registerInspectorDispatcher(new WebInspector.InspectorObserver);
if (InspectorBackend.registerPageDispatcher)
    InspectorBackend.registerPageDispatcher(new WebInspector.PageObserver);

The frontend class ConsoleObserver.js (UserInterface/Protocol/ConsoleObserver.js) will react to messagesCleared event and trigger some programmatic or business logic:

WebInspector.ConsoleObserver = class ConsoleObserver
    // Events defined by the "Console" domain.


Web Inspector strings are localized. Localized strings are stored at localizedStrings.js (Localizations/en.lproj/localizedStrings.js). All UI strings are wrapped by the WebInspector.UIString() method, so they are printed localized:

grep -Rn "WebInspector.UIString(\"Resource" 

The contents of localizedStrings.js are not created manually but by running the script update-webkit-localizable-strings. This script parses all the strings marked to be localized and updates localizedStrings.js.

Bear this in mind if you send a patch with a new or modified string.

Sending a patch

When sending a patch subject should be prefixed by Web Inspector:. Before the patch is sent, some style checkers are run (Tools/Scripts/check-webkit-style), to verify the patch complies Web Inspector coding style. It’s also possible to run the script manually:

    Tools/Scripts/prepare-ChangeLog --bug xxxxx
    # edit WebCore/ChangeLog

As usual, modify the updated ChangeLogs and run webkit-patch upload to send your patch.

Community and resources

There’s a very convenient Wiki with very valuable sources of information such as pointers to blog posts, how to debug, open bugs, etc. The Web Inspector has also it’s own IRC channel at freenode: #webkit-inspector. Most of the work in Web Inspector is carried by Timothy Hatcher, Joseph Pecoraro and Brian Burg, with contributions of other WebKit hackers and other collaborators. I can tell patches and bug reports are very welcomed and reviews go very fast.

I also want to mention this post from Brian Burg that discusses Web Inspector architecture and I used as a basis for this post.


The Web Inspector is an important component of the WebKit project. It’s an application structured in 3 layers: a frontend, a backend and a target. This abstraction allows detaching the inspector from the target (inspected WebKit browser).

The frontend is a web application composed of HTML + CSS + JavaScript. It implements a Model-View-Controller pattern and makes heavy use of ECMA2015 features. The backend exposes information of WebKit’s WebCore and JavaScriptCore elements.

Communication between frontend and backend is provided by several dispatchers that communicate both parts through the WebKit Remote Debugging Protocol (a JSON-RPC 2.0 based protocol). The protocol defines several domains. Each domain defines commands and events which are the messages the frontend and the backend exchange with each other. Actual implementation of backend commands is provided by agent classes that live in WebCore and JavaScriptCore side. On the frontend side, several observer classes can listen and respond backend notifications.

It has been a rather long post. I hope it can serve as a starting point for anyone interested in understanding or hacking in this important component of the WebKit project.

December 31, 2015 10:00 PM

December 30, 2015

Adrián Pérez

Introducing SnabbWall

If you have been following the blogs of my colleagues Katerina, Diego, and Andy it should not be a surprise for you to know that in Igalia we have a small —and awesome!— team working with Snabb Switch to provide novel software–mostly solutions for networking.

We are now starting the development of SnabbWall, an application-level (Layer-7) firewall suite for Snabb Switch, and yours truly happens to be the lead developer, so I figured out this is a good moment to introduce the project, which is kindly sponsored by the NLnet Foundation.


The one-minute introduction to Snabb Switch

Snabb Switch (which I'm going to abbreviate as SnS, it's getting long to type it!) works by taking over network interfaces, bypassing the kernel drivers, and using its own. When used to implement a certain network functionality, what your (SnS) program “sees” is no more than streams of raw packets. Gone are the TCP/IP stack, BSD sockets, and all the other functionality provided by the operating system kernel to user space programs. Heck, you even need to access hardware directly! There is nothing in there but the hardware waiting to have streams of electrons blasted through it. But things go real fast without the overhead of a traditional network stack, so despite the void, it is a good place to massage packets and pass them along.

It is fortunate that empty spaces do not last long as such: SnS comes with a set of built-in applications, which can be combined in different ways to achieve our (networking-related) goals. For example, we could pick the bridge application, combine it with the driver for our network interfaces, and (bam!) we have a super fast Ethernet bridge. And even more.

Now is when we remember that SnS is software component. That means that it is code running in a computer, and as such, it can be modified. Why not adding packet filtering capabilities to that bridge we built, so one of the ports only accepts certain kinds of traffic? We could write a small SnS application which only allows HTTP traffic to pass by:

local ethernet = require("lib.protocol.ethernet")
local datagram = require("lib.protocol.datagram")
local link = require("")

local HttpOnly = {
  new = function (self) return setmetatable({}, self) end;
  push = function (self)
    local n = math.min(link.nreadable(self.input.input),
    for _ = 1, n do
      local p = link.receive(self.input.input)
      local d = datagram:new(p, ethernet)
      local ip4 = d:stack()[2]
      if ip4:sport() == 80 or ip4:dport() == 80 then
        link.transmit(self.output.output, p)

Well, that was not long. But you can see where this is going: if we are serious about making a full–fledged firewall, our cute little program is going to mutate into something hairy. We would like to have something like netfilter or, even better, something neater which allows to specify rules in a higher level syntax like OpenBSD's Packet Filter, package it as a SnS application, and have it ready for reusing.

Dear readers: meet SnabbWall

SnabbWall is —will be, when completed— a set of SnS applications and programs. Applications are reusable in your own designs, like every other SnS application.

SnabbWall components

The L7 Spy application is capable of identifying protocol data flows (that is, it works in at the application level, or Layer-7) but other than that packets just flow through it. The idea here is that sometimes it is interesting to just know which kind of traffic passes through network, for example to gather statistics. If a packet is determined to belong to a certain protocol, ancillary metadata is attached to the packet. The way metadata is handled does not ever modify the packet itself, so applications which are not designed to handle it do not need to be modified.

On the other hand, the L7 Firewall application implements the actual logic of matching packets against a set of rules which determine what to do with each one of them. What is special about this application is that, on top of what other filtering solutions like pflua may offer, it also allows to match the additional metadata generated by L7 Spy — if present.

Note that it is not at all necessary to use both applications in tandem: they can function independently, to allow others to mix-and-match them as desired. Yet, they are designed to work together, and SnabbWall also provides a standalone program (snabb wall) which implements a complete application-level firewall.

Deep Space 9

Inferring the protocol to which a packet belongs is a tough job: it requires inspecting the whole contents of each packet, including decoding of protocol fields and reading the data payloads. This is typically referred to as deep packet inspection (DPI), and if the term makes you feel uneasy it is not without reason: DPI is used by evil governments and faceless corporations to eavesdrop on us netizens, or by greedy Internet service providers to cap some kinds of traffic thus attacking net neutrality. Which does not mean that DPI is bad per se: like any other tool, it can be used also to pursue noble purposes. Who wouldn't want their video calls to be smooth thanks to the network giving priority to real time audio and video after all? DPI is just one more tool, and as such it can be misused — so can a hammer. But implementing DPI algorithms is not among the goals of the project — if possible. A hammer had to be chosen.

There are Free Software solutions which already provide DPI functionality. After evaluating the candidates, it was decided that SnabbWall (in particular, the L7 Spy application) would use nDPI. It is a well-maintained, reliable piece of software, with a C API which lends itself to be wrapped using LuaJIT's FFI. It has a long story —for a piece of software— as it started life as OpenDPI, which itself was a trimmed down version of a commercial product, and along its multiple lives it has been tweaked to the point that nDPI provides the best protocol/application detection of the pack.

Here is a commented shortlist of other implementations which were evaluated before settling on nDPI:

  • l7-filter is designed to be used along with the rest of the netfilter stack. That is, in kernel space and reusing components of the Linux kernel, which makes it unsuitable to use in user space.
  • Hippie, apart from being kernel-only, seems super abandoned: last commit in 2008, and having to use CVS to check out the code is guaranteed to give chills to anyone.
  • OpenDPI is no longer available. It is possible to find a copy of the latest version of the source code published under the LGPL, but nevertheless the project is abandoned and unmaintained.
  • nDPI itself is an improved fork of OpenDPI, actively maintained by the ntop developers. The inferred protocols and applications for inspected packets are consistently the best among the solutions based on Free Software.
  • libprotoident uses an interesting technique dubbed Lightweight Packet Inspection (LPI): instead of looking at the whole payload, it only checks the first four bytes of each packet. This implementation is very interesting because it ensures that payload data does not need to be read in full without sacrificing much on the side of detection reliability, which surely looks better in the eyes of those concerned about potential eavesdropping.
  • SPID also follows a different approach which does not read packet payloads in full, using statistical data instead. Unfortunately it is just a prototype.

(For an exhaustive discussion of some of the options outlined, I recommend reading through the “Independent Comparison of Popular DPI Tools for Traffic Classification” paper by Tomasz Bujlow, Valentín Carela-Español, and Pere Barlet-Ros — which is available in PDF.)


As mentioned, the project has just been started and so far I have been able to put up its website at, and work on the design of the system. There is also a roadmap which I plan to keep updated as the project progresses, though I expect that the tech-savyy readers may prefer to delve into the code at GitHub.

At this very moment I am already starting to write the Lua binding for nDPI, using the handy LuaJIT FFI, which I am trying to keep independent from the rest of SnabbWall. Hopefully we will be able to release it as a separate component. Stay tuned!

by aperez ( at December 30, 2015 06:20 PM

December 21, 2015

Jacobo Aragunde

LibreOffice hackfest in Madrid

Earlier this month, I attended a LibreOffice hackfest that has been held in Madrid. My goal for this hackfest was to work on the accessibility of the suite, in particular with the problem exposed in bugs 35652 and 91739, which together prevent the “say all” feature in ATs from working properly with LibreOffice.

The root of the problem is that LibreOffice only exposes a subset of the document through the accessibility APIs; in particular, only the visible parts of the document. This makes sense from the performance point of view, specially for huge documents, but poses a challenge for ATs.

To workaround this behavior, LibreOffice provides some means to navigate to off-screen elements through the flows-to relation. The last paragraph in the visible area has a relation of this kind with the next paragraph even if the latter is not on screen. Unfortunately, the associated accesibility tree does not reflect a proper parent-child structure when these relations are used: objects which should be present are absent, because they actually are never added to the tree. Additionally, when the user changes the visible area of the document, the accessible tree is rebuilt to contain only the visible contents of the document and it leaves orphaned objects left and right.

The proposal from my colleage Joanie, maintainer of Orca, is creating a new set of top-level children for the accessible document object to represent the pages in the document. These would be exposed through the accessibility APIs at any time, regardless of the visibility of said pages, and the actual contents would be lazily loaded when required by ATs. This should expose information to ATs properly in a tree structure and keep most of the document unloaded until really needed. Besides, the problems that appear when navigating flows-to relations or changing the visible area should be addressed to keep the accessible tree clean and up-to-date.

Sadly, I could make no significant advances in the implementation of the above solution. First, the notion of accessible pages doesn’t even exist in Writer, it should be implemented completely from scratch. Besides, the “only visible objects exist” logic is deeply rooted in the code and it would need much more work than three days of hackfest. Still, the analysis work has been very valuable to see where the problems are and pave the way to fix them.

Finally, I’ve also been helping a new contributor to land his first patches in LibreOffice. We started with a cleanup of ATK roles, one of the first tasks I touched when working on accessibility in LibreOffice which was still ongoing, and now it’s close to be finished. We also tried to debug a docx import problem regarding the flip property and instead found a regression in image flip for certain types of bitmaps. The work resulted in another patch for the latter problem which is already merged.

This has been a great occassion to hang out with the community, help newcomers and get some hacking done. I’m very thankful to Igalia for having sponsored my trip and stay in Madrid. See you in the next event!

Igalia & LibreOffice

by Jacobo Aragunde Pérez at December 21, 2015 12:23 PM

December 20, 2015

Manuel Rego

Layout Topics on the Web Engines Hackfest 2015

As usual Igalia hosted the Web Engines Hackfest 2015 in A Coruña two weeks ago. This is a special event where the main focus is hacking on the different engines of the open web platform. The developers discuss their plans and define tasks during the breakout sessions.

Despite of the unconference format, as people liked them past year, we arranged a few talks during the hackfest from a wide range of topics. You can find the slides on the hackfest wiki and the recordings will be published online soon (stay tuned following @webhackfest on twitter).

Regarding my participation I was mostly interested on layout topics. Of course you can think on CSS Grid Layout, but I also approached the work around MathML implementation on WebKit that my mate Álex has been leading lately.

CSS Grid Layout

It’s been a while since I don’t write a post about grid layout. We’ve been working on lots of things but lately I’ve been focused on supporting a new feature on Blink, that I usually call “implicit grid before explicit”.

Trying to explain it briefly, the implicit grid feature basically means that the grid can add tracks (columns or rows) on demand. So, if you place items outside of the grid, the implicit tracks will be created. Initially you could only add those implicit tracks after the explicit grid (on the right/bottom), but now the spec allows you to add them before (on the left/right).

Example of a 2x2 grid with implicit columns before and after the explicit grid Example of a 2x2 grid with implicit columns before and after the explicit grid

In order to support this, I’ve ended up refactoring all the placement code. I wrote a small document with the plan where you can get a more detailed explanation about this feature. During the hackfest I worked on the patch that adds initial support for this, which is right now in the review process and will hopefully land before the end of the year.

On the other hand, taking advantage of being together with Javi and Sergio for a few days, we discussed about some of the hot topics in the current grid layout implementation: intrinsic sizes (min-content and max-content constraints are not fully supported on Blink and WebKit yet) and orthogonal flows.

Whiteboard discussion about grid track sizing algorithm and orthogonal flows Whiteboard discussion about grid track sizing algorithm and orthogonal flows

Last thing related to grid layout, João Oliveira was talking to me about how he can start contributing to it. My two ideas were: adding some support for grid layout on the Chrome DevTools or helping on the W3C tests suite.

He created his first W3C test and it’s been already integrated in the suite (forgive me for the picky review 😇 ). Maybe he keeps contributing with more tests (or someone else does it), I’ll be more than happy to help with reviews and trying to solve any doubt regarding grid layout spec. If you’re interested don’t hesitate to contact me. Having a good test suite would be really nice to check the interoperability of the different implementations.


During the hackfest Álex and Fred explained the current status of MathML implementation in WebKit and the refactoring plans they have.

Main issue is that the MathML implementation in WebKit is unmaintained since some time ago. In addition, trying to add new features on top of current implementation is really complex, as it’s inhering flexbox logic which, ironically, removes a lot of flexibility to do the math layouts.

Álex and Fred during the MathML breakout session Álex and Fred during the MathML breakout session

The plan is to remove the dependency with flexbox, and there’re some patches already doing it. This code will start to be upstreamed in the short term. The proposal has been shared with the rest of the WebKit community and people seem to agree with the plan.

This code simplification will make that MathML has its own renderers in charge of the different MathML layouts. This hopefully might be a step in the right direction considering getting MathML back in Blink.

Álex, Fred and Javi have been working really hard on this refactoring, and I was contributing with my tiny bit during the hackfest. I hope to find time next year and try to help to move this forward.


Last but not least, big thanks to each person involved on the hackfest, from people attending to everyone giving us a hand on the different topics. We really hope you enjoyed this edition. See you all in the next one! 😉

Web Engines Hackfest 2015 sponsors: Collabora and Igalia Web Engines Hackfest 2015 sponsors: Collabora and Igalia

Thanks again to the sponsors Collabora and Igalia for making possible such a great event.

December 20, 2015 11:00 PM

December 17, 2015

Alberto Garcia

Improving disk I/O performance in QEMU 2.5 with the qcow2 L2 cache

QEMU 2.5 has just been released, with a lot of new features. As with the previous release, we have also created a video changelog.

I plan to write a few blog posts explaining some of the things I have been working on. In this one I’m going to talk about how to control the size of the qcow2 L2 cache. But first, let’s see why that cache is useful.

The qcow2 file format

qcow2 is the main format for disk images used by QEMU. One of the features of this format is that its size grows on demand, and the disk space is only allocated when it is actually needed by the virtual machine.

A qcow2 file is organized in units of constant size called clusters. The virtual disk seen by the guest is also divided into guest clusters of the same size. QEMU defaults to 64KB clusters, but a different value can be specified when creating a new image:

qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G

In order to map the virtual disk as seen by the guest to the qcow2 image in the host, the qcow2 image contains a set of tables organized in a two-level structure. These are called the L1 and L2 tables.

There is one single L1 table per disk image. This table is small and is always kept in memory.

There can be many L2 tables, depending on how much space has been allocated in the image. Each table is one cluster in size. In order to read or write data to the virtual disk, QEMU needs to read its corresponding L2 table to find out where that data is located. Since reading the table for each I/O operation can be expensive, QEMU keeps a cache of L2 tables in memory to speed up disk access.

The L2 cache can have a dramatic impact on performance. As an example, here’s the number of I/O operations per second that I get with random read requests in a fully populated 20GB disk image:

L2 cache size Average IOPS
1 MB 5100
1,5 MB 7300
2 MB 12700
2,5 MB 63600

If you’re using an older version of QEMU you might have trouble getting the most out of the qcow2 cache because of this bug, so either upgrade to at least QEMU 2.3 or apply this patch.

(in addition to the L2 cache, QEMU also keeps a refcount cache. This is used for cluster allocation and internal snapshots, but I’m not covering it in this post. Please refer to the qcow2 documentation if you want to know more about refcount tables)

Understanding how to choose the right cache size

In order to choose the cache size we need to know how it relates to the amount of allocated space.

The amount of virtual disk that can be mapped by the L2 cache (in bytes) is:

disk_size = l2_cache_size * cluster_size / 8

With the default values for cluster_size (64KB) that is

disk_size = l2_cache_size * 8192

So in order to have a cache that can cover n GB of disk space with the default cluster size we need

l2_cache_size = disk_size_GB * 131072

QEMU has a default L2 cache of 1MB (1048576 bytes) so using the formulas we’ve just seen we have 1048576 / 131072 = 8 GB of virtual disk covered by that cache. This means that if the size of your virtual disk is larger than 8 GB you can speed up disk access by increasing the size of the L2 cache. Otherwise you’ll be fine with the defaults.

How to configure the cache size

Cache sizes can be configured using the -drive option in the command-line, or the ‘blockdev-add‘ QMP command.

There are three options available, and all of them take bytes:

  • l2-cache-size: maximum size of the L2 table cache
  • refcount-cache-size: maximum size of the refcount block cache
  • cache-size: maximum size of both caches combined

There are two things that need to be taken into account:

  1. Both the L2 and refcount block caches must have a size that is a multiple of the cluster size.
  2. If you only set one of the options above, QEMU will automatically adjust the others so that the L2 cache is 4 times bigger than the refcount cache.

This means that these three options are equivalent:

-drive file=hd.qcow2,l2-cache-size=2097152
-drive file=hd.qcow2,refcount-cache-size=524288
-drive file=hd.qcow2,cache-size=2621440

Although I’m not covering the refcount cache here, it’s worth noting that it’s used much less often than the L2 cache, so it’s perfectly reasonable to keep it small:

-drive file=hd.qcow2,l2-cache-size=4194304,refcount-cache-size=262144

Reducing the memory usage

The problem with a large cache size is that it obviously needs more memory. QEMU has a separate L2 cache for each qcow2 file, so if you’re using many big images you might need a considerable amount of memory if you want to have a reasonably sized cache for each one. The problem gets worse if you add backing files and snapshots to the mix.

Consider this scenario:

Here, hd0 is a fully populated disk image, and hd1 a freshly created image as a result of a snapshot operation. Reading data from this virtual disk will fill up the L2 cache of hd0, because that’s where the actual data is read from. However hd0 itself is read-only, and if you write data to the virtual disk it will go to the active image, hd1, filling up its L2 cache as a result. At some point you’ll have in memory cache entries from hd0 that you won’t need anymore because all the data from those clusters is now retrieved from hd1.

Let’s now create a new live snapshot:

Now we have the same problem again. If we write data to the virtual disk it will go to hd2 and its L2 cache will start to fill up. At some point a significant amount of the data from the virtual disk will be in hd2, however the L2 caches of hd0 and hd1 will be full as a result of the previous operations, even if they’re no longer needed.

Imagine now a scenario with several virtual disks and a long chain of qcow2 images for each one of them. See the problem?

I wanted to improve this a bit so I was working on a new setting that allows the user to reduce the memory usage by cleaning unused cache entries when they are not being used.

This new setting is available in QEMU 2.5, and is called ‘cache-clean-interval‘. It defines an interval (in seconds) after which all cache entries that haven’t been accessed are removed from memory.

This example removes all unused cache entries every 15 minutes:

-drive file=hd.qcow2,cache-clean-interval=900

If unset, the default value for this parameter is 0 and it disables this feature.

Further information

In this post I only intended to give a brief summary of the qcow2 L2 cache and how to tune it in order to increase the I/O performance, but it is by no means an exhaustive description of the disk format.

If you want to know more about the qcow2 format here’s a few links:


My work in QEMU is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the QEMU development team.

Enjoy QEMU 2.5!

by berto at December 17, 2015 03:39 PM

December 08, 2015

Víctor Jáquez

GStreamer VA-API 0.7.0

GStreamer VA-API 0.7.0 is here! As is usually said, “grab it while it is fresh”, specially the distributions.

In a previous blog post we explained in detail what is GStreamer VA-API. Also, last October, we talked about it in the GStreamer Conference 2015; you can glance the slides here.

This release includes three major changes:

  • Added support for VP9 decoding.
  • Improved a lot the HEVC decoder.
  • Some fixes in the integration with OpenGL platforms.

Please note that VP9 decoding is only supported from Braswell and Skylake onwards, using the Intel’s hybrid driver for libva, which is a subset of a VA-API backend, and it can be plugged to Intel’s VA-API backend (adding the –hybrid-codec parameter to the intel-driver’s configure script).

HEVC (a.k.a. H265) decoding is more stable and has a better performance. And, remember, it is only available in Skylake and Cherry View chipsets).

Now, OpenGL3 is handled correctly through GstGLTextureUploadMeta.

But there are a lot of fixes more, improvements and enhancements, such as better handling of H264 corrupted streams, better GstContext handling, the enabling of vaapidecodebin (finally!), etc.

Here’s the git’s short log summary since the last 0.6.0 release.

 5  Gwenole Beauchesne
 1  Jan Schmidt
 2  Lim Siew Hoon
 1  Mark Nauwelaerts
 1  Olivier Crete
61  Sreerenj Balachandran
72  Víctor Manuel Jáquez Leal

by vjaquez at December 08, 2015 08:01 PM

November 15, 2015

Diego Pino

Reusing LuaJIT's disassembler

Last year at Igalia we started coding pflua, a high-performance packet filter which runs on top of LuaJIT. Pflua is now included in Snabb Switch as an external library and it’s used to do all the packet filtering tasks that were initially done using libpcap. Pflua is capable of compiling, while performing several optimizations, pflang expressions to Lua functions. One of the first tasks I did in pflua was obtaining the machine code that LuaJIT produces for a translated Lua function. In this post, I take a high-level look at LuaJIT’s disassembler, the piece of code that allows LuaJIT to print out the compiled machine code that itself produces.

LuaJIT 2.1, currently in beta stage, introduced two very useful tools: a statistical profiler and a code dumper. Both tools are actually Lua modules. They can be used either from source code or externally when calling LuaJIT. In that case, it’s necessary to add the parameters -jp=options and jdump=options.

LuaJIT’s profiler can help us to understand what code is hot, in other words, what parts of the program are consuming most of the CPU cycles. The dumper helps us understand what code is produced for every trace. It’s possible to peek at the resulting Lua’s bytecode, LuaJIT’s SSA IR (Static-single assignment IR, an intermediate representation form often used in compilers for procedural languages such as Java and C) and machine code, using different flags. Either for the profiler and the code dumper, all the possible flags are best documented at the headers of their respective source code files: src/jit/p.lua and src/jit/dump.lua. In the particular case of the code dumper, the flag ‘m’ prints out machine code for a trace:

$ luajit -jdump=m -e "local x = 1; for i=1,1e6 do x = x + 1 end; print(x)"
---- TRACE 1 start (command line):1
---- TRACE 1 mcode 81
0bcaffa3  mov dword [0x40c22410], 0x1
0bcaffae  movsd xmm0, [0x40bd7120]
0bcaffb7  cvttsd2si ebp, [rdx+0x8]
0bcaffbc  cmp dword [rdx+0x4], 0xfffeffff
0bcaffc3  jnb 0x0bca0010        ->0
0bcaffc9  movsd xmm7, [rdx]
0bcaffcd  addsd xmm7, xmm0
0bcaffd1  add ebp, +0x01
0bcaffd4  cmp ebp, 0x000f4240
0bcaffda  jg 0x0bca0014 ->1
0bcaffe0  addsd xmm7, xmm0
0bcaffe4  add ebp, +0x01
0bcaffe7  cmp ebp, 0x000f4240
0bcaffed  jle 0x0bcaffe0        ->LOOP
0bcaffef  jmp 0x0bca001c        ->3

LuaJIT uses its own disassembler to print out machine code. There’s one for every architecture supported: ARM, MIPS, PPC, x86 and x64. LuaJIT’s disassemblers are actually Lua modules, written in Lua (some of them as small as 500 LOC), and live at src/jit/dis_XXX.lua. Mike Pall comments on the header of dis_x86.lua that an external disassembler could have been used to do the actual disassembling and later integrate its result with the dumper module, but that design would be more fragile. So he decided to implement his own disassemblers.

As they are coded as modules, it could be possible to reuse them in other programs. Basically, each disassembler module exports three functions: create, disass and regname. The disass function creates a new context and disassembles a chunk of code starting at address. So it would be possible to pass a section of a binary and get it decoded using the disass function.

In the example below, I’m going to use LuaJIT’s x86-64 disassembler to print out the .text section of a binary, which is the section that contains the compiled source code. I use the binary of following hello-world program as input.

#include <stdio.h>

int main(int argc, char *argv[])
    if (argc < 2)
        printf("Usage: hello <name>\n");
    printf("Hello %s!", argv[1]);
    return 0;

I need to know the offset and size of the .text section:

$ readelf --wide -S ./hello-world
Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [13] .text             PROGBITS        0000000000400520 000520 0001b2 00  AX  0   0 16

Now all I have to do is to read that chunk of the binary and pass it to disass.

local disass = require("dis_x64")

local function readfile(name, offset, size)
   local f =, "rb")
   if not f then
      error(("Couldn't open file: %s"):format(name))
   f:seek("set", offset)
   local content = f:read(size)
   return content

local function main()
   local filename, offset, size = unpack(arg)
   assert(filename, ("Couldn't find file: %s"):format(filename))
   offset = assert(tonumber(offset), "No valid offset")
   size = assert(tonumber(size), "No valid size")
   local mcode = readfile(filename, offset, size)

And this what I get when I print out the first 10 lines of the .text section:

$ luajit ./dis.lua ./hello-world 0x520 0x1b2 | head -10
00000000  31ED              xor ebp, ebp
00000002  4989D1            mov r9, rdx
00000005  5E                pop rsi
00000006  4889E2            mov rdx, rsp
00000009  4883E4F0          and rsp, -0x10
0000000d  50                push rax
0000000e  54                push rsp
0000000f  49C7C0D0064000    mov r8, 0x004006d0
00000016  48C7C160064000    mov rcx, 0x00400660
0000001d  48C7C716064000    mov rdi, 0x00400616

To validate the output is correct I compared it to the same output produced by ndisasm and objdump.

$ ndisasm -b 64 ./hello-world | grep -A 10 "400520"
00000520  31ED              xor ebp,ebp
00000522  4989D1            mov r9,rdx
00000525  5E                pop rsi
00000526  4889E2            mov rdx,rsp
00000529  4883E4F0          and rsp,byte -0x10
0000052D  50                push rax
0000052E  54                push rsp
0000052F  49C7C0D0064000    mov r8,0x4006d0
00000536  48C7C160064000    mov rcx,0x400660
0000053D  48C7C716064000    mov rdi,0x400616

It looks almost the same as LuaJIT’s disassembler, but that’s because LuaJIT’s dissasembler follows ndisasm format, as it’s stated in the source code. Objdump produces a slightly different output but semantically equivalent:

$ objdump -M intel -j .text -d hello-world

Disassembly of section .text:

0000000000400520 <_start>:
  400520:       31 ed                   xor    ebp,ebp
  400522:       49 89 d1                mov    r9,rdx
  400525:       5e                      pop    rsi
  400526:       48 89 e2                mov    rdx,rsp
  400529:       48 83 e4 f0             and    rsp,0xfffffffffffffff0
  40052d:       50                      push   rax
  40052e:       54                      push   rsp
  40052f:       49 c7 c0 d0 06 40 00    mov    r8,0x4006d0
  400536:       48 c7 c1 60 06 40 00    mov    rcx,0x400660
  40053d:       48 c7 c7 16 06 40 00    mov    rdi,0x400616

It is possible to the same thing by instantiating a context via create and call context:disass() to disassemble a chunk of machine code. This approach allow us to have a finer control of the output as create is passed a callback for each diassembled line. We could accumulate the disassembled lines in a variable or print them out to stdio, as in this example.

November 15, 2015 10:00 PM

November 11, 2015

Jacobo Aragunde

Updated LibreOffice workshop at A Coruña University

I’ve been invited to repeat the workshop about LibreOffice I conducted last year at the University of A Coruña. Contents are largely the same but I’ve done some updates, hence this short entry to share the new slides:

It was a great session, with many interesting questions from the students. Thanks to Juan José Sánchez for the invitation! I hope to be able to repeat next year :)

by Jacobo Aragunde Pérez at November 11, 2015 04:45 PM

November 09, 2015

Andy Wingo

embracing conway's law

Most of you have heard of "Conway's Law", the pithy observation that the structure of things that people build reflects the social structure of the people that build them. The extent to which there is coordination or cohesion in a system as a whole reflects the extent to which there is coordination or cohesion among the people that make the system. Interfaces between components made by different groups of people are the most fragile pieces. This division goes down to the inner life of programs, too; inside it's all just code, but when a program starts to interface with the outside world we start to see contracts, guarantees, types, documentation, fixed programming or binary interfaces, and indeed faults as well: how many bug reports end up in an accusation that team A was not using team B's API properly?

If you haven't heard of Conway's law before, well, welcome to the club. Inneresting, innit? And so thought I until now; a neat observation with explanatory power. But as aspiring engineers we should look at ways of using these laws to build systems that take advantage of their properties.

in praise of bundling

Most software projects depend on other projects. Using Conway's law, we can restate this to say that most people depend on things built by other people. The Chromium project, for example, depends on many different libraries produced by many different groups of people. But instead of requiring the user to install each of these dependencies, or even requiring the developer that works on Chrome to have them available when building Chrome, Chromium goes a step further and just includes its dependencies in its source repository. (The mechanism by which it does this isn't a direct inclusion, but since it specifies the version of all dependencies and hosts all code on Google-controlled servers, it might as well be.)

Downstream packagers like Fedora bemoan bundling, but they ignore the ways in which it can produce better software at lower cost.

One way bundling can improve software quality is by reducing the algorithmic complexity of product configurations, when expressed as a function of its code and of its dependencies. In Chromium, a project that bundles dependencies, the end product is guaranteed to work at all points in the development cycle because its dependency set is developed as a whole and thus uniquely specified. Any change to a dependency can be directly tested against the end product, and reverted if it causes regressions. This is only possible because dependencies have been pulled into the umbrella of "things the Chromium group is responsible for".

Some dependencies are automatically pulled into Chrome from their upstreams, like V8, and some aren't, like zlib. The difference is essentially social, not technical: the same organization controls V8 and Chrome and so can set the appropriate social expectations and even revert changes to upstream V8 as needed. Of course the goal of the project as a whole has technical components and technical considerations, but they can only be acted on to the extent they are socially reified: without a social organization of the zlib developers into the Chromium development team, Chromium has no business automatically importing zlib code, because the zlib developers aren't testing against Chromium when they make a release. Bundling zlib into Chromium lets the Chromium project buffer the technical artifacts of the zlib developers through the Chromium developers, thus transferring responsibility to Chromium developers as well.

Conway's law predicts that the interfaces between projects made by different groups of people are the gnarliest bits, and anyone that has ever had to maintain compatibility with a wide range of versions of upstream software has the scar tissue to prove it. The extent to which this pain is still present in Chromium is the extent to which Chromium, its dependencies, and the people that make them are not bound tightly enough. For example, making a change to V8 which results in a change to Blink unit tests is a three-step dance: first you commit a change to Blink giving Chromium a heads-up about new results being expected for the particular unit tests, then you commit your V8 change, then you commit a change to Blink marking the new test result as being the expected one. This process takes at least an hour of human interaction time, and about 4 hours of wall-clock time. This pain would go away if V8 were bundled directly into Chromium, as you could make the whole change at once.

forking considered fantastic

"Forking" sometimes gets a bad rap. Let's take the Chromium example again. Blink forked from WebKit a couple years ago, and things have been great in both projects since then. Before the split, the worst parts in WebKit were the abstraction layers that allowed Google and Apple to use the dependencies they wanted (V8 vs JSC, different process models models, some other things). These abstraction layers were the reified software artifacts of the social boundaries between Google and Apple engineers. Now that the social division is gone, the gnarly abstractions are gone too. Neither group of people has to consider whether the other will be OK with any particular change. This eliminates a heavy cognitive burden and allows both projects to move faster.

As a pedestrian counter-example, Guile uses the libltdl library to abstract over the dynamic loaders of different operating systems. (Already you are probably detecting the Conway's law keywords: uses, library, abstract, different.) For years this library has done the wrong thing while trying to do the right thing, ignoring .dylib's but loading .so's on Mac (or vice versa, I can't remember), not being able to specify soversions for dependencies, throwing a stat party every time you load a library because it grovels around for completely vestigial .la files, et cetera. We sent some patches some time ago but the upstream project is completely unmaintained; the patches haven't been accepted, users build with whatever they have on their systems, and though we could try to take over upstream it's a huge asynchronous burden for something that should be simple. There is a whole zoo of concepts we don't need here and Guile would have done better to include libltdl into its source tree, or even to have forgone libltdl and just written our own thing.

Though there are costs to maintaining your own copy of what started as someone else's work, people who yammer on against forks usually fail to recognize their benefits. I think they don't realize that for a project to be technically cohesive, it needs to be socially cohesive as well; anything else is magical thinking.

not-invented-here-syndrome considered swell

Likewise there is an undercurrent of smarmy holier-than-thou moralism in some parts of the programming world. These armchair hackers want you to believe that you are a bad person if you write something new instead of building on what has already been written by someone else. This too is magical thinking that comes from believing in the fictional existence of a first-person plural, that there is one "we" of "humanity" that is making linear progress towards the singularity. Garbage. Conway's law tells you that things made by different people will have different paces, goals, constraints, and idiosyncracies, and the impedance mismatch between you and them can be a real cost.

Sometimes these same armchair hackers will shake their heads and say "yeah, project Y had so much hubris and ignorance that they didn't want to bother understanding what X project does, and they went and implemented their own thing and made all their own mistakes." To which I say, so what? First of all, who are you to judge how other people spend their time? You're not in their shoes and it doesn't affect you, at least not in the way it affects them. An armchair hacker rarely understands the nature of value in an organization (commercial or no). People learn more when they write code than when they use it or even when they read it. When your product has a problem, where will you find the ability to fix it? Will you file a helpless bug report or will you be able to fix it directly? Assuming your software dependencies model some part of your domain, are you sure that their models are adequate for your purpose, with the minimum of useless abstraction? If the answer is "well, I'm sure they know what they're doing" then if your organization survives a few years you are certain to run into difficulties here.

One example. Some old-school Mozilla folks still gripe at Google having gone and created an entirely new JavaScript engine, back in 2008. This is incredibly naïve! Google derives immense value from having JS engine expertise in-house and not having to coordinate with anyone else. This control also gives them power to affect the kinds of JavaScript that gets written and what goes into the standard. They would not have this control if they decided to build on SpiderMonkey, and if they had built on SM, they would have forked by now.

As a much more minor, insignificant, first-person example, I am an OK compiler hacker now. I don't consider myself an expert but I do all right. I got here by making a bunch of mistakes in Guile's compiler. Of course it helps if you get up to speed using other projects like V8 or what-not, but building an organization's value via implementation shouldn't be discounted out-of-hand.

Another point is that when you build on someone else's work, especially if you plan on continuing to have a relationship with them, you are agreeing up-front to a communications tax. For programmers this cost is magnified by the degree to which asynchronous communication disrupts flow. This isn't to say that programmers can't or shouldn't communicate, of course, but it's a cost even in the best case, and a cost that can be avoided by building your own.

When you depend on a project made by a distinct group of people, you will also experience churn or lag drag, depending on whether the dependency changes faster or slower than your project. Depending on LLVM, for example, means devoting part of your team's resources to keeping up with the pace of LLVM development. On the other hand, depending on something more slow-moving can make it more difficult to work with upstream to ensure that the dependency actually suits your use case. Again, both of these drag costs are magnified by the asynchrony of communicating with people that probably don't share your goals.

Finally, for projects that aim to ship to end users, depending on people outside your organization exposes you to risk. When a security-sensitive bug is reported on some library that you use deep in your web stack, who is responsible for fixing it? If you are responsible for the security of a user-facing project, there are definite advantages for knowing who is on the hook for fixing your bug, and knowing that their priorities are your priorities. Though many free software people consider security to be an argument against bundling, I think the track record of consumer browsers like Chrome and Firefox is an argument in favor of giving power to the team that ships the product. (Of course browsers are terrifying security-sensitive piles of steaming C++! But that choice was made already. What I assert here is that they do well at getting security fixes out to users in a timely fashion.)

to use a thing, join its people

I'm not arguing that you as a software developer should never use code written by other people. That is silly and I would appreciate if commenters would refrain from this argument :)

Let's say you have looked at the costs and the benefits and you have decided to, say, build a browser on Chromium. Or re-use pieces of Chromium for your own ends. There are real costs to doing this, but those costs depend on your relationship with the people involved. To minimize your costs, you must somehow join the community of people that make your dependency. By joining yourself to the people that make your dependency, Conway's law predicts that the quality of your product as a whole will improve: there will be fewer abstraction layers as your needs are taken into account to a greater degree, your pace will align with the dependency's pace, and colleagues at Google will review for you because you are reviewing for them. In the case of Opera, for example, I know that they are deeply involved in Blink development, contributing significantly to important areas of the browser that are also used by Chromium. We at Igalia do this too; our most successful customers are those who are able to work the most closely with upstream.

On the other hand, if you don't become part of the community of people that makes something you depend on, don't be surprised when things break and you are left holding both pieces. How many times have you heard someone complain the "project A removed an API I was using"? Maybe upstream didn't know you were using it. Maybe they knew about it, but you were not a user group they cared about; to them, you had no skin in the game.

Foundations that govern software projects are an anti-pattern in many ways, but they are sometimes necessary, born from the need for mutually competing organizations to collaborate on a single project. Sometimes the answer for how to be able to depend on technical work from others is to codify your social relationship.

hi haters

One note before opening the comment flood: I know. You can't control everything. You can't be responsible for everything. One way out of the mess is just to give up, cross your fingers, and hope for the best. Sure. Fine. But know that there is no magical first-person-plural; Conway's law will apply to you and the things you build. Know what you're actually getting when you depend on other peoples' work, and know what you are paying for it. One way or another, pay for it you must.

by Andy Wingo at November 09, 2015 01:48 PM

November 03, 2015

Andy Wingo

two paths, one peak: a view from below on high-performance language implementations

Ohmigod it's November. Time flies amirite. Eck-setra. These are not actually my sentiments but sometimes I do feel like a sloth or a slow loris, grasping out at quarter-speed. Once I get a hold it's good times, but hoo boy. The tech world churns and throws up new languages and language implementations every year and how is it that in 2015, some 20 years after the project was started, Guile still doesn't do native compilation?

Though I've only been Guiling for the last 10 years or so, this article aims to plumb those depths; and more than being an apology or a splain I want to ponder the onward journey from the here and the now. I was going to write something like "looking out from this peak to the next higher peak" but besides being a cliché that's exactly what I don't mean to do. In Guile performance work I justify my slow loris grip by a mistrust of local maxima. I respect and appreciate the strategy of going for whatever gains you the most in the short term, especially in a commercial context, but with a long view maybe this approach is a near win but a long lose.

That's getting ahead of myself; let's get into this thing. We started byte-compiling Guile around 2008 or so. Guile is now near to native compilation. Where are we going with this thing?

short term: template jit

The obvious next thing to do for Guile would be to compile its bytecodes to machine code using a template JIT. This strategy just generates machine code for each bytecode instruction without regard to what comes before or after. It's dead simple. Guile's bytecode is quite well-suited to this technique, even, in the sense that an instruction doesn't correspond to much code. As Guile has a register-based VM, its instructions will also specialize well against their operands when compiled to native code. The only global state that needs to be carried around at runtime is the instruction pointer and the stack pointer, both of which you have already because of how modern processors work.

Incidentally I have long wondered why CPython doesn't have a template JIT. Spiritually I am much more in line with the PyPy project but if I were a CPython maintainer, I would use a template JIT on the bytecodes I already have. Using a template JIT preserves the semantics of bytecode, including debugging and introspection. CPython's bytecodes are at a higher level than Guile's though, with many implicit method/property lookups (at least the last time I looked at them), and so probably you would need to add inline caches as well; but no biggie. Why aren't the CPython people doing this? What is their long-term perf story anyway -- keep shovelling C into the extension furnace? Lose to PyPy?

In the case of Guile we are not yet grasping in this direction because we don't have (direct) competition from PyPy :) But also there are some problems with a template JIT. Once you internalize the short-term mentality of a template JIT you can get stuck optimizing bytecode, optimizing template JIT compilation, and building up a baroque structure that by its sheer mass may prevent you from ever building The Right Thing. You will have to consider how a bytecode-less compilation pipeline interacts with not only JITted code but also bytecode, because it's a lose to do a template JIT for code that is only executed once.

This sort of short-term thinking is what makes people also have to support on-stack replacement (OSR), also known as hot loop transfer. The basic idea is that code that executes often has to be JITted to go fast, but you can't JIT everything because that would be slow. So you wait to compile a function until it's been called a few times; fine. But with loops it could be that a function is called just once but a loop in the function executes many times. You need to be able to "tier up" to the template JIT from within a loop. This complexity is needed at the highest performance level, but if you choose to do a template JIT you're basically committing to implementing OSR early on.

Additionally the implementation of a template JIT compiler is usually a bunch of C or C++ code. It doesn't make sense to include a template JIT in a self-hosted system that compiles to bytecode, because it would be sad to have the JIT not be written in the source language (Guile Scheme in our case).

Finally in Scheme we have tail-call and delimited continuation considerations. Currently in Guile all calls happen in the Guile bytecode interpreter, which makes tail calls easy -- the machine frame stays the same and we just have to make a tail call on the Scheme frame. This is fine because we don't actually control the machine frame (the C frame) of the bytecode interpreter itself -- the C compiler just does whatever it does. But how to tail call between the bytecode interpreter and JIT-compiled code? You'd need to add a trampoline beneath both the C interpreter and any entry into compiled code that would trampoline to the other implementation, depending on how the callee "returns". And how would you capture stack slices with delimited continuations? It's possible (probably -- I don't know how to reinstate a delimited continuation with both native and interpreted frames), but something of a headache, and is it really necessary?

if you compile ahead-of-time anyway...

The funny thing about CPython is that, like Guile, it is actually an ahead-of-time compiler. While the short-term win would certainly be to add a template JIT, because the bytecode is produced the first time a script is run and cached thereafter, you might as well compile the bytecode to machine code ahead-of-time too and skip the time overhead of JIT compilation on every run. In a template JIT, the machine code is only a function of the bytecode (assuming the template JIT doesn't generate code that depends on the shape of the heap).

Compiling native code ahead of time also saves on memory usage, because you can use file-backed mappings that can be lazily paged in and shared between multiple processes, and when these pages are in cache that translates also to faster startup too.

But if you're compiling bytecode ahead of time to native code, what is the bytecode for?

(not) my beautiful house

At some point you reach a state where you have made logical short-term decisions all the way and you end up with vestigial organs of WTF in your language runtime. Bytecode, for example. A bytecode interpreter written in C. Object file formats for things you don't care about. Trampolines. It's time to back up and consider just what it is that we should be building.

The highest-performing language implementations will be able to compile together the regions of code in which a program spends most of its time. Ahead-of-time compilers can try to predict these regions, but you can't always know what the behavior of a program will be. A program's run-time depends on its inputs, and program inputs are late-bound.

Therefore these highest-performing systems will use some form of adaptive optimization to apply run-time JIT compilation power on whatever part of a program turns out to be hot. This is the peak performance architecture, and indeed in the climb to a performant language implementation, there is but one peak that I know of. The question becomes, how to get there? What path should I take, with the priorities I have and the resources available to me, which lets me climb the farthest up the hill while always leaving the way clear to the top?

guile's priorities

There are lots of options here, and instead of discussing the space as a whole I'll just frame the topic with some bullets. Here's what I want out of Guile:

  1. The project as a whole should be pleasing to hack on. As much of the system as possible should be written in Scheme, as little as possible in C or assembler, and dependencies on outside projects should be minimized.

  2. Guile users should be able to brag about startup speed to their colleagues. We are willing to trade away some peak throughput for faster startup, if need be.

  3. Debuggability is important -- a Guile hacker will always want to be able to get stack traces with actual arguments and local variable values, unless they stripped their compiled Guile binaries, which should be possible as well. But we are willing to give up some debuggability to improve performance and memory use. In the same way that a tail call replaces the current frame in its entirety, we're willing to lose values of dead variables in stack frames that are waiting on functions to return. We're also OK with other debuggability imprecisions if the performance gains are good enough. With macro expansion, Scheme hackers expect a compilation phase; spending time transforming a program via ahead-of-time compilation is acceptable.

Call it the Guile Implementor's Manifesto, or the manifesto of this implementor at least.

beaucoup bucks

Of course if you have megabucks and ace hackers, then you want to dial back on the compromises: excellent startup time but also source-level debugging! The user should be able to break on any source position: the compiler won't even fold 1 + 1 to 2. But to get decent performance you need to be able to tier up to an optimizing compiler soon, and soon in two senses: soon after starting the program, but also soon after starting your project. It's an intimidating thing to build when you are just starting on a language implementation. You need to be able to tier down too, at least for debugging and probably for other reasons too. This strategy goes in the right direction, performance-wise, but it's a steep ascent. You need experienced language implementors, and they are not cheap.

The usual strategy for this kind of implementation is to write it all in C++. The latency requirements are too strict to do otherwise. Once you start down this road, you never stop: your life as an implementor is that of a powerful, bitter C++ wizard.

The PyPy people have valiently resisted this trend, writing their Python implementation in Python itself, but they concede to latency by compiling their "translated interpreter" into C, which then obviously can't itself be debugged as Python code. It's self-hosting, but staged into C. Ah well. Still, a most valiant, respectable effort.

This kind of language implementation usually has bytecode, as it's a convenient reification of the source semantics, but it doesn't have to. V8 is a good counterexample, at least currently: it treats JavaScript source code as the canonical representation of program semantics, relying on its ability to re-parse source text to an AST in the same way every time as needed. V8's first-tier implementation is actually a simple native code compiler, generated from an AST walk. But things are moving in the bytecode direction in the V8 world, reluctantly, so we should consider bytecode as the backbone of the beaucoup-bucks language implementation.

shoestring slim

If you are willing to relax on source-level debugging, as I am in Guile, you can simplify things substantially. You don't need bytecode, and you don't need a template JIT; in the case of Guile, probably the next step in Guile's implementation is to replace the bytecode compiler and interpreter with a simple native code compiler. We can start with the equivalent of a template JIT, but without the bytecode, and without having to think about the relationship between compiled and (bytecode-)interpreted code. (Guile still has a traditional tree-oriented interpreter, but it is actually written in Scheme; that is a story for another day.)

There's no need to stop at a simple compiler, of course. Guile's bytecode compiler is already fairly advanced, with interprocedural optimizations like closure optimization, partial evaluation, and contification, as well as the usual loop-invariant code motion, common subexpression elimination, scalar replacement, unboxing, and so on. Add register allocation and you can have quite a fine native compiler, and you might even beat the fabled Scheme compilers on the odd benchmark. They'll call you plucky: high praise.

There's a danger in this strategy though, and it's endemic in the Scheme world. Our compilers are often able to do heroic things, but only on the kinds of programs they can fully understand. We as Schemers bend ourselves to the will of our compilers, writing only the kinds of programs our compilers handle well. Sometimes we're scared to fold, preferring instead to inline the named-let iteration manually to make sure the compiler can do its job. We fx+ when we should +; we use tagged vectors when we should use proper data structures. This is déformation professionelle, as the French would say. I gave a talk at last year's Scheme workshop on this topic. PyPy people largely don't have this problem, for example; their langauge implementation is able to see through abstractions at run-time to produce good code, but using adaptive optimization instead of ahead-of-time trickery.

So, an ahead-of-time compiler is perhaps a ridge, but it is not the peak. No amount of clever compilation will remove the need for an adaptive optimizer, and indeed too much cleverness will stunt the code of your users. The task becomes, how to progress from a decent AOT native compiler to a system with adaptive optimization?

Here, as far as I know, we have a research problem. In Guile we have mostly traced the paths of history, re-creating things that existed before. As Goethe said, quoted in the introduction to The Joy of Cooking: "That which thy forbears have bequeathed to thee, earn it anew if thou wouldst possess it." But finally we find here something new, or new-ish: I don't know of good examples of AOT compilers that later added adaptive optimization. Do you know of any, dear reader? I would be delighted to know.

In the absence of a blazed trail to the top, what I would like to do is to re-use the AOT compiler to do dynamic inlining. We might need to collect type feedback as well, though inlining is the more important optimization. I think we can serialize the compiler's intermediate representation into a special section in the ELF object files that Guile produces. A background thread or threads can monitor profiling information from main threads. If a JIT thread decides two functions should be inlined, it can deserialize compiler IR and run the standard AOT compiler. We'd need a bit of mutability in the main program in which to inject such an optimization; an inline cache would do. If we need type feedback, we can make inline caches do that job too.

All this is yet a ways off. The next step for Guile, after the 2.2 release, is a simple native compiler, then register allocation. Step by step.

but what about llvmmmmmmmmmmmmm

People always ask about LLVM. It is an excellent compiler backend. It's a bit big, and maybe you're OK with that, or maybe not; whatever. Using LLVM effectively depends on your ability to deal with churn and big projects. But if you can do that, swell, you have excellent code generation. But how does it help you get to the top? Here things are less clear. There are a few projects using LLVM effectively as a JIT compiler, but that is a very recent development. My hubris, desire for self-hosting, and lack of bandwidth for code churn makes it so that I won't use LLVM myself but I have no doubt that a similar strategy to that which I outline above could work well for LLVM. Serialize the bitcode into your object files, make it so that you can map all optimization points to labels in that bitcode, and you have the ability to do some basic dynamic inlining. Godspeed!


If you're interested, I gave a talk a year ago on the state of JavaScript implementations, and how they all ended up looking more or less the same. This common architecture was first introduced by Self; languages implementations in this category include HotSpot and any of the JavaScript implementations.

Some notes on how PyPy produces interpreters from RPython.

and so I bid you good night

Guile's compiler has grown slowly, in tow of my ballooning awareness of ignorance and more slowly inflating experience. Perhaps we could have done the native code compilation thing earlier, but I am happy with our steady progress over the last five years or so. We had to scrap one bytecode VM and one or two compiler intermediate representations, and though that was painful I think we've done pretty well as far as higher-order optimizations go. If we had done native compilation earlier, I can't but think the inevitably wrong decisions we would have made on the back-end would have prevented us from having the courage to get the middle-end right. As it is, I see the way to the top, through the pass of ahead-of-time compilation and thence to a dynamic inliner. It will be some time before we get there, but that's what I signed up for :) Onward!

by Andy Wingo at November 03, 2015 11:47 PM

October 29, 2015

Andy Wingo

type folding in guile

A-hey hey hey, my peeps! Today's missive is about another optimization pass in Guile that we call "type folding". There's probably a more proper name for this, but for the moment we go with "type folding" as it's shorter than "abstract constant propagation, constant folding, and branch folding based on flow-sensitive type and range analysis".

on types

A word of warning to the type-system enthusiasts among my readers: here I'm using "type" in the dynamic-languages sense, to mean "a property about a value". For example, whether a value is a vector or a pair is a property of that value. I know that y'all use that word for other purposes, but there are other uses that do not falute so highly, and it's in the more pedestrian sense that I'm interested here.

To back up a bit: what are the sources of type information in dynamic languages? In Guile, there are three ways the compiler can learn about a value's type.

One source of type information is the compiler's knowledge of the result types of expressions in the language, especially constants and calls to the language's primitives. For example, in the Scheme definition (define y (vector-length z)), we know that y is a non-negative integer, and we probably also know a maximum value for z too, given that vectors have a maximum size.

Conditional branches with type predicates also provide type information. For example, in consider this Scheme expression:

(lambda (x)
  (if (pair? x)
      (car x)
      (error "not a pair" x)))

Here we can say that at the point of the (car x) expression, x is definitely a pair. Conditional branches are interesting because they add a second dimension to type analysis. The question is no longer "what is the type of all variables", but "what is the type of all variables at all points in the program".

Finally, we have the effect of argument type checks in function calls. For example in the (define y (vector-length z)) definition, after (vector-length z) has been evaluated, we know that z is indeed a vector, because if it weren't, then the primitive call would raise an exception.

In summary, the information that we would like to have is what type each variable has at each program point (label). This information comes from where the variables are defined (the first source of type information), conditional branches and control-flow joins (the second source), and variable use sites that imply type checks (the third). It's a little gnarly but in essence it's a classic flow analysis. We treat the "type" of a variable as a set of possible types. A solution to the flow equations results in a set of types for each variable at each label. We use the intmap data structures to share space between the solution at different program points, resulting in an O(n log n) space complexity.

In Guile we also solve for the range of values a variable may take, at the same time as solving for type. I started doing this as part of the Compost hack a couple years ago, where I needed to be able to prove that the operand to sqrt was non-negative in order to avoid sqrt producing complex numbers. Associating range with type turns out to generalize nicely to other data types which may be thought of as having a "magnitude" -- for example a successful (vector-ref v 3) implies that v is at least 4 elements long. Guile can propagate this information down the flow graph, or propagate it in the other way: if we know the vector was constructed as being 10 elements long, then a successful (vector-ref v n) can only mean that n is between 0 and 9.

what for the typing of the things

Guile's compiler uses type analysis in a few ways at a few different stages. One basic use is in dead code elimination (DCE). An expression can be eliminated from a program if its value is never used and if it causes no side effects. Guile models side effects (and memory dependencies between expressions) with effects analysis. I quote:

We model four kinds of effects: type checks (T), allocations (A), reads (R), and writes (W). Each of these effects is allocated to a bit. An expression can have any or none of these effects.

In an expression like (vector-ref v n), type analysis may compute that in fact v is indeed a vector and n is an integer, and also that n is within the range of valid indexes of v. In that case we can remove the type check (T) bit from the expression's effects, opening up the expression for DCE.

Getting back to the topic of this article, Guile's "type folding" pass uses type inference in three ways.

The first use of type information is if we determine that, at a given use site, a variable has precisely one type and one value. In that case we can do constant folding over that expression, replacing its uses with its value. For example, let's say we have the expression (define len (vector-length v)). If we know that v is a vector of length length 5, we can replace any use of len with the constant, 5. As an implementation detail we actually keep the definition of len in place and let DCE remove it later. We can consider this to be abstract constant propagation: abstract in the sense that it folds over abstract values, represented just as type sets and ranges, and which materializes a concrete value only if it is able to do so. Since ranges propagate through operators as well, it can also be considered as abstract constant folding; the type inference operators act as constant folders.

Another use of type information is in branches. If Guile sees (if (< n (vector-length v)) 1 2) and n and v have the right types and disjoint ranges, then we can fold the test and choose 1 or 2 depending on how the test folds.

Finally type information can enable strength reduction. For example it's a common compiler trick to want to reduce (* n 16) to (ash n 4), but if n isn't an integer this isn't going to work. Likewise, (* n 0) can be 0, 0.0, 0.0+0.0i, something else, or an error, depending on the type of n and whether the * operator has been extended to apply over non-number types. Type folding uses type information to reduce the strength of operations like these, but only where it can prove that the transformation is valid.

So that's type folding! It's a pretty neat pass that does a few things as once. Code here, and code for the type inference itself here.

type-driven unboxing

Guile uses type information in one other way currently, and that is to determine when to unbox floating-point numbers. The current metric is that whenever an arithmetic operation will produce a floating-point number -- in Scheme parlance, an inexact real -- then that operation should be unboxed, if it has an unboxed counterpart. Unboxed operations on floating-point numbers are advantageous because they don't have to allocate space on the garbage-collected heap for their result. Since an unboxed operation like the fadd floating-point addition operator takes raw floating-point numbers as operands, it also will never cause a type check, unlike the polymorphic add instruction. Knowing that fadd has no effects lets the compiler do a better job at common subexpression elimination (CSE), dead code elimination, loop-invariant code motion, and so on.

To unbox an operation, its operands are unboxed, the operation itself is replaced with its unboxed counterpart, and the result is then boxed. This turns something like:

(+ a b)


(f64->scm (fl+ (scm->f64 a) (scm->f64 b)))

You wouldn't think this would be an optimization, except that the CSE pass can eliminate many of these conversion pairs using its scalar elimination via fabricated expressions pass.

A proper flow-sensitive type analysis is what enables sound, effective unboxing. After arithmetic operations have been unboxed, Guile then goes through and tries to unbox loop variables and other variables with more than one definition ("phi' variables, for the elect). It mostly succeeds at this. The results are great: summing a packed vector of 10 million 32-bit floating-point values goes down from 500ms to 130ms, on my machine, with no time spent in the garbage collector. Once we start doing native compilation we should be up to about 5e8 or 10e8 floats per second in this microbenchmark, which is totally respectable and about what gcc's -O0 performance gets.


This kind of type inference works great in tight loops, and since that's how many programs spend most of their time, that's great. Of course, this situation is also a product of programmers knowing that tight loops are how computers go the fastest, or at least of how compilers do the best on their code.

Where this approach to type inference breaks down is at function boundaries. There are no higher-order types and no higher-order reasoning, and indeed no function types at all! This is partially mitigated by earlier partial evaluation and contification passes, which open up space in which the type inferrer can work. Method JIT compilers share this weakness with Guile; tracing JIT runtimes like LuaJIT and PyPy largely do not.

up summing

So that's the thing! I was finally motivated to dust off this draft given the recent work on unboxing in a development branch of Guile. Happy hacking and we promise we'll actually make releases so that you can use it soon soon :)

by Andy Wingo at October 29, 2015 10:13 PM

October 28, 2015

Carlos Alberto López Pérez

Introducing the meta-webkit Yocto layer.

Lately, in my daily work at Igalia, I have been learning to use Yocto / OpenEmbedded to create custom distributions and products targeting different embedded hardware. One of the goals was to create a Kiosk-like browser that was based on a modern Web engine, light (specially regarding RAM requirements) and fast.

WebKit fits perfectly this requirements, so I started checking the available recipes on the different layers of Yocto to build WebKit, unfortunately the available recipes for WebKitGTK+ available were for ancient versions of the engine (no WebKit2 multi-process model). This has changed recently, as the WebKitGTK+ recipe available in the oe-core layer has been updated to a new version. In any case, we needed this for an older release of Yocto so I ended creating a new layer.

I have added also some recipes for WebKitForWayland and released it as meta-webkit. The idea is to collect here recipes for building browsers and web engines based on WebKit. For the moment it includes recipes for building the WebKitGTK+ and WebKitForWayland ports.

WebKitGTK+ vs WebKitForWayland

First a few words about the main differences between this two different ports of WebKit, so you have a better understanding on where to use one or the other.

Let’s start by defining both engines:

  • WebKit for Wayland port pairs the WebKit engine with the Wayland display protocol, allowing embedders to create simple and performant systems based on Web platform technologies. It is designed with hardware acceleration in mind, relying on EGL, the Wayland EGL platform, and OpenGL ES.
  • WebKitGTK+ is a full-featured port of the WebKit rendering engine, suitable for projects requiring any kind of web integration, from hybrid HTML/CSS applications to full-fledged web browsers. It offers WebKit’s full functionality and is useful in a wide range of systems from desktop computers to embedded systems like phones, tablets, and televisions.

From the definitions you may guess already some differences. WebKitForWayland focus on simple and performant Web-oriented systems, meanwhile WebKitGTK+ focus on any kind of product (complex or simple) that requires full Web integration.

WebKitForWayland is what you need when you want a very fast and light HTML5/js runtime, capable of squeeze the hardware acceleration of your platform (Wayland EGL platform) up to the maximum performance. But is not suitable if you are thinking in building a general purpose browser on top of it. It lacks some features that are not needed for a Web runtime but that are desirable for a more generic browser.

Here you can see a video of WebKitForWayland running on the weston ivi-shell and several instances of WebkitForWayland showing different demos (poster circle and several webgl demos) running on the Intel NUC (Intel Atom E3815 – 1 core). The OS is a custom image built with Yocto 1.8 and this meta-webkit layer.

On the Weston terminal you can appreciate the low resource usage (CPU and RAM) despite having several browser process running the demanding demos at full speed. Link to the video on YouTube here.

In case you want to try it, you might want to build an image for your device with meta-webkit and Yocto (see below), or you can test the image that I built for the demo on the above video. You can flash it to an usb stick memory device as follows, and boot it on any Intel x86_64 machine that has an Intel GPU:

# /dev/sdX is the device of your usb memory stick (for example /dev/sdc)
curl -s | xz -dc | dd bs=4k of=/dev/sdX

On the other hand WebKitGTK+ allows you to build a custom browser on top of it. It has a rich and stable API and many goodies like support for plugins, integrated web inspector, full toolkit support, etc. WebKitGTK+ also performs very good and consumes few resources, it can run both on top of Wayland and X11. But at the moment of writing this, the support for Wayland is still incomplete (we are working on it). So if possible we recommend that you base your product on the X11 backend. You could later migrate to Wayland without much effort once we complete the support for it.

Building a Yocto image with WebKitForWayland

The usual way to create an image with Yocto and WebKitForWayland is:

  • Setup the environment and source oe-init-build-env as usual.
  • Checkout the branch of meta-webkit that matches your Yocto/OE version (for example: fido).
    Note that fido (1.8) is the less recent version supported. If you are using the fido branch you will also need to add the meta-ruby layer that is available on meta-openembedded
  • Add the path to the meta-webkit layer in your conf/bblayers.conf file
  • Append the following lines to the conf/local.conf file
  • DISTRO_FEATURES_append = " opengl wayland"
    IMAGE_INSTALL_append = " webkitforwayland"
  • Then build the target image, for example
  • bitbake core-image-weston
  • Then boot the image on the target device and run the webkitforwayland engine from a weston terminal
  • WPELauncher

Building a Yocto image with WebKitGTK+

There are some things to take into account when building WebKitGTK+:

  • The package webkitgtk contains the shared libraries and the webkitgtk runtime.
  • The package webkitgtk-bin contains the MiniBrowser executable. This is very basic browser built on top of the webkitgtk runtime, mainly used for testing purposes.
  • On the recipe there are several packageconfig options that you can tune. For example, for enabling WebGL support you can add the following to your conf/local.conf file:
    PACKAGECONFIG_pn-webkitgtk = "x11 webgl"

    Check the recipe source code to see all the available options.

  • The name of the recipe is the same than the one available in oe-core (master), so you should select which version of webkitgtk you want to build. For example, to build 2.10.3 add on conf/local.conf:
    PREFERRED_VERSION_webkitgtk = "2.10.3"

So, the usual way to create an image with Yocto and WebKitGTK+ is:

  • Setup the environment and source oe-init-build-env as usual.
  • Checkout the branch of meta-webkit that matches your Yocto/OE version (for example: fido).
    Note that fido (1.8) is the less recent version supported. If you are using the fido branch you will also need to add the meta-ruby layer that is available on meta-openembedded
  • Add the following lines to your conf/local.conf file (for building the X11 backend of WebKitGTK+)
  • DISTRO_FEATURES_append = " opengl x11"
    IMAGE_INSTALL_append = " webkitgtk-bin"
  • Then build the X11 image:
  • bitbake core-image-sato
  • Then boot the image on the target device and run the included test browser from an X terminal (or any other browser that you have developed on top of the WebKitGTK+ API)
  • Further info

    If you are new to Yocto / OpenEmbedded, a good starting point is to check out the documentation.

    If you have any issue or doubt with the meta-webkit layer, please let me know about that (you can mail me at or open an issue on github)

    Finally, If you need help for integrating a Web engine on your product, you can also hire us. As maintainers of the GTK+ and WebKit For Wayland ports of WebKit we have considerable experience creating, maintaining and optimizing ports of WebKit.

    by clopez at October 28, 2015 01:23 PM

    October 23, 2015

    Víctor Jáquez

    GStreamer Conference 2015

    Last October 8th and 9th, the GStreamer Conference 2015 had happened in Dublin. It was plenty of interesting talks and colleagues. I really enjoyed it.

    I participated with our latest work in gstreamer-vaapi.  You can glance the slides of my talk.

    GStreamer VA-API talk

    By the way, if you are in Dublin, let me recommend you to visit the Chester Beatty Library. There are a lot of antique, beautiful and rare books

    by vjaquez at October 23, 2015 10:50 AM

    October 19, 2015

    Jacobo Aragunde

    GENIVI meeting at Seoul

    Hello! I’m writing this update from Seoul!

    The National Folk Museum of Korea, framed by the gardens surrounding Gyeongbokgung Palace

    The National Folk Museum of Korea, framed by the gardens surrounding Gyeongbokgung Palace

    The purpose of this trip is taking part in the GENIVI 13th All Member Meeting. Igalia will present the contributions made to the GENIVI community in the browsers area. I have been coordinating the company investment in the automotive field in the last year and taken part in the development of several prototypes involving Chromium technologies, that is the reason I am here with Silvia Cho, Gwang Yoon Hwang and Michael Catanzaro.

    Among other things, Igalia has contributed our knowledge to the discussion on the different technical choices to build a browser for the sector, providing choices and analyzing pros and cons of each one. We have been working to complete the implementation of Ozone-Wayland, the layer that enables running Chromium on Wayland: this task will be essential in the automotive sector where next generation platforms are being built on top of Wayland. Finally, we have put together the pieces that allow to run a modern instance of Chromium on a Renesas R-Car M2 board, an embedded platform designed for this sector; a job that implied hacking the BSPs, the build recipes and even some wrapper libraries.

    Three talks will be presented by Igalians in this event:

    We will also have a booth at the conference showcase, where public will be able to see our work in action. An R-Car board will be running Chromium on top of Wayland to display our work on Ozone-Wayland and porting. Last but not least, there will also be some demos featuring WebKit for Wayland, a fresh flavor of WebKit that focuses on high performance with a low footprint.

    Chromium running on top of Weston on an embedded ARM device.

    Chromium running on top of Weston on an embedded ARM device.

    You can expect to see slides and maybe some more pictures here, as usual, later this week. Happy hacking!

    EDIT: uploaded the slides of the three talks, linked above. Also added these pictures from twitter:

    by Jacobo Aragunde Pérez at October 19, 2015 09:35 AM

    October 15, 2015

    Diego Pino

    Multicore architectures and CPU affinity

    Lately I have working in Snabb Switch as part of the networking team we have in Igalia. Snabb Switch is a kernel by-pass software switch that talks directly to 10-Gbps network cards. This allows Snabb Switch to manipulate packets are at speed rate the Linux kernel is simply not capable of doing it. Snabb Switch also provides a framework to develop network applications and most of the code is Lua (LuaJIT). Snabb Switch rides on the latest innovations in hardware and it’s very innovative in many ways. All this has allowed me to catch up wit many technologies. In this post I reviewed the latest 10 years in Intel microprocessor evolution.

    One of the first attempts by Intel to parallelize processing was hyper-threading, a technology that debuted in Xeon in 2002 and later that year in Pentium 4. A single CPU with hyper-threading appears as two logical CPUs to the Operating System. Hyper-threading takes advantage of the superscalar architecture of modern CPUs, in which instruction processing is divided into several independent pipelines. By duplicating the number of registers in a CPU, hyper-threading can exploit the use of the different pipelines, so most of them are busy at a given time. However, other resources such as cache are not actually duplicated but shared by the logical and the physical CPU. A CPU with hyper-threading enabled can provide a performance boost of 30% or in the worst case no boost at all.

    After this first attempt of bringing real concurrent processing, Intel CPUs started to incorporate multiple cores per socket. For some time hyper-threading technology was forgotten, as Intel modeled its new Core microarchitecture after the P6 microarchitecture (Pentium Pro, Pentium III and Celeron). However with the release of the new Nehalem microarchtecture (Core i7) by the end of 2008, hyper-threading made a come back. Since then, all Intel processors feature this technology. As hyper-threading adds a logical CPU for each physical core, my dual-core CPU appears as four cores to the OS.

    $ lscpu
    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                4
    On-line CPU(s) list:   0-3
    Thread(s) per core:    2
    Core(s) per socket:    2

    Two core(s) per socket and two thread(s) per core, makes a total of four CPU(s).

    What really encouraged Intel to switch to a multicore architecture was the inability to keep improving speed by increasing the number of transistors in a CPU. As Gordon Moore, co-founder of Intel, noticed in 1965, the number of transistors per square inch in a CPU doubles every year (Moore’s Law), something that still applies today although the period has stretched to 2.5 years. However, while the number of transistors in the last decade has gone from 37.5 million to 904 million, CPU clock speed has barely doubled, going from 1.3 Ghz to 2.8 Ghz. In 2004, the heat build-up in CPUs caused Intel to abandon this model and start featuring multiple cores into one single processing unit.

    First multicore processors accessed memory through a shared bus, in other words, all cores shared a common memory. This design is known as UMA or Uniform Memory Access. As the number of cores increased, contention issues appeared. Access to the bus became a bottleneck to scalability, preventing adding more cores. To solve this problem, CPU designers introduced a new type of memory layout known as Non-Uniform Memory Access or NUMA. In a NUMA topology, cores are grouped into a single unit called a NUMA node . Each NUMA node has a local memory assigned, which guarantees only a limited number of cores will try to access memory at a given time. The memory that is not local to a NUMA node is known as remote or foreign. Access to remote memory takes longer because the distance between processor and memory affects access time. And that is the main reason why this architecture is called NUMA and it is often refer as a topology. Either UMA and NUMA are two types of Shared Memory Architectures.

    Architecture of a NUMA system

    Architecture of a NUMA system. Source: Wikipedia

    Unfortunately, my laptop only has one NUMA group:

    $ lscpu | grep NUMA
    NUMA node(s):          1
    NUMA node0 CPU(s):     0-3

    Or using the numactl command:

    $ numactl --show
    policy: default
    preferred node: current
    physcpubind: 0 1 2 3 
    cpubind: 0 
    nodebind: 0 
    membind: 0 

    Shows my CPU features only features one NUMA node containing 4 cores, or in other words, my laptop implements an UMA layout. The same command on a more powerful machine (Intel Xeon Processor E5-2697 v2, with 12 cores per socket), provides the following information:

    $ lscpu
    CPU(s):                48
    On-line CPU(s) list:   0-47
    Thread(s) per core:    2
    Core(s) per socket:    12
    Socket(s):             2
    NUMA node(s):          2
    NUMA node0 CPU(s):     0-11,24-35
    NUMA node1 CPU(s):     12-23,36-47

    This is a much more advanced CPU. There are 2 sockets each containing 12 cores, or 24 cores with hyper-threading. Cores are grouped into 2 NUMA nodes (node0 and node1). The lscpu command also provides information about node’s distance:

    node distances:
    node   0   1 
    0:  10  21 
    1:  21  10 

    Besides cores, peripheral devices also compete for accessing the data bus, since devices have direct access to memory trough DMA (Direct Memory Access). In a NUMA layout, this means that devices, usually connected to a PCI port, have a NUMA node number assigned too. If a process is running on a core which heavily interacts with an I/O device belonging to different NUMA node, performance degradation issues may appear. NUMA considerably benefits from the data locality principle, so devices and processes operating on the same data should run within the same NUMA node.

    $ cat /sys/bus/pci/devices/0000:00:01.0/numa_node

    In the example above, device with PCI address 0000:00:01.0 belongs to NUMA node0.

    If data locallity is so important in a NUMA architecture, it seems it would be very handy to be able to select in what core to run a program. Generally, when the OS executes a program it creates a process and assigns it to a core following a scheduling algorithm. The process will run for some time, until the kernel dispatcher assigns a new process to the core and the former process is put to sleep. When it wakes up, it may be reassigned to a different core. Usually if the process consumes a lot of CPU time, the kernel won’t reassign it to a different core, but it could be possible.

    CPU affinity let us bind a process or thread to a core or group of cores. From the user perspective, commands such as taskset or numactl can help us to control CPU affinity of a process:

    $ taskset -c 0 snabb    # Run the program 'snabb' on core 0
    $ taskset -c 0-3 snabb  # Run the program 'snabb' on cores 0,1,2,3

    Likewise, with the command numatcl:

    $ numactl --cpunodebind=0 snabb    # Run the program 'snabb' on any of the cores of node0
    $ numactl --physcpubind=0-3 snabb  # Run the program 'snabb' on cores 0,1,2,3

    Besides setting CPU affinity, numatcl allows a finer control being possible to select different preferences for memory allocation (only local allowed, local preferred, etc).

    Summarizing, most modern CPUs, if not all, implement a multicore architecture on a NUMA topology. Arbitrary assignment of processes to a core may have an impact on performance, specially if a running program interacts with a PCI device in a different NUMA node. There are tools such as taskset and numactl, that allow us to have a finer control on CPU affinity, being possible to select what core or numa node will run a program.

    Lastly, some recommended links to articles I read, or partially read, and helped me to write down this post:

    October 15, 2015 10:00 AM