Planet Igalia

November 15, 2015

Diego Pino

Reusing LuaJIT's disassembler

Last year at Igalia we started coding pflua, a high-performance packet filter which runs on top of LuaJIT. Pflua is now included in Snabb Switch as an external library and it’s used to do all the packet filtering tasks that were initially done using libpcap. Pflua is capable of compiling, while performing several optimizations, pflang expressions to Lua functions. One of the first tasks I did in pflua was obtaining the machine code that LuaJIT produces for a translated Lua function. In this post, I take a high-level look at LuaJIT’s disassembler, the piece of code that allows LuaJIT to print out the compiled machine code that itself produces.

LuaJIT 2.1, currently in beta stage, introduced two very useful tools: a statistical profiler and a code dumper. Both tools are actually Lua modules. They can be used either from source code or externally when calling LuaJIT. In that case, it’s necessary to add the parameters -jp=options and jdump=options.

LuaJIT’s profiler can help us to understand what code is hot, in other words, what parts of the program are consuming most of the CPU cycles. The dumper helps us understand what code is produced for every trace. It’s possible to peek at the resulting Lua’s bytecode, LuaJIT’s SSA IR (Static-single assignment IR, an intermediate representation form often used in compilers for procedural languages such as Java and C) and machine code, using different flags. Either for the profiler and the code dumper, all the possible flags are best documented at the headers of their respective source code files: src/jit/p.lua and src/jit/dump.lua. In the particular case of the code dumper, the flag ‘m’ prints out machine code for a trace:

$ luajit -jdump=m -e "local x = 1; for i=1,1e6 do x = x + 1 end; print(x)"
---- TRACE 1 start (command line):1
---- TRACE 1 mcode 81
0bcaffa3  mov dword [0x40c22410], 0x1
0bcaffae  movsd xmm0, [0x40bd7120]
0bcaffb7  cvttsd2si ebp, [rdx+0x8]
0bcaffbc  cmp dword [rdx+0x4], 0xfffeffff
0bcaffc3  jnb 0x0bca0010        ->0
0bcaffc9  movsd xmm7, [rdx]
0bcaffcd  addsd xmm7, xmm0
0bcaffd1  add ebp, +0x01
0bcaffd4  cmp ebp, 0x000f4240
0bcaffda  jg 0x0bca0014 ->1
0bcaffe0  addsd xmm7, xmm0
0bcaffe4  add ebp, +0x01
0bcaffe7  cmp ebp, 0x000f4240
0bcaffed  jle 0x0bcaffe0        ->LOOP
0bcaffef  jmp 0x0bca001c        ->3

LuaJIT uses its own disassembler to print out machine code. There’s one for every architecture supported: ARM, MIPS, PPC, x86 and x64. LuaJIT’s disassemblers are actually Lua modules, written in Lua (some of them as small as 500 LOC), and live at src/jit/dis_XXX.lua. Mike Pall comments on the header of dis_x86.lua that an external disassembler could have been used to do the actual disassembling and later integrate its result with the dumper module, but that design would be more fragile. So he decided to implement his own disassemblers.

As they are coded as modules, it could be possible to reuse them in other programs. Basically, each disassembler module exports three functions: create, disass and regname. The disass function creates a new context and disassembles a chunk of code starting at address. So it would be possible to pass a section of a binary and get it decoded using the disass function.

In the example below, I’m going to use LuaJIT’s x86-64 disassembler to print out the .text section of a binary, which is the section that contains the compiled source code. I use the binary of following hello-world program as input.

#include <stdio.h>

int main(int argc, char *argv[])
    if (argc < 2)
        printf("Usage: hello <name>\n");
    printf("Hello %s!", argv[1]);
    return 0;

I need to know the offset and size of the .text section:

$ readelf --wide -S ./hello-world
Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [13] .text             PROGBITS        0000000000400520 000520 0001b2 00  AX  0   0 16

Now all I have to do is to read that chunk of the binary and pass it to disass.

local disass = require("dis_x64")

local function readfile(name, offset, size)
   local f =, "rb")
   if not f then
      error(("Couldn't open file: %s"):format(name))
   f:seek("set", offset)
   local content = f:read(size)
   return content

local function main()
   local filename, offset, size = unpack(arg)
   assert(filename, ("Couldn't find file: %s"):format(filename))
   offset = assert(tonumber(offset), "No valid offset")
   size = assert(tonumber(size), "No valid size")
   local mcode = readfile(filename, offset, size)

And this what I get when I print out the first 10 lines of the .text section:

$ luajit ./dis.lua ./hello-world 0x520 0x1b2 | head -10
00000000  31ED              xor ebp, ebp
00000002  4989D1            mov r9, rdx
00000005  5E                pop rsi
00000006  4889E2            mov rdx, rsp
00000009  4883E4F0          and rsp, -0x10
0000000d  50                push rax
0000000e  54                push rsp
0000000f  49C7C0D0064000    mov r8, 0x004006d0
00000016  48C7C160064000    mov rcx, 0x00400660
0000001d  48C7C716064000    mov rdi, 0x00400616

To validate the output is correct I compared it to the same output produced by ndisasm and objdump.

$ ndisasm -b 64 ./hello-world | grep -A 10 "400520"
00000520  31ED              xor ebp,ebp
00000522  4989D1            mov r9,rdx
00000525  5E                pop rsi
00000526  4889E2            mov rdx,rsp
00000529  4883E4F0          and rsp,byte -0x10
0000052D  50                push rax
0000052E  54                push rsp
0000052F  49C7C0D0064000    mov r8,0x4006d0
00000536  48C7C160064000    mov rcx,0x400660
0000053D  48C7C716064000    mov rdi,0x400616

It looks almost the same as LuaJIT’s disassembler, but that’s because LuaJIT’s dissasembler follows ndisasm format, as it’s stated in the source code. Objdump produces a slightly different output but semantically equivalent:

$ objdump -M intel -j .text -d hello-world

Disassembly of section .text:

0000000000400520 <_start>:
  400520:       31 ed                   xor    ebp,ebp
  400522:       49 89 d1                mov    r9,rdx
  400525:       5e                      pop    rsi
  400526:       48 89 e2                mov    rdx,rsp
  400529:       48 83 e4 f0             and    rsp,0xfffffffffffffff0
  40052d:       50                      push   rax
  40052e:       54                      push   rsp
  40052f:       49 c7 c0 d0 06 40 00    mov    r8,0x4006d0
  400536:       48 c7 c1 60 06 40 00    mov    rcx,0x400660
  40053d:       48 c7 c7 16 06 40 00    mov    rdi,0x400616

It is possible to the same thing by instantiating a context via create and call context:disass() to disassemble a chunk of machine code. This approach allow us to have a finer control of the output as create is passed a callback for each diassembled line. We could accumulate the disassembled lines in a variable or print them out to stdio, as in this example.

November 15, 2015 10:00 PM

November 11, 2015

Jacobo Aragunde

Updated LibreOffice workshop at A Coruña University

I’ve been invited to repeat the workshop about LibreOffice I conducted last year at the University of A Coruña. Contents are largely the same but I’ve done some updates, hence this short entry to share the new slides:

It was a great session, with many interesting questions from the students. Thanks to Juan José Sánchez for the invitation! I hope to be able to repeat next year :)

by Jacobo Aragunde Pérez at November 11, 2015 04:45 PM

November 09, 2015

Andy Wingo

embracing conway's law

Most of you have heard of "Conway's Law", the pithy observation that the structure of things that people build reflects the social structure of the people that build them. The extent to which there is coordination or cohesion in a system as a whole reflects the extent to which there is coordination or cohesion among the people that make the system. Interfaces between components made by different groups of people are the most fragile pieces. This division goes down to the inner life of programs, too; inside it's all just code, but when a program starts to interface with the outside world we start to see contracts, guarantees, types, documentation, fixed programming or binary interfaces, and indeed faults as well: how many bug reports end up in an accusation that team A was not using team B's API properly?

If you haven't heard of Conway's law before, well, welcome to the club. Inneresting, innit? And so thought I until now; a neat observation with explanatory power. But as aspiring engineers we should look at ways of using these laws to build systems that take advantage of their properties.

in praise of bundling

Most software projects depend on other projects. Using Conway's law, we can restate this to say that most people depend on things built by other people. The Chromium project, for example, depends on many different libraries produced by many different groups of people. But instead of requiring the user to install each of these dependencies, or even requiring the developer that works on Chrome to have them available when building Chrome, Chromium goes a step further and just includes its dependencies in its source repository. (The mechanism by which it does this isn't a direct inclusion, but since it specifies the version of all dependencies and hosts all code on Google-controlled servers, it might as well be.)

Downstream packagers like Fedora bemoan bundling, but they ignore the ways in which it can produce better software at lower cost.

One way bundling can improve software quality is by reducing the algorithmic complexity of product configurations, when expressed as a function of its code and of its dependencies. In Chromium, a project that bundles dependencies, the end product is guaranteed to work at all points in the development cycle because its dependency set is developed as a whole and thus uniquely specified. Any change to a dependency can be directly tested against the end product, and reverted if it causes regressions. This is only possible because dependencies have been pulled into the umbrella of "things the Chromium group is responsible for".

Some dependencies are automatically pulled into Chrome from their upstreams, like V8, and some aren't, like zlib. The difference is essentially social, not technical: the same organization controls V8 and Chrome and so can set the appropriate social expectations and even revert changes to upstream V8 as needed. Of course the goal of the project as a whole has technical components and technical considerations, but they can only be acted on to the extent they are socially reified: without a social organization of the zlib developers into the Chromium development team, Chromium has no business automatically importing zlib code, because the zlib developers aren't testing against Chromium when they make a release. Bundling zlib into Chromium lets the Chromium project buffer the technical artifacts of the zlib developers through the Chromium developers, thus transferring responsibility to Chromium developers as well.

Conway's law predicts that the interfaces between projects made by different groups of people are the gnarliest bits, and anyone that has ever had to maintain compatibility with a wide range of versions of upstream software has the scar tissue to prove it. The extent to which this pain is still present in Chromium is the extent to which Chromium, its dependencies, and the people that make them are not bound tightly enough. For example, making a change to V8 which results in a change to Blink unit tests is a three-step dance: first you commit a change to Blink giving Chromium a heads-up about new results being expected for the particular unit tests, then you commit your V8 change, then you commit a change to Blink marking the new test result as being the expected one. This process takes at least an hour of human interaction time, and about 4 hours of wall-clock time. This pain would go away if V8 were bundled directly into Chromium, as you could make the whole change at once.

forking considered fantastic

"Forking" sometimes gets a bad rap. Let's take the Chromium example again. Blink forked from WebKit a couple years ago, and things have been great in both projects since then. Before the split, the worst parts in WebKit were the abstraction layers that allowed Google and Apple to use the dependencies they wanted (V8 vs JSC, different process models models, some other things). These abstraction layers were the reified software artifacts of the social boundaries between Google and Apple engineers. Now that the social division is gone, the gnarly abstractions are gone too. Neither group of people has to consider whether the other will be OK with any particular change. This eliminates a heavy cognitive burden and allows both projects to move faster.

As a pedestrian counter-example, Guile uses the libltdl library to abstract over the dynamic loaders of different operating systems. (Already you are probably detecting the Conway's law keywords: uses, library, abstract, different.) For years this library has done the wrong thing while trying to do the right thing, ignoring .dylib's but loading .so's on Mac (or vice versa, I can't remember), not being able to specify soversions for dependencies, throwing a stat party every time you load a library because it grovels around for completely vestigial .la files, et cetera. We sent some patches some time ago but the upstream project is completely unmaintained; the patches haven't been accepted, users build with whatever they have on their systems, and though we could try to take over upstream it's a huge asynchronous burden for something that should be simple. There is a whole zoo of concepts we don't need here and Guile would have done better to include libltdl into its source tree, or even to have forgone libltdl and just written our own thing.

Though there are costs to maintaining your own copy of what started as someone else's work, people who yammer on against forks usually fail to recognize their benefits. I think they don't realize that for a project to be technically cohesive, it needs to be socially cohesive as well; anything else is magical thinking.

not-invented-here-syndrome considered swell

Likewise there is an undercurrent of smarmy holier-than-thou moralism in some parts of the programming world. These armchair hackers want you to believe that you are a bad person if you write something new instead of building on what has already been written by someone else. This too is magical thinking that comes from believing in the fictional existence of a first-person plural, that there is one "we" of "humanity" that is making linear progress towards the singularity. Garbage. Conway's law tells you that things made by different people will have different paces, goals, constraints, and idiosyncracies, and the impedance mismatch between you and them can be a real cost.

Sometimes these same armchair hackers will shake their heads and say "yeah, project Y had so much hubris and ignorance that they didn't want to bother understanding what X project does, and they went and implemented their own thing and made all their own mistakes." To which I say, so what? First of all, who are you to judge how other people spend their time? You're not in their shoes and it doesn't affect you, at least not in the way it affects them. An armchair hacker rarely understands the nature of value in an organization (commercial or no). People learn more when they write code than when they use it or even when they read it. When your product has a problem, where will you find the ability to fix it? Will you file a helpless bug report or will you be able to fix it directly? Assuming your software dependencies model some part of your domain, are you sure that their models are adequate for your purpose, with the minimum of useless abstraction? If the answer is "well, I'm sure they know what they're doing" then if your organization survives a few years you are certain to run into difficulties here.

One example. Some old-school Mozilla folks still gripe at Google having gone and created an entirely new JavaScript engine, back in 2008. This is incredibly naïve! Google derives immense value from having JS engine expertise in-house and not having to coordinate with anyone else. This control also gives them power to affect the kinds of JavaScript that gets written and what goes into the standard. They would not have this control if they decided to build on SpiderMonkey, and if they had built on SM, they would have forked by now.

As a much more minor, insignificant, first-person example, I am an OK compiler hacker now. I don't consider myself an expert but I do all right. I got here by making a bunch of mistakes in Guile's compiler. Of course it helps if you get up to speed using other projects like V8 or what-not, but building an organization's value via implementation shouldn't be discounted out-of-hand.

Another point is that when you build on someone else's work, especially if you plan on continuing to have a relationship with them, you are agreeing up-front to a communications tax. For programmers this cost is magnified by the degree to which asynchronous communication disrupts flow. This isn't to say that programmers can't or shouldn't communicate, of course, but it's a cost even in the best case, and a cost that can be avoided by building your own.

When you depend on a project made by a distinct group of people, you will also experience churn or lag drag, depending on whether the dependency changes faster or slower than your project. Depending on LLVM, for example, means devoting part of your team's resources to keeping up with the pace of LLVM development. On the other hand, depending on something more slow-moving can make it more difficult to work with upstream to ensure that the dependency actually suits your use case. Again, both of these drag costs are magnified by the asynchrony of communicating with people that probably don't share your goals.

Finally, for projects that aim to ship to end users, depending on people outside your organization exposes you to risk. When a security-sensitive bug is reported on some library that you use deep in your web stack, who is responsible for fixing it? If you are responsible for the security of a user-facing project, there are definite advantages for knowing who is on the hook for fixing your bug, and knowing that their priorities are your priorities. Though many free software people consider security to be an argument against bundling, I think the track record of consumer browsers like Chrome and Firefox is an argument in favor of giving power to the team that ships the product. (Of course browsers are terrifying security-sensitive piles of steaming C++! But that choice was made already. What I assert here is that they do well at getting security fixes out to users in a timely fashion.)

to use a thing, join its people

I'm not arguing that you as a software developer should never use code written by other people. That is silly and I would appreciate if commenters would refrain from this argument :)

Let's say you have looked at the costs and the benefits and you have decided to, say, build a browser on Chromium. Or re-use pieces of Chromium for your own ends. There are real costs to doing this, but those costs depend on your relationship with the people involved. To minimize your costs, you must somehow join the community of people that make your dependency. By joining yourself to the people that make your dependency, Conway's law predicts that the quality of your product as a whole will improve: there will be fewer abstraction layers as your needs are taken into account to a greater degree, your pace will align with the dependency's pace, and colleagues at Google will review for you because you are reviewing for them. In the case of Opera, for example, I know that they are deeply involved in Blink development, contributing significantly to important areas of the browser that are also used by Chromium. We at Igalia do this too; our most successful customers are those who are able to work the most closely with upstream.

On the other hand, if you don't become part of the community of people that makes something you depend on, don't be surprised when things break and you are left holding both pieces. How many times have you heard someone complain the "project A removed an API I was using"? Maybe upstream didn't know you were using it. Maybe they knew about it, but you were not a user group they cared about; to them, you had no skin in the game.

Foundations that govern software projects are an anti-pattern in many ways, but they are sometimes necessary, born from the need for mutually competing organizations to collaborate on a single project. Sometimes the answer for how to be able to depend on technical work from others is to codify your social relationship.

hi haters

One note before opening the comment flood: I know. You can't control everything. You can't be responsible for everything. One way out of the mess is just to give up, cross your fingers, and hope for the best. Sure. Fine. But know that there is no magical first-person-plural; Conway's law will apply to you and the things you build. Know what you're actually getting when you depend on other peoples' work, and know what you are paying for it. One way or another, pay for it you must.

by Andy Wingo at November 09, 2015 01:48 PM

November 03, 2015

Andy Wingo

two paths, one peak: a view from below on high-performance language implementations

Ohmigod it's November. Time flies amirite. Eck-setra. These are not actually my sentiments but sometimes I do feel like a sloth or a slow loris, grasping out at quarter-speed. Once I get a hold it's good times, but hoo boy. The tech world churns and throws up new languages and language implementations every year and how is it that in 2015, some 20 years after the project was started, Guile still doesn't do native compilation?

Though I've only been Guiling for the last 10 years or so, this article aims to plumb those depths; and more than being an apology or a splain I want to ponder the onward journey from the here and the now. I was going to write something like "looking out from this peak to the next higher peak" but besides being a cliché that's exactly what I don't mean to do. In Guile performance work I justify my slow loris grip by a mistrust of local maxima. I respect and appreciate the strategy of going for whatever gains you the most in the short term, especially in a commercial context, but with a long view maybe this approach is a near win but a long lose.

That's getting ahead of myself; let's get into this thing. We started byte-compiling Guile around 2008 or so. Guile is now near to native compilation. Where are we going with this thing?

short term: template jit

The obvious next thing to do for Guile would be to compile its bytecodes to machine code using a template JIT. This strategy just generates machine code for each bytecode instruction without regard to what comes before or after. It's dead simple. Guile's bytecode is quite well-suited to this technique, even, in the sense that an instruction doesn't correspond to much code. As Guile has a register-based VM, its instructions will also specialize well against their operands when compiled to native code. The only global state that needs to be carried around at runtime is the instruction pointer and the stack pointer, both of which you have already because of how modern processors work.

Incidentally I have long wondered why CPython doesn't have a template JIT. Spiritually I am much more in line with the PyPy project but if I were a CPython maintainer, I would use a template JIT on the bytecodes I already have. Using a template JIT preserves the semantics of bytecode, including debugging and introspection. CPython's bytecodes are at a higher level than Guile's though, with many implicit method/property lookups (at least the last time I looked at them), and so probably you would need to add inline caches as well; but no biggie. Why aren't the CPython people doing this? What is their long-term perf story anyway -- keep shovelling C into the extension furnace? Lose to PyPy?

In the case of Guile we are not yet grasping in this direction because we don't have (direct) competition from PyPy :) But also there are some problems with a template JIT. Once you internalize the short-term mentality of a template JIT you can get stuck optimizing bytecode, optimizing template JIT compilation, and building up a baroque structure that by its sheer mass may prevent you from ever building The Right Thing. You will have to consider how a bytecode-less compilation pipeline interacts with not only JITted code but also bytecode, because it's a lose to do a template JIT for code that is only executed once.

This sort of short-term thinking is what makes people also have to support on-stack replacement (OSR), also known as hot loop transfer. The basic idea is that code that executes often has to be JITted to go fast, but you can't JIT everything because that would be slow. So you wait to compile a function until it's been called a few times; fine. But with loops it could be that a function is called just once but a loop in the function executes many times. You need to be able to "tier up" to the template JIT from within a loop. This complexity is needed at the highest performance level, but if you choose to do a template JIT you're basically committing to implementing OSR early on.

Additionally the implementation of a template JIT compiler is usually a bunch of C or C++ code. It doesn't make sense to include a template JIT in a self-hosted system that compiles to bytecode, because it would be sad to have the JIT not be written in the source language (Guile Scheme in our case).

Finally in Scheme we have tail-call and delimited continuation considerations. Currently in Guile all calls happen in the Guile bytecode interpreter, which makes tail calls easy -- the machine frame stays the same and we just have to make a tail call on the Scheme frame. This is fine because we don't actually control the machine frame (the C frame) of the bytecode interpreter itself -- the C compiler just does whatever it does. But how to tail call between the bytecode interpreter and JIT-compiled code? You'd need to add a trampoline beneath both the C interpreter and any entry into compiled code that would trampoline to the other implementation, depending on how the callee "returns". And how would you capture stack slices with delimited continuations? It's possible (probably -- I don't know how to reinstate a delimited continuation with both native and interpreted frames), but something of a headache, and is it really necessary?

if you compile ahead-of-time anyway...

The funny thing about CPython is that, like Guile, it is actually an ahead-of-time compiler. While the short-term win would certainly be to add a template JIT, because the bytecode is produced the first time a script is run and cached thereafter, you might as well compile the bytecode to machine code ahead-of-time too and skip the time overhead of JIT compilation on every run. In a template JIT, the machine code is only a function of the bytecode (assuming the template JIT doesn't generate code that depends on the shape of the heap).

Compiling native code ahead of time also saves on memory usage, because you can use file-backed mappings that can be lazily paged in and shared between multiple processes, and when these pages are in cache that translates also to faster startup too.

But if you're compiling bytecode ahead of time to native code, what is the bytecode for?

(not) my beautiful house

At some point you reach a state where you have made logical short-term decisions all the way and you end up with vestigial organs of WTF in your language runtime. Bytecode, for example. A bytecode interpreter written in C. Object file formats for things you don't care about. Trampolines. It's time to back up and consider just what it is that we should be building.

The highest-performing language implementations will be able to compile together the regions of code in which a program spends most of its time. Ahead-of-time compilers can try to predict these regions, but you can't always know what the behavior of a program will be. A program's run-time depends on its inputs, and program inputs are late-bound.

Therefore these highest-performing systems will use some form of adaptive optimization to apply run-time JIT compilation power on whatever part of a program turns out to be hot. This is the peak performance architecture, and indeed in the climb to a performant language implementation, there is but one peak that I know of. The question becomes, how to get there? What path should I take, with the priorities I have and the resources available to me, which lets me climb the farthest up the hill while always leaving the way clear to the top?

guile's priorities

There are lots of options here, and instead of discussing the space as a whole I'll just frame the topic with some bullets. Here's what I want out of Guile:

  1. The project as a whole should be pleasing to hack on. As much of the system as possible should be written in Scheme, as little as possible in C or assembler, and dependencies on outside projects should be minimized.

  2. Guile users should be able to brag about startup speed to their colleagues. We are willing to trade away some peak throughput for faster startup, if need be.

  3. Debuggability is important -- a Guile hacker will always want to be able to get stack traces with actual arguments and local variable values, unless they stripped their compiled Guile binaries, which should be possible as well. But we are willing to give up some debuggability to improve performance and memory use. In the same way that a tail call replaces the current frame in its entirety, we're willing to lose values of dead variables in stack frames that are waiting on functions to return. We're also OK with other debuggability imprecisions if the performance gains are good enough. With macro expansion, Scheme hackers expect a compilation phase; spending time transforming a program via ahead-of-time compilation is acceptable.

Call it the Guile Implementor's Manifesto, or the manifesto of this implementor at least.

beaucoup bucks

Of course if you have megabucks and ace hackers, then you want to dial back on the compromises: excellent startup time but also source-level debugging! The user should be able to break on any source position: the compiler won't even fold 1 + 1 to 2. But to get decent performance you need to be able to tier up to an optimizing compiler soon, and soon in two senses: soon after starting the program, but also soon after starting your project. It's an intimidating thing to build when you are just starting on a language implementation. You need to be able to tier down too, at least for debugging and probably for other reasons too. This strategy goes in the right direction, performance-wise, but it's a steep ascent. You need experienced language implementors, and they are not cheap.

The usual strategy for this kind of implementation is to write it all in C++. The latency requirements are too strict to do otherwise. Once you start down this road, you never stop: your life as an implementor is that of a powerful, bitter C++ wizard.

The PyPy people have valiently resisted this trend, writing their Python implementation in Python itself, but they concede to latency by compiling their "translated interpreter" into C, which then obviously can't itself be debugged as Python code. It's self-hosting, but staged into C. Ah well. Still, a most valiant, respectable effort.

This kind of language implementation usually has bytecode, as it's a convenient reification of the source semantics, but it doesn't have to. V8 is a good counterexample, at least currently: it treats JavaScript source code as the canonical representation of program semantics, relying on its ability to re-parse source text to an AST in the same way every time as needed. V8's first-tier implementation is actually a simple native code compiler, generated from an AST walk. But things are moving in the bytecode direction in the V8 world, reluctantly, so we should consider bytecode as the backbone of the beaucoup-bucks language implementation.

shoestring slim

If you are willing to relax on source-level debugging, as I am in Guile, you can simplify things substantially. You don't need bytecode, and you don't need a template JIT; in the case of Guile, probably the next step in Guile's implementation is to replace the bytecode compiler and interpreter with a simple native code compiler. We can start with the equivalent of a template JIT, but without the bytecode, and without having to think about the relationship between compiled and (bytecode-)interpreted code. (Guile still has a traditional tree-oriented interpreter, but it is actually written in Scheme; that is a story for another day.)

There's no need to stop at a simple compiler, of course. Guile's bytecode compiler is already fairly advanced, with interprocedural optimizations like closure optimization, partial evaluation, and contification, as well as the usual loop-invariant code motion, common subexpression elimination, scalar replacement, unboxing, and so on. Add register allocation and you can have quite a fine native compiler, and you might even beat the fabled Scheme compilers on the odd benchmark. They'll call you plucky: high praise.

There's a danger in this strategy though, and it's endemic in the Scheme world. Our compilers are often able to do heroic things, but only on the kinds of programs they can fully understand. We as Schemers bend ourselves to the will of our compilers, writing only the kinds of programs our compilers handle well. Sometimes we're scared to fold, preferring instead to inline the named-let iteration manually to make sure the compiler can do its job. We fx+ when we should +; we use tagged vectors when we should use proper data structures. This is déformation professionelle, as the French would say. I gave a talk at last year's Scheme workshop on this topic. PyPy people largely don't have this problem, for example; their langauge implementation is able to see through abstractions at run-time to produce good code, but using adaptive optimization instead of ahead-of-time trickery.

So, an ahead-of-time compiler is perhaps a ridge, but it is not the peak. No amount of clever compilation will remove the need for an adaptive optimizer, and indeed too much cleverness will stunt the code of your users. The task becomes, how to progress from a decent AOT native compiler to a system with adaptive optimization?

Here, as far as I know, we have a research problem. In Guile we have mostly traced the paths of history, re-creating things that existed before. As Goethe said, quoted in the introduction to The Joy of Cooking: "That which thy forbears have bequeathed to thee, earn it anew if thou wouldst possess it." But finally we find here something new, or new-ish: I don't know of good examples of AOT compilers that later added adaptive optimization. Do you know of any, dear reader? I would be delighted to know.

In the absence of a blazed trail to the top, what I would like to do is to re-use the AOT compiler to do dynamic inlining. We might need to collect type feedback as well, though inlining is the more important optimization. I think we can serialize the compiler's intermediate representation into a special section in the ELF object files that Guile produces. A background thread or threads can monitor profiling information from main threads. If a JIT thread decides two functions should be inlined, it can deserialize compiler IR and run the standard AOT compiler. We'd need a bit of mutability in the main program in which to inject such an optimization; an inline cache would do. If we need type feedback, we can make inline caches do that job too.

All this is yet a ways off. The next step for Guile, after the 2.2 release, is a simple native compiler, then register allocation. Step by step.

but what about llvmmmmmmmmmmmmm

People always ask about LLVM. It is an excellent compiler backend. It's a bit big, and maybe you're OK with that, or maybe not; whatever. Using LLVM effectively depends on your ability to deal with churn and big projects. But if you can do that, swell, you have excellent code generation. But how does it help you get to the top? Here things are less clear. There are a few projects using LLVM effectively as a JIT compiler, but that is a very recent development. My hubris, desire for self-hosting, and lack of bandwidth for code churn makes it so that I won't use LLVM myself but I have no doubt that a similar strategy to that which I outline above could work well for LLVM. Serialize the bitcode into your object files, make it so that you can map all optimization points to labels in that bitcode, and you have the ability to do some basic dynamic inlining. Godspeed!


If you're interested, I gave a talk a year ago on the state of JavaScript implementations, and how they all ended up looking more or less the same. This common architecture was first introduced by Self; languages implementations in this category include HotSpot and any of the JavaScript implementations.

Some notes on how PyPy produces interpreters from RPython.

and so I bid you good night

Guile's compiler has grown slowly, in tow of my ballooning awareness of ignorance and more slowly inflating experience. Perhaps we could have done the native code compilation thing earlier, but I am happy with our steady progress over the last five years or so. We had to scrap one bytecode VM and one or two compiler intermediate representations, and though that was painful I think we've done pretty well as far as higher-order optimizations go. If we had done native compilation earlier, I can't but think the inevitably wrong decisions we would have made on the back-end would have prevented us from having the courage to get the middle-end right. As it is, I see the way to the top, through the pass of ahead-of-time compilation and thence to a dynamic inliner. It will be some time before we get there, but that's what I signed up for :) Onward!

by Andy Wingo at November 03, 2015 11:47 PM

October 29, 2015

Andy Wingo

type folding in guile

A-hey hey hey, my peeps! Today's missive is about another optimization pass in Guile that we call "type folding". There's probably a more proper name for this, but for the moment we go with "type folding" as it's shorter than "abstract constant propagation, constant folding, and branch folding based on flow-sensitive type and range analysis".

on types

A word of warning to the type-system enthusiasts among my readers: here I'm using "type" in the dynamic-languages sense, to mean "a property about a value". For example, whether a value is a vector or a pair is a property of that value. I know that y'all use that word for other purposes, but there are other uses that do not falute so highly, and it's in the more pedestrian sense that I'm interested here.

To back up a bit: what are the sources of type information in dynamic languages? In Guile, there are three ways the compiler can learn about a value's type.

One source of type information is the compiler's knowledge of the result types of expressions in the language, especially constants and calls to the language's primitives. For example, in the Scheme definition (define y (vector-length z)), we know that y is a non-negative integer, and we probably also know a maximum value for z too, given that vectors have a maximum size.

Conditional branches with type predicates also provide type information. For example, in consider this Scheme expression:

(lambda (x)
  (if (pair? x)
      (car x)
      (error "not a pair" x)))

Here we can say that at the point of the (car x) expression, x is definitely a pair. Conditional branches are interesting because they add a second dimension to type analysis. The question is no longer "what is the type of all variables", but "what is the type of all variables at all points in the program".

Finally, we have the effect of argument type checks in function calls. For example in the (define y (vector-length z)) definition, after (vector-length z) has been evaluated, we know that z is indeed a vector, because if it weren't, then the primitive call would raise an exception.

In summary, the information that we would like to have is what type each variable has at each program point (label). This information comes from where the variables are defined (the first source of type information), conditional branches and control-flow joins (the second source), and variable use sites that imply type checks (the third). It's a little gnarly but in essence it's a classic flow analysis. We treat the "type" of a variable as a set of possible types. A solution to the flow equations results in a set of types for each variable at each label. We use the intmap data structures to share space between the solution at different program points, resulting in an O(n log n) space complexity.

In Guile we also solve for the range of values a variable may take, at the same time as solving for type. I started doing this as part of the Compost hack a couple years ago, where I needed to be able to prove that the operand to sqrt was non-negative in order to avoid sqrt producing complex numbers. Associating range with type turns out to generalize nicely to other data types which may be thought of as having a "magnitude" -- for example a successful (vector-ref v 3) implies that v is at least 4 elements long. Guile can propagate this information down the flow graph, or propagate it in the other way: if we know the vector was constructed as being 10 elements long, then a successful (vector-ref v n) can only mean that n is between 0 and 9.

what for the typing of the things

Guile's compiler uses type analysis in a few ways at a few different stages. One basic use is in dead code elimination (DCE). An expression can be eliminated from a program if its value is never used and if it causes no side effects. Guile models side effects (and memory dependencies between expressions) with effects analysis. I quote:

We model four kinds of effects: type checks (T), allocations (A), reads (R), and writes (W). Each of these effects is allocated to a bit. An expression can have any or none of these effects.

In an expression like (vector-ref v n), type analysis may compute that in fact v is indeed a vector and n is an integer, and also that n is within the range of valid indexes of v. In that case we can remove the type check (T) bit from the expression's effects, opening up the expression for DCE.

Getting back to the topic of this article, Guile's "type folding" pass uses type inference in three ways.

The first use of type information is if we determine that, at a given use site, a variable has precisely one type and one value. In that case we can do constant folding over that expression, replacing its uses with its value. For example, let's say we have the expression (define len (vector-length v)). If we know that v is a vector of length length 5, we can replace any use of len with the constant, 5. As an implementation detail we actually keep the definition of len in place and let DCE remove it later. We can consider this to be abstract constant propagation: abstract in the sense that it folds over abstract values, represented just as type sets and ranges, and which materializes a concrete value only if it is able to do so. Since ranges propagate through operators as well, it can also be considered as abstract constant folding; the type inference operators act as constant folders.

Another use of type information is in branches. If Guile sees (if (< n (vector-length v)) 1 2) and n and v have the right types and disjoint ranges, then we can fold the test and choose 1 or 2 depending on how the test folds.

Finally type information can enable strength reduction. For example it's a common compiler trick to want to reduce (* n 16) to (ash n 4), but if n isn't an integer this isn't going to work. Likewise, (* n 0) can be 0, 0.0, 0.0+0.0i, something else, or an error, depending on the type of n and whether the * operator has been extended to apply over non-number types. Type folding uses type information to reduce the strength of operations like these, but only where it can prove that the transformation is valid.

So that's type folding! It's a pretty neat pass that does a few things as once. Code here, and code for the type inference itself here.

type-driven unboxing

Guile uses type information in one other way currently, and that is to determine when to unbox floating-point numbers. The current metric is that whenever an arithmetic operation will produce a floating-point number -- in Scheme parlance, an inexact real -- then that operation should be unboxed, if it has an unboxed counterpart. Unboxed operations on floating-point numbers are advantageous because they don't have to allocate space on the garbage-collected heap for their result. Since an unboxed operation like the fadd floating-point addition operator takes raw floating-point numbers as operands, it also will never cause a type check, unlike the polymorphic add instruction. Knowing that fadd has no effects lets the compiler do a better job at common subexpression elimination (CSE), dead code elimination, loop-invariant code motion, and so on.

To unbox an operation, its operands are unboxed, the operation itself is replaced with its unboxed counterpart, and the result is then boxed. This turns something like:

(+ a b)


(f64->scm (fl+ (scm->f64 a) (scm->f64 b)))

You wouldn't think this would be an optimization, except that the CSE pass can eliminate many of these conversion pairs using its scalar elimination via fabricated expressions pass.

A proper flow-sensitive type analysis is what enables sound, effective unboxing. After arithmetic operations have been unboxed, Guile then goes through and tries to unbox loop variables and other variables with more than one definition ("phi' variables, for the elect). It mostly succeeds at this. The results are great: summing a packed vector of 10 million 32-bit floating-point values goes down from 500ms to 130ms, on my machine, with no time spent in the garbage collector. Once we start doing native compilation we should be up to about 5e8 or 10e8 floats per second in this microbenchmark, which is totally respectable and about what gcc's -O0 performance gets.


This kind of type inference works great in tight loops, and since that's how many programs spend most of their time, that's great. Of course, this situation is also a product of programmers knowing that tight loops are how computers go the fastest, or at least of how compilers do the best on their code.

Where this approach to type inference breaks down is at function boundaries. There are no higher-order types and no higher-order reasoning, and indeed no function types at all! This is partially mitigated by earlier partial evaluation and contification passes, which open up space in which the type inferrer can work. Method JIT compilers share this weakness with Guile; tracing JIT runtimes like LuaJIT and PyPy largely do not.

up summing

So that's the thing! I was finally motivated to dust off this draft given the recent work on unboxing in a development branch of Guile. Happy hacking and we promise we'll actually make releases so that you can use it soon soon :)

by Andy Wingo at October 29, 2015 10:13 PM

October 28, 2015

Carlos Alberto López Pérez

Introducing the meta-webkit Yocto layer.

Lately, in my daily work at Igalia, I have been learning to use Yocto / OpenEmbedded to create custom distributions and products targeting different embedded hardware. One of the goals was to create a Kiosk-like browser that was based on a modern Web engine, light (specially regarding RAM requirements) and fast.

WebKit fits perfectly this requirements, so I started checking the available recipes on the different layers of Yocto to build WebKit, unfortunately the available recipes for WebKitGTK+ available were for ancient versions of the engine (no WebKit2 multi-process model). This has changed recently, as the WebKitGTK+ recipe available in the oe-core layer has been updated to a new version. In any case, we needed this for an older release of Yocto so I ended creating a new layer.

I have added also some recipes for WebKitForWayland and released it as meta-webkit. The idea is to collect here recipes for building browsers and web engines based on WebKit. For the moment it includes recipes for building the WebKitGTK+ and WebKitForWayland ports.

WebKitGTK+ vs WebKitForWayland

First a few words about the main differences between this two different ports of WebKit, so you have a better understanding on where to use one or the other.

Let’s start by defining both engines:

  • WebKit for Wayland port pairs the WebKit engine with the Wayland display protocol, allowing embedders to create simple and performant systems based on Web platform technologies. It is designed with hardware acceleration in mind, relying on EGL, the Wayland EGL platform, and OpenGL ES.
  • WebKitGTK+ is a full-featured port of the WebKit rendering engine, suitable for projects requiring any kind of web integration, from hybrid HTML/CSS applications to full-fledged web browsers. It offers WebKit’s full functionality and is useful in a wide range of systems from desktop computers to embedded systems like phones, tablets, and televisions.

From the definitions you may guess already some differences. WebKitForWayland focus on simple and performant Web-oriented systems, meanwhile WebKitGTK+ focus on any kind of product (complex or simple) that requires full Web integration.

WebKitForWayland is what you need when you want a very fast and light HTML5/js runtime, capable of squeeze the hardware acceleration of your platform (Wayland EGL platform) up to the maximum performance. But is not suitable if you are thinking in building a general purpose browser on top of it. It lacks some features that are not needed for a Web runtime but that are desirable for a more generic browser.

Here you can see a video of WebKitForWayland running on the weston ivi-shell and several instances of WebkitForWayland showing different demos (poster circle and several webgl demos) running on the Intel NUC (Intel Atom E3815 – 1 core). The OS is a custom image built with Yocto 1.8 and this meta-webkit layer.

On the Weston terminal you can appreciate the low resource usage (CPU and RAM) despite having several browser process running the demanding demos at full speed. Link to the video on YouTube here.

In case you want to try it, you might want to build an image for your device with meta-webkit and Yocto (see below), or you can test the image that I built for the demo on the above video. You can flash it to an usb stick memory device as follows, and boot it on any Intel x86_64 machine that has an Intel GPU:

# /dev/sdX is the device of your usb memory stick (for example /dev/sdc)
curl -s | xz -dc | dd bs=4k of=/dev/sdX

On the other hand WebKitGTK+ allows you to build a custom browser on top of it. It has a rich and stable API and many goodies like support for plugins, integrated web inspector, full toolkit support, etc. WebKitGTK+ also performs very good and consumes few resources, it can run both on top of Wayland and X11. But at the moment of writing this, the support for Wayland is still incomplete (we are working on it). So if possible we recommend that you base your product on the X11 backend. You could later migrate to Wayland without much effort once we complete the support for it.

Building a Yocto image with WebKitForWayland

The usual way to create an image with Yocto and WebKitForWayland is:

  • Setup the environment and source oe-init-build-env as usual.
  • Checkout the branch of meta-webkit that matches your Yocto/OE version (for example: fido).
    Note that fido (1.8) is the less recent version supported. If you are using the fido branch you will also need to add the meta-ruby layer that is available on meta-openembedded
  • Add the path to the meta-webkit layer in your conf/bblayers.conf file
  • Append the following lines to the conf/local.conf file
  • DISTRO_FEATURES_append = " opengl wayland"
    IMAGE_INSTALL_append = " webkitforwayland"
  • Then build the target image, for example
  • bitbake core-image-weston
  • Then boot the image on the target device and run the webkitforwayland engine from a weston terminal
  • WPELauncher

Building a Yocto image with WebKitGTK+

There are some things to take into account when building WebKitGTK+:

  • The package webkitgtk contains the shared libraries and the webkitgtk runtime.
  • The package webkitgtk-bin contains the MiniBrowser executable. This is very basic browser built on top of the webkitgtk runtime, mainly used for testing purposes.
  • On the recipe there are several packageconfig options that you can tune. For example, for enabling WebGL support you can add the following to your conf/local.conf file:
    PACKAGECONFIG_pn-webkitgtk = "x11 webgl"

    Check the recipe source code to see all the available options.

  • The name of the recipe is the same than the one available in oe-core (master), so you should select which version of webkitgtk you want to build. For example, to build 2.10.3 add on conf/local.conf:
    PREFERRED_VERSION_webkitgtk = "2.10.3"

So, the usual way to create an image with Yocto and WebKitGTK+ is:

  • Setup the environment and source oe-init-build-env as usual.
  • Checkout the branch of meta-webkit that matches your Yocto/OE version (for example: fido).
    Note that fido (1.8) is the less recent version supported. If you are using the fido branch you will also need to add the meta-ruby layer that is available on meta-openembedded
  • Add the following lines to your conf/local.conf file (for building the X11 backend of WebKitGTK+)
  • DISTRO_FEATURES_append = " opengl x11"
    IMAGE_INSTALL_append = " webkitgtk-bin"
  • Then build the X11 image:
  • bitbake core-image-sato
  • Then boot the image on the target device and run the included test browser from an X terminal (or any other browser that you have developed on top of the WebKitGTK+ API)
  • Further info

    If you are new to Yocto / OpenEmbedded, a good starting point is to check out the documentation.

    If you have any issue or doubt with the meta-webkit layer, please let me know about that (you can mail me at or open an issue on github)

    Finally, If you need help for integrating a Web engine on your product, you can also hire us. As maintainers of the GTK+ and WebKit For Wayland ports of WebKit we have considerable experience creating, maintaining and optimizing ports of WebKit.

    by clopez at October 28, 2015 01:23 PM

    October 23, 2015

    Víctor Jáquez

    GStreamer Conference 2015

    Last October 8th and 9th, the GStreamer Conference 2015 had happened in Dublin. It was plenty of interesting talks and colleagues. I really enjoyed it.

    I participated with our latest work in gstreamer-vaapi.  You can glance the slides of my talk.

    GStreamer VA-API talk

    By the way, if you are in Dublin, let me recommend you to visit the Chester Beatty Library. There are a lot of antique, beautiful and rare books

    by vjaquez at October 23, 2015 10:50 AM

    October 19, 2015

    Jacobo Aragunde

    GENIVI meeting at Seoul

    Hello! I’m writing this update from Seoul!

    The National Folk Museum of Korea, framed by the gardens surrounding Gyeongbokgung Palace

    The National Folk Museum of Korea, framed by the gardens surrounding Gyeongbokgung Palace

    The purpose of this trip is taking part in the GENIVI 13th All Member Meeting. Igalia will present the contributions made to the GENIVI community in the browsers area. I have been coordinating the company investment in the automotive field in the last year and taken part in the development of several prototypes involving Chromium technologies, that is the reason I am here with Silvia Cho, Gwang Yoon Hwang and Michael Catanzaro.

    Among other things, Igalia has contributed our knowledge to the discussion on the different technical choices to build a browser for the sector, providing choices and analyzing pros and cons of each one. We have been working to complete the implementation of Ozone-Wayland, the layer that enables running Chromium on Wayland: this task will be essential in the automotive sector where next generation platforms are being built on top of Wayland. Finally, we have put together the pieces that allow to run a modern instance of Chromium on a Renesas R-Car M2 board, an embedded platform designed for this sector; a job that implied hacking the BSPs, the build recipes and even some wrapper libraries.

    Three talks will be presented by Igalians in this event:

    We will also have a booth at the conference showcase, where public will be able to see our work in action. An R-Car board will be running Chromium on top of Wayland to display our work on Ozone-Wayland and porting. Last but not least, there will also be some demos featuring WebKit for Wayland, a fresh flavor of WebKit that focuses on high performance with a low footprint.

    Chromium running on top of Weston on an embedded ARM device.

    Chromium running on top of Weston on an embedded ARM device.

    You can expect to see slides and maybe some more pictures here, as usual, later this week. Happy hacking!

    EDIT: uploaded the slides of the three talks, linked above. Also added these pictures from twitter:

    by Jacobo Aragunde Pérez at October 19, 2015 09:35 AM

    October 15, 2015

    Diego Pino

    Multicore architectures and CPU affinity

    Lately I have working in Snabb Switch as part of the networking team we have in Igalia. Snabb Switch is a kernel by-pass software switch that talks directly to 10-Gbps network cards. This allows Snabb Switch to manipulate packets are at speed rate the Linux kernel is simply not capable of doing it. Snabb Switch also provides a framework to develop network applications and most of the code is Lua (LuaJIT). Snabb Switch rides on the latest innovations in hardware and it’s very innovative in many ways. All this has allowed me to catch up wit many technologies. In this post I reviewed the latest 10 years in Intel microprocessor evolution.

    One of the first attempts by Intel to parallelize processing was hyper-threading, a technology that debuted in Xeon in 2002 and later that year in Pentium 4. A single CPU with hyper-threading appears as two logical CPUs to the Operating System. Hyper-threading takes advantage of the superscalar architecture of modern CPUs, in which instruction processing is divided into several independent pipelines. By duplicating the number of registers in a CPU, hyper-threading can exploit the use of the different pipelines, so most of them are busy at a given time. However, other resources such as cache are not actually duplicated but shared by the logical and the physical CPU. A CPU with hyper-threading enabled can provide a performance boost of 30% or in the worst case no boost at all.

    After this first attempt of bringing real concurrent processing, Intel CPUs started to incorporate multiple cores per socket. For some time hyper-threading technology was forgotten, as Intel modeled its new Core microarchitecture after the P6 microarchitecture (Pentium Pro, Pentium III and Celeron). However with the release of the new Nehalem microarchtecture (Core i7) by the end of 2008, hyper-threading made a come back. Since then, all Intel processors feature this technology. As hyper-threading adds a logical CPU for each physical core, my dual-core CPU appears as four cores to the OS.

    $ lscpu
    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                4
    On-line CPU(s) list:   0-3
    Thread(s) per core:    2
    Core(s) per socket:    2

    Two core(s) per socket and two thread(s) per core, makes a total of four CPU(s).

    What really encouraged Intel to switch to a multicore architecture was the inability to keep improving speed by increasing the number of transistors in a CPU. As Gordon Moore, co-founder of Intel, noticed in 1965, the number of transistors per square inch in a CPU doubles every year (Moore’s Law), something that still applies today although the period has stretched to 2.5 years. However, while the number of transistors in the last decade has gone from 37.5 million to 904 million, CPU clock speed has barely doubled, going from 1.3 Ghz to 2.8 Ghz. In 2004, the heat build-up in CPUs caused Intel to abandon this model and start featuring multiple cores into one single processing unit.

    First multicore processors accessed memory through a shared bus, in other words, all cores shared a common memory. This design is known as UMA or Uniform Memory Access. As the number of cores increased, contention issues appeared. Access to the bus became a bottleneck to scalability, preventing adding more cores. To solve this problem, CPU designers introduced a new type of memory layout known as Non-Uniform Memory Access or NUMA. In a NUMA topology, cores are grouped into a single unit called a NUMA node . Each NUMA node has a local memory assigned, which guarantees only a limited number of cores will try to access memory at a given time. The memory that is not local to a NUMA node is known as remote or foreign. Access to remote memory takes longer because the distance between processor and memory affects access time. And that is the main reason why this architecture is called NUMA and it is often refer as a topology. Either UMA and NUMA are two types of Shared Memory Architectures.

    Architecture of a NUMA system

    Architecture of a NUMA system. Source: Wikipedia

    Unfortunately, my laptop only has one NUMA group:

    $ lscpu | grep NUMA
    NUMA node(s):          1
    NUMA node0 CPU(s):     0-3

    Or using the numactl command:

    $ numactl --show
    policy: default
    preferred node: current
    physcpubind: 0 1 2 3 
    cpubind: 0 
    nodebind: 0 
    membind: 0 

    Shows my CPU features only features one NUMA node containing 4 cores, or in other words, my laptop implements an UMA layout. The same command on a more powerful machine (Intel Xeon Processor E5-2697 v2, with 12 cores per socket), provides the following information:

    $ lscpu
    CPU(s):                48
    On-line CPU(s) list:   0-47
    Thread(s) per core:    2
    Core(s) per socket:    12
    Socket(s):             2
    NUMA node(s):          2
    NUMA node0 CPU(s):     0-11,24-35
    NUMA node1 CPU(s):     12-23,36-47

    This is a much more advanced CPU. There are 2 sockets each containing 12 cores, or 24 cores with hyper-threading. Cores are grouped into 2 NUMA nodes (node0 and node1). The lscpu command also provides information about node’s distance:

    node distances:
    node   0   1 
    0:  10  21 
    1:  21  10 

    Besides cores, peripheral devices also compete for accessing the data bus, since devices have direct access to memory trough DMA (Direct Memory Access). In a NUMA layout, this means that devices, usually connected to a PCI port, have a NUMA node number assigned too. If a process is running on a core which heavily interacts with an I/O device belonging to different NUMA node, performance degradation issues may appear. NUMA considerably benefits from the data locality principle, so devices and processes operating on the same data should run within the same NUMA node.

    $ cat /sys/bus/pci/devices/0000:00:01.0/numa_node

    In the example above, device with PCI address 0000:00:01.0 belongs to NUMA node0.

    If data locallity is so important in a NUMA architecture, it seems it would be very handy to be able to select in what core to run a program. Generally, when the OS executes a program it creates a process and assigns it to a core following a scheduling algorithm. The process will run for some time, until the kernel dispatcher assigns a new process to the core and the former process is put to sleep. When it wakes up, it may be reassigned to a different core. Usually if the process consumes a lot of CPU time, the kernel won’t reassign it to a different core, but it could be possible.

    CPU affinity let us bind a process or thread to a core or group of cores. From the user perspective, commands such as taskset or numactl can help us to control CPU affinity of a process:

    $ taskset -c 0 snabb    # Run the program 'snabb' on core 0
    $ taskset -c 0-3 snabb  # Run the program 'snabb' on cores 0,1,2,3

    Likewise, with the command numatcl:

    $ numactl --cpunodebind=0 snabb    # Run the program 'snabb' on any of the cores of node0
    $ numactl --physcpubind=0-3 snabb  # Run the program 'snabb' on cores 0,1,2,3

    Besides setting CPU affinity, numatcl allows a finer control being possible to select different preferences for memory allocation (only local allowed, local preferred, etc).

    Summarizing, most modern CPUs, if not all, implement a multicore architecture on a NUMA topology. Arbitrary assignment of processes to a core may have an impact on performance, specially if a running program interacts with a PCI device in a different NUMA node. There are tools such as taskset and numactl, that allow us to have a finer control on CPU affinity, being possible to select what core or numa node will run a program.

    Lastly, some recommended links to articles I read, or partially read, and helped me to write down this post:

    October 15, 2015 10:00 AM

    October 06, 2015

    Manuel Rego

    Grid Layout Coast to Coast

    So here we’re again, this time to announce that I’ll be presenting my talk CSS Grid Layout from the inside out at HTML5DevConf Autumn 2015 (19-20 October). I’m really excited about this new opportunity to show the world the wonders of CSS Grid Layout specification, together with the “dark” stuff that happens behind the scenes. Thanks to the organization for counting me in. 😃

    My talk announced at the HTML5DevConf website My talk announced at the HTML5DevConf website

    During the talk I’ll give a short introduction to the grid layout syntax and features, including some of the most recent developments, like the much awaited grid gutters (which will hopefully land very soon). Igalia has been working on the CSS Grid Layout implementation on Blink and WebKit since 2013 as part of a collaboration with Bloomberg. Right now we’re closer than ever to ship the feature in Chromium!

    Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

    This’ll be my second grid layout talk this year after the one at CSSConf in New York. It’ll be my first time at HTML5DevConf and my first time in San Francisco too. It seems a huge conference full of great speakers and interesting talks. I really hope that people like my talk, anyway I’m sure I’ll enjoy the conference and learn a lot about the latest cool stuff that is moving around the web platform. Please, feel free to ping me if you want to talk about grid, new layout models, the web and/or Igalia.

    Last but not least, my igalian mate Martin Robinson who is currently hacking on Servo will be attending the conference too. Don’t hesitate to ask him any question about web browser engines, he knows them very well. 😉

    October 06, 2015 10:00 PM

    September 25, 2015

    Samuel Iglesias

    ARB_program_interface_query and shader_runner

    Lately I have worked on ARB_shader_storage_buffer_object  extension implementation on Mesa, along with Iago (who has already explained how we implement it).

    One of the features we implemented was adding support for buffer variables to the queries defined in ARB_program_interface_query. Those queries ended up being very useful to check another feature introduced by ARB_shader_storage_buffer_object: the new storage layout called std430.

    I have developed basic tests for piglit, but Jordan Justen and Ilia Mirkin mentioned that Ian Romanick had worked in a comprehensive testing for uniform buffer objects presented at XDC2014.

    Ian developed a stochastic search-based testing for uniform block layouts which generates shader_runner tests using a template in Mako that defines a series of uniform blocks with different layouts. Finally, shader_runner verifies some information through glGetActiveUniforms*() queries such as: the variable type, its offset inside the block, its array size (if any), if it is row major, etc… using “active uniform” command.

    That was the kind of testing that I was looking for, so I cloned his repository and developed a modified version for shader storage blocks and std430.

    During that process, I found that shader_runner was lacking support for the queries defined in ARB_program_interface_query extension. Yesterday, the patches were pushed to master branch of the piglit repository.

    If you want to use this new command, the format  is the following:

    verify program_interface_query GL_INTERFACE_TYPE_ENUM var_name GL_PROPS_ENUM integer

    or, if we include the GL type enum:

    verify program_interface_query GL_INTERFACE_TYPE_ENUM var_name GL_PROPS_ENUM GL_TYPE_ENUM

    I have written an example to show how to use this command in a shader_runner test. This is just a snippet showing the usage of the new command:

    link success
    verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].s1_3.fv3[0] GL_TYPE GL_FLOAT_VEC4
    verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].s1_3.fv3[0] GL_TOP_LEVEL_ARRAY_SIZE 4
    verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].s1_3.fv3[0] GL_OFFSET 128
    verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[0].s1_3.fv3[0] GL_OFFSET 48
    verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[0].m34_1 GL_TYPE GL_FLOAT_MAT3x4
    verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].m34_1 GL_OFFSET 80
    verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].m34_1 GL_ARRAY_STRIDE 0
    verify program_interface_query GL_BUFFER_VARIABLE SSBO1.m43_1 GL_IS_ROW_MAJOR 1
    verify program_interface_query GL_PROGRAM_INPUT piglit_vertex GL_TYPE GL_FLOAT_VEC4
    verify program_interface_query GL_PROGRAM_OUTPUT piglit_fragcolor GL_TYPE GL_FLOAT_VEC4

    I hope you find them useful for your tests!


    by Samuel Iglesias at September 25, 2015 08:56 AM

    September 23, 2015

    Xabier Rodríguez Calvar

    WebKit Contributors Meeting 2015 (late, I know)

    After writing my last post I realized that I needed to write a bit more about what I had been doing at the WebKit Contributors Meeting.

    First thing to say is that it happened in March at Apple campus in Cupertino and I atteded as part of the Igalia gang.

    My goal when I went there was to discuss with Youenn Fablet about Streams API and we are implementing and see how we could bootstrap the reviews and being to get the code reviewed and landed efficiently. Youenn and I also made a presentation (mainly him) about it. At that moment we got some comments and help from Benjamin Poulain and nowadays we are also working with Darin Adler and Geoffrey Garen so the work is ongoing.

    WebRTC was also a hot topic and we talked a bit about how to deal with the promises as they seem to be involved in the WebRTC standard was well. My Igalian partner Philippe was missed in this regard as he is involved in the development of WebRTC in WebKit, but he unfortunately couldn’t make it because of personal reasons.

    I also had an interesting talk with Jer Noble and Eric Carlson about Media Source and Encrypted Media Extensions. I told them about the several downstream implementations that we are or were working on, specially the WebKit4Wayland one and that we expect to begin to upstream soon. They commented that they still have doubts about the abstractions they made for them and of course I promised to get back to them when we begin with the job. Actually I already discussed some issues with Quique, another fellow Igalian.

    Among the other interesting discussions, I found very necessary the migration of Mac port to CMake. Actually, I am experiencing now the painbenefits of using XCode to add files, specially the generated ones to the compilation. I hope that Alex succeeds with the task and soon we have a common build system for all main ports.

    by calvaris at September 23, 2015 04:36 PM

    Joanmarie Diggs

    New in Orca 3.18: Firefox Support Rewrite and MathML

    For years I had been threatening to nuke Orca’s support for Firefox (it’s the only way to be sure) and start over. This release cycle I finally made good on my threat. :)

    The funny thing is that what prompted me to do so was not Firefox; it was the need to implement an Orca-controlled caret for WebKitGtk content. Why an Orca-controlled caret for WebKitGtk content is needed — and why the resulting support is not yet being used for WebKitGtk content — is a rant topic for another day. But with that work largely done, and Orca’s support for Firefox in desperate need of a rewrite, the time was right.

    The improvements include:

    1. Addition of speech support for MathML. (Nemeth is a high-priority for me, but just getting something working had to happen first.)
    2. Significant performance improvements: The lags Orca’s Firefox support had become (in)famous for should be gone — at least the ones caused by Orca, which was quite a few. I have also added some defensive handling for the accessible-event floods Orca is sometimes on the receiving end of (e.g. due to unusually large numbers of DOM changes).
    3. Support for Google Docs applications: Just enable Google Docs’ accessibility support and Orca’s sticky focus mode, and you should be good to go.
    4. Improved support for other rich text editors like Etherpad.
    5. Fixes for Orca getting stuck in or skipping over content in browse mode.
    6. Fixes for Orca getting tripped up by auto-reloading/refreshing content.
    7. Fixes for several bugs in presentation of elements in object mode. (By the way, yes, Orca can present one object per line “like JAWS does.”)
    8. Improved accuracy in Orca’s label inference support (because there are still authors who fail to use label elements and/or ARIA to make their form fields accessible).
    9. Fixes for several bugs related to using structural navigation, fast forward, and rewind in SayAll. (Note: These “experimental” SayAll features still need to be enabled via your Orca customizations because GNOME’s GUI and string freezes snuck up on me. But once you’ve enabled them, they should work more reliably than before.)
    10. Improved presentation of dynamically-added content, such as items added to the bottom of a search results page in real time.
    11. Did I mention MathML and lag elimination? :)
    12. Updated documentation (yes, finally!) to reflect the addition of browse mode, focus mode, sticky focus mode, layout mode, and the like.

    While I still have more improvements to make and bugs to fix, I’m quite pleased with how Orca 3.18.0 works with Firefox. I hope you will be too.

    To give Orca 3.18.0 a try, clone it from or check with your distro. Questions? Please ask on the Orca mailing list.

    by jd at September 23, 2015 02:37 AM

    September 21, 2015

    Carlos García Campos

    WebKitGTK+ 2.10

    HTTP Disk Cache

    WebKitGTK+ already had an HTTP disk cache implementation, simply using SoupCache, but Apple introduced a new cross-platform implementation to WebKit (just a few bits needed a platform specific implementation), so we decided to switch to it. This new cache has a lot of advantages over the SoupCache approach:

    • It’s fully integrated in the WebKit loading process, sharing some logic with the memory cache too.
    • It’s more efficient in terms of speed (the cache is in the NetworkProcess, but only the file descriptor is sent to the Web Process that mmaps the file) and disk usage (resource body and headers are stored in separate files in disk, using hard links for the body so that difference resources with the exactly same contents are only stored once).
    • It’s also more robust thanks to the lack of index. The synchronization between the index and the actual contents has always been a headache in SoupCache, with many resources leaked in disk, resources that are cache twice, etc.

    The new disk cache is only used by the Network Process, so in case of using the shared secondary process model the SoupCache will still be used in the Web Process.

    New inspector UI

    The Web Inspector UI has been redesigned, you can see some of the differences in this screenshot:


    For more details see this post in the Safari blog


    This was one the few regressions we still had compared to WebKit1. When we switched to WebKit2 we lost IndexedDB support, but It’s now back in 2.10. It uses its own new process, the DatabaseProcess, to perform all database operations.


    WebKitGTK+ 2.8 improved the overall performance thanks to the use of the bmalloc memory allocator. In 2.10 the overall performance has also improved, this time thanks to a new implementation of the locking primitives. All uses of mutex/condition have been replaced by a new implementation. You can see more details in the email Filip sent to webkit-dev or in the so detailed commit messages.

    Screen Saver inhibitor

    It’s more and more common to use the web browser to watch large videos in fullscreen mode, and quite annoying when the screen saver decides to “save” your screen every x minutes during the whole video. WebKitGTK+ 2.10 uses the ScreenSaver DBus service to inhibit the screen saver while a video is playing in fullscreen mode.

    Font matching for strong aliases

    WebKit’s font matching algorithm has improved, and now allows replacing fonts with metric-compatible equivalents. For example, sites that specify Arial will now get Liberation Sans, rather than your system’s default sans font (usually DejaVu). This makes text appear better on many pages, since some fonts require more space than others. The new algorithm is based on code from Skia that we expect will be used by Chrome in the future.

    Improve image quality when using newer versions of cairo/pixman

    The poor downscaling quality of cairo/pixman is a well known issue that was finally fixed in Cairo 1.14, however we were not taking advantage of it in WebKit even when using a recent enough version of cairo. The reason is that we were using CAIRO_FILTER_BILINEAR filter that was not affected by the cairo changes. So, we just switched to use CAIRO_FILTER_GOOD, that will use the BILINEAR filter in previous versions of Cairo (keeping the backwards compatibility), and a box filter for downscaling in newer versions. This drastically improves the image quality of downscaled images with a minim impact in performance.

    New API

    Editor API

    The lack of editor capabilities from the API point of view was blocking the migration to WebKit2 for some applications like Evolution. In 2.10 we have started to add the required API to ensure not only that the migration is possible for any application using a WebView in editable mode, but also that it will be more convenient to use.

    So, for example, to monitor the state of the editor associated to a WebView, 2.10 provides a new class WebKitEditorState, that for now allows to monitor the typing attributestyping attributes. With WebKit1 you had to connect to the selection-changed signal and use the DOM bindings API to manually query the typing attributes. This is quite useful for updating the state of the editing buttons in the editor toolbar, for example. You just need to connect to WebKitEditorState::notify::typying-attributes and update the UI accordingly. For now typing attributes is the only thing you can monitor from the UI process API, but we will add more information when needed like the current cursor position, for example.

    Having WebKitEditorState doesn’t mean we don’t need a selection-changed signal that we can monitor to query the DOM ourselves. But since in WebKit2 the DOM lives in the Web Process, the selection-changed signal has been added to the Web Extensions API. A new class WebKitWebEditor has been added, to represent the web editor associated to a WebKitWebPage, and can be obtained with webkit_web_page_get_editor(). And is this new class the one providing the selection-changed signal. So, you can connect to the signal and use the DOM API the same way it was done in WebKit1.

    Some of the editor commands require an argument, like for example, the command to insert an image requires the image source URL. But both the WebKit1 and WebKit2 APIs only provided methods to run editor commands without any argument. This means that, once again, to implement something like insert-image or insert link, you had to use the DOM bindings to create and insert the new elements in the correct place. WebKitGTK+ 2.10 provides webkit_web_view_execute_editing_command_with_argument() to make this a lot more convenient.

    You can test all this features using the new editor mode of MiniBrowser, simply run it with -e command line option and no arguments.


    Website data

    When browsing the web, websites are allowed to store data at the client side. It could be a cache, like the HTTP disk cache, or data required by web features like offline applications, local storage, IndexedDB, WebSQL, etc. All that data is currently stored in different directories and not all of those could be configured by the user. The new WebKitWebsiteDataManager class in 2.10 allows you to configure all those directories, either using a common base cache/data directory or providing a specific directory for every kind of data stored. It’s not mandatory to use it though, the default values are compatible with the ones previously used.

    This gives the user more control over the browsing data stored in the client side, but in the future versions we plan to add support for actually handling the data, so that you will be able to query and delete the data stored by a particular security domain.

    Web Processes limit

    WebKitGTK+ currently supports two process models, the single shared secondary process and the multiple secondary processes. When using the latter, a new web process is created for every new web view. When there are a lot of web views created at the same time, the resources required to create all those processes could be too much in some systems. To improve that a bit 2.10 adds webkit_web_context_set_web_process_count_limit(), to set the maximum number of web process that can be created a the same time.

    This new API can also be used to implement a slightly different version of the shared single process model. By using the multiple secondary process model with a limit of 1 web process, you still have a single shared web process, but using the multi-process mechanism, which means the network will happen in the Network Process, among other things. So, if you use the shared secondary process model in your application, unless your application only loads local resources, we recommend you to switch to multiple process model and use the limit to benefit from all the Network Process feature like the new disk cache, for example. Epiphany already does this for the secondary process model and web apps.

    Missing media plugins installation permission request

    When you try to play media, and the media backend doesn’t find the plugins/codecs required to play it, the missing plugin installation mechanism starts the package installer to allow the user to find and install the required plugins/codecs. This used to happen in the Web Process and without any way for the user to avoid it. WebKitGTK+ 2.10 provides a new WebKitPermissionRequest implementation that allows the user to block the request and prevent the installer from being invoked.

    by carlos garcia campos at September 21, 2015 11:52 AM

    September 18, 2015

    Alejandro Piñeiro

    Optimizing shader assembly instruction on Mesa using shader-db (II)

    On my previous post I mentioned that I have been working on optimizing the shader instruction count for specific shaders guided by shader-db, and showed one specific example. In this post I will show another one, slightly more complex on the triaging and solution.

    Some of the shaders with a worse instruction count can be foundn at shader-db/shaders/dolphin. Again I analyzed it in order to get the more simpler shader possible with the same issue:

       #version 130
       in vec2 myData;
       void main()
          gl_Position = vec4(myData, 3.0, 4.0);

    Some comments:

    • It also happens with uniforms (so you can replace “in vec2″ for “uniform vec2″)
    • It doesn’t happens if you use directly the input. You need to do this kind of “input and const” combination.

    So as in my previous post I executed the compilation, using the option optimizer. In the case of IR I got the following files:

    • VS-0001-00-start
    • VS-0001-01-01-opt_reduce_swizzle
    • VS-0001-01-04-opt_copy_propagation
    • VS-0001-01-07-opt_register_coalesce
    • VS-0001-02-02-dead_code_eliminate
    • VS-0001-02-07-opt_register_coalesce

    Being this the desired outcome (so the content of VS-0001-02-02-dead_code_eliminate):

    0: mov m3.z:F, 3.000000F
    1: mov m3.w:F, 4.000000F
    2: mov m3.xy:F, attr17.xyyy:F
    3: mov m2:D, 0D
    4: vs_urb_write (null):UD

    Unsurprisingly it is mostly movs. In opposite to the shader I mentioned on my previous post, where on both cases the same optimizations were applied, in this case the NIR path doesn’t apply the last optimization (the second register coalesce). So this time I will focus on the starting point and the state just after the dead code eliminate pass.

    So on IR, the starting point (VS-0001-00-start) is:

    0: mov vgrf2.0.x:F, 3.000000F
    1: mov vgrf2.0.y:F, 4.000000F
    2: mov, vgrf2.xxxy:F
    3: mov vgrf1.0.xy:F, attr17.xyxx:F
    4: mov vgrf0.0:F, vgrf1.xyzw:F
    5: mov m2:D, 0D
    6: mov m3:F, vgrf0.xyzw:F
    7: vs_urb_write (null):UD

    and the state after the dead code eliminate is the following one:

    0: mov vgrf1.0.z:F, 3.000000F
    1: mov vgrf1.0.w:F, 4.000000F
    2: mov vgrf1.0.xy:F, attr17.xyyy:F
    3: mov m2:D, 0D
    4: mov m3:F, vgrf1.xyzw:F
    5: vs_urb_write (null):UD

    On NIR, the starting point is:

    0: mov vgrf2.0.x:F, 3.000000F
    1: mov vgrf2.0.y:F, 4.000000F
    2: mov vgrf0.0.xy:F, attr17.xyyy:F
    3: mov vgrf1.0.xy:D, vgrf0.xyzw:D
    4: mov, vgrf2.xxxy:D
    5: mov m2:D, 0D
    6: mov m3:F, vgrf1.xyzw:F
    7: vs_urb_write (null):UD

    and the state after the dead code eliminate is the following one:

    0: mov vgrf2.0.x:F, 3.000000F
    1: mov vgrf2.0.y:F, 4.000000F
    2: mov m3.xy:D, attr17.xyyy:D
    3: mov, vgrf2.xxxy:D
    4: mov m2:D, 0D
    5: vs_urb_write (null):UD

    The first difference we can see is that although the instructions are basically the same at the starting point, the order is not the same. In fact if we check the different intermediate steps (I will not show them here to avoid a post too long), although the optimizations are the same, how and which get optimized are somewhat different. One could conclude that the problem is this order, but if we take a look to the final step on the NIR assembly shader, there isn’t anything clearly indicating that that shader can’t be simplified. Specifically instruction #3 could go away if instruction #0 and #1 writes directly to m3 instead of vgrf2, that is what the IR path does. So it seems that the problem is on the register coalesce optimization.

    As I mentioned, there is a slight order difference between NIR and IR. That leads that on the NIR case, between instruction #3 and #0/#1 there is another instruction, that is in a different place on IR. So my first thought was that the optimization was only checking against the immediate previous instruction. Once I started to look to the code it showed that I was wrong. For each instruction, there was a loop checking for all the previous instructions. What I noticed is that on that loop, all the checks that rejected one previous instruction was a break. So I initially thought that perhaps one of those breaks was in fact a continue. This seemed to be confirmed when I did the quick hack of replace everything for continues. It proved wrong as soon as I saw all the piglit regressions I had in hand. So after that I did the proper, and do a proper debug. So using gdb, the condition it was stopping the optimization to check previous instructions was the following one:

    /* If somebody else writes our destination here, we can't coalesce
    * before that.
    if (inst->dst.in_range(scan_inst->dst, scan_inst->regs_written))

    Probably the code is hard to understand out of context, but the comment is clear. When we coalesce two instructions, that is possible when the previous one writes to a register we are reading on the current instruction. But obviously, that can’t be done if there is a instruction in the middle that writes on the same register. And it is true that is what is happening here. If you look at the final state of the NIR path, we want to coalesce instruction #3 with instruction #1 and #0, but instruction #2 is writing on m3 too.

    So, that’s over? Not exactly. Remember that IR was able to simplify this, and it can’t be only because the order was different. If you take a deeper look to those instructions, there are some x, y, z, w after the register names. Those report in which channels those instructions are writing. As I mentioned on my previous post, this work is about providing a NIR to vec4 pass. So those registers are vectors. Instruction #3 can be read as “move the content of components x and y from register vgrf2 to components z and w of register m3″.  And instruction #2 can be read as “move the content of components x and y from register attr17 to components x and y of register m3″. So although we are writing to the same destination, we are writing to different components, meaning that it would be safe to do the coalescing. We just need to be sure that there isn’t any component overlap between current instruction and the previous one we are checking against. Fortunately the registers already save in which registers they are writing on in a variable called “writemask”. So we only need to change that code for the following one:

    /* If somebody else writes the same channels of our destination here,
    * we can't coalesce before that.
    if (inst->dst.in_range(scan_inst->dst, scan_inst->regs_written) &&
    (inst->dst.writemask & scan_inst->dst.writemask) != 0) {

    The patch with this change was sent to the mesa list (here), approved and pushed to master.

    Final words

    So again, a problem what was easier to write the solution that to write the patch. But in any case, it showed a significant improvement. Using shader-db tool comparing before and after the patch:

    total instructions in shared programs: 1781593 -> 1734957 (-2.62%)
    instructions in affected programs:     1238390 -> 1191754 (-3.77%)
    helped:                                12782
    HURT:                                  0

    by infapi00 at September 18, 2015 08:40 AM

    September 14, 2015

    Alejandro Piñeiro

    Optimizing shader assembly instruction on Mesa using shader-db

    Lately I have been working on Mesa. Specifically I have been working with my fellow igalians Eduardo Lima and Antía Puentes to provide a NIR to vec4 pass to the i965 backend. I will not go too much in the details, but in summary, NIR is a new intermediate representation for Mesa. Intermediate as being in the middle of the OpenGL GLSL language used for shaders, and the final GPU machine instructions for each specific Mesa backends. NIR is intended to replace the previous GLSL IR, and in some places it is already done. If you are interested on the details, take a look to the NIR announcement and NIR documentation page.

    Although the bug is still open, Mesa master already has the functionality for this pass, and in fact, is the default now. This new NIR pass provides the same functionality that the one available with the old GLSL IR pass (from now, just IR). This was properly tested with piglit. But although the total instruction count in general have improved, we are getting worse instruction count compiling some specific known shaders if we use NIR. So the next step would be improve this. This is an ongoing effort, like these patches from Jason Ekstrand, but I would like to share some of my experience so far.

    In order to guide this work, we have been using shader-db. shader-db is a shader database, with a executable to compile those shaders, and a tool to compare two executions of that compilation. Usually it is used to verify that the optimization that you are implementing is really improving the instruction count, or even to justify your change. Several Mesa commits include the before and after shader-db statistics. But in this case, we have been using it as a guide of what we could improve. We compile all the shaders using IR and using NIR (using the environment variable INTEL_USE_NIR), and check in which shaders there are a instruction count regression.

    Case 1: subtraction needs an extra mov.

    Ok, so one of the shaders with a worse instruction count is humus-celshading/4.shader_test. After some analysis of what was the problem, I got a simpler shader with the same problem:

    in vec4 inData;
    void main(){
    gl_Position = gl_Vertex - inData;

    This simple shader needs one extra instructions using NIR. So yes, a simple subtraction is getting worse. FWIW, this is the desired final shader assembly:

    0: add m3:F, attr0.xyzw:F, -attr18.xyzw:F
    1: mov m2:D, 0D
    2: vs_urb_write (null):UD

    Note that there isn’t an assembly subtraction instruction, but it is represented as negating the second parameter and use an add (this seems captain obvious information here, but will be relevant later).

    So at this point one option would be start to look at the backend (remember, i965) code for vec4, specifically the optimizations, and check if we see something. Those optimization are called at brw_vec4.cpp. Those optimizations are in general common to any compiler, like dead code elimination, copy propagation, register coalescing, etc. And usually they are executed several times in several passes, and some of those are simplifications to be used by other optimizations (for example, if your copy propagation optimization pass works, then it is common that your dead code elimination pass will get an instruction out). So with all those optimizations and passes, how do you find the problem? Although it is a good idea read the code for those optimizations to know how they work, it is usually not enough to know where the problem is. So this is again a debug problem, and as usually, you want to know what it is happening step by step.

    For this I executed again the compilation, with the following environment variable:

    INTEL_USE_NIR=1 INTEL_DEBUG=optimizer ./run subtraction.shader_test

    This option prints out to a file the shader assembly compiled at each optimization pass (if applied). So for example, I get the following files for both cases:

    • VS-0001-00-start
    • VS-0001-01-04-opt_copy_propagation
    • VS-0001-01-07-opt_register_coalesce
    • VS-0001-02-02-dead_code_eliminate

    So in order to get the final shader assemlby, it was executed a copy propagation, a register coalesce, and a dead code eliminate. BTW, I found that environment variable while looking at the code. It is not listed on the mesa envvar page, something I assume is a bug.

    So I started to look at the differences between the different steps. Taking into account that on both cases, the same optimizations were executed, and in the same order, I started looking for differences between one and the other at any step. And I found one difference on the copy propagation.

    So let’s see the starting point using IR:

    0: mov vgrf2.0:F, -attr18.xyzw:F
    1: add vgrf0.0:F, attr0.xyzw:F, vgrf2.xyzw:F
    2: mov m2:D, 0D
    3: mov m3:F, vgrf0.xyzw:F
    4: vs_urb_write (null):UD

    And the outcome of the copy propagation:

    0: mov vgrf2.0:F, -attr18.xyzw:F
    1: add vgrf0.0:F, attr0.xyzw:F, -attr18.xyzw:F
    2: mov m2:D, 0D
    3: mov m3:F, vgrf0.xyzw:F
    4: vs_urb_write (null):UD

    And the starting point using NIR:

    0: mov vgrf0.0:UD, attr0.xyzw:UD
    1: mov vgrf1.0:UD, attr18.xyzw:UD
    2: add vgrf2.0:F, vgrf0.xyzw:F, -vgrf1.xyzw:F
    3: mov m2:D, 0D
    4: mov m3:F, vgrf2.xyzw:F
    5: vs_urb_write (null):UD

    And the outcome of the copy propagation:

    0: mov vgrf0.0:UD, attr0.xyzw:UD
    1: mov vgrf1.0:UD, attr18.xyzw:UD
    2: add vgrf2.0:F, attr0.xyzw:F, -vgrf1.xyzw:F
    3: mov m2:D, 0D
    4: mov m3:F, vgrf2.xyzw:F
    5: vs_urb_write (null):UD

    Although it is true that the starting point for NIR already have one extra instruction compared with IR, that extra one gets optimized on following steps. What caught my attention was the difference between what happens with the instruction #1 on the IR case, compared with the equivalent instruction #2 on the NIR case (the add). On the IR case, copy propagation is able to propagate attr18 from the previous instruction. So is easy to see that this could be simplified on following optimization steps. But that doesn’t happen on the NIR case. On NIR, instruction #2 after the copy propagation remains the same.

    So I started to take a look to the implementation of the copy propagation optimization code (here). Without entering into details, this pass analyses each instruction, comparing them with the previous ones in order to know if it can do a copy propagation. So I looked why with that specific instruction the pass concludes that it can’t be done. At this point you could use gdb, but I used some extra printfs (sometimes they are useful too). So I find the check that rejected that instruction:

    bool has_source_modifiers = value.negate || value.abs;
    if (has_source_modifiers && value.type != inst->src[arg].type)
        return false;

    That means that if the source of the previous instruction you are checking against is negated (or has an abs), and the types are different, you can’t do the propagation. This makes sense, because negation is different on different types. If we go back to check the shader assembly output, we find that it is true that the types (those F, D and UD just after the registers) are different between the IR and the NIR case. Why we didn’t worry before? Why this was not failing on any piglit test? Well, because if you take a look more carefully, the instructions that had different types are the movs. In both cases, the type is correct on the add instruction. And in a mov, the type is somewhat irrelevant. You are just moving raw data from one place to the other. It is important on the ALU operation. But in any case, it is true that the type is wrong on those registers (compared with the original GLSL code), and as we are seeing, is causing some problems on the optimization passes. So next step: check where those types are filled.

    Searching a little on the code, and using gdb this time, this is done on the function nir_setup_uniforms at brw_vec4_nir.cpp,while creating a source register variable. But curiously it is using the type that came from NIR:

    src_reg src = src_reg(ATTR, var->data.location + i, var->type);

    and printing out the content of var->type with gdb, it properly shows the type used at the GLSL code. If we go deeper to the src_reg constructor:

    src_reg::src_reg(register_file file, int reg, const glsl_type *type)
        this->file = file;
        this->reg = reg;
        if (type && (type->is_scalar() || type->is_vector() || type->is_matrix()))
            this->swizzle = brw_swizzle_for_size(type->vector_elements);
            this->swizzle = BRW_SWIZZLE_XYZW;

    We see that the type is only used to fill the swizzle. But if we compare this to the equivalent code for a destination register:

    dst_reg::dst_reg(register_file file, int reg, const glsl_type *type,
    unsigned writemask)
        this->file = file;
        this->reg = reg;
        this->type = brw_type_for_base_type(type);
        this->writemask = writemask;

    dst_reg is is also filling internally the type, something src_reg is not doing. So at this point the patch is straightforward. It is just fill src_reg->type using the constructor type parameter. The patch was approved and is already on master.

    Coming up next

    At the end, I didn’t need to improve at all any of the optimization passes, as the bug was elsewhere, but the debugging steps for this work are still the same. In fact it was the usual bug that was harder to find (for simplicity I summarized the triaging) that to solve. For the next blog post, I will explain how I worked on another instruction count regression, somewhat more complex, that needed a change on one of the optimization passes.


    by infapi00 at September 14, 2015 11:58 AM

    September 11, 2015

    Jacobo Aragunde

    An update on LibreOffice for Android

    Summer time is for hacking! I booked some time in the last weeks to carry on with the work on the Android port we started earlier this year at Igalia, merging the pending bits in master, completing features and adding some polish. Now that everything is upstream, let me share it with you!

    Merge ownCloud document provider

    I had used the ownCloud Android library to speed up the implementation of the ownCloud document provider. Unfortunately, it included some libraries as dependencies in binary form and that was not acceptable to integrate the library in LibreOffice.

    My solution was creating a Github fork of the library and replace the binary dependencies with the corresponding sources. Now a tarball containing the sources in that repository is downloaded and compiled by LibreOffice build system when asked for the Android target.

    Implement save to cloud

    While I was working on the Document Providers infrastructure, Miklos and Tomaž were implementing document edition on the LibreOffice port. Now there is some basic support to edit and save documents, I must enable the save operation in the Document Providers framework and in particular in the ownCloud test implementation.

    The ownCloud save operation itself was easy, but knowing whether the local save operation had finished or not proved to be a bit more challenging. LibreOfficeKit API does not yet provide a way to set a callback on the save operation so we can check its status and do something after it has finished. As a workaround, I periodically check the modification date field in the file to know if LibreOffice core is already done with it.

    This is meant to be a temporary solution, but at least, we can provide some feedback on the save operation now.

    Feedback on save

    A hamburger in my LibreOffice

    Without any indication, the drawer containing the list of document providers was completely hidden from users; only a handful of people I showed the app to, besides myself, were aware of it. For that reason, I enabled the application home button, the one on the left of the action bar, to become a drawer indicator, the famous “hamburger” icon found in many Android apps.

    It took me some time to figure out the way to put everything together so the button changes its icon and behavior according to the context. When the user is on the root folder, the “hamburger” will indicate the presence of a drawer menu with the additional locations implemented by the available document providers. When browsing some directory, it will become a back arrow to browse the parent directory. Finally, when the drawer is open it will also become an arrow to close the drawer.

    Drawer indicator

    Smart behavior for system back key

    Tapping the system back key when browsing some directory and see the app close was a bit frustrating, because most users expect it to go back to the previous location, the parent directory. I’ve overridden the default behavior to make it smarter and context-aware. When browsing some directory, it will open the parent directory; when the drawer is open, it will close it; when browsing the root directory, it will ask for an additional tap for confirmation to exit the application.

    Exit confirmation

    All these patches are in master branch and will be part of the next major release. Impatient people will also find them in the daily builds from tomorrow onward.

    by Jacobo Aragunde Pérez at September 11, 2015 06:44 PM

    August 15, 2015

    Adrián Pérez

    Locked out of SSH? Renew all the keys!

    Long time, no post! Well, here I am with something completely unrelated to programming. Instead, is this something that has been bugging me for a while but I had been postponing because from the surface it looked a bit like a deep rabbit hole.

    The Issue

    For a while, every time I tried to connect to certain servers via SSH (or Mosh, make sure to check it out!), I was getting something like this:

    ~ % ssh
    Host key fingerprint is ~~REDACTED~~
    Permission denied (publickey).
    ~ %

    Having plenty of things pending to do, plus lazyness, made me ignore this for a while because most of the machines I connect to via SSH have password-based authentication as a fallback. But today I had to use a machine that is configured to accept only key-based authentication, so I had to bite the bullet. As usual, I re-tried connecting with the very same command, plus adding -v to get some debugging output. Notice this message:

    debug1: Skipping ssh-dss key /home/aperez/.ssh/id_dsa for not in PubkeyAcceptedKeyTypes

    Could this be the root cause? Searching on Internet did not yield anything interesting — nobody checks result pages after the second one, ever. Then I remembered that after the infamous Heartbleed bug several initiatives with the goal of making secure stuff more secure —and hopefully bug-free— were started. While LibreSSL gets the honorable mention of having the cutest logo, other teams have not been behind in “de-crufting” their code. I started to fear that the latest release of OpenSSH may not have support anymore for one of:

    • The host key used by the server.
    • The host key type used by my laptop.
    • The identity key type.

    It was time to learn about what changed in the recent past.

    Exhibit “A”

    Looking at OpenSSH 7.0 release notes, there is the culpript:

     * Support for ssh-dss, ssh-dss-cert-* host and user keys is disabled
       by default at run-time. These may be re-enabled using the
       instructions at

    So it turns out that support for ssh-dss keys like the one I was trying to use is still available, but disabled by default. The mentioned URL contains information on how to use legacy features, and in this case the support for ssh-dss keys can be re-enabled using the PubkeyAcceptedKeyTypes option, either temporarily for a single ssh (or scp) invocation:

    ~ % ssh -oPubkeyAcceptedKeyTypes=+ssh-dss

    or permanently adding a snippet to the ~/.ssh/config file:

         PubkeyAcceptedKeyTypes +ssh-dss

    The Solution

    I suspect it won't be long before support for DSA keys is disabled at compile time, and for good reasons, so this seemed a moment as good as any to make a new SSH key, and propagate it to the hosts where I was using the old one.

    Indeed, I should.

    Question is: Which type of key should I generate as of 2015? The current default of ssh-keygen is to use RSA keys of 2048 bits, which seems like a reasonable thing provided that it is technically possible to break 1024 bit keys. Applying a bit of paranoia, I decided to better use a 4096 bit key, to make it future-proof as well.

    But hey, we live in dangerous days, and one may want to stay away from RSA keys. They use the standard NIST curves, which are not that good, and because we know for a fact that the NSA has been tampering around with them, we may want to take a different approach, and go for an Ed25519 key instead. Which, apart from having Daniel J. Bernstein in the design team, are not vulnerable to poor random number generation. There is only one catch: support for Ed25519 is kind of new, so if you need to connect to machines using OpenSSH ≤ 6.5 you may still want to create a 4096 bit RSA key for them, ensuring that the Ed25519 key is used by default and the RSA one only when needed. More on that later, for now let's create the new keys with:

    ~ % ssh-keygen -t ed25519


    ~ % ssh-keygen -t rsa -b 4096

    Then, append the key to the relevant machines, changing id_ed25519 to id_rsa for the ones which do not support Ed25519; note how the option to enable ssh-dss keys is used to temporarily allow using the old key to be able to append the new one (which somehow did not work with ssh-copy-id):

    % ssh -oPubkeyAcceptedKeyTypes=+ssh-dss \
        'cat >> ~/.ssh/authorized_keys' < ~/.ssh/

    Last but not least, let's make sure that the RSA key is only ever used when needed, with a couple of tweaks in the ~/.ssh/config file:

    PubkeyAcceptedKeyTypes ssh-ed25519,
         PubkeyAcceptedKeyTypes +ssh-rsa

    Note that it is not possible to remove one item from the PubkeyAcceptedKeyTypes list with -ssh-rsa: we need to specify a complete list of key types that OpenSSH will be allowed to use. For reference: the ssh_config manual page lists the default value, and ssh -Q key lists key types built in a particular OpenSSH installation. In my case, I have decided to use my new Ed25519 key as primary, allowing only this key type by default, and using the RSA key for selected hosts.

    While we are at it, it may be interesting to switch from the OpenSSH server to TinySSH in the servers under our control. Despite being small, it supports Ed25519, and it uses NaCl (DJB's crypto library) instead of OpenSSL. Also, it is possibly the simplest service one can setup over CurveCP at the moment. But this post is already long enough, so let's move on to the finale.


    My main desktop environment is GNOME, which is very nice and by default includes a convenient SSH agent as part of its Keyring daemon. But every now and then I just use OpenBox window manager, or even a plain Linux virtual terminal with a tmux session in it, where there is no GNOME components running. And I do miss its SSH agent. After looking a bit around, I have installed Envoy, which instead of implementing its own agent, ensures that one (and only one) instance of the OpenSSH ssh-agent runs for each user, across all logged-in sessions, and without needing changes to your shell startup files when using the included PAM module.

    Update (2015-08-16): Another reason to use Envoy, as pointed out by Óscar Amor, is that the GNOME Keyring daemon can't handle Ed25519 keys.

    Happy SSH'ing!

    by aperez ( at August 15, 2015 11:52 PM

    August 14, 2015

    Alberto Garcia

    I/O limits for disk groups in QEMU 2.4

    QEMU 2.4.0 has just been released, and among many other things it comes with some of the stuff I have been working on lately. In this blog post I am going to talk about disk I/O limits and the new feature to group several disks together.

    Disk I/O limits

    Disk I/O limits allow us to control the amount of I/O that a guest can perform. This is useful for example if we have several VMs in the same host and we want to reduce the impact they have on each other if the disk usage is very high.

    The I/O limits can be set using the QMP command block_set_io_throttle, or with the command line using the throttling.* options for the -drive parameter (in brackets in the examples below). Both the throughput and the number of I/O operations can be limited. For a more fine-grained control, the limits of each one of them can be set on read operations, write operations, or the combination of both:

    • bps (throttling.bps-total): Total throughput limit (in bytes/second).
    • bps_rd (throttling.bps-read): Read throughput limit.
    • bps_wr (throttling.bps-write): Write throughput limit.
    • iops (throttling.iops-total): Total I/O operations per second.
    • iops_rd (throttling.iops-read): Read I/O operations per second.
    • iops_wr (throttling.iops-write): Write I/O operations per second.


    -drive if=virtio,file=hd1.qcow2,throttling.bps-write=52428800,throttling.iops-total=6000

    In addition to that, it is also possible to configure the maximum burst size, which defines a pool of I/O that the guest can perform without being limited:

    • bps_max (throttling.bps-total-max): Total maximum (in bytes).
    • bps_rd_max (throttling.bps-read-max): Read maximum.
    • bps_wr_max (throttling.bps-write-max): Write maximum.
    • iops_max (throttling.iops-total-max): Total maximum of I/O operations.
    • iops_rd_max (throttling.iops-read-max): Read I/O operations.
    • iops_wr_max (throttling.iops-write-max): Write I/O operations.

    One additional parameter named iops_size allows us to deal with the case where big I/O operations can be used to bypass the limits we have set. In this case, if a particular I/O operation is bigger than iops_size then it is counted several times when it comes to calculating the I/O limits. So a 128KB request will be counted as 4 requests if iops_size is 32KB.

    • iops_size (throttling.iops-size): Size of an I/O request (in bytes).

    Group throttling

    All of these parameters I’ve just described operate on individual disk drives and have been available for a while. Since QEMU 2.4 however, it is also possible to have several drives share the same limits. This is configured using the new group parameter.

    The way it works is that each disk with I/O limits is member of a throttle group, and the limits apply to the combined I/O of all group members using a round-robin algorithm. The way to put several disks together is just to use the group parameter with all of them using the same group name. Once the group is set, there’s no need to pass the parameter to block_set_io_throttle anymore unless we want to move the drive to a different group. Since the I/O limits apply to all group members, it is enough to use block_set_io_throttle in just one of them.

    Here’s an example of how to set groups using the command line:

    -drive if=virtio,file=hd1.qcow2,throttling.iops-total=6000,
    -drive if=virtio,file=hd2.qcow2,throttling.iops-total=6000,
    -drive if=virtio,file=hd3.qcow2,throttling.iops-total=3000,
    -drive if=virtio,file=hd4.qcow2,throttling.iops-total=6000,
    -drive if=virtio,file=hd5.qcow2,throttling.iops-total=3000,
    -drive if=virtio,file=hd6.qcow2,throttling.iops-total=5000

    In this example, hd1, hd2 and hd4 are all members of a group named foo with a combined IOPS limit of 6000, and hd3 and hd5 are members of bar. hd6 is left alone (technically it is part of a 1-member group).

    Next steps

    I am currently working on providing more I/O statistics for disk drives, including latencies and average queue depth on a user-defined interval. The code is almost ready. Next week I will be in Seattle for the KVM Forum where I will hopefully be able to finish the remaining bits.

    I will also attend LinuxCon North America. Igalia is sponsoring the event and we have a booth there. Come if you want to talk to us or see our latest demos with WebKit for Wayland.

    See you in Seattle!

    by berto at August 14, 2015 10:22 AM

    August 04, 2015

    Andy Wingo

    developing v8 with guix

    a guided descent into hell

    It all started off so simply. My primary development machine is a desktop computer that I never turn off. I suspend it when I leave work, and then resume it when I come back. It's always where I left it, as it should be.

    I rarely update this machine because it works well enough for me, and anyway my focus isn't the machine, it's the things I do on it. Mostly I work on V8. The setup is so boring that I certainly didn't imagine myself writing an article about it today, but circumstances have forced my hand.

    This machine runs Debian. It used to run the testing distribution, but somehow in the past I needed something that wasn't in testing so it runs unstable. I've been using Debian for some 16 years now, though not continuously, so although running unstable can be risky, usually it isn't, and I've unborked it enough times that I felt pretty comfortable.

    Perhaps you see where this is going!

    I went to install something, I can't even remember what it was now, and the downloads failed because I hadn't updated in a while. So I update, install the thing, and all is well. Except my instant messaging isn't working any more because there are a few moving parts (empathy / telepathy / mission control / gabble / dbus / whatwhat), and the install must have pulled in something that broke one of them. No biggie, this happens. Might as well go ahead and update the rest of the system while I'm at it and get a reboot to make sure I'm not running old software.

    Most Debian users know that you probably shouldn't do a dist-upgrade from an old system -- you upgrade and then you dist-upgrade. Or perhaps this isn't even true, it's tribal lore to avoid getting eaten by the wild beasts of bork that roam around the village walls at night. Anyway that's what I did -- an upgrade, let it chunk for a while, then a dist-upgrade, check the list to make sure it didn't decide to remove one of my kidneys to satisfy the priorities of the bearded demon that lives inside apt-get, OK, let it go, all is well, reboot. Swell.

    Or not! The computer restarts to a blank screen. Ha ha ha you have been bitten by a bork-beast! Switch to a terminal and try to see what's going on with GDM. It's gone! Ha ha ha! Your organs are being masticated as we speak! How does that feel! Try to figure out which package is causing it, happily with another computer that actually works. Surely this will be fixed in some update coming soon. Oh it's something that's going to take a few weeks!!!! Ninth level, end of the line, all passengers off!

    my gods

    I know how we got here, I love Debian, but it is just unacceptable and revolting that software development in 2015 is exposed to an upgrade process which (1) can break your system (2) by default and (3) can't be rolled back. The last one is the killer: who would design software this way? If you make a system like this in 2015 I'd say you're committing malpractice.

    Well yesterday I resolved that this would be the last time this happens to me. Of course I could just develop in a virtual machine, and save and restore around upgrades, but that's kinda trash. Or I could use btrfs and be able to rewind changes to the file system, but then it would rewind everything, not just the system state.

    Fortunately there is a better option in the form of functional package managers, like Nix and Guix. Instead of upgrading your system by mutating /usr, Nix and Guix store all files in a content-addressed store (/nix/store and /gnu/store, respectively). A user accesses the store via a "profile", which is a forest of symlinks into the store.

    For example, on my machine with a NixOS system installation, I have:

    $ which ls
    $ ls -l /run/current-system/sw/bin/ls
    lrwxrwxrwx 1 root nixbld 65 Jan  1  1970
      /run/current-system/sw/bin/ls ->
    $ ldd /nix/store/wc472nw0kyw0iwgl6352ii5czxd97js2-coreutils-8.23/bin/ls (0x00007fff5d3c4000) => /nix/store/c2p56z920h4mxw12pjw053sqfhhh0l0y-acl-2.2.52/lib/ (0x00007fce99d5d000) => /nix/store/la5imi1602jxhpds9675n2n2d0683lbq-glibc-2.20/lib/ (0x00007fce999c0000) => /nix/store/jd3gggw5bs3a6sbjnwhjapcqr8g78f5c-attr-2.4.47/lib/ (0x00007fce997bc000)
      /nix/store/la5imi1602jxhpds9675n2n2d0683lbq-glibc-2.20/lib/ (0x00007fce99f65000)

    Content-addressed linkage means that files in the store are never mutated: they will never be overwritten by a software upgrade. Never. Never will I again gaze in horror at the frozen beardcicles of a Debian system in the throes of "oops I just deleted all your programs, like that time a few months ago, wasn't that cool, it's really cold down here, how do you like my frozen facial tresses and also the horns".

    At the same time, I don't have to give up upgrades. Paradoxically, immutable software facilitates change and gives me the freedom to upgrade my system without anxiety and lost work.

    nix and guix

    So, there's Nix and there's Guix. Both are great. I'll get to comparing them, but first a digression on the ways they can be installed.

    Both Nix and Guix can be installed either as the operating system of your computer, or just as a user-space package manager. I would actually recommend to people to start with the latter way of working, and move on to the OS if you feel comfortable. The fundamental observation here is that because /nix/store doesn't depend on or conflict with /usr, you can run Nix or Guix as a user on a (e.g.) Debian system with no problems. You can have a forest of symlinks in ~/.guix-profile/bin that links to nifty things you've installed in the store and that's cool, you don't have to tell Debian.

    and now look at me

    In my case I wanted to also have the system managed by Nix or Guix. GuixSD, the name of the Guix OS install, isn't appropriate for me yet because it doesn't do GNOME. I am used to GNOME and don't care to change, so I installed NixOS instead. It works fine. There have been some irritations -- for example it just took me 30 minutes to figure out how to install dict, with a local wordnet dictionary server -- but mostly it has the packages I need. Again, I don't recommend starting with the OS install though.

    GuixSD, the OS installation of Guix, is a bit harder even than NixOS. It has fewer packages, though what it does have tends to be more up-to-date than Nix. There are two big things about GuixSD though. One is that it aims to be fully free, including avoiding non-free firmware. Because they build deterministic build products from source, Nix and Guix can offer completely reproducible builds, which is swell for software reliability. Many reliability people also care a lot about software freedom and although Nix does support software freedom very well, it also includes options to turn on the Flash plugin, for example, and of course includes the Linux kernel with all of the firmware. Well GuixSD eschews non-free firmware, and uses the Linux-Libre kernel. For myself I have a local build on another machine that uses the stock Linux kernel with firmware for my Intel wireless device, and I was really discouraged from even sharing the existence of this hack. I guess it makes sense, it takes a world to make software freedom, but that particular part is not my fight.

    The other thing about Guix is that it's really GNU-focused. This is great but also affects the product in some negative ways. They use "dmd" as an init system, for example, which is kinda like systemd but not. One consequence of this is that GuixSD doesn't have an implementation of the org.freedesktop.login1 seat management interface, which these days is implemented by part of systemd, which in turn precludes a bunch of other things GNOME-related. At one point I started working on a fork of systemd that pulled logind out to a separate project, which makes sense to me for distros that want seat management but not systemd, but TBH I have no horse in the systemd race and in fact systemd works well for me. But, a system with elogind would also work well for me. Anyway, the upshot is that unless you care a lot about the distro itself or are willing to adapt to e.g. Xfce or Xmonad or something, NixOS is a more pragmatic choice.

    i'm on a horse

    I actually like Guix's tools better than Nix's, and not just because they are written in Guile. Guix also has all the tools I need for software development, so I prefer it and ended up installing it as a user-space package manager on this NixOS system. Sounds bizarre but it actually works pretty well.

    So, the point of this article is to be a little guide of how to build V8 with Guix. Here we go!

    up and running with guix

    First, check the manual. It's great and well-written and answers many questions and in fact includes all of this.

    Now, I assume you're on an x86-64 Linux system, so we're going to use the awesome binary installation mechanism. Check it out: because everything in /gnu/store is linked directly to each other, all you have to do is to copy a reified /gnu/store onto a working system, then copy a sqlite thing into /var, and you've installed Guix. Sweet, eh? And actually you can take a running system and clone it onto other systems in that way, and Guix even provides a tool to generate such a tarball for you. Neat stuff.

    cd /tmp
    tar xf guix-binary-0.8.3.x86_64-linux.tar.xz
    mv var/guix /var/ && mv gnu /

    This Guix installation has a built-in profile for the root user, so let's go ahead and add a link from ~root to the store.

    ln -sf /var/guix/profiles/per-user/root/guix-profile \

    Since we're root, we can add the bin/ part of the Guix profile to our environment.

    export PATH="$HOME/.guix-profile/bin:$HOME/.guix-profile/sbin:$PATH"

    Perhaps we add that line to our ~root/.bash_profile. Anyway, now we have Guix. Or rather, we almost have Guix -- we need to start the daemon that actually manages the store. Create some users:

    groupadd --system guixbuild
    for i in `seq -w 1 10`; do
      useradd -g guixbuild -G guixbuild           \
              -d /var/empty -s `which nologin`    \
              -c "Guix build user $i" --system    \

    And now run the daemon:

    guix-daemon --build-users-group=guixbuild

    If your host distro uses systemd, there's a unit that you can drop into the systemd folder. See the manual.

    A few more things. One, usually when you go to install something, you'll want to fetch a pre-built copy of that software if it's available. Although Guix is fundamentally a build-from-source distro, Guix also runs a continuous builder service to make sure that binaries are available, if you trust the machine building the binaries of course. To do that, we tell the daemon to trust

    guix archive --authorize < ~root/.guix-profile/share/guix/

    as a user

    OK now we have Guix installed. Running Guix commands will install things into the store as needed, and populate the forest of symlinks in the current user's $HOME/.guix-profile. So probably what you want to do is to run, as your user:

    /var/guix/profiles/per-user/root/guix-profile/bin/guix \
      package --install guix

    This will make Guix available in your own user's profile. From here you can begin to install software; for example, if you run

    guix package --install emacs

    You'll then have an emacs in ~/.guix-profile/bin/emacs which you can run. Pretty cool stuff.

    back on the horse

    So what does it mean for software development? Well, when I develop software, I usually want to know exactly what the inputs are, and to not have inputs to the build process that I don't control, and not have my build depend on unrelated software upgrades on my system. That's what Guix provides for me. For example, when I develop V8, I just need a few things. In fact I need these things:

    ;; Save as ~/src/profiles/v8.scm
    (use-package-modules gcc llvm base python version-control less ccache)
     (list clang
           (list gcc-4.9 "lib")

    This set of Guix packages is what it took for me to set up a V8 development environment. I can make a development environment containing only these packages and no others by saving the above file as v8.scm and then sourcing this script:

    ~/.guix-profile/bin/guix package -p ~/src/profiles/v8 -m ~/src/profiles/v8.scm
    eval `~/.guix-profile/bin/guix package -p ~/src/profiles/v8 --search-paths`
    export GYP_DEFINES='linux_use_bundled_gold=0 linux_use_gold_flags=0 linux_use_bundled_binutils=0'
    export CXX='ccache clang++'
    export CC='ccache clang'
    export LD_LIBRARY_PATH=$HOME/src/profiles/v8/lib

    Let's take this one line at a time. The first line takes my manifest -- the set of packages that collectively form my build environment -- and arranges to populate a symlink forest at ~/src/profiles/v8.

    $ ls -l ~/src/profiles/v8/
    total 44
    dr-xr-xr-x  2 root guixbuild  4096 Jan  1  1970 bin
    dr-xr-xr-x  2 root guixbuild  4096 Jan  1  1970 etc
    dr-xr-xr-x  4 root guixbuild  4096 Jan  1  1970 include
    dr-xr-xr-x  2 root guixbuild 12288 Jan  1  1970 lib
    dr-xr-xr-x  2 root guixbuild  4096 Jan  1  1970 libexec
    -r--r--r--  2 root guixbuild  4138 Jan  1  1970 manifest
    lrwxrwxrwx 12 root guixbuild    59 Jan  1  1970 sbin -> /gnu/store/1g78hxc8vn7q7x9wq3iswxqd8lbpfnwj-glibc-2.21/sbin
    dr-xr-xr-x  6 root guixbuild  4096 Jan  1  1970 share
    lrwxrwxrwx 12 root guixbuild    58 Jan  1  1970 var -> /gnu/store/1g78hxc8vn7q7x9wq3iswxqd8lbpfnwj-glibc-2.21/var
    lrwxrwxrwx 12 root guixbuild    82 Jan  1  1970 x86_64-unknown-linux-gnu -> /gnu/store/wq6q6ahqs9rr0chp97h461yj8w9ympvm-binutils-2.25/x86_64-unknown-linux-gnu

    So that's totally scrolling off the right for you, that's the thing about Nix and Guix names. What it means is that I have a tree of software, and most directories contain a union of links from various packages. It so happens that sbin though just has links from glibc, so it links directly into the store. Anyway. The next line in my arranges to point my shell into that environment.

    $ guix package -p ~/src/profiles/v8 --search-paths
    export PATH="/home/wingo/src/profiles/v8/bin:/home/wingo/src/profiles/v8/sbin"
    export CPATH="/home/wingo/src/profiles/v8/include"
    export LIBRARY_PATH="/home/wingo/src/profiles/v8/lib"
    export LOCPATH="/home/wingo/src/profiles/v8/lib/locale"
    export PYTHONPATH="/home/wingo/src/profiles/v8/lib/python2.7/site-packages"

    Having sourced this into my environment, my shell's ls for example now points into my new profile:

    $ which ls

    Neat. Next we have some V8 defines. On x86_64 on Linux, v8 wants to use some binutils things that it bundles itself, but oddly enough for months under Debian I was seeing spurious intermittent segfaults while linking with their bundled gold linker binary. I don't want to use their idea of what a linker is anyway, so I set some defines to make v8's build tool use Guix's linker. (Incidentally, figuring out what those defines were took spelunking through makefiles, to gyp files, to the source of gyp itself, to the source of the standard shlex Python module to figure out what delimiters shlex.split actually splits on... yaaarrggh!)

    Then some defines to use ccache, then a strange thing: what's up with that LD_LIBRARY_PATH?

    Well. I'm not sure. However the normal thing for dynamic linking under Linux is that you end up with binaries that are just linked against e.g., whereever the system will find That's not what we want in Guix -- we want to link against a specific version of every dependency, not just any old version. Guix's builders normally do this when building software for Guix, but somehow in this case I haven't managed to make that happen, so the binaries that are built as part of the build process can end up not specifying the path of the libraries they are linked to. I don't know whether this is an issue with v8's build system, that it doesn't want to work well with Nix / Guix, or if it's something else. Anyway I hack around it by assuming that whatever's in my artisanally assembled symlink forest ("profile") is the right thing, so I set it as the search path for the dynamic linker. Suggestions welcome here.

    And from here... well it just works! I've gained the ability to precisely specify a reproducible build environment for the software I am working on, which is entirely separated from the set of software that I have installed on my system, which I can reproduce precisely with a script, and yet which is still part of my system -- I'm not isolated from it by container or VM boundaries (though I can be; see NixOps for more in that direction).

    OK I lied a little bit. I had to apply this patch to V8:

    $ git diff
    diff --git a/build/standalone.gypi b/build/standalone.gypi
    index 2bdd39d..941b9d7 100644
    --- a/build/standalone.gypi
    +++ b/build/standalone.gypi
    @@ -98,7 +98,7 @@
             ['OS=="win"', {
               'gomadir': 'c:\\goma\\goma-win',
             }, {
    -          'gomadir': '<!(/bin/echo -n ${HOME}/goma)',
    +          'gomadir': '<!(/usr/bin/env echo -n ${HOME}/goma)',
             ['host_arch!="ppc" and host_arch!="ppc64" and host_arch!="ppc64le"', {
               'host_clang%': '1',

    See? Because my system is NixOS, there is no /bin/echo. It does helpfully install a /usr/bin/env though, which other shell invocations in this build script use, so I use that instead. I mention this as an example of what works and what workarounds there are.

    dpkg --purgatory

    So now I have NixOS as my OS, and I mostly use Guix for software development. This is a new setup and we'll see how it works in practice.

    Installing NixOS on top of Debian was a bit irritating. I ended up making a bootable USB installation image, then installing over to my Debian partition, happy in the idea that it wouldn't conflict with my system. But in that I forgot about /etc and /var and all that. So I copied /etc to /etc-debian, just as a backup, and NixOS appeared to install fine. However it wouldn't boot, and that's because some systemd state from my old /etc which was still in place conflicted with... something? In the end I redid the install, moving my old /usr, /etc and such directories to backup names and letting NixOS have control. That worked fine.

    I have GuixSD on a laptop but I really don't recommend it right now -- not unless you have time and are willing to hack on it. But that's OK, install NixOS and you'll be happy on the system side, and if you want Guix you can install it as a user.

    Comments and corrections welcome, and happy hacking!

    by Andy Wingo at August 04, 2015 04:23 PM

    July 28, 2015

    Andy Wingo

    loop optimizations in guile

    Sup peeps. So, after the slog to update Guile's intermediate language, I wanted to land some new optimizations before moving on to the next thing. For years I've been meaning to do some loop optimizations, and I was finally able to land a few of them.

    loop peeling

    For a long time I have wanted to do "loop peeling". Loop peeling means peeling off the first iteration of a loop. If you have a source program that looks like this:

    while foo:

    Loop peeling turns it into this:

    if foo:
      while foo:

    You wouldn't think that this is actually an optimization, would you? Well on its own, it's not. But if you combine it with common subexpression elimination, then it means that the loop body is now dominated by all effects and all loop-invariant expressions that must be evaluated for the expression to loop.

    In dynamic languages, this is most useful when one source expression expands to a number of low-level steps. So for example if your language runtime implements top-level variable references in three parts, one where it gets a reference to a mutable box, then it checks if the box has a value, and and the third where it unboxes it, then we would have:

    if foo:
      bar_location = lookup("bar")
      bar_value = dereference(bar_location)
      if bar_value is null: throw NotFound("bar")
      baz_location = lookup("baz")
      baz_value = dereference(baz_location)
      if baz_value is null: throw NotFound("baz")
      while foo:
        bar_value = dereference(bar_location)
        baz_value = dereference(baz_location)

    The result is that we have hoisted the lookups and null checks out of the loop (if a box can never transition from full back to empty). It's a really powerful transformation that can even hoist things that traditional loop-invariant code motion can't, but more on that later.

    Now, the problem with loop peeling is that usually values will escape your loop. For example:

    while foo:
      x = qux()
      if x then return x

    In this little example, there is a value x, and the return x statement is actually not in the loop. It's syntactically in the loop, but the underlying representation that the compiler uses looks more like this:

    function qux(k):
      label loop_header():
        fetch(foo) -gt; loop_test
      label loop_test(foo_value):
        if foo_value then -> exit else -> body
      label body():
        fetch(x) -gt; have_x
      label have_x(x_value):
        if x_value then -> return_x else -> loop_header
      label return_x():
        values(x) -> k
      label exit():

    This is the "CPS soup" I described in my last post. Only the bold parts are in the loop; notably, the return is outside the loop. Point being, if we peel off the first iteration, then there are two possible values for x that we would return:

    if foo:
      x1 = qux()
      if x1 then return x1
      while foo:
        x2 = qux()
        if x2 then return x2

    I have them marked as x1 and x2. But I've also duplicated the return x terms, which is not what we want. We want to peel off the first iteration, which will cause code growth equal to the size of the loop body, but we don't want to have to duplicate everything that's after the loop. What we have to do is re-introduce a join point that defines x:

    if foo:
      x1 = qux()
      if x1 then join(x1)
      while foo:
        x2 = qux()
        if x2 then join(x2)
    label join(x)
      return x

    Here I'm playing fast and loose with notation because the real terms are too gnarly. What I'm trying to get across is that for each value that flows out of a loop, you need a join point. That's fine, it's a bit more involved, but what if your loop exits to two different points, but one value is live in both of them? A value can only be defined in one place, in CPS or SSA. You could re-place a whole tree of phi variables, in SSA parlance, with join blocks and such, but it's just too hard.

    However we can still get the benefits of peeling in most cases if we restrict ourselves to loops that exit to only one continuation. In that case the live variable set is the intersection of all variables defined in the loop that are live at the exit points. Easy enough, and that's what we have in Guile now. Peeling causes some code growth but the loops are smaller so it should still be a win. Check out the source, if that's your thing.

    loop-invariant code motion

    Usually when people are interested in moving code out of loops they talk about loop-invariant code motion, or LICM. Contrary to what you might think, LICM is complementary to peeling: some things that peeling+CSE can hoist are not hoistable by LICM, and vice versa.

    Unlike peeling, LICM does not cause code growth. Instead, for each expression in a loop, LICM tries to hoist it out of the loop if it can. An expression can be hoisted if all of these conditions are true:

    1. It doesn't cause the creation of an observably new object. In Scheme, the definition of "observable" is quite subtle, so in practice in Guile we don't hoist expressions that can cause any allocation. We could use alias analysis to improve this.

    2. The expression cannot throw an exception, or the expression is always evaluated for every loop iteration.

    3. The expression makes no writes to memory, or if it writes to memory, other expressions in the loop cannot possibly read from that memory. We use effects analysis for this.

    4. The expression makes no reads from memory, or if it reads from memory, no other expression in the loop can clobber those reads. Again, effects analysis.

    5. The expression uses only loop-invariant variables.

    This definition is inductive, so once an expression is hoisted, the values it defines are then considered loop-invariant, so you might be able to hoist a whole chain of values.

    Compared to loop peeling, this has the gnarly aspect of having to explicitly reason about loop invariance and manually move code, which is a pain. (Really LICM would be better named "artisanal code motion".) However it causes no code growth, which is a plus, though like peeling it can increase register pressure. But the big difference is that LICM can hoist effect-free expressions that aren't always executed. Consider:

    while foo:
      x = qux() ? "hi" : "ho"

    Here for some reason it could be faster to cache "hi" or "ho" in registers, which is what LICM allows:

    hi, ho = "hi", "ho"
    while foo:
      x = qux() ? hi : ho

    On the other hand, LICM alone can't hoist the if baz is null checks in this example from above:

    while foo:

    The issue is that the call to bar() might not return, so the error that might be thrown if baz is null shouldn't be observed until bar is called. In general we can't hoist anything that might throw an exception past some non-hoisted code that might throw an exception. This specific situation happens in Guile but there are similar ones in any language, I think.

    More formally, LICM will hoist effectful but loop-invariant expressions that postdominate the loop header, whereas peeling hoists those expressions that dominate all back-edges. I think? We'll go with that. Again, the source.

    loop inversion

    Loop inversion is a little hack to improve code generation, and again it's a little counterintuitive. If you have this loop:

    while n < x:

    Loop inversion turns it into:

    if n < x:
      while n < x

    The goal is that instead of generating code that looks like this:

      test n, x;
      branch-if-greater-than-or-equal done;
      x = x + 1
      goto header

    You make something that looks like this:

      test n, x;
      branch-if-greater-than-or-equal done;
      x = x + 1
      test n, x;
      branch-if-less-than header;

    The upshot is that the loop body now contains one branch instead of two. It's mostly helpful for tight loops.

    It turns out that you can express this transformation on CPS (or SSA, or whatever), but that like loop peeling the extra branch introduces an extra join point in your program. If your loop exits to more than one label, then we have the same problems as loop peeling. For this reason Guile restricts loop inversion (which it calls "loop rotation" at the moment; I should probably fix that) to loops with only one exit continuation.

    Loop inversion has some other caveats, but probably the biggest one is that in Guile it doesn't actually guarantee that each back-edge is a conditional branch. The reason is that usually a loop has some associated loop variables, and it could be that you need to reshuffle those variables when you jump back to the top. Mostly Guile's compiler manages to avoid shuffling, allowing inversion to have the right effect, but it's not guaranteed. Fixing this is not straightforward, since the shuffling of values is associated with the predecessor of the loop header and not the loop header itself. If instead we reshuffled before the header, that might work, but each back-edge might have a different shuffling to make... anyway. In practice inversion seems to work out fine; I haven't yet seen a case where it doesn't work. Source code here.

    loop identification

    One final note: what is a loop anyway? Turns out this is a somewhat hard problem, especially once you start trying to identify nested loops. Guile currently does the simple thing and just computes strongly-connected components in a function's flow-graph, and says that a loop is a non-trivial SCC with a single predecessor. That won't tease apart loop nests but oh wells! I spent a lot of time last year or maybe two years ago with that "Loop identification via D-J graphs" paper but in the end simple is best, at least for making incremental steps.

    Okeysmokes, until next time, loop on!

    by Andy Wingo at July 28, 2015 08:10 AM

    July 27, 2015

    Andy Wingo

    cps soup

    Hello internets! This blog goes out to my long-time readers who have followed my saga hacking on Guile's compiler. For the rest of you, a little history, then the new thing.

    In the olden days, Guile had no compiler, just an interpreter written in C. Around 8 years ago now, we ported Guile to compile to bytecode. That bytecode is what is currently deployed as Guile 2.0. For many reasons we wanted to upgrade our compiler and virtual machine for Guile 2.2, and the result of that was a new continuation-passing-style compiler for Guile. Check that link for all the backstory.

    So, I was going to finish documenting this intermediate language about 5 months ago, in preparation for making the first Guile 2.2 prereleases. But something about it made me really unhappy. You can catch some foreshadowing of this in my article from last August on common subexpression elimination; I'll just quote a paragraph here:

    In essence, the scope tree doesn't necessarily reflect the dominator tree, so not all transformations you might like to make are syntactically valid. In Guile 2.2's CSE pass, we work around the issue by concurrently rewriting the scope tree to reflect the dominator tree. It's something I am seeing more and more and it gives me some pause as to the suitability of CPS as an intermediate language.

    This is exactly the same concern that Matthew Fluet and Stephen Weeks had back in 2003:

    Thinking of it another way, both CPS and SSA require that variable definitions dominate uses. The difference is that using CPS as an IL requires that all transformations provide a proof of dominance in the form of the nesting, while SSA doesn't. Now, if a CPS transformation doesn't do too much rewriting, then the partial dominance information that it had from the input tree is sufficient for the output tree. Hence tree splicing works fine. However, sometimes it is not sufficient.

    As a concrete example, consider common-subexpression elimination. Suppose we have a common subexpression x = e that dominates an expression y = e in a function. In CPS, if y = e happens to be within the scope of x = e, then we are fine and can rewrite it to y = x. If however, y = e is not within the scope of x, then either we have to do massive tree rewriting (essentially making the syntax tree closer to the dominator tree) or skip the optimization. Another way out is to simply use the syntax tree as an approximation to the dominator tree for common-subexpression elimination, but then you miss some optimization opportunities. On the other hand, with SSA, you simply compute the dominator tree, and can always replace y = e with y = x, without having to worry about providing a proof in the output that x dominates y (i.e. without putting y in the scope of x)

    [MLton-devel] CPS vs SSA

    To be honest I think all this talk about dominators is distracting. Dominators are but a lightweight flow analysis, and I usually find myself using full-on flow analysis to compute the set of optimizations that I can do on a piece of code. In fact the only use I had for dominators in the nested CPS language was to rewrite scope trees! The salient part of Weeks' observation is that nested scope trees are the problem, not that dominators are the solution.

    So, after literally years of hemming and hawing about this, I finally decided to remove nested scope trees from Guile's CPS intermediate language. Instead, a function is now a collection of labelled continuations, with one distinguished entry continuation. There is no more $letk term to nest continuations in each other. A program is now represented as a "soup" -- basically a map from labels to continuation bodies, again with a distinguished entry. As an example, consider this expression:

      return add(x, 1)

    If we rewrote it in continuation-passing style, we'd give the function a name for its "tail continuation", ktail, and annotate each expression with its continuation:

    function(ktail, x):
      add(x, 1) -> ktail

    Here the -> ktail means that the add expression passes its values to the continuation ktail.

    With nested CPS, it could look like:

    function(ktail, x):
      letk have_one(one): add(x, one) -> ktail
        load_constant(1) -> have_one

    Here the label have_one is in a scope, as is the value one. With "CPS soup", though, it looks more like this:

    function(ktail, x):
      label have_one(one): add(x, one) -> ktail
      label main(x): load_constant(1) -> have_one

    It's a subtle change, but it took a few months to make so it's worth pointing out what's going on. The difference is that there is no scope tree for labels or variables any more. A variable can be used at a label if it flows to the label, in a flow analysis sense. Indeed, determining the set of variables that can be used at a label requires flow analysis; that's what Weeks was getting at in his 2003 mail about the advantages of SSA, which are really the advantages of an intermediate language without nested scope trees.

    The question arises, though, now that we've decided on CPS soup, how should we represent a program as a value? We've gone from a nested term to a graph term, and we need to find a way to represent it somehow that facilitates looking up labels by name, and facilitates tree rewrites.

    In Guile's IR, labels and variables are both integers, so happily enough, we have such a data structure: Clojure-style maps specialized for integer keys.

    Friends, if there has been one realization or revolution for me in the last year, it has been Clojure-style data structures. Here's why. In compilers, I often have to build up some kind of analysis, then use that analysis to transform data. Often I need to keep the old term around while I build a new one, but it would be nice to share state between old and new terms. With a nested tree, if a leaf changed you'd have to rebuild all surrounding terms, which is gnarly. But with Clojure-style data structures, more and more I find myself computing in terms of values: build up this value, transform this map to that set, fold over this map -- and yes, you can fold over Guile's intmaps -- and so on. By providing an expressive data structure for which I can control performance characteristics by using transients if needed, these data structures make my programs more about data and less about gnarly machinery.

    As a concrete example, the old contification pass in Guile, I didn't have the mental capacity to understand all the moving parts in such a way that I could compute an optimal contification from the beginning; instead I had to iterate to a fixed point, as Kennedy did in his "Compiling with Continuations, Continued" paper. With the new CPS soup language and with Clojure-style data structures, I could actually fit more of the algorithm into my head, with the result that Guile now contifies optimally while avoiding the fixed-point transformation. Also, the old pass used hash tables to represent the analysis, which I found incredibly confusing to reason about -- I totally buy Rich Hickey's argument that place-oriented programming is the source of many evils in programs, and hash tables are nothing if not a place party. Using functional maps let me solve harder problems because they are easier for me to reason about.

    Contification isn't an isolated case, either. For example, we are able to do the complete set of optimizations from the "Optimizing closures in O(0) time" paper, including closure sharing, which I think makes Guile unique besides Chez Scheme. I wasn't capable of doing it on the old representation because it was just too hard for me to think about, because my data structures weren't right.

    This new "CPS soup" language is still a first-order CPS language in that each term specifies its continuation, and that variable names appear in the continuation of a definition, not the definition itself. This effectively makes every variable a phi variable, in the sense of SSA, and you have to do some work to get to a variable's definition. It could be that still this isn't the right number of names; consider this function:

    function foo(k, x):
      label have_y(y) bar(y) -> k
      label y_is_two() load_constant(2) -> have_y
      label y_is_one() load_constant(1) -> have_y
      label main(x) if x -> y_is_one else -> y_is_two

    Here there is no distinguished name for the value load_constant(1) versus load_constant(2): both are possible values for y. If we ended up giving them names, we'd have to reintroduce actual phi variables for the joins, which would basically complete the transformation to SSA. Until now though I haven't wanted those names, so perhaps I can put this off. On the other hand, every term has a label, which simplifies many things compared to having to contain terms in basic blocks, as is usually done in SSA. Yet another chapter in CPS is SSA is CPS is SSA, it seems.

    Welp, that's all the nerdery for right now. Talk at yall later!

    by Andy Wingo at July 27, 2015 02:43 PM

    July 25, 2015

    Michael Catanzaro

    Useful DuckDuckGo bangs

    DuckDuckGo bangs are just shortcuts to redirect your search to another search engine. My personal favorites:

    • !gnomebugs — Runs a search on GNOME Bugzilla. Especially useful followed by a bug number. For example, search for ‘!gnomebugs 100000’ and see what you get.
    • !wkb — Same thing for WebKit Bugzilla.
    • !w — Searches Wikipedia.

    There’s 6388 more, but those are the three I can remember. If you work on GNOME or WebKit, these are super convenient.

    by Michael Catanzaro at July 25, 2015 01:47 AM

    July 23, 2015

    Xabier Rodríguez Calvar

    ReadableStream almost ready

    Hello dear readers! Long time no see! You might thing that I have been lazy, and I was in blog posting but I was coding like mad.

    First remarkable thing is that I attended the WebKit Contributors Meeting that happened in March at Apple campus in Cupertino as part of the Igalia gang. There we discussed of course about Streams API, its state and different implementation possibilities. Another very interesting point which would make me very happy would be the movement of Mac to CMake.

    In a previous post I already introduced the concepts of the Streams API and some of its possible use cases so I’ll save you that part now. The news is that ReadableStream has its basic funcionality complete. And what does it mean? It means that you can create a ReadableStream by providing the constructor with the underlying source and the strategy objects and read from it with its reader and all the internal mechanisms of backpresure and so on will work according to the spec. Yay!

    Nevertheless, there’s still quite some work to do to complete the implementation of Streams API, like the implementation of byte streams, writable and transform streams, piping operations and built-in strategies (which is what I am on right now).I don’t know either when Streams API will be activated by default in the next builds of Safari, WebKitGTK+ or WebKit for Wayland, but we’ll make it at some point!

    Code suffered already lots of changes because we were still figuring out which architecture was the best and Youenn did an awesome job in refactoring some things and providing support for promises in the bindings to make the implementation of ReadableStream more straitghforward and less “custom”.

    Implementation could still suffer quite some important changes as, as part of my work implementing the strategies, some reviewers raised their concerns of having Streams API implemented inside WebCore in terms of IDL interfaces. I have already a proof of concept of CountQueuingStrategy and ByteLengthQueuingStrategy implemented inside JavaScriptCore, even a case where we use built-in JavaScript functions, which might help to keep closer to the spec if we can just include JavaScript code directly. We’ll see how we end up!

    Last and not least I would like to thank Igalia for sponsoring me to attend the WebKit Contributors Meeting in Cupertino and also Adenilson for being so nice and taking us to very nice places for dinner and drinks that we wouldn’t be able to find ourselves (I owe you, promise to return the favor at the Web Engines Hackfest). It was also really nice to have the oportunity of quickly visiting New York City for some hours because of the long connection there which usually would be a PITA, but it was very enjoyable this time.

    by calvaris at July 23, 2015 04:17 PM