Planet Igalia

May 02, 2016

Diego Pino

Network namespaces: IPv6 connectivity

In the last post I introduced network namespaces and showed a practical example on how to share IPv4 connectivity between a network namespace and a host. Before that post, I also wrote a short tutorial on how to set up an IPv6 tunnel using Hurricane Electric broker service. This kind of service allow us to get into the IPv6 realm using an IPv4 connection.

In this article I continue exploring network namespaces. Taking advantage of the work done in the aforementioned posts, I explain in this post how to share IPv6 connectivity between a host and a network namespace.

Let’s assume we already have a SIT tunnel (IPv6-in-IPv4 tunnel) enabled in our host and we’re able to ping an external IPv6 address. If you haven’t, I encourage you to check out set up an IPv6 tunnel.

$ ping6
PING 56 data bytes
64 bytes from icmp_seq=1 ttl=57 time=93.2 ms

I need to write an script which will create the network namespace and set it up accordingly. I call that script ns-ipv6. If the script works correctly, I should be able to ping an external IPv6 host from the namespace. Such script looks like this:

 1 #!/bin/bash
 3 set -x
 5 if [[ $EUID -ne 0 ]]; then
 6    echo "You must run this script as root."
 7    exit 1
 8 fi
10 VETH1_IPV6=fd00::1
11 VPEER1_IPV6=fd00::2
13 # Clean up.
14 ip netns del ns-ipv6 &>/dev/null
15 ip li del veth1 &> /dev/null
17 # Create network namespace.
18 ip netns add ns-ipv6
20 # Create veth pair.
21 ip li add name veth1 type veth peer name vpeer1
23 # Setup veth1 (host).
24 ip -6 addr add ${VETH1_IPV6}/64 dev veth1
25 ip li set dev veth1 up
27 # Setup vpeer1 (network namespace).
28 ip li set dev vpeer1 netns ns-ipv6
29 ip netns exec ns-ipv6 ip li set dev lo up
30 ip netns exec ns-ipv6 ip -6 addr add ${VPEER1_IPV6}/64 dev vpeer1
31 ip netns exec ns-ipv6 ip li set vpeer1 up
33 # Make vpeer1 default gw.
34 ip netns exec ns-ipv6 ip -6 route add default dev vpeer1 via ${VETH1_IPV6}
36 # NAT
37 sysctl -w net.ipv6.conf.all.forwarding=1
38 ip6tables -t nat --flush
39 ip6tables -t nat -A POSTROUTING -o he-ipv6 -j MASQUERADE
41 # Get into ns-ipv6.
42 ip netns exec ns-ipv6 /bin/bash --rcfile <(echo "PS1=\"ns-ipv6> \"")

It actually works:

$ sudo ./
ns-ipv6> ping6 -c 1 2a00:1450:4003:801::200e
PING 2a00:1450:4003:801::200e(2a00:1450:4003:801::200e) 56 data bytes
64 bytes from 2a00:1450:4003:801::200e: icmp_seq=1 ttl=54 time=83.7 ms

Let’s take a deeper look on how it works.


The script creates a veth pair to communicate the network namespace with the host. Each virtual interface is assigned an IPv6 address in the ‘fd00::0/64’ network space (Lines 21 and 27). This type of address is known as ULA or Unique Local Address. ULAs are the IPv6 counterpart of IPv4 private addresses.

Before continuing, a brief reminder on how IPv6 addresses work:

An IPv6 address is a 128-bit value represented as 8 blocks of 16-bit (8 x 16-bit = 128-bit). Blocks are separated by a colon (‘:’). Unlike IPv4 addresses, block values are written in hexadecimal. Since each block is a 16-bit value, they can be written in hexadecimal as 4-digit numbers. Leading zeros in each block can be ommitted. On the same hand, when several consecutive block values are zero they can be ommitted too. In that case two colons (‘::’) are written instead, meaning everything in between is nil. For instance, the address ‘fd00::1’ is the short form of the much longer address ‘fd00:0000:0000:0000:0000:0000:0000:0001’.

RFC 4193 (section 3) describes Unique Local Addresses format as:

| 7 bits |1|  40 bits   |  16 bits  |          64 bits           |
| Prefix |L| Global ID  | Subnet ID |        Interface ID        |

- Prefix is always fc00::/7.
- L:
  * set to 1, for local assigned addresses (block _fd00::/8_).
  * set to 0, not defined yet (block _fc00::/8_).
- Global ID: 40-bit global identifier used to create a globally unique prefix.
- Subnet ID: 16-bit Subnet ID is an identifier of a subnet within the site.
- Interface ID: 64-bit Interface ID.

The RFC reserves the IPv6 address block ‘fc00::/7’ for ULAs. It divides this address in two subnetworks: ‘fc00::/8’ and ‘fd00::/8’. The use of the ‘fc00::/8’ block has not been defined yet, while the ‘fd00:/8’ block is used for IPv6 local assigned addresses (private addresses).

The address ‘fd63:b1f4:7268:d970::1’ is an example of a valid ULA. It starts by the ‘fd’ prefix followed by an unique Global ID (‘63:b1f4:7268’) and a Subnet ID (‘d970’), leaving 64 bits for the Interface ID (‘::1’). I recommend the page Private IPv6 address range to obtain valid random ULAs.

ULAs are not routable in the global Internet. They are meant to be used inside local networks and that’s precisely the reason why they exist.

NAT on IPv6

Lines 34-36 activate IPv6 forwarding and IP Masquerade on the source address. However, this solution is not optimal.

The Hurricane Electric tunnel broker service lends us a ‘::0/64’ block, with 2^64 - 2 valid hosts. NAT, Network Address Translation, grants a host in a private network external connectivity via a proxy that lends the private host its address. This is the most common use case of NAT, known as Source NAT. Besides IP addresses, NAT translates port numbers too and that’s why it’s sometimes referred as NAPT (Network Address and Port Translation). NAT has been an important technology for optimizing the use of IPv4 address space, although it has its costs too.

The original goal of IPv6 was solving the problem of IP address exhaustation. Mechanisms such as NAT are not needed because the IPv6 address space is so big that every host could have an unique address, reachable from another end of the network. Actually, IPv6 brings back the original point-to-point design of the IPv4 Internet, before private addresses and NAT were introduced.

So let’s try to get rid of NAT66 (IPv6-to-IPv6 translation) by:

  • Using global IPv6 addresses.
  • Removing MASQUERADING.

The new script is available as a gist here: There’s some tricky bits that are worth explaining:

First thing, is to replace the ULAs by IPv6 addresses which belong to /64 block leased by Hurricane Electric:


When setting up the interfaces, the host side should add a more restricted routing rule for the other end of the veth pair. The reason is that all addresses belong to the same network. If from the host side a packet needs to get to the network namespace side, it would be routed through the IPv6 tunnel unless there’s a more restricted rule.

# Setup veth1 (host).
ip -6 addr add ${VETH1_IPV6} dev veth1
ip -6 route add ${VPEER1_IPV6}/128 dev veth1

Lastly, NAT66 can be removed but IP forwarding is still necessary as the host acts as a router.

# Enable IPv6 forwarding.
systctl net.ipv6.conf.all.forwarding=1

When a packet arrives from the network namespace into the host, the destination address of the packet doesn’t match any of the interfaces of the host. If IP forwarding were disabled, the packet will simply be dropped. However, since IP forwarding is enabled, non-delivered packets get forwarded through the host’s default gateway reaching their destination, hopefully.

After these changes, the script still works:

ns-ipv6-no-nat> ping6 2a00:1450:4004:801::200e
PING 2a00:1450:4004:801::200e(2a00:1450:4004:801::200e) 56 data bytes
64 bytes from 2a00:1450:4004:801::200e: icmp_seq=1 ttl=56 time=86.0 ms

DNS resolution

In the line above I’m pinging an IPv6 address directly (this address is actually What happens if I try to ping a host name instead?

ns-ipv6> ping6
unknown host

When I ping from the network namespace, the /etc/resolv.conf file is queried to obtain a DNS nameserver address.



This address is an IPv4 address, but the network namespace has IPv6 connectivity only. It cannot reach any host in the IPv4 realm. However, DNS resolution works in the host since the host has either IPv6 and IPv4 connectivity. It is necessary to add a DNS server with an IPv6 address. Luckily, Google has DNS servers available in the IPv6 realm too.

nameserver 2001:4860:4860::8888

Now I should be able to ping from the network namespace:

ns-ipv6> ping6 -c 1
PING 56 data bytes
64 bytes from icmp_seq=1 ttl=57 time=85.7 ms

--- ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 85.702/85.702/85.702/0.000 ms

Wrapping up

After all these changes we end up with a script that:

  • Uses Hurricane Electric’s IPv6 network addresses, instead of ULAs.
  • Doesn’t do NAT66 to provide external IPv6 connectivity to the network namespace.

It has been a lot of fun writing out this post, it helped me to understand many things better. I definitely encourage everyone interested to run some of the scripts above and try out IPv6, if you haven’t yet. The network namespace part is not fundamental but it makes it more interesting.

Lastly, I’d like to thank my colleague Carlos López for his unvaluable help as well as the StackOverflow community which helped me to figure out the script that gets rid of NAT66.

May 02, 2016 10:00 AM

April 26, 2016

Adrián Pérez

Identifying Layer-7 packet flows in SnabbWall

Spring is here already, the snow has melted a while ago, and it looks like a good time to write a bit about network traffic flows, as promised in my previous post about ljndpi. Why so? Well, looking at network traffic and grouping it into logical streams between two endpoints is something that needs to be done for SnabbWall, a suite of Snabb applications which implement a Layer-7 analyzer and firewall which Igalia is developing with sponsorship from the NLnet Foundation.


(For those interested in following only my Snabb-related posts, there are separate feeds you can use: RSS, Atom.)

Going With the Flow

Any sequence of related network packets between two hosts can be a network traffic flow. But not quite so: the exact definition may vary, depending on the level at which we are working. For example, an ISP may want to consider all packets between the pair of hosts —regardless of their contents— as part of the same flow in order to account for transferred data in metered connections, but for SnabbWall we want “application-level” traffic flows. That is: all packets generated (or received) by the same application should be classified into the same flow.

But that can get tricky, because even if we looked only at TCP traffic from one application, it is not possible just map a single connection to one flow. Take FTP for example: in active mode it uses a control connection, plus an additional data connection, and both should be considered part of the same flow because both belong to the same application. On the other side of the spectrum are web browsers like the one you are probably using to read this article: it will load the HTML using one connection, and then other related content (CSS, JavaScript, images) needed to display the web page.

In SnabbWall, the assignment of packets to flows is done based on the following fields from the packet:

  • 802.1Q VLAN tag.
  • Source and destination IP addresses.
  • Source and destination port numbers.

The VLAN tag is there to force classifying packets with the same source and destination but in different logical networks in separate packet flows. As for port numbers, in practice these are only extracted from packets when the upper layer protocol is UDP or TCP. There are other protocols which use port numbers, but they are deliberately left out (for now) because either nDPI does not support them, or they are not widely adopted (SCTP comes to mind).

Some Implementation Details

Determining the flow to which packets belong is an important task which is performed for each single packet scanned. Even before packet contents are inspected, they have to be classified.

Handling flows is split in two in the SnabbWall packet scanner: a generic implementation inspects packets to extract the fields above (VLAN tag, addresses, ports) and calculates a unique flow key from them, while backend-specific code inspects the contents of the packet and identifies the application for a flow of packets. Once the generic part of the code has calculated a key, it can be used to keep tables which associate additional data to each flow. While SnabbWall has only one backend which at the moment which uses nDPI, this split makes it easier to add others in the future.

For efficiency —both in terms of memory and CPU usage— flow keys are represented using a C struct. The following snippet shows the one for IPv4 packets, with a similar one where the address fields are 16 bytes wide being used for IPv6 packets:

ffi.cdef [[
   struct swall_flow_key_ipv4 {
      uint16_t vlan_id;
      uint8_t  __pad;
      uint8_t  ip_proto;
      uint8_t  lo_addr[4];
      uint8_t  hi_addr[4];
      uint16_t lo_port;
      uint16_t hi_port;
   } __attribute__((packed));

local flow_key_ipv4 = ffi.metatype("struct swall_flow_key_ipv4", {
   __index = {
      hash = make_cdata_hash_function(ffi.sizeof("struct swall_flow_key_ipv4")),

The struct is laid out with an extra byte of padding, to ensure that its size is a multiple of 4. Why so? The hash function (borrowed from the lib.ctable module) used for flow keys works on inputs with sizes multiple of 4 bytes because calculations are done in a word-by-word basis. In Lua the hash value for userdata values is their memory address, which makes them all different to each other: defining our own hashing function allows using the hash values as keys into tables, instead of the flow key itself. Let's see how this works with the following snippet, which counts per-flow packets:

local flows = {}
while not ended do
   local key = key_from_packet(read_packet())
   if flows[key:hash()] then
      flows[key:hash()].num_packets = flows[key:hash()].num_packets + 1
      flows[key:hash()] = { key = key, num_packets = 1 }

If we used the keys themselves instead of key:hash() for indexing the flows table, this wouldn't work because the userdata for the new key is created for each packet being processed, which means that keys with the same content created for different packets would have different hash values (their address in memory). On the other hand, the :hash() method always returns the same value keys with the same contents.

Highs and Lows

You may be wondering why our flow key struct has its members named lo_addr, hi_addr, lo_port and hi_port. It turns out that in packets which belong to the same application travel between two hosts in both directions. Let's consider the following:

  • Host A, with address
  • Host B, with address
  • A web browser from A connects (using randomly assigned port 10205) to host B, which has an HTTP server running in port 80.

The sequence of packets observed will go like this:

# Source IP Destination IP Source Port Destination Port
1 10205 80
2 80 10205
3 10205 80

If the flow key fields would be src_addr, dst_addr and so on, the first and second packets would be classified in separate flows — but they belong in the same one! This is sidestepped by sorting the addresses and ports of each packet when calculating its flow key. For the example connection above, all packets involved have as the “low IP address” (lo_addr), as the “high IP address” (hi_addr), 80 as the “low port” (lo_port), and 10205 as the “high port” (hi_port) — effectively classifying all the packets into the same flow.

This translates into some minor annoyance in the nDPI scanner backend because nDPI expects us to pass a pair of identifiers for the source and destination hosts for each packet inspected. Not a big deal, though.

Flow(er Power)

Something we have to do for IPv6 packets is traversing the chain of extension headers to get to the upper-layer protocol and extract port numbers from TCP and UDP packets. There can be any number of extension headers, and while in practice they should never be lots, this makes the amount of work needed to derive a flow key from a packet is not constant.

The good news is that RFC 6437 specifies using the 20-bit flow label field of the fixed IPv6 header in a way that, combined with the source and destination addresses, they uniquely identify the flow of the packet. This is all rainbows and ponies, but in practice the current behaviour would still be needed: the specification considers that an all-zeroes value indicates “packets that have not been labeled”. Which means that it is still needed to use the source and destination ports as fallback. What is even worse: while forbidden by the specification, flow labels can mutate while packets are en-route without any means of verifying that the change was made. Also, it is allowed to assign a new flow label to an unlabeled packet when packets are being forwarded. Nevertheless, using the flow label may be interesting to be used instead of the port numbers when the upper layer protocol is neither TCP nor UDP. Due to the limited usefulness, using IPv6 flow labels remains unimplemented for now, but I have not discarded adding support later on.

Something Else

Alongside with the packet scanner, I have implemented the L7Spy application, and the snabb wall command during this phase of the SnabbWall project. Expect another post soon about them!

by aperez ( at April 26, 2016 08:34 PM

April 16, 2016

Frédéric Wang

OpenType MATH in HarfBuzz


  • Work is in progress to add OpenType MATH support in HarfBuzz and will be instrumental for many math rendering engines relying on that library, including browsers.

  • For stretchy operators, an efficient way to determine the required number of glyphs and their overlaps has been implemented and is described here.

In the context of Igalia browser team effort to implement MathML support using TeX rules and OpenType features, I have started implementation of OpenType MATH support in HarfBuzz. This table from the OpenType standard is made of three subtables:

  • The MathConstants table, which contains layout constants. For example, the thickness of the fraction bar of ab\frac{a}{b}.

  • The MathGlyphInfo table, which contains glyph properties. For instance, the italic correction indicating how slanted an integral is e.g. to properly place the subscript in ∫D\displaystyle\displaystyle\int_{D}.

  • The MathVariants table, which provides larger size variants for a base glyph or data to build a glyph assembly. For example, either a larger parenthesis or a assembly of U+239B, U+239C, U+239D to write something like:


Code to parse this table was added to Gecko and WebKit two years ago. The existing code to build glyph assembly in these Web engines was adapted to use the MathVariants data instead of only private tables. However, as we will see below the MathVariants data to build glyph assembly is more general, with arbitrary number of glyphs or with additional constraints on glyph overlaps. Also there are various fallback mechanisms for old fonts and other bugs that I think we could get rid of when we move to OpenType MATH fonts only.

In order to add MathML support in Blink, it is very easy to import the OpenType MATH parsing code from WebKit. However, after discussions with some Google developers, it seems that the best option is to directly add support for this table in HarfBuzz. Since this library is used by Gecko, by WebKit (at least the GTK port) and by many other applications such as Servo, XeTeX or LibreOffice it make senses to share the implementation to improve math rendering everywhere.

The idea for HarfBuzz is to add an API to

  1. 1.

    Expose data from the MathConstants and MathGlyphInfo.

  2. 2.

    Shape stretchy operators to some target size with the help of the MathVariants.

It is then up to a higher-level math rendering engine (e.g. TeX or MathML rendering engines) to beautifully display mathematical formulas using this API. The design choice for exposing MathConstants and MathGlyphInfo is almost obvious from the reading of the MATH table specification. The choice for the shaping API is a bit more complex and discussions is still in progress. For example because we want to accept stretching after glyph-level mirroring (e.g. to draw RTL clockwise integrals) we should accept any glyph and not just an input Unicode strings as it is the case for other HarfBuzz shaping functions. This shaping also depends on a stretching direction (horizontal/vertical) or on a target size (and Gecko even currently has various ways to approximate that target size). Finally, we should also have a way to expose italic correction for a glyph assembly or to approximate preferred width for Web rendering engines.

As I mentioned at the beginning, the data and algorithm to build glyph assembly is the most complex part of the OpenType MATH and deserves a special interest. The idea is that you have a list of n≥1n\geq 1 glyphs available to build the assembly. For each 0≤i≤n-10\leq i\leq n-1, the glyph gig_{i} has advance aia_{i} in the stretch direction. Each gig_{i} has straight connector part at its start (of length sis_{i}) and at its end (of length eie_{i}) so that we can align the glyphs on the stretch axis and glue them together. Also, some of the glyphs are “extenders” which means that they can be repeated 0, 1 or more times to make the assembly as large as possible. Finally, the end/start connectors of consecutive glyphs must overlap by at least a fixed value omino_{\mathrm{min}} to avoid gaps at some resolutions but of course without exceeding the length of the corresponding connectors. This gives some flexibility to adjust the size of the assembly and get closer to the target size tt.










Figure 0.1: Two adjacent glyphs in an assembly

To ensure that the width/height is distributed equally and the symmetry of the shape is preserved, the MATH table specification suggests the following iterative algorithm to determine the number of extenders and the connector overlaps to reach a minimal target size tt:

  1. 1.

    Assemble all parts by overlapping connectors by maximum amount, and removing all extenders. This gives the smallest possible result.

  2. 2.

    Determine how much extra width/height can be distributed into all connections between neighboring parts. If that is enough to achieve the size goal, extend each connection equally by changing overlaps of connectors to finish the job.

  3. 3.

    If all connections have been extended to minimum overlap and further growth is needed, add one of each extender, and repeat the process from the first step.

We note that at each step, each extender is repeated the same number of times r≥0r\geq 0. So if IExtI_{\mathrm{Ext}} (respectively INonExtI_{\mathrm{NonExt}}) is the set of indices 0≤i≤n-10\leq i\leq n-1 such that gig_{i} is an extender (respectively is not an extender) we have ri=rr_{i}=r (respectively ri=1r_{i}=1). The size we can reach at step rr is at most the one obtained with the minimal connector overlap omino_{\mathrm{min}} that is

∑i=0N-1(∑j=1riai-omin)+omin=(∑i∈INonExtai-omin)+(∑i∈IExtr⁢(ai-omin))+omin\sum_{i=0}^{N-1}\left(\sum_{j=1}^{r_{i}}{a_{i}-o_{\mathrm{min}}}\right)+o_{% \mathrm{min}}=\left(\sum_{i\in I_{\mathrm{NonExt}}}{a_{i}-o_{\mathrm{min}}}% \right)+\left(\sum_{i\in I_{\mathrm{Ext}}}r{(a_{i}-o_{\mathrm{min}})}\right)+o% _{\mathrm{min}}

We let NExt=|IExt|N_{\mathrm{Ext}}={|I_{\mathrm{Ext}}|} and NNonExt=|INonExt|N_{\mathrm{NonExt}}={|I_{\mathrm{NonExt}}|} be the number of extenders and non-extenders. We also let SExt=∑i∈IExtaiS_{\mathrm{Ext}}=\sum_{i\in I_{\mathrm{Ext}}}a_{i} and SNonExt=∑i∈INonExtaiS_{\mathrm{NonExt}}=\sum_{i\in I_{\mathrm{NonExt}}}a_{i} be the sum of advances for extenders and non-extenders. If we want the advance of the glyph assembly to reach the minimal size tt then

  SNonExt-omin⁢(NNonExt-1)+r⁢(SExt-omin⁢NExt)≥t{S_{\mathrm{NonExt}}-o_{\mathrm{min}}\left(N_{\mathrm{NonExt}}-1\right)}+{r% \left(S_{\mathrm{Ext}}-o_{\mathrm{min}}N_{\mathrm{Ext}}\right)}\geq t  

We can assume 0" class="ltx_Math" display="inline" id="p12.m1">SExt-omin⁢NExt>0S_{\mathrm{Ext}}-o_{\mathrm{min}}N_{\mathrm{Ext}}>0 or otherwise we would have the extreme case where the overlap takes at least the full advance of each extender. Then we obtain

  r≥rmin=max⁡(0,⌈t-SNonExt+omin⁢(NNonExt-1)SExt-omin⁢NExt⌉)r\geq r_{\mathrm{min}}=\max\left(0,\left\lceil\frac{t-{S_{\mathrm{NonExt}}+o_{% \mathrm{min}}\left(N_{\mathrm{NonExt}}-1\right)}}{S_{\mathrm{Ext}}-o_{\mathrm{% min}}N_{\mathrm{Ext}}}\right\rceil\right)  

This provides a first simplification of the algorithm sketched in the MATH table specification: Directly start iteration at step rminr_{\mathrm{min}}. Note that at each step we start at possibly different maximum overlaps and decrease all of them by a same value. It is not clear what to do when one of the overlap reaches omino_{\mathrm{min}} while others can still be decreased. However, the sketched algorithm says all the connectors should reach minimum overlap before the next increment of rr, which means the target size will indeed be reached at step rminr_{\mathrm{min}}.

One possible interpretation is to stop overlap decreasing for the adjacent connectors that reached minimum overlap and to continue uniform decreasing for the others until all the connectors reach minimum overlap. In that case we may lose equal distribution or symmetry. In practice, this should probably not matter much. So we propose instead the dual option which should behave more or less the same in most cases: Start with all overlaps set to omino_{\mathrm{min}} and increase them evenly to reach a same value oo. By the same reasoning as above we want the inequality

  SNonExt-o⁢(NNonExt-1)+rmin⁢(SExt-o⁢NExt)≥t{S_{\mathrm{NonExt}}-o\left(N_{\mathrm{NonExt}}-1\right)}+{r_{\mathrm{min}}% \left(S_{\mathrm{Ext}}-oN_{\mathrm{Ext}}\right)}\geq t  

which can be rewritten

  SNonExt+rmin⁢SExt-o⁢(NNonExt+rmin⁢NExt-1)≥tS_{\mathrm{NonExt}}+r_{\mathrm{min}}S_{\mathrm{Ext}}-{o\left(N_{\mathrm{NonExt% }}+{r_{\mathrm{min}}N_{\mathrm{Ext}}}-1\right)}\geq t  

We note that N=NNonExt+rmin⁢NExtN=N_{\mathrm{NonExt}}+{r_{\mathrm{min}}N_{\mathrm{Ext}}} is just the exact number of glyphs used in the assembly. If there is only a single glyph, then the overlap value is irrelevant so we can assume NNonExt+r⁢NExt-1=N-1≥1N_{\mathrm{NonExt}}+{rN_{\mathrm{Ext}}}-1=N-1\geq 1. This provides the greatest theorical value for the overlap oo:

  omin≤o≤omaxtheorical=SNonExt+rmin⁢SExt-tNNonExt+rmin⁢NExt-1o_{\mathrm{min}}\leq o\leq o_{\mathrm{max}}^{\mathrm{theorical}}=\frac{S_{% \mathrm{NonExt}}+r_{\mathrm{min}}S_{\mathrm{Ext}}-t}{N_{\mathrm{NonExt}}+{r_{% \mathrm{min}}N_{\mathrm{Ext}}}-1}  

Of course, we also have to take into account the limit imposed by the start and end connector lengths. So omaxo_{\mathrm{max}} must also be at most min⁡(ei,si+1)\min{(e_{i},s_{i+1})} for 0≤i≤n-20\leq i\leq n-2. But if rmin≥2r_{\mathrm{min}}\geq 2 then extender copies are connected and so omaxo_{\mathrm{max}} must also be at most min⁡(ei,si)\min{(e_{i},s_{i})} for i∈IExti\in I_{\mathrm{Ext}}. To summarize, omaxo_{\mathrm{max}} is the minimum of omaxtheoricalo_{\mathrm{max}}^{\mathrm{theorical}}, of eie_{i} for 0≤i≤n-20\leq i\leq n-2, of sis_{i} 1≤i≤n-11\leq i\leq n-1 and possibly of e0e_{0} (if 0∈IExt0\in I_{\mathrm{Ext}}) and of of sn-1s_{n-1} (if n-1∈IExt{n-1}\in I_{\mathrm{Ext}}).

With the algorithm described above NExtN_{\mathrm{Ext}}, NNonExtN_{\mathrm{NonExt}}, SExtS_{\mathrm{Ext}}, SNonExtS_{\mathrm{NonExt}} and rminr_{\mathrm{min}} and omaxo_{\mathrm{max}} can all be obtained using simple loops on the glyphs gig_{i} and so the complexity is O⁢(n)O(n). In practice nn is small: For existing fonts, assemblies are made of at most three non-extenders and two extenders that is n≤5n\leq 5 (incidentally, Gecko and WebKit do not currently support larger values of nn). This means that all the operations described above can be considered to have constant complexity. This is much better than a naive implementation of the iterative algorithm sketched in the OpenType MATH table specification which seems to require at worst

  ∑r=0rmin-1NNonExt+r⁢NExt=NNonExt⁢rmin+rmin⁢(rmin-1)2⁢NExt=O⁢(n×rmin2)\sum_{r=0}^{r_{\mathrm{min}}-1}{N_{\mathrm{NonExt}}+rN_{\mathrm{Ext}}}=N_{% \mathrm{NonExt}}r_{\mathrm{min}}+\frac{r_{\mathrm{min}}\left(r_{\mathrm{min}}-% 1\right)}{2}N_{\mathrm{Ext}}={O(n\times r_{\mathrm{min}}^{2})}  

and at least Ω⁢(rmin)\Omega(r_{\mathrm{min}}).

One of issue is that the number of extender repetitions rminr_{\mathrm{min}} and the number of glyphs in the assembly NN can become arbitrary large since the target size tt can take large values e.g. if one writes \underbrace{\hspace{65535em}} in LaTeX. The improvement proposed here does not solve that issue since setting the coordinates of each glyph in the assembly and painting them require Θ⁢(N)\Theta(N) operations as well as (in the case of HarfBuzz) a glyph buffer of size NN. However, such large stretchy operators do not happen in real-life mathematical formulas. Hence to avoid possible hangs in Web engines a solution is to impose a maximum limit NmaxN_{\mathrm{max}} for the number of glyph in the assembly so that the complexity is limited by the size of the DOM tree. Currently, the proposal for HarfBuzz is Nmax=128N_{\mathrm{max}}=128. This means that if each assembly glyph is 1em large you won’t be able to draw stretchy operators of size more than 128em, which sounds a quite reasonable bound. With the above proposal, rminr_{\mathrm{min}} and so NN can be determined very quickly and the cases N≥NmaxN\geq N_{\mathrm{max}} rejected, so that we avoid losing time with such edge cases…

Finally, because in our proposal we use the same overlap oo everywhere an alternative for HarfBuzz would be to set the output buffer size to nn (i.e. ignore r-1r-1 copies of each extender and only keep the first one). This will leave gaps that the client can fix by repeating extenders as long as oo is also provided. Then HarfBuzz math shaping can be done with a complexity in time and space of just O⁢(n)O(n) and it will be up to the client to optimize or limit the painting of extenders for large values of NN…

by fredw at April 16, 2016 09:23 PM

April 10, 2016

Diego Pino

Network namespaces

Namespaces and cgroups are two of the main kernel technologies most of the new trend on software containerization (think Docker) rides on. To put it simple, cgroups are a metering and limiting mechanism, they control how much of a system resource (CPU, memory) you can use. On the other hand, namespaces limit what you can see. Thanks to namespaces processes have their own view of the system’s resources.

The Linux kernel provides 6 types of namespaces: pid, net, mnt, uts, ipc and user. For instance, a process inside a pid namespace only sees processes in the same namespace. Thanks to the mnt namespace, it’s possible to attach a process to its own filesystem (like chroot). In this article I focus only in network namespaces.

If you have grasped the concept of namespaces you may have at this point an intuitive idea of what a network namespace might offer. Network namespaces provide a brand-new network stack for all the processes within the namespace. That includes network interfaces, routing tables and iptables rules.

Network namespaces

From the system’s point of view, when creating a new process via clone() syscall, passing the flag CLONE_NEWNET will create a brand-new network namespace into the new process. From the user perspective, we simply use the tool ip (package is iproute2) to create a new persistent network namespace:

$ ip netns add ns1

This command will create a new network namespace called ns1. When the namespace is created, the ip command adds a bind mount point for it under /var/run/netns. This allows the namespace to persist even if there’s no process attached to it. To list the namespaces available in the system:

$ ls /var/run/netns

Or via ip:

$ ip netns

As previously said, a network namespace contains its own network resources: interfaces, routing tables, etc. Let’s add a loopback interface to ns1:

1 $ ip netns exec ns1 ip link set dev lo up
2 $ ip netns exec ns1 ping
3 PING ( 56(84) bytes of data.
4 64 bytes from icmp_seq=1 ttl=64 time=0.115 ms
  • Line 1 brings up the loopback interface inside the network namespace ns1.
  • Line 2 executes the command ping inside the network namespace.

An alternative syntax to bring up the loopback interface could be:

$ ip netns exec ns1 ifconfig lo up

However, I tend to use the command ip as it has become the preferred networking tool in Linux, obsoleting the old but more familiar commands ifconfig, route, etc. Notice that ip requires root privileges, so run it as root or prepend sudo.

A network namespace has its own routing table too:

$ ip netns exec ns1 ip route show

Which at this point returns nothing as we haven’t add any routing table rule yet. Generally speaking, any command run within a network namespace is prepend by the prologue:

$ ip netns exec <network-namespace>

A practical example

One of the consequences of network namespaces is that only one interface could be assigned to a namespace at a time. If the root namespace owns eth0, which provides access to the external world, only programs within the root namespace could reach the Internet. The solution is to communicate a namespace with the root namespace via a veth pair. A veth pair works like a patch cable, connecting two sides. It consists of two virtual interfaces, one of them is assigned to the root network namespace, while the other lives within a network namespace. Setting up their IP addresses and routing rules accordingly, plus enabling NAT in the host side, will be enough to provide Internet access to the network namespace.

Additionally, I feel like I need to make a clarification at this point. I’ve read in several articles about network namespaces that physical device interfaces can only live in the root namespace. At least that’s not the case with my current kernel (Linux 3.13). I can assign eth0 to a namespace other than the root and when setting it up properly have Internet access from the namespace. However, the limitation of one interface living only in one single namespace at a time still applies, and that’s a reason powerful enough to need connecting network namespace via a veth pair.

To start with, let’s create a new network namespace called ns1.

# Remove namespace if it exists.
ip netns del ns1 &>/dev/null

# Create namespace
ip netns add ns1

Next, create a veth pair. Interface v-eth1 will remain inside the root network namespace, while its peer, v-peer1, will be moved to the ns1 namespace.

# Create veth link.
ip link add v-eth1 type veth peer name v-peer1

# Add peer-1 to NS.
ip link set v-peer1 netns ns1

Next, setup IPv4 addresses for both interfaces and bring them up.

# Setup IP address of v-eth1.
ip addr add dev v-eth1
ip link set v-eth1 up

# Setup IP address of v-peer1.
ip netns exec ns1 ip addr add dev v-peer1
ip netns exec ns1 ip link set v-peer1 up
ip netns exec ns1 ip link set lo up

Additionally I brought up the loopback interface inside ns1.

Now it’s necessary we make all external traffic leaving ns1 to go through v-eth1.

ip netns exec ns1 ip route add default via

However this won’t be enough. As with any host sharing Its internet connection, it’s necessary to enable IPv4 forwarding in the host and enable masquerading.

# Share internet access between host and NS.

# Enable IP-forwarding.
echo 1 > /proc/sys/net/ipv4/ip_forward

# Flush forward rules, policy DROP by default.
iptables -P FORWARD DROP
iptables -F FORWARD

# Flush nat rules.
iptables -t nat -F

# Enable masquerading of
iptables -t nat -A POSTROUTING -s -o eth0 -j MASQUERADE

# Allow forwarding between eth0 and v-eth1.
iptables -A FORWARD -i eth0 -o v-eth1 -j ACCEPT
iptables -A FORWARD -o eth0 -i v-eth1 -j ACCEPT

If everything went fine, it would be possible to ping an external host from ns1.

$ ip netns exec ns1 ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=50 time=48.5 ms
64 bytes from icmp_seq=2 ttl=50 time=50.8 ms

This is how the routing table inside ns1 would look like after the setup:

$ ip netns exec ns1 ip route sh
default via dev v-peer1 dev v-peer1  proto kernel  scope link  src 

Prepending the ip netns exec prologue for every command to run from the namespace might be a bit tedious. Once the most basic features inside the namespace are setup, a more interesting possibility is to run a bash shell and attach it to the network namespace:

$ ip netns exec ns1 /bin/bash --rcfile <(echo "PS1=\"namespace ns1> \"")
namespace ns1> ping
PING ( 56(84) bytes of data.
64 bytes from ( icmp_seq=1 ttl=58 time=17.6 ms

Type exit to leave end the bash process and leave the network namespace.


Network namespaces, as well as other containerization technologies provided by the Linux kernel, are a lightweight mechanism for resource isolation. Processes attached to a network namespace see their own network stack, while not interfering with the rest of the system’s network stack.

Network namespaces are easy to use too. A similar network-level isolation could have been set up using a VM. However, that seems a much more expensive solution in terms of system resources and time investment to build up such environment. If you only need process isolation at the networking level, network namespaces are definitively something to consider.

The full script is available as a GitHub gist at:

More readings

April 10, 2016 09:30 PM

April 06, 2016

Víctor Jáquez

gstreamer-vaapi 1.8: the codec split

On march 23th GStreamer 1.8 was released, along with all its bundled modules, and, of course, one of those modules is gstreamer-vaapi.

Let us talk about this gstreamer-vaapi release, since there are several sweets!

First thing to notice is that the encoders have been renamed. Before they followed the pattern vaapiencode_{codec}, now they follow the pattern vaapi{codec}enc. The purpose of this change is twofold: to fix the plugins gtk-docs and to keep the usual element names in GStreamer. The conversion table is this:

Old New
vaapiencode_h264 vaapih264enc
vaapiencode_h265 vaapih265enc
vaapiencode_mpeg2 vaapimpeg2enc
vaapiencode_jpeg vaapijpegenc
vaapiencode_vp8 vaapivp8enc

But those were not the only name changes, we also have split the vaapidecode. Now we have a vaapijpegdec, which only decodes JPEG images, while keeping the old vaapidecode for video decoding. Also, vaapijpegdec was demoted to a marginal rank, because there are some problems in the Intel VA driver (which is the only one which supports JPEG decoding right now).

Note that in future releases, all the decoders will be split by codec, just as we did the JPEG decoder now; but first, we need to modify vaapidecodebin to choose a decoder in run-time based on the negotiated caps.

Please, update your scripts and code accordingly.

There are a ton of enhancements and optimizations too. Let me enumerate some of them: Vineeth TM fixed several memory leaks, and some compilations issues; Tim enabled vaapisink to send unhandled keyboard or mouse events to the application, making the usage of apps like gst-play-1.0 or apps based on GstPlayer be more natural; Thiago fixed the h264/h265 parsers, meanwhile Sree fixed the vp9 and the h265 ones too; Scott also fixed the h265 parser; et cetera. As may you see, H265/HEVC parser has been very active lately, it is the new thing!

I have to thank Sebastian Dröge, he did all the release work and also fixed a couple compilation issues.

This is the short log summary since 1.6:

 2  Lim Siew Hoon
 1  Scott D Phillips
 8  Sebastian Dröge
 3  Sreerenj Balachandran
 5  Thiago Santos
 1  Tim-Philipp Müller
 8  Vineeth TM
16  Víctor Manuel Jáquez Leal

by vjaquez at April 06, 2016 10:47 AM

March 31, 2016

Jacobo Aragunde

A new PhpReport for 2016

It’s been three years without any new release of our venerable time tracking tool, PhpReport. It doesn’t mean the project has been still during all this time; despite the slower development pace, you will find more than 80 patches in the repository since the last release, which account for 45 fixes or small features in the project.

It’s time to gather them all in a new release, PhpReport 2.16. These are the highlights:

CSV export

csv export button

All reports have now an option to export data to CSV, so they can be imported into other software. This was made with spreadsheets in mind, and I can confirm it works perfectly with our dear LibreOffice Calc.

Quick-access buttons

Quick access date form

Most reports got quick-access buttons added for the most common time periods: current and previous week, current month and year, etc.

Smarter “copy from yesterday”

The “copy from date” feature allows to copy tasks from a certain date into the current one. Its main use case is to copy everything from your previous work day, because you probably have been doing the same work, more or less during the same timetable… You can conveniently copy the tasks and just add some tweaks. The default date to copy from used to be the previous day, but you don’t usually want to copy your tasks from Sunday to Monday! Now it defaults to the previous date you have worked.


Addressed several slightly annoying behaviors to make the application more enjoyable to use. For example, when you create a project, the grid will scroll automatically to the newly added project so you can keep editing the project attributes. A similar thing happens when you have some row selected in the accumulated hours report and you load a new set of dates: the previously selected row will be kept selected and visible. Opening the project details from a report now pops up a new window so your report remains untouched when you go back to it. Inexplicably small grids now use all the screen space, and we increased the consistency among the different edition screens.

Moved to GitHub

The project sources, releases and the information previously available in the project website has been moved to GitHub, so any potential contributors will find a familiar environment if they happen to be interested in PhpReport.


I’ve bumped the version number from 2.1 straight to 2.16. The intention is to make one release per year, to guarantee some predictability regarding how and when the changes in the repository will arrive to users. You can expect PhpReport 2.17 releasing in March next year; it may look like a long time but I think it’s reasonable for a project with a low level of activity.

Looking for help!

Igalia has recently announced open positions in our Coding Experience program, and one of them is aimed to work in PhpReport. We have many ideas to improve this tool to make it more productive and powerful. Check the conditions of the program in the website if you are interested!

by Jacobo Aragunde Pérez at March 31, 2016 07:39 PM

Michael Catanzaro

Positive progress on WebKitGTK+ security updates

I previously reported that, although WebKitGTK+ releases regular upstream security updates, most Linux distributions are not taking the updates. At the time, only Arch Linux and Fedora were reliably releasing our security updates. So I’m quite pleased that openSUSE recently released a WebKitGTK+ security update, and then Mageia did too. Gentoo currently has an update in the works. It remains to be seen if these distros regularly follow up on updates (expect a follow-up post on this in a few months), but, optimistically, you now have several independent distros to choose from to get an updated version WebKitGTK+, plus any distros that regularly receive updates directly from these distros.

Unfortunately, not all is well yet. It’s still not safe to use WebKitGTK+ on the latest releases of Debian or Ubuntu, or on derivatives like Linux Mint, elementary OS, or Raspbian. (Raspbian is notable because it uses an ancient, insecure version of Epiphany as its default web browser, and Raspberry Pis are kind of popular.)

And of course, no distribution has been able to get rid of old, insecure WebKitGTK+ 2.4 compatibility packages, so many applications on distributions that do provide security updates for modern WebKitGTK+ will still be insecure. (Don’t be fooled by the recent WebKitGTK+ 2.4.10 update; it contains only a few security fixes that were easy to backport, and was spurred by the need to add GTK+ 3.20 compatibility. It is still not safe to use.) Nor have distributions managed to remove QtWebKit, which is also old and insecure. You still need to check individual applications to see if they are running safe versions of WebKit.

But at least there are now several distros providing WebKitGTK+ security updates. That’s good.

Special thanks to Apple and to my colleagues at Igalia for their work on the security advisories that motivate these updates.

by Michael Catanzaro at March 31, 2016 03:00 AM

Epiphany 3.20

So, what’s new in Epiphany 3.20?

First off: overlay scrollbars. Because web sites have the ability to style their scrollbars (which you’ve probably noticed on Google sites), WebKit embedders cannot use a normal GtkScrolledWindow to display content; instead, WebKit has to paint the scrollbars itself. Hence, when overlay scrollbars appeared in GTK+ 3.16, WebKit applications were left out. Carlos García Campos spent some time to work on this, and the result speaks for itself (if you fullscreen this video to see it properly):

Overlay scrollbars did not actually require any changes in Epiphany itself — all applications using an up-to-date version of WebKit will immediately benefit — but I mention it here as it’s one of the most noticeable changes. Read about other WebKit improvements, like the new Faster Than Light FTL/B3 JavaScript compilation tier, on Carlos’s blog.

Next up, there is a new downloads manager, also by Carlos García Campos. This replaces the old downloads bar that used to appear at the bottom of the screen:

Screenshot of the new downloads manager in Epiphany 3.20.

I flipped the switch in Epiphany to enable WebGL:

If you watched that video in fullscreen, you might have noticed that page is marked as insecure, even though it doesn’t use HTTPS. Like most browsers, we used to have several confusing security states. Pages with mixed content received a security warning that all users ignored, but pages with no security at all received no such warning. That’s pretty dumb, which is why Firefox and Chrome have been talking about changing this for a year or so now. I went ahead and implemented it. We now have exactly two security states: secure and insecure. If your page loads any content not over HTTPS, it will be marked as insecure. The vast majority of pages will be displayed as insecure, but it’s no less than such sites deserve. I’m not concerned at all about “warning fatigue,” because users are not generally expected to take any action on seeing these warnings. In the future, we will take this further, and use the insecure indicator for sites that use SHA-1 certificates.

Moving on. By popular request, I exposed the previously-hidden setting to disable session restore in the preferences dialog, as “Remember previous tabs on startup:”

Screenshot of the preferences dialog, with the new

Meanwhile, Carlos worked in both WebKit and Epiphany to greatly improve session restoration. Previously, Epiphany would save the URLs of the pages loaded in each tab, and when started it would load each URL in a new tab, but you wouldn’t have any history for those tabs, for example, and the state of the tab would otherwise be lost. Carlos worked on serializing the WebKit session state and exposing it in the WebKitGTK+ API, allowing us to restore full back/forward history for each tab, plus details like your scroll position on each tab. Thanks to Carlos, we also now make use of this functionality when reopening closed tabs, so your reopened tab will have a full back/forward list of history, and also when opening new tabs, so the new tab will inherit the history of the tab it was opened from (a feature that we had in the past, but lost when we switched to WebKit2).

Interestingly, we found the session restoration was at first too good: it would restore the page really exactly as you last viewed it, without refreshing the content at all. This means that if, for example, you were viewing a page in Bugzilla, then when starting the browser, you would miss any new comments from the last time you loaded the page until you refresh the page manually. This is actually the current behavior in Safari; it’s desirable on iOS to make the browser launch instantly, but questionable for desktop Safari. Carlos decided to always refresh the page content when restoring the session for WebKitGTK+.

Last, and perhaps least, there’s a new empty state displayed for new users, developed by Lorenzo Tilve and polished up by me, so that we don’t greet new users with a completely empty overview (where your most-visited sites are normally displayed):

Empty State

That, plus a bundle of the usual bugfixes, significant code cleanups, and internal architectual improvements (e.g. I converted the communication between the UI process and the web process extension to use private D-Bus connections instead of the session bus). The best things have not changed: it still starts up about 5-20 times faster than Firefox in my unscientific testing; I expect you’ll find similar results.


by Michael Catanzaro at March 31, 2016 01:54 AM

March 24, 2016

Andy Wingo

a simple (local) solution to the pay gap

International Working Women's Day was earlier this month, a day that reminds the world how far it has yet to go to achieve just treatment of women in the workplace. Obviously there are many fronts on which to fight to dismantle patriarchy, and also cissexism, and also transphobia, and also racism, and sometimes it gets a bit overwhelming just to think of a world where people treat each other right.

Against this backdrop, it's surprising that some policies are rarely mentioned by people working on social change. This article is about one of them -- a simple local change that can eliminate the pay gap across all axes of unfair privilege.


OK here it is: just pay everyone in a company the same hourly wage.

That's it!

on simple, on easy

But, you say, that's impossible!

Rich Hickey has this famous talk where he describes one thing as simple and the other as easy. In his narrative, simple is good but hard, and easy is bad but, you know, easy. I enjoy this talk because it's easy (hah!) to just call one thing simple and the other easy and it's codewords for good and bad, and you come across as having the facile prestidigitatory wisdom of a Malcolm Gladwell.

As far as simple, the substance of equal pay is as simple as it gets. And as far as practical implementation goes, it only needs buy-in from one person: your boss could do it tomorrow.

But, you say, a real business would never do this! This is getting closer to the real issues, but not there yet. There are plenty of instances of real businesses that do this. Incidentally, mine is one of them! I do not intend this to be an advertisement for my company, but I have to mention this early because society does its best to implant inside our brains the ideas that certain ideas are possible and certain others are not.

But, you say, this would be terrible for business! Here I think we are almost there. There's a question underneath, if we can manage to phrase it in a more scientific way -- I mean, the ideal sense in which science is a practice of humankind in which we use our limited powers to seek truth, with hypotheses but without prejudice. It might sound a bit pompous to invoke capital-S Science here, but I think few conversations of this kind try to honestly even consider existence proofs in the form of already-existing histories (like the company I work for), much less an unbiased study of the implications of modelling the future on those histories.

Let's assume that you and I want to work for justice, and in this more perfect world, men and women and nonbinary people will have equal pay for equal work, as will all people that lie on all axes of privilege that currently operate in society. If you are with me up to here: great. If not, we don't share a premise so it's not much use to go farther. You can probably skip to the next article in your reading list.

So, then, the questions: first of all, would a flat equal wage within a company actually help people in marginalized groups? What changes would happen to a company if it enacted a flat wage tomorrow? What are its limitations? How could this change come about?

would it help?

Let's take the most basic question first. How would this measure affect people in marginalized groups?

Let us assume that salaries are distributed inversely: the higher salaries are made by fewer people. A lower salary corresponds to more people. So firstly, we are in a situation where the median salary is less than the mean: that if we switched to pay everyone the mean, then most people would see an increase in their salary.

Assuming that marginalized people were evenly placed in a company, that would mean that most would benefit. But we know that is not the case: "marginalized" is the operative term. People are categorized at a lower point than their abilities; people's climb of the organizational hierarchy (and to higher salaries) is hindered by harassment, by undervalued diversity work, and by external structural factors, like institutionalized racism or the burden of having to go through a gender transition. So probably, even if a company touts equal pay within job classifications, the job classifications themselves unfairly put marginalized people lower than white dudes like me. So, proportionally marginalized people would benefit from an equal wage more than most.

Already this plan is looking pretty good: more money going to marginalized people is a necessary step to bootstrap a more just world.

All that said, many (but not most) people from marginalized groups will earn more than the mean. What for them? Some will decide that paying for a more just company as a whole is worth a salary reduction. (Incidentally, this applies to everyone: everyone has their price for justice. It might be 0.1%, it might be 5%, it might be 50%.)

Some, though, will decide it is not worth paying. They will go work elsewhere, probably for even more money (changing jobs being the best general way to advance your salary). I don't blame marginalized folks for getting all they can: more power to them.

From what I can tell, things are looking especially good for marginalized people under a local equal-wage initiative. Not perfect, not in all cases, but generally better.

won't someone think of the dudes

I don't believe in value as a zero-sum proposition: there are many ways in which a more fair world could be more productive, too. But in the short term, a balance sheet must balance. Salary increases in the bottom will come from salary decreases from the top, and the dudebro is top in tech.

We should first note that many and possibly most white men will see their wages increase under a flat-wage scheme, as most people earn below the mean.

Secondly, some men will be willing to pay for justice in the form of equal pay for equal work. An eloquent sales pitch as to what they are buying will help.

Some men would like to pay but have other obligations that a "mean" salary just can't even. Welp, there are lots of jobs out there. We'll give you a glowing recommendation :)

Finally there will be dudes that are fine with the pay gap. Maybe they have some sort of techno-libertarian justification? Oh well! They will find other jobs. As someone who cares about justice, you don't really want to work with these people anyway. Call it "bad culture fit", and treat it as a great policy to improve the composition of your organization.

an aside: what are we here for anyway?

A frequent objection to workplace change comes in the form of a pandering explanation of what companies are for, that corporations are legally obligated to always proceed along the the most profitable path.

I always find it extraordinarily ignorant to hear this parroted by people in tech: it's literally part of the CS canon to learn about the limitations of hill-climbing as an optimization strategy. But on the other hand, I do understand; the power of just-so neoliberal narrative is immense, filling your mind with pat explanations, cooling off your brain into a poorly annealed solid mass.

The funny thing about corporate determinism that it's not even true. Folks who say this have rarely run companies, otherwise they should know better. Loads of corporate decisions are made with a most tenuous link on profitability, and some that probably even go against the profit interest. It's always easy to go in a known-profitable direction, but that doesn't mean it's the only way to go, nor that all the profitable directions are known.

Sometimes this question is framed in the language of "what MyDesignCo really cares about is good design; we're worried about how this measure might affect our output". I respect this question more, because it's more materialist (you can actually answer the question!), but I disagree with the premise. I don't think any company really cares about the product in a significant way. Take the design company as an example. What do you want on your tombstone: "She made good advertisements"??? Don't get me wrong, I like my craft, and I enjoy practicing it with my colleagues. But if on my tombstone they wrote "He worked for justice", and also if there were a heaven, I would be p OK with that. What I'm saying is, you start a company, you have an initial idea, you pivot, whatever, it doesn't matter in the end. What matters is you relationship with life on the planet, and that is the criteria you should use to evaluate what you do.

Beyond all that -- it's amazing how much wrong you can wrap up in a snarky hacker news one-liner -- beyond all that, the concern begs the question by assuming that a flat-wage arrangement is less profitable. People will mention any down-side they can but never an up-side.

possible flat-wage up-sides from a corporate perspective

With that in mind, let's consider some ways that a flat wage can actually improve the commercial fate of a company.

A company with a flat wage already has a marketing point that they can use to attract people that care about this sort of thing. It can make your company stand out from the crowd and attract good people.

The people you attract will know you're doing the flat-wage thing, and so will be predisposed to want to work together. This can increase productivity. It also eliminates some material sources of conflict between different roles in an organization. You would still need "human resources" people but they would need to spend less time on mitigating the natural money-based conflicts that exist in other organizations.

Another positive side relates to the ability of the company to make collective sacrifices. For example a company that is going through harder times can collectively decide not to raise wages or even to lower them, rather than fire people. Obviously this outcome depends on the degree to which people feel responsible for the organization, which is incomplete without a feeling of collective self-management as in a cooperative, but even in a hierarchical organization these effects can be felt.

Incidentally a feeling of "investment" in the organization is another plus. When you work in a company in which compensation depends on random factors that you can't see, you always wonder if you're being cheated out of your true value. If everyone is being paid the same you know that everyone's interest in improving company revenue is aligned with their own salary interest -- you can't gain by screwing someone else over.

limitations of a flat wage at improving justice

All that said, paying all workers/partners/employees the same hourly wage is not a panacea for justice. It won't dismantle patriarchy overnight. It won't stop domestic violence, and it won't stop the cops from killing people of color. It won't stop microagressions or harassment in the workplace, and in some ways if there are feelings of resentment, it could even exacerbate them. It won't arrest attrition of marginalized people from the tech industry, and it won't fix hiring. Enacting the policy in a company won't fix the industry as a whole, even if all companies enacted it, as you would still have different wages at different companies. It won't fix the situation outside of the tech industry; a particularly egregious example being that in almost all places, cleaning staff are hired via subcontracts and not as employees. And finally, it won't resolve class conflict at work: the owner still owns. There are still pressures on the owner to keep the whole balance sheet secret, even if the human resources side of things is transparent.

All that said, these are mainly ways in which an equal wage policy is incomplete. A step in the right direction, on a justice level, but incomplete. In practice though the objections you get will be less related to justice and more commercial in nature. Let's take a look at some of them.

commercial challenges to a flat wage

Having everyone paid the same makes it extraordinarily difficult to hire people that are used to being paid on commission, like sales people. Sales people drive Rolexes and wear Mercedes. It is very, very tough to hire good sales people on salary. At my work we have had some limited success hiring, and some success growing technical folks into sales roles, but this compensation package will hinder your efforts to build and/or keep your sales team.

On the other hand, having the same compensation between sales and engineering does eliminate some of the usual sales-vs-product conflicts of interest.

Another point it that if you institute a flat-wage policy, you will expect to lose some fraction of your highly-skilled workers, as many of these are more highly paid. There are again some mitigations but it's still a reality. Perhaps more perniciously, you will have greater difficulties hiring senior people: you literally can't get into a bidding war with a competitor over a potential hire.

On the flip side, a flat salary can make it difficult to hire more junior positions. There are many theories here but I think that a company is healthy when it has a mix of experiences, that senior folks and junior folks bring different things to the table. But if your flat wage is higher than the standard junior wage, then your potential junior hires are now competing against more senior people -- internally it will be hard to keep a balance between different experiences.

Indeed junior workers that you already have are now competing at their wage level with potential hires that might be more qualified in some way. An unscrupulous management could fire those junior staff members and replace them with more senior candidates. An equal wage policy does not solve internal class conflicts; you need to have equal ownership and some form of workplace democracy for that.

You could sort people into pay grades, but in many ways this would formalize injustice. Marginalized people are by definition not equally distributed across pay grades.

Having a flat wage also removes a standard form of motivation, that your wage is always rising as you get older. It could be that after 5 years in a job, maybe your wages went up because the company's revenues went up, but they're still the same as a new hire's -- how do you feel about that? It's a tough question. I think an ever-rising wage has a lot of negative aspects, including decreasing the employability of older workers, but it's deeply rooted in tech culture at least.

Another point is motivation of people within the same cadre. Some people are motivated by bonuses, by performing relatively well compared to their peers. This wouldn't be an option in an organization with a purely flat wage. Does it matter? I do not know.

work with me tho

As the prophet Pratchett said, "against one perfect moment, the centuries beat in vain". There are some definite advantages to a flat wage within a company: it's concrete, it can be immediately enacted, it solves some immediate problems in a local way. Its commercial impact is unclear, but the force of narrative can bowl over many concerns in that department: what's important is to do the right thing. Everybody knows that!

As far as implementation, I see three-and-a-half ways this could happen in a company.

The first is that equal pay could be a founding principle of the company. This was mostly the case in the company I work for (and operate, and co-own equally with the other 40 or so partners). I wasn't a founder of the company, and the precise set of principles and policies has changed over the 15 years of the company's life, but it's more obvious for this arrangement to continue from a beginning than to change from the normal pay situation.

The second is, the change could come from the top down. Some CEOs get random brain waves and this happens. In this case, the change is super-easy to make: you proclaim the thing and it's done. As a person who has had to deal with cash-flow and payroll and balance sheets, I can tell you that this considerably simplifies HR from a management perspective.

The third is via collective action. This only works if workers are able to organize and can be convinced to be interested in justice in this specific way. In some companies, a worker's body might simply be able to negotiate this with management -- e.g., we try it out for 6 months and see. In most others you'd probably need to unionize and strike.

Finally, if this practice were more wider-spread in a sector, it could be that it just becomes "best practice" in some way -- that company management could be shamed into doing it, or it could just be the way things are done.


Many of these points are probably best enacted in the context of a worker-owned cooperative, where you can do away with the worker-owner conflict at the same time. But still, they are worth thinking of in a broader context, and worth evaluating in the degree to which they work for (or against) justice in the workplace. But enough blathering from me today :) Happy hacking!

by Andy Wingo at March 24, 2016 09:49 PM

March 22, 2016

Carlos García Campos

WebKitGTK+ 2.12

We did it again, the Igalia WebKit team is pleased to announce a new stable release of WebKitGTK+, with a bunch of bugs fixed, some new API bits and many other improvements. I’m going to talk here about some of the most important changes, but as usual you have more information in the NEWS file.


FTL JIT is a JavaScriptCore optimizing compiler that was developed using LLVM to do low-level optimizations. It’s been used by the Mac port since 2014 but we hadn’t been able to use it because it required some patches for LLVM to work on x86-64 that were not included in any official LLVM release, and there were also some crashes that only happened in Linux. At the beginning of this release cycle we already had LLVM 3.7 with all the required patches and the crashes had been fixed as well, so we finally enabled FTL for the GTK+ port. But in the middle of the release cycle Apple surprised us announcing that they had the new FTL B3 backend ready. B3 replaces LLVM and it’s entirely developed inside WebKit, so it doesn’t require any external dependency. JavaScriptCore developers quickly managed to make B3 work on Linux based ports and we decided to switch to B3 as soon as possible to avoid making a new release with LLVM to remove it in the next one. I’m not going to enter into the technical details of FTL and B3, because they are very well documented and it’s probably too boring for most of the people, the key point is that it improves the overall JavaScript performance in terms of speed.

Persistent GLib main loop sources

Another performance improvement introduced in WebKitGTK+ 2.12 has to do with main loop sources. WebKitGTK+ makes an extensive use the GLib main loop, it has its own RunLoop abstraction on top of GLib main loop that is used by all secondary processes and most of the secondary threads as well, scheduling main loop sources to send tasks between threads. JavaScript timers, animations, multimedia, the garbage collector, and many other features are based on scheduling main loop sources. In most of the cases we are actually scheduling the same callback all the time, but creating and destroying the GSource each time. We realized that creating and destroying main loop sources caused an overhead with an important impact in the performance. In WebKitGTK+ 2.12 all main loop sources were replaced by persistent sources, which are normal GSources that are never destroyed (unless they are not going to be scheduled anymore). We simply use the GSource ready time to make them active/inactive when we want to schedule/stop them.

Overlay scrollbars

GNOME designers have requested us to implement overlay scrollbars since they were introduced in GTK+, because WebKitGTK+ based applications didn’t look consistent with all other GTK+ applications. Since WebKit2, the web view is no longer a GtkScrollable, but it’s scrollable by itself using native scrollbars appearance or the one defined in the CSS. This means we have our own scrollbars implementation that we try to render as close as possible to the native ones, and that’s why it took us so long to find the time to implement overlay scrollbars. But WebKitGTK+ 2.12 finally implements them and are, of course, enabled by default. There’s no API to disable them, but we honor the GTK_OVERLAY_SCROLLING environment variable, so they can be disabled at runtime.

But the appearance was not the only thing that made our scrollbars inconsistent with the rest of the GTK+ applications, we also had a different behavior regarding the actions performed for mouse buttons, and some other bugs that are all fixed in 2.12.

The NetworkProcess is now mandatory

The network process was introduced in WebKitGTK+ since version 2.4 to be able to use multiple web processes. We had two different paths for loading resources depending on the process model being used. When using the shared secondary process model, resources were loaded by the web process directly, while when using the multiple web process model, the web processes sent the requests to the network process for being loaded. The maintenance of this two different paths was not easy, with some bugs happening only when using one model or the other, and also the network process gained features like the disk cache that were not available in the web process. In WebKitGTK+ 2.12 the non network process path has been removed, and the shared single process model has become the multiple web process model with a limit of 1. In practice it means that a single web process is still used, but the network happens in the network process.

NPAPI plugins in Wayland

I read it in many bug reports and mailing lists that NPAPI plugins will not be supported in wayland, so things like will not work. That’s not entirely true. NPAPI plugins can be windowed or windowless. Windowed plugins are those that use their own native window for rendering and handling events, implemented in X11 based systems using XEmbed protocol. Since Wayland doesn’t support XEmbed and doesn’t provide an alternative either, it’s true that windowed plugins will not be supported in Wayland. Windowless plugins don’t require any native window, they use the browser window for rendering and events are handled by the browser as well, using X11 drawable and X events in X11 based systems. So, it’s also true that windowless plugins having a UI will not be supported by Wayland either. However, not all windowless plugins have a UI, and there’s nothing X11 specific in the rest of the NPAPI plugins API, so there’s no reason why those can’t work in Wayland. And that’s exactly the case of, for example. In WebKitGTK+ 2.12 the X11 implementation of NPAPI plugins has been factored out, leaving the rest of the API implementation common and available to any window system used. That made it possible to support windowless NPAPI plugins with no UI in Wayland, and any other non X11 system, of course.


And as usual we have completed our API with some new additions:


by carlos garcia campos at March 22, 2016 10:36 AM

March 14, 2016

Javier Muñoz

Requester Pays Bucket goes upstream in Ceph

The last Requester Pays Buckets patches went upstream in Ceph some days ago. This new feature is available in the master branch now, and it will be part of the next Ceph Jewel release.

In S3, this feature is used to configure buckets in such a way that the user who request the contents will pay transfer fee.

Along this post I will introduce the feature in order to know how this concept maps to Ceph and how it works under the hood.

Understanding the feature

The Requester Pays Buckets feature originates in the Amazon S3 storage. It is part of the Amazon business model related to the Cloud storage.

In S3, the bucket owners pay for all Amazon S3 storage and data transfer costs associated with their buckets. This approach makes sense to cover the use cases where the bucket owners use the service to host/consume the content and/or they want to share the content with some authenticated users.

On the other hand, a relevant number of use cases use S3 to share a huge amount of content requiring some kind of option to balance the costs among the different content consumers in that bucket. This option is known as 'Requester Pays Buckets'

Bear in mind this feature becomes critical when the content is fairly popular and the bucket owner have many requests. This is the typical use case among global content distributors. In this case, the transfer fees may become a major issue.

Mapping the feature to Ceph

As mentioned, this feature comes from the S3 storage where it is used to balance costs related to data transfer in buckets. When enabled the requester pays for the data transfer and the request although the bucket owner still pays for the data storage.

The S3 algorithm also charges the bucket owner for the request under the following conditions:

  • The requester doesn't tag the request as 'Requester Pays Bucket'
  • The request authentication fails
  • The request is anonymous
  • The request is a SOAP request

Ceph does not implement any billing and account management service oriented to charge users so the feature can not be ported as it is.

In this point we made the decision to implement the mechanisms behind of this feature but keeping out the billing policies of Ceph. This way you can find the proper support to reproduce the original Amazon S3 billing behaviour although you will be free to wrap this support with different and more flexible billing policies if needed.

To keep the compatibility with the tools in the S3 ecosystem, the S3 interface of this feature is in place. The Boto library was used to test the proper behaviour.

In the backend, the usage logging was extended to accommodate the new feature. Now the usage logging records are not displayed by the bucket owner. They are listed by the bucket user where this user may be the owner or not.

This change required a new way to index the records in the usage logging although it doesn't break the compatibility with the previous Ceph versions. Bear in mind the old records are displayed in the new format.

There are two notable differences between the S3 and the RGW S3 algorithms. The S3 algorithm charges the bucket owner for the request if the requester doesn't tag the request as 'Requester Pays Bucket' or the request authentication fails. In the case of RGW S3 both cases are logged under the requester instead of the owner.


The Ceph RGW S3 REST API was extended to support the following use cases:

Ceph RGW S3 REST API implements the same behaviour and semantics as S3. It is needed to support the S3 tooling ecosystem in a transparent way.

Some examples with Python and Boto

You can use the following Python scripts to set, retrieve and download objects using the Requester Pays Buckets feature in Ceph. Those examples require Boto.

The usage log

The usage log is the place where the Requester Pays Bucket information is aggregated. There are three fields related with the feature ('user', 'owner' and 'payer').

The 'user' (or 'requester') is the client credential accessing the bucket content.

The 'owner' is the client credential creating the bucket.

The 'payer' is the client credential paying the data transfer. If this field doesn't exist the 'owner' is the client credential paying the data transfer.

The new virtual error buckets

One virtual error bucket is a new abstraction to log the usage on non existent resources (404 Not Found). All virtual error buckets have the same name ('-').

Having a look in the 'ops' and 'successful_ops' fields under the virtual error buckets, you will see the second one is always zero.

Each user has its own virtual error bucket to collect 404 errors. Ceph will add a virtual error bucket with the first 404 error available. The virtual error buckets live in the usage logging only.

Wrap up

With this new feature in place, Ceph implements the required support to know in detail who is accessing the RGW S3 buckets (owners vs authenticated users)

The feature brings in new ways to understand and track the bucket content in massive use cases where the costs may be assigned to different users.

The new usage logging also contains more detailed information to be used in regular reports to customers ('owner vs payer' categories, ops on virtual error buckets, etc)


My work in Ceph is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Ceph development team. Thank you guys!

by Javier at March 14, 2016 11:00 PM

March 13, 2016

Michael Catanzaro

Do you trust this application?

Much of the software you use is riddled with security vulnerabilities. Anyone who reads Matthew Garrett knows that most proprietary software is a lost cause. Some Linux advocates claim that free software is more secure than proprietary software, but it’s an open secret that tons of popular desktop Linux applications have many known, unfixed vulnerabilities. I rarely see anybody discuss this, as if it’s taboo, but it’s been obvious to me for a long time.

Usually vulnerabilities go unreported simply because nobody cares to look. Here’s an easy game: pick any application that makes HTTP connections — anything stuck on an old version of WebKit is a good place to start — and look for the following basic vulnerabilities:

  • Failure to use TLS when required (GNOME Music, GNOME Weather; note these are the only apps I mention here that do not use WebKit). This means the application has no security.
  • Failure to perform TLS certificate verification (Shotwell and Pantheon Photos). This means the application has no security against active attackers.
  • Failure to perform TLS certificate verification on subresources (Midori and XombreroLiferea). As sites usually send JavaScript in subresources, this means active attackers can get total control of the page by changing the script, without being detected (update: provided JavaScript is enabled). (Regrettably, Epiphany prior to 3.14.0 was also affected by this issue.)
  • Failure to perform TLS certificate verification before sending HTTP headers (private Midori bugBanshee). This leaks secure cookies, usually allowing attackers full access to your user account on a website. It also leaks the page you’re visiting, which HTTPS is supposed to keep private. (Update: Regrettably, Epiphany prior to 3.14.0 was affected by this issue. Also, the WebKit 2 API in WebKitGTK+ prior to 2.6.6, CVE-2015-2330.)

Except where noted, the latest release of all of the applications listed above are still vulnerable at the time of this writing, even though almost all of these bugs were reported long ago. With the exception of Shotwell, nobody has fixed any of these issues. Perhaps nobody working on the project cares to fix it, or perhaps nobody working on the project has the time or expertise to fix it, or perhaps nobody is working on the project anymore at all. This is all common in free software.

In the case of Shotwell, the issue has been fixed in git, but it might never be released because nobody works on Shotwell anymore. I informed distributors of the Shotwell vulnerability three months ago via the GNOME distributor list, our official mechanism for communicating with distributions, and advised them to update to a git snapshot. Most distributions ignored it. This is completely typical; to my knowledge, the stable releases of all Linux distributions except Fedora are still vulnerable.

If you want to play the above game, it should be very easy for you to add to my list by checking only popular desktop software. A good place to start would be to check if Liferea or Xombrero (supposedly a security-focused browser) perform TLS certificate verification before sending HTTP headers, or if Banshee performs verification on subresources, on the principle that vulnerable applications probably have other related vulnerabilities. (I did not bother to check.)

On a related note, many applications use insecure dependencies. Tons of popular GTK+ applications are stuck on an old, deprecated version of WebKitGTK+, for example. Many popular KDE applications use QtWebKit, which is old and deprecated. These deprecated versions of WebKit suffer from well over 100 remote code execution vulnerabilities fixed upstream that will probably never be backported. (100 is a lowball estimate; I would be unsurprised if the real number for QtWebKit was much, much higher.)

I do not claim that proprietary software is generally more secure than free software, because that is absolutely not true. Proprietary software vendors, including big name corporations that you might think would know better, are still churning out consumer products based on QtWebKit, for example. (This is unethical, but most proprietary software vendors do not care about security.) Not that it matters too much, as proprietary software vendors rarely provide comprehensive security updates anyway. (If your Android phone still gets updates, guess what: they’re superficial.) A few prominent proprietary software vendors really do care about security and do good work to keep their users safe, but they are rare exceptions, not the rule.

It’s a shame we’re not able to do better with free software.

by Michael Catanzaro at March 13, 2016 01:39 AM

March 12, 2016

Michael Catanzaro

Do you trust this website?

TLS certificate validation errors are much less common on today’s Internet than they used to be, but you can still expect to run into them from time to time. Thanks to a decade of poor user interface decisions by web browsers (only very recently fixed in major browsers), users do not understand TLS and think it’s OK to bypass certificate warnings if they trust the site in question.

This is completely backwards. You should only bypass the warning if you do not trust the site.

The TLS certificate does not exist to state that the site is somehow trustworthy. It exists only to state that the site is the site you think it is: to ensure there is no man in the middle (MITM) attacker. If you are visiting and get a certificate validation error, that means that even though your browser is displaying the URL, there’s zero reason to believe you’re really visiting rather than an attack site. Your browser can tell the difference, and it’s warning you. (More often, the site is just broken, or “misconfigured” if you want to be generous, but you and your browser have no way to know that.)

If you do not trust the site in question (e.g. you do not have any user account on the site), then there is not actually any harm in bypassing the warning. You don’t trust the site, so you do not care if a MITM is changing the page, recording your passwords, sending fake data to the site in your name, or whatever else.

But if you do trust the site, this error is cause to freak out and not continue, because it gives you have strong reason to believe there is a MITM attacker. Once you click continue, you should assume the MITM has total control over your interaction with the trusted website.

I will pick on Midori for an example of how bad design can confuse users:

The button label reads
The button label reads “Trust this website,” but it should read “I do not trust this website.”

As you can see from the label, Midori has this very wrong. Users are misled into continuing if they trust the website: the very situation in which it is unsafe to continue.

Firefox and Chrome handle this much better nowadays, but not perfectly. Firefox says “Your connection is not secure” while Chrome says “Your connection is not private.” It would be better to say: “This doesn’t look like the real”

by Michael Catanzaro at March 12, 2016 10:55 PM

February 29, 2016

Javier Muñoz

AWS Signature Version 4 goes upstream in Ceph

The first stable AWS4 implementation in Ceph went upstream some days ago. It is now available in the master branch and it will ship with the next Ceph release Jewel as planned.

I will use this blog post to talk about this new feature shipping in Ceph Jewel and the current effort by Outscale and Igalia to raise the level of compatibility between the Ceph RGW S3 and Amazon S3 interfaces.

In detail, I will describe the signing process in AWS4, how it works in Ceph RGW, the current coverage and the next steps in the pipeline around this authentication algorithm.

S3 request authentication algorithms

If you are not familiar with request authentication, regions, endpoints, credential scopes, etc. in Amazon S3 you could want to read one of my last posts about this stuff. It offers a simple and quick overview of this stuff while introducing the concepts and terms I will use in this blog post. A more long and low-level technical reading is available in the AWS documentation too. I will use this last one to drive/compare the implementations in an reasonable level for everybody.

Amazon S3 provides storage through web services interfaces (REST, SOAP and BitTorrent). By the way, Ceph RGW implements a compatible S3 REST interface in order to be interoperable with the Amazon S3 REST ecosystem (tools, libraries, third-party services and so on).

This S3 REST interface works over the Hypertext Transfer Protocol (HTTP) with the same HTTP verbs (GET, POST, PUT, DELETE, etc) that web browsers use to retrieve web pages and to send data to remote servers.

There are two kind of Amazon S3 RESTful interactions: authenticated and anonymous. The way to implement request authentication is signing these requests or interactions using an authentication algorithm. In the Amazon signing process' public specification there are two authentication algorithms currently in use: AWS2 and AWS4.

AWS2 and AWS4 in Ceph

The new Signature Version 4 (AWS4) is the current AWS signing protocol. It improves the previous Signature Version 2 (AWS2) significantly. Take into consideration these algorithm strenghts of AWS4 over AWS2

  • To sign a message, the signer use a signing key that is derived from her secret access key rather than using the secret access key itself
  • The signer derives the signing key from the credential scope, which means that she doesn't need to include the key itself in the request
  • Each signing task requires to use the credential scope

The benefits of using AWS4 in Ceph are clear:

  • Verification of the identify of the requester via access key ID and secret access key
  • Request tampering prevention while the request is in transit
  • Replay attacks protection within 15 minutes of the timestamp in the request

Authentication methods

The signing process can express authentication information by using one of the following methods:

  • HTTP Authorization header. The most common method of authenticating. The signature calculations vary depending on the method you choose to transfer the request payload; 'transfer payload in a single chunk' vs 'transfer payload in multiple chunks (chunked upload)'
  • Query string parameters. It uses a query string to express a request entirely in an URL together with the authorization information. This type of URL is also known as a presigned URL.

The current Ceph AWS4 implementation supports all authentication methods but transfering payload in multiple chunks (chunked upload). It is in the pipeline though.

Lacking chunked upload does not impact the Ceph RGW performance. The server side always use a streaming-hash approach to compute the signature.

Computing a Signature

The idea behind of computing a signature is using a cryptographic hash function over the request, and then use the hash value, some other values from the request, and a secret access key to create a signed hash. That is the signature.

Depending on the kind of authentication method used and the concrete request the algorithm requires different inputs. As one example to illustrate one of the authentication paths we can explore the required steps to craft a signature in the HTTP Authorization header case.

As you can see it computes a canonical request, a string to sign and a signing key as part of the process.

The final signature is the result of hashing the signing key and the string to sign. The keyed-hash message authentication code used along the signature computation is HMAC-SHA256

Default configuration in Ceph Jewel

Ceph Jewel is planned to ship with AWS2 and AWS4 enabled by default. You will not need to configure any extra switch to authenticate with AWS2 or AWS4.

Region constraints

In Amazon S3 the region enforces the allowed authentication algorithms.

In the case of Ceph RGW the code doesn't implement any kind of constraint related to the region names.

The next steps in the pipeline

The chunked upload feature to transfer the payload in multiple chunks is part of the pipeline definitely.

Some kind of integration with zones/regions to provide 'signature binding' could make sense too. It would help to enforce auth policies and so on.


My work in Ceph is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Ceph development team. Thank you guys!

by Javier at February 29, 2016 11:00 PM

February 26, 2016

Xabier Rodríguez Calvar

Über latest Media Source Extensions improvements in WebKit with GStreamer

In this post I am going to talk about the implementation of the Media Source Extensions (known as MSE) in the WebKit ports that use GStreamer. These ports are WebKitGTK+, WebKitEFL and WebKitForWayland, though only the latter has the latest work-in-progress implementation. Of course we hope to upstream WebKitForWayland soon and with it, this backend for MSE and the one for EME.

My colleague Enrique at Igalia wrote a post about this about a week ago. I recommend you read it before continuing with mine to understand the general picture and the some of the issues that I managed to fix on that implementation. Come on, go and read it, I’ll wait.

One of the challenges here is something a bit unnatural in the GStreamer world. We have to process the stream information and then make some metadata available to the JavaScript app before playing instead of just pushing everything to a playing pipeline and being happy. For this we created the AppendPipeline, which processes the data and extracts that information and keeps it under control for the playback later.

The idea of the our AppendPipeline is to put a data stream into it and get it processed at the other side. It has an appsrc, a demuxer (qtdemux currently

) and an appsink to pick up the processed data. Something tricky of the spec is that when you append data into the SourceBuffer, that operation has to block it and prevent with errors any other append operation while the current is ongoing, and when it finishes, signal it. Our main issue with this is that the the appends can contain any amount of data from headers and buffers to only headers or just partial headers. Basically, the information can be partial.

First I’ll present again Enrique’s AppendPipeline internal state diagram:

First let me explain the easiest case, which is headers and buffers being appended. As soon as the process is triggered, we move from Not started to Ongoing, then as the headers are processed we get the pads at the demuxer and begin to receive buffers, which makes us move to Sampling. Then we have to detect that the operation has ended and move to Last sample and then again to Not started. If we have received only headers we will not move to Sampling cause we will not receive any buffers but we still have to detect this situation and be able to move to Data starve and then again to Not started.

Our first approach was using two different timeouts, one to detect that we should move from Ongoing to Data starve if we did not receive any buffer and another to move from Sampling to Last sample if we stopped receiving buffers. This solution worked but it was a bit racy and we tried to find a less error prone solution.

We tried then to use custom downstream events injected from the source and at the moment they were received at the sink we could move from Sampling to Last sample or if only headers were injected, the pads were created and we could move from Ongoing to Data starve. It took some time and several iterations to fine tune this but we managed to solve almost all cases but one, which was receiving only partial headers and no buffers.

If the demuxer received partial headers and no buffers it stalled and we were not receiving any pads or any event at the output so we could not tell when the append operation had ended. Tim-Philipp gave me the idea of using the need-data signal on the source that would be fired when the demuxer ran out of useful data. I realized then that the events were not needed anymore and that we could handle all with that signal.

The need-signal is fired sometimes when the pipeline is linked and also when the the demuxer finishes processing data, regardless the stream contains partial headers, complete headers or headers and buffers. It works perfectly once we are able to disregard that first signal we receive sometimes. To solve that we just ensure that at least one buffer left the appsrc with a pad probe so if we receive the signal before any buffer was detected at the probe, it shall be disregarded to consider that the append has finished. Otherwise, if we have seen already a buffer at the probe we can consider already than any need-data signal means that the processing has ended and we can tell the JavaScript app that the append process has ended.

Both need-data signal and probe information come in GStreamer internal threads so we could use mutexes to overcome any race conditions. We thought though that deferring the operations to the main thread through the pipeline bus was a better idea that would create less issues with race conditions or deadlocks.

To finish I prefer to give some good news about performance. We use mainly the YouTube conformance tests to ensure our implementation works and I can proudly say that these changes reduced the time of execution in half!

That’s all folks!

by calvaris at February 26, 2016 12:30 PM

February 25, 2016

Javier Muñoz

Ceph, a free unified distributed storage system

Over the last few months I have been working in Ceph, a free unified distributed storage system, in order to implement some missing features in RADOS gateway, help some customers with Ceph clusters in production and fixing bugs.

This effort is part of my daily work here in Igalia working in upstream projects. As you could know, Igalia works in the Cloud arena providing services on development, deployment and orchestration around interesting open projects.

Together with Ceph (storage) we are also working upstream in Qemu (compute) and Snabb (networking). All these projects are in the core to create private and public clouds with Open Source.

My goal with this first post is introducing Ceph in a simple and easy way to understand this marvelous piece of software. I will cover the design and main innovations in Ceph together with its architecture, major use cases and relationship with OpenStack (a well-known free and open-source software platform for cloud computing).

Understanding Ceph

Ceph is an object storage based free software storage platform that stores data on a single distributed computer cluster. I would say this definition catches the essence of Ceph perfectly. It is also the foundation to understand its innovations, the architecture and the performance/scalability factors in Ceph.

Let's start with the object storage. The object storage is a storage architecture that manages data as objects, as opposed to other storage architectures like file systems which manage data as a file hierarchy and block storage which manages data as blocks within sectors and tracks. Each object typically includes the data, a variable amount of metadata, and a globally unique identifier.

On top of this object storage, Ceph provides a block interface (RBD), an object interface (RGW) and a filesystem interface (CephFS).

If we add a smart cluster approach in the previous design we will have a reliable object storage service that can scales to many thousands of devices. This reliable object storage service is known as RADOS (Reliable Autonomic Distributed Object Storage) in the current Ceph implementation.

But what is a 'smart cluster approach' here? At the petabyte and exabyte scale, systems are necessarily dynamic. They are built incrementally, they grow and contract with the deployment of new storage and decommissioning of old devices, devices fail and recover on a continous basis, and large amounts of data are created and destroyed. RADOS takes care of a consistent view of the data distribution and consistent read and write access to data objects.

RADOS also provides storage nodes with complete knowledge of the distribution of data in the systems, devices can act semi-autonomously using peer-to-peer like protocols to self-manage data replication, participate in failure detection and respond to device failures and the resulting changes in the distribution of data by replicating or migrating data objects.

If we consider the minimal configuration together with the basic components needed to set up a RADOS system, we will have a set of object storage daemons (OSDs) and a small group of monitors (MONs) reponsible for managing OSD cluster membership.

In Ceph this OSD cluster membership requires a cluster map. This cluster map specifies cluster membership, device state and the mapping of data objects to devices. The data distribution is specified first by mapping objects to placemente groups (PGs) and then mapping each PG onto a set of devices. The algorithm taking care of these steps is known as CRUSH (Controlled, Scalable, Decentralized Placement of Replicated Data)

With this information in mind we may consider two major innovation in Ceph RADOS:

  • The CRUSH algorithm. The way how Ceph clients and Ceph OSD daemons compute information (hashing function) about object location instead of having to depend on a central lookup table
  • Smart daemons. The Ceph's OSD daemons and Ceph clients are cluster aware. This enables OSDs interact directly with other OSDs and MONs. Ceph clients interacts with OSDs directly.

Both items add significant intelligence in the solution to avoid bottlenecks and, at the same time, pursue hyperscale at the petabyte and exabyte scale.

In this point we should have enough information to understand the raw Ceph architecture. Let's have a look in the usual block diagram for Ceph:

  • RGW. A web services gateway for object storage, compatible with S3 and Swift
  • RBD. A reliable, fully distributed block device with cloud platform integration
  • CEPHFS. A distributed file system with POSIX semantics and scale-out metatadata management
  • LIBRADOS. A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
  • RADOS. A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors

Mapping out the major components involved under the hood and their interactions makes it still possible getting a more detailed version of this architecture:

The OpenStack basics

Although this is an introduction post in Ceph I will describe OpenStack and its relationship with Ceph briefly. It will be useful later.

Ceph may be used alone but some of its most interesting use cases take place as part of OpenStack. A quick overview on OpenStack will be useful to understand how the OpenStack and Ceph components work together to provide reliable and scalable storage.

The current stable release for OpenStack is 'Liberty' and it includes 17 components (compute, image services, object store, etc). All those components have well-known code names (Nova, Glance, Swift, etc)

The next picture catches a very high level abstraction for OpenStack:

As you can see, Glance (VM image manager) and Cinder (block storage) are two core services in the solution.

We mentioned the previous picture shows a simple view of OpenStack. A more accurate diagram together with the relationships among the services is available in the next picture for 'Folsom', a previous release (2012)

While OpenStack evolves and include new services, this 'Folsom' picture should be good enough to introduce the services related to storage and the level of complexity of OpenStack.

So the storage services in place are Swift (object store service), Glance (image service) and Cinder (block storage service).

Those services work in tandem to cover the general and specific requirements for storage in OpenStack.

Using Ceph in OpenStack

The main integration points between OpenStack and Ceph are the object and block device interfaces.

The RADOS gateway (RGW) and the RADOS block device (RBD) interfaces are used to provide the required storage to 5 services (Keystone, Swift, Cinder, Glance and Nova)

It is worth mentioning the compute service (Nova) interfaces the RBD layer via a hypervisor. An open source hypervisor working like a charm with Ceph is Qemu/KVM. It uses librbd and librados.

Other component to mention in the stack is libvirt. OpenStack uses libvirt to configure Qemu/KVM properly.

Ceph RBD dominates the choice for Cinder drivers currently, as stated in the sixth public survey of OpenStack users (page 31)

The physical deployment of Ceph and OpenStack

Setting up and operating a reliable and scalable storage cluster is always demanding. It requires a careful planning along many different aspects. Some of these critical decisions are related to the cluster capacity (RAM, disks, number of nodes, use profiles, etc)

Although it is always possible going with your own custom configuration some hardware providers offer several standard configurations.

As a random and arbitrary example, we can have a look in the HPE Helion portfolio. This set of solutions is a mix of open-source software and integrated systems for enterprise cloud computing.

The next picture shows the physical space required and how it compares to the different logical components in the architecture.

The new and old use cases

The production of data is expanding at an astonishing pace. Two major drivers in this rapid growth of global data are the analog-to-digital switch (software is everywhere) and the rapid increase in data generation by individuals and companies.

The new use cases related to storage nowadays are radically different of the previous ones a few years ago. These new use cases are all about storing and retrieving unstructured data like photos, videos and social media in massive scale. All this stuff requires real-time analitycs and reporting together with efficient processing.

To get these requirements together, some companies are extending/migrating their current datacenters to support software-defined approaches. As consecuence, those new datacenters leverage virtualization concepts such as abstraction, pooling, and automation to all of the data center’s resources and services to achieve IT as a service. In this vision all elements of the infrastructure (compute, storage, networking and security) are virtualized and delivered as a service.

In this context, we can identify some new and well-known use cases along the next 5 different categories. The original classification is used by the RedHat Storage team. Take into consideration I am merging Cloud infrastructure and Virtualzation here.

  • Big data analytics. Storing, integrating, and analyzing data at petabyte scale
  • Cloud infrastructure and Virtualization. Virtual machine storage and storage for tenant applications (Swift/S3 API)
  • Rich media. Massive scalability and cost containment (scaling out with commodity hardware)
  • File sync and share. Secure mobility, collaboration and the need for anytime, anywhere access to files
  • Archival data. Agile, scalable, cost-effective and flexible unified storage (objects, blocks and file systems)

Ceph is used to support all these use cases in production with great results.

Pushing Ceph to the limit

Some folks in the CERN IT Department are pushing Ceph to the limit. They use Ceph as part of an OpenStack deployment and I have to say the numbers are great.

The solution is a large distributed OpenStack infrastructure with around 10,000 VMs and 100,000 CPU cores (1000 Cinder volumes and 1500 Glance images). The Cloud is predominantly used for physics data analysis but they also reported on a long tail of conventional IT services and user-managed application VMs.

If you want to know more on this Ceph cluster operated by CERN, I would recommend to watch this video at Vancouver Summit 2015.

In brief, and beyond of the great insights shared along the talk, the current Ceph version scales out to 10 PB. In that scale it just works. Over that threshold, it requires extra configuration adjustments.


I told you! This piece of software is marvelous!

I plan to add new blog entries to cover some of the new features implemented in the previous months. They are upstream code now so you will be able to enjoy them in Jewel!

If you are looking for some kind of support related to development, design, deployment, etc. in Ceph or you would love to see some new feature in the next releases. Feel free to contact me!

by Javier at February 25, 2016 11:00 PM

February 24, 2016

Manuel Rego

Igalia Coding Experience on Web Engines

In Igalia we’re looking for people to join the Igalia Coding Experience program. Basically we’re opening positions for internships in different fields related to some of our teams like multimedia, compilers, networking or web platform. The main purpose is to give students and recent graduates an initial taste of coding in industry. Where you could work together with several Igalia hackers on different free software projects.

I’m part of the web platform team where we work on different tasks related to the core of several web engines. Apart from our work on CSS Grid Layout in Blink and WebKit, that you probably know if you follow my blog, Igalia has been working on other topics like:

In our team we’re looking for a student willing to help on some of this topics. Probably, the final work might be somehow related to CSS Grid Layout where we’ve a bunch of peripheral tasks that would be really useful. Some ideas off the top of my head:

This is not meant to be an exhaustive list, but just some examples so you can realize the type of tasks you’ll be doing. Of course, depending on the profile of the selected person we’ll choose the task that fits better.

If you’re interested in this internship or any other from the rest of the teams, you can find all the details and conditions in our website. We’re a company spread all around the globe, with igalians in different countries and timezones (from Seoul to San Francisco). And, of course, these internships are remote friendly.

On top of that, Igalia is hiring too, just in case you already have some experience and are looking for a job. Again you can find all the information at

Last but not least, Igalia welcomes everyone and encourages applications from members of underrepresented groups in the free software community. We’re aiming to keep a diverse environment in our company.

February 24, 2016 11:00 PM

February 19, 2016

Michael Catanzaro

WebKitGTK+ Gets Security Updates

My recent blog post On WebKit Security Updates has attracted some not-unexpected attention. Since I knew poorly-chosen words could harm the image of the WebKit project, I prefaced that blog post with a disclaimer which I hoped few would miss:

WebKitGTK+ releases regular security updates upstream. It is safe to use so long as you apply the updates.

We have a stable branch that receives only bug fixes for six months, from which we release regular updates including security fixes. This is is comparable to industry standards (consider nine months of support for a Firefox ESR, or less than two months of support for a Chromium release). It is hardly WebKit’s fault that most distributions regularly release security updates for Firefox and Chromium, but not for WebKit.

I reject the notion that we should provide a branch with security fixes and no other bug fixes. Withholding bug fixes is unfair to users, and nobody expects Firefox or Chromium to do this. This feels like a double standard to me.

I also reject the notion that WebKit is too risky to update because it is not a leaf package. I provided a solution to this (carrying separate -stable and -secure packages) in my previous blog post for distributions that are very concerned about unexpected regressions. I don’t think it’s necessary, but it is not exactly rocket science.

I strongly disagree with conclusions that you should stop using WebKit wholesale. You should, however, verify that the version of WebKit offered by your distribution and used in your application is secure. That means (a) ensuring your distribution provides the most-recent stable release (currently 2.10.6 or 2.10.7 are both fine), and (b) ensuring your application is using that release rather than 2.4.x, which will also be packaged by your distribution and is used by most applications. For web browsers, check the Internet for well-known security flaws.

If your distribution is not providing a safe version of WebKit, consider switching to one that does and applying pressure on distributions that irresponsibly ship insecure versions of WebKit. I call on Ubuntu, Debian, openSUSE, and other distributions to follow the lead of Fedora and Arch Linux in providing stable WebKit updates to all users, not just testing branch users.

by Michael Catanzaro at February 19, 2016 09:04 PM

February 18, 2016

Enrique Ocaña

Improving Media Source Extensions on WebKit ports based on GStreamer

During 2014 I started to become interested on how GStreamer was used in WebKit to play media content and how it was related to Media Source Extensions (MSE). Along 2015, my company Igalia strenghtened its cooperation with Metrological to enhance the multimedia support in their customized version of WebKitForWayland, the web platform they use for their products for the set-top box market. This was an opportunity to do really interesting things in the multimedia field on a really nice hardware platform: Raspberry Pi.

What are Media Source Extensions?

Normal URL playback in the <video> tag works by configuring the platform player (GStreamer in our case) with a source HTTP URL, so it behaves much like any other external player, downloading the content and showing it in a window. Special cases such as Dynamic Adaptive Streaming over HTTP (DASH) are automatically handled by the player, which becomes more complex. At the same time, the JavaScript code in the webpage has no way to know what’s happening with the quality changes in the stream.

The MSE specification lets the authors move the responsibility to the JavaScript side in that kind of scenarios. A Blob object (Blob URL) can be configured to get its data from a MediaSource object. The MediaSource object can instantiate SourceBuffer objects. Video and Audio elements in the webpage can be configured with those Blob URLs. With this setup, JavaScript can manually feed binary data to the player by appending it to the SourceBuffer objects. The data is buffered and the playback time ranges generated by the data are accessible to JavaScript. The web page (and not the player) has now the control on the data being buffered, its quality, codec and procedence.  Now it’s even possible to synthesize the media data programmatically if needed, opening the door to media editors and media effects coded in JavaScript.


MSE is being adopted by the main content broadcasters on the Internet. It’s required by YouTube for its dedicated interface for TV-like devices and they even have an MSE conformance test suite that hardware manufacturers wanting to get certified for that platform must pass.

MSE architecture in WebKit

WebKit is a multiplatform framework with an end user API layer (WebKit2), an internal layer common to all platforms (WebCore) and particular implementations for each platform (GObject + GStreamer, in our case). Google and Apple have done a great work bringing MSE to WebKit. They have led the effort to implement the common WebCore abstractions needed to support MSE, such as MediaSource, SourceBuffer, MediaPlayer and the integration with HTMLMediaElement (video tag). They have also provided generic platform interfaces (MediaPlayerPrivateInterface, MediaSourcePrivate, SourceBufferPrivate) a working platform implementation for Mac OS X and a mock platform for testing.


The main contributions to the platform implementation for ports using GStreamer for media playback were done by Stephane Jadaud and Sebastian Dröge on bugs #99065 (initial implementation with hardcoded SourceBuffers for audio and video), #139441 (multiple SourceBuffers) and #140078 (support for tracks, more containers and encoding formats). This last patch hasn’t still been merged in trunk, but I used it as the starting point of the work to be done.

GStreamer, unlike other media frameworks, is strongly based on the concept of pipeline: the data traverses a series of linked elements (sources, demuxers, decoders, sinks) which process it in stages. At a given point in time, different pieces of data are in the pipeline at the same time in varying degrees of processing stages. In the case of MSE, a special WebKitMediaSrc GStreamer element is used as the data source in the pipeline and also serves as interface with the upper MSE layer, acting as client of MediaSource and SourceBuffer. WebKitMediaSrc is spawned by GstPlayBin (a container which manages everything automatically inside) when an MSE SourceBuffer is added to the MediaSource. The MediaSource is linked with the MediaPlayer, which has MediaPlayerPrivateGStreamer as private platform implementation. In the design we were using at that time, WebKitMediaSrc was responsible for demuxing the data appended on each SourceBuffer into several streams (I’ve never seen more than one stream per SourceBuffer, though) and for reporting the statistics and the samples themselves to the upper layer according to the MSE specs. To do that, the WebKitMediaSrc encapsulated an appsrc, a demuxer and a parser per source. The remaining pipeline elements after WebKitMediaSrc were in charge of decoding and playback.

Processing appends with GStreamer

The MSE implementation in Chromium uses a chunk demuxer to parse (demux) the data appended to the SourceBuffers. It keeps the parsing state and provides a self-contained way to perform the demuxing. Reusing that Chromium code would have been the easiest solution. However, GStreamer is a powerful media framework and we strongly believe that the demuxing stage can be done using GStreamer as part of the pipeline.

Because of the way GStreamer works, it’s easy to know when an element outputs new data but there’s no easy way to know when it has finished processing its input without discontinuing the flow with with End Of Stream (EOS) and effectively resetting the element. One simple approach that works is to use timeouts. If the demuxer doesn’t produce any output after a given time, we consider that the append has produced all the MediaSamples it could and therefore has finished. Two different timeouts were used: one to detect when appends that produce no samples have finished (noDataToDecodeTimeout) and another to detect when no more samples are coming (lastSampleToDecodeTimeout). The former needs to be longer than the latter.

Another technical challenge was to perform append processing when the pipeline isn’t playing. While playback doesn’t start, the pipeline just prerolls (is filled with the available data until the first frame can be rendered on the screen) and then pauses there until the continuous playback can start. However, the MSE spec expects the appended data to be completely processed and delivered to the upper MSE layer first, and then it’s up to JavaScript to decide if the playback on screen must start or not. The solution was to add intermediate queue elements with a very big capacity to force a preroll stage long enough for the probes in the demuxer source (output) pads to “see” all the samples pass beyond the demuxer. This was how the pipeline looked like at that time (see also the full dump):


While focusing on making the YouTube 2015 tests pass on our Raspberry Pi 1, we realized that the generated buffered ranges had strange micro-holes (eg: [0, 4.9998]; [5.0003, 10.0]) and that was confusing the tests. Definitely, there were differences of interpretation between ChunkDemuxer and qtdemux, but this is a minor problem which can be solved by adding some extra time ranges that fill the holes. All these changes got the append feature in good shape and the we could start watching videos more or less reliably on YouTube TV for the first time.

Basic seek support

Let’s focus on some real use case for a moment. The JavaScript code can be appending video data in the [20, 25] range, audio data in the [30, 35] range (because the [20, 30] range was appended before) and we’re still playing the [0, 5] range. Our previous design let the media buffers leave the demuxer and enter in the decoder without control. This worked nice for sequential playback, but was not compatible with non-linear playback (seeks). Feeding the decoder with video data for [0, 5] plus [20, 25] causes a big pause (while the timeline traverses [5, 20]) followed by a bunch of decoding errors (the decoder needs sequential data to work).

One possible improvement to support non-linear playback is to implement buffer stealing and buffer reinjecting at the demuxer output, so the buffers never go past that point without control. A probe steals the buffers, encapsulates them inside MediaSamples, pumps them to the upper MSE layer for storage and range reporting, and finally drops them at the GStreamer level. The buffers can be later reinjected by the enqueueSample() method when JavaScript decides to start the playback in the target position. The flushAndEnqueueNonDisplayingSamples() method reinjects auxiliary samples from before the target position just to help keeping the decoder sane and with the right internal state when the useful samples are inserted. You can see the dropping and reinjection points in the updated diagram:


The synchronization issues of managing several independent timelines at once must also be had into account. Each of the ongoing append and playback operations happen in their own timeline, but the pipeline is designed to be configured for a common playback segment. The playback state (READY, PAUSED, PLAYING), the flushes needed by the seek operation and the prerolls also affect all the pipeline elements. This problem can be minimized by manipulating the segments by hand to accomodate the different timings and by getting the help of very large queues to sustain the processing in the demuxer, even when the pipeline is still in pause. These changes can solve the issues and get the “47. Seek” test working, but YouTube TV is more demanding and requires a more structured design.

Divide and conquer

At this point we decided to simplify MediaPlayerPrivateGStreamer and refactor all the MSE logic into a new subclass called MediaPlayerPrivateGStreamerMSE. After that, the unified pipeline was split into N append pipelines (one per SourceBuffer) and one playback pipeline. This change solved the synchronization issues and splitted a complex problem into two simpler ones. The AppendPipeline class, visible only to the MSE private player, is in charge of managing all the append logic. There’s one instance for each of the N append pipelines.

Each append pipeline is created by hand. It contains an appsrc (to feed data into it), a typefinder, a qtdemuxer, optionally a decoder (in case we want to suport Encrypted Media Extensions too), and an appsink (to pick parsed data). In my willing to simplify, I removed the support for all formats except ISO MP4, the only one really needed for YouTube. The other containers could be reintroduced in the future.


The playback pipeline is what remains of the old unified pipeline, but simpler. It’s still based on playbin, and the main difference is that the WebKitMediaSrc is now simpler. It consists of N sources (one per SourceBuffer) composed by an appsrc (to feed buffered samples), a parser block and the src pads. Uridecodebin is in charge of instantiating it, like before. The PlaybackPipeline class was created to take care of some of the management logic.


The AppendPipeline class manages the callback forwarding between threads, using asserts to strongly enforce the access to WebCore MSE classes from the main thread. AtomicString and all the classes inheriting from RefCounted (instead of ThreadSafeRefCounted) can’t be safely managed from different threads. This includes most of the classes used in the MSE implementation. However, the demuxer probes and other callbacks sometimes happen in the streaming thread of the corresponding element, not in the main thread, so that’s why call forwarding must be done.

AppendPipeline also uses an internal state machine to manage the different stages of the append operation and all the actions relevant for each stage (starting/stopping the timeouts, process the samples, finish the appends and manage SourceBuffer aborts).


Seek support for the real world

With this new design, the use case of a typical seek works like this (very simplified):

  1. The video may be being currently played at some position (buffered, of course).
  2. The JavaScript code appends data for the new target position to each of the video/audio SourceBuffers. Each AppendPipeline processes the data and JavaScript is aware of the new buffered ranges.
  3. JavaScript seeks to the new position. This ends up calling the seek() and doSeek() methods.
  4. MediaPlayerPrivateGStreamerMSE instructs WebKitMediaSrc to stop accepting more samples until further notice and to prepare the seek (reset the seek-data and need-data counters). The player private performs the real GStreamer seek in the playback pipeline and leaves the rest of the seek pending for when WebKitMediaSrc is ready.
  5. The GStreamer seek causes some changes in the pipeline and eventually all the appsrc in WebKitMediaSrc emit the seek-data and need-data events. Then WebKitMediaSrc notifies the player private that it’s ready to accept samples for the target position and needs data. MediaSource is notified here to seek and this triggers the enqueuing of the new data (non displaying samples and visible ones).
  6. The pending seek at player private level which was pending from step 4 continues, giving permission to WebKitMediaSrc to accept samples again.
  7. Seek is completed. The samples enqueued in step 5 flow now through the playback pipeline and the user can see the video from the target position.

That was just the typical case, but more complex scenarios are also supported. This includes multiple seeks (pressing the forward/backward button several times), seeks to buffered areas (the easiest ones) and to unbuffered areas (where the seek sequence needs to wait until the data for the target area is appended and buffered).

Close cooperation from qtdemux is also required in order to get accurate presentation timestamps (PTS) for the processed media. We detected a special case when appending data much forward in the media stream during a seek. Qtdemux kept generating sequential presentation timestamps, completely ignoring the TFDT atom, which tells where the timestamps of the new data block must start. I had to add a new “always-honor-tfdt” attribute to qtdemux to solve that problem.

With all these changes the YouTube 2015 and 2016 tests are green for us and YouTube TV is completely functional on a Raspberry Pi 2.

Upstreaming the code during Web Engines Hackfest 2015

All this work is currently in the Metrological WebKitForWayland repository, but it could be a great upstream contribution. Last December I was invited to the Web Engines Hackfest 2015, an event hosted in Igalia premises in A Coruña (Spain). I attended with the intention of starting the upstreaming process of our MSE implementation for GStreamer, so other ports such as WebKitGTK+ and WebKitEFL could also benefit from it. Thanks a lot to our sponsors for making it possible.

At the end of the hackfest I managed to have something that builds in a private branch. I’m currently devoting some time to work on the regressions in the YouTube 2016 tests, clean unrelated EME stuff and adapt the code to the style guidelines. Eventually, I’m going to submit the patch for review on bugzilla. There are some topics that I’d like to discuss with other engineers as part of this process, such as the interpretation of the spec regarding how the ReadyState is computed.

In parallel to the upstreaming process, our plans for the future include getting rid of the append timeouts by finding a better alternative, improving append performance and testing seek even more thoroughly with other real use cases. In the long term we should add support for appendStream() and increase the set of supported media containers and codecs at least to webm and vp8.

Let’s keep hacking!

by eocanha at February 18, 2016 08:10 PM

February 17, 2016

Manuel Rego

CSS Grid Layout from the inside out (HTML5DevConf 2015)

As I announced in a previous blog post I was attending HTML5DevConf past October, where I gave a talk about CSS Grid Layout. The video of my talk is now online, so I thought it would be a good moment to remember the days in sunny San Francisco.

My Talk

First of all, thanks to the organization for giving me the chance to speak about CSS Grid Layout during the conference.

My talk was in one of the rooms of the Contemporary Jewish Museum. The room was full and people got really excited about CSS Grid Layout. This is not a big surprise as I have the same feeling every time someone speaks about it in any event. Thanks for attending my talk and for the nice feedback. 😊

During the talk we reviewed the main features provided by Grid Layout spec with live coding examples. Then I also explained the work that the browser has to do to render a grid. And finally we checked the current status on the different browser engines.

The slides have been available on my blog since the conference’s date, but now the video has been published in YouTube. If you missed it, you’ve the opportunity to watch it now. I hope you enjoy it!

Video of my talk: CSS Grid Layout from the inside out

Conference Highlights

HTML5DevConf is a huge conference, lots of people around, several talks at the same time in different venues, etc.

The JavaScript language, as expected, was one of the hot topics. There were some nice talks explaining the new stuff from ECMAScript 6 (ES6), and even talking about some stuff from ES7. For example, they were explaining things like arrow functions that were implemented in Chromium/V8 by some of my colleagues at Igalia.

Jamund Ferguson talking about arrow functions during his talk Jamund Ferguson talking about arrow functions during his talk

However, I’d like to highlight the talk by Parashuram Narasimhan called Measuring web Perf ? Lets write an app for that !!. During the talk he explained a tool called browser-perf which automates performance measurements in web pages. For example, you can test scrolling on a page and get the FPS and other performance parameters from DevOps tools. The cool thing is that you can very easily create a bot to monitor this and check if you’re suffering any regression. If you’re interested you can watch the video too.

San Francisco

Traveling to California is a long trip, but it really paid off. The weather was perfect and I had time for some sightseeing (as you can see in my flickr album). San Francisco is a lovely city and, despite of the hills, is a neat place for biking and walking. On top of that, my fellow igalian Martin Robinson was really kind and showed me some nice spots in the city. Big thanks Martin!

Sun and waves at Ocean Beach Sun and waves at Ocean Beach

After spending my whole life in front of the Atlantic Ocean, I couldn’t resist it so I went to Ocean Beach in order to try (and taste) the Pacific Ocean for the first time.


Overall, it was an incredible experience in a wonderful city, I fell in love with San Francisco. 😍

Regarding the conference itself, it was nice to make an update of my presentation at CSS Conf 2015. In addition it was useful to follow the cutting edge stuff around the web platform.

Hopefully I’ll have the possibility to keep showing you all the awesome features provided by Grid Layout in some other conferences during 2016. Fingers crossed!

Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

Lastly, as you probably know all the work around CSS Grid Layout in Blink and WebKit is the result of a collaboration between Bloomberg and Igalia. Thanks folks for your support.

February 17, 2016 11:00 PM

February 11, 2016

Manuel Rego

Subgrids thinking out loud

Subgrids has become a hot topic on the past weeks, mostly due to an article by Eric Meyer called “Subgrids Considered Essential”, that shows some examples where subgrids can be really useful. Which is somehow based on a previous post by fantasai on the same topic. Also Rachel Andrew has talked about subgrids on her blog and her last ebook about Grid Layout (which is quite handy to get an introduction to grid syntax and main features).

As you probably know, Igalia has been working on the CSS Grid Layout implementation on Chromium/Blink and Safari/WebKit for a while. As implementors we thought it would be nice to share our feedback on this topic, so this week the team working on Grid Layout (Sergio, Javi and myself) arranged a meeting to review subgrids feature. These are our initial thoughts about it and a first draft proposal about a possible implementation.

State of art

Subgrids feature allows us to create nested grids that share lines with the grid container. We just define the size of the tracks on the main grid container, and the subgrid use those sizes to layout its items.

In a grid, only direct children are grid items and can be positioned on the grid. We can mark one of those items as subgrid, and then the subgrid’s children will be like regular items for the main grid. This is pretty hard to explain with words so we’ll follow up with some examples in the next section.

The feature has been marked as “at-risk” in the specification since January 2014 Which means that it could be moved to the next level (or version) of the spec.

Current situation is that none of the grid implementations (nor any web engine, neither any polyfill) support subgrids yet. Mostly because it’s a pretty complex feature and implementors have been focused on other parts of the spec.

Use cases

In order to understand what subgrids are, let’s explain a few use cases. The canonical example for subgrids feature is a regular form defined using a list:


Ideally we’d like to define a grid with the labels on the first column and the inputs on the second one. If we define the <ul> element as a grid container, we’ll need to define the <li> as subgrids to be able to place the labels and inputs as desired. Without subgrids we’d need to remove the list (probably replacing <ul> by <div> and removing <li> tags) flattening our HTML and losing the semantic structure, which is really bad from the accessibility point of view.

With subgrids support we could use the following CSS:

ul {
    display: grid;
    grid-template-columns: auto 1fr;

li {
    display: grid;
    grid: subgrid;

label {
    grid-column: 1;

input {
    grid-column: 2;

Example of form using subgrid Example of form using subgrid

For this particular purpose, we could be using display: contents; (which is only supported in Firefox so far) instead of subgrids. Just applying display: contents; to the <li> elements we’ll have the very same result.

On the other hand, another of the subgrids examples that we can find out there is a catalog of shop items. Imagine that those products are composed by title, picture and description.

The HTML of each item would be something like this:

<div class="product">
    <img src="product.jpg" />
    <p>Long description...</p>

And imagine that we use the following CSS:

body {
    display: grid;
    grid-template-columns: 1fr 1fr 1fr;

.product {
    display: grid;
    grid: subgrid;

h1 {
    grid-row: 2;

img {
    grid-row: 1;

p {
    grid-row: 3;

The goal is that each product appears in a subgrid of 3 rows, that way the titles and descriptions are horizontally aligned. For that purpose we’re defining a main grid of 3 columns, and a subgrid for each product with the 3 rows.

Example of catalog using subgrid Example of catalog using subgrid

As we can see even if the title has 2 lines, or independently of the lines in the description, the items are properly aligned.

This cannot be achieved with display: contents;, because of it’ll add all the products in a single row (actually 3 rows in the grid). Because we’re setting the row explicitly to the <h1> (row 2), <img> (row 1) and <p> (row 3) elements. Without subgrids, those rows cannot be automatically translated to rows 5, 4 and 6, when the products have filled the 3 columns defined on the main grid.

Main issues

This and next section are very technical and should be discussed with the CSS Working Group (CSS WG), but we thought a blog post would be better to show examples than a simple mail to www-style.

We’ve been reading the spec and found some potential issues thinking about how to implement subgrids. These are some of them in no specific order.

Margin, border and padding

Probably the most important issue is the problem with margin, border and padding on the subgrid. And how they affect to the grid track sizing algorithm.

Let’s try to use a simple example to explain it, imagine a 3x2 grid where we’ve a subgrid taking the first row completely.

<div style="display: grid; grid-template-columns: auto auto auto; grid-template-rows: auto auto;">
    <div style="display: grid; grid: subgrid; grid-column: 1 / 4; grid-row: 1 / 2;">
      <div style="grid-column: 1;">Item 1</div>
      <div style="grid-column: 2;">Item 2</div>
      <div style="grid-column: 3;">Item 3</div>

Imagine that each item’s width is 100px, so the columns will be 100px width each one.

However, if we set for example the padding on the subgrid (e.g. padding: 0px 50px;), we’ll have to do some extra operations to calculate the size of the tracks:

  • First column, as it’s on the left edge of the subgrid we’ll be adding 50px to the size computation. Resulting on a 150px track.
  • Second column is not affected by the padding, so it keeps being 100px.
  • Third column, is in a similar situation than the first one because it’s on the right edge of the subgrid. So the final size is 150px.

Example of subgrid with padding Example of subgrid with padding

The problem here is that the grid track sizing algorithm works directly with the grid items. So it’d need to know if the item is in one of the edges of the subgrid in order to use the margin, border and/or padding to compute the size of the track.

On a first sight this might seem doable, but if we think on items spanning several tracks and nested subgrids things become really complex. For example, imagine that we’ve a nested subgrid that takes 2nd and 3rd columns on the first row with padding: 0px 25px;.

Example of nested subgrids with padding Example of nested subgrids with padding

For each column of the main grid container, we’d need to take into account the different paddings of the subgrids. So for each item we’d need to know not only if it’s on the edge of the subgrid but also check the ancestors.

On a related note, having scrollbars on the subgrids have similar issues related to track sizing.

Implicit tracks on the subgrid

The issue here is what happens when we get rid of explicit tracks on the subgrid. According to the spec, the items in this situation should use implicit tracks of the subgrid, and not take tracks from the main grid. Which means that they’re not aligned with the main grid (they don’t share lines with it).

Let’s use an example to illustrate this issue:

<div style="display: grid; grid-template-columns: 100px 100px 100px; grid-template-rows: 100px 100px;">
    <div style="display: grid; grid: subgrid; grid-column: 1 / 3; grid-row: 2 / 3;">
      <div style="grid-column: 1;">si1</div>
      <div style="grid-column: 2;">si2</div>
      <div style="grid-column: 3;">si3</div>
      <div style="grid-column: 4;">si4</div>

Example of subgrid with implicit tracks Example of subgrid with implicit tracks

As we can see the subgrid items si3 and si4 are placed in implicit tracks of the subgrid. Those tracks has no relation with the main grid container, literally from the spec: “they effectively extend in a third dimension”. We cannot think in a use case for this particular topic, and it seems really weird as the main goal of the subgrid is to share the lines with the main grid container.

Besides, imagine a different example of an auto-sized column, and some items from a subgrid in that column that span in and out the explicit subgrid. It’d be really complex from the grid track sizing algorithm point of view, to determine the size of that auto-sized column of the main grid container.

Automatic grid span

In the previous examples we were defining subgrids explicitly setting both the grid position (the initial column and row for the subgrid) and the grid span (how many cells the subgrid takes in each direction).

However, we can define a subgrid just setting the grid position with an automatic grid span. According to the spec, the implicit tracks on the subgrid will determine the span.

Again hopefully one example can help to get the idea:

<div style="display: grid; grid-template-columns: 100px 100px 100px; grid-template-rows: 100px 100px;">
    <div style="display: grid; grid: subgrid; grid-column: 2; grid-row: 1;">
      <div style="grid-column: 1;">si1</div>
      <div style="grid-column: 2;">si2</div>
      <div style="grid-column: 3;">si3</div>

The subgrid’s grid position is set to the 2nd column and the 1st row. Then the first 2 items (si1 and si2) are placed in the tracks from the main grid (2nd and 3rd column respectively). But the last item (si3) get rid of tracks, so it’s added on an extra implicit track on the subgrid unrelated to the main grid container. Because the subgrid doesn’t create implicit tracks on the main grid,

Example of subgrid with automatic grid span Example of subgrid with automatic grid span

This seems quite complex to manage, specially if the subgrid cannot create implicit tracks on the main grid container. Also, if we combine it with auto-placement (so we don’t set the grid position), finding the right position and span might be tricky.

On top of that, imagine that one of the items in the subgrid is positioned using grid-column: -1;. As the subgrid doesn’t have a grid span yet, probably it’d be placed on the first column of the subgrid, when we’d be expecting it to be on the last one.

Subgrid only in one axis

The subgrid keyword can be set in both properties defining the grid (grid-template-columns and grid-template-rows) or only in one of them. If we set it only in one of them, we’re creating a subgrid only in one direction.

Let’s see one example:

<div style="display: grid; grid-template-columns: 150px 150px; grid-template-rows: 150px 150px;">
    <div style="display: grid; grid-template-columns: subgrid; grid-column: 1 / 3; grid-row: 1 / 2;">
      <div style="grid-column: 1;">si1</div>
      <div style="grid-column: 2;">si2</div>
      <div style="grid-column: 1;">si3</div>
      <div style="grid-column: 2;">si4</div>

In this case we’re defining a subgrid only for the columns direction, so in order to calculate the size of the items in the 1st and 2nd columns it use the track sizes defined in the main grid container.

However, for the rows we’re not setting anything, so it uses auto-sized rows. This means that the rows will behave like in a regular nested grid, completely independent of the parent grid container.

Example of subgrid only in one axis Example of subgrid only in one axis

Probably this is feasible to implement, but again it seems a weird feature, we might need some good use cases to decide about it.

Descendants navigation

Another issue with subgrids is that we can nest them as much as we want. So, in order to determine the size of a track in the grid container, we need to traverse not only the direct children, but also all the descendants. This can have a huge performance impact if nested subgrids are abused.

Regarding this point, probably we’ve to simply live with it and navigate all the descendants being aware of the performance penalty.

Grid gaps

Probably grid gaps properties should be ignored in the subgrids. We want the items to be aligned with the main grid container, so having different gaps seems really strange.

This is not in the spec, but we guess that the reason is that grid gaps were added after subgrids, so that part of the spec was not updated.


There are other things that we didn’t explore in detail. For example:

  • Named grid lines from the main grid container used in the subgrids.
  • Positioned items in the subgrids.
  • Subgrids with orthogonal flows.

And we’re sure there are more things that we’re missing at this moment.

Draft proposal

As we can see there are a bunch of potential issues, of course they need a deeper analysis and further discussions, but the complexity of the whole feature seems not small at all.

For this reason, we might want to think in a simpler proposal. François Remy has already sent one to the CSS WG mailing list.

We’ve been thinking in another idea, trying to fulfill the use cases described previously in this post, and avoid the issues as much as possible. This is a pre-alpha proposal, so take it with a grain of salt.

The ideas would be to have a limited support of subgrids:

  • Make the subgrid pretty similar to display: contents;, so we cannot set margin, border or padding on it. That will simplify the calculations needed to determine the size of the tracks and the items.

    Main difference with display: contents; is that the subgrid will allow to translate positions depending on the final placement on the main grid container. Like in the use case of the shop catalog.

  • Remove implicit tracks on the subgrid, so we need to always set a grid span (how many columns and rows the subgrid takes) beforehand.

    Items trying to be placed on the implicit grid of the subgrid, would end up on the first cell of the subgrid. For items spanning outside the explicit grid, the span would be truncated at the end edge of the subgrid.

  • It seems clear that items in the subgrids need to be positioned before the items in the main grid container. And once all the positions are resolved, we could try to run the grid track sizing algorithm as usual.

  • Probably on an initial approach only allowing subgrids in both axis might be enough.

With this proposal the use cases explained in the first section would be covered, we’d need to be more specific on the CSS, but they’ll work.

We’d lack the chance to set CSS properties on the subgrid itself, but we could add an extra item on the subgrid that spans the whole subrid and paint a border, background or whatever is needed. Not an ideal solution anyway.


This post is mostly a braindump from the Igalia team working on CSS Grid Layout, after one day discussing about this part of the spec. We just want to share it with the rest of the web community, the CSS WG and anyone interested on the topic. In order to help to move forward the decision regarding what do to with subgrids in the CSS Grid Layout specification.

Lots of questions about subgrids are still hanging in the air:

  • Defer them to level 2?
  • Simplify them and keep them on level 1?
  • Keep them as they are and wait for subgrids before shipping CSS Grid Layout?

We think there aren’t good answers to these questions right now, hopefully we all together will find the right solution and find a final solution that makes happy all the involved parties. 😄

Last but not least, there’s going to be a CSS Grid Layout Workshop in New York City by the end of the month organized by one of the spec editors (fantasai). My colleague Sergio, who is the maintainer of Grid Layout code in Blink and WebKit will be participating in this event. Hopefully this and other topics will be clarified there, and the specification could move to Last Call Working Draft (LCWD) towards Candidate Recommendation (CR)!

Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

Finally, we’d like to thank again Bloomberg for supporting Igalia in the development of CSS Grid Layout.


February 11, 2016 11:00 PM

February 09, 2016

Víctor Jáquez

GStreamer VA-API under the umbrella of GStreamer

We have a new GStreamer VA-API release: 1.6.0!

Wait a minute“, you might say, “weren’t the last release 0.7?“, and you will be correct; but something big has happened: GStreamer VA-API is now part of the official GStreamer project!

And this means a couple changes.

First of all, the official repository of the project has been moved, now it’s co-hosted along with the rest of the GStreamer components, in (fdo):

Anonymous Git repository: git://

Developer Git repository: ssh://

Web Git gateway:

Second, the bug tracking has also moved, since now GStreamer VA-API is now a component of GStreamer, the new bugs must be filled here:

And the bug list is here:

What will happen with the old bugs?” you will ask. Well, we will move them as soon as we have reviewed all them.

The third change, as you already noticed, is the version scheme. Now we are following the GStreamer version numbering to avoid confusion and to simplify our development. Hence, this release, 1.6.0, supports the current GStreamer stable version (1.6.3), and our current development branch is the 1.7.x. The future releases will follow the GStreamer version schemes and dependencies.

Sweet! but, what’s new in this release?“. The answer is, not much. Really. Most of the changes are related with the coupling with the upstream processes (autoconf setup, documentation, etc.). Perhaps the most remarkable thing is the removal of the support libraries (libgstvaapi-*.so) used by the vaapi plugin: now they are compiled as one static library and linked to the GStreamer’s plugin. Also, the custom parsers were removed, and the plugin and elements documentation got better shape.

At code level, we had to push a huge indentation commit, in order to align with the GStreamer code style. This commit artificially kills the blame history, but it was our better option.

I ought to say that those were not the only changes at code level, Michael Olbrich fixed a missing frame release in the Wayland backend. And Sree, as usual, fixed a bunch of hardcore stuff. But specially I want to thank Tim-Philipp Müller, for helping us along the upstreaming process. And obviously to the Intel Open Source Technology Center, for let this happen.

Here’s the git’s short log summary since the last 0.7.0 release:

 1  Joel Holdsworth
 1  Michael Olbrich
 9  Sreerenj Balachandran
 4  Tim-Philipp Müller
42  Víctor Manuel Jáquez Leal

By the way, Igalia is hiring!

by vjaquez at February 09, 2016 12:32 PM

February 08, 2016

Andy Wingo

a lambda is not (necessarily) a closure


Greets, folks! Check it out: Guile had a whole track devoted to it at FOSDEM this year. OK, so it was only half a day, but there were like a dozen talks! And the room was full all the morning! And -- get this -- I had nothing to do with its organization! I think we can credit the Guix project with the recent surge of interest in Guile; fully half the talks were from people excited about using Guix to solve their problems. Thanks very, very much to Pjotr Prins for organizing the lovely event.

I gave a talk on how the Guile 2.2 compiler and virtual machine could change the way people program. Happily, the video recording came out OK! Video below (or here if that doesn't work), and slides here.

The time was super-limited though and I wasn't able to go into the detail that I'd like. So, dear readers, here we are, with a deeper look on lambda representation in Guile.

a lambda is not (necessarily) a closure

What is this?

(lambda (a b) (+ a b))

If you answer, "it's a lambda expression", you're right! You're also right if you say it's a function -- I mean, lambda makes a function, right? There are lots of things that you could say that would be right, including silly things like "twenty-two characters set in an awkward typeface".

But if you said "it's a closure" -- well you're right in general I guess, like on a semantic what-does-it-mean level, but as far as how Guile represents this thing at run-time, hoo boy are there a number of possibilities, and a closure is just one of them. This article dives into the possibilities, with the goal being to help you update your mental model of "how much do things cost".

In Guile, a lambda expression can be one of the following things at run-time:

  1. Gone

  2. Inlined

  3. Contified

  4. Code pointer

  5. Closure

Let's look into these one-by-one.

lambda: gone

If Guile can prove that a lambda expression is never reached, it won't be present at run-time. The main way this happens is via partial evaluation, but later passes can do this too. In the most basic example, consider the lambda bound to f by this let expression.

(let ((f (lambda ()

Guile has an ,optimize command that can be run at the REPL to show the effect of partial evaluation on your code. These days it's a bit out of date in a way -- it can't show what CPS-based optimization will do to your code -- but for our purposes here it will transform the expression to the following code:

(let ((f (lambda ()
=> 42

So the lambda is gone, big whoop. The interesting thing though is that this happens concurrently with other things that partial evaluation does, so the lambda goes away in this expression too:

(let ((launch? #f)
      (f (lambda ()
  (if launch? (f) 'just-kidding))
=> 'just-kidding

lambda: inlined

The other trick that partial evaluation can do with lambda expressions is inlining. Re-taking the example above, if we change launch? to #t, the branch folds the other way and the application (f) inlines:

(let ((launch? #t)
      (f (lambda ()
  (if launch? (f) 'just-kidding))
=> (let ((launch? #t)
         (f (lambda ()
     (if #t (f) 'just-kidding))
=> (let ((launch? #t)
         (f (lambda ()
=> (let ((launch? #t)
         (f (lambda ()
     ((lambda () (launch-the-missiles!))))
=> (let ((launch? #t)
         (f (lambda ()
=> (launch-the-missiles!)

Here again the lambda is gone, but not because it was unreachable, but because it was inlined into its use. I showed some intermediate steps as well, just so you get a feel about how partial evaluation works. The inlining step is illustrated by the fourth transformation, where the lambda application went away, replaced by its body.

Partial evaluation can also unroll many kinds of recursion:

(letrec ((lp (lambda (n)
               (if (zero? n)
                   (+ n (lp (1- n)))))))
  (lp 5))
=> 15

The partial evaluator in Guile 2.2 is more or less unchanged from the one in Guile 2.0, so you get these benefits on old Guile as well. Building a good intuition as to what the partial evaluator will do is important if you want to get the best performance out of Guile. Use the ,optimize command at the REPL to see the effects of partial evaluation on any given expression.

lambda: contified

So, here we step into the unknown, in the sense that from here on out, these optimizations are new in Guile 2.2. Unfortunately, they can be hard to see as they aren't really representable in terms of source-to-source transformations over Scheme programs. Consider this program:

(define (count-down n)
  (define loop
    (lambda (n out)
      (let ((out (cons n out)))
        (if (zero? n)
            (loop (1- n) out)))))
  (loop n '()))

It's a little loop that builds a list of integers. The lambda in this loop, bound to loop, will be contified into the body of count-down.

To see that this is the case, we have to use a new tool, ,disassemble (abbreviated ,x). This takes a procedure and prints its bytecode. It can be hard to understand, so I'm going to just point out some "shapes" of disassembly that you can recognize.

> ,x count-down
Disassembly of #<procedure count-down (n)> at #x9775a8:

  10    (cons 2 1 2)
  11    (br-if-u64-=-scm 0 1 #f 5) ;; -> L2
  14    (sub/immediate 1 1 1)
  15    (br -5)                    ;; -> L1

I've snipped the disassembly to the interesting part. The first thing to notice is that there's just one procedure here: only one time that ,x prints "Disassembly of ...". That means that the lambda was eliminated somehow, either because it was dead or inlined, as described above, or because it was contified. It wasn't dead; we can see that from looking at the ,optimize output, which doesn't significantly change the term. It wasn't inlined either; again, ,optimize can show you this, but consider that because partial evaluation can't determine when the loop would terminate, it won't find a point at which it can stop unrolling the loop. (In practice what happens though is that it tries, hits an effort or code growth limit, then aborts the inlining attempt.)

However, what we see in the disassembly is the body of the loop: we cons something onto a list (the cons), check if a two numbers are equal (br-if-u64-=-scm), and if they are we jump out of the loop (L2). Otherwise we subtract 1 from a number (sub/immediate) and loop (br to L1). That is the loop. So what happened?

Well, if inlining is copying, then contification is rewiring. Guile's compiler was able to see that although it couldn't inline the loop function, it could see all of loop's callers, and that loop always returned to the same "place". (Another way to say this is that loop is always called with the same continuation.) The compiler was then able to incorporate the body of loop into count-down, rewiring calls to loop to continue to loop's beginning, and rewriting returns from loop to proceed to the continuation of the loop call.

a digression on language

These words like "contification" and "continuation" might be unfamiliar to you, and I sympathize. If you know of a better explanation of contification, I welcome any links you might have. The name itself comes from a particular formulation of the intermediate language used in Guile, the so-called "CPS" language. In this language, you convert a program to make it so it never returns: instead, each sub-expression passes its values to its continuation via a tail call. Each continuation is expressed as a lambda expression. See this article for an intro to CPS and how it relates to things you might know like SSA.

Transforming a program into CPS explodes it into a bunch of little lambdas: every subexpression gets its own. You would think this would be a step backwards, if your goal is to eliminate closures in some way. However it's possible to syntactically distinguish between lambda expressions which are only ever used as continuations and those that are used as values. Let's call the former kind of lambda a cont and the latter a function. A cont-lambda can be represented at run-time as a label -- indeed, the disassembly above shows this. It turns out that all lambda expressions introduced by the CPS transformation are conts. Conts form a first-order flow graph, and are basically the same as SSA basic blocks. If you're interested in this kind of thing, see Andrew Kennedy's great paper, Compiling with Continuations, Continued, and see also CPS soup for more on how this evolved in Guile 2.2.

I say all this to give you a vocabulary. Functions that are present in the source program start life as being treated as function-lambdas. Contification takes function-lambda values and turns then into cont-lambda labels, if it can. That's where the name "contification" comes from. For more on contification, see MLton's page on its contification pass, linking to the original paper that introduces the concept.

and we're back

Contification incorporates the body of a function into the flow graph of its caller. Unlike inlining, contification is always an optimization: it never causes code growth, and it enables other optimizations by exposing first-order control flow. (It's easier for the compiler to reason about first-order loops than it is to reason about control flow between higher-order functions.)

Contification is a reliable optimization. If a function's callers are always visible to the compiler, and the function is always called with the same continuation, it will be contified. These are two fairly simple conditions that you can cultivate your instincts to detect and construct.

Contification can also apply to mutually recursive functions, if as a group they are all always called with the same continuation. It's also an iterative process, in the sense that contifying one set of functions can expose enough first-order control flow that more contification opportunities become apparent.

It can take a while to get a feel for when this optimization applies. You have to have a feel for what a continuation is, and what it means for a function's callers to all be visible to the compiler. However, once you do internalize these conditions, contification is something you can expect Guile's compiler to do to your code.

lambda: code pointer

The next representation a lambda might have at run-time is as a code pointer. In this case, the function fails the conditions for contification, but we still avoid allocating a closure.

Here's a little example to illustrate the case.

(define (thing)
  (define (log what)
    (format #t "Very important log message: ~a\n" what))
  (log "ohai")
  (log "kittens")
  (log "donkeys"))

In this example, log is called with three different continuations, so it's not eligible for contification. Unfortunately, this example won't illustrate anything for us because the log function is so small that partial evaluation will succeed in inlining it. (You could determine this for yourself by using ,optimize.) So let's make it bigger, to fool the inliner:

(define (thing)
  (define (log what)
    (format #t "Very important log message: ~a\n" what)
    ;; If `log' is too short, it will be inlined.  Make it bigger.
    (format #t "Did I ever tell you about my chickens\n")
    (format #t "I was going to name one Donkey\n")
    (format #t "I always wanted a donkey\n")
    (format #t "In the end we called her Raveonette\n")
    (format #t "Donkey is not a great name for a chicken\n")
    (newline) (newline) (newline) (newline) (newline))
  (log "ohai")
  (log "kittens")
  (log "donkeys"))

Now if we disassembly it, we do get disassembly for two different functions:

,x thing
Disassembly of #<procedure thing ()> at #x97d704:

Disassembly of log at #x97d754:

So, good. We defeated the inliner. Let's look closer at the disassembly of the outer function.

,x thing
Disassembly of #<procedure thing ()> at #x97d704:
  12    (call-label 3 2 8)              ;; log at #x97d754

Here we see that instead of the generic call instruction, we have the specific call-label instruction which calls a procedure whose code is at a known offset from the calling function.

call-label is indeed a cheaper call than the full call instruction that has to check that the callee is actually a function and so on. But that's not the real optimization here. If all callers of a function are known -- and by this time, you're starting to catch the pattern, I think -- if all callers are known, then the procedure does not need to exist as a value at run-time.

This affords a number of optimization opportunities. Theoretically there are many -- all call sites can be specialized to the specific callee. The callee can have an optimized calling convention that doesn't have anything to do with the generic convention. Effect analysis can understand the side effects and dependencies of the callee in a more precise way. The compiler can consider unboxing some arguments and return values, if it finds that useful.

In Guile though, there's only one real optimization that we do, and that is related to free variables. Currently in Guile, all procedures have a uniform calling convention, in which the procedure being called (the callee) is itself passed as the zeroeth argument, and then the arguments follow on the stack. The function being called accesses its free variables through that zeroeth argument. If however there is no need for the procedure to be represented as a value, we are free to specialize that zeroeth argument.

So, consider a well-known procedure like log above. (By "well-known", we mean that all of log's callers are known.) Since log doesn't actually have any lexically bound free variables, we can just pass in anything as argument zero when invoking it. In practice we pass #f, because it happens to be an easy value to make.

(Why isn't format treated as a free variable in log? Because there is special support from the linker for lazily initializing the locations of variables imported from other modules or defined at the top level instead of within a lexical contour. In short: only variables that are (a) used within the lambda and (b) defined within a let or similar count towards a lambda's free variables.)

For a well-known procedure with only one free variable, we can pass in that free variable as the zeroeth argument. Internally to the function, we rewrite references to that free variable to reference argument 0 instead. This is a neat hack because we can have a lambda with a free variable but which results in no allocation at run-time.

Likewise if there are two free variables -- and this is starting to sound like Alice's restaurant, isn't it -- well we do have to pass in their values to the procedure, but we don't have to build an actual closure object with a tag and a code pointer and all. Pairs happen to be small and have some fast paths in Guile, so we use that. References to the free variables get internally rewritten to be car or cdr of argument 0.

For three or more free variables, we do the same, but with a vector.

For a final trick, a set of mutually recursive procedures whose callers are all known can share the object that collects their free variables. We collect the union of the free variables of all of the procedures, and pack them into a specialized representation as above.

Note that for well-known procedures, all variables that are free in the lambda are also free in the caller; that's why the 1-free-variable substitution works. The lambda is bound in a scope that dominates its callers, but its free variables dominate the lambda so they dominate the callers too. For that reason in this case we could choose to do lambda lifting instead, with no penalty: instead of bundling up the free variables in a heap object, we could pass them as arguments. Dybvig claims this is not a great idea because it increases register pressure. That could be true, but I haven't seen the numbers. Anyway, we do the flat closure thing, so we pack the free vars into data.

All these ideas came pretty much straight from the great Optimizing Closures in O(0) Time by Andrew Keep et al.

lambda: closure

OK! So you have a lambda whose callees are not all visible to the compiler. You need to reify the procedure as a value. That reified procedure-as-value is a closure: an object with a tag, a code pointer, and an array of free variables.

Of course, if the procedure has no free variables, you just have the tag and the code pointer... and because Scheme is semantically squirrely when it comes to the result of (eqv? (lambda () 10) (lambda () 10)) (it's unspecified: lambda expressions don't have identity), we can statically allocate the closure in the binary, as a constant.

Otherwise we do allocate the heap object.

Note however that if a group of mutually recursive procedures has just one entry that is not "well-known", then that procedure clique can share one closure object.

lambda: it's complicated

In summary, a lambda is an abstraction that has many concrete representations. Guile will choose the cheapest representation that it can. If you need to eke out even more performance from your program, having a good mental model of how the abstract maps to the concrete will help you know where to focus your efforts, and what changes might be helpful. Good luck, and happy hacking!

by Andy Wingo at February 08, 2016 10:12 AM

Claudio Saavedra

Mon 2016/Feb/08

About a year ago, Igalia was approached by the people working on printing-related technologies in HP to see whether we could give them a hand in their ongoing effort to improve the printing experience in the web. They had been working for a while in extensions for popular web browsers that would allow users, for example, to distill a web page from cruft and ads and format its relevant contents in a way that would be pleasant to read in print. While these extensions were working fine, they were interested in exploring the possibility of adding this feature to popular browsers, so that users wouldn't need to be bothered with installing extensions to have an improved printing experience.

That's how Alex, Martin, and me spent a few months exploring the Chromium project and its printing architecture. Soon enough we found out that the Chromium developers had been working already on a feature that would allow pages to be removed from cruft and presented in a sort of reader mode, at least in mobile versions of the browser. This is achieved through a module called dom distiller, which basically has the ability to traverse the DOM tree of a web page and return a clean DOM tree with only the important contents of the page. This module is based on the algorithms and heuristics in a project called boilerpipe with some of it also coming from the now popular Readability. Our goal, then, was to integrate the DOM distiller with the modules in Chromium that take care of generating the document that is then sent to both the print preview and the printing service, as well as making this feature available in the printing UI.

After a couple of months of work and thanks to the kind code reviews of the folks at Google, we got the feature landed in Chromium's repository. For a while, though, it remained hidden behind a runtime flag, as the Chromium team needed to make sure that things would work well enough in all fronts before making it available to all users. Fast-forward to last week, when I found out by chance that the runtime flag has been flipped and the Simplify page printing option has been available in Chromium and Chrome for a while now, and it has even reached the stable releases. The reader mode feature in Chromium seems to remain hidden behind a runtime flag, I think, which is interesting considering that this was the original motivation behind the dom distiller.

As a side note, it is worth mentioning that the collaboration with HP was pretty neat and it's a good example of the ways in which Igalia can help organizations to improve the web experience of users. From the standards that define the web to the browsers that people use in their everyday life, there are plenty of areas in which work needs to be done to make the web a more pleasant place, for web developers and users alike. If your organization relies on the web to reach its users, or to enable them to make use of your technologies, chances are that there are areas in which their experience can be improved and that's one of the things we love doing.

February 08, 2016 09:01 AM

February 04, 2016

Andy Wingo

guile compiler tasks

Hey! We released Guile 2.1.2, including the unboxing work, and we fixed the slow bootstrap problem by shipping pre-built bootstraps in tarballs. A pretty OK solution in my opinion; check it out!

future work

At this point I think I'm happy with Guile's compiler and VM, enough for now. There is a lot more work to do but it's a good point at which to release a stable series. There will probably be a number of additional pre-releases, but not any more significant compiler/VM work that must be done before a release.

However, I was talking with Guilers at FOSDEM last weekend and we realized that although we do a pretty good job at communicating the haps in compiler-land, we don't do a good job at sharing a roadmap or making it possible for other folks to join the hack. And indeed, it's been difficult to do so while things were changing so much: I had to get things right in my head before joining in the confusion of other people's heads.

In that spirit I'd like to share a list of improvements that it would be nice to make at some point. If you take one of these tasks, be my guest: find me on IRC (wingo on freenode) and let me know, and I'll help as I am able. You need to be somewhat independent; I'm not offering a proper mentoring or anything, more like office hours or something, where you come with the problem you are having and I commiserate and give context/background/advice as I am able.

So with that out of the way, here's a huge list of stuff! Following this, more details on each one.

  1. stripping binaries

  2. full source in binaries

  3. cps in in binaries

  4. linking multiple modules together

  5. linking a single executable

  6. instruction explosion

  7. elisp optimizations

  8. prompt removal

  9. basic register allocation

  10. optimal register allocation

  11. unboxed record fields

  12. textual CPS

  13. avoiding arity checks

  14. unboxed calls and returns

  15. module-level inlining

  16. cross-module inlining

As a bonus, in the end I'll give some notes on native compilation. But first, the hacks!

stripping binaries

Guile uses ELF as its object file format, and currently includes source location information as DWARF data. On space-constrained devices this might be too much. Your task: add a hack to the linker that can strip existing binaries. Read Ian Lance Taylor's linker articles for more background, if you don't know things about linkers yet.

full source in binaries

Wouldn't it be nice if the ELF files that Guile generates actually included the source as well as the line numbers? We could do that, in a separate strippable ELF section. This point is like the reverse of the previous point :)

cps in in binaries

We could also include the CPS IR in ELF files too. This would enable some kinds of link-time optimization and cross-module inlining. You'd need to define a binary format for CPS, like LLVM bitcode or so. Neat stuff :)

linking multiple modules together

Currently in Guile, just about every module is a separate .go file. Loading a module will cause a few stat calls and some seeks and reads and all that. Wouldn't it be nice if you could link together all the .go files that were commonly used into one object? Again this is a linker hack, but it needs support from the run-time as well: when the run-time goes to load a file, it should first check in a registry if that file has been logically provided by some other file. We'd be able to de-duplicate constant data from various modules. However there is an initialization phase when loading a .go file which effectively performs all the relocations needed by constants that need a fix-up at load-time; see the ELF article I linked to above for more. For some uses, it would be OK to produce one relocation/initialization procedure. For others, if you expected to only load a fraction of the modules in a .go file, it would be a lose on startup time,
so you would probably need to support lazy relocation when a module is first loaded.

Anyway, your task would be to write a linker hack that loads a bunch of .go files, finds the relocations in them, de-duplicates the constants, and writes out a combined .go file that includes a table of files contained in it. Good luck :) This hack would work great for Emacs, where it's effectively a form of unexec that doesn't actually rely on unexec.

linking a single executable

In the previous task, you could end up with the small guile binary that links to libguile (or your binary linking to libguile), and then a .go file containing all the modules you are interestd in. It sure would be nice to be able to link those together into just one binary, or at least to link the .go into the Guile binary. If the Guile is statically linked itself, you would have a statically linked application. If it's dynamically linked, it would remain dynamically linked. Again, a linker hack, but one that could provide a nicer way to distribute Guile binaries.

instruction explosion

Now we get more to the compiler side of things. Currently in Guile's VM there are instructions like vector-ref. This is a little silly: there are also instructions to branch on the type of an object (br-if-tc7 in this case), to get the vector's length, and to do a branching integer comparison. Really we should replace vector-ref with a combination of these test-and-branches, with real control flow in the function, and then the actual ref should use some more primitive unchecked memory reference instruction. Optimization could end up hoisting everything but the primitive unchecked memory reference, while preserving safety, which would be a win. But probably in most cases optimization wouldn't manage to do
this, which would be a lose overall because you have more instruction dispatch.

Well, this transformation is something we need for native compilation anyway. I would accept a patch to do this kind of transformation on the master branch, after version 2.2.0 has forked. In theory this would remove most all high level instructions from the VM, making the bytecode closer to a virtual CPU, and likewise making it easier for the compiler to emit native code as it's working at a lower level.

elisp optimizations

Guile implements Emacs Lisp, and does so well. However it hasn't been the focus of a lot of optimization. Emacs has a lot of stuff going on on its side, and so have we, so we haven't managed to replace the Elisp interpreter in Emacs with one written in Guile, though Robin Templeton has brought us a long way forward. We need someone to do both the integration work but also to poke the compiler and make sure it's a clear win.

prompt removal

It's pretty natural to use delimited continuations when compiling some kind of construct that includes a break statement to Guile, whether that compiler is part of Elisp or just implemented as a Scheme macro. But, many instances of prompts can be contified, resulting in no overhead at run-time. Read up on contification and contify the hell out of some prompts!

basic register allocation

Guile usually tries its best to be safe-for-space: only the data which might be used in the future of a program is kept alive, and the rest is available for garbage collection. Notably, this applies to function arguments, temporaries, and lexical variables: if a value is dead, the GC can collect it and re-use its space. However this isn't always what you want. Sometimes you might want to have all variables that are in scope to be available, for better debugging. Your task would be to implement a "slot allocator" (which is really register allocation) that keeps values alive in the parts of the programs that they dominate.

optimal register allocation

On the other hand, our slot allocator -- which is basically register allocation, but for stack slots -- isn't so great. It does OK but you can often end up shuffling values in a loop, which is the worst. Your task would be to implement a proper register allocator: puzzle-solving, graph-coloring, iterative coalescing, something that really tries to do a good job. Good luck!

unboxed record fields

Guile's "structs", on which records are implemented, support unboxed values, but these values are untyped, not really integrated with the record layer, and always boxed in the VM. Your task would be to design a language facility that allows us to declare records with typed fields, and to store unboxed values in those fields, and to cause access to their values to emit boxing/unboxing instructions around them. The optimizer will get rid of those boxing/unboxing instructions if it can. Good luck!

textual CPS

The CPS language is key to all compiler work in Guile, but it doesn't have a nice textual form like LLVM IR does. Design one, and implement a parser and an unparser!

avoiding arity checks

If you know the procedure you are calling, like if it's lexically visible, then if you are calling it with the right number of arguments you can skip past the argument check and instead do a call-label directly into the body. Would be pretty neat!

unboxed calls and returns

Likewise if a function's callers are all known, it might be able to unbox its arguments or return value, if that's a good idea. Tricky! You could start with a type inference pass or so, and maybe that could produce some good debugging feedback too.

module-level inlining

Guile currently doesn't inline anything that's not lexically visible. Unfortunately this restriction extends to top-level definitions in a module: they are treated as mutable and so never inlined/optimized/etc. Probably we need to change the semantics here such that a module can be compiled as a unit, and all values which are never mutated can be assumed to be constant. Probably you also want a knob to turn off this behavior, but really you can always re-compile and re-load a module as a whole if re-loading a function at run-time doesn't work because it was inlined. Anyway. Some semantic work here, but some peval work as well. Be careful!

cross-module inlining

Likewise Guile currently doesn't inline definitions from other modules. However for small functions this really hurts. Guile should probably serialize tree-il for small definitions in .go files, and allow peval to speculatively inline imported definitions. This is related to the previous point and has some semantic implications.

bobobobobobonus! native compilation

Thinking realistically, native compilation is the next step. We have the object file format, cool. We will need the ability to call out from machine code in .go files to run-time functions, so we need to enhance the linker, possibly even with things like PLT/GOT sections to avoid dirtying too many pages. We need to lower the CPS even further, to get closer to some kind of machine model, then go specific, with an assembler for each architecture. The priority in the beginning will be simplicity and minimal complexity; good codegen will come later. This is obviously the most attractive thing but it's also the most tricky, design-wise. I want to do at least part of this, so though you can't have it all, you are welcome to help :)

That's it for now. I'll amend the post with more things as and when I think of them. Comments welcome too, as always. Happy hacking!

by Andy Wingo at February 04, 2016 09:38 PM

Claudio Saavedra

Thu 2016/Feb/04

We've opened a few positions for developers in the fields of multimedia, networking, and compilers. I could say a lot about why working in Igalia is way different to working on your average tech-company or start-up, but I think the way it's summarized in the announcements is pretty good. Have a look at them if you are curious and don't hesitate to apply!

February 04, 2016 12:53 PM

February 03, 2016

Michael Catanzaro

On Subresource Certificate Validation

Ryan Castellucci has a quick read on subresource certificate validation. It is accurate; I fixed this shortly after joining Igalia. (Update: This was actually in response to a bug report from him.) Run his test to see if your browser is vulnerable.

Epiphany, Xombrero, Opera Mini and Midori […] were loading subresources, such as scripts, from HTTPS servers without doing proper certificate validation. […] Unfortunately Xombrero and Midori are still vulnerable. Xombrero seems to be dead, and I’ve gotten no response from them. I’ve been in touch with Midori, but they say they don’t have the resources to fix it, since it would require rewriting large portions of the code base in order to be able to use the fixed webkit.

I reported this to the Midori developers in late 2014 (private bug). It’s hard to understate how bad this is: it makes HTTPS completely worthless, because an attacker can silently modify JavaScript loaded via subresources.

This is actually a unique case in that it’s a security problem that was fixed only thanks to the great API break, which has otherwise been the cause of many security problems. Thanks to the API break, we were able to make the new API secure by default without breaking any existing applications. (But this does no good for applications unable to upgrade.)

(A note to folks who read Ryan’s post: most mainstream browsers do silently block invalid certificates, but Safari will warn instead. I’m not sure which behavior I prefer.)

by Michael Catanzaro at February 03, 2016 08:36 PM

February 01, 2016

Michael Catanzaro

On WebKit Security Updates

Linux distributions have a problem with WebKit security.

Major desktop browsers push automatic security updates directly to users on a regular basis, so most users don’t have to worry about security updates. But Linux users are dependent on their distributions to release updates. Apple fixed over 100 vulnerabilities in WebKit last year, so getting updates out to users is critical.

This is the story of how that process has gone wrong for WebKit.

Before we get started, a few disclaimers. I want to be crystal clear about these points:

  1. This post does not apply to WebKit as used in Apple products. Apple products receive regular security updates.
  2. WebKitGTK+ releases regular security updates upstream. It is safe to use so long as you apply the updates.
  3. The opinions expressed in this post are my own, not my employer’s, and not the WebKit project’s.

Browser Security in a Nutshell

Web engines are full of security vulnerabilities, like buffer overflows, null pointer dereferences, and use-after-frees. The details don’t matter; what’s important is that skilled attackers can turn these vulnerabilities into exploits, using carefully-crafted HTML to gain total control of your user account on your computer (or your phone). They can then install malware, read all the files in your home directory, use your computer in a botnet to attack websites, and do basically whatever they want with it.

If the web engine is sandboxed, then a second type of attack, called a sandbox escape, is needed. This makes it dramatically more difficult to exploit vulnerabilities. Chromium has a top-class Linux sandbox. WebKit does have a Linux sandbox, but it’s not any good, so it’s (rightly) disabled by default. Firefox does not have a sandbox due to major architectural limitations (which Mozilla is working on).

For this blog post, it’s enough to know that attackers use crafted input to exploit vulnerabilities to gain control of your computer. This is why it’s not a good idea to browse to dodgy web pages. It also explains how a malicious email can gain control of your computer. Modern email clients render HTML mail using web engines, so malicious emails exploit many of the same vulnerabilities that a malicious web page might. This is one reason why good email clients block all images by default: image rendering, like HTML rendering, is full of security vulnerabilities. (Another reason is that images hosted remotely can be used to determine when you read the email, violating your privacy.)

WebKit Ports

To understand WebKit security, you have to understand the concept of WebKit ports, because different ports handle security updates differently.

While most code in WebKit is cross-platform, there’s a large amount of platform-specific code as well, to improve the user and developer experience in different environments. Different “ports” run different platform-specific code. This is why two WebKit-based browsers, say, Safari and Epiphany (GNOME Web), can display the same page slightly differently: they’re using different WebKit ports.

Currently, the WebKit project consists of six different ports: one for Mac, one for iOS, two for Windows (Apple Windows and WinCairo), and two for Linux (WebKitGTK+ and WebKitEFL). There are some downstream ports as well; unlike the aforementioned ports, downstream ports are, well, downstream, and not part of the WebKit project. The only one that matters for Linux users is QtWebKit.

If you use Safari, you’re using the Mac or iOS port. These ports get frequent security updates from Apple to plug vulnerabilities, which users receive via regular updates.

Everything else is broken.

Since WebKit is not a system library on Windows, Windows applications must bundle WebKit, so each application using WebKit must be updated individually, and updates are completely dependent on the application developers. iTunes, which uses the Apple Windows port, does get regular updates from Apple, but beyond that, I suspect most applications never get any security updates. This is a predictable result, the natural consequence of environments that require bundling libraries.

(This explains why iOS developers are required to use the system WebKit rather than bundling their own: Apple knows that app developers will not provide security updates on their own, so this policy ensures every iOS application rendering HTML gets regular WebKit security updates. Even Firefox and Chrome on iOS are required to use the system WebKit; they’re hardly really Firefox or Chrome at all.)

The same scenario applies to the WinCairo port, except this port does not have releases or security updates. Whereas the Apple ports have stable branches with security updates, with WinCairo, companies take a snapshot of WebKit trunk, make their own changes, and ship products with that. Who’s using WinCairo? Probably lots of companies; the biggest one I’m aware of uses a WinCairo-based port in its AAA video games. It’s safe to assume few to no companies are handling security backports for their downstream WinCairo branches.

Now, on to the Linux ports. WebKitEFL is the WebKit port for the Enlightenment Foundation Libraries. It’s not going to be found in mainstream Linux distributions; it’s mostly used in embedded devices produced by one major vendor. If you know anything at all about the internet of things, you know these devices never get security updates, or if they do, the updates are superficial (updating only some vulnerable components and not others), or end a couple months after the product is purchased. WebKitEFL does not bother with pretense here: like WinCairo, it has never had security updates. And again, it’s safe to assume few to no companies are handling security backports for their downstream branches.

None of the above ports matter for most Linux users. The ports available on mainstream Linux distributions are QtWebKit and WebKitGTK+. Most of this blog will focus on WebKitGTK+, since that’s the port I work on, and the port that matters most to most of the people who are reading this blog, but QtWebKit is widely-used and deserves some attention first.

It’s broken, too.


QtWebKit is the WebKit port used by Qt software, most notably KDE. Some cherry-picked examples of popular applications using QtWebKit are Amarok, Calligra, KDevelop, KMail, Kontact, KTorrent, Quassel, Rekonq, and Tomahawk. QtWebKit provides an excellent Qt API, so in the past it’s been the clear best web engine to use for Qt applications.

After Google forked WebKit, the QtWebKit developers announced they were switching to work on QtWebEngine, which is based on Chromium, instead. This quickly led to the removal of QtWebKit from the WebKit project. This was good for the developers of other WebKit ports, since lots of Qt-specific code was removed, but it was terrible for KDE and other QtWebKit users. QtWebKit is still maintained in Qt and is getting some backports, but from a quick check of their git repository it’s obvious that it’s not receiving many security updates. This is hardly unexpected; QtWebKit is now years behind upstream, so providing security updates would be very difficult. There’s not much hope left for QtWebKit; these applications have hundreds of known vulnerabilities that will never be fixed. Applications should port to QtWebEngine, but for many applications this may not be easy or even possible.

Update: As pointed out in the comments, there is some effort to update QtWebKit. I was aware of this and in retrospect should have mentioned this in the original version of this article, because it is relevant. Keep an eye out for this; I am not confident it will make its way into upstream Qt, but if it does, this problem could be solved.


WebKitGTK+ is the port used by GTK+ software. It’s most strongly associated with its flagship browser, Epiphany, but it’s also used in other places. Some of the more notable users include Anjuta, Banshee, Bijiben (GNOME Notes), Devhelp, Empathy, Evolution, Geany, Geary, GIMP, gitg, GNOME Builder, GNOME Documents, GNOME Initial Setup, GNOME Online Accounts, GnuCash, gThumb, Liferea, Midori, Rhythmbox, Shotwell, Sushi, and Yelp (GNOME Help). In short, it’s kind of important, not only for GNOME but also for Ubuntu and Elementary. Just as QtWebKit used to be the web engine for choice for Qt applications, WebKitGTK+ is the clear choice for GTK+ applications due to its nice GObject APIs.

Historically, WebKitGTK+ has not had security updates. Of course, we released updates with security fixes, but not with CVE identifiers, which is how software developers track security issues; as far as distributors are concerned, without a CVE identifier, there is no security issue, and so, with a few exceptions, distributions did not release our updates to users. For many applications, this is not so bad, but for high-risk applications like web browsers and email clients, it’s a huge problem.

So, we’re trying to improve. Early last year, my colleagues put together our first real security advisory with CVE identifiers; the hope was that this would encourage distributors to take our updates. This required data provided by Apple to WebKit security team members on which bugs correspond to which CVEs, allowing the correlation of Bugzilla IDs to Subversion revisions to determine in which WebKitGTK+ release an issue has been fixed. That data is critical, because without it, there’s no way to know if an issue has been fixed in a particular release or not. After we released this first advisory, Apple stopped providing the data; this was probably just a coincidence due to some unrelated internal changes at Apple, but it certainly threw a wrench in our plans for further security advisories.

This changed in November, when I had the pleasure of attending the WebKit Contributors Meeting at Apple’s headquarters, where I was finally able meet many of the developers I had interacted with online. At the event, I gave a presentation on our predicament, and asked Apple to give us information on which Bugzilla bugs correspond to which CVEs. Apple kindly provided the necessary data a few weeks later.

During the Web Engines Hackfest, a yearly event that occurs at Igalia’s office in A Coruña, my colleagues used this data to put together WebKitGTK+ Security Advisory WSA-2015-0002, a list of over 130 vulnerabilities disclosed since the first advisory. (The Web Engines Hackfest was sponsored by Igalia, my employer, and by our friends at Collabora. I’m supposed to include their logos here to advertise how cool it is that they support the hackfest, but given all the doom and gloom in this post, I decided perhaps they would perhaps prefer not to have their logos attached to it.)

Note that 130 vulnerabilities is an overcount, as it includes some issues that are specific to the Apple ports. (In the future, we’ll try to filter these out.) Only one of the issues — a serious error in the networking backend shared by WebKitGTK+ and WebKitEFL — resided in platform-specific code; the rest of the issues affecting WebKitGTK+ were all cross-platform issues. This is probably partly because the trickiest code is cross-platform code, and partly because security researchers focus on Apple’s ports.

Anyway, we posted WSA-2015-0002 to the oss-security mailing list to make sure distributors would notice, crossed our fingers, and hoped that distributors would take the advisory seriously. That was one month ago.

Distribution Updates

There are basically three different approaches distributions can take to software updates. The first approach is to update to the latest stable upstream version as soon as, or shortly after, it’s released. This is the strategy employed by Arch Linux. Arch does not provide any security support per se; it’s not necessary, so long as upstream projects release real updates for security problems and not simply patches. Accordingly, Arch almost always has the latest version of WebKitGTK+.

The second main approach, used by Fedora, is to provide only stable release updates. This is more cautious, reflecting that big updates can break things, so they should only occur when upgrading to a new version of the operating system. For instance, Fedora 22 shipped with WebKitGTK+ 2.8, so it would release updates to new 2.8.x versions, but not to WebKitGTK+ 2.10.x versions.

The third approach, followed by most distributions, is to take version upgrades only rarely, or not at all. For smaller distributions this may be an issue of manpower, but for major distributions it’s a matter of avoiding regressions in stable releases. Holding back on version updates actually works well for most software. When security problems arise, distribution maintainers for major distributions backport fixes and release updates. The problem is that this not feasible for web engines; due to the huge volume of vulnerabilities that need fixed, security issues can only practically be handled upstream.

So what’s happened since WSA-2015-0002 was released? Did it convince distributions to take WebKitGTK+ security seriously? Hardly. Fedora is the only distribution that has made any changes in response to WSA-2015-0002, and that’s because I’m one of the Fedora maintainers. (I’m pleased to announce that we have a 2.10.7 update headed to both Fedora 23 and Fedora 22 right now. In the future, we plan to release the latest stable version of WebKitGTK+ as an update to all supported versions of Fedora shortly after it’s released upstream.)


Ubuntu releases WebKitGTK+ updates somewhat inconsistently. For instance, Ubuntu 14.04 came with WebKitGTK+ 2.4.0. 2.4.8 is available via updates, but even though 2.4.9 was released upstream over eight months ago, it has not yet been released as an update for Ubuntu 14.04.

By comparison, Ubuntu 15.10 (the latest release) shipped with WebKitGTK+ 2.8.5, which has never been updated; it’s affected by about 40 vulnerabilities fixed in the latest upstream release. Ubuntu organizes its software into various repositories, and provides security support only to software in the main repository. This version of WebKitGTK+ is in Ubuntu’s “universe” repository, not in main, so it is excluded from security support. Ubuntu users might be surprised to learn that a large portion of Ubuntu software is in universe and therefore excluded from security support; this is in contrast to almost all other distributions, which typically provide security updates for all the software they ship.

I’m calling out Ubuntu here not because it is specially-negligent, but simply because it is our biggest distributor. It’s not doing any worse than most of our other distributors.


Debian provides WebKit updates to users running unstable, and to testing except during freeze periods, but not to released version of Debian. Debian is unique in that it has a formal policy on WebKit updates. Here it is, reproduced in full:

Debian 8 includes several browser engines which are affected by a steady stream of security vulnerabilities. The high rate of vulnerabilities and partial lack of upstream support in the form of long term branches make it very difficult to support these browsers with backported security fixes. Additionally, library interdependencies make it impossible to update to newer upstream releases. Therefore, browsers built upon the webkit, qtwebkit and khtml engines are included in Jessie, but not covered by security support. These browsers should not be used against untrusted websites.

For general web browser use we recommend Iceweasel or Chromium.

Chromium – while built upon the Webkit codebase – is a leaf package, which will be kept up-to-date by rebuilding the current Chromium releases for stable. Iceweasel and Icedove will also be kept up-to-date by rebuilding the current ESR releases for stable.

(Iceweasel and Icedove are Debian’s de-branded versions of Firefox and Thunderbird, the product of an old trademark spat with Mozilla.)

Debian is correct that we do not provide long term support branches, as it would be very difficult to backport security fixes. But it is not correct that “library interdependencies make it impossible to update to newer upstream releases.” This might have been true in the past, but for several years now, we have avoided requiring new versions of libraries whenever it would cause problems for distributions, and — with one big exception that I will discuss below — we ensure that each release maintains both API and ABI compatibility. (Distribution maintainers should feel free to get in touch if we accidentally introduce some compatibility issue for your distribution; if you’re having trouble taking our updates, we want to help. I recently worked with openSUSE to make sure WebKitGTK+ can still be compiled with GCC 4.8, for example.)

The risk in releasing updates is that WebKitGTK+ is not a leaf package: a bad update could break some application. This seems to me like a good reason for application maintainers to carefully test the updates, rather than a reason to withhold security updates from users, but it’s true there is some risk here. One possible solution would be to have two different WebKitGTK+ packages, say, webkitgtk-secure, which would receive updates and be used by high-risk software like web browsers and email clients, and a second webkitgtk-stable package that would not receive updates to reduce regression potential.

Recommended Distributions

We regularly receive bug reports from users with very old versions of WebKit, who trust their distributors to handle security for them and might not even realize they are running ancient, unsafe versions of WebKit. I strongly recommend using a distribution that releases WebKitGTK+ updates shortly after they’re released upstream. That is currently only Arch and Fedora. (You can also safely use WebKitGTK+ in Debian testing — except during its long freeze periods — and Debian unstable, and maybe also in openSUSE Tumbleweed, and (update) also in Gentoo testing. Just be aware that the stable releases of these distributions are currently not receiving our security updates.) I would like to add more distributions to this list, but I’m currently not aware of any more that qualify.

The Great API Break

So, if only distributions would ship the latest release of WebKitGTK+, then everything would be good, right? Nope, because of a large API change that occurred two and a half years ago, called WebKit2.

WebKit (an API layer within the WebKit project) and WebKit2 are two separate APIs around WebCore. WebCore is the portion of the WebKit project that Google forked into Blink; it’s too low-level to be used directly by applications, so it’s wrapped by the nicer WebKit and WebKit2 APIs. The difference between the WebKit and WebKit2 APIs is that WebKit2 splits work into multiple secondary processes. Asides from the UI process, an application will have one or many separate web processes (for the actual page rendering), possibly a separate network process, and possibly a database process for IndexedDB. This is good for security, because it allows the secondary processes to be sandboxed: the web process is the one that’s likely to be compromised first, so it should not have the ability to access the filesystem or the network. (Remember, though, that there is no Linux sandbox yet, so this is currently only a theoretical benefit.) The other main benefit is robustness. If a web site crashes the renderer, only a single web process crashes (corresponding to one tab in Epiphany), not the entire browser. UI process crashes are comparatively rare.

Intermission: Certificate Verification

Another advantage provided by the API change is the opportunity to handle HTTPS connections more securely. In the original WebKitGTK+ API, applications must handle certificate verification on their own. This was a serious mistake; predictably, applications performed no verification at all, or did so improperly. For instance, take this Shotwell bug which is not fixed in any released version of Shotwell, or this Banshee bug which is still open. Probably many more applications are affected, because I have not done a comprehensive check. The new API is secure by default; applications can ignore verification errors, but only if they go out of their way to do so.

Remember that even though WebKitGTK+ 2.4.9 was released upstream over eight months ago, Ubuntu 14.04 is still on 2.4.8? It’s worth mentioning that 2.4.9 contains the fix for that serious networking backend issue I mentioned earlier (CVE-2015-2330). The bug is that TLS certificate verification was not performed until an HTTP response was received from the server; it’s supposed to be performed before sending an HTTP request, to prevent secure cookies from leaking. This is a disaster, as attackers can easily use it to get your session cookie and then control your user account on most websites. (Credit to Ross Lagerwall for reporting that issue.) We reported this separately to oss-security due to its severity, but that was not enough to convince distributions to update. But most applications in Ubuntu 14.04, including Epiphany and Midori, would not even benefit from this fix, because the change only affects WebKit2; remember, there’s no certificate verification in the original WebKitGTK+ API. (Modern versions of Epiphany do use WebKit2, but not the old version included in Ubuntu 14.04.) Old versions of Epiphany and Midori load pages even if certificate verification fails; the verification result is only used to change the status of a security indicator, basically giving up your session cookies to attackers.

Removing WebKit1

WebKit2 has been around for Mac and iOS for longer, but the first stable release for WebKitGTK+ was the appropriately-versioned WebKitGTK+ 2.0, in March 2013. This release actually contained three different APIs: webkitgtk-1.0, webkitgtk-3.0, and webkit2gtk-3.0. webkitgtk-1.0 was the original API, used by GTK+ 2 applications. webkitgtk-3.0 was the same thing for GTK+ 3 applications, and webkit2gtk-3.0 was the new WebKit2 API, available only for GTK+ 3 applications.

Maybe it should have remained that way.

But, since the original API was a maintenance burden and not as stable or robust as WebKit2, it was deleted after the WebKitGTK+ 2.4 release in March 2014. Applications had had a full year to upgrade; surely that was long enough, right? The original WebKit API layer is still maintained for the Mac, iOS, and Windows ports, but the GTK+ API for it is long gone. WebKitGTK+ 2.6 (September 2014) was released with only one API, webkit2gtk-4.0, which was basically the same as webkit2gtk-3.0 except for a couple small fixes; most applications were able to upgrade by simply changing the version number. Since then, we have maintained API and ABI compatibility for webkit2gtk-4.0, and intend to do so indefinitely, hopefully until GTK+ 4.0.

A lot of good that does for applications using the API that was removed.

WebKit2 Adoption

While upgrading to the WebKit2 API will be easy for most applications (it took me ten minutes to upgrade GNOME Initial Setup), for many others it will be a significant challenge. Since rendering occurs out of process in WebKit2, the DOM API can only be accessed by means of a shared object injected into the web process. For applications that perform only a small amount of DOM manipulation, this is a minor inconvenience compared to the old API. For applications that use extensive DOM manipulation — the email clients Evolution and Geary, for instance — it’s not just an inconvenience, but a major undertaking to upgrade to the new API. Worse, some applications (including both Geary and Evolution) placed GTK+ widgets inside the web view; this is no longer possible, so such widgets need to be rewritten using HTML5. Say nothing of applications like GIMP and Geany that are stuck on GTK+ 2. They first have to upgrade to GTK+ 3 before they can consider upgrading to modern WebKitGTK+. GIMP is working on a GTK+ 3 port anyway (GIMP uses WebKitGTK+ for its help browser), but many applications like Geany (the IDE, not to be confused with Geary) are content to remain on GTK+ 2 forever. Such applications are out of luck.

As you might expect, most applications are still using the old API. How does this work if it was already deleted? Distributions maintain separate packages, one for old WebKitGTK+ 2.4, and one for modern WebKitGTK+. WebKitGTK+ 2.4 has not had any updates since last May, and the last real comprehensive security update was over one year ago. Since then, almost 130 vulnerabilities have been fixed in newer versions of WebKitGTK+. But since distributions continue to ship the old version, few applications are even thinking about upgrading. In the case of the email clients, the Evolution developers are hoping to upgrade later this year, but Geary is completely dead upstream and probably will never be upgraded. How comfortable are you with using an email client that has now had no security updates for a year?

(It’s possible there might be a further 2.4 release, because WebKitGTK+ 2.4 is incompatible with GTK+ 3.20, but maybe not, and if there is, it certainly will not include many security fixes.)

Fixing Things

How do we fix this? Well, for applications using modern WebKitGTK+, it’s a simple problem: distributions simply have to start taking our security updates.

For applications stuck on WebKitGTK+ 2.4, I see a few different options:

  1. We could attempt to provide security backports to WebKitGTK+ 2.4. This would be very time consuming and therefore very expensive, so count this out.
  2. We could resurrect the original webkitgtk-1.0 and webkitgtk-3.0 APIs. Again, this is not likely to happen; it would be a lot of work to restore them, and they were removed to reduce maintenance burden in the first place. (I can’t help but feel that removing them may have been a mistake, but my colleagues reasonably disagree.)
  3. Major distributions could remove the old WebKitGTK+ compatibility packages. That will force applications to upgrade, but many will not have the manpower to do so: good applications will be lost. This is probably the only realistic way to fix the security problem, but it’s a very unfortunate one. (But don’t forget about QtWebKit. QtWebKit is based on an even older version of WebKit than WebKitGTK+ 2.4. It doesn’t make much sense to allow one insecure version of WebKit but not another.)

Or, a far more likely possibility: we could do nothing, and keep using insecure software.

by Michael Catanzaro at February 01, 2016 10:40 PM

Xabier Rodríguez Calvar

Web Engines Hackfest according to me

And once again, in December we celebrated the hackfest. This year happened between Dec 7-9 at the Igalia premises and the scope was much broader than WebKitGTK+, that’s why it was renamed as Web Engines Hackfest. We wanted to gather people working on all open source web engines and we succeeded as we had people working on WebKit, Chromium/Blink and Servo.

The edition before this I was working with Youenn Fablet (from Canon) on the Streams API implementation in WebKit and we spent our time on the same thing again. We have to say that things are much more mature now. During the hackfest we spent our time in fixing the JavaScriptCore built-ins inside WebCore and we advanced on the automatic importation of the specification web platform tests, which are based on our prior test implementation. Since now they are managed there, it does not make sense to maintain them inside WebKit too, we just import them. I must say that our implementation is fairly complete since we support the current version of the spec and have almost all tests passing, including ReadableStream, WritableStream and the built-in strategy classes. What is missing now is making Streams work together with other APIs, such as Media Source Extensions, Fetch or XMLHttpRequest.

There were some talks during the hackfest and we did not want to be less, so we had our own about Streams. You can enjoy it here:

You can see all hackfest talks in this YouTube playlist. The ones I liked most were the ones by Michael Catanzaro about HTTP security, which is always interesting given the current clumsy political movements against cryptography and the one by Dominik Röttsches about font rendering. It is really amazing what a browser has to do just to get some letters painted on the screen (and look good).

As usual, the environment was amazing and we had a great time, including the traditional Street Fighter‘s match, where Gustavo found a worthy challenger in Changseok :)

Of course, I would like to thank Collabora and Igalia for sponsoring the event!

And by the way, quite shortly after that, I became a WebKit reviewer!

by calvaris at February 01, 2016 11:06 AM

January 31, 2016

Manuel Rego

Deep Dive into Grid Layout Placement

During the last months as part of my work in Igalia I’ve been focused on finishing the new/missing bits of the CSS Grid Layout Blink’s implementation related to items placement. In summary the task was mostly related to 2 things:

  • Support implicit grid before explicit. So the grid can add tracks on demand not only on the growing direction (usually right/bottom) but also on the opposite one.

  • Fix unknown named grid lines resolution. This is the case when an item is placed in a line called “foo”, but no lines with that name exist in the grid.

These might seem quite simple tasks, but they implied quite a lot of changes in the underlying implementation. I ended up refactoring all the code to resolve grid positions in order to complete them. I even wrote a document explaining the whole plan and all the technical details, where you can find links to all the related patches.

Now that we’ve finished this, it’s time to explain how you can use it. Although my colleague Sergio already wrote about this in 2014, the specification has changed since that time, so I think it’s better to try to explain the whole thing from scratch. This post is a kind of summary with examples of Placing Grid Items” section of the CSS Grid Layout spec.

Grid lines

This is probably one of the most important concepts of the grid layout spec. The grid lines are the ones dividing horizontally and vertically a grid. And they’re actually numbered, starting at 1.

Let’s use a 3x2 grid as example to explain how this works:

<div style="display: grid;
            grid-template-columns: 300px 200px 100px;
            grid-template-rows: 100px 50px;">

Numbered grid lines example Numbered grid lines example

Grid placement properties

In order to position the items inside the grid container, you need to use the grid placement properties. These properties are:

  • grid-column-start: Set the first vertical line for the item.
  • grid-column-end: Set the last vertical line for the item.
  • grid-row-start: Set the first horizontal line for the item.
  • grid-row-end: Set the last horizontal line for the item.

With these properties you define the area where the grid item will be placed. In order to do that, you use the line numbers.

The initial value for these properties is auto, which makes possible that items are automatically placed looking for empty cells inside the grid. For more information about this, please review a previous post on the matter.

On top of that, there’re some handy shorthands:

  • grid-column: Shorthand for grid-column-start and grid-column-end properties.
  • grid-row: Shorthand for grid-row-start and grid-row-end properties.
  • grid-area: Shorthand to set the 4 placement properties in just one declaration.

So, imagine that you add the following item in the previous grid:

<div style="grid-column-start: 2; grid-column-end: 4;
  grid-row-start: 1; grid-row-end 2;">

Probably easier to read if you use some shorthands:

<div style="grid-column: 2 / 4; grid-row: 1 / 2;"></div>

This means that the grid item will be placed taking the 2nd and 3rd columns in the first row.

Place item using line numbers example Place item using line numbers example

Cell spanning

Previous item was spanning 2 columns (2nd and 3rd ones) referencing the lines. You could do the same using the span keyword, together with the number of cells you want to span.

So, you could place the item in the very same position using:

<div style="grid-column: 2 / span 2; grid-row: 1;"></div>

Place item using span example Place item using span example

Note that here you’re not setting the end line for the row. This means that grid-row-end takes a value of auto. In this case auto defaults to a span of one.

Negative line numbers

So far we’ve only seen positive numbers, but lines have also negative indexes. Negative numbers allow you to reference the lines starting from the end of the grid.

Following with the same example, you could place the item again in the same position using the negative indexes:

<div style="grid-column: -3 / -1; grid-row: -3 / -2;"></div>

Place item using negative line numbers example Place item using negative line numbers example

This might be really useful in some situations. For example, if you want to be sure that the item is in the last column, independently of the number of tracks, you’ll just need to set: grid-column-end: -1;.

Named grid lines

Not only that, but you can also name the grid lines, so you don’t need to remember the specific number to reference to them.

Let’s modify the definition of the grid, keeping the size of tracks but adding names to the lines:

<div style="display: grid;
            grid-template-columns: [start] 300px [main] 200px [aside] 100px [end];
            grid-template-rows: [top] 100px [middle] 50px [bottom];">

And again if we want to place the item in the same position, we just need to reference the name of the lines:

<div style="grid-column: main / end; grid-row: top / middle;"></div>

Place item using line names example Place item using line names example

One line can have several names, you just need to set them in the definition: grid-template-rows: [top start] 100px [middle center] 50px [bottom end];.

Also the names of the lines can be repeated. To reference them you’ve to use a number that can be again positive or negative. Let’s use a different example to showcase this:

<div style="display: grid;
            grid-template-columns: [col] 200px [gap] 50px [col] 200px [gap] 50px [col] 200px [gap];">

And imagine that you place some items like this:

<div style="grid-column: col 2;"></div>
<div style="grid-column: col 1 / gap 2;"></div>
<div style="grid-column: col -1;"></div>

Place items with repeated named grid lines example Place items with repeated named grid lines example

And of course, you can span to a named grid line:

<div style="grid-column: col 2 / span 2 gap;"></div>

Place item spanning to named grid line example Place item spanning to named grid line example

Grid areas

Better still, you can define grid areas and place items directly on them. You have to use the grid-template-areas property to put names to the different areas in your grid. And you could use the grid-area shorthand directly to place the items.

Let’s use a bigger grid (5x4) to show an example:

<div style="display: grid;
            grid-template-columns: 100px 100px 100px 100px 100px;
            grid-template-rows: 100px 100px 100px 100px;
                'title  title  title  title  aside'
                'nav    main   main   main   aside'
                'nav    main   main   main   aside'
                'footer footer footer footer footer';">

And position one item in each of the areas:

<div style="grid-area: title;"></div>
<div style="grid-area: nav;"></div>
<div style="grid-area: main;"></div>
<div style="grid-area: aside;"></div>
<div style="grid-area: footer;"></div>

Place items inside grid areas example Place items inside grid areas example

Grid areas & Named grid lines

One interesting thing about areas and placement is that grid areas create implicit names for the grid lines surrounding them. These implicit names use the “-start” and “-end” suffixes. And you can reference those lines when placing an item, instead of using the whole area.

E.g. the “title” area from previous example creates 4 implicit names for the lines (2 in each axis):

  • Left line: “title-start
  • Right line: “title-end
  • Top line: “title-start
  • Bottom line: “title-end

Following with that example you could place an item using the implicit names:

<div style="grid-column: main-start / aside-end;
            grid-row: title-start / nav-end;"></div>

Place item using implicit names from grid areas example Place item using implicit names from grid area

And the same can happen the other way around. If you name lines using those suffixes, you’re actually creating implicit areas. So, we could just create the very same grid using named grid lines:

<div style="display: grid;
            grid-template-columns: [title-start nav-start footer-start] 100px [main-start nav-end] 100px 100px 100px [aside-start title-end main-end] 100px [aside-end footer-end];
            grid-template-rows: [title-start aside-start] 100px [nav-start main-start title-end] 100px 100px [footer-start nav-end main-end aside-end] 100px [footer-end];">

All the examples of items positioned in this section will be exactly in the same place with this new grid.

Implicit grid

With the grid definition properties (grid-template-columns, grid-template-rows and grid-template-areas) you determine the explicit number of tracks (columns and rows) in your grid. However, grid spec allows you to place items outside of the explicit grid. In order to support that, implicit tracks are created automatically, the size of these tracks is controlled by grid-auto-columns and grid-auto-rows properties. In the following examples I’ll use red color to highlight the implicit lines.

This time let’s use a simple 2x1 grid:

<div style="display: grid;
            grid-template-columns: 200px 200px;
            grid-auto-columns: 100px;">

And imagine that you place an item in the 5th column (grid-column: 5;). As the grid only has 2 columns, 3 implicit columns will be added in order to position the item.

Implicit grid example Implicit grid example

Again you can also create implicit tracks with items that span several cells. For example, if an item takes 3 columns starting on the 2nd one (grid-column: 2 / span 3);

Implicit grid with span example Implicit grid with span example

Originally, the implicit tracks could only be added at the end. But now it’s possible to add implicit tracks before the explicit grid. For example, if you place an item using grid-column: -5; it’ll add 2 columns on the left and it’ll be placed in the -2nd column.

Implicit grid before explicit example Implicit grid before explicit example

Implicit grid & Named grid lines

But this is not the only way to create implicit tracks, you can also create them if you use undefined named grid lines. This is more a way to show mistakes on the user CSS than a feature itself, but maybe someone finds it useful. The idea is that all the lines in the implicit grid will take any random name you might need to position an item.

The basic example is placing items referencing a nonexistent line called “foo”. For example you will create 3 implicit columns (1 before and 2 after the explicit grid) with the following items:

<div style="grid-column: foo;"></div>
<div style="grid-column: 2 / span 2 foo;"></div>
<div style="grid-column: -1 foo;"></div>

Implicit grid using undefined named grid lines example Implicit grid using undefined named grid lines example

Note that the simplest example grid-column: foo is being placed in the 4th column (adding an extra empty column just after the explicit grid). This is because first line that is considered to be called “foo” is the first implicit line (line 4), so last line of the grid (line 3) is not included.

Also, the last item grid-column: -1 foo is placed on the -1th column (maybe you was not expecting that). This is because of you start looking for a line named “foo” from the edge of the explicit grid. So, you ignore lines -1, -2 and -3 (as they’re not called “foo”) and consider line -4 (first line on the implicit grid) to have that name.

This is a bit trickier if the line actually exists, as you’ve to count it too in order to place the item. Specially it’s complex if you’re using span to a named grid line, but there’re not enough lines. In that case only implicit lines in the search direction are considered to have that name.

Again, hopefully an example can help to understand this. Let’s add a name to the middle line in the previous example:

<div style="display: grid;
            grid-template-columns: 200px [middle] 200px;
            grid-auto-columns: 100px;">

And now, let’s use place a few items referencing that “middle” line:

<div style="grid-column: 3 middle;"></div>
<div style="grid-column: span 2 middle / 5;"></div>
<div style="grid-column: -4 / span middle;"></div>

Implicit grid using more named grid lines than available example Implicit grid using more named grid lines than available example

The strange case here is grid-column: span 2 middle / 5;, as you can see it takes from -1th column to 4th column (both included). The item ends at line 5, and it has to span 2 lines called “middle” to find the start. You could think that it should count line 4 and line 2, but, as explained before, you have to start counting lines from the edge of the explicit grid. So you actually count line 2 and then you’ve to consider the implicit lines on the left to find the start position (line -4).

Special cases

Grid placement properties have a few special situations that are resolved by the conflict handling section of the spec.

For example, if you place an item where the end line is before than the start line, both lines are swapped. Thus, something like grid-column: 5 / 2; would become grid-column: 2 / 5;.

Another situation is the one in which you have span in both the start and end positions. The span for the end position is discarded. So, grid-column: span 2 / span 3; would become grid-column: span 2;. Which will use the grid placement algorithm to find an empty area (of 2 columns in this particular example) to position itself.

Last one is the case when you only have a span to named grid line. In that case, it’s replaced by span 1. E.g. grid-column: span foo; will become grid-column: span 1;.

Placement special cases example Placement special cases example


If you have read that far it seems you’re really interested on CSS Grid Layout. The main conclusion is that the specification is really flexible regarding how to place items on the grid. As you can see there’re quite a lot of different ways to position an item in the very same place. Probably each person will get used to a few of them and just forget about the rest.

IMHO the basic stuff (line indexes, spans, line names and areas) is quite straightforward. And the implicit grid is not that hard either. Then negative indexes bring some fun. However undefined named grid lines behavior is really tricky (hopefully not something you should care about anyway). But it’s true that I’m biased as I’ve been dealing with this for a long time.

Last, I thought it would be nice to finish this with a big example which uses most of the things described in this post. If you got it right and don’t get confused at all you’re mastering grid layout placement. Congrats! 😄

This is the definition of the grid container used in this example:

<div style="display: grid;
            grid-template-columns: [left] 200px [main-start] 100px [center] 100px 100px [main-end] 50px 50px [right];
            grid-template-rows: [top title-start] 50px [title-end main-start] 200px [main-end center] 150px 100px [bottom];
            grid-auto-columns: 50px;
            grid-auto-rows: 50px;">

And in the next picture you can see how different items will be placed.

Placement big example Placement big example

You can try it live in our examples repository:


As commented on the introduction Blink’s implementation should support now (Chrome 50+) all these different placement possibilities. Igalia has been working on this implementation and we’re porting it now to WebKit too.

On the other side, Gecko has already support for it too, in this case developed by Mozilla.

Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

Finally, as usual I’d like to highlight one more time that all this work has been done as part of a collaboration between Igalia and Bloomberg. Big thanks for your support!


January 31, 2016 11:00 PM

January 21, 2016

Andy Wingo

talks i would like to give in 2016

Every year I feel like I'm trailing things in a way: I hear of an amazing conference with fab speakers, but only after the call for submissions had closed. Or I see an event with exactly the attendees I'd like to schmooze with, but I hadn't planned for it, and hey, maybe I could have even spoke there.

But it's a new year, so let's try some new things. Here's a few talks I would love to give this year.

building languages on luajit

Over the last year or two my colleagues and I have had good experiences compiling in, on, and under LuaJIT, and putting those results into production in high-speed routers. LuaJIT has some really interesting properties as a language substrate: it has a tracing JIT that can punch through abstractions, it has pretty great performance, and it has a couple of amazing escape hatches that let you reach down to the hardware in the form of the FFI and the DynASM assembly generator. There are some challenges too. I can tell you about them :)

try guile for your next project!

This would be a talk describing Guile, what it's like making programs with it, and the kind of performance you can expect out of it. If you're a practicing programmer who likes shipping small programs that work well, are fun to write, and run with pretty good performance, I think Guile can be a great option.

I don't get to do many Guile talks because hey, it's 20 years old, so we don't get the novelty effect. Still, I judge a programming language based on what you can do with it, and recent advances in the Guile implementation have expanded its scope significantly, allowing it to handle many problem sizes that it couldn't before. This talk will be a bit about the language, a bit about the implementation, and a bit about applications or problem domains.

compiling with persistent data structures

As part of Guile's recent compiler improvements, we switched to a somewhat novel intermediate language. It's continuation-passing-style, but based on persistent data structures. Programming with it is interesting and somewhat different than other intermediate languages, and so this would be a talk describing the language and what it's like to work in it. Definitely a talk for compiler people, by a compiler person :)

a high-performance networking with luajit talk

As I mentioned above, my colleagues and I at work have been building really interesting things based on LuaJIT. In particular, using the Snabb Switch networking toolkit has let us build an implementation of a "lightweight address family translation router" -- the internet-facing component of an IPv4-as-a-service architecture, built on an IPv6-only network. Our implementation flies.

It sounds a bit specialized, and it is, but this talk could go two ways.

One version of this talk could be for software people that aren't necessarily networking specialists, describing the domain and how with Snabb Switch, LuaJIT, compilers, and commodity x86 components, we are able to get results that compete well with offerings from traditional networking vendors. Building specialized routers and other network functions in software is an incredible opportunity for compiler folks.

The other version would be more for networking people. We'd explain the domain less and focus more on architecture and results, and look more ahead to challenges of 100Gb/s ports.

let me know!

I'll probably submit some of these to a few conferences, but if you run an event and would like me to come over and give one of these talks, I would be flattered :) Maybe that set of people is empty, but hey, it's worth a shot. Probably contact via the twitters has the most likelihood of response.

There are some things you need to make sure are covered before reaching out, of course. It probably doesn't need repeating in 2016, but make sure that you have a proper code of conduct, and that that you'll be able to put in the time to train your event staff to create that safe space that your attendees need. Getting a diverse speaker line-up is important to me too; conferences full of white dudes like me are not only boring but also serve to perpetuate an industry full of white dudes. If you're reaching out, reach out to women and people of color too, and let me know that you're working on it. This old JSConf EU post has some ideas too. Godspeed, and happy planning!

by Andy Wingo at January 21, 2016 11:59 AM

January 19, 2016

Adrián Pérez

ljndpi: The SnabbWall sidekick

Howdy, and happy 2016! Last time we met here, I wrote about SnabbWall, a suite of Snabb Switch applications implementing the machinery needed for a Layer-7 firewall using deep packet inspection, and a firewall program itself. The work I am doing in SnabbWall is the result of ongoing collaboration between Igalia and the NLnet Foundation. We are very grateful to have their sponsorship.


We are treating development of SnabbWall like every other project, and among the periodic duties it is important to report progress. Development is happening 100% out in the open, so early on we decided to do also make the status updates open—by blogging them.

(For those interested in following only related posts, I have set up an additional feed: RSS, Atom.)

nDPI Lua Binding: Check

The second milestone for SnabbWall was having a binding using the handy LuaJIT FFI extension for the nDPI library, and I am happy to announce that the binding, dubbed ljndpi, is fairly complete by now. As an added bonus, I have been careful to make the code independet from Snabb Switch, which made it possible to split out the history into a separate repository, which gets imported under lib/ljndpi/ in the SnabbWall repository using git-subtree, and then built into the snabb executable. This is the same approach currently used in Snabb Switch to build the external dependencies (that is ljsyscall, pflua, and LuaJIT).

Hannibal Smith, happy about the project results

Having a working binding for nDPI completes the second milestone of the SnabbWall roadmap.

Some Implementation Notes

Low-level Definitions

The ndpi.c submodule (source: ndpi/c.lua) contains only the definitions needed by the FFI extension to call nDPI functions, and takes care of loading the shared object. If you check the code, you may notice that some of the type names do not match exactly those used in the nDPI headers: as long as the byte size and their role is used consistently, the FFI extension does not really care about the actual names of the types, so I have decided to use names which looked better in my eyes—the nDPI API is kinda terrible in that regard.

Protocol Bitmasks

Telling nDPI which protocols it should detect is done using a big bitmask, with one bit for each supported protocol.

The implementation of NDPI_PROTOCOL_BITMASK in C uses a bunch of macros to operate on a struct which wraps an array of integers, and they do not lend themselves to be easily wrapped using the FFI extension. Because of that, I decided to reimplement it in pure Lua. The size of the array of integers in the bitmask struct and their bit width may vary across different nDPI versions. In preparation for future changes —which should not happen often—, the type is defined using the values of the NDPI_NUM_BITS and NDPI_BITS constants: copying their values from the nDPI headers is the only change needed.

As a bonus, I have thrown in support for chaining calls to all methods except :is_set(). This is super convenient:

local bits = ndpi.protocol_bitmask()
print(bits:set_all():del(32):is_set(32)) --> false
print(bits:reset():add(42):is_set(42))   --> true

Certain operations, like enabling detection of all the supported protocols can be done with a one-liner:

local dm = ndpi.detection_module(1000)

Protocol Names and Identifiers

For each supported protocol, nDPI has both a macro to give a name to its numerical identifier, and a table with data about the protocol. One of the pieces of data is a string with the name of the protocol, in a format suitable to be presented to the user. These strings are const static in the C side of the world, but every time ndpi_get_proto_name() gets called it would be needed to convert the plain array of 8-bit integers to a Lua string using ffi.string(), which in turn would create a copy. Of course one could cache the strings in the Lua side, but protocol identifiers and names do not change (as long as the nDPI version remains unchanged), so instead of caching the results, I have opted for writing a small script which parses the ndpi_protocol_ids.h file and generates the corresponding ndpi/protocol_ids.lua file:

-- Generated by ljdnpi's tools/update-protocol-ids script
local T = {
  -- ...
  -- ...
return T

Note how the same table contains both numeric and string indexes. This is perfectly fine in Lua, and allows both mapping from a protocol name to its identifier, and from an identifier to a name. The strings are not as nice is the ones returned by ndpi_get_proto_name(), but on the upside they are guaranteed to always be uppercase and be valid identifiers.

Wrapper Module

Finally, the ndpi.wrap submodule (source: ndpi/wrap.lua) ties the previous pieces together to provide an idiomatic Lua interface to nDPI. To achieve this, it uses ffi.metatype() to associate metatables to the C types used by nDPI.

The and ndpi.flow types were a bit tricky because they are opaque: their implementation details are hidden, and nDPI only allows us to create values of those types, free them, and pass them to functions. So in C one does:

// Create and initialize an endpoint identifier:
const size = ndpi_detection_get_sizeof_ndpi_id_struct();
struct ndpi_id_struct *ndpi_id = malloc(size);
memset(ndpi_id, 0x00, size);

How does one do that without knowing in advance the size? Well, to begin with, the FFI extension will associate the metatable of the metatype to any pointer value of the C type, no matter how it has been created. Additionally, FFI metatypes have an additional __new metamethod which allows to manually allocate memory or doing any other thing that may be needed, so I ended up with this (the version for ndpi.flow is the same, but using lib.ndpi_free_flow instead of for the __gc metamethod):

local id_struct_ptr_t = ffi.typeof("ndpi_id_t*")
local id_struct_size  = lib.ndpi_detection_get_sizeof_ndpi_id_struct()

local function id_new(ctype)
  local id = ffi.cast(id_struct_ptr_t, C.malloc(id_struct_size))
  ffi.fill(id, id_struct_size)
  return id

local id_type = ffi.metatype("ndpi_id_t", { __new = id_new, __gc = })

Note that the metatable does not have an __index entry, so the type does not have any methods attached to it. This effectively only allows creating values, passing them around, and letting the garbage collector free their memory—which is exactly what we want.

For the ndpi.detection_module type, the main difference is that there is a table associated to the __index entry of the metatype. It contains the functions which act on a struct ndpi_detection_module_struct*, some of them renamed to shorter (or just more logical) names, and some of them are small wrappers which perform additional conversion of values between the Lua and C sides of the world.

Finally, the ndpi.wrap module also imports the bits from the other submodules (ndpi.c, ndpi.protocol_bitmask, and ndpi.protocol_ids). Provided that ndpi.wrap is what should be loaded when using a plain require("ndpi"), the one last bit is making sure that ndpi.lua has:

-- Requiring "ndpi" returns the same as "ndpi.wrap"
return require("ndpi.wrap")

There is a part of the public nDPI API which I have purposedly omitted from ljndpi: the ndpi_t**() functions used to implement a tree structure. These are used to provide (as a convenience) a data structure which maps keys to pointers, and it is not really needed to interface with nDPI for packet inspection. Any program which needs a similar structure can use Lua tables. If needed, it is even easier to roll your own custom data structure in Lua than to wrap these functions.

Putting nDPI Into Use

The nDPI API is quite low-level, and so is the binding exposed by ljndpi. As an exercise to test the module and get my hands greasy in preparation for writing SnabbWall's L7 Spy, I reimplemented a pure-Lua version of the ndpiReader tool included with nDPI, which works only on pcap capture files. If you want to use ljndpi I would strongly encourage you to take a look examples/readpcap in the source tree.

The basic idea goes this way:

  • Create a ndpi.detection_module() and enable detection of all supported protocols.

  • Open the input pcap capture (or prepare whatever mechanism you use to read one packet at a time).

  • Create a table where flow information is stored.

  • For each packet:

    1. Build a “flow identifier” (flowid) from information in the packet (more on this below).
    2. If the flow table does not have an entry for the flowid, create a new one. Make sure you create also a ndpi.flow object, and two objects (one for the source host, another for the destination).
    3. Call ndpi.detection_module:process_packet(), passing the ndpi.flow and objects corresponding to the flow. If this function returns anything else than ndpi.protocol.PROTOCOL_UNKNOWN, then the protocol has been identified (jackpot!).

So what's a “flow identifier” anyway? It can be any value (really!), as long as you can calculate it solely from data in the packets, and each identifier should be unique for each application (Layer-7!) passing data over the network between a pair of hosts. For IP-based protocols (TCP, UDP), you want to use at least the IP addresses and ports involved. Note that nDPI considers the direction of the data irrelevant for the flows —no matter in which direction traffic goes, it belongs to a certain application—, but the direction of the source and destination of each inspected packet is relevant. That means that your flow identifier must be the same regardless of the direction, but the values passed as the two last arguments to :process_packet() have to indicate the direction of the packets.

There are a number of improvements one can make on top of the basic method above, being the most obvious one using VLAN tags to identify flows (readpcap does this already), but I am leaving flow identification for a follow-up post. Dear readers: think about it as homework, and make sure to check the answers with me in the following weeks.

(Or, you know, just go ahead, clone ljndpi, and experiment!)

by aperez ( at January 19, 2016 06:29 PM

Andy Wingo

unboxing in guile

Happy snowy Tuesday, hackfolk! I know I said in my last dispatch that I'd write about Lua soon, but that article is still cooking. In the meantime, a note on Guile and unboxing.

on boxen, on blitzen

Boxing is a way for a programming language implementation to represent a value.

A boxed value is the combination of a value along with a tag providing some information about the value. Both the value and the tag take up some space. The value can be thought to be inside a "box" labelled with the tag and containing the value.

A value's tag can indicate whether the value's bits should be interpreted as an unsigned integer, as a double-precision floating-point number, as an array of words of a particular data type, and so on. A tag can also be used for other purposes, for example to indicate whether a value is a pointer or an "immediate" bit string.

Whether values in a programming language are boxed or not is an implementation consideration. It can be the case that in languages with powerful type systems that a compiler can know what the representation of all values are in all parts of all programs, and so boxing is never needed. However, it's much easier to write a garbage collector if values have a somewhat uniform representation, with tag bits to tell the GC how to trace any pointers that might be contained in the object. Tags can also carry run-time type information needed by a dynamically typed language like Scheme or JavaScript, to allow for polymorphic predicates like number? or pair?.

Boxing all of the values in a program can incur significant overhead in space and in time. For example, one way to implement boxes is to allocate space for the tag and the value on the garbage-collected heap. A boxed value would then be referred to via a pointer to the corresponding heap allocation. However, most memory allocation systems align their heap allocations on word-sized boundaries, for example on 8-byte boundaries. That means that the low 3 bits of a heap allocation will always be zero. If you make a bit string whose low 3 bits are not zero, it cannot possibly be a valid pointer. In that case you can represent some types within the set of bit strings that cannot be valid pointers. These values are called "immediates", as opposed to "heap objects". In Guile, we have immediate representations for characters, booleans, some special values, and a subset of the integers. Alternately, a programming language implementation can represent values as double-precision floating point numbers, and shove pointers into the space of the NaN values. And for heap allocations, some systems can associate one tag with a whole page of values, minimizing per-value boxing overhead.

The goal of these optimizations is to avoid heap allocation for some kinds of boxes. While most language implementations have good garbage collectors that make allocation fairly cheap, the best way to minimize allocation cost is to refrain from it entirely.

In Guile's case, we currently use a combination of low-bit tagging for immediates, including fixnums (a subset of the integers), and tagged boxes on the heap for everything else, including floating-point numbers.

Boxing floating-point numbers obviously incurs huge overhead on floating-point math. You have to consider that each intermediate value produced by a computation will result in the allocation of another 8 bytes for the value and 4 or 8 bytes for the tag. Given that Guile aligns allocations on 8-byte boundaries, the result is a 16-byte allocation in either case. Consider this loop to sum the doubles in a bytevector:

(use-modules (rnrs bytevectors))
(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (+ sum (bytevector-ieee-double-native-ref v i)))

Each trip through the loop is going to allocate not one but two heap floats: one to box the result of bytevector-ieee-double-native-ref (whew, what a mouthful), and one for the sum. If we have a bytevector of 10 million elements, that will be 320 megabytes of allocation. Guile can allocate short-lived 16-byte allocations at about 900 MB/s on my machine, so summing this vector is going to take at least 350ms, just for the allocation. Indeed, without unboxing I measure this loop at 580ms for a 10 million element vector:

> (define v (make-f64vector #e10e6 1.0))
> ,time (f64-sum v)
$1 = 1.0e7
;; 0.580114s real time, 0.764572s run time.  0.268305s spent in GC.

The run time is higher than the real time due to parallel marking. I think in this case, allocation has even higher overhead because it happens outside the bytecode interpreter. The add opcode has a fast path for small integers (fixnums), and if it needs to work on flonums it calls out to a C helper. That C helper doesn't have a pointer to the thread-local freelist so it has to go through a more expensive allocation path.

Anyway, in the time that Guile takes to fetch one f64 value from the vector and add it to the sum, the CPU ticked through some 150 cycles, so surely we can do better than this.

unboxen, unblitzen

Let's take a look again at the loop to see where the floating-point allocations are produced.

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (+ sum (bytevector-ieee-double-native-ref v i)))

It turns out there's no reason for the loquatiously-named bytevector-ieee-double-native-ref to return a boxed number. It's a monomorphic function that is well-known to the Guile compiler and virtual machine, and it even has its own opcode. In Guile 2.0 and until just a couple months ago in Guile 2.2, this function did box its return value, but that was because the virtual machine had no facility for unboxed values of any kind.

To allow bytevector-ieee-double-native-ref to return an unboxed double value, the first item of business was then to support unboxed values in Guile's VM. Looking forward to unboxed doubles, we made a change such that all on-stack values are 64 bits wide, even on 32-bit systems. (For simplicity, all locals in Guile take up the same amount of space. For the same reason, fetching 32-bit floats also unbox to 64-bit doubles.)

We also made a change to Guile's "stack maps", which are data structures that tell the garbage collector which locals are live in a stack frame. There is a stack map recorded at every call in a procedure, to be used when an activation is pending on the stack. Stack maps are stored in a side table in a separate section of the compiled ELF library. Live values are traced by the garbage collector, and dead values are replaced by a special "undefined" singleton. The change we made was to be able to indicate that live values were boxed or not, and if they were unboxed, what type they were (e.g. unboxed double). Knowing the type of locals helps the debugger to print values correctly. Currently, all unboxed values are immediates, so the GC doesn't need to trace them, but it's conceivable that we could have unboxed pointers at some point. Anyway, instead of just storing one bit (live or dead) per local in the stack map, we store two, and reserve one of the bit patterns to indicate that
the local is actually an f64 value.

But the changes weren't done then: since we had never had unboxed locals, there were quite a few debugging-related parts of the VM that assumed that we could access the first slot in an activation to see if it was a procedure. This dated from a time in Guile where slot 0 would always be the procedure being called, but the check is bogus ever since Guile 2.2 allowed local value slots corresponding to the closure or procedure arguments to be re-used for other values, if the closure or argument was dead. Another nail in the coffin of procedure-in-slot-0 was driven by closure optimizations, in which closures whose callees are all visible could specialize the representation of their closure in non-standard ways. It took a while, but unboxing f64 values flushed out these bogus uses of slot 0.

The next step was to add boxing and unboxing operations to the VM (f64->scm and scm->f64, respectively). Then we changed bytevector-ieee-double-native-ref to return an unboxed value and then immediately box it via f64->scm. Similarly for bytevector-ieee-double-native-set!, we unbox the value via scm->f64, potentially throwing a type error. Unfortunately our run-time type mismatch errors got worse; although the source location remains the same, scm->f64 doesn't include the reason for the unboxing. Oh well.

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
              (+ sum boxed))

When we lower Tree-IL to CPS, we insert the needed f64->scm and scm->f64 boxing and unboxing operations around bytevector accesses. Cool. At this point we have a system with unboxed f64 values, but which is slower than the original version because every f64 bytevector access involves two instructions instead of one, although the instructions themselves together did the same amount of work. However, telling the optimizer about these instructions could potentially eliminate some of them. Let's keep going and see where we get.

Let's attack the other source of boxes, the accumulation of the sum. We added some specialized instuctions to the virtual machine to support arithmetic over unboxed values. Doing this is potentially a huge win, because not only do you avoid allocating a box for the result, you also avoid the type checks on the incoming values. So we add f64+, f64-, and so on.

Unboxing the + to f64+ is a tricky transformation, and relies on type analysis. Our assumption is that if type analysis indicates that we are in fact able to replace a generic arithmetic instruction with a combination of operand unboxing, unboxed arithmetic, and a boxing operation, then we should do it. Separating out the boxes and the monomorphic arithmetic opens the possibility to remove the resulting box, and possibly remove the unboxing of operands too. In this case, we run an optimization pass and end up with something like:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
               (f64+ (scm->f64 sum)
                     (scm->f64 boxed)))))

Scalar replacement via fabricated expressions will take the definition of boxed as (f64->scm f64) and fabricate a definition of f64 as (scm->f64 boxed), which propagates down to the f64+ so we get:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
               (f64+ (scm->f64 sum)

Dead code elimination can now kill boxed, so we end up with:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i)))
               (f64+ (scm->f64 sum)

Voilà, we removed one allocation. Yay!

As we can see from the residual code, we're still left with one f64->scm boxing operation. That expression is one of the definitions of sum, one of the loop variables. The other definition is 0.0, the starting value. So, after specializing arithmetic operations, we go through the set of multiply-defined variables ("phi" variables) and see what we can do to unbox them.

A phi variable can be unboxed if all of its definitions are unboxable. It's not always clear that you should unbox, though. For example, maybe you know via looking at the definitions for the value that it can be unboxed as an f64, but all of its uses are boxed. In that case it could be that you throw away the box when unboxing each definition, only to have to re-create them anew when using the variable. You end up allocating twice as much instead of not at all. It's a tricky situation. Currently we assume a variable with multiple definitions should only be unboxed if it has an unboxed use. The initial set of unboxed uses is the set of operands to scm->f64. We iterate this set to a fixed point: unboxing one phi variable could cause others to be unbox as well. As a heuristic, we only require one unboxed use; it could be there are other uses that are boxed, and we could indeed hit that pessimal double-allocation case. Oh well!

In this case, the intermediate result looks something like:

(define (f64-sum v)
  (let lp ((i 0) (sum (scm->f64 0.0)))
    (let ((sum-box (f64->scm sum)))
      (if (< i (bytevector-length v))
          (lp (+ i 8)
              (let ((f64 (bytevector-ieee-double-native-ref v i)))
                  (f64+ (scm->f64 sum-box)

After the scalar replacement and dead code elimination passes, we end up with something more like:

(define (f64-sum v)
  (let lp ((i 0) (sum (scm->f64 0.0)))
    (let ((sum-box (f64->scm sum)))
      (if (< i (bytevector-length v))
          (lp (+ i 8)
              (f64+ sum
                    (bytevector-ieee-double-native-ref v i)))

Well this is looking pretty good. There's still a box though. Really we should sink this to the exit, but as it happens there's something else that accidentally works in our favor: loop peeling. By peeling the first loop iteration, we create a control-flow join at the loop exit that defines a phi variable. That phi variable is subject to the same optimization, sinking the box down to the join itself. So in reality the result looks like:

(define (f64-sum v)
  (let ((i 0)
        (sum (scm->f64 0.0))
        (len (bytevector-length v)))
     (if (< i len)
         (let ((i (+ i 8))
               (sum (f64+ sum
                          (bytevector-ieee-double-native-ref v i))))
           (let lp ((i i) (sum sum))
             (if (< i len)
                 (lp (+ i 8)
                     (f64+ sum (bytevector-ieee-double-native-ref v i)))

As you can see, the peeling lifted the length computation up to the top too, which is a bonus. We should probably still implement allocation sinking, especially for loops for which peeling isn't an option, but the current status often works well. Running f64-sum on a 10-million-element packed double array goes down from 580ms to 99ms, or to some 25 or 30 CPU cycles per element, and of course no time in GC. Considering that this loop still has the overhead of bytecode interpretation and cache misses, I think we're doing A O K.


It used to be that using packed bytevectors of doubles was an easy way to make your program slower using types (thanks to Sam Tobin-Hochstadt for that quip). The reason is that although a packed vector of doubles uses less memory, every access to it has to allocate a new boxed number. Compare to "normal" vectors where sure, it uses more memory, but fetching an element fetches an already-boxed value. Now with the unboxing optimization, this situation is properly corrected... in most cases.

The major caveat is that for unboxing to work completely, each use of a potentially-unboxable value has to have an alternate implementation that can work on unboxed values. In our example above, the only use was f64+ (which internally is really called fadd), so we win. Writing an f64 to a bytevector can also be unboxed. Unfortunately, bytevectors and simple arithmetic are currently all of the unboxable operations. We'll implement more over time, but it's a current limitation.

Another point is that we are leaning heavily on the optimizer to remove the boxes when it can. If there's a bug or a limitation in the optimizer, it could be the box stays around needlessly. It happens, hopefully less and less but it does happen. To be sure you get the advantages, you need to time the code and see if it's spending significant time in GC. If it is, then you need to disassemble your code to see where that's happening. It's not a very nice thing, currently. The Scheme-like representations I gave above were written by hand; the CPS intermediate language is much more verbose than that.

Another limitation is that function arguments and return values are always boxed. Of course, the compiler can inline and contify a lot of functions, but that means that to use abstraction, you need to build up a mental model of what the inliner is going to do.

Finally, it's not always obvious to the compiler what the type of a value is, and that necessarily limits unboxing. For example, if we had started off the loop by defining sum to be 0 instead of 0.0, the result of the loop as a whole could be either an exact integer or an inexact real. Of course, loop peeling mitigates this to an extent, unboxing sum within the loop after the first iteration, but it so happens that peeling also prevents the phi join at the loop exit from being unboxed, because the result from the peeled iteration is 0 and not 0.0. In the end, we are unable to remove the equivalent of sum-box, and so we still allocate once per iteration. Here is a clear case where we would indeed need allocation sinking.

Also, consider that in other contexts the type of (+ x 1.0) might actually be complex instead of real, which means that depending on the type of x it might not be valid to unbox this addition. Proving that a number is not complex can be non-obvious. That's the second way that fetching a value from a packed vector of doubles or floats is useful: it's one of the rare times that you know that a number is real-valued.

on integer, on fixnum

That's all there is to say about floats. However, when doing some benchmarks of the floating-point unboxing, one user couldn't reproduce some of the results: they were seeing huge run-times for on a microbenchmark that repeatedly summed the elements of a vector. It turned out that the reason was that they were on a 32-bit machine, and one of the loop variables used in the test was exceeding the fixnum range. Recall that fixnums are the subset of integers that fit in an immediate value, along with their tag. Guile's fixnum tag is 2 bits, and fixnums have a sign bit, so the most positive fixnum on a 32-bit machine is 229—1, or around 500 million. It sure is a shame not to be able to count up to #xFFFFFFFF without throwing an allocation party!

So, we set about seeing if we could unbox integers as well in Guile. Guile's compiler has a lot more visibility as to when something is an integer, compared to real numbers. Anything used as an index into a vector or similar data structure must be an exact integer, and any query as to the length of a vector or a string or whatever is also an integer.

Note that knowing that a value is an exact integer is insufficient to unbox it: you have to also know that it is within the range of your unboxed integer data type. Here we take advantage of the fact that in Guile, type analysis also infers ranges. So, cool. Because the kinds of integers that can be used as indexes and lengths are all non-negative, our first unboxed integer type is u64, the unsigned 64-bit integers.

If Guile did native compilation, it would always be a win to unbox any integer operation, if only because you would avoid polymorphism or any other potential side exit. For bignums that are within the unboxable range, the considerations are similar to the floating-point case: allocation costs dominate, so unboxing is almost always a win, provided that you avoid double-boxing. Eliminating one allocation can pay off a lot of instruction dispatch.

For fixnums, though, things are not so clear. Immediate tagging is such a cheap way of boxing that in an interpreter, the extra instructions you introduce could outweigh any speedup from having faster operations.

In the end, I didn't do science and I decided to just go ahead and unbox if I could. We are headed towards native compilation, this is a necessary step along that path, and what the hell, it seemed like a good idea at the time.

Because there are so many more integers in a typical program than floating-point numbers, we had to provide unboxed integer variants of quite a number of operations. Of course we could unconditionally require unboxed arguments to vector-ref, string-length and so on, but in addition to making u64 variants of arithmetic, we also support bit operations like logand and such. Unlike the current status with floating point numbers, we can do test-and-branch over unboxed u64 comparisons, and we can compare u64 values to boxed SCM values.

In JavaScript, making sure an integer is unboxed is easy: you just do val | 0. The bit operation | truncates the value to a uint32 32-bit two's-complement signed integer (thanks to Slava for the correction). In Guile though, we have arbitrary-precision bit operations, so although (logior val 0) would assert that val is an integer, it wouldn't necessarily mean that it's unboxable.

Instead, the Guile idiom for making sure you have an unboxed integer in a particular range should go like this:

(define-inlinable (check-uint-range x mask)
  (let ((x* (logand x mask)))
    (unless (= x x*)
      (error "out of range" x))

A helper like this is useful to assert that an argument to a function is of a particular type, especially given that arguments to functions are always boxed and treated as being of unknown type. The logand asserts that the value is an integer, and the comparison asserts that it is within range.

For example, if we want to implement a function that does modular 8-bit addition, it can go like:

(define-inlinable (check-uint8 x)
  (check-uint-range x #xff))
(define-inlinable (truncate-uint8 x)
  (logand x #xff))
(define (uint8+ x y)
  (truncate-uint8 (+ (check-uint8 x) (check-uint8 y))))

If we disassemble this function, we get something like:

Disassembly of #<procedure uint8+ (x y)> at #xa8d0f8:

   0    (assert-nargs-ee/locals 3 2)    ;; 5 slots (2 args)
   1    (scm->u64/truncate 4 3)
   2    (load-u64 1 0 255)
   5    (ulogand 4 4 1)
   6    (br-if-u64-=-scm 4 3 #f 17)     ;; -> L1
;; [elided code to throw an error if x is not in range]
  23    (scm->u64/truncate 3 2)
  24    (ulogand 3 3 1)
  25    (br-if-u64-=-scm 3 2 #f 18)     ;; -> L2
;; [elided code to throw an error if y is not in range]
  43    (uadd 4 4 3)
  44    (ulogand 4 4 1)
  45    (u64->scm 3 4)
  46    (return-values 2)               ;; 1 value

The scm->u64/truncate instructions unbox an integer, but truncating it to the u64 range. They are used when we know that any additional bits won't be used, as in this case where we immediately do a logand of the unboxed value. All in all it's not a bad code sequence; there are two possible side exits for each argument (not an integer signalled by the unboxing, and out of range signalled by the explicit check), and no other run-time dispatch. For now I think we can be pretty happy with the code.

That's about it for integer unboxing. We also support unboxed signed 64-bit integers, mostly for use as operands or return values from bytevector-s8-ref and similar unboxed accessors on bytevectors. There are fewer operations that have s64 variants, though, compared to u64 variants.


Up until now in Guile, it could be that you might have to avoid Scheme if you needed to do some kinds of numeric computation. Unboxing floating-point and integer numbers makes it feasible to do more computation in Scheme instead of having to rely in inflexible C interfaces. At the same time, as a Scheme hacker I feel much more free knowing that I can work on 64-bit integers without necessarily allocating bignums. I expect this optimization to have a significant impact on the way I program, and what I program. We'll see where this goes, though. Until next time, happy hacking :)

by Andy Wingo at January 19, 2016 11:57 AM

January 11, 2016

Andy Wingo

the half strap: self-hosting and guile

or, "why does building guile take so friggin long"

Happy new year's, hackfolk! I don't know about y'all, but I'm feeling pretty good about 2016. Let's make some cool stuff!

Today's article is about Guile and how it builds itself. It's a Scheme implementation mostly written in Scheme, so how it would go about doing that isn't straightforward. And although the performance of Guile is pretty great these days, a user's first experience with it will probably be building it, which is a process that takes approximately forever. Seriously. On this newish laptop with an i7-5600U CPU and four cores it takes like 45 minutes. On older machines it can take even longer. What gives?

Well, fictional reader, it's a good question. I'm glad you asked! Before getting to the heart of the matter, I summarize a bit of background information.

and then nothing turned itself inside out

Guile is mostly written in Scheme. Some parts of it are written in C -- some runtime routines, some supporting libraries (the garbage collector, unicode support, arbitrary precision arithmetic), and the bytecode interpreter. The first phase when building Guile is to take the system's C compiler -- a program that takes C source code and produces native machine code -- and use it to build libguile, the part of Guile that is written in C.

The next phase is to compile the parts of Guile written in Scheme. Currently we compile to bytecode which is then interpreted by libguile, but this discussion would be the same if we compiled Scheme to native code instead of bytecode.

There's a wrinkle, though: the Scheme compiler -- the program that takes a Scheme program and produces bytecode -- is written in Scheme. When we built libguile, we could use the system's C compiler. But the system has no Scheme compiler, so how do we do?

The answer is that in addition to a Scheme compiler, Guile also includes a Scheme interpreter. We use the interpreter to load the Scheme compiler, and then use the compiler to produce bytecode from Scheme.

There's another wrinkle, though, and I bet you can guess what it is :) The Scheme interpreter is also written in Scheme. It used to be that Guile's Scheme interpreter was written in C, but that made it impossible to tail-call between compiled and interpreted code. So some six years ago, I rewrote the interpreter in Scheme.

As I mention in that article, Guile actually has two Scheme interpreters: the one in Scheme and one in C that is only used to compile the one in Scheme, and never used again. The bootstrap interpreter written in C avoids the problem with tail calls to compiled code because when it runs, there is no compiled code.

So in summary, Guile's build has the following general phases:

  1. The system C compiler builds libguile.

  2. The bootstrap C interpreter in libguile loads the Scheme compiler and builds eval.go from eval.scm. (Currently .go is the extension for compiled Guile code. The extension predates the Go language. Probably we switch to .so at some point, though.)

  3. The Scheme interpreter from eval.go loads the Scheme compiler and compiles the rest of the Scheme code in Guile, including the Scheme compiler itself.

In the last step, Guile compiles each file in its own process, allowing for good parallelization. This also means that as the compiler builds, the compiler itself starts running faster because it can use the freshly built .go files instead having to use the interpreter to load the source .scm files.

so what's slow?

Building libguile is not so slow; it takes about a minute on my laptop. Could be faster, but it's fine.

Building eval.go is slow, but at two and half minutes it's bearable.

Building the rest of the Scheme code is horribly slow though, and for me takes around 40 or 50 minutes. What is going on?

The crucial difference between building libguile and building the .go files is that when we build libguile, we use the C compiler, which is itself a highly optimized program. When we build .go files, we use the Scheme compiler, which hasn't yet been compiled! Indeed if you rebuild all the Scheme code using a compiled Scheme compiler instead of an interpreted Scheme compiler, you can rebuild all of Guile in about 5 minutes. (Due to the way the Makefile dependencies work, the easiest way to do this if you have a built Guile is rm bootstrap/ice-9/eval.go && make -jN.)

The story is a bit complicated by parallelism, though. Usually if you do a make -j4, you will be able to build 4 things at the same time, taking advantage of 4 cores (if you have them). However Guile's Makefile rules are arranged in such a way that the initial eval.go compile is done serially, when nothing else is running. This is because the bootstrap interpreter written in C uses C stack space as temporary storage. It could be that when compiling bigger files, the C interpreter might run out of stack, and with C it's hard to detect exactly how much stack you have. Indeed, sometimes we get reports of strange bootstrap failures that end up being because Guile was built with -O0 and the compiler decided to use much more stack space than we usually see. We try to fix these, usually by raising the static stack limits that Guile's C interpreter imposes, but we certainly don't want a limitation in the bootstrap interpreter to affect the internal structure of the rest of Guile. The
bootstrap interpreter's only job is to load the compiler and build eval.go, and isn't tested in any other way.

So eval.go is build serially. After that, compilation can proceed in parallel, but goes more slowly before speeding up. To explain that, I digress!

a digression on interpreters

When Scheme code is loaded into Guile from source, the process goes like this:

  1. Scheme code is loaded from disk or wherever as a stream of bytes.

  2. The reader parses that byte stream into S-expressions.

  3. The expander runs on the S-expressions, expanding macros and lowering Scheme code to an internal language called "Tree-IL".

Up to here, the pipeline is shared between the interpreter and the compiler. If you're compiling, Guile will take the Tree-IL, run the partial evaluator on it, lower to CPS, optimize that CPS, and then emit bytecode. The next time you load this file, Guile will just mmap in the .go file and skip all of the other steps. Compilation is great!

But if you are interpreting, a few more things happen:

  1. The memoizer does some analysis on the Tree-IL and turns variable references into two-dimensional (depth, offset) references on a chained environment. See the story time article for more; scroll down about halfway for the details. The goal is to do some light compilation on variable access so that the interpreter will have to do less work, and also prevent closures from hanging on to too much data; this is the "flat closure" optimization, for the interpreter.

  2. The interpreter "compiles" the code to a chain of closures. This is like the classic direct-threading optimization, but for a tree-based interpreter.

The closure-chaining strategy of the interpreter is almost exactly as in described in SICP's analyze pass. I came up with it independently, but so did Jonathan Rees in 1982 and Marc Feeley in 1986, so I wasn't surprised when I found the prior work!

Back in 2009 when we switched to the eval-in-Scheme, we knew that it would result in a slower interpreter. This is because instead of the interpreter being compiled to native code, it was compiled to bytecode. Also, Guile's Scheme compiler wasn't as good then, so we knew that we were leaving optimizations on the floor. Still, the switch to an evaluator in Scheme enabled integration of the compiler, and we thought that the interpreter speed would improve with time. I just took a look and with this silly loop:

(let lp ((n 0)) (if (< n #e1e7) (lp (1+ n))))

Guile 1.8's interpreter written in C manages to run this in 1.1 seconds. Guile 2.0's interpreter written in Scheme and compiled to the old virtual machine does it in 16.4 seconds. Guile 2.1.1's interpreter, with the closure-chaining optimization, a couple of peephole optimizations in the interpreter, and compiled using the better compiler and VM from Guile 2.2, manages to finish in 2.4 seconds. So we are definitely getting better, and by the time we compile eval.scm to native code I have no doubt that we will be as good as the old C implementation. (Of course, when compiled to Guile 2.2's VM, the loop finishes in 55 milliseconds, but comparing a compiler and an interpreter is no fair.)

The up-shot for bootstrap times is that once the interpreter is compiled, the build currently runs a little slower, because the compiled eval.go interpreter is a bit slower than the bootstrap interpreter in libguile.

bottom up, top down

Well. Clearly I wanted to share a thing with you about interpreters; thank you for following along :) The salient point is that Guile's interpreter is now pretty OK, though of course not as good as the compiler. Still, Guile 2.0 builds in 12 minutes, while Guile 2.2 builds in 40 or 50, and Guile 2.2 has a faster interpreter. What's the deal?

There are a few factors at play but I think the biggest is that Guile 2.2's compiler is simply much more sophisticated than Guile 2.0's compiler. Just loading it up at bootstrap-time takes longer than loading Guile 2.0's compiler, because there's more code using more macro abstractions than in Guile 2.0. The expander has to do more work, and the evaluator has to do more work. A compiler is a program that runs on programs, and interpreting a bigger program is going to be slower than interpreting a smaller program.

It's a somewhat paradoxical result: to make programs run faster, we needed a better compiler, but that better compiler is bigger, and so it bootstraps from source more slowly. Some of the improvements to generated code quality were driven by a desire to have the compiler run faster, but this only had the reverse effect on bootstrap time.

Unfortunately, Guile 2.2's compiler also runs slow when it's fully compiled: compiling one largeish module in Guile 2.2 compared to 2.0 takes 10.7 seconds instead of 1.9. (To reproduce, ,time (compile-file "module/ice-9/psyntax-pp.scm") from a Guile 2.0 or 2.2 REPL.) How can we explain this?

Understanding this question has taken me some time. If you do a normal profile of the code using statprof, you get something like this:

> ,profile (compile-file "module/ice-9/psyntax-pp.scm")
%     cumulative   self             
time   seconds     seconds  procedure
 12.41      1.61      1.61  language/cps/intmap.scm:393:0:intmap-ref
  6.35      1.05      0.82  vector-copy
  5.92     13.09      0.77  language/cps/intset.scm:467:5:visit-branch
  5.05      0.71      0.65  language/cps/intmap.scm:183:0:intmap-add!
  4.62      1.40      0.60  language/cps/intset.scm:381:2:visit-node
  3.61      0.93      0.47  language/cps/intset.scm:268:0:intset-add
  3.46      0.49      0.45  language/cps/intset.scm:203:0:intset-add!
  3.17      1.01      0.41  language/cps/intset.scm:269:2:adjoin
  3.03      1.46      0.39  language/cps/intmap.scm:246:2:adjoin

("Cumulative seconds" can be greater than the total number of seconds for functions that have multiple activations live on the stack.)

These results would seem to unequivocally indicate that the switch to persistent data structures in the new compiler is to blame. This is a somewhat disheartening realization; I love working with the new data structures. They let me write better code and think about bigger things.

Seeing that most of the time is spent in intmap and intset manipulations, I've tried off and on over the last few months to speed them up. I tried at one point replacing hot paths with C -- no speedup, so I threw it away. I tried adding an alternate intmap implementation that, for transient packed maps, would store the map as a single vector; no significant speedup, binned it. I implemented integer unboxing in the hopes that it would speed up the results; more about that in another missive. I stared long and hard at the generated code, looking for opportunities to improve it (and did make some small improvements). Even when writing this article, the results are such a shame that I put the article on hold for a couple weeks while I looked into potential improvements, and managed to squeak out another 10%.

In retrospect, getting no speedup out of C hot paths should have been a hint.

For many years, a flat statistical profile with cumulative/self timings like the one I show above has been my go-to performance diagnostic. Sometimes it does take a bit of machine sympathy to understand, though; when you want to know what's calling a hot function, usually you look farther down the list for functions that don't have much self time but whose cumulative time matches the function you're interested in. But this approach doesn't work for hot functions that are called from many, many places, as is the case with these fundamental data structure operations.

Indeed at one point I built a tool to visualize statistical stack samples, the idea being you often want to see how a program gets to its hot code. This tool was useful but its output could be a bit overwhelming. Sometimes you'd have to tell it to generate PDF instead of PNG files because the height of the image exceeded Cairo's internal limits. The tool also had too many moving pieces to maintain. Still, the core of the idea was a good one, and I incorporated the non-graphical parts of it into Guile proper, where they sat unused for a few years.

Fast-forward to now, where faced with this compiler performance problem, I needed some other tool to help me out. It turns out that in the 2.0 to 2.2 transition, I had to rewrite the profiler's internals anyway to deal with the new VM. The old VM could identify a frame's function by the value in local slot 0; the new one has to look up from instruction pointer values. Because this lookup can be expensive, the new profiler just writes sampled instruction pointer addresses into an array for later offline analysis, eventual distilling to a flat profile. It turns out that this information is exactly what's needed to do a tree profile like I did in chartprof. I had to add cycle detection to prevent the graphs from being enormous, but cycle detection makes much more sense in a tree output than in a flat profile. The result, distilled a bit:

> ,profile (compile-file "module/ice-9/psyntax-pp.scm") #:display-style tree
100.0% read-and-compile at system/base/compile.scm:208:0
  99.4% compile at system/base/compile.scm:237:0
    99.4% compile-fold at system/base/compile.scm:177:0
      75.3% compile-bytecode at language/cps/compile-bytecode.scm:568:0
        73.8% lower-cps at language/cps/compile-bytecode.scm:556:0
          41.1% optimize-higher-order-cps at language/cps/optimize.scm:86:0
          29.9% optimize-first-order-cps at language/cps/optimize.scm:106:0
          1.5% convert-closures at language/cps/closure-conversion.scm:814:0
      20.5% emit-bytecode at language/cps/compile-bytecode.scm:547:0
        18.5% visit-branch at language/cps/intmap.scm:514:5
          18.5% #x7ff420853318 at language/cps/compile-bytecode.scm:49:15
            18.5% compile-function at language/cps/compile-bytecode.scm:83:0
              18.5% allocate-slots at language/cps/slot-allocation.scm:838:0
      3.6% compile-cps at language/tree-il/compile-cps.scm:1071:0
        2.5% optimize at language/tree-il/optimize.scm:31:0
        0.6% cps-convert/thunk at language/tree-il/compile-cps.scm:924:0
        0.4% fix-letrec at language/tree-il/fix-letrec.scm:213:0
  0.6% compile-fold at system/base/compile.scm:177:0
    0.6% save-module-excursion at ice-9/boot-9.scm:2607:0
      0.6% #x7ff420b95254 at language/scheme/compile-tree-il.scm:29:3

I've uploaded the full file here, for the curious Guile hacker.

So what does it mean? The high-order bit is that we spend some 70% of the time in the optimizer. Indeed, running the same benchmark but omitting optimizations gets a much more respectable time:

$ time meta/uninstalled-env \
  guild compile -O0 module/ice-9/psyntax-pp.scm -o /tmp/foo.go
wrote `/tmp/foo.go'

real	0m3.050s
user	0m3.404s
sys	0m0.060s

One of the results of this investigation was that we should first compile the compiler with -O0 (no optimizations), then compile the compiler with -O2 (with optimizations). This change made it into the 2.1.1 release a couple months ago.

We also spend around 18.5% of time in slot allocation -- deciding what local variable slots to allocate to CPS variables. This takes time because we do a precise live variable analysis over the CPS, which itself has one variable for every result value and a label for every program point. Then we do register allocation, but in a way that could probably be optimized better. Perhaps with -O0 we should use a different strategy to allocate slots: one which preserves the values of variables that are available but dead. This would actually be an easier allocation task. An additional 1.5% is spent actually assembling the bytecode.

Interestingly, partial evaluation, CPS conversion, and a couple of other small optimizations together account for only 3.6% of time; and reading and syntax expansion account for only 0.6% of time. This is good news at least :)

up in the trees, down in the weeds

Looking at the top-down tree profile lets me see that the compiler is spending most of its time doing things that the Guile 2.0 compiler doesn't do: loop optimizations, good slot allocations, and so on. To an extent, then, it's to be expected that the Guile 2.2 compiler is slower. This also explains why the C fast-paths weren't so effective at improving performance: the per-operation costs for the already pretty low and adding C implementations wasn't enough of a speedup to matter. The problem was not that intmap-ref et al were slow, it was that code was calling them a lot.

Improving the optimizer has been a bit challenging, not least due to the many axes of "better". Guile's compiler ran faster before the switch to "CPS soup" and persistent data structures, but it produced code that ran slower because I wasn't able to write the optimizations that I would have liked. Likewise, Guile 2.0's compiler ran faster, because it did a worse job. But before switching to CPS soup, Guile's compiler also used more memory, because per-program-point and per-variable computations were unable to share space with each other.

I think the top-down profiler has given me a better point of view in this instance, as I can reason about what I'm doing on a structural level, which I wasn't able to understand from the flat profile. Still, it's possible to misunderstand the performance impact of leaf functions when they are spread all over a tree, and for that reason I think we probably need both kinds of profilers.

In the case of Guile's compiler I'm not sure that I'll change much at this point. We'll be able to switch to native compilation without a fundamental compiler rewrite. But spending most of the time in functions related to data structures still seems pretty wrong to me on some deep level -- what if the data structures were faster? What if I wrote the code in some other way that didn't need the data structures so much? It gnaws at me. It gnaws and gnaws.

the half strap

Unfortunately, while compiling Scheme to native code will probably speed up the compiler, it won't necessarily speed up the bootstrap. I think the compiler has some 800 KB of source code right now, and let's say that we're able to do native compilation with 1200 KB. So 50% more code, but probably the result is two to ten times faster on average: a win, in terms of compiler speed, when compiled. But for bootstrap time, because in the beginning of the bootstrap most of the compiler isn't compiled, it could well be a slowdown.

This is the disadvantage of bootstrapping from an interpreter -- the more compiler you write, the slower your strap.

Note that this is different from the case where you bootstrap from a compiled Scheme compiler. In our case we do a half-bootstrap, first building an interpreter in C, compiling the interpreter in Scheme, then bootstrapping off that.

It's a common trope in compiler development where the heroic, farsighted compiler hacker refuses to add optimizations unless they make the compiler bootstrap faster. Dybvig says as much in his "History of Chez Scheme" paper. Well, sure -- if you're willing to accept complete responsibility for bootstrapping. From my side, I'm terrified that I could introduce some error in a binary that could reproduce itself worm-like into all my work and it make it impossible to change anything. You think I jest, but the Sanely Bootstrappable Common Lisp papers instilled me with fear. Want to change your tagging scheme? You can't! Want to experiment with language, start programming using features from your own dialect? You can't! No, thank you. I value my sanity more than that.

Incidentally, this also answers a common question people have: can I use some existing Guile to compile a new Guile? The answer is tricky. You can if the two Guiles implement the same language and virtual machine. Guile-the-language is fairly stable. However, due to the way that the VM and the compiler are co-developed, some of the compiler is generated from data exported by libguile. If that information happens to be the same on your Guile, then yes, it's possible. Otherwise no. For this reason it's not something we describe, besides cross-compilers from the same version. Just half strap: it takes a while but it's fairly fool-proof.

and that's it!

Thanks for reading I guess. Good jobbies! Next time, some words on Lua. Until then, happy strapping!

by Andy Wingo at January 11, 2016 09:51 PM