Planet Igalia

February 21, 2017

Asumu Takikawa

Optimizing Snabbwall

In my previous blog post, I wrote about the work I’ve been doing on Snabbwall, sponsored by the NLNet Foundation. The next milestone in the project was to write some user documentation (this is now done) and to do some benchmarking.

After some initial benchmarking, I found that Snabbwall wasn’t performing as well as it could. One of the impressive things about Snabb is that well-engineered apps can achieve line-rate performance on 10gbps NICs. That means that the LuaJIT program is processing packets at 10gbps, which means that if your packets are about 40 bytes (the minimum size of an IPv6 packet) then it spends around 30 nanoseconds per packet.

On the other hand, Snabbwall was clocking about 1gbps or less. This was based on measurements from a simple benchmarking script that uses the packetblaster program to fire a ton of packets at a NIC connected to an instance of Snabbwall. The benchmark output looked like this:

1
2
3
4
5
Firewall on 02:00.1 (CPU 3). Packetblaster on 82:00.1.
BENCH (BITTORRENT.pcap, 1 iters, 10 secs)
bytes: 1,396,392,179 packets: 1,736,085 bps: 1,090,257,569.1429
BENCH (rtmp_sample.cap, 1 iters, 10 secs)
bytes: 490,248,129 packets: 1,510,824 bps: 381,488,072.1031

(the bps numbers give the bits per second processed for the run)

For Snabbwall, we hadn’t actually set a goal of processing packets at line-rate. And in any case, the performance of the system is limited by the processing speed of nDPI, which handles the actual deep-packet inspection work. But 1gbps is pretty far from line-rate, so I spent a few days on finding some low-hanging performance fruit.

Profiling and exploring traces

Most of the performance issues that I found were pinpointed by using the very helpful LuaJIT profiler and debugging tools. For debugging Snabb performance issues in particular, you can use an invocation like the following:

1
2
   ## dump verbose trace information
   $ ./snabb snsh -jv -p program.to.run -f <flags> args

The -jv option provides verbose profiling output that shows an overview of the tracing process. In particular, it shows when the trace recorder has to abort.

(see this page for details on LuaJIT’s control commands)

In case you’re not familiar with how tracing JITs like LuaJIT work, the basic idea is that the VM will run in an interpreted mode by default, and record traces through the program as it executes instructions.

(BTW, if you’re wondering what a trace is, it is a “a linear sequence of instructions with a single entry point and one or more exit points”)

Once the VM finds a hot (i.e., frequently executed) trace that it is capable of compiling and is also worth compiling, the VM compiles the trace and runs the result.

If the compiler can’t handle some aspect of the trace, however, it will abort and return to the interpreter. If this happens in hot code, you can get severely degraded performance.

This was what was happening in Snabbwall. Here’s an excerpt from the trace info for a Snabbwall run:

1
2
3
4
5
6
7
[TRACE  83 util.lua:27 -> 72]
[TRACE --- util.lua:58 -- NYI: unsupported C type conversion at scanner.lua:202]
[TRACE --- (78/1) scanner.lua:110 -- NYI: unsupported C type conversion at scanner.lua:202]
[TRACE --- (78/1) scanner.lua:110 -- NYI: unsupported C type conversion at scanner.lua:202]
[TRACE --- (78/1) scanner.lua:110 -- NYI: unsupported C type conversion at scanner.lua:202]
[TRACE --- (78/1) scanner.lua:110 -- NYI: unsupported C type conversion at scanner.lua:202]
[TRACE  84 (78/1) scanner.lua:110 -- fallback to interpreter]

The source code documentation in the LuaJIT implementation explains what the notation means. What’s important for our purposes is that the lines without a trace number which have --- are showing trace aborts where the compiler gave up.

As the comments note, trace aborts are not always a problem because the speed of the interpreter may be sufficient. Presumably more so if the code is not that warm.

In our case, however, these trace aborts are happening in the middle of the packet scanning code in scanner.lua, which is part of the core loop of the firewall. That’s a bad sign.

It turns out that the unsupported C type conversion error occurs in some cases when a cdata (the type for LuaJIT’s FFI objects) allocation is unsupported. You can see the code that’s throwing this error in the LuaJIT implementation here.

The specific line in Snabbwall that is causing the trace to abort in cdata allocation is this one:

1
key = flow_key_ipv4()

which is allocating a new instance of an FFI data type. The call occurs in a function which is called repeatedly in the scanning loop, so it triggers the allocation issue each time. The data type it’s trying to allocate is this one:

1
2
3
4
5
6
7
8
9
struct swall_flow_key_ipv4 {
   uint16_t vlan_id;
   uint8_t  __pad;
   uint8_t  ip_proto;
   uint8_t  lo_addr[4];
   uint8_t  hi_addr[4];
   uint16_t lo_port;
   uint16_t hi_port;
} __attribute__((packed));

Reading the LuaJIT internals a bit reveals that the issue is that an allocation of a struct which has an array field is unsupported in JIT-mode.

To test this hypothesis, here’s a small Lua script that you can try out that just allocates a struct with a single array field:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
ffi = require("ffi")

ffi.cdef[[
  struct foo {
    uint8_t a[4];
  };
]]

for i=1, 1000 do
  local foo = ffi.new("struct foo")
end

Running this with the -jv option yields output like this:

1
2
3
4
5
6
$ luajit -jv cdata-test.lua 
[TRACE --- cdata-test.lua:9 -- NYI: unsupported C type conversion at cdata-test.lua:10]
[TRACE --- cdata-test.lua:9 -- NYI: unsupported C type conversion at cdata-test.lua:10]
[TRACE --- cdata-test.lua:9 -- NYI: unsupported C type conversion at cdata-test.lua:10]
[TRACE --- cdata-test.lua:9 -- NYI: unsupported C type conversion at cdata-test.lua:10]
[TRACE --- cdata-test.lua:9 -- NYI: unsupported C type conversion at cdata-test.lua:10]

which is the same error we saw earlier from Snabbwall. For Snabbwall, we can work around this by allocating the swall_flow_key_ipv4 data structure just once in the module. On each loop iteration, we then re-write the fields on the single instance instead of allocating new ones.

This might sound iffy, but as long as the lifetime of this flow key data structure is controlled, it should be ok. In particular, the documented API for Snabbwall doesn’t even expose this data structure so we can ensure that an old reference is never read after the fields get overwritten.

Using some dynasm

Once I optimized the flow key allocation, I saw another trace abort in Snabbwall that was trickier to work around. The relevant trace info is this excerpt here:

1
2
3
4
5
6
7
[TRACE  78 (71/3) scanner.lua:110 -> 72]
[TRACE --- (77/1) util.lua:34 -- NYI: unsupported C function type at wrap.lua:64]
[TRACE --- (77/1) util.lua:34 -- NYI: unsupported C function type at wrap.lua:64]
[TRACE --- (77/1) util.lua:34 -- NYI: unsupported C function type at wrap.lua:64]
[TRACE --- (77/1) util.lua:34 -- NYI: unsupported C function type at wrap.lua:64]
[TRACE  79 link.lua:45 return]
[TRACE  80 (77/1) util.lua:34 -- fallback to interpreter]

For this case, it wasn’t necessary to go read the LuaJIT source code to figure out exactly what was going on (though I suspect the error comes from this line). The module wrap.lua in the nDPI FFI library uses two C functions with the following signatures:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
typedef struct { uint16_t master_protocol, protocol; } ndpi_protocol_t;

ndpi_protocol_t ndpi_detection_process_packet (ndpi_detection_module_t *detection_module,
                                               ndpi_flow_t *flow,
                                               const uint8_t *packet,
                                               unsigned short packetlen,
                                               uint64_t current_tick,
                                               ndpi_id_t *src,
                                               ndpi_id_t *dst);

ndpi_protocol_t ndpi_guess_undetected_protocol (ndpi_detection_module_t *detection_module,
                                                uint8_t protocol,
                                                uint32_t src_host, uint16_t src_port,
                                                uint32_t dst_host, uint32_t dst_port);

Note that both functions return a struct by value. If you read the FFI semantics page for LuaJIT closely, you’ll see that calls to “C functions with aggregates passed or returned by value” are described as having “suboptimal performance” because they’re not compiled.

This is a little tricky to work around without writing some C code. At the C-level, it’s easy to write a wrapper that returns the struct data by reference through a pointer argument to avoid the return. Then wrap.lua can allocate its own protocol struct and pass that into the wrapper instead. That’s actually the first thing I did in order to test if this approach improves the performance (spoiler: it did).

But using a C wrapper complicates the build process for Snabbwall and introduces some issues with linking. It turns out that dynasm, which came up in a previous blog post, can help us out.

Specifically, instead of using a C wrapper, we can just write what the C wrapper code would do in dynamically-generated assembly code. Generating the code once at run-time lets us avoid any build/linking issues and it’s just as fast.

The downside is of course that it’s harder to write and debug. I’m not really a seasoned x64 assembly hacker, so it took me a while to grok the ABI docs in order to put it all together.

Here’s the wrapper code for the ndpi_detection_process_packet function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
local function gen(Dst)
   -- pass the first stack argument onto the original function
   | mov rax, [rsp+8]
   | push rax

   -- call the original function, do stack cleanup
   | mov64 rax, orig_f
   | call rax
   | add rsp, 8

   -- at this point, rax and rdx have struct
   -- fields in them, which we want to write into
   -- the struct pointer (2nd stack arg)
   | mov rcx, [rsp+16]
   | mov [rcx], rax
   | mov [rcx+4], rdx

   | ret
end

The code is basically just doing some function call plumbing following the x64 SystemV ABI.

What’s going on is that the original C function has 7 arguments and our assembly wrapper is supposed to have 8 (an additional pointer that we’ll write through). On x64, six integer arguments are passed through registers so the remaining two get passed on the stack.

That means we don’t need to modify any registers in this wrapper (since we will immediately call the original function), but we do need to re-push the first argument onto the stack to prepare for the call.

We can then call the function, and then increment the stack pointer to clean up the stack (the ABI also requires the caller to clean up the stack).

The rest of the code just writes the two returned struct fields from the registers through the pointer on the stack to the struct contained in Lua-land.

Speedup

With the changes I described in the post, the performance of Snabbwall on the benchmark improved quite a bit. Here are the numbers after implementing the three optimizations mentioned above:

1
2
3
4
5
Firewall on 02:00.1 (CPU 3). Packetblaster on 82:00.1.
BENCH (BITTORRENT.pcap, 1 iters, 10 secs)
bytes: 5,827,504,913 packets: 7,247,814 bps: 4,539,941,765.8323
BENCH (rtmp_sample.cap, 1 iters, 10 secs)
bytes: 6,099,115,315 packets: 18,793,998 bps: 4,745,353,670.4701

We’re still not at line-rate, but at this point the profiler attributes a large portion of the time (23% + 20%) to the two C functions that do the actual packet inspection work:

1
2
3
4
5
6
24%  ipv4_addr_cmp
23%  process_packet
20%  guess_undetected_protocol
 9%  extract_packet_info
 4%  scan_packet
 3%  hash

(there might be some optimization potential in ipv4_addr_cmp, which is in the Lua code)

In doing this optimization work, I was happy to find that the LuaJIT performance tools were very helpful. Though I do think there might be an opportunity to put a more solution-oriented interface on it. For example, an optimization coach for LuaJIT could be interesting and useful.

February 21, 2017 02:22 PM

February 10, 2017

Carlos García Campos

Accelerated compositing in WebKitGTK+ 2.14.4

WebKitGTK+ 2.14 release was very exciting for us, it finally introduced the threaded compositor to drastically improve the accelerated compositing performance. However, the threaded compositor imposed the accelerated compositing to be always enabled, even for non-accelerated contents. Unfortunately, this caused different kind of problems to several people, and proved that we are not ready to render everything with OpenGL yet. The most relevant problems reported were:

  • Memory usage increase: OpenGL contexts use a lot of memory, and we have the compositor in the web process, so we have at least one OpenGL context in every web process. The threaded compositor uses the coordinated graphics model, that also requires more memory than the simple mode we previously use. People who use a lot of tabs in epiphany quickly noticed that the amount of memory required was a lot more.
  • Startup and resize slowness: The threaded compositor makes everything smooth and performs quite well, except at startup or when the view is resized. At startup we need to create the OpenGL context, which is also quite slow by itself, but also need to create the compositing thread, so things are expected to be slower. Resizing the viewport is the only threaded compositor task that needs to be done synchronously, to ensure that everything is in sync, the web view in the UI process, the OpenGL viewport and the backing store surface. This means we need to wait until the threaded compositor has updated to the new size.
  • Rendering issues: some people reported rendering artifacts or even nothing rendered at all. In most of the cases they were not issues in WebKit itself, but in the graphic driver or library. It’s quite diffilcult for a general purpose web engine to support and deal with all possible GPUs, drivers and libraries. Chromium has a huge list of hardware exceptions to disable some OpenGL extensions or even hardware acceleration entirely.

Because of these issues people started to use different workarounds. Some people, and even applications like evolution, started to use WEBKIT_DISABLE_COMPOSITING_MODE environment variable, that was never meant for users, but for developers. Other people just started to build their own WebKitGTK+ with the threaded compositor disabled. We didn’t remove the build option because we anticipated some people using old hardware might have problems. However, it’s a code path that is not tested at all and will be removed for sure for 2.18.

All these issues are not really specific to the threaded compositor, but to the fact that it forced the accelerated compositing mode to be always enabled, using OpenGL unconditionally. It looked like a good idea, entering/leaving accelerated compositing mode was a source of bugs in the past, and all other WebKit ports have accelerated compositing mode forced too. Other ports use UI side compositing though, or target a very specific hardware, so the memory problems and the driver issues are not a problem for them. The imposition to force the accelerated compositing mode came from the switch to using coordinated graphics, because as I said other ports using coordinated graphics have accelerated compositing mode always enabled, so they didn’t care about the case of it being disabled.

There are a lot of long-term things we can to to improve all the issues, like moving the compositor to the UI (or a dedicated GPU) process to have a single GL context, implement tab suspension, etc. but we really wanted to fix or at least improve the situation for 2.14 users. Switching back to use accelerated compositing mode on demand is something that we could do in the stable branch and it would improve the things, at least comparable to what we had before 2.14, but with the threaded compositor. Making it happen was a matter of fixing a lot bugs, and the result is this 2.14.4 release. Of course, this will be the default in 2.16 too, where we have also added API to set a hardware acceleration policy.

We recommend all 2.14 users to upgrade to 2.14.4 and stop using the WEBKIT_DISABLE_COMPOSITING_MODE environment variable or building with the threaded compositor disabled. The new API in 2.16 will allow to set a policy for every web view, so if you still need to disable or force hardware acceleration, please use the API instead of WEBKIT_DISABLE_COMPOSITING_MODE and WEBKIT_FORCE_COMPOSITING_MODE.

We really hope this new release and the upcoming 2.16 will work much better for everybody.

by carlos garcia campos at February 10, 2017 05:18 PM

February 08, 2017

Alberto Garcia

QEMU and the qcow2 metadata checks

When choosing a disk image format for your virtual machine one of the factors to take into considerations is its I/O performance. In this post I’ll talk a bit about the internals of qcow2 and about one of the aspects that can affect its performance under QEMU: its consistency checks.

As you probably know, qcow2 is QEMU’s native file format. The first thing that I’d like to highlight is that this format is perfectly fine in most cases and its I/O performance is comparable to that of a raw file. When it isn’t, chances are that this is due to an insufficiently large L2 cache. In one of my previous blog posts I wrote about the qcow2 L2 cache and how to tune it, so if your virtual disk is too slow, you should go there first.

I also recommend Max Reitz and Kevin Wolf’s qcow2: why (not)? talk from KVM Forum 2015, where they talk about a lot of internal details and show some performance tests.

qcow2 clusters: data and metadata

A qcow2 file is organized into units of constant size called clusters. The cluster size defaults to 64KB, but a different value can be set when creating a new image:

qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G

Clusters can contain either data or metadata. A qcow2 file grows dynamically and only allocates space when it is actually needed, so apart from the header there’s no fixed location for any of the data and metadata clusters: they can appear mixed anywhere in the file.

Here’s an example of what it looks like internally:

In this example we can see the most important types of clusters that a qcow2 file can have:

  • Header: this one contains basic information such as the virtual size of the image, the version number, and pointers to where the rest of the metadata is located, among other things.
  • Data clusters: the data that the virtual machine sees.
  • L1 and L2 tables: a two-level structure that maps the virtual disk that the guest can see to the actual location of the data clusters in the qcow2 file.
  • Refcount table and blocks: a two-level structure with a reference count for each data cluster. Internal snapshots use this: a cluster with a reference count >= 2 means that it’s used by other snapshots, and therefore any modifications require a copy-on-write operation.

Metadata overlap checks

In order to detect corruption when writing to qcow2 images QEMU (since v1.7) performs several sanity checks. They verify that QEMU does not try to overwrite sections of the file that are already being used for metadata. If this happens, the image is marked as corrupted and further access is prevented.

Although in most cases these checks are innocuous, under certain scenarios they can have a negative impact on disk write performance. This depends a lot on the case, and I want to insist that in most scenarios it doesn’t have any effect. When it does, the general rule is that you’ll have more chances of noticing it if the storage backend is very fast or if the qcow2 image is very large.

In these cases, and if I/O performance is critical for you, you might want to consider tweaking the images a bit or disabling some of these checks, so let’s take a look at them. There are currently eight different checks. They’re named after the metadata sections that they check, and can be divided into the following categories:

  1. Checks that run in constant time. These are equally fast for all kinds of images and I don’t think they’re worth disabling.
    • main-header
    • active-l1
    • refcount-table
    • snapshot-table
  2. Checks that run in variable time but don’t need to read anything from disk.
    • refcount-block
    • active-l2
    • inactive-l1
  3. Checks that need to read data from disk. There is just one check here and it’s only needed if there are internal snapshots.
    • inactive-l2

By default all tests are enabled except for the last one (inactive-l2), because it needs to read data from disk.

Disabling the overlap checks

Tests can be disabled or enabled from the command line using the following syntax:

-drive file=hd.qcow2,overlap-check.inactive-l2=on
-drive file=hd.qcow2,overlap-check.snapshot-table=off

It’s also possible to select the group of checks that you want to enable using the following syntax:

-drive file=hd.qcow2,overlap-check.template=none
-drive file=hd.qcow2,overlap-check.template=constant
-drive file=hd.qcow2,overlap-check.template=cached
-drive file=hd.qcow2,overlap-check.template=all

Here, none means that no tests are enabled, constant enables all tests from group 1, cached enables all tests from groups 1 and 2, and all enables all of them.

As I explained in the previous section, if you’re worried about I/O performance then the checks that are probably worth evaluating are refcount-block, active-l2 and inactive-l1. I’m not counting inactive-l2 because it’s off by default. Let’s look at the other three:

  • inactive-l1: This is a variable length check because it depends on the number of internal snapshots in the qcow2 image. However its performance impact is likely to be negligible in all cases so I don’t think it’s worth bothering with.
  • active-l2: This check depends on the virtual size of the image, and on the percentage that has already been allocated. This check might have some impact if the image is very large (several hundred GBs or more). In that case one way to deal with it is to create an image with a larger cluster size. This also has the nice side effect of reducing the amount of memory needed for the L2 cache.
  • refcount-block: This check depends on the actual size of the qcow2 file and it’s independent from its virtual size. This check is relatively expensive even for small images, so if you notice performance problems chances are that they are due to this one. The good news is that we have been working on optimizing it, so if it’s slowing down your VMs the problem might go away completely in QEMU 2.9.

Conclusion

The qcow2 consistency checks are useful to detect data corruption, but they can affect write performance.

If you’re unsure and you want to check it quickly, open an image with overlap-check.template=none and see for yourself, but remember again that this will only affect write operations. To obtain more reliable results you should also open the image with cache=none in order to perform direct I/O and bypass the page cache. I’ve seen performance increases of 50% and more, but whether you’ll see them depends a lot on your setup. In many cases you won’t notice any difference.

I hope this post was useful to learn a bit more about the qcow2 format. There are other things that can help QEMU perform better, and I’ll probably come back to them in future posts, so stay tuned!

Acknowledgments

My work in QEMU is sponsored by Outscale and has been made possible by Igalia and the help of the rest of the QEMU development team.

by berto at February 08, 2017 08:52 AM

Michael Catanzaro

An Update on WebKit Security Updates

One year ago, I wrote a blog post about WebKit security updates that attracted a fair amount of attention at the time. For a full understanding of the situation, you really have to read the whole thing, but the most important point was that, while WebKitGTK+ — one of the two WebKit ports present in Linux distributions — was regularly releasing upstream security updates, most Linux distributions were ignoring the updates, leaving users vulnerable to various security bugs, mainly of the remote code execution variety. At the time of that blog post, only Arch Linux and Fedora were regularly releasing WebKitGTK+ updates, and Fedora had only very recently begun doing so comprehensively.

Progress report!

So how have things changed in the past year? The best way to see this is to look at the versions of WebKitGTK+ in currently-supported distributions. The latest version of WebKitGTK+ is 2.14.3, which fixes 13 known security issues present in 2.14.2. Do users of the most popular Linux operating systems have the fixes?

  • Fedora users are good. Both Fedora 24 and Fedora 25 have the latest version, 2.14.3.
  • If you use Arch, you know you always have the latest stuff.
  • Ubuntu users rejoice: 2.14.3 updates have been released to users of both Ubuntu 16.04 and 16.10. I’m very  pleased that Ubuntu has decided to take my advice and make an exception to its usual stable release update policy to ensure its users have a secure version of WebKit. I can’t give Ubuntu an A grade here because the updates tend to lag behind upstream by several months, but slow updates are much better than no updates, so this is undoubtedly a huge improvement. (Anyway, it’s hardly a bad idea to be cautious when releasing a big update with high regression potential, as is unfortunately the case with even stable WebKit updates.) But if you use the still-supported Ubuntu 14.04 or 12.04, be aware that these versions of Ubuntu cannot ever update WebKit, as it would require a switch to WebKit2, a major API change.
  • Debian does not update WebKit as a matter of policy. The latest release, Debian 8.7, is still shipping WebKitGTK+ 2.6.2. I count 184 known vulnerabilities affecting it, though that’s an overcount as we did not exclude some Mac-specific security issues from the 2015 security advisories. (Shipping ancient WebKit is not just a security problem, but a user experience problem too. Actually attempting to browse the web with WebKitGTK+ 2.6.2 is quite painful due to bugs that were fixed years ago, so please don’t try to pretend it’s “stable.”) Note that a secure version of WebKitGTK+ is available for those in the know via the backports repository, but this does no good for users who trust Debian to provide them with security updates by default without requiring difficult configuration. Debian testing users also currently have the latest 2.14.3, but you will need to switch to Debian unstable to get security updates for the foreseeable future, as testing is about to freeze.
  • For openSUSE users, only Tumbleweed has the latest version of WebKit. The current stable release, Leap 42.2, ships with WebKitGTK+ 2.12.5, which is coincidentally affected by exactly 42 known vulnerabilities. (I swear I am not making this up.) The previous stable release, Leap 42.1, originally released with WebKitGTK+ 2.8.5 and later updated to 2.10.7, but never past that. It is affected by 65 known vulnerabilities. (Note: I have to disclose that I told openSUSE I’d try to help out with that update, but never actually did. Sorry!) openSUSE has it a bit harder than other distros because it has decided to use SUSE Linux Enterprise as the source for its GCC package, meaning it’s stuck on GCC 4.8 for the foreseeable future, while WebKit requires GCC 4.9. Still, this is only a build-time requirement; it’s not as if it would be impossible to build with Clang instead, or a custom version of GCC. I would expect WebKit updates to be provided to both currently-supported Leap releases.
  • Gentoo has the latest version of WebKitGTK+, but only in testing. The latest version marked stable is 2.12.5, so this is a serious problem if you’re following Gentoo’s stable channel.
  • Mageia has been updating WebKit and released a couple security advisories for Mageia 5, but it seems to be stuck on 2.12.4, which is disappointing, especially since 2.12.5 is a fairly small update. The problem here does not seem to be lack of upstream release monitoring, but rather lack of manpower to prepare the updates, which is a typical problem for small distros.
  • The enterprise distros from Red Hat, Oracle, and SUSE do not provide any WebKit security updates. They suffer from the same problem as Ubuntu’s old LTS releases: the WebKit2 API change  makes updating impossible. See my previous blog post if you want to learn more about that. (SUSE actually does have WebKitGTK+ 2.12.5 as well, but… yeah, 42.)

So results are clearly mixed. Some distros are clearly doing well, and others are struggling, and Debian is Debian. Still, the situation on the whole seems to be much better than it was one year ago. Most importantly, Ubuntu’s decision to start updating WebKitGTK+ means the vast majority of Linux users are now receiving updates. Thanks Ubuntu!

To arrive at the above vulnerability totals, I just counted up the CVEs listed in WebKitGTK+ Security Advisories, so please do double-check my counting if you want. The upstream security advisories themselves are worth mentioning, as we have only been releasing these for two years now, and the first year was pretty rough when we lost our original security contact at Apple shortly after releasing the first advisory: you can see there were only two advisories in all of 2015, and the second one was huge as a result of that. But 2016 seems to have gone decently well. WebKitGTK+ has normally been releasing most security fixes even before Apple does, though the actual advisories and a few remaining fixes normally lag behind Apple by roughly a month or so. Big thanks to my colleagues at Igalia who handle this work.

Challenges ahead

There are still some pretty big problems remaining!

First of all, the distributions that still aren’t releasing regular WebKit updates should start doing so.

Next, we have to do something about QtWebKit, the other big WebKit port for Linux, which stopped receiving security updates in 2013 after the Qt developers decided to abandon the project. The good news is that Konstantin Tokarev has been working on a QtWebKit fork based on WebKitGTK+ 2.12, which is almost (but not quite yet) ready for use in distributions. I hope we are able to switch to use his project as the new upstream for QtWebKit in Fedora 26, and I’d encourage other distros to follow along. WebKitGTK+ 2.12 does still suffer from those 42 vulnerabilities, but this will be a big improvement nevertheless and an important stepping stone for a subsequent release based on the latest version of WebKitGTK+. (Yes, QtWebKit will be a downstream of WebKitGTK+. No, it will not use GTK+. It will work out fine!)

It’s also time to get rid of the old WebKitGTK+ 2.4 (“WebKit1”), which all distributions currently parallel-install alongside modern WebKitGTK+ (“WebKit2”). It’s very unfortunate that a large number of applications still depend on WebKitGTK+ 2.4 — I count 41 such packages in Fedora — but this old version of WebKit is affected by over 200 known vulnerabilities and really has to go sooner rather than later. We’ve agreed to remove WebKitGTK+ 2.4 and its dependencies from Fedora rawhide right after Fedora 26 is branched next month, so they will no longer be present in Fedora 27 (targeted for release in November). That’s bad for you if you use any of the affected applications, but fortunately most of the remaining unported applications are not very important or well-known; the most notable ones that are unlikely to be ported in time are GnuCash (which won’t make our deadline) and Empathy (which is ported in git master, but is not currently in a  releasable state; help wanted!). I encourage other distributions to follow our lead here in setting a deadline for removal. The alternative is to leave WebKitGTK+ 2.4 around until no more applications are using it. Distros that opt for this approach should be prepared to be stuck with it for the next 10 years or so, as the remaining applications are realistically not likely to be ported so long as zombie WebKitGTK+ 2.4 remains available.

These are surmountable problems, but they require action by downstream distributions. No doubt some distributions will be more successful than others, but hopefully many distributions will be able to fix these problems in 2017. We shall see!

by Michael Catanzaro at February 08, 2017 06:32 AM

On Epiphany Security Updates and Stable Branches

One of the advantages of maintaining a web browser based on WebKit, like Epiphany, is that the vast majority of complexity is contained within WebKit. Epiphany itself doesn’t have any code for HTML parsing or rendering, multimedia playback, or JavaScript execution, or anything else that’s actually related to displaying web pages: all of the hard stuff is handled by WebKit. That means almost all of the security problems exist in WebKit’s code and not Epiphany’s code. While WebKit has been affected by over 200 CVEs in the past two years, and those issues do affect Epiphany, I believe nobody has reported a security issue in Epiphany’s code during that time. I’m sure a large part of that is simply because only the bad guys are looking, but the attack surface really is much, much smaller than that of WebKit. To my knowledge, the last time we fixed a security issue that affected a stable version of Epiphany was 2014.

Well that streak has unfortunately ended; you need to make sure to update to Epiphany 3.22.6, 3.20.7, or 3.18.11 as soon as possible (or Epiphany 3.23.5 if you’re testing our unstable series). If your distribution is not already preparing an update, insist that it do so. I’m not planning to discuss the embarrassing issue here — you can check the bug report if you’re interested — but rather on why I made new releases on three different branches. That’s quite unlike how we handle WebKitGTK+ updates! Distributions must always update to the very latest version of WebKitGTK+, as it is not practical to backport dozens of WebKit security fixes to older versions of WebKit. This is rarely a problem, because WebKitGTK+ has a strict policy to dictate when it’s acceptable to require new versions of runtime dependencies, designed to ensure roughly three years of WebKit updates without the need to upgrade any of its dependencies. But new major versions of Epiphany are usually incompatible with older releases of system libraries like GTK+, so it’s not practical or expected for distributions to update to new major versions.

My current working policy is to support three stable branches at once: the latest stable release (currently Epiphany 3.22), the previous stable release (currently Epiphany 3.20), and an LTS branch defined by whatever’s currently in Ubuntu LTS and elementary OS (currently Epiphany 3.18). It was nice of elementary OS to make Epiphany its default web browser, and I would hardly want to make it difficult for its users to receive updates.

Three branches can be annoying at times, and it’s a lot more than is typical for a GNOME application, but a web browser is not a typical application. For better or for worse, the majority of our users are going to be stuck on Epiphany 3.18 for a long time, and it would be a shame to leave them completely without updates. That said, the 3.18 and 3.20 branches are very stable and only getting bugfixes and occasional releases for the most serious issues. In contrast, I try to backport all significant bugfixes to the 3.22 branch and do a new release every month or thereabouts.

So that’s why I just released another update for Epiphany 3.18, which was originally released in September 2015. Compare this to the long-term support policies of Chrome (which supports only the latest version of the browser, and only for six weeks) or Firefox (which provides nine months of support for an ESR release), and I think we compare quite favorably. (A stable WebKit series like 2.14 is only supported for six months, but that’s comparable to Firefox.) Not bad?

by Michael Catanzaro at February 08, 2017 05:56 AM

January 27, 2017

Asumu Takikawa

Snabbwall's Firewall App: L7Fw

Recently, I’ve been helping out with the development of the Snabbwall project at Igalia. Snabbwall is a project to develop a layer 7 firewall using the Snabb framework.

The project is graciously sponsored by the NLNet Foundation.

A layer 7 firewall, for those just tuning in, is a firewall that can track and filter packets based on their state/contents at the application-level (layer 7 in the OSI model). Comparatively, a more traditional firewall (think iptables-based firewalls like UFW) typically makes decisions at layer 3 or 4. For example, such firewalls may not be able to pinpoint and block arbitrary Bittorrent traffic since those packets may be hard to distinguish from other TCP traffic.

(see my colleague Adrián’s blog post introducing Snabbwall for more background)

Most of the guts of Snabbwall were already built in 2016: a libndpi FFI binding for LuaJIT and a Snabb scanner app (L7Spy) that uses the FFI binding for deep packet inspection.

The remaining development milestone was to build a small app (that we call L7Fw) that would consult the packet scanner, apply some firewall rules, and take care of management duties such as logging.

The design of the firewall suite assumes that the Snabb app network will be set up so that an L7Spy instance will scan packets before L7Fw sees them. The scanner communicates packet data flow information to the firewall, which consists of a protocol name (e.g., HTTP or BITTORRENT) and how many packets are included in the flow.

Note that the set of detectable protocols will depend on the packet scanner that is used. L7Fw itself is agnostic about what kind of scanner is used by L7Spy (in practice only one is implemented so far—an nDPI scanner).

The point of the firewall app is to be able to make forwarding decisions based on this scanner information. For example, a user may wish to express a rule like “drop the packet if at least 5 packets are detected in a Bittorrent flow, and only if the destination IP is outside of this subnet”.

To accommodate such rules, the app lets the user write the firewall rules using pflang (which I’ve conveniently already written about a fair amount on this blog). Specifically, the firewall rules can be written using the pfmatch DSL augmented with some firewall-specific information.

The example Bittorrent rule as a pfmatch expression is illustrated in the following firewall policy table:

1
2
3
4
5
6
7
{ -- pfmatch rule to block bittorrent traffic
  BITTORRENT = [[match {
                  flow_count >= 5 and not dst net 192.168.1.0 => drop
                  otherwise => accept
               }]],
  -- catch-all rule
  default = "accept" }

The table is indexed by the name of detected protocols. The mapped values are firewall policy strings, such as "accept" to forward the packet to the next app.

When a pfmatch expression is given, as in the first rule above, it is compiled into a matching function (that can call accept, reject, and drop methods) and applied to the relevant packets.

The L7Fw app is intended for use in custom Snabb deployments in which you may install the firewall in front of other Snabb apps (either off the shelf or your own bespoke ones) as needed.

For testing purposes, however, Snabbwall comes with a snabb wall filter program that lets you test out firewall rules to see how they will work. Like the snabb wall spy program, it can take input from a pcap file or a network interface.

For example, if we put the rules above in a file firewall-rules.lua, we can invoke the firewall on some test data from the Snabbwall repo like this:

1
2
$ sudo ./snabb wall filter -p -4 192.168.1.1 -m 12:34:56:78:9a:bc\
    -f firewall-rules.lua pcap program/wall/tests/data/BITTORRENT.pcap

(assuming your $PWD is the src directory of the Snabb repo)

which will print out a report like the following:

1
2
3
4
5
apps report:
l7fw
Accepted packets: 0 (0%)
Rejected packets: 0 (0%)
Dropped packets:  53 (100%)

If you run it on a non-bittorrent data file like program/wall/tests/data/http.cap you’ll see different results:

1
2
3
4
5
apps report:
l7fw
Accepted packets: 43 (100%)
Rejected packets: 0 (0%)
Dropped packets:  0 (0%)

Supplying the -l option will log blocked packets to the system log in a format similar to other Linux firewalls:

1
snabb[27740]: [Snabbwall DROP] PROTOCOL=BITTORRENT MAC=00:03:ff:3e:d0:dc SRC=10.10.10.22 DST=10.10.10.23

This could be useful if you are deploying your own firewall with L7Fw and want to monitor it.

Implementation

The app implementation itself is pretty straightforward. The most complicated code in the app is the method that constructs responses for "reject" rules. A reject rule (as in the corresponding iptables target) will drop a packet like the "drop" rule but will also send an error packet back to the sender. The error response is either an ICMP packet or a TCP RST packet, depending on the original input packet.

That means there’s some fiddly code that has to deal with constructing various packets with the correct fields. Luckily, Snabb comes with some libraries for constructing protocol headers which simplifies the code considerably.

It’s still easy to mess it up though. Before I wrote my unit test suite (I really should have followed the function design recipe and written my tests first), some of the ICMPv6 packets had an ethernet frame that indicated that they were IPv4 packets instead. I only caught the bug because my unit tests failed.

Future plans

With this milestone, the Snabbwall project is close to wrapping up the work funded by the NLNet folks. The next step is to write a user guide with details on how to set up your own firewall solution.

If you’d like to try the firewall out, the docs for the Snabb app are available on the Snabbwall website. The code is hosted on Github here.

January 27, 2017 03:32 PM

January 22, 2017

Frédéric Wang

Mus Window System

Update 2017/02/01: The day I published this blog post, the old mus client library was removed. Some stack traces were taken during the analysis Antonio & I made in fall 2016, but they are now obsolete. I’m updating the blog in order to try and clarify things and fix possible inaccuracies.

TL; DR

Igalia has recently been working on making Chromium Desktop run natively on Ozone/Wayland thanks to support from Renesas. This is however still experimental and there is a lot of UI work to do. In order to facilitate discussion at BlinkOn7 and more specifically how browser windows are managed, we started an analysis of the Chromium Window system. An overview is provided in this blog post.

Introduction

Antonio described how we were able to get upstream Chromium running on Linux/Ozone/Wayland last year. However for that purpose the browser windows were really embedded in the “ash” environment (with additional widgets) which was itself drawn in an Ozone/Wayland window. This configuration is obviously not ideal but it at least allowed us to do some initial experiments and to present demos of Chromium running on R-Car M3 at CES 2017. If you are interested you can check our ces-demos-2017 and meta-browser repositories on GitHub. Our next goal is to have all the browser windows handled as native Ozone/Wayland windows.

In a previous blog post, I gave an overview of the architecture of Ozone. In particular, I described classes used to handle Ozone windows for the X11 or Wayland platforms. However, this was only a small part of the picture and to understand things globally one really has to take into account the classes for the Aura window system and Mus window client & server.

Window-related Classes

Class Hierarchy

A large number of C++ classes and types are involved in the chromium window system. The following diagram provides an overview of the class hierarchy. It can merely be divided into four parts:

  • In blue, the native Ozone Windows and their host class in Aura.
  • In red and orange, the Aura Window Tree.
  • In purple, the Mus Window Tree (client-side). Update 2017/02/01: These classes have been removed.
  • In green and brown, the Mus Window Tree (server-side) and its associated factories.

Windowing System

I used the following convention which is more or less based on UML:

  • Rectangles represent classes/interfaces while ellipses represent enums, types, defines etc. You can click them to access the corresponding C++ source file.
  • Inheritance is indicated by a white arrow to the base class from the derived class. For implementation of Mojo interfaces, dashed white arrows are used.
  • Other associations are represented by an arrow with open or diamond head. They mean that the origin of the arrow has a member involving one or more instances of the pointed class, you can hover over the arrow head to see the name of that class member.

Native Ozone Windows (blue)

ui::PlatformWindow is an abstract class representing a single window in the underlying platform windowing system. Examples of implementations include ui::X11WindowOzone or ui::WaylandWindow.

ui::PlatformWindow can be stored in a ui::ws::PlatformDisplayDefault which implements the ui::PlatformWindowDelegate interface too. This in turn allows to use platform windows as ui::ws::Display. ui::ws::Display also has an associated delegate class.

aura::WindowTreeHost is a generic class that hosts an embedded aura:Window root and bridges between the native window and that embedded root window. One can register instance of aura::WindowTreeHostObserver as observers. There are various implementations of aura::WindowTreeHost but we only consider aura::WindowTreeHostPlatform here.

Native ui::PlatformWindow windows are stored in an instance of aura::WindowTreeHostPlatform which also holds the gfx::AcceleratedWidget to paint on. aura::WindowTreeHostPlatform implements the ui::PlatformWindowDelegate so that it can listen events from the underlying ui::PlatformWindow.

aura::WindowTreeHostMus is a derived class of aura::WindowTreeHostPlatform that has additional logic make it usable in mus. One can listen changes from aura::WindowTreeHostMus by implementing the aura::WindowTreeHostMusDelegate interface.

Aura Windows (orange)

aura::Window represents windows in Aura. One can communicate with instances of that class using a aura::WindowDelegate. It is also possible to listen for events by registering a list of aura::WindowObserver.

aura::WindowPort defines an interface to enable aura::Window to be used either with mus via aura::WindowMus or with the classical environment via aura::WindowPortLocal. Here, we only consider the former.

WindowPortMus holds a reference to a aura::WindowTreeClient. All the changes to the former are forwarded to the latter so that we are sure they are propagated to the server. Conversely, changes received by aura::WindowTreeClient are sent back to WindowPortMus. However, aura::WindowTreeClient uses an intermediary aura::WindowMus class for that purpose to avoid that the changes are submitted again to the server.

Aura Window Tree Client (red)

aura::WindowTreeClient is an implementation of the ui::mojom::WindowManager and ui::mojom::WindowTreeClient Mojo interfaces for Aura. As in previous classes, we can associate to it a aura::WindowTreeClientDelegate or register a list of aura::WindowTreeClientObserver.

aura::WindowTreeClient also implements the aura::WindowManagerClient interface and has as an associated aura::WindowManagerDelegate. It has a set of roots aura::WindowPortMus which serve as an intermediary to the actual aura:Window instances.

In order to use an Aura Window Tree Client, you first create it using a Mojo connector, aura::WindowTreeClientDelegate as well as optional aura::WindowManagerDelegate and Mojo interface request for ui::mojom::WindowTreeClient. After that you need to call ConnectViaWindowTreeFactory or ConnectAsWindowManager to connect to the Window server and create the corresponding ui::mojom::WindowTree connection. Obviously, AddObserver can be used to add observers.

Finally, the CreateWindowPortForTopLevel function can be used to request the server to create a new top level window and associate a WindowPortMus to it.

Mus Window Tree Client (purple)

Update 2017/02/01: These classes have been removed.

ui::WindowTreeClient is the equivalent of aura:WindowTreeClient in mus. It implements the ui::mojom::WindowManager and ui::mojom::WindowTreeClient Mojo interfaces. It derives from ui::WindowManagerClient, has a list of ui::WindowTreeClientObserver and is associated to ui::WindowManagerDelegate and ui::WindowTreeClientDelegate.

Regarding windows, they are simpler than in Aura since we only have one class ui::Window class. In a similar way as Aura, we can register ui::WindowObserver. We can also add ui::Window as root windows or embedded windows of ui::WindowTreeClient.

In order to use a Mus Window Tree Client, you first create it using a ui::WindowTreeClientDelegate as well as optional ui::WindowManagerDelegate and Mojo interface request for ui::mojom::WindowTreeClient. After that you need to call ConnectViaWindowTreeFactory or ConnectAsWindowManager to connect to the Window server and create the corresponding ui::mojom::WindowTree connection. Obviously, AddObserver can be used to add observers.

Finally, the NewTopLevelWindow and NewWindow functions can be used to create new windows. This will cause the corresponding functions to be called on the connected ui::mojom::WindowTree.

Mus Window Tree Server (green and brown)

We just saw the Aura and Mus window tree clients. The ui::ws::WindowTree Mojo interface is used to for the window tree server and is implemented in ui::ws::WindowTree. In order to create window trees, we can use either the ui::ws::WindowManagerWindowTreeFactory or the ui::ws::WindowTreeFactory which are available through the corresponding ui::mojom::WindowManagerWindowTreeFactory and ui::mojom::WindowTreeFactory Mojo interfaces.

ui::ws::WindowTreeHostFactory implements the ui::ws::WindowTreeHostFactory Mojo interface. It allows to create ui::mojom::WindowTreeHost, more precisely ui::ws::Display implementing that interface. Under the hood, ui::ws::Display created that way actually have an associated ws::DisplayBinding creating a ui::ws::WindowTree. Of course, the display also has a native window associated to it.

Mojo Window Tree APIs (pink)

Update 2017/02/01: The Mus client has been removed. This section only applies to the Aura client.

As we have just seen in previous sections the Aura and Mus window systems are very similar and in particular implement the same Mojo ui::mojom::WindowManager and ui::mojom::WindowTreeClient interfaces. In particular:

  • The ui::mojom::WindowManager::WmCreateTopLevelWindow function can be used to create a new top level window. This results in a new windows (aura::Window or ui::Window) being added to the set of embedded windows.

  • The ui::mojom::WindowManager::WmNewDisplayAdded function is invoked when a new display is added. This results in a new window being added to the set of window roots (aura::WindowMus or ui::Window).

  • The ui::mojom::WindowTreeClient::OnEmbed function is invoked when Embed is called on the connected ui::mojom::WindowTree. The window parameter will be set as the window root (aura:WindowMus or ui:Window). Note that we assume that the set of roots is empty when such an embedding happens.

Note that for Aura Window Tree Client, the creation of new aura::WindowMus window roots implies the creation of the associated aura::WindowTreeHostMus and aura::WindowPortMus. The creation of the new aura:Window window by aura::WmCreateTopLevelWindow is left to the corresponding implementation in aura::WindowManagerDelegate.

On the server side, the ui::ws::WindowTree implements the ui::mojom::WindowTree mojo interface. In particular:

  • The ui::mojom::WindowTree::Embed function can be used to embed a WindowTreeClient at a given window. This will result of OnEmbed being called on the window tree client.

  • The ui::mojom::WindowTree::NewWindow function can be used to create a new window. It will call OnChangeCompleted on the window tree client.

  • The ui::mojom::WindowTree::NewTopLevelWindow function can be used to request the window manager to create a new top level window. It will call OnTopLevelCreated (or OnChangeCompleted in case of failure) on the window tree client.

Analysis of Window Creation

Chrome/Mash

In our project, we are interested in how Chrome/Mash windows are created. When the chrome browser process is launched, it takes the regular content/browser startup path. That means in chrome/app/chrome_main.cc it does not take the ::MainMash path, but instead content::ContentMain. This is the call stack of Chrome window creation within the mash shell:

#1 0x7f7104e9fd02 ui::WindowTreeClient::NewTopLevelWindow()
#2 0x5647d0ed9067 BrowserFrameMus::BrowserFrameMus()
#3 0x5647d09c4a47 NativeBrowserFrameFactory::Create()
#4 0x5647d096bbc0 BrowserFrame::InitBrowserFrame()
#5 0x5647d0877b8a BrowserWindow::CreateBrowserWindow()
#6 0x5647d07d5a48 Browser::Browser()
...

Update 2017/02/01: This has changed since the time when the stack trace was taken. The BrowserFrameMus constructor now creates a views:DesktopNativeWidgetAura and thus an aura::Window.

The outtermost window created when one executes chrome --mash happens as follows. When the UI service starts (see Service::OnStart in services/ui/service.cc), an instance of WindowServer is created. This creations triggers the creation of a GpuServiceProxy instance. See related snippets below:

void Service::OnStart() {
  (...)
  // Gpu must be running before the PlatformScreen can be initialized.
  window_server_.reset(new ws::WindowServer(this));
  (..)
}

WindowServer::WindowServer(WindowServerDelegate* delegate)
: delegate_(delegate),
  (..)
  gpu_proxy_(new GpuServiceProxy(this)),
  (..)
{ }

GpuServiceProxy, then, starts off the GPU service initialization. By the time the GPU connection is established, the following callback is called: GpuServiceProxy::OnInternalGpuChannelEstablished. This calls back to WindowServer::OnGpuChannelEstablished, which calls Service::StartDisplayInit. The later schedules a call to PlatformScreenStub::FixedSizeScreenConfiguration. This finally triggers the creation of an Ozone top level window via an ui::ws::Display:

#0 base::debug::StackTrace::StackTrace()
#1 ui::(anonymous namespace)::OzonePlatformX11::CreatePlatformWindow()
#2 ui::ws::PlatformDisplayDefault::Init()
#3 ui::ws::Display::Init()
#4 ui::ws::DisplayManager::OnDisplayAdded()
#5 ui::ws::PlatformScreenStub::FixedSizeScreenConfiguration()
(..)

On the client-side, the addition of a new display will cause aura::WindowTreeClient::WmNewDisplayAdded to be called and will eventually lead to the creation of an aura::WindowTreeHostMus with the corresponding display id. As shown on the graph, this class derives from aura::WindowTreeHost and hence instantiates an aura::Window.

Mus demo

The Chromium source code contains a small mus demo that is also useful do experiments. The demo is implemented via a class with the following inheritance:

class MusDemo : public service_manager::Service,
                public aura::WindowTreeClientDelegate,
                public aura::WindowManagerDelegate

i.e. it can be started as Mus service and handles some events from aura:WindowTreeClient and aura::WindowManager.

When the demo service is launched, it creates a new Aura environment and an aura::WindowTreeClient, it does the Mojo request to create an ui::WindowTree and then performs the connection:

#0 aura::WindowTreeClient::ConnectAsWindowManager()
#1 ui::demo::MusDemo::OnStart()
#2 service_manager::ServiceContext::OnStart()

Similarly to Chrome/Mash, a new display and its associated native window is created by the UI:

#0 ui::ws::WindowTree::AddRootForWindowManager()
#1 ui::ws::Display::CreateWindowManagerDisplayRootFromFactory()
#2 ui::ws::Display::InitWindowManagerDisplayRoots()
#3 ui::ws::PlatformDisplayDefault::OnAcceleratedWidgetAvailable()
#4 ui::X11WindowBase::Create()
#5 ui::(anonymous namespace)::OzonePlatformX11::CreatePlatformWindow()
#6 ui::ws::PlatformDisplayDefault::Init()
#7 ui::ws::Display::Init()
#8 ui::ws::DisplayManager::OnDisplayAdded()

The Aura Window Tree Client listens the changes and eventually aura::WindowTreeClient::WmNewDisplayAdded is called. It then informs its MusDemo delegate so that the display is added to the screen display list

#0 DisplayList::AddDisplay
#1 ui::demo::MusDemo::OnWmWillCreateDisplay
#2 aura::WindowTreeClient::WmNewDisplayAddedImpl
#3 aura::WindowTreeClient::WmNewDisplayAdded

…and the animated bitmap window is added to the Window Tree:

#0  ui::demo::MusDemo::OnWmNewDisplay
#1  aura::WindowTreeClient::WmNewDisplayAddedImpl
#2  aura::WindowTreeClient::WmNewDisplayAdded

Update 2017/02/01: Again, aura::WindowTreeClient::WmNewDisplayAdded will lead to the creation of an aura Window.

Conclusion

Currently, both Chrome/Mash or Mus demo create a ui::ws::Display containing a native Ozone window. The Mus demo class uses the Aura WindowTreeClient to track the creation of that display and pass it to the Mus Demo class. The actual chrome browser window is created inside the native Mash window using a Mus WindowTreeClient. Again, this is not what we want since each chrome browser should have its own native window. It will be interesting to first experiment multiple native windows with the Mus demo.

Update 2017/02/01: aura::WindowTreeClient is now always used to create Aura Windows for both external windows (corresponding to a display & a native window) and internal windows. However, we still want all chrome browser to be external windows.

Igalia Logo
Renesas Logo

It seems that the Ozone/Mash work has been mostly done with ChromeOS in mind and will probably need some adjustments to make things work on desktop Linux. We are very willing to discuss this and other steps for Desktop Chrome Wayland/Ozone in details with Google engineers. Antonio will attend BlinkOn7 next week and will be very happy to talk with you so do not hesitate to get in touch!

January 22, 2017 11:00 PM

January 13, 2017

Asumu Takikawa

Making a Snabb app with pfmatch

My first task when I joined Igalia was experimenting with building networking apps with pfmatch and the Snabb framework. I mentioned Snabb in my last blog post, but in case you’re just tuning in, Snabb is a framework built on LuaJIT that lets you assemble networking programs as a combination of small apps—each implementing a particular network function—in a high-level scripting environment.

Meanwhile, pfmatch is a simple DSL that’s built on pflua, the subject of my last blog post. The idea of pfmatch is to use pflang filters to implement a pattern-matching facility for use in Snabb apps.

While pfmatch has been a part of pflua for a while, the networking team hadn’t really tried using it for a real app. So as a starter project, I looked into implementing a small and self-contained app using pfmatch that could potentially be useful. After looking through a bunch of networking literature, I settled on building an app for suppressing potentially malicious traffic, such as portscans, from outside your network.

A portscan detector is a good fit for a Snabb app because it can be built as a self-contained network function that pushes good packets through to other apps and drops the bad ones.

There are many online algorithms used for solving this problem, many based on Jung et al’s well-known TRW (threshold random walk) work. A related algorithm by Weaver et al in “Very Fast Containment of Scanning Worms” from Usenix Security 2004 came with easy-to-understand pseudo-code for implementing the algorithm, so I went with that version.

I won’t describe the detailed workings of the TRW algorithm or its variants here because the papers I linked to explain the ideas better than I could. I’ll just sketch the basic idea here.

The family of TRW-like algorithms essentially try to quickly block traffic from hosts that will not be “useful” in the future. To a first approximation, this means traffic that will not lead to successful connections (the algorithm assumes the presence of an oracle for determining connection success). A successful connection “drives a random walk upwards” and vice versa for a failed one—when the walk reaches a certain threshold the host is assumed to be malicious. This works on the assumption that non-malicious hosts have a higher probability for connection success.

The fast algorithm from the Weaver paper uses the basic TRW algorithm but tracks the connection data using approximate caches in order to reduce memory use and improve access speed. (there are a few other differences from TRW that are detailed in section 5.1 of the paper)

Implementation

I’ll use the previously mentioned scan suppression app as an example to walk through some of the interesting points in building a Snabb app.

(Note: this isn’t intended to be a tutorial on how to build a Snabb app, so you may want to look at Snabb’s Getting Started guide for more details)

Let’s start with the data definitions that the app uses. The algorithm in the Usenix paper is designed to be fast by utilizing data structures optimized for space usage. Take a look at Figure 1 from the paper linked below:

The figure shows two cache tables that are used in the algorithm. The thing to notice is that one of the caches uses 8-bit records and the other uses 64 bits per record. It’d be useful to ensure that the data structures we use in the Lua code match this without additional overhead.

To do that, we can use the FFI to declare the data structures using C declarations like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
ffi.cdef[[
  typedef struct {
    uint8_t in_to_out : 1;
    uint8_t out_to_in : 1;
    uint8_t age : 6;
  } conn_cache_line_t;

  typedef struct {
    uint16_t tag1;
    uint16_t count1;
    uint16_t tag2;
    uint16_t count2;
    uint16_t tag3;
    uint16_t count3;
    uint16_t tag4;
    uint16_t count4;
  } addr_cache_line_t;
]]

This declaration uses bit fields for the connection cache record structs to ensure that we fit the data in a byte. It’s pretty nice to be able to specify FFI data types like this in LuaJIT, even when we’re not specifically binding to FFI functions. The rest of the data structure implementation follows the paper straightforwardly.

For the actual app code, it’s helpful to know what makes up a Snabb app. At its simplest, a Snabb app is a Lua class that has a new method. But usually it also has either a pull or push method to either pull packets into the program or push packets through, depending on the network function it implements.

The Snabb framework calls these methods on every breath of its core loop, shuffling packets around as necessary.

So to build the scanner, we need an app with a push method that takes packets (from somewhere, the precise input will depend on how the app is used) and drops them if they’re believed to be malicious and transmits them on otherwise. The push method will call out to a function implementing the algorithm in the Weaver et al paper.

(Note: the Snabb Reference Manual explains what methods an app can implement along with many other details about the framework)

The main components of the algorithm are shown in Figure 2 of the Usenix paper:

These three cases distinguish whether a packet comes from inside/outside the network in question and whether it triggers the blocking threshold. For scan suppression purposes, the “inside” might be a specific subnet that you control and “outside” all other (potentially malicious) hosts. Alternatively, inside and outside may be flipped if you’re trying to contain scanning worms.

These three cases are encoded in two methods, inside and outside, in the Scanner class for the app.

The push method eventually calls these methods, but first it has to dispatch to determine which of these three cases we’re going to run. Thus the push method looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
function Scanner:push()
  -- some sanity checks and variable binding for convenience
  local i = assert(self.input.input, "input port not found")
  local o = assert(self.output.output, "output port not found")

  -- in the actual app there's some code here that does some
  -- maintenance work on a timer, but the details don't matter here

  -- wait until we get a packet to process
  while not link.empty(i) do
    -- this part is a little different from what's in the Git repo just
    -- to simplify the code for this blog post
    local pkt = link.receive(i)
    self:matcher(pkt.data, pkt.length)
  end
end

The most important line is the self:matcher(pkt.data, pkt.length) line, which is calling a pfmatch function that’s installed in constructor for Scanner. The code that constructs the dispatcher looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
self.matcher = pfm.compile([[
  match {
    ip and src net $inside_net and not dst net $inside_net => {
        ip proto tcp => inside($src_addr_off, $dst_addr_off, $tcp_port_off)
        otherwise => inside($src_addr_off, $dst_addr_off)
      }
    ip and not src net $inside_net and dst net $inside_net => {
        ip proto tcp => outside($src_addr_off, $dst_addr_off, $tcp_port_off)
        otherwise => outside($src_addr_off, $dst_addr_off)
      }
  }]],

In this code, each of the clauses in the match clause (which is actually a Lua string between the [[ and ]]—so not quite as seamless as a Lisp DSL) has a pflang filter on the left-hand side and some actions on the right-hand side. The outermost filters in this case are checking whether packets are coming from some configured network or from outside.

(Note: this makes the simplifying assumption that you can determine whether a packet comes from inside/outside the network with a src net ... filter)

The actions are either nested dispatches, or a call to some method in self. The specific nested filters in this example are checking whether the packet is TCP or not (the algorithm cares about the port if it’s TCP, but not otherwise).

The $-prefixed expressions like $inside_net are substitutions that let you do a tiny bit of abstraction over patterns and actions (they’re just strings that are substituted into the filter or action).

Though it’s possible to write this dispatching without pfmatch, I think it looks clearer written this way than with conditionals and helper functions to match on the packet fields.

The inside and outside methods themselves are almost straightforward transliterations of the pseudo-code. For example, here is the inside method that handles the first case on the left:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
function Scanner:inside(data, len, off_src, off_dst, off_port)
  -- do a lookup in the connection cache
  local cache_entry, src_ip, dst_ip, port =
    self:extract(data, off_src, off_dst, off_port)

  -- do a lookup in the address cache
  count = self:lookup_count(dst_ip)

  if cache_entry.in_to_out ~= 1 then
    if cache_entry.out_to_in == 1 then
      -- previously a "miss" but now a "hit"
      self:set_count(dst_ip, count - 2)
    end
    cache_entry.in_to_out = 1
  end

  cache_entry.age = 0

  -- forward packet to next app
  link.transmit(self.output.output, self.pkt)
end

As you can see, the structure of this method is identical to the conditional in the figure. The outside case is a little more complicated:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
function Scanner:outside(data, len, off_src, off_dst, off_port)
  local cache_entry, src_ip, dst_ip, port =
    self:extract(data, off_src, off_dst, off_port)

  count = self:lookup_count(src_ip)

  -- here we dispatch between the two block threshold cases that you can see
  -- in Figure 2
  if count < block_threshold then
    if cache_entry.out_to_in ~= 1 then
      if cache_entry.in_to_out == 1 then
        self:set_count(src_ip, count - 1)
        cache_entry.out_to_in = 1
      elseif hygiene_matcher(self.pkt.data, self.pkt.length) then
	if debug then
	   print("blocked packet due to hygiene check")
	end
        return packet.free(self.pkt)
      else
        -- a potential "miss"
        self:set_count(src_ip, count + 1)
        cache_entry.out_to_in = 1
      end
      cache_entry.in_to_out = 1
    end

    -- if not dropped ...
    cache_entry.age = 0
    link.transmit(self.output.output, self.pkt)
  -- i.e., above block threshold
  else
     if cache_entry.in_to_out == 1 then
        if syn_or_udp_matcher(self.pkt.data, self.pkt.length) then
	   if debug then
	      print("blocked initial SYN/UDP packet for blocked host")
	   end
	   return packet.free(self.pkt)
	elseif cache_entry.out_to_in ~= 1 then
	   -- a "hit"
	   self:set_count(src_ip, count - 1)
	   cache_entry.out_to_in = 1
	end
	-- internal or old
	cache_entry.age = 0
	link.transmit(self.output.output, self.pkt)
     else
	if debug then
	   print(string.format("blocked packet from %s on port %d",
			       format_ip(src_ip), port))
	end
	return packet.free(self.pkt)
     end
  end
end

Here we’ve merged the two cases of being above or below the threshold into a single method with a big conditional. Despite the length of the method, the structure still matches the pseudo-code very closely. If you can figure out what the helper functions that are used will do, the code is pretty obvious given the algorithm pseudo-code.

Reflection

Though I set out to build an app using pfmatch, it turned out that the lines of code for matching is fairly small relative to the whole app. That said, I do think it makes the dispatch code cleaner than it would look without it.

On the other hand, I’m used to more seamless Lispy DSLs via macros (see Racket) so I do wish that Lua came with macros (though from what I heard at the Lua Workshop it may not be a total pipe dream).

In the full code, there are also a couple of places where I didn’t use pfmatch but did compile some pflua filters for use in the inside/outside methods. In these cases, it was clearer to dispatch on a boolean with a conditional than use a match clause.

I think we could improve some things in pfmatch to make it more useful. For example, using substitutions is really necessary to make the code look clean but it’s kinda hacky. It would’ve been nicer if the the pattern language supported data deconstruction/variable binding in addition to data dispatch, so that this could be done less awkwardly. It’s not obvious to me how that would integrate into pflang though.

Perhaps instead the pfmatch language could come with more built-in substitutions that encode the structure of packets, so that expressions in both pflang and the actions can refer to them by name instead of ip[12:4].

Overall, I’m also impressed that it’s so easy to write a network function in Lua following the structure of some code from a paper. This is partly to the credit of the paper authors who wrote a clear paper, but also shows how nice it is to program networking code in a high-level language.

The full app code is available here.

January 13, 2017 02:00 PM

January 11, 2017

Juan A. Suárez

New Year, New Blog!

A new year has come! And with the new year, the usual new proposals: be better person, do more exercise, blog more, … :smile:

More than two years without blogging. Lot of time. So let’s start with the last proposal.

But I...

January 11, 2017 11:00 PM

Jacobo Aragunde

GENIVI-fying Chromium

In the last weeks, I’ve been working to integrate Chromium in the GENIVI Development Platform (GDP).

GENIVI provides a platform tailored for the needs of the In-Vehicle Infotainment (IVI) industry. It’s interesting to highlight their early adoption of Wayland for the graphic stack and the development of the Wayland IVI extension to fulfil the specific needs of IVI window manager implementations.

In GDP application model, apps are under control of systemd, which takes care of starting and stopping them. Through the ivi-application protocol, applications register their surfaces to be managed by HMI central controller. This HMI controller will be in charge of manipulating the surfaces belonging to applications for presentation to the user, using the ivi-controller protocol. In the end, the Wayland compositor will compose the surfaces on screen.

The first step was to build Chromium for the target platform. GDP can be built with Yocto, and the meta-browser layer provides a set of recipes to build Chromium, among other browsers. Support for Chromium on Wayland is granted by the Ozone-Wayland project, developed mainly by Intel. We had already used this combination of software to run Chromium on embedded platforms, and made contributions to it.

To achieve the integration with the GDP application model, we need to put some files in place so the browser can be controlled by systemd and the HMI controller has some knowledge about its existence. We extend the chromium-wayland recipe with a .bbappend file added to the GDP recipes. With everything in place, and one additional dependency from meta-openembedded, we are able to add Chromium to a GDP image and build.

Chromium on GDP

If you want to reproduce the build yourself before everything is integrated, you can already do it:

  • Get the GENIVI development platform and follow instructions to set GDP master up for your board.
  • Get the meta-browser layer and add it to your conf/bblayers.conf. I currently recommend to use the branch ease-chromium-wayland-build from this fork, as it contains some cleanup patches that haven’t been integrated yet. EDIT: patches have been merged into upstream meta-browser, no need to use the fork now!
  • You should already have the meta-openembedded submodule, so add meta-openembedded/meta-gnome to your conf/bblayers.conf.
  • Get the integration patch and add it into the meta-genivi-dev submodule. You may add this fork as a new remote and change to the chromium-integration branch, or you could just wget the patch and apply it locally. EDIT: the patch has been amended several times, links were updated.
  • Add chromium-wayland to your IMAGE_INSTALL_append in conf/local.conf. Now you may bitbake your image.

Next steps include integrating my patches into mainline development of their respective projects, rebasing meta-browser to use the latest possible version of Chromium, trying more platforms (currently working on a Minnowboard) and general fine-tuning by fixing issues like the weird proportions of the browser window when running on GDP.

This work is performed by Igalia and sponsored by GENIVI through the Challenge Grant Program. Thank you!

GENIVI logo

by Jacobo Aragunde Pérez at January 11, 2017 06:00 PM

Philippe Normand

WebKitGTK+ and HTML5 fullscreen video

HTML5 video is really nice and all but one annoying thing is the lack of fullscreen video specification, it is currently up to the User-Agent to allow fullscreen video display.

WebKit allows a fullscreen button in the media controls and Safari can, on user-demand only, switch a video to fullscreen and display nice controls over the video. There’s also a webkit-specific DOM API that allow custom media controls to also make use of that feature, as long as it is initiated by a user gesture. Vimeo’s HTML5 player uses that feature for instance.

Thanks to the efforts made by Igalia to improve the WebKit GTK+ port and its GStreamer media-player, I have been able to implement support for this fullscreen video display feature. So any application using WebKitGTK+ can now make use of it :)

The way it is done with this first implementation is that a new fullscreen gtk+ window is created and our GStreamer media-player inserts an autovideosink in the pipeline and overlays the video in the window. Some simple controls are supported, the UI is actually similar to Totem’s. One improvement we could do in the future would be to allow application to override or customize that simple controls UI.

The nice thing about this is that we of course use the GstXOverlay interface and that it’s implemented by the linux, windows and mac os video sinks :) So when other WebKit ports start to use our GStreamer player implementation it will be fairly easy to port the feature and allow more platforms to support fullscreen video display :)

Benjamin also suggested that we could reuse the cairo surface of our WebKit videosink and paint it in a fullscreen GdkWindow. But this should be done only when his Cairo/Pixman patches to enable hardware acceleration land.

So there is room for improvement but I believe this is a first nice step for fullscreen HTML5 video consumption from Epiphany and other WebKitGTK+-based browsers :) Thanks a lot to Gustavo, Sebastian Dröge and Martin for the code-reviews.

by Philippe Normand at January 11, 2017 02:42 PM

January 09, 2017

Samuel Iglesias

Vulkan's Float64 capability support on ANV driver landed!

Since beginning last year, our team at Igalia has been involved in enabling ARB_gpu_shader_fp64 extension to different Intel generations: first Broadwell and later, then Haswell, now IvyBridge (still under review); so working on supporting Vulkan’s Float64 capability into Mesa was natural.

January 09, 2017 10:33 AM

January 08, 2017

Manuel Rego

Coloring the insertion caret

On the last weeks Igalia has been working in adding support for caret-color property on Blink. This post is a summary of the different tasks done to achieve it.

First steps

Basically I started investigating a little bit the state of art. caret-color is defined in the spec called CSS Basic User Interface Module Level 3, it’s a Candidate Recommendation so browsers should be happy to implement it. However the property is marked as at-risk due to the lack of implementations (I talked to the spec editor, Florian Rioval, to confirm it).

Update: caret-color is no longer at-risk thanks to Chromium and Firefox implementations.

Then the first thing was to send the Intent to Implement and Ship: CSS UI: caret-color property mail to blink-dev. The feedback was possitive as it’s a small change without a big impact, and the intent was approved.

Initial support

My colleague Claudio Saavedra had already implemented caret-color support for Bloomberg in their downstream port a few years ago. And they have been using it successfully on the terminals. In addition, as you can see in this mail by Lea Verou on the CSS Working Group (CSS WG) mailing list, there’s a clear author demand for more control over the caret color.

Basically the task was adding a new CSS property to the Blink parser and change the color of the insertion caret to use it. The latter was basically changing a line and using caret-color instead of color to determine it.

Adding the CSS property was not hard, but there’s something special regarding caret-color as it’s the first color property that accepts auto as value. So far auto is going to be implemented as currentColor, but in the future it can be tweaked to “ensure good visibility and contrast with the surrounding content”. For that reason the patch needed some extra work on top of a regular color.

caret-color example caret-color example

Making caret-color animatable

The next step was making caret-color animatable. Again this would be fairly easy if it was a regular color, but due to the auto value it needed some special changes.

The important thing here is that auto is not interpolable, so if you do something like this:

  .foo {
    color: initial;
    caret-color: auto;
    transition: all 1s;
  }
  .foo:hover {
    color: green;
    caret-color: lime;
  }

The color of the text will interpolate from black to green, but the color of the insertion caret will do a discrete jump from black to lime. This might be seen as a weird behavior, but it was discussed and verified with the CSS WG.

Anyway the patch has landed and you can now enjoy your rainbow caret! 🌈

caret-color animation example

The resolved value

Another thing that was clarified with the CSS WG is related to the resolved value of caret-color property.

In CSS you can have specified value, computed value, used value and resolved value. This is not the post to explain on this, so I’ll jump directly into the details of what was related to caret-color implementation.

For color properties we’re used to have as resolved value (the one returned by getComputedStyle()) the used value, a numeric color with a format like rgb(X, X, X). For example, in the following snippet you would get as resolved value rgb(0, 255, 0) in the console log:

  <div id="test" style="color: lime; background-color: currentColor;"></div>
  <script>
    console.log(window.getComputedStyle(document.getElementById("test")).backgroundColor);
  </script>

However, that was not in the specs, and initially caret-color tests were checking that the resolved value was auto or currentColor. This has been modified and now it’ll behave like the rest of the color properties.

W3C Test Suite

Of course we need tests for all this stuff, but Florian had already created a few tests related to caret-color on the csswg-test repository. So I used them as base and contributed a few more to check all the stuff needed regarding the implementation.

The good news is that now there are 21 tests checking caret-color that are upstream on the W3C repository and can be used by the rest of browsers to verify the interoperability.

This test suite is being used in Blink to verify that the current implementation is spec compatible. Right now only 1 test is failing due to a known bug regarding animations.

Conclusions

caret-color support is already available on Chrome Canary, and it’ll be hitting stable in Chrome 57. Please start playing with it and let me know if you find any issue.

The good news is that thanks to all the discussions inside the CSS WG, Firefox implementation got reactivated and caret-color is supported there now too! It’ll be included in Firefox 53. Also Mozilla folks have wrote the documentation at MDN.

Regarding other browsers, in WebKit the bug has been reported and Blink patches could be somehow ported; in Edge I added a new feature request vote for it if you’re interested.

Acknowledgements

First I’d like to thank Bloomberg for sponsoring, once again, Igalia to do this work. It’s really nice to keep adding valuable features to the web.

This is the kind of things where Igalia can be the perfect partner for your company. Some features that might be interesting for your organization but are not in the roadmap or priorities of the browser vendors. We can help to discuss the feature with the standards body and push to get it implemented on the open source browsers. Of course, not everything will be accepted upstream, but in some situations (like the caret-color property), it’s possible to make it for real.

If you’d like to do this kind of work in Igalia, don’t miss the opportunity to join us. We’ve recently opened a position for a Chromium developer, take a look to it and send us your application if you’re interested.

Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

Also thanks to Florian Rioval for replying so fast to all my questions and doubts regarding caret-color and the spec. He was very kind quickly reviewing all my tests for the W3C csswg-test repo.

Finally thanks to the reviewers (specially Alan Cutter and Timothy Loh) and other people commenting on the different issues for all your patience with me. 😄

January 08, 2017 11:00 PM

January 06, 2017

Iago Toral

GL_ARB_gpu_shader_fp64 / OpenGL 4.0 lands for Intel/Haswell. More gen7 support coming soon!

2017 starts with good news for Intel Haswell users: it has been a long time coming, but we have finally landed GL_ARB_gpu_shader_fp64 for this platform. Thanks to Matt Turner for reviewing the huge patch series!

Maybe you are not particularly excited about GL_ARB_gpu_shader_fp64, but that does not mean this is not an exciting milestone for you if you have a Haswell GPU (or even IvyBridge, read below): this extension was the last piece missing to finally bring Haswell to expose OpenGL 4.0!

If you want to give it a try but you don’t want to build the driver from the latest Mesa sources, don’t worry: the feature freeze for the Mesa 13.1 release is planned to happen in just a few days and the current plan is to have the release in early February, so if things go according to plan you won’t have to wait too long for an official release.

But that is not all, now that we have landed Fp64 we can also send for review the implementation of GL_ARB_vertex_attrib_64bit. This could be a very exciting milestone, since I believe this is the only thing missing for Haswell to have all the extensions required for OpenGL 4.5!

You might be wondering about IvyBridge too, and 2017 also starts with good news for IvyBridge users. Landing Fp64 for Haswell allowed us to send for review the IvyBridge patches we had queued up for GL_ARB_gpu_shader_fp64 which will bring IvyBridge up to OpenGL 4.0. But again, that is not all, once we land Fp64 we should also be able to send the patches for GL_vertex_attrib_64bit and get IvyBridge up to OpenGL 4.2, so look forward to this in the near future!

We have been working hard on Fp64 and Va64 during a good part of 2016, first for Broadwell and later platforms and later for Haswell and IvyBridge; it has been a lot of work so it is exciting to see all this getting to the last stages and on its way to the hands of the final users.

All this has only been possible thanks to Intel’s sponsoring and the great support and insight that our friends there have provided throughout the development and review processes, so big thanks to all of them and also to the team at Igalia that has been involved in the development with me.

by Iago Toral at January 06, 2017 07:49 AM

December 21, 2016

Frédéric Wang

¡Igalia is hiring!

If you read this blog, you probably know that I joined Igalia early this year, where I have been involved in projects related to free software and web engines. You may however not be aware that Igalia has a flat & cooperative structure where all decisions (projects, events, recruitments, company agreements etc) are voted by members of an assembly. In my opinion such an organization allows to take better decisions and to avoid frustrations, compared to more traditional hierarchical organizations.

After several months as a staff, I finally applied to become an assembly member and my application was approved in November! Hence I attended my first assembly last week where I got access to all the internal information and was also able to vote… In particular, we approved the opening of two new job positions. If you are interested in state-of-the-art free software projects and if you are willing to join a company with great human values, you should definitely consider applying!

December 21, 2016 11:00 PM

December 19, 2016

Miguel A. Gómez

WPE: Web Platform for Embedded

WPE is a new WebKit port optimized for embedded platforms that can support a variety of display protocols like Wayland, X11 or other native implementations. It is the evolution of the port formerly known as WebKitForWayland, and it was born as part of a collaboration between Metrological and Igalia as an effort to have a WebKit port running efficiently on STBs.

QtWebKit has been unmaintained upstream since they decided to switch to Blink, hence relying in a dead port for the future of STBs is a no-go. Meanwhile, WebKitGtk+ has been maintained and live upstream which was perfect as a basis for developing this new port, removing the Gtk+ dependency and trying Wayland as a replacement for X server. WebKitForWayland was born!

During a second iteration, we were able to make the Wayland dependency optional, and change the port to use platform specific libraries to implement the window drawings and management. This is very handy for those platforms were Wayland is not available. Due to this, the port was renamed to reflect that Wayland is just one of the several backends supported: welcome WPE!.

WPE has been designed with simplicity and performance in mind. Hence, we just developed a fullscreen browser with no tabs and multimedia support, as small (both in memory usage and disk space) and light as possible.

Current repositories

We are now in the process of moving from the WebKitForWayland repositories to what will be the WPE final ones. This is why this paragraph is about “current repositories”, and why the names include WebKitForWayland instead of WPE. This will change at some point, and expect a new post with the details when it happens. For now, just bear in mind that where it says WebKitForWayland it really refers to WPE.

  • Obviously, we use the main WebKit repository git://git.webkit.org/WebKit.git as our source for the WebKit implementation.
  • Then there are some repositories at github to host the specific WPE bits. This repositories include the needed dependencies to build WPE together with the modifications we did to WebKit for this new port. This is the main WPE repository, and it can be easily built for the desktop and run inside a Wayland compositor. The build and run instructions can be checked here. The mission of these repositories is to be the WPE reference repository, containing the differences needed from upstream WebKit and that are common to all the possible downstream implementations. Every release cycle, the changes in upstream WebKit are merged into this repository to keep it updated.
  • And finally we have Metrological repositories. As in the previous case, we added the dependencies we needed together with the WebKit code. This third’s repository mission is to hold the Metrological specific changes both to WPE and its dependencies, and it also updated from the main WPE repository each release cycle. This version of WPE is meant to be used inside Metrological’s buildroot configuration, which is able to build images for the several target platforms they use. These platforms include all the versions of the Raspberry Pi boards, which are the ones we use as the reference platforms, specially the Raspberry Pi 2, together with several industry specific boards from chip vendors such as Broadcom and Intel.

Building

As I mentioned before, building and running WPE from the main repository is easy and the instructions can be found here.

Building an image for a Raspberry Pi is quite easy as well, just a bit more time consuming because of the cross compiling and the extra dependencies. There are currently a couple of configs at Metrological’s buildroot that can be used and that don’t depend on Metrological’s specific packages. Here are the commands you need to run in order to test it:

    • Clone the buildroot repository:
      git clone https://github.com/Metrological/buildroot-wpe.git
    • Select the buildroot configuration you want. Currently you can use raspberrypi_wpe_defconfig to build for the RPi1 and raspberrypi2_wpe_defconfig to build for the RPi2. This example builds for the RPi2, it can be changed to the RPi1 just changing this command to use the appropriate config. The test of the commands are the same for both cases.
      make raspberrypi2_wpe_defconfig
    • Build the config
      make
    • And then go for a coffee because the build will take a while.
    • Once the build is finished you need to deploy the result to the RPi’s card (SD for the RPi1, microSD for the RPi2). This card must have 2 partitions:
      • boot: fat32 file system, with around 100MB of space.
      • root: ext4 file system with around 500MB of space.
    • Mount the SD card partitions in you system and deploy the build result (stored in output/images) to them. The deploy commands assume that the boot partition was mounted on /media/boot and the root partition was mounted on /media/rootfs:
      cp -R output/images/rpi-firmware/* /media/boot/
      cp output/images/zImage /media/boot/
      cp output/images/*.dtb /media/boot/
      tar -xvpsf output/images/rootfs.tar -C /media/rootfs/
    • Remove the card from your system and plug it to the RPi. Once booted, ssh into it and the browser can be easily launched:
      wpe http://www.igalia.com

Features

I’m planning to write a dedicated post to talk about technical details of the project where I’ll cover this more in deep but, briefly, these are some features that can be found in WPE:

    • support for the common HTML5 features: positioning, CSS, CSS3D, etc
    • hardware accelerated media playback
    • hardware accelerated canvas
    • WebGL
    • MSE
    • MathML
    • Forms
    • Web Animations
    • XMLHttpRequest
    • and many other features supported by WebKit. If you are interested in the complete list, feel freel to browse to http://html5test.com/ and check it yourself!

captura-de-pantalla-de-2016-12-19-12-09-10

Current adoption status

We are proud to see that thanks to Igalia’s effort together with Metrological, WPE has been selected to replace QtWebKit inside the RDK stack, and that it’s also been adopted by some big cable operators like Comcast (surpassing other options like Chromium, Opera, etc). Also, there are several other STB manufacturers that have shown interest in putting WPE on their boards, which will lead to new platforms supported and more people contributing to the project.

These are really very good news for WPE, and we hope to have an awesome community around the project (both companies and individuals), to collaborate making the engine even better!!

Future developments

Of course, periodically merging upstream changes and, at the same time, keep adding new functionalities and supported platforms to the engine are a very important part of what we are planning to do with WPE. Both Igalia and Metrological have a lot of ideas for future work: finish WebRTC and EME support, improvements to the graphics pipelines, add new APIs, improve security, etc.

But besides that there’s also a very important refactorization that is being performed, and it’s uploading the code to the main WebKit repository as a new port. Basically this means that the main WPE repository will be removed at some point, and its content will be integrated into WebKit. Together with this, we are setting the pieces to have a continuous build and testing system, as the rest of the WebKit ports have, to ensure that the code is always building and that the layout tests are passing properly. This will greatly improve the quality and robustness of WPE.

So, when we are done with those changes, the repository structure will be:

  • The WebKit main repository, with most of the code integrated there
  • Clients/users of WPE will have their own repository with their specific code, and they will merge main repository’s changes directly. This is the case of the Metrological repository.
  • A new third repository that will store WPE’s rendering backends. This code cannot be upstreamed to the WebKit repository as in many cases the license won’t allow it. So only a generic backend will be upstreamed to WebKit while the rest of the backends will be stored here (or in other client specific repositories).

by magomez at December 19, 2016 02:39 PM

December 15, 2016

Javier Muñoz

Ceph RGW AWS4 presigned URLs working with the Minio Cloud client

Some fellows are using the Minio Client (mc) as their primary client-side tool to work with S3 cloud storage and filesystems. As you may know, mc works with the AWS v4 signature API and it provides a modern alternative under the Apache 2.0 License to UNIX commands (ls, cat, cp, diff, etc).

In the case you are using mc in the client side and Ceph RGW S3 in the server side, you could be experimenting some issues with AWS4 presigned URLs and the error code '403 Forbidden'.

To resolve this issue you need to set to 'false' a new configuration parameter in the RGW S3 configuration file:

rgw s3 auth aws4 force boto2 compat = false

With this configuration in place, RGW S3 will be able to handle mc and other client-side tools experimenting the same issue properly. This configuration option is already available upstream.

By the way, if you are interested to know the origin of this issue you can have a look in this old boto2 bug.

While computing the signature a buggy boto2 version will craft the host using the port number twice while a proper implementation (mc, etc) uses it once only. The result will be two different outputs to compute the same URL.

Amazon S3 will accept as valid both signatures.

In the case of RGW S3, with the new configuration option set to 'false', RGW S3 will compute a second signature in the case of presigned URLs if the first signature computation does not match. The AWS4 presigned URL will be valid if any of the two signatures match.

Enjoy!

Acknowledgments

My work in Ceph is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Ceph development team. Thanks Pritha, Matt Benjamin and Yehuda for all your support to go upstream!

by Javier at December 15, 2016 11:00 PM

Claudio Saavedra

Thu 2016/Dec/15

Igalia is hiring. We're currently interested in Multimedia and Chromium developers. Check the announcements for details on the positions and our company.

December 15, 2016 05:13 PM

December 09, 2016

Frédéric Wang

STIX Two in Gecko and WebKit

On the 1st of December, the STIX Fonts project announced the release of STIX 2. If you never heard about this project, it is described as follows:

The mission of the Scientific and Technical Information Exchange (STIX) font creation project is the preparation of a comprehensive set of fonts that serve the scientific and engineering community in the process from manuscript creation through final publication, both in electronic and print formats.

This sounds a very exciting goal but the way it has been achieved has made the STIX project infamous for its numerous delays, for its poor or confusing packaging, for delivering math fonts with too many bugs to be usable, for its lack of openness & communication, for its bad handling of third-party feedback & contribution

Because of these laborious travels towards unsatisfactory releases, some snarky people claim that the project was actually named after Styx (Στύξ) the river from Greek mythology that one has to cross to enter the Underworld. Or that the story of the project is summarized by Baudelaire’s verses from L’Irrémédiable:

Une Idée, une Forme, un Être
Parti de l’azur et tombé
Dans un Styx bourbeux et plombé
Où nul œil du Ciel ne pénètre ;

More seriously, the good news is that the STIX Consortium finally released text fonts with a beautiful design and a companion math font that is usable in math rendering engines such as Word Processors, LaTeX and Web Engines. Indeed, WebKit and Gecko have supported OpenType-based MathML layout for more than three years (with recent improvements by Igalia) and STIX Two now has correct OpenType data and metrics!

Of course, the STIX Consortium did not address all the technical or organizational issues that have made its reputation but I count on Khaled Hosny to maintain his more open XITS fork with enhancements that have been ignored for STIX Two (e.g. Arabic and RTL features) or with fixes of already reported bugs.

As Jacques Distler wrote in a recent blog post, OS vendors should ideally bundle the STIX Two fonts in their default installation. For now, users can download and install the OTF fonts themselves. Note however that the STIX Two archive contains WOFF and WOFF2 fonts that page authors can use as web fonts.

I just landed patches in Gecko and WebKit so that future releases will try and find STIX Two on your system for MathML rendering. However, you can already do the proper font configuration via the preference menu of your browser:

  • For Gecko-based applications (e.g. Firefox, Seamonkey or Thunderbird), go to the font preference and select STIX Two Math as the “font for mathematics”.
  • For WebKit-based applications (e.g. Epiphany or Safari) add the following rule to your user stylesheet: math { font-family: "STIX Two Math"; }.

Finally, here is a screenshot of MathML formulas rendered by Firefox 49 using STIX Two:

Screenshot of MathML formulas rendered by Firefox using STIX 2

And the same page rendered by Epiphany 3.22.3:

Screenshot of MathML formulas rendered by Epiphany using STIX 2

December 09, 2016 11:00 PM

December 02, 2016

Frédéric Wang

Chromium on R-Car M3 & AGL/Wayland

As my fellow igalian Antonio Gomes explained on his blog, we have recenly been working on making the master branch of Chromium run on Linux Desktop using the upstream Ozone/Wayland backend. This effort is supported by Renesas and they were interested to rely on these recent developments to showcase Chromium running on the latest generation of their R-Car systems-on-chip for automotive applications. Ideally, they were willing to see this happening for the Automotive Grade Linux distribution.

Igalia Logo
Renesas Logo

Luckily, support for R-Car M3 was already available in the development branch of AGL and the AGL instructions for R-Car M2 actually worked pretty well for M3 too mutatis mutandis. There is also an OpenEmbedded/Yocto BSP layer for Chromium but unfortunately Wayland is only supported up to m48 (using the Ozone backend from Intel’s fork) and X11 is only supported up to m52. Hence we created a (for-now-private) fork of meta-browser that:

  • Uses the latest development version of Chromium, in particular the Linux/Ozone/Mash/Wayland support Igalia has been working on recently.
  • Relies on the new GN build system rather than the deprecated GYP one.
  • Supports the ARM64 architecture instead of just ARM32.
  • Installs more resources such as Mojo service manifests.
  • Applies more patches (e.g. build fixes or the workaround for issue 2485673002).

Renesas also provided us some R-Car M3 boards. I received my package this week and hence was able to start playing with it on Wednesday. Again, the AGL instructions for R-Car M2 apply well to M3 and the elinux page is very helpful to understand the various parts of the board. After having started AGL on the board and opened a Weston terminal (using mouse & keyboard as on the video or using ssh) I was able to run chromium via the following command:

/usr/bin/chromium/chrome --mash --ozone-plaform=wayland \
                         --user-data-dir=/tmp/user-data-dir \
                         --no-sandbox

The first line is the usual command currently used to execute Ozone/Mash/Wayland. The second line works around the fact that the AGL demo is running as root. The third line disables sandbox because the deprecated setuid sandbox does not work with Ozone and the newer user namespaces sandbox is not supported by the AGL kernel.

The following video a gives an overview of the hardware preparation as well as some basic tests with bouncyballs.org and webengineshackfest.org:

You can see that chromium runs relatively smoothly but we will do more testing to confirm these initial experiments. Also, we find the general issues with Linux/Ozone/Mash (internal VS external windows, no fullscreen, keyboard restricted to US layout, unwanted ChromeOS widgets etc) but we expect to continue to collaborate with Google to fix these bugs!

December 02, 2016 11:00 PM

Jacobo Aragunde

Sementando software Libre por Galicia

Escribo estas liñas en galego para falar das actividades relacionadas co software libre nas que teño participado nestas últimas datas no ámbito local, como parte da misión de Igalia de promover o software libre.

Para comezar, tiven a honra de ser invitado a dar un seminario, por terceiro ano consecutivo, como parte da asignatura de Deseño de Sistemas de Información no Mestrado de Enxeñería Informática impartido na Universidade da Coruña. O obxectivo era conectar o contido da materia coa realidade dun proxecto software desenvolvido en comunidade e para iso analizamos o proxecto e a comunidade de LibreOffice. Recordamos a súa longa historia, estudamos a composición e funcionamento da súa comunidade e botamos unha ollada á arquitectura da aplicación e ás técnicas e ferramentas empregadas para QA.

Ademais, a semana pasada visitei Monforte de Lemos para falar do proxecto empresarial de Igalia e de cómo é posible facer negocios con tecnoloxías de software libre. A invitación veu do IES A Pinguela, no que se imparten varios ciclos formativos relacionados coas TIC. Sempre é agradable dar a coñecer o proxecto de Igalia e incidir non só nos aspectos técnicos senón tamén nos humanos, como o noso modelo de organización democrático e igualitario ou os fortes principios de responsabilidade social. No plano técnico, analizamos distintos modelos de negocio e a súa relación co software libre, e casos prácticos de empresas e comunidades reais.

Charla en IES A Pinguela

Os contidos das dúas charlas, a continuación en formato PDF híbrido (permiten edición con LibreOffice):

by Jacobo Aragunde Pérez at December 02, 2016 08:55 AM

December 01, 2016

Asumu Takikawa

Pflua to assembly via dynasm

Today’s post is about the compiler implementation work I’ve been doing at Igalia for the past month and a half or so. The networking team at Igalia maintains the pflua library for fast packet filtering in Lua. The library implements the pflang language, which is what we call the filtering language that’s built into tools like tcpdump and libpcap.

The main use case for pflua is packet filtering integrated into programs written using Snabb, a high-speed networking framework implemented with LuaJIT.

(for more info about pflua, check out Andy Wingo’s blog posts about it and also Kat Barone-Adesi’s blog posts.

The big selling point of pflua is that it’s a very high speed implementation of pflang filtering (see here). Pflua gets its speed by compiling pflang expressions clevery (via ANF and SSA) to Lua code that’s executed by LuaJIT, which makes code very fast via tracing JIT compilation.

Compiling to Lua lets pflua take advantage of all the optimizations of LuaJIT, which is great when compiling simpler filters (such as the “accept all” filter) or with filters that reject packets quickly. With more complex filters and filters that accept packets, the performance can be more variable.

For example, the following chart from the pflua benchmark suite shows a situation in which benchmarks are paired off, with one filter accepting most packets and the second rejecting most. The Lua performance (light orange, furthest bar on the right) is better in the rejecting case:

Meanwhile, for filters that mostly accept packets, the Linux BPF implementations (blue & purple) are faster. As Andy explains in his original blog post on these benchmarks, this poor performance can be attributed to the execution of side traces that jump back to the original trace’s loop header, thus triggering unnecessary checks.

That brings me to what I’ve been working on. Instead of compiling down to Lua and relying on LuaJIT’s optimizations, I have been experimenting with compiling pflang filters directly to x64 assembly via pflua’s intermediate representation. The compilation is still implemented in Lua and for its code generation pass utilizes dynasm, a dynamic assembler that augments C/Lua with a preprocessor based assembly language (it was originally developed for use with C in the internals of LuaJIT). More precisely, the compiler uses the Lua port of dynasm.

The hope is that by compiling to x64 asm directly, we can get better performance out of cases where compiling to Lua does poorly. To spoil the ending, let me show you the same benchmark graph from earlier but with a newer Pflua and the new native backend (in dark orange on the far right):

As you can see, the performance of the native backend is almost uniformly better than the other implementations on these particular filters. The last case is the only one in which the default Lua backend (light orange, 2nd from right) does better.

In particular, the native backend does better than the Linux BPF cases even when the filter tends to accept packets. That said, I don’t want to give the impression that the native backend is always better than the Lua backend. For example, the Lua backend does better in the following examples:

Here the filters are simpler (such as the empty “accept all” filter) and execute fewer checks. In these cases, the Lua backend is faster, perhaps because LuaJIT does a good job of inlining away the checks or the funtion entirely. The native backend is still competitive though.

Using filters that are more complex appears to tip things favorably towards the dynasm backend. For example, here are the results on two more complex filters that target HTTP packets:

The two filters used for the graph above are the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
tcp port 80 and
  (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)

tcp[2:2] = 80 and
  (tcp[20:4] = 1347375956 or
   tcp[24:4] = 1347375956 or
   tcp[28:4] = 1347375956 or
   tcp[32:4] = 1347375956 or
   tcp[36:4] = 1347375956 or
   tcp[40:4] = 1347375956 or
   tcp[44:4] = 1347375956 or
   tcp[48:4] = 1347375956 or
   tcp[52:4] = 1347375956 or
   tcp[56:4] = 1347375956 or
   tcp[60:4] = 1347375956)

The first is from the pcap-filter language manpage and targets HTTP data packets. The other is from a Stackexchange answer and looks for HTTP POST packets. As is pretty obvious from the graph, the native backend is winning by a large margin for both.

Implementation

Now that you’ve seen the benchmark results, I want to remark on the general design of the new backend. It’s actually a pretty straightforward compiler that looks like one you would write following Appel’s Tiger book. The new backend consumes the SSA basic blocks IR that’s used by the Lua backend. From there, an instruction selection pass converts the tree structures in the SSA IR into straight-line code with virtual instructions, often introducing additional variables (really virtual registers) and movs for intermediate values.

Then a simple linear scan register allocator assigns real registers to the virtual registers, and also eliminates redundant mov instructions where possible. I’ve noticed that most pflang filters don’t produce intermediate forms with much register pressure, so spilling to memory is pretty rare. In fact, I’ve only been able to cause spilling with synthetic examples that I make up or from random testing outputs.

Finally, a code generation pass uses dynasm to produce real x64 instructions. Dynasm simplifies code generation quite a bit, since it lets you write a code generator as if you were just writing inline assembly in your Lua program.

For example, here’s a snippet showing how the code generator produces assembly for a virtual instruction that converts from network to host order:

1
2
3
elseif itype == "ntohl" then
   local reg = alloc[instr[2]]
   | bswap Rd(reg)

This is a single branch from a big conditional statement that dispatches on the type of virtual instruction. It first looks up the relevant register for the argument variable in the register allocation and then produces a bswap assembly instruction. The line that starts with a | is not normal Lua syntax; it’s consumed by the dynasm preprocessor and turned into normal Lua. As you can see, Lua expressions like reg can also be used in the assembly to control, for example, the register to use.

After preprocessing, the branch looks like this:

1
2
3
4
elseif itype == "ntohl" then
   local reg = alloc[instr[2]]
   --| bswap Rd(reg)
   dasm.put(Dst, 479, (reg))

(there is a very nice unofficial documentation page for dynasm by Peter Crawley if you want to learn more about it)

There are some cases where using dynasm has been a bit frustrating. For example, when implementing register spilling, you would like to be able to switch between using register or memory arguments for dynasm code generation (like in the Rd(reg) bit above). Unfortunately, it’s not possible to dynamically choose whether to use a register or memory argument, because it’s fixed at the pre-processing stage of dynasm.

This means that the code generator has to separate register and memory cases for every kind of instruction that it dispatches on (and potentially more than just two cases if there are several arguments). This turns the code into spaghetti with quite a bit of boilerplate. A better alternative is to introduce additional instructions to mov memory values into registers and vice versa, and then uniformly work on register arguments otherwise. This introduces some unnecessary overhead, but since spilling is so rare in practice, I haven’t found it to be a problem in practice.

Dynasm does let you do some really cute code though. For example, one of the tests that I wrote ensures that callee-save registers are used correctly by the native backend. This is a little tricky to test because you need to ensure that compiled function doesn’t interfere with registers used in the surrounding context (in this case, that’s the LuaJIT VM).

But expressing that test context is not really possible with just plain old Lua. The solution is to apply some more dynasm, and set up a test context which uses callee-save registers and then checks to make sure its registers don’t get mutated by calling the filter. It ends up looking like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
local function test(expr)
   local Dst = dasm.new(actions)

   local ssa = convert_ssa(convert_anf(optimize(expand(parse(expr),
                           "EN10MB"))))
   local instr = sel.select(ssa)
   local alloc = ra.allocate(instr)
   local f = compile(instr, alloc)
   local pkt = v4_pkts[1]

   | push rbx
   -- we want to make sure %rbx still contains this later
   | mov rbx, 0xdeadbeef
   -- args to 'f'
   | mov64 rdi, ffi.cast(ffi.typeof("uint64_t"), pkt.packet)
   | mov rsi, pkt.len
   -- call 'f'
   | mov64 rax, ffi.cast(ffi.typeof("uint64_t"), f)
   | call rax
   -- make sure it's still there
   | cmp rbx, 0xdeadbeef
   -- put a bool in return register
   | sete al
   | pop rbx
   | ret

   local mcode, size = Dst:build()
   table.insert(anchor, mcode)
   local f = ffi.cast(ffi.typeof("bool(*)()"), mcode)
   return f()
end

What this Lua function does is it takes a pflang filter expression and compiles it into a function f, and then returns a new function generated with dynasm that will call f with particular values in callee-save registers. The generated function only returns true if f didn’t stomp over its registers.

You can see the rest of the code that uses this test here. I think it’s a pretty cool test.

Using the new backend

If you’d like to try out the new Pflua, it’s now merged into the upstream Igalia repository here. To set it up, you’ll need to clone the repo recursively:

1
git clone --recursive https://github.com/Igalia/pflua.git

in order to pull in both LuaJIT and dynasm submodules. Then enter the directory and run make to build it.

To check out how the new backend compiles some filters, go into the tools folder and run something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ ../env pflua-compile --native 'ip'
7f4030932000  4883FE0E          cmp rsi, +0x0e
7f4030932004  7C0A              jl 0x7f4030932010
7f4030932006  0FB7770C          movzx esi, word [rdi+0xc]
7f403093200a  4883FE08          cmp rsi, +0x08
7f403093200e  7403              jz 0x7f4030932013
7f4030932010  B000              mov al, 0x0
7f4030932012  C3                ret
7f4030932013  B001              mov al, 0x1
7f4030932015  C3                ret

and you’ll see some assembly dumped to standard output like above. If you have any code that uses pflua’s API, you can pass an extra argument to pf.compile_filter like the following:

1
pf.compile_filter(filter_expr, { native = true })

in order to use native compilation. The plan is to integrate this new work into Snabb (normal pflua is already integrated upstream) in order to speed up network functions that rely on packet filtering. Bug reports and contributions welcome!

December 01, 2016 12:27 PM

November 19, 2016

Víctor Jáquez

GStreamer-VAAPI 1.10 (now 1.10.1)

A lot of things have happened last month and I have neglected this report. Let me try to fix it.

First, the autumn GStreamer Hackfest and the GStreamer Conference 2016 in Berlin.

The hackfest’s venue was the famous C-Base, where we talked with friends and colleagues. Most of the discussions were leaned towards the release 1.10. Still, Julien Isorce and I talked about our approaches for DMABuf sharing with downstream; we agreed on to explore how to implement an allocator based on GstDmaBufAllocator. We talked with Josep Torra about the performance he was getting with gstreamer-vaapi encoders (and we already did good progress on this). Also we talked with Nicolas Dufresne about v4l2 and kmssink. And many other informal chats along with beer and pizza.

GStreamer Hackfest a the C-Base (from @calvaris tweet)
GStreamer Hackfest a the C-Base (from @calvaris tweet)

Afterwards, in the GStreamer Conference, I talked a bit about the current status of GStreamer-VAAPI and what we can expect from release 1.10. Here’s the video that the folks from Ubicast recorded. You can see the slides here too.

I spent most of the conference with Scott, Sree and Josep tracing the bottlenecks in the frame uploading to VA Intel driver, and trying to answer the questions of the folks who gently approached me. I hope my answers were any helpful.

And finally, the first of November, the release 1.10 hit the streets!

As it is already mentioned in the release notes, these are the main new features you can find in GStreamer-VAAPI 1.10

  • All decoders have been split, one plugin feature per codec. So far, the available ones, depending on the driver, are: vaapimpeg2dec, vaapih264dec, vaapih265dec, vaapivc1dec, vaapivp8dec, vaapivp9dec and vaapijpegdec (which already was split).
  • Improvements when mapping VA surfaces into memory, differenciating from negotiation caps and allocations caps, since the allocation memory for surfaces may be bigger than one that is going to be mapped.
  • vaapih265enc got support to constant bitrate mode (CBR).
  • Since several VA drivers are unmaintained, we decide to keep a white list with the va drivers we actually test, which is mostly the i915 and, in some degree, gallium from mesa project. Exporting the environment variable GST_VAAPI_ALL_DRIVERS disable the white list.
  • The plugin features are registered, in run-time, according to its support by the loaded VA driver. So only the decoders and encoder supported by the system are registered. Since the driver can change, some dependencies are tracked to invalidate the GStreamer registry and reload the plugin.
  • DMA-Buf importation from upstream has been improved, gaining performance.
  • vaapipostproc now can negotiate through caps the buffer transformations.
  • Decoders now can do reverse playback! But they only shows I frames, because the surface pool is smaller than the required by the GOP to show all the frames.
  • The upload of frames onto native GL textures has been optimized too, keeping a cache of the internal structures for the offered textures by the sink, improving a lot the performance with glimagesink.

This is the short log summary since branch 1.8:

  1  Allen Zhang
 19  Hyunjun Ko
  1  Jagyum Koo
  2  Jan Schmidt
  7  Julien Isorce
  1  Matthew Waters
  1  Michael Olbrich
  1  Nicolas Dufresne
 13  Scott D Phillips
  9  Sebastian Dröge
 30  Sreerenj Balachandran
  1  Stefan Sauer
  1  Stephen
  1  Thiago Santos
  1  Tim-Philipp Müller
  1  Vineeth TM
154  Víctor Manuel Jáquez Leal

Thanks to Igalia, Intel Open Source Technology Center and many others for making GStreamer-VAAPI possible.

by vjaquez at November 19, 2016 11:31 AM

November 15, 2016

Iago Toral

OpenGL terrain renderer update: Motion Blur

As usual, it can be turned on and off at build-time and there is some configuration available as well to control how the effect works. Here are some screen-shots:

motion-blur-0
Motion Blur Off
motion-blur-0_2
Motion Blur Off
motion-blur-8
Motion Blur On, intensity 12.5%
motion-blur-8_2
Motion Blur On, intensity 12.5%
motion-blur-4
Motion Blur On, intensity 25%
motion-blur-4_2
Motion Blur On, intensity 25%
motion-blur-2
Motion Blur On, intensity 50%
motion-blur-2_2
Motion Blur On, intensity 50%

Motion blur is a technique used to make movement feel smoother than it actually is and is targeted at hiding the fact that things don’t move in continuous fashion, but rather, at fixed intervals dictated by the frame rate. For example, a fast moving object in a scene can “leap” many pixels between consecutive frames even if we intend for it to have a smooth animation at a fixed speed. Quick camera movement produces the same effect on pretty much every object on the scene. Our eyes can notice these gaps in the animation, which is not great. Motion blur applies a slight blur to objects across the direction in which they move, which aims at filling the animation gaps produced by our discrete frames, tricking the brain into perceiving a smoother animation as result.

In my demo there are no moving objects other than the sky box or the shadows, which are relatively slow objects anyway, however, camera movement can make objects change screen-space positions quite fast (specially when we rotate the view point) and the motion- blur effect helps us perceive a smoother animation in this case.

I will try to cover the actual implementation in some other post but for now I’ll keep it short and leave it to the images above to showcase what the filter actually does at different configuration settings. Notice that the smoothing effect is something that can only be perceived during motion, so still images are not the best way to showcase the result of the filter from the perspective of the viewer, however, still images are a good way to freeze the animation and see exactly how the filter modifies the original image to achieve the smoothing effect.

by Iago Toral at November 15, 2016 12:19 PM

November 14, 2016

Manuel Rego

Recap of the Web Engines Hackfest 2016

This is my personal summary of the Web Engines Hackfest 2016 that happened at Igalia headquarters (in A Coruña) during the last week of September.

The hackfest is a special event, because the target audience are implementors of the different web browser engines. The idea is to join browser hackers for a few days to discuss different topics and work together in some of them. This year we almost reached 40 participants, which is the largest number of people attending the hackfest since we started it 8 years ago. Also it was really nice to have people from the different communities around the open web platform.

One of the breakout sessions of the Web Engines Hackfest 2016 One of the breakout sessions of the Web Engines Hackfest 2016

Talks

Despite the main focus of the event is the hacking time and the breakout sessions to discuss the different topics, we took advantage of having great developers around to arrange a few short talks. I’ll do a brief review of the talks from this year, which have been already published on a YouTube playlist (thanks to Chema and Juan for the amazing recording and video edition).

CSS Grid Layout

In my case, as usual lately, my focus was CSS Grid Layout implementation on Blink and WebKit, that Igalia is doing as part of our collaboration with Bloomberg. The good thing was that Christian Biesinger from Google was present in the hackfest, he’s usually the Blink developer reviewing our patches, so it was really nice to have him around to discuss and clarify some topics.

My colleague Javi already wrote a blog post about the hackfest with lots of details about the Grid Layout work we did. Probably the most important bit is that we’re getting really close to ship the feature. The spec is now Candidate Recommendation (despite a few issues that still need to be clarified) and we’ve been focusing lately on finishing the last features and fixing interoperability issues with Gecko implementation. And we’ve just sent the Intent to ship” mail to blink-dev, if everything goes fine Grid Layout might be enabled by default on Chromium 57 that will be released around March 2017.

MathML

It’s worth to mention the work performed by Behdad, Fred and Khaled adding support on HarfBuzz for the OpenType MATH table. Again Fred wrote a detailed blog post about all this work some weeks ago.

Also we had the chance to discuss with more googlers about the possibilities to bring MathML back into Chromium. We showed the status of the experimental branch created by Fred and explained the improvements we’ve been doing on the WebKit implementation. Now Gecko and WebKit both follow the implementation note and the tests have been upstreamed into the W3C Web platform Tests repository. Let’s see how all this will evolve in the future.

Acknowledges

Last, but not least, as one of the event organizers, I’ve to say thanks to everyone attending the hackfest, without all of you the event won’t make any sense. Also I want to acknowledge the support from the hackfest sponsors Collabora, Igalia and Mozilla. And I’d also like to give kudos to Igalia for hosting and organizing the event one year more. Looking forward for 2017 edition!

Web Engines Hackfest 2016 sponsors: Collabora, Igalia and Mozilla Web Engines Hackfest 2016 sponsors: Collabora, Igalia and Mozilla

Igalia 15th Anniversary & Summit

Just to close this blog post let me talk about some extra events that happened on the last week of September just after the Web Engines Hackfest.

Igalia was celebrating its 15th anniversary with several parties during the week, one of them was in the last day of the hackfest at night with a cool live concert included. Of course hackfest attendees were invited to join us.

Igalia 15th Anniversary Party Igalia 15th Anniversary Party (picture by Chema Casanova)

Then, on the weekend, Igalia arranged a new summit. We usually do two per year and, in my humble opinion, they’re really important for Igalia as a whole. The flat structure is based on trust in our peers, so spending a few days per year all together is wonderful. It allows us to know each other better, while having a good time with our colleagues. And I’m sure they’re very useful for newcomers too, in order to help them to understand our company culture.

Igalia Fall Summit 2016 Igalia Fall Summit 2016 (picture by Alberto Garcia)

And that’s all for this post, let’s hope you didn’t get bored. Thanks for reading this far. 😊

November 14, 2016 11:00 PM

Antonio Gomes

Chromium, ozone, wayland and beyond

Over the past 2 months my fellow Igalian Frédéric Wang and I have been working on improving the Wayland support in Chromium. I will be detailing on this post some of the progresses we’ve made and what our next steps are.

Background

The more Wayland evolved as a mature alternative to the traditional X11-based environments, the more the Chromium community realized efforts were needed to run Chrome natively and flawlessly on such environments. On that context, the Ozone project began back in 2013 with the main goal of making porting Chrome to different graphics systems easier.

Ozone consists in a set of C++ classes for abstracting different window systems on Linux.
It provides abstraction for the construction of accelerated surfaces underlying Aura UI framework, input devices assignment and event handling.

Yet in 2013, Intel presented an Ozone/Wayland backend to the Chromium community. Although project progressed really well, after some time of active development, it got stalled: its most recent update is compatible with Chromium m49, from December/2015.

Leaping forward to 2016, a Wayland backend was added to Chromium upstream by Michael Forney. The implementation is not as feature complete as Intel’s (for example, drag n’ drop support is not implemented, and it only works with specific build configurations of Chrome), but it is definitively a good starting point.

In July, I experimented a bit this existing Wayland backend in Chromium. In summary, I took the existing wayland support upstream, and “ported” the minimal patches from Intel’s 01.org repository, being able to get ContentShell running with Wayland. However, given how the Wayland backend in Chromium upstream is designed and implemented, the GPU process failed to start, and HW rendering path was not functional.

weston_wayland_chromium

Communication with Google

Our main goal was to make Chrome capable of running natively on a Wayland-based environment, with hardware accelerated rendering enabled through the GPU process. After the investigation phase, the natural follow up step for us at Igalia was to contact Chrome hackers at Google, share our plans to Chromium/Wayland, and gather feedback. We reached out to @forney and @rjkroege, and once we heard back, we could understand why “porting” the missing bits from Intel’s original Ozone/Wayland implementation could not be the best choice on the long run.

Basically, the Ozone/Wayland backend present in Chromium upstream is designed in accordance to the original Mus+Ash Project’s architecture: the UI and GPU components would live in the same process. This way, the Wayland connection and other objects, that belong to the UI component, would be accessible to the GPU component and accelerated HW rendering could be implemented without any IPC.

This differs from Intel’s original implementation, where the GPU process owns the Wayland connections and objects, and the legacy Chromium IPC system is used to pass Wayland data from the GPU process to the Browser process.

The latest big Chromium refactor, since the Content API modularization, is the so called Mus+Ash (read "mustash") effort. In an over-summarized definition, it consists of factoring out specific parts of Chromium out of Chrome into "services", each featuring well defined interfaces for communication via Mojo IPC subsystem.

It was also pointed to us that eventually the UI and GPU components will be split into different processes. At that point, the same approach as the one done for the DRM/GBM Ozone backend was advised in order to implement communication between UI and GPU components.

Igalia improvements

Given that “ash” is a ChromeOS-oriented UI component, it seems logical to say that “mus+ash” project’s primarily focus is on the ChromeOS platform. Well, with that in mind, we thought we just needed to: (1) make a ChromeOS/Ozone/Wayland build of Chrome; (2) run ‘chrome –mash –ozone-platform=wayland’; (c) voilà, all should work.

WRONG!

When Fred and I started on this project two months ago, the situation was:

  • Chrome/Ozone/Wayland configuration was only build-able on ChromeOS and failed to run by default.
  • The Ozone project documentation was out of date (e.g. it referred to GYP and the obsolete Intel Ozone/Wayland code)
  • The Wayland code path was not tested by any buildbots, and could break without being noticed at this time.

After various code clean-ups and build fixes, we could finally launch Chrome off of a ChromeOS/Mash/Ozone/X11 build.

With further investigation, we verified that ChromeOS/Mash/Ozone/Wayland worked in m52 (when hardware accelerated rendering support was added to the Wayland backend), m53 and m54 checkouts, but not in m55. We bisected commits between m54 and m55, identified the regressions, and fixed them in issues 2387063002 and 2389053003. Finally, chrome –mash –ozone-platform=wayland was running with hardware accelerated rendering enabled on “off-device” ChromeOS builds.

screenshot-from-2016-11-14-00-51-02

 

Linux desktop builds

Our next goal at this point of the collaboration was to make it possible to build and run chrome –mash off of a regular Linux desktop build, through Ozone/Wayland backend.

After discussing about it on the ozone-dev mailing, we submitted and upstreamed various patches, converging to a point where a preliminary version of Chrome/mash/Ozone/Wayland was functional on a non-chromeos build, as shown below.

chrome_mash_ozone_wayland_non-chromeos-oct10

Note that this version of Chrome is stated above as “preliminary” because although it launches off of a regular desktop Linux build, it still runs with UI components from the ChromeOS UI environment, including a yellow bottom tray, and other items part of the so called “ash” Desktop environment.

Improving developers’ ecosystem

As mentioned, when we started on this collaboration, the Ozone project featured an out of date documentation, compilation and execution failures, and was mostly not integrated to Chromium’s continuous testing system (buildbots, etc). However, given that Ozone is the abstraction layer we would work on in order to get a Wayland backend up and running for Chrome, we found it sensible to improve this scenario.

By making use of Igalia’s community work excellence, many accomplishments were made
on this front, and now Ozone documentation is up-to-date.

Also, after having fixed Ozone/Wayland recurring build and execution failures, Fred and I realized the Ozone/Wayland codepath in Chromium needed to be exercised by a buildbot, so that its maintenance burden is reduced. Fred emailed ozone-dev about it and we started contributing to it, together with the Googler thomasanderson@. As a result,  Ozone/Wayland backend was made part of Chromium’s continuous integration testing system.

Also, Fred made a great post about the current Ozone/Wayland code in Chromium, its main classes and some experimentation done about adding Mojo capability to it. I consider it is a must-read follow on on this post.

Next steps

Today chrome —mash launches within an “ash” window, where the “ash” environment works as a shell for child internal windows, being Chrome one of them.
Being “mash” a shortener for “mus+ash”, our next step is to “eliminate”
the “ash” aspects from the execution path, resulting on a “Chrome Mus” launch.

This project is sponsored by Renesas Electronics …

renesas_logomark_l

… and has been performed by the great Igalian hacker Frédéric Wang and I on behalf of Igalia.

igalia-logo-364x130

by agomes at November 14, 2016 04:42 PM

November 10, 2016

Xabier Rodríguez Calvar

Web Engines Hackfest 2016

From September 26th to 28th we celebrated at the Igalia HQ the 2016 edition of the Web Engines Hackfest. This year we broke all records and got participants from the three main companies behind the three biggest open source web engines, say Mozilla, Google and Apple. Or course, it was not only them, we had some other companies and ourselves. I was active part of the organization and I think we not only did not get any complain but people were comfortable and happy around.

We had several talks (I included the slides and YouTube links):

We had lots and lots of interesting hacking and we also had several breakout sessions:

  • WebKitGTK+ / Epiphany
  • Servo
  • WPE / WebKit for Wayland
  • Layout Models (Grid, Flexbox)
  • WebRTC
  • JavaScript Engines
  • MathML
  • Graphics in WebKit

What I did during the hackfest was working with Enrique and Žan to advance on reviewing our downstream implementation of Media Source Extensions (MSE) in order to land it as soon as possible and I can proudly say that we did already (we didn’t finish at the hackfest but managed to do it after it). We broke the bots and pissed off Michael and Carlos but we managed to deactivate it by default and continue working on it upstream.

So summing up, from my point of view and it is not only because I was part of the organization at Igalia, based also in other people’s opinions, I think the hackfest was a success and I think we will continue as we were or maybe growing a bit (no spoilers!).

Finally I would like to thank our gold sponsors Collabora and Igalia and our silver sponsor Mozilla.

by calvaris at November 10, 2016 08:59 AM

October 31, 2016

Manuel Rego

My experience at W3C TPAC 2016 in Lisbon

At the end of September I attended W3C TPAC 2016 in Lisbon together with my igalian fellows Joanie and Juanjo. TPAC is where all the people working on the different W3C groups meet for a week to discuss tons of topics around the Web. It was my first time on a W3C event so I had the chance to meet there a lot of amazing people, that I’ve been following on the internet for a long time.

Igalia booth

Like past year Igalia was present in the conference, and we had a nice exhibitor booth just in front the registration desk. Most of the time my colleague Juanjo was there explaining Igalia and the work we do to the people that came to the booth.

Igalia Booth at W3C TPAC 2016 Igalia Booth at W3C TPAC 2016 (picture by Juan José Sánchez)

In the booth we were showcasing some videos of the last standards implementations we’ve been doing upstream (like CSS Grid Layout, WebRTC, MathML, etc.). In addition, we were also showing a few demos of our WebKit port called WPE, which has been optimized to run on low-end devices like the Raspberry Pi.

It’s really great to have the chance to explain the work we do around browsers, and check that there are quite a lot of people interested on that. On this regard, Igalia is a particular consultancy that can help companies to develop standards upstream, contributing to both the implementations and the specification evolution and discussions inside the different standard bodies. Of course, don’t hesitate to contact us if you think we can help you to achieve similar goals on this field.

CSS Working Group

During TPAC I was attending the CSS WG meetings as an observer. It’s been really nice to check that a lot of people there appreciate the work we do around CSS Grid Layout, as part of our collaboration with Bloomberg. Not only the implementation effort we’re leading on Blink and WebKit, but also the contributions we make to the spec itself.

New ideas for CSS Grid Layout

There were not a lot of discussion about Grid Layout during the meetings, just a few issues that I’ll explain later. However, Jen Simmons did a very cool presentation with a proposal for a new regions specification.

As you probably remember one of the main concerns regarding CSS Regions spec is the need of dummy divs, that are used to flow the text into them. Jen was suggesting that we could use CSS Grids to solve that. Grids create boxes from CSS where you place elements from the DOM, the idea would be that if you can reference those boxes from CSS then the contents could flow there without the need of dummy nodes. This was linked with some other new ideas to improve CSS Grid Layout:

  • Apply backgrounds and borders to grid cells.
  • Skip cells during auto-placement.

All this is already possible nowadays using dummy div elements (of course if you use a browser with regions and grid support like Safari Technology Preview). However, the idea would be to be able to achieve the same thing without any empty item, directly referring them from CSS.

This of course needs further discussion and won’t be part of the level 1 of CSS Grid Layout spec, but it’d be really nice to get some of those things included in the future versions. I was discussing with Jen and some of them (like skipping cells for auto-placement) shouldn’t be hard to implement. Last, this reminded me about a discussion we had two years ago at the Web Engines Hackfest 2014 with Mihnea Ovidenie.

Last issues on CSS Grid Layout

After the meetings I had the chance to discuss hand by hand some issues with fantasai (one of the Grid Layout spec editors).

One of the topics was discussed during the CSS WG meeting, it was related to how to handle percentages inside Grid Layout when they need to be resolved in an intrinsic size situation. The question here was if both percentage tracks and percentage gaps should resolve the same or differently, the group agreed that they should work exactly the same. However there is something else, as Firefox computes percentage for margins different to what’s done in the other browsers. I tried to explain all this in a detailed issue for the CSS WG, I really think we only have option to do this, but it’ll take a while until we’ve a final resolution on this.

Another issue I think we’ve still open is about the minimum size of grid items. I believe the text on the spec still needs some tweaks, but, thanks to fantasai, we managed to clarify what should be the expected behavior in the common case. However, there’re still some open questions and an ongoing discussion regarding images.

Finally, it’s worth to mention that just after TPAC, the Grid Layout spec has transitioned to Candidate Recommendation (CR), which means that it’s getting stable enough to finish the implementations and release it to the wild. Hopefully this open issues will be fixed pretty soon.

Houdini

I also attended the first day of Houdini meetings, I was lucky as they started with the Layout API (the one I was more interested on). It’s clear that Google is pushing a lot for this to happen, all the new Layout NG project inside Blink seems quite related to the new Houdini APIs. It looks like a nice thing to have, but, on a first sight, it seems quite hard to get it right. Mainly due to the big complexity of the layout on the web.

On the bright side the Painting API transitioned to CR. Blink has some prototype implementations already that can be used to create cool demos. It’s really amazing to see that the Houdini effort is already starting to produce some real results.

Other

It’s always a pleasure to visit Portugal and Lisbon, both a wonderful country and city. One of the conference afternoons I had the chance to go for a walk from the congress center to the lovely Belém Tower. It was clear I couldn’t miss the chance to go there for such a big conference, you don’t always have TPAC so close from home.

Lisbon wall painting Lisbon wall painting

Overall it was a really nice experience, everyone was very kind and supportive about Igalia and our work on the web platform. Let’s keep working hard to push the web forward!

October 31, 2016 11:00 PM