Planet Igalia

July 10, 2018

Pablo Saavedra


Many times, during these last months, I thought to keep updated my blog writing a new post. Unfortunately, for one or another reason I always found an excuse to not do so. Well, I think that time is over because finally I found something useful and worthy the time spent time on the writing.

– That is OK but … what are you talking about?.
– Be patient Pablo, if you didn’t skip the headline of the post you already know about what I’m talking, probably :-).

Yes, I’m talking about how to setup a MacPro computer into a icecc cluster based on Linux hosts to take advantage of those to get more CPU power to build heavy software projects, like Chromium,  faster. The idea besides this is to distribute all the computational work over Linux nodes (fairly cheaper than any Mac) requested for cross-compiling tasks from the Mac host.

I’ve been working as a sysadmin at Igalia for the last couple of years. One of my duties here is to support and improve the developers building infrastructures. Recently we’ve faced long building times for heavy software projects like, for instance, Chromium. In this context, one of the main  issues that I had to solve is  how to build Chromium for MacOS in a reasonable time and avoiding to spend a lot of money in expensive bleeding edge Apple’s hardware to get CPU power.

This is what this post is about. This is an explanation about how to configure a Mac Pro to use a Linux based icecc cluster to boost the building times using cross-compilation. For simplicity, the explanation is focused in the singular case of just one single Linux host as icecc node and just one MacOS host requesting for compiling tasks but, in any case, you can extrapolate the instructions provided here to have many nodes as you need.

So let’s go with the the explanation but, first of all, a summary for those who want to go directly to the minimal and essential information …


On the Linux host:

# Configure the iceccd
$ sudo apt install icecc
$ sudo systemctl enable icecc-scheduler
$ edit /etc/icecc/icecc.conf
$ sudo systemctl restart icecc

# Generate the clang cross-compiling toolchain
$ sudo apt install build-essential icecc
$ git clone ~/depot_tools
$ export PATH=$PATH:~/depot_tools
$ git clone ~/chromium_clang_darwin_on_linux
$ cd ~/chromium_clang_darwin_on_linux
$ export CLANG_REVISION=332838  # or CLANG_REVISION=$(./get-chromium-clang-revision)
$ ./icecc-create-darwin-env
# copy the clang_darwin_on_linux_332838.tar.gz to your MacOS host

On the Mac:

# Configure the iceccd
$ git clone ~/icecream-mac/
$ sudo ~/icecream-mac/
$ launchctl load /Library/LaunchDaemons/org.icecream.iceccd.plist
$ launchctl start /Library/LaunchDaemons/org.icecream.iceccd.plist

# Set the ICECC env vars
$ export ICECC_VERSION=x86_64:~/clang_darwin_on_linux_332838.tar.gz
$ export PATH=~/icecream-mac/bin/icecc/:$PATH

# Get the depot_tools
$ cd ~
$ git clone
$ export PATH=$PATH:~/depot_tools

# Download and build the Chromium sources
$ cd chromium && fetch chromium && cd src
$ gn gen out/Default --args='cc_wrapper="icecc" \
  treat_warnings_as_errors=false \
  clang_use_chrome_plugins=false \
  use_debug_fission=false \
  linux_use_bundled_binutils=false \
  use_lld=false \
$ ninja -j 32 -C out/Default chrome

… and now the detailed explanation

Installation and setup of icecream on Linux hosts

The installation of icecream on a  Debian based Linux host is pretty simple. The latest version (1.1) for icecc is available in Debian testing and sid for a while so everything that you must to do is install it from the APT repositories. For case of stretch, there is a backport available  in the repository publically available:

sudo apt install icecc

The second important part of a icecc cluster is the icecc-scheduler. This daemon is in charge to route the requests from the icecc nodes which requiring available CPUs  for compiling to the nodes of the icecc cluster allowed to run remote build jobs.

In this setup we will activate the scheduler in the Linux node ( The key here is that only one scheduler should be up at the same time in the same network to avoid errors in the cluster.

sudo systemctl enable icecc-scheduler

Once the scheduler is configured and up, it is time to add icecc hosts to the cluster. We will start adding the Linux hosts following this idea:

  • The IP of the icecc scheduler is
  • The Linux host is allowed to run remote jobs
  • The Linux host is allowed to run up to 32 concurrent jobs (this is arbitrary decision and can be adjusted per each particular host)
    # edit /etc/icecc/icecc.conf

We will need to restart the service to apply those changes:

sudo systemctl restart icecc

Installing and setup of icecream on MacOS hosts

The next step is to install and configure the icecc service on our Mac.  The easy way to get icecc available on Mac is icecream-mac project from darktears. We will do the installation assuming the following facts:

  • The local user account in Mac is psaavedra
  • The IP of the icecc scheduler is
  • The Mac is not allowed to accept remote jobs
  • We don’t want run use the Mac as worker.

To get the icecream-mac software we will make a git-clone of the project on Github:

git clone /Users/psaavedra/icecream-mac/
sudo /Users/psaavedra/icecream-mac/

We will edit a bit the /Library/LaunchDaemons/org.icecream.iceccd.plist daemon definition as follows:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "">
<plist version="1.0">

Note that we are setting 2 workers in the Mac. Those workers are needed to execute threads in the host client host for things like linking … We will reload the service with this configuration:

launchctl load /Library/LaunchDaemons/org.icecream.iceccd.plist
launchctl start /Library/LaunchDaemons/org.icecream.iceccd.plist

Getting the cross-compilation toolchain for the icecream-mac

We already have the icecc cluster configured but, before to start to build Chromium on MacOS using icecc, there is still something before to do. We still need a cross-compiled clang for Darwin on Linux and, to avoid incompatibilities between versions, we need a clang based on the very same version that your Chromium code to be compiled.

You can check and get the cross-compilation clang revision that you need as follows:

cd src
CLANG_REVISION=$(cat tools/clang/scripts/ | grep CLANG_REVISION | head -n 1 | cut -d "'" -f 2)

In order to simplify this step.  I made some scripts which make it easy the generation of this clang cross-compiled toolchain. On a Linux host:

  • Install build depends:
    sudo apt install build-essential icecc
  • Get the Chromium project depot tools
    git clone ~/depot_tools  
    export PATH=$PATH:~/depot_tools
  • Download the psaavedra’s scripts (yes, my scripts):
    git clone ~/chromium_clang_darwin_on_linux
    cd ~/chromium_clang_darwin_on_linux
  • You can use the get-chromium-clang-revision script to get the latest clang revision using in Chromium master:
  • and then, to build the cross-compiled toolchain:

    ; this script encapsulates the download, configure and build of the clang software.

  • A clang_darwin_on_linux_999999.tar.gz file will be generated.

Setup the icecc environment variables

Once you have the /Users/psaavedra/clang_darwin_on_linux_332838.tar.gz generated in your MacOS. You are ready to set the icecc environments variables.

export ICECC_VERSION=x86_64:/Users/psaavedra/clang_darwin_on_linux_332838.tar.gz

The first variable enables the usage of the remote clang for C++. The second one establish toolchain to use by the x86_64 (Linux nodes) to build the code sent from the Mac.

Finally, remember to add the icecc binaries to the $PATH:

export PATH=/Users/psaavedra/icecream-mac/bin/icecc/:$PATH

You can check and get the cross-compiled clang revision that you need as follows:

cd src
CLANG_REVISION=$(cat tools/clang/scripts/ | grep CLANG_REVISION | head -n 1 | cut -d "'" -f 2)

… and building Chromium, at last

Reached this point, it’s time to build a Chromium using the icecc cluster and the cross-compiled clang toolchain previously created. These steps follows the official Chromium build procedure and only adapted to setup the icecc wrapper.

Ensure depot_tools is the path:

cd ~git clone
export PATH=$PATH:~/depot_tools
ninja --version
# 1.8.2

Get the code:

git config --global core.precomposeUnicode truemkdir chromium
cd chromium
fetch chromium

Configure the build:

cd src
gn gen out/Default --args='cc_wrapper="icecc" treat_warnings_as_errors=false clang_use_chrome_plugins=false linux_use_bundled_binutils=false use_jumbo_build=true'
# or with ccache
export CCACHE_PREFIX=icecc
gn gen out/Default --args='cc_wrapper="ccache" treat_warnings_as_errors=false clang_use_chrome_plugins=false linux_use_bundled_binutils=false use_jumbo_build=true'

And build, at last:

ninja -j 32 -C out/Default chrome

icemon allows you to graphically monitoring the icecc cluster. Run it in remote from your Linux host if you don’t want install it in the MacOS:

ssh -X user@yourlinuxbox icemon

; with icemon you should see how each build task is distributed across the icecc cluster.

by Pablo Saavedra at July 10, 2018 08:15 AM

July 09, 2018

Frédéric Wang

Review of Igalia's Web Platform activities (H1 2018)

This is the semiyearly report to let people know a bit more about Igalia’s activities around the Web Platform, focusing on the activity of the first semester of year 2018.



Igalia has proposed and developed the specification for BigInt, enabling math on arbitrary-sized integers in JavaScript. Igalia has been developing implementations in SpiderMonkey and JSC, where core patches have landed. Chrome and Node.js shipped implementations of BigInt, and the proposal is at Stage 3 in TC39.

Igalia is also continuing to develop several features for JavaScript classes, including class fields. We developed a prototype implementation of class fields in JSC. We have maintained Stage 3 in TC39 for our specification of class features, including static variants.

We also participated to WebAssembly (now at First Public Working Draft) and internationalization features for new features such as Intl.RelativeTimeFormat (currently at Stage 3).

Finally, we have written more tests for JS language features, performed maintenance and optimization and participated to other spec discussions at TC39. Among performance optimizations, we have contributed a significant optimization to Promise performance to V8.


Igalia has continued the standardization effort at the W3C. We are pleased to announce that the following milestones have been reached:

A new charter for the ARIA WG as well as drafts for ARIA 1.2 and Core Accessibility API Mappings 1.2 are in preparation and are expected to be published this summer.

On the development side, we implemented new ARIA features and fixed several bugs in WebKit and Gecko. We have refined platform-specific tools that are needed to automate accessibility Web Platform Tests (examine the accessibility tree, obtain information about accessible objects, listen for accessibility events, etc) and hope we will be able to integrate them in Web Platform Tests. Finally we continued maintenance of the Orca screen reader, in particular fixing some accessibility-event-flood issues in Caja and Nautilus that had significant impact on Orca users.

Web Platform Predictability

Thanks to support from Bloomberg, we were able to improve interoperability for various Editing/Selection use cases. For example when using backspace to delete text content just after a table (W3C issue) or deleting a list item inside a content cell.

We were also pleased to continue our collaboration with the AMP project. They provide us a list of bugs and enhancement requests (mostly for the WebKit iOS port) with concrete use cases and repro cases. We check the status and plans in WebKit, do debugging/analysis and of course actually submit patches to address the issues. That’s not always easy (e.g. when it is touching proprietary code or requires to find some specific reviewers) but at least we make discussions move forward. The topics are very diverse, it can be about MessageChannel API, CSSOM View, CSS transitions, CSS animations, iOS frame scrolling custom elements or navigating special links and many others.

In general, our projects are always a good opportunity to write new Web Platform Tests import them in WebKit/Chromium/Mozilla or improve the testing infrastructure. We have been able to work on tests for several specifications we work on.


Thanks to support from Bloomberg we’ve been pursuing our activities around CSS:

We also got more involved in the CSS Working Group, in particular participating to the face-to-face meeting in Berlin and will attend TPAC’s meeting in October.


We have also continued improving the web platform implementation of some Linux ports of WebKit (namely GTK and WPE). A lot of this work was possible thanks to the financial support of Metrological.

Other activities

Preparation of Web Engines Hackfest 2018

Igalia has been organizing and hosting the Web Engines Hackfest since 2009, a three days event where Web Platform developers can meet, discuss and work together. We are still working on the list of invited, sponsors and talks but you can already save the date: It will happen from 1st to 3rd of October in A Coruña!

New Igalians

This semester, new developers have joined Igalia to pursue the Web platform effort:

  • Rob Buis, a Dutch developer currently living in Germany. He is a well-known member of the Chromium community and is currently helping on the web platform implementation in WebKit.

  • Qiuyi Zhang (Joyee), based in China is a prominent member of the Node.js community who is now also assisting our compilers team on V8 developments.

  • Dominik Infuer, an Austrian specialist in compilers and programming language implementation who is currently helping on our JSC effort.

Coding Experience Programs

Two students have started a coding experience program some weeks ago:

  • Oriol Brufau, a recent graduate in math from Spain who has been an active collaborator of the CSS Working Group and a contributor to the Mozilla project. He is working on the CSS Logical Properties and Values specification, implementing it in Chromium implementation.

  • Darshan Kadu, a computer science student from India, who contributed to GIMP and Blender. He is working on Web Platform Tests with focus on WebKit’s infrastructure and the GTK & WPE ports in particular.

Additionally, Caio Lima is continuing his coding experience in Igalia and is among other things working on implementing BigInt in JSC.


Thank you for reading this blog post and we look forward to more work on the web platform this semester!

July 09, 2018 10:00 PM

June 26, 2018

Gyuyoung Kim

How to develop Chromium with Visual Studio Code on Linux?

How have you been developing Chromium? I have often been asked what is the best tool to develop Chromium. I guess Chromium developers have been usually using vim, emacs, cscope, sublime text, eclipse, etc. And they have used GDB or console logs for debugging. But, in case of Windows, developers have used Visual Studio. Although Visual Studio supports powerful features to develop C/C++ programs, unfortunately, it couldn’t be used by other platform developers. However, recently I notice that Visual Studio Code also can support to develop Chromium with the nice editor, powerful debugging tools, and a lot of extensions. And, even it can work on Linux and Mac because it is based on Electron. Nowadays I’m developing Chromium with VS Code. I feel that VS Code is one of the very nice tools to develop Chromium. So I’d like to share my experience how to develop Chromium by using the Visual Studio Code on Ubuntu.

Default settings for Chromium

There is already an article about how to set up the environment to build Chromium in VS codeSo to start, you need to prepare the environment settings. This section just lists key points in the instructions.

  1. Download VS Code
  2. Launch VS Code in chromium/src
    $ code .
  3. Install useful extensions
    1. Python – Linting, intellisense, code formatting, refactoring, debugging, snippets.
    2. c/c++ for visual studio code – Code formatting, debugging, Intellisense.
    3.  Toggle Header/Source Toggles between .cc and .h with F4. The C/C++ extension supports this as well through Alt+O but sometimes chooses the wrong file when there are multiple files in the workspace that have the same name.
    4. you-complete-me YouCompleteMe code completion for VS Code. It works fairly well in Chromium. To install You-Complete-Me, enter these commands in a terminal:
      $ git clone ~/.ycmd 
      $ cd ~/.ycmd 
      $ git submodule update --init --recursive 
      $ ./ --clang-completer
    5. Rewrap –  Wrap lines at 80 characters with Alt+Q.
  4. Setup for Chromium
  5. Key mapping
    * Ctrl+P: opens a search box to find and open a file.
    * F1 or Ctrl+Shift+P: opens a search box to find a command
      (e.g. Tasks: Run Task).
    * Ctrl+K, Ctrl+S: opens the key bindings editor.
    * Ctrl+`: toggles the built-in terminal.
    * Ctrl+Shift+M: toggles the problems view (linter warnings, 
      compile errors and warnings). You'll swicth a lot between
      terminal and problem view during compilation.
    * Alt+O: switches between the source/header file.
    * Ctrl+G: jumps to a line.
    * F12: jumps to the definition of the symbol at the cursor
      (also available on right-click context menu).
    * Shift+F12 or F1: CodeSearchReferences, Return shows all
      references of the symbol at the cursor.
    * F1: CodeSearchOpen, Return opens the current file in
      Code Search.
    * Ctrl+D: selects the word at the cursor. Pressing it multiple
      times multi-selects the next occurrences, so typing in one
      types in all of them, and Ctrl+U deselects the last occurrence.
    * Ctrl+K+Z: enters Zen Mode, a fullscreen editing mode with
      nothing but the current editor visible.
    * Ctrl+X: without anything selected cuts the current line.
      Ctrl+V pastes the line.
  6. (Optional) Color setting
    • Press Ctrl+Shift+P, color, Enter to pick a color scheme for the editor

Additional settings after the default settings.

  1. Set workspaceRoot to .bashrc. (Because it will be needed for some extensions.)
    export workspaceRoot=$HOME/chromium/src
  2. Add new tasks to tasks.json in order to build Chromium by using ICECC.
    1. I wrote a post about how to build Chromium by using ICECC in the previous post. Please setup it first.
      1. Share my experience to build Chromium with ICECC and Jumbo
          "taskName": "1-build_chrome_debug_icecc",
          "command": " Debug",
          "isShellCommand": true,
          "isTestCommand": true,
          "problemMatcher": [
              "owner": "cpp",
              "fileLocation": [
              "pattern": {
                "regexp": "^../../(.*):(\\d+):(\\d+):\\s+(warning|\\w*\\s?error):\\s+(.*)$",
                "file": 1,
                "line": 2,
                "column": 3,
                "severity": 4,
                "message": 5
              "owner": "cpp",
              "fileLocation": [
              "pattern": {
                "regexp": "^../../(.*?):(.*):\\s+(warning|\\w*\\s?error):\\s+(.*)$",
                "file": 1,
                "severity": 3,
                "message": 4
  3. Update “Chrome Debug” configuration in launch.json
      "name": "Chrome Debug",
      "type": "cppdbg",
      "request": "launch",
      "targetArchitecture": "x64",
      "environment": [
        {"name":"workspaceRoot", "value":"${HOME}/chromium/src"}
      "program": "${workspaceRoot}/out/Debug/chrome",
      "args": ["--single-process"],  // The debugger only can work with the single process mode for now.
      "preLaunchTask": "1-build_chrome_debug_icecc",
      "stopAtEntry": false,
      "cwd": "${workspaceRoot}/out/Debug",
      "externalConsole": true,

Screenshot after the settings

you will see this window after finishing all settings.

Build Chromium with ICECC in VS Code

  1. Run Task (F1 or Ctrl+Shift+P)
  2. Select 1-build_chrome_debug_icecc. VS Code will show an integrated terminal when building Chromium as below, On the ICECC monitor, you can see that VS Code builds Chromium by using ICECC.


Start debugging in VS Code

After completing the build, now is time to start debugging Chromium.

  1. Set a breakpoint
    1. F9 button or just click the left side of line number
  2. Launch debug
    1. Press F5 button
  3. Screen captures when debugger stopped at a breakpoint
    1. Overview
    2. Editor
    3. Call stack
    4. Variables
    5. Watch
    6. Breakpoints


In multiple processes model, VS Code can’t debug child processes yet (i.e. renderer process). According to C/C++ extension project site, they suggested us to add the below command to launch.json though, it didn’t work for Chromium when I tried.

"setupCommands": [
    { "text": "-gdb-set follow-fork-mode child" }


  1. Chromium VS Code setup:

by gyuyoung at June 26, 2018 12:55 AM

June 14, 2018

Diego Pino

Fast checksum computation

An Internet packet generally includes two checksums: a TCP/UDP checksum and an IP checksum. In both cases, the checksum value is calculated using the same algorithm. For instance, IP header checksum is computed as follows:

  • Set the packet’s IP header checksum to zero.
  • Fetch the IP header octets in groups of 16-bit and calculate the accumulated sum.
  • In case there’s an overflow while adding, sum the carry-bit to the total sum.
  • Once the sum is done, set the checksum to the one’s complement of the accumulated sum.

Here is an implementation of such algorithm in Lua:

local ffi = require("ffi")
local bit = require("bit")

local function checksum_lua (data, size)
   local function r16 (data)
      return ffi.cast("uint16_t*", data)[0]
   local csum = 0
   local i = size
   -- Accumulated sum.
   while i > 1 do
      local word = r16(data + (size - i))
      csum = csum + word
      i = i - 2
   -- Handle odd sizes.
   if i == 1 then
      csum = csum + data[size-1]
   -- Add accumulated carry.
   while true do
      local carry = bit.rshift(csum, 16)
      if carry == 0 then break end
      csum =, 0xffff) + carry
   -- One's complement.
   return, 0xffff)

The IP header checksum is calculated only over the IP header octets. However, the TCP header is calculated over the TCP header, the packet’s payload plus an extra header called the pseudo-header.

A pseudo-header is a 12-byte data structured composed of:

  • IP source and destination addresses (8 bytes).
  • Protocol (TCP=0x6 or UDP=0x11) (2 bytes).
  • TCP length, calculated from IP header’s total length minus TCP or UDP header size (2 bytes).

You may wonder what’s the purpose of the pseudo-header? David P Reed, often considered the father of UDP, provides a great explanation in this thread: Purpose of pseudo header in TCP checksum. Basically, the original goal of the pseudo-header was to take into account IP addresses as part of the TCP checksum, since they’re relevant fields in an end-to-end communication. Back in those days, the original plan for TCP safe communications was to leave source and destination addresses clear but encrypt the rest of the TCP fields. That would avoid man-in-the middle attacks. However NAT, which is essentially a man-in-the-middle, happened thrashing away this original plan. In summary, the pseudo-header exists today for legacy reasons.

Lastly, is worth mentioning that UDP checksum is optional in IPv4, so you might see it set as zero many times. However, the field is mandatory in IPv6.

Verifying a packet’s checksum

Verifying a packet’s checksum is easy. On receiving a packet, the receiver sums all the relevant octets, including the checksum field. The result must be zero if the packet is correct, since the sum of a number and its one’s complement is always zero.

From a developer’s perspective there are several tools for verifying the correctness of a packet’s checksum. Perhaps my preferred tool is Wireshark, which features an option to check the validity of TCP checksums (Edit->Preferences->Protocols[TCP]. Mark Validate the TCP checksum if possible). When this option is enabled, packets with a wrong checksum are highlighted in a black background.

Bad checksums
Bad checksums

Seeing packets with wrong checksums is common when capturing packets with tcpdump and open them in Wireshark. The reason why checksums are not correct is that TCP checksumming is generally offloaded to the NIC, since it’s a relatively expensive operation (nowadays, NICs count with specialized hardware to do that operation fast). Since tcpdump captures outgoing packets before they hit the NIC, the checksum value hasn’t been calculated yet and likely contains garbage. It’s possible to check whether checksum offloading is enabled in a NIC by typing:

$ ethtool --show-offload <nic> | grep checksumming
rx-checksumming: on
tx-checksumming: on

Another option for verifying checksum values is using tshark:

$ tshark -r packets.pcap -V -o tcp.check_checksum:TRUE | grep -c "Error/Checksum"

Lastly, in case you’d like to fix wrong checksums in a pcap file is possible to do that with tcprewrite:

$ tcprewrite -C -i packets.pcap -o packets-fixed.pcap
$ tshark -r packets-fixed.pcap -V -o tcp.check_checksum:TRUE | grep -c "Error/Checksum"

Fast checksum computation

Since TCP checksum computation involves a large chunk of data improving its performance is important. There are, in fact, several RFCs dedicated exclusively to discuss this topic. RFC 1071 (Computing the Internet Checksum) includes a detailed explanation of the algorithm and also explores different techniques for speeding up checksumming. In addition, it features reference implementations in several hardware architectures such as Motorola 68020, Cray and IBM 370.

Perhaps the fastest way to recompute a checksum of a modified packet is to incrementally update the checksum as the packet gets modified. Take for instance the case of NAT which modifies origin and destination ports and addresses. Those operations affect both the TCP and IP checksums. In the case of the IP checksum, if the source address gets modified we can recompute the new IP checksum as:

new_checksum = ~(~checksum + (-pkt.source_ip) + new_source_ip);

Or more generically using the following formula:

HC = one-complement(one-complement(C) + (-m) + m)

This technique is covered in RFC 1071 and further polished over two other RFCs: RFC 1141 and RFC 1624 (Incremental Updating of the Internet Checksum).

If we decide to recompute the checksum, there are several techniques to do it fast. On its canonical form, the algorithm says octets are summed as 16-bit words. If there’s carry after an addition, the carry should be added to the accumulated sum. Truth is it’s not necessary to add octets as 16-bit words. Due to the associative property of addition, it is possible to do parallel addition using larger word sizes such as 32-bit or 64-bit words. In those cases the variable that stores the accumulative sum has to be bigger too. Once the sum is computed a final step folds the sum to a 16-bit word (adding carry if any).

Here’s an implementation in C using 32-bit words:

uint16_t checksum (uint8_t* data, uint16_t len)
    uint64_t sum = 0;
    uint32_t* p = (uint32_t*) data;
    uint16_t i = 0;
    while (len >= 4) {
        sum = sum + p[i++];
        len -= 4;
    if (len >= 2) { 
        sum = sum + ((uint16_t*) data)[i * 4];
        len -= 2;
    if (len == 1) {
        sum += data[len-1];
    // Fold sum into 16-bit word.
    while (sum>>16) {
        sum = (sum & 0xffff) + (sum>>16);
    return ntohs((uint16_t)~sum);

Using larger word sizes increases speed as it reduces the total number of operations. What about using this technique on 64-bit integers? It definitely would be possible, but it requires to handle carry in the body loop. In the algorithm above, 32-bit words are summed to a 64-bit word. Carry, if any, is stored in the higher part of sum, which later gets summed in the folding step.

Using SIMD instructions should allow us to sum larger sizes of data in parallel. For instance using AVX2’s VPADD (vector-packed addition) instruction it should be possible to sum 16x16-bit words in parallel. The issue here once again is handling the possible generated carry to the accumulated sum. So instead of a 16x16-bit vector a 8x32 vector is used instead. From a functional point of view this is equivalent to sum using 128-bit words.

Snabb features implementations of checksum computation, generic and using SIMD instructions. In the latter case there are versions for SSE2 and AVX2 instruction sets. Snabb’s philosophy is to do everything in software and rely as much less as possible in offloaded NIC functions. Thus checksum computation is something done in code. Snabb’s implementation using AVX2 instructions is available at src/arch/avx2.c (Luke pushed a very interesting implementation in machine code as well. See PR#899).

Going back to RFC 1071, many of the reference implementations do additions in the main loop taking into account the carry bit. For instance, in the Motorola 68020 implementation that is done using the ADDXL instruction. In X86 there’s an equivalent add-with-carry instruction (ADC). Basically this instruction performs a sum of two operands plus the carry-flag.

Another technique described in RFC 1071, and also used in the reference implementations, is loop unrolling. Instead of summing one word per loop, we could sum 2, 4 or 8 words instead. A loop that sums 64-bit words in strides of 8 means actually avoiding loops for packet sizes lower than 512 bytes. Unrolling a loop requires adding waterfall code after the loop to handle the edge-cases that control the bounds of the loop.

As an exercise to teach myself more DynASM and X86-64 assembly, I decided to rewrite the generic checksum algorithm and see if performance improved. The first implementation followed the canonical algorithm, summing words as 16-bit values. Performance was much better than the generic Lua implementation posted at the beginning of this article, but it wasn’t better than Snabb’s C implementation, which does loop unrolling.

After this initially disappointing result, I decided to apply some of the optimization techniques commented before. Summing octets as 32-bit words definitely improved performance. The advantage of writing the algorithm in assembly is that I could make use of the ADC instruction. That allowed me to use 64-bit words. Performance improved once again. Finally I tried out several loop unrolling. With a loop unrolling of 4 strides the algorithm proved to be better than the SSE2 algorithm for several packet sizes: 64 bytes, 570 bytes and 1520 bytes. However, it doesn’t beat the AVX2 implementation in the large packet case, but it shown better performance for small and medium sizes.

And here’s the final implementation:

; Prologue.
push rbp
mov rbp, rsp

; Accumulative sum.
xor rax, rax                ; Clear out rax. Stores accumulated sum.
xor r9, r9                  ; Clear out r9. Stores value of array.
xor r8, r8                  ; Clear out r8. Stores array index.
mov rcx, rsi                ; Rsi (2nd argument; size). Assign rsi to rcx.
cmp rcx, 32                 ; If index is less than 16.
jl >2                       ; Jump to branch '2'.
add rax, [rdi + r8]         ; Sum acc with qword[0].
adc rax, [rdi + r8 + 8]     ; Sum with carry qword[1].
adc rax, [rdi + r8 + 16]    ; Sum with carry qword[2].
adc rax, [rdi + r8 + 24]    ; Sum with carry qword[3]
adc rax, 0                  ; Sum carry-bit into acc.
sub rcx, 32                 ; Decrease index by 8.
add r8, 32                  ; Jump two qwords.
jmp <1                      ; Go to beginning of loop.
cmp rcx, 16                 ; If index is less than 16.
jl >3                       ; Jump to branch '2'.
add rax, [rdi + r8]         ; Sum acc with qword[0].
adc rax, [rdi + r8 + 8]     ; Sum with carry qword[1].
adc rax, 0                  ; Sum carry-bit into acc.
sub rcx, 16                 ; Decrease index by 8.
add r8, 16                  ; Jump two qwords.
cmp rcx, 8                  ; If index is less than 8.
jl >4                       ; Jump to branch '2'.
add rax, [rdi + r8]         ; Sum acc with qword[0].
adc rax, 0                  ; Sum carry-bit into acc.
sub rcx, 8                  ; Decrease index by 8.
add r8, 8                   ; Next 64-bit.
cmp rcx, 4                  ; If index is less than 4.
jl >5                       ; Jump to branch '3'.
mov r9d, dword [rdi + r8]   ; Fetch 32-bit from data + r8 into r9d.
add rax, r9                 ; Sum acc with r9. Accumulate carry.
sub rcx, 4                  ; Decrease index by 4.
add r8, 4                   ; Next 32-bit.
cmp rcx, 2                  ; If index is less than 2.
jl >6                       ; Jump to branch '4'.
movzx r9, word [rdi + r8]   ; Fetch 16-bit from data + r8 into r9.
add rax, r9                 ; Sum acc with r9. Accumulate carry.
sub rcx, 2                  ; Decrease index by 2.
add r8, 2                   ; Next 16-bit.
cmp rcx, 1                  ; If index is less than 1.
jl >7                       ; Jump to branch '5'.
movzx r9, byte [rdi + r8]   ; Fetch 8-bit from data + r8 into r9.
add rax, r9                 ; Sum acc with r9. Accumulate carry.

; Fold 64-bit into 16-bit.
mov r9, rax                 ; Assign acc to r9.
shr r9, 32                  ; Shift r9 32-bit. Stores higher part of acc.
and rax, 0x00000000ffffffff ; Clear out higher-part of rax. Stores lower part of acc.
add eax, r9d                ; 32-bit sum of acc and r9.
adc eax, 0                  ; Sum carry to acc.
mov r9d, eax                ; Repeat for 16-bit.
shr r9d, 16
and eax, 0x0000ffff
add ax, r9w
adc ax, 0

; One's complement.
not rax                     ; One-complement of rax.
and rax, 0xffff             ; Clear out higher part of rax.

; Epilogue.
mov rsp, rbp
pop rbp

; Return.

Benchmark results for several data sizes:

Data size: 44 bytes

Algorithm Time per csum Time per byte
Generic 87.77 ns 1.99 ns
SSE2 86.06 ns 1.96 ns
AVX2 83.20 ns 1.89 ns
New 52.10 ns 1.18 ns

Data size: 550 bytes

Algorithm Time per csum Time per byte
Generic 1058.04 ns 1.92 ns
SSE2 510.40 ns 0.93 ns
AVX2 318.42 ns 0.58 ns
New 270.79 ns 0.49 ns

Data size: 1500 bytes

Algorithm Time per csum Time per byte
Generic 2910.10 ns 1.94 ns
SSE2 991.04 ns 0.66 ns
AVX2 664.98 ns 0.44 ns
New 743.88 ns 0.50 ns

All in all, it has been a fun exercise. I have learned quite a lot about the Internet checksum algorithm. I also learned how loop unrolling can help improving performance in a dramatic way (more than I initially expected). I found very interesting as well how changing the context of a problem, in this case the target programming language, forces to think about the problem in a different way but it also enables the possibility of doing more optimizations that were not possible before.

June 14, 2018 06:00 AM

June 13, 2018

Jacobo Aragunde

Chromium official/release builds and icecc

You may already be using icecc to compile your Chromium, either by following some instructions like the ones published by my colleague Gyuyoung or using the popular icecc-chromium set of scripts. In those cases, you will probably get in some trouble if you try to generate an official build with that configuration.

First, let me refresh what an “official build” is called in Chromium. You may know that build optimization in Chromium builds depends on two flags:

  • is_debug
    Debug build. Enabling official builds automatically sets is_debug to false.
  • is_official_build
    Set to enable the official build level of optimization. This has nothing
    to do with branding, but enables an additional level of optimization above
    release (!is_debug). This might be better expressed as a tri-state
    (debug, release, official) but for historical reasons there are two
    separate flags.

  • The GN documentation is pretty verbose about this. To sum up, to get full binary optimization you should enable is_official_build which will also disable is_debug in the background. This is what other projects would call a release build.

    Back to the main topic, I was running an official build distributed via icecc and stumbled on some compilation problems:

    clang: error: no such file or directory: /usr/lib/clang/7.0.0/share/cfi_blacklist.txt
    clang: error: no such file or directory: ../../tools/cfi/blacklist.txt
    clang: error: no such file or directory: /path/to/src/chrome/android/profiles/

    These didn’t happen when icecc build was disabled, so I was certain to have found some limitations in the distributed compiler. The icecc-chromium set of scripts was already disabling a number of clang cleanup/sanitize tools, so I decided to take the same approach. First, I checked the GN args that could be related to these errors and identified two:

    • is_cfi
      Current value (from the default) = true
      From //build/config/sanitizers/sanitizers.gni:53

    Compile with Control Flow Integrity to protect virtual calls and casts.

    TODO(pcc): Remove this flag if/when CFI is enabled in all official builds.

  • clang_use_default_sample_profile
    Current value (from the default) = true
    From //build/config/compiler/

    Some configurations have default sample profiles. If this is true and
    clang_sample_profile_path is empty, we’ll fall back to the default.

    We currently only have default profiles for Chromium in-tree, so we disable
    this by default for all downstream projects, since these profiles are likely
    nonsensical for said projects.

  • These two args were enabled, I just disabled them and got rid the compilation flags that were causing trouble: -fprofile-sample-use=/path/to/src/chrome/android/profiles/ -fsanitize=cfi-vcall -fsanitize-blacklist=../../tools/cfi/blacklist.txt. I’ve learned that support for -fsanitize-blacklist is available in upstream icecc, but most distros don’t package it yet, so it’s safer to disable that.

    To sum up, if you are using icecc and you want to run an official build, you have to add a couple more GN args:

    clang_use_default_sample_profile = false
    is_cfi = false

    by Jacobo Aragunde Pérez at June 13, 2018 08:06 AM

    June 04, 2018

    Michael Catanzaro

    Security vulnerability in Epiphany Technology Preview

    If you use Epiphany Technology Preview, please update immediately and ensure you have revision 3.29.2-26 or newer. We discovered and resolved a vulnerability that allowed websites to access internal Epiphany features and thereby exfiltrate passwords from the password manager. We apologize for this oversight.

    The unstable Epiphany 3.29.2 release is the only affected release. Epiphany 3.29.1 is not affected. Stable releases, including Epiphany 3.28, are also not affected.

    There is no reason to believe that the issue was discovered or exploited by any attackers, but you might wish to change your passwords if you are concerned.

    by Michael Catanzaro at June 04, 2018 11:00 PM

    May 29, 2018

    Maksim Sisov

    Chromium with Ozone/Wayland: BlinkOn9, dmabuf and more refactorings…

    It has been quite a long while since we wrote blogs about our Chromium Ozone/Wayland effort, and there are a lot of news right now. Igalia participated in the BlinkOn9 conference and gave a talk ( about the Ozone/Wayland support, and had many discussions on how to continue with upstreaming desktop integration related patches for the Mus service.

    Even though, we had been able to make Chromium running with the Wayland backend natively, and upstreamed a lot of Ozone/Wayland related patches, some disagreements had had to be resolved before proceeding with upstreaming. To be precise, the future of the Mus service was unclear and it was decided to abandon it in favor of a platform desktop integration directly to Aura without Mus. Thanks for a good Ozone design, we had been able to quickly redesign our solution and upstream more patches to make it possible to run Chromium Ozone/Wayland from the ToT. Of course, it still has been missing some functionality patches, but the effort is going on steadily and we expect to have all the patches upstreamed in the following months. A lot of work still has to be done.

    Another point I would like to mention is about our UI/GPU split effort. This is the effort to make Chromium Ozone/Wayland to be run with a separate gpu process. For that, a proper communication channel between the browser process, where a wayland connection is established, must be established. We have had tried a nested compositor approach, but decided to abandon it in favor of a dmabuf based approach, which will also allow us to have gpu native memory buffers, rasterization and, perhaps, zero-copy features enabled on some platforms. The idea behind of the dmabuf approach is to reuse some of the Ozone/Drm codebase, and by utilizing drm render nodes, create shared GBM buffers, which file descriptors are then shared with the browser process, where Wayland creates a dmabuf based wl_buffer and attaches it to wayland windows. At this stage, we have already had a working PoC, and Ozone/Drm refactorings are being upstreamed now. It will also support a presentation feedback if such protocol is available on a system with a Wayland compositor.

    Last but not least, we would like to clarify about the accelerated media decode in Chromium Ozone/Wayland mentioned in
    To make it clear, we are not currently working on the V4L2, but rather the patches have been just merged by another external contributor for simplicity of compiling our Chromium solution on ARM based boards, especially Renesas M3 R-car board.

    by msisov at May 29, 2018 08:51 AM

    May 27, 2018

    Michael Catanzaro

    Thoughts on Flatpak after four months of Epiphany Technology Preview

    It’s been four months since I announced Epiphany Technology Preview — which I’ve been using as my main browser ever since — and five months since I announced the availability of a stable channel via Flatpak. For the most part, it’s been a good experience. Having the latest upstream development code for everything is wonderful and makes testing very easy. Any user can painlessly download and install either the latest stable version or the bleeding-edge development version on any Linux system, regardless of host dependencies, either via a couple clicks in GNOME Software or one command in the terminal. GNOME Software keeps it updated, so I always have a recent version. Thanks to this, I’m often noticing problems shortly after they’re introduced, rather than six months later, as was so often the case for me in the past. Plus, other developers can no longer complain that there’s a problem with my local environment when I report a bug they can’t reproduce, because Epiphany Technology Preview is a canonical distribution environment, a ground truth of sorts.

    There have been some rough patches where Epiphany Technology Preview was not working properly — sometimes for several days — due to various breaking changes, and the long time required to get a successful SDK build when it’s failing. For example, multimedia playback was broken for all of last week, due to changes in how the runtime is built. H.264 video is still broken, since the relevant Flatpak extension is only compatible with the 3.28 runtime, not with master. Opening files was broken for a while due to what turned out to be a bug in mutter that was causing the OpenURI portal to crash. I just today found another bug where closing a portal while visiting Slack triggered a gnome-shell crash. For the most part, these sorts of problems are expected by testers of unstable nightly software, though I’m concerned about the portal bugs because these affect stable users too. Anyway, these are just bugs, and all software has bugs: they get fixed, nothing special.

    So my impression of Flatpak is still largely positive. Flatpak does not magically make our software work properly in all host environments, but it hugely reduces the number of things that can go wrong on the host system. In recent years, I’ve seen users badly break Epiphany in various ways, e.g. by installing custom mimeinfo or replacing the network backend. With Flatpak, either of these would require an incredible amount of dedicated effort. Without a doubt, Flatpak distribution is more robust to user error. Another advantage is that we get the latest versions of OS dependencies, like GStreamer, libsoup, and glib-networking, so we can avoid the many bugs in these components that have been fixed in the years since our users’ LTS distros froze the package versions. I appreciate the desire of LTS distros to provide stability for users, but at the same time, I’m not impressed when users report issues with the browser that we fixed two years ago in one dependency or another. Flatpak is an excellent compromise solution to this problem: the LTS distro retains an LTS core, but specific applications can use newer dependencies from the Flatpak runtime.

    But there is one huge downside to using Flatpak: we lose crash reports. It’s at best very difficult — and often totally impossible — to investigate crashes when using Flatpak, and that’s frankly more important than any of the gains I mention above. For example, today Epiphany Technology Preview is crashing pretty much constantly. It’s surely a bug in WebKit, but that’s all I can figure out. The way to get a backtrace from a crashing app in flatpak is to use coredumpctl to manually dump the core dump to disk, then launch a bash shell in the flatpak environment and manually load it up in gdb. The process is manual, old-fashioned, primitive, and too frustrating for me by a lot, so I wrote a little pyexpect script to automate this process for Epiphany, thinking I could eventually generalize it into a tool that would be useful for other developers. It’s a horrible hack, but it worked pretty well the day I tested it. I haven’t seen it work since. Debuginfo seems to be constantly broken, so I only see a bunch of ???s in my backtraces, and how are we supposed to figure out how to debug that? So I have no way to debug or fix the WebKit bug, because I can’t get a backtrace. The broken, inconsistent, or otherwise-unreliable debuginfo is probably just some bug that will be fixed eventually (and which I half suspect may be related to our recent freedesktop SDK upgrade. Update: Alex has debugged the debuginfo problem and it looks like that’s on track to be solved), but even once it is, we’re back to square one: it’s still too much effort to get the backtrace, relative to developing on the host system, and that’s a hard problem to solve. It requires tools that do not exist, and for which we have no plans to create, or even any idea of how to create them.

    This isn’t working. I need to be able to effortlessly get a backtrace out of my application, with no or little more effort than running coredumpctl gdb as I would without Flatpak in the way. Every time I see Epiphany or WebKit crash, knowing I can do nothing to debug or investigate, I’m very sorely tempted to switch back to using Fedora’s Epiphany, or good old JHBuild. (I can’t promote BuildStream here, because BuildStream has the same problem.)

    So the developer experience is just not good, but set that aside: the main benefits of Flatpak are for users, not developers, after all. Now, what if users run into a crash, how can they report the bug? Crash reports are next to useless without a backtrace, and wise developers refuse to look at crash reports until a quality backtrace has been posted. So first we need to fix the developer experience to work properly, but even then, it’s not enough: we need an automatic crash reporter, along the lines of ABRT or apport, to make reporting crashes realistically-achievable for users, as it already is for distro-packaged apps. But this is a much harder problem to solve. Such a tool will require integration with coredumpctl, and I have not the faintest clue how we could go about making coredumpctl support container environments. Yet without this, we’re asking application developers to give up their most valuable data — crash reports — in order to use Flatpak.

    Eventually, if we don’t solve crash reporting, Epiphany’s experiment with Flatpak will have to come to an end, because that’s more important to me than the (admittedly-tremendous) benefits of Flatpak. I’m still hopeful that the ingenuity of the Flatpak community will find some solutions. We’ll see how this goes.

    by Michael Catanzaro at May 27, 2018 11:39 PM

    May 21, 2018

    Andy Wingo

    correct or inotify: pick one

    Let's say you decide that you'd like to see what some other processes on your system are doing to a subtree of the file system. You don't want to have to change how those processes work -- you just want to see what files those processes create and delete.

    One approach would be to just scan the file-system tree periodically, enumerating its contents. But when the file system tree is large and the change rate is low, that's not an optimal thing to do.

    Fortunately, Linux provides an API to allow a process to receive notifications on file-system change events, called inotify. So you open up the inotify(7) manual page, and are greeted with this:

    With careful programming, an application can use inotify to efficiently monitor and cache the state of a set of filesystem objects. However, robust applications should allow for the fact that bugs in the monitoring logic or races of the kind described below may leave the cache inconsistent with the filesystem state. It is probably wise to do some consistency checking, and rebuild the cache when inconsistencies are detected.

    It's not exactly reassuring is it? I mean, "you had one job" and all.

    Reading down a bit farther, I thought that with some "careful programming", I could get by. After a day of trying, I am now certain that it is impossible to build a correct recursive directory monitor with inotify, and I am not even sure that "good enough" solutions exist.

    pitfall the first: buffer overflow

    Fundamentally, inotify races the monitoring process with all other processes on the system. Events are delivered to the monitoring process via a fixed-size buffer that can overflow, and the monitoring process provides no back-pressure on the system's rate of filesystem modifications. With inotify, you have to be ready to lose events.

    This I think is probably the easiest limitation to work around. The kernel can let you know when the buffer overflows, and you can tweak the buffer size. Still, it's a first indication that perfect is not possible.

    pitfall the second: now you see it, now you don't

    This one is the real kicker. Say you get an event that says that a file "frenemies.txt" has been created in the directory "/contacts/". You go to open the file -- but is it still there? By the time you get around to looking for it, it could have been deleted, or renamed, or maybe even created again or replaced! This is a TOCTTOU race, built-in to the inotify API. It is literally impossible to use inotify without this class of error.

    The canonical solution to this kind of issue in the kernel is to use file descriptors instead. Instead of or possibly in addition to getting a name with the file change event, you get a descriptor to a (possibly-unlinked) open file, which you would then be responsible for closing. But that's not what inotify does. Oh well!

    pitfall the third: race conditions between inotify instances

    When you inotify a directory, you get change notifications for just that directory. If you want to get change notifications for subdirectories, you need to open more inotify instances and poll on them all. However now you have N2 problems: as poll and the like return an unordered set of readable file descriptors, each with their own ordering, you no longer have access to a linear order in which changes occurred.

    It is impossible to build a recursive directory watcher that definitively says "ok, first /contacts/frenemies.txt was created, then /contacts was renamed to /peeps, ..." because you have no ordering between the different watches. You don't know that there was ever even a time that /contacts/frenemies.txt was an accessible file name; it could have been only ever openable as /peeps/frenemies.txt.

    Of course, this is the most basic ordering problem. If you are building a monitoring tool that actually wants to open files -- good luck bubster! It literally cannot be correct. (It might work well enough, of course.)


    As far as I am aware, inotify came out to address the needs of desktop search tools like the belated Beagle (11/10 good pupper just trying to get his pup on). Especially in the days of spinning metal, grovelling over the whole hard-drive was a real non-starter, especially if the search database should to be up-to-date.

    But after looking into inotify, I start to see why someone at Google said that desktop search was in some ways harder than web search -- I mean we all struggle to find files on our own machines, even now, 15 years after the whole dnotify/inotify thing started. Part of it is that the given the choice between supporting reliable, fool-proof file system indexes on the one hand, and overclocking the IOPS benchmarks on the other, the kernel gave us inotify. I understand it, but inotify still sucks.

    I dunno about you all but whenever I've had to document such an egregious uncorrectable failure mode as any of the ones in the inotify manual, I have rewritten the software instead. In that spirit, I hope that some day we shall send inotify to the pet cemetery, to rest in peace beside Beagle.

    by Andy Wingo at May 21, 2018 02:29 PM

    May 16, 2018

    Andy Wingo

    lightweight concurrency in lua

    Hello, all! Today I'd like to share some work I have done recently as part of the Snabb user-space networking toolkit. Snabb is mainly about high-performance packet processing, but it also needs to communicate with management-oriented parts of network infrastructure. These communication needs are performed by a dedicated manager process, but that process has many things to do, and can't afford to make blocking operations.

    Snabb is written in Lua, which doesn't have built-in facilities for concurrency. What we'd like is to have fibers. Fortunately, Lua's coroutines are powerful enough to implement fibers. Let's do that!

    fibers in lua

    First we need a scheduling facility. Here's the smallest possible scheduler: simply a queue of tasks and a function to run those tasks.

    local task_queue = {}
    function schedule_task(thunk)
       table.insert(task_queue, thunk)
    function run_tasks()
       local queue = task_queue
       task_queue = {}
       for _,thunk in ipairs(queue) do thunk() end

    For our purposes, a task is just a function that will be called with no arguments.

    Now let's build fibers. This is easier than you might think!

    local current_fiber = false
    function spawn_fiber(fn)
       local fiber = coroutine.create(fn)
       schedule_task(function () resume_fiber(fiber) end)
    function resume_fiber(fiber, ...)
       current_fiber = fiber
       local ok, err = coroutine.resume(fiber, ...)
       current_fiber = nil
       if not ok then
          print('Error while running fiber: '..tostring(err))
    function suspend_current_fiber(block, ...)
       -- The block function should arrange to reschedule
       -- the fiber when it becomes runnable.
       block(current_fiber, ...)
       return coroutine.yield()

    Here, a fiber is simply a coroutine underneath. Suspending a fiber suspends the coroutine. Resuming a fiber runs the coroutine. If you're unfamiliar with coroutines, or coroutines in Lua, maybe have a look at the lua-users wiki page on the topic.

    The difference between a fibers facility and just coroutines is that with fibers, you have a scheduler as well. Very much like Scheme's call-with-prompt, coroutines are one of those powerful language building blocks that should rarely be used directly; concurrent programming needs more structure than what Lua offers.

    If you're following along, it's probably worth it here to think how you would implement yield based on these functions. A yield implementation should yield control to the scheduler, and resume the fiber on the next scheduler turn. The answer is here.


    Once you have fibers and a scheduler, you have concurrency, which means that if you're not careful, you have a mess. Here I think the Go language got the essence of the idea exactly right: Do not communicate by sharing memory; instead, share memory by communicating.

    Even though Lua doesn't support multiple machine threads running concurrently, concurrency between fibers can still be fraught with bugs. Tony Hoare's Communicating Sequential Processes showed that we can avoid a class of these bugs by treating communication as a first-class concept.

    Happily, the Concurrent ML project showed that it's possible to build these first-class communication facilities as a library, provided the language you are working in has threads of some kind, and fibers are enough. Last year I built a Concurrent ML library for Guile Scheme, and when in Snabb we had a similar need, I ported that code over to Lua. As it's a new take on the problem in a different language, I think I've been able to simplify things even more.

    So let's take a crack at implementing Concurrent ML in Lua. In CML, the fundamental primitive for communication is the operation. An operation represents the potential for communication. For example, if you have a channel, it would have methods to return "get operations" and "put operations" on that channel. Actually receiving or sending a message on a channel occurs by performing those operations. One operation can be performed many times, or not at all.

    Compared to a system like Go, for example, there are two main advantages of CML. The first is that CML allows non-deterministic choice between a number of potential operations in a generic way. For example, you can construct a operation that, when performed, will either get on one channel or wait for a condition variable to be signalled, whichever comes first. In Go, you can only select between operations on channels.

    The other interesting part of CML is that operations are built from a uniform protocol, and so users can implement new kinds of operations. Compare again to Go where all you have are channels, and nothing else.

    The CML operation protocol consists three related functions: try which attempts to directly complete an operation in a non-blocking way; block, which is called after a fiber has suspended, and which arranges to resume the fiber when the operation completes; and wrap, which is called on the result of a successfully performed operation.

    In Lua, we can call this an implementation of an operation, and create it like this:

    function new_op_impl(try, block, wrap)
       return { try=try, block=block, wrap=wrap }

    Now let's go ahead and write the guts of CML: the operation implementation. We'll represent an operation as a Lua object with two methods. The perform method will attempt to perform the operation, and return the resulting value. If the operation can complete immediately, the call to perform will return directly. Otherwise, perform will suspend the current fiber and arrange to continue only when the operation completes.

    The wrap method "decorates" an operation, returning a new operation that, if and when it completes, will "wrap" the result of the completed operation with a function, by applying the function to the result. It's useful to distinguish the sub-operations of a non-deterministic choice from each other.

    Here our new_op function will take an array of operation implementations and return an operation that, when performed, will synchronize on the first available operation. As you can see, it already has the equivalent of Go's select built in.

    function new_op(impls)
       local op = { impls=impls }
       function op.perform()
          for _,impl in ipairs(impls) do
             local success, val = impl.try()
             if success then return impl.wrap(val) end
          local function block(fiber)
             local suspension = new_suspension(fiber)
             for _,impl in ipairs(impls) do
                impl.block(suspension, impl.wrap)
          local wrap, val = suspend_current_fiber(block)
          return wrap(val)
       function op.wrap(f)
          local wrapped = {}
          for _, impl in ipairs(impls) do
             local function wrap(val)
                return f(impl.wrap(val))
             local impl = new_op_impl(impl.try, impl.block, wrap)
             table.insert(wrapped, impl)
          return new_op(wrapped)
       return op

    There's only one thing missing there, which is new_suspension. When you go to suspend a fiber because none of the operations that it's trying to do can complete directly (i.e. all of the try functions of its impls returned false), at that point the corresponding block functions will publish the fact that the fiber is waiting. However the fiber only waits until the first operation is ready; subsequent operations becoming ready should be ignored. The suspension is the object that manages this state.

    function new_suspension(fiber)
       local waiting = true
       local suspension = {}
       function suspension.waiting() return waiting end
       function suspension.complete(wrap, val)
          waiting = false
          local function resume()
             resume_fiber(fiber, wrap, val)
       return suspension

    As you can see, the suspension's complete method is also the bit that actually arranges to resume a suspended fiber.

    Finally, just to round out the implementation, here's a function implementing non-deterministic choice from among a number of sub-operations:

    function choice(...)
       local impls = {}
       for _, op in ipairs({...}) do
          for _, impl in ipairs(op.impls) do
             table.insert(impls, impl)
       return new_op(impls)

    on cml

    OK, I'm sure this seems a bit abstract at this point. Let's implement something concrete in terms of these primitives: channels.

    Channels expose two similar but different kinds of operations: put operations, which try to send a value, and get operations, which try to receive a value. If there's a sender already waiting to send when we go to perform a get_op, the operation continues directly, and we resume the sender; otherwise the receiver publishes its suspension to a queue. The put_op case is similar.

    Finally we add some synchronous put and get convenience methods, in terms of their corresponding CML operations.

    function new_channel()
       local ch = {}
       -- Queues of suspended fibers waiting to get or put values
       -- via this channel.
       local getq, putq = {}, {}
       local function default_wrap(val) return val end
       local function is_empty(q) return #q == 0 end
       local function peek_front(q) return q[1] end
       local function pop_front(q) return table.remove(q, 1) end
       local function push_back(q, x) q[#q+1] = x end
       -- Since a suspension could complete in multiple ways
       -- because of non-deterministic choice, it could be that
       -- suspensions on a channel's putq or getq are already
       -- completed.  This helper removes already-completed
       -- suspensions.
       local function remove_stale_entries(q)
          local i = 1
          while i <= #q do
             if q[i].suspension.waiting() then
                i = i + 1
                table.remove(q, i)
       -- Make an operation that if and when it completes will
       -- rendezvous with a receiver fiber to send VAL over the
       -- channel.  Result of performing operation is nil.
       function ch.put_op(val)
          local function try()
             if is_empty(getq) then
                return false, nil
                local remote = pop_front(getq)
                remote.suspension.complete(remote.wrap, val)
                return true, nil
          local function block(suspension, wrap)
             push_back(putq, {suspension=suspension, wrap=wrap, val=val})
          return new_op({new_op_impl(try, block, default_wrap)})
       -- Make an operation that if and when it completes will
       -- rendezvous with a sender fiber to receive one value from
       -- the channel.  Result is the value received.
       function ch.get_op()
          local function try()
             if is_empty(putq) then
                return false, nil
                local remote = pop_front(putq)
                return true, remote.val
          local function block(suspension, wrap)
             push_back(getq, {suspension=suspension, wrap=wrap})
          return new_op({new_op_impl(try, block, default_wrap)})
       function ch.put(val) return ch.put_op(val).perform() end
       function ch.get()    return ch.get_op().perform()    end
       return ch

    a wee example

    You might be wondering what it's like to program with channels in Lua, so here's a little example that shows a prime sieve based on channels. It's not a great example of concurrency in that it's not an inherently concurrent problem, but it's cute to show computations in terms of infinite streams.

    function prime_sieve(count)
       local function sieve(p, rx)
          local tx = new_channel()
          spawn_fiber(function ()
             while true do
                local n = rx.get()
                if n % p ~= 0 then tx.put(n) end
          return tx
       local function integers_from(n)
          local tx = new_channel()
          spawn_fiber(function ()
             while true do
                n = n + 1
          return tx
       local function primes()
          local tx = new_channel()
          spawn_fiber(function ()
             local rx = integers_from(2)
             while true do
                local p = rx.get()
                rx = sieve(p, rx)
          return tx
       local done = false
          local rx = primes()
          for i=1,count do print(rx.get()) end
          done = true
       while not done do run_tasks() end

    Here you also see an example of running the scheduler in the last line.

    where next?

    Let's put this into perspective: in a couple hundred lines of code, we've gone from minimal Lua to a language with lightweight multitasking, extensible CML-based operations, and CSP-style channels; truly a delight.

    There are a number of possible ways to extend this code. One of them is to implement true multithreading, if the language you are working in supports that. In that case there are some small protocol modifications to take into account; see the notes on the Guile CML implementation and especially the Manticore Parallel CML project.

    The implementation above is pleasantly small, but it could be faster with the choice of more specialized data structures. I think interested readers probably see a number of opportunities there.

    In a library, you might want to avoid the global task_queue and implement nested or multiple independent schedulers, and of course in a parallel situation you'll want core-local schedulers as well.

    The implementation above has no notion of time. What we did in the Snabb implementation of fibers was to implement a timer wheel, inspired by Juho Snellman's Ratas, and then add that timer wheel as a task source to Snabb's scheduler. In Snabb, every time the equivalent of run_tasks() is called, a scheduler asks its sources to schedule additional tasks. The timer wheel implementation schedules expired timers. It's straightforward to build CML timeout operations in terms of timers.

    Additionally, your system probably has other external sources of communication, such as sockets. The trick to integrating sockets into fibers is to suspend the current fiber whenever an operation on a file descriptor would block, and arrange to resume it when the operation can proceed. Here's the implementation in Snabb.

    The only difficult bit with getting nice nonblocking socket support is that you need to be able to suspend the calling thread when you see the EWOULDBLOCK condition, and for coroutines that is often only possible if you implemented the buffered I/O yourself. In Snabb that's what we did: we implemented a compatible replacement for Lua's built-in streams, in Lua. That lets us handle EWOULDBLOCK conditions in a flexible manner. Integrating epoll as a task source also lets us sleep when there are no runnable tasks.

    Likewise in the Snabb context, we are also working on a TCP implementation. In that case you want to structure TCP endpoints as fibers, and arrange to suspend and resume them as appropriate, while also allowing timeouts. I think the scheduler and CML patterns are going to allow us to do that without much trouble. (Of course, the TCP implementation will give us lots of trouble!)

    Additionally your system might want to communicate with fibers from other threads. It's entirely possible to implement CML on top of pthreads, and it's entirely possible as well to support communication between pthreads and fibers. If this is interesting to you, see Guile's implementation.

    When I talked about fibers in an earlier article, I built them in terms of delimited continuations. Delimited continuations are fun and more expressive than coroutines, but it turns out that for fibers, all you need is the expressive power of coroutines -- multi-shot continuations aren't useful. Also I think the presentation might be more straightforward. So if all your language has is coroutines, that's still good enough.

    There are many more kinds of standard CML operations; implementing those is also another next step. In particular, I have found semaphores and condition variables to be quite useful. Also, standard CML supports "guards", invoked when an operation is performed, and "nacks", invoked when an operation is definitively not performed because a choice selected some other operation. These can be layered on top; see the Parallel CML paper for notes on "primitive CML".

    Also, the choice operator above is left-biased: it will prefer earlier impls over later ones. You might want to not always start with the first impl in the list.

    The scheduler shown above is the simplest thing I could come up with. You may want to experiment with other scheduling algorithms, e.g. capability-based scheduling, or kill-safe abstractions. Do it!

    Or, it could be you already have a scheduler, like some kind of main loop that's already there. Cool, you can use it directly -- all that fibers needs is some way to schedule functions to run.


    In summary, I think Concurrent ML should be better-known. Its simplicity and expressivity make it a valuable part of any concurrent system. Already in Snabb it helped us solve some longstanding gnarly issues by making the right solutions expressible.

    As Adam Solove says, Concurrent ML is great, but it has a branding problem. Its ideas haven't penetrated the industrial concurrent programming world to the extent that they should. This article is another attempt to try to get the word out. Thanks to Adam for the observation that CML is really a protocol; I'm sure the concepts could be made even more clear, but at least this is a step forward.

    All the code in this article is up on a gitlab snippet along with instructions for running the example program from the command line. Give it a go, and happy hacking with CML!

    by Andy Wingo at May 16, 2018 03:17 PM

    Jessica Tallon

    IPFIX Performance


    IPFIX is a standard for recording “flows”, which are packets with the same: source IP, destination IP, source port, destination port and protocol. It’s got a few uses but probably the most familiar is tracking usage for billing. The recording is done by something called a “probe” which gets the traffic in and builds a “flow table”, usually some form of hash-map which records a “flow” backed by additional data like how many bytes and packets have been transmitted. These flows are then exported by an “exporter”, usually just another process outside the performance critical “dataplane” (probe) to a collector which can store and display the information in a useful / pretty manner such as a web UI. IPFIX usually runs on a copy of the traffic to ensure not to hinder the traffic itself, which means we shouldn’t concern ourselves too much with what comes out at the end.

    So, why am I telling you about IPFIX? My colleagues Asumu and Andy have already implemented this in Snabb, and it already exists as a plugin in VPP. What it does is not too tricky and thus it makes for a great example to try to re-implement as a method of learning VPP. We’re able to test performance and see how things work while hacking on something familiar. There are definitely more features it could have and that the stock VPP one probably does implement. The Snabb implementation and our VPP plugin are fairly bare-bones, but they do record some basic information and export it.


    We’re going to use a server with two network cards connected in a pair. This allows us to attach a Snabb app we wrote to one end which will send packets and have those packets be received by another card by VPP. The network card works by having ring buffer with a read pointer and a write pointer. You advance the read pointer to access the next packets and the network card will advance the write buffer when writing new packets it receives. The card keeps track of how often it finds itself writing over the unread packets in a counter. You can see some of the counters, including the rx missed counter showing how many packets we’re dropping by being too slow. This is what it looks when you ask VPP to show those counters:

    TenGigabitEthernet2/0/0            1     up   TenGigabitEthernet2/0/0
      Ethernet address 00:11:22:33:44:55
      Intel 82599
        carrier up full duplex speed 10000 mtu 9216 
        rx queues 1, rx desc 1024, tx queues 2, tx desc 1024
        cpu socket 0
        rx frames ok                                     2398280
        rx bytes ok                                    143896800
        rx missed                                         252816
        extended stats:
          rx good packets                                2398280
          rx good bytes                                143896800
          rx q0packets                                   2398280
          rx q0bytes                                   143896800
          rx size 64 packets                             2658721
          rx total packets                               2658694
          rx total bytes                               159520296
          rx l3 l4 xsum error                            2658762
          rx priority0 dropped                            253924
    local0                             0    down  local0

    We need to find the no-drop rate (NDR), which is the highest speed where we can keep up with the load. When the card shows that we’re dropping packets, we know we’re going too fast and need to back off. We’re able to find this speed by performing a binary search. Bisecting the speed and testing testing if we’re dropping packets, if so then we need to go lower, bisecting the remaining speed, otherwise bisect the higher range. This is an example of our snabb app performing the binary search, looking for the NDR:

    Applying 5.000000 Gbps of load.
        TX 7440477 packets (7.440477 MPPS), 446428620 bytes (5.000001 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 7440477 packets lost (100.000000%)
    Applying 7.500000 Gbps of load.
        TX 11160650 packets (11.160650 MPPS), 669639000 bytes (7.499957 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 11160650 packets lost (100.000000%)
    Applying 6.250000 Gbps of load.
        TX 9300576 packets (9.300576 MPPS), 558034560 bytes (6.249987 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 9300576 packets lost (100.000000%)
    Applying 5.625000 Gbps of load.
        TX 8370519 packets (8.370519 MPPS), 502231140 bytes (5.624989 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 8370519 packets lost (100.000000%)
    Applying 5.938000 Gbps of load.
        TX 8836291 packets (8.836291 MPPS), 530177460 bytes (5.937988 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 8836291 packets lost (100.000000%)
    Applying 5.782000 Gbps of load.
        TX 8604158 packets (8.604158 MPPS), 516249480 bytes (5.781994 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 8604158 packets lost (100.000000%)
    Applying 5.860000 Gbps of load.
        TX 8720228 packets (8.720228 MPPS), 523213680 bytes (5.859993 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 8720228 packets lost (100.000000%)
    Applying 5.899000 Gbps of load.
        TX 8778268 packets (8.778268 MPPS), 526696080 bytes (5.898996 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 8778268 packets lost (100.000000%)
    Applying 5.919000 Gbps of load.
        TX 8808026 packets (8.808026 MPPS), 528481560 bytes (5.918993 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 8808026 packets lost (100.000000%)
    Applying 5.909000 Gbps of load.
        TX 8793130 packets (8.793130 MPPS), 527587800 bytes (5.908983 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 8793130 packets lost (100.000000%)
    Applying 5.904000 Gbps of load.
        TX 8785707 packets (8.785707 MPPS), 527142420 bytes (5.903995 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 8785707 packets lost (100.000000%)
    Applying 5.907000 Gbps of load.
        TX 8790150 packets (8.790150 MPPS), 527409000 bytes (5.906981 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 8790150 packets lost (100.000000%)
    Applying 5.906000 Gbps of load.                                                                                                                                                                                                   
        TX 8788687 packets (8.788687 MPPS), 527321220 bytes (5.905998 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 8788687 packets lost (100.000000%)
    Applying 5.905000 Gbps of load.
        TX 8787182 packets (8.787182 MPPS), 527230920 bytes (5.904986 Gbps)
        RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
        Loss: 0 ingress drop + 8787182 packets lost (100.000000%)

    The output shows applying a specific load and the success or failed state is based on if the rx missed counter from VPP is showing packets dropped or not. If we apply a load and it fails then we attempt it three times. The output of the failed retries has been removed for brevity.

    The sample data is a test data which includes a lot of flows, we have 6 different sample data sets which we test three times to be able to avarage the data. The sample datasets range from the smallest possible ethernet frame size, 64 bytes to the largest standard ethernet frame size of 1522 bytes.

    Reaching peek performance

    The initial results without any performance tweaking on our VPP plugin are:


    The first thing that grabbed my interest was that we weren’t getting to full speed with larger packets, something that shouldn’t be too tricky. As it turns out we had added some ELOG statements, a light-weight method of logging in VPP. These are pretty useful for getting a good look inside your function so that you can figure out what happens and when. Turns out they’re not as lightweight as I’d have liked. Removing those allowed us to get better performance across the board, including reaching line-rate (10 Gigabit per second) for frame sizes of 200 and over.

    Then my attention turned to the 64 and 100 frame sizes. I was wondering how long we were actually taking on average for each packet and how it compared to the stock VPP plugin. It turned out that we’re taking around 131 nanoseconds when the stock one takes around 67 nanoseconds. That’s a big difference, and for those who read my first networking post, you’ll remember I mentioned that you have around 67.2 nanoseconds per packet. We were definitely exceeding our budget. Taking a bit of a closer look, it’s taking us around 103 ns just to look up in our flow table whether a record exists or if it’s a new flow.

    I decided to look at what the stock plugin does. VPP has a lot of different data-types built in, and they had decided to go down a different route than us. They’re using a vector where the hash of the flow key is the index and the value is an index into a pool. VPP pools are fast blocks of memory to fit fixed sized stuff into, you add things and get back an index and use that to fetch it. We’re using a bihash which is a bounded-index hash map, it’s pretty handy as we just have the key as the flow key and then backing it the flow entry containing for example: amount of packets, bytes, and the start and end times for the flow. We had a similar problem in Snabb where our hashing algorithm was slow for the large key. I thought maybe the same thing was occurring but in actuality, the stock VPP was using the same xxhash algorithm that bihashes use by default and that was getting good performance. Not that then.

    The bi-hash works by having a vector of buckets, the amount of which are specified when you initialise it. When you add an item it selects the bucket based on the hash. Each bucket maintains it’s own cache able to fit a specific number of bi-hash entries (both key and value). The size of the cache is based a constant, defined for various key and value sized bi-hashes. The more buckets you have, the fewer keys per bucket and thus, more caches per bi-hash entries. The entries are stored on the heap in pages which the bucket manages. When an item isn’t in the cache of the bucket then, it performs a linear search over the allocated pages until it’s found. Here’s what the bucket data structure looks like:

    typedef struct
          u32 offset;
          u8 linear_search;
          u8 log2_pages;
          u16 cache_lru;
        u64 as_u64;
      BVT (clib_bihash_kv) cache[BIHASH_KVP_CACHE_SIZE];
    } BVT (clib_bihash_bucket);

    We were only defining 1024 buckets which meant that we missing the cache a lot and falling back to slow linear searches. Increasing this value up to 32768, we start getting much better performance:

    Now we seem to be outperforming both the stock VPP plugin and Snabb. The improvements over Snabb could be due to a number of things, however VPP does something rather interesting which may give it the edge. In VPP there is a loop which can handle two packets at once, it’s identical to the single packet loop but apparently operating on two packets at a time provides performance improvements. At a talk in FOSDEM they claimed it helped, and in some parts of VPP they have loops which handle 4 packets per iteration. I think that in the future exploring why we’re seeing the VPP plugin perform better compared to Snabb would be an interesting endeavour.

    by tsyesika at May 16, 2018 09:55 AM

    May 07, 2018

    Iago Toral

    Intel Mesa Vulkan driver now supports shaderInt16

    The Vulkan specification includes a number of optional features that drivers may or may not support, as described in chapter 30.1 Features. Application developers can query the driver for supported features via vkGetPhysicalDeviceFeatures() and then activate the subset they need in the pEnabledFeatures field of the VkDeviceCreateInfo structure passed at device creation time.

    In the last few weeks I have been spending some time, together with my colleague Chema, adding support for one of these features in Anvil, the Intel Vulkan driver in Mesa, called shaderInt16, which we landed in Mesa master last week. This is an optional feature available since Vulkan 1.0. From the spec:

    shaderInt16 specifies whether 16-bit integers (signed and unsigned) are supported in shader code. If this feature is not enabled, 16-bit integer types must not be used in shader code. This also specifies whether shader modules can declare the Int16 capability.

    It is probably relevant to highlight that this Vulkan capability also requires the SPIR-V Int16 capability, which basically means that the driver’s SPIR-V compiler backend can actually digest SPIR-V shaders that declare and use 16-bit integers, and which is really the core of the functionality exposed by the Vulkan feature.

    Ideally, shaderInt16 would increase the overall throughput of integer operations in shaders, leading to better performance when you don’t need a full 32-bit range. It may also provide better overall register usage since you need less register space to store your integer data during shader execution. It is important to remark, however, that not all hardware platforms (Intel or otherwise) may have native support for all possible types of 16-bit operations, and thus, some of them might still need to run in 32-bit (which requires injecting type conversion instructions in the shader code). For Intel platforms, this is the case for operations associated with integer division.

    From the point of view of the driver, this is the first time that we generally exercise lower bit-size data types in the driver compiler backend, so if you find any bugs in the implementation, please file bug reports in bugzilla!

    Speaking of shaderInt16, I think it is worth mentioning its interactions with other Vulkan functionality that we implemented in the past: the Vulkan 1.0 VK_KHR_16bit_storage extension (which has been promoted to core in Vulkan 1.1). From the spec:

    The VK_KHR_16bit_storage extension allows use of 16-bit types in shader input and output interfaces, and push constant blocks. This extension introduces several new optional features which map to SPIR-V capabilities and allow access to 16-bit data in Block-decorated objects in the Uniform and the StorageBuffer storage classes, and objects in the PushConstant storage class. This extension allows 16-bit variables to be declared and used as user-defined shader inputs and outputs but does not change location assignment and component assignment rules.

    While the shaderInt16 capability provides the means to operate with 16-bit integers inside a shader, the VK_KHR_16bit_storage extension provides developers with the means to also feed shaders with 16-bit integer (and also floating point) input data, such as Uniform/Storage Buffer Objects or Push Constants, from the applications side, plus, it also gives the opportunity for linked shader stages in a graphics pipeline to consume 16-bit shader inputs and produce 16-bit shader outputs.

    VK_KHR_16bit_storage and shaderInt16 should be seen as two parts of a whole, each one addressing one part of a larger problem: VK_KHR_16bit_storage can help reduce memory bandwith for Uniform and Storage Buffer data accessed from the shaders and / or optimize Push Constant space, of which there are only a few bytes available, making it a precious shader resource, but without shaderInt16, shaders that are fed 16-bit input data are still required to convert this data to 32-bit internally for operation (and then back again to 16-bit for output if needed). Likewise, shaders that use shaderInt16 without VK_KHR_16bit_storage can only operate with 16-bit data that is generated inside the shader, which largely limits its usage. Both together, however, give you the complete functionality.


    We are very happy to continue expanding the feature set supported in Anvil and we look forward to seeing application developers making good use of shaderInt16 in Vulkan to improve shader performance. As noted above, this is the first time that we fully enable the shader compiler backend to do general purpose operations on lower bit-size data types and there might be things that we can still improve or optimize. If you hit any issues with the implementation, please contact us and / or file bug reports so we can continue to improve the implementation.

    by Iago Toral at May 07, 2018 08:39 AM

    April 30, 2018

    Samuel Iglesias

    Ubucon Europe 2018 experience and talk slides!

    Last weekend I attended Ubucon Europe 2018 in the city center of Gijón, Spain, at the Antiguo Instituto. This 3-day conference was full of talks, sometimes 3 or even 4 at the same time! The covered topics went from Ubuntu LoCo teams to Industrial Hardware… there was at least one talk interesting for all attendees! Please check the schedule if you want to know what happened at Ubucon ;-)

    Ubucon Europe 2018

    I would like to say thank you to all the organization team. They did a great job taking care of both the attendees and the speakers, they organized a very nice traditional espicha that introduced good cider and the Asturian gastronomy to the people coming from any part of the world. The conference was amazing and I am very pleased to have participated in this event. I recommend you to attend it next year… lots of fun and open-source!

    Regarding my talk “Introduction to Mesa, an open-source graphics API”, there were 30 people attending to it. It was great to check that so many people were interested on a brief introduction about how Mesa works, how the community collaborate together and how anyone can become contributor to Mesa, even as a non-developer. I hope I can see them contributing to Mesa in the coming future!

    You can find the slides of my talk here.

    Ubucon Europe 2018 talk

    April 30, 2018 12:00 PM

    April 25, 2018

    Frédéric Wang

    April 17, 2018

    Víctor Jáquez

    How to setup a gst-build environment with Intel’s VA-API stack

    gst-build is far superior than gst-uninstalled scripts for developing GStreamer, mainly because its meson and ninja usage. Nonetheless, to integrate external dependencies it is not as easy as in gst-uninstalled.

    This guide aims to show how to integrate GStreamer-VAAPI dependencies, in this case with the Intel VA-API driver.


    For now we will need meson master, since it this patch is required. The pull request is already merged but it is unreleased yet.


    clone gst-build repository

    $ git clone git://
    $ cd gst-build

    apply custom patch for gst-build

    The patch will add the repositories for libva and intel-vaapi-driver.

    $ wget
    $ git am 0001-add-libva-and-intel-vaapi-driver-as-subprojects.patch



    Running this command, all dependency repositories will be cloned, symbolic links created, and the build directory configured.

    $ meson bulid

    apply custom patches for libva, intel-vaapi-driver and gstreamer-vaapi


    This patch is required since the headers files uninstalled paths doesn’t match with the ones in the “include” directives.

    $ cd libva
    $ wget
    $ git am 0001-build-add-headers-for-uninstalled-setup.patch
    $ cd -



    The patch handles libva dependency as a subproject.

    $ cd intel-vaapi-driver
    $ wget
    $ git am 0001-meson-support-libva-as-subproject.patch
    $ cd -



    Note to myself: this patch must be split and merged in upstream.

    $ cd gstreamer-vaapi
    $ wget
    $ git am 0001-build-meson-libva-gst-uninstall-friendly.patch
    $ cd -

    0001-build-meson-libva-gst-uninstall-friendly.patch updated: 2018/04/24


    $ ninja -C build

    And wait a couple minutes.

    run uninstalled environment for testing

    $ ninja -C build uninstalled
    [gst-master] $ gst-inspect-1.0 vaapi
    Plugin Details:
      Name                     vaapi
      Description              VA-API based elements
      Filename                 /opt/gst/gst-build/build/subprojects/gstreamer-vaapi/gst/vaapi/
      License                  LGPL
      Source module            gstreamer-vaapi
      Binary package           gstreamer-vaapi
      Origin URL     
      vaapih264enc: VA-API H264 encoder
      vaapimpeg2enc: VA-API MPEG-2 encoder
      vaapisink: VA-API sink
      vaapidecodebin: VA-API Decode Bin
      vaapipostproc: VA-API video postprocessing
      vaapivc1dec: VA-API VC1 decoder
      vaapih264dec: VA-API H264 decoder
      vaapimpeg2dec: VA-API MPEG2 decoder
      vaapijpegdec: VA-API JPEG decoder
      9 features:
      +-- 9 elements

    by vjaquez at April 17, 2018 02:43 PM

    Iago Toral

    Frame analysis of a rendering of the Sponza model

    For some time now I have been working on a personal project to render the well known Sponza model provided by Crytek using Vulkan. Here is a picture of the current (still a work-in-progress) result:

    Sponza rendering

    This screenshot was captured on my Intel Kabylake laptop, running on the Intel Mesa Vulkan driver (Anvil).

    The following list includes the main features implemented in the demo:

    • Depth pre-pass
    • Forward and deferred rendering paths
    • Anisotropic filtering
    • Shadow mapping with Percentage-Closer Filtering
    • Bump mapping
    • Screen Space Ambient Occlusion (only on the deferred path)
    • Screen Space Reflections (only on the deferred path)
    • Tone mapping
    • Anti-aliasing (FXAA)

    I have been thinking about writing post about this for some time, but given that there are multiple features involved I wasn’t sure how to scope it. Eventually I decided to write a “frame analysis” post where I describe, step by step, all the render passes involved in the production of the single frame capture showed at the top of the post. I always enjoyed reading this kind of articles so I figured it would be fun to write one myself and I hope others find it informative, if not entertaining.

    To avoid making the post too dense I won’t go into too much detail while describing each render pass, so don’t expect me to go into the nitty-gritty of how I implemented Screen Space Ambient Occlussion for example. Instead I intend to give a high-level overview of how the various features implemented in the demo work together to create the final result. I will provide screenshots so that readers can appreciate the outputs of each step and verify how detail and quality build up over time as we include more features in the pipeline. Those who are more interested in the programming details of particular features can always have a look at the Vulkan source code (link available at the bottom of the article), look for specific tutorials available on the Internet or wait for me to write feature-specifc posts (I don’t make any promises though!).

    If you’re interested in going through with this then grab a cup of coffe and get ready, it is going to be a long ride!

    Step 0: Culling

    This is the only step in this discussion that runs on the CPU, and while optional from the point of view of the result (it doesn’t affect the actual result of the rendering), it is relevant from a performance point of view. Prior to rendering anything, in every frame, we usually want to cull meshes that are not visible to the camera. This can greatly help performance, even on a relatively simple scene such as this. This is of course more noticeable when the camera is looking in a direction in which a significant amount of geometry is not visible to it, but in general, there are always parts of the scene that are not visible to the camera, so culling is usually going to give you a performance bonus.

    In large, complex scenes with tons of objects we probably want to use more sophisticated culling methods such as Quadtrees, but in this case, since the number of meshes is not too high (the Sponza model is slightly shy of 400 meshes), we just go though all of them and cull them individually against the camera’s frustum, which determines the area of the 3D space that is visible to the camera.

    The way culling works is simple: for each mesh we compute an axis-aligned bounding box and we test that box for intersection with the camera’s frustum. If we can determine that the box never intersects, then the mesh enclosed within it is not visible and we flag it as such. Later on, at rendering time (or rather, at command recording time, since the demo has been written in Vulkan) we just skip the meshes that have been flagged.

    The algorithm is not perfect, since it is possible that an axis-aligned bounding box for a particular mesh is visible to the camera and yet no part of the mesh itself is visible, but it should not affect a lot of meshes and trying to improve this would incur in additional checks that could undermine the efficiency of the process anyway.

    Since in this particular demo we only have static geometry we only need to run the culling pass when the camera moves around, since otherwise the list of visible meshes doesn’t change. If dynamic geometry were present, we would need to at least cull dynamic geometry on every frame even if the camera stayed static, since dynamic elements may step in (or out of) the viewing frustum at any moment.

    Step 1: Depth pre-pass

    This is an optional stage, but it can help performance significantly in many cases. The idea is the following: our GPU performance is usually going to be limited by the fragment shader, and very specially so as we target higher resolutions. In this context, without a depth pre-pass, we are very likely going to execute the fragment shader for fragments that will not end up in the screen because they are occluded by fragments produced by other geometry in the scene that will be rasterized to the same XY screen-space coordinates but with a smaller Z coordinate (closer to the camera). This wastes precious GPU resources.

    One way to improve the situation is to sort our geometry by distance from the camera and render front to back. With this we can get fragments that are rasterized from background geometry quickly discarded by early depth tests before the fragment shader runs for them. Unfortunately, although this will certainly help (assuming we can spare the extra CPU work to keep our geometry sorted for every frame), it won’t eliminate all the instances of the problem in the general case.

    Also, some times things are more complicated, as the shading cost of different pieces of geometry can be very different and we should also take this into account. For example, we can have a very large piece of geometry for which some pixels are very close to the camera while some others are very far away and that has a very expensive shader. If our renderer is doing front-to-back rendering without any other considerations it will likely render this geometry early (since parts of it are very close to the camera), which means that it will shade all or most of its very expensive fragments. However, if the renderer accounts for the relative cost of the shader execution it would probably postpone rendering it as much as possible, so by the time it actually renders it, it takes advantage of early fragment depth tests to avoid as many of its expensive fragment shader executions as possible.

    Using a depth-prepass ensures that we only run our fragment shader for visible fragments, and only those, no matter the situation. The downside is that we have to execute a separate rendering pass where we render our geometry to the depth buffer so that we can identify the visible fragments. This pass is usually very fast though, since we don’t even need a fragment shader and we are only writing to a depth texture. The exception to this rule is geometry that has opacity information, such as opacity textures, in which case we need to run a cheap fragment shader to identify transparent pixels and discard them so they don’t hit the depth buffer. In the Sponza model we need to do that for the flowers or the vines on the columns for example.

    Depth pre-pass output

    The picture shows the output of the depth pre-pass. Darker colors mean smaller distance from the camera. That’s why the picture gets brighter as we move further away.

    Now, the remaining passes will be able to use this information to limit their shading to fragments that, for a given XY screen-space position, match exactly the Z value stored in the depth buffer, effectively selecting only the fragments that will be visible in the screen. We do this by configuring the depth test to do an EQUAL test instead of the usual LESS test, which is what we use in the depth-prepass.

    In this particular demo, running on my Intel GPU, the depth pre-pass is by far the cheapest of all the GPU passes and it definitely pays off in terms of overall performance output.

    Step 2: Shadow map

    In this demo we have single source of light produced by a directional light that simulates the sun. You can probably guess the direction of the light by checking out the picture at the top of this post and looking at the direction projected shadows.

    I already covered how shadow mapping works in previous series of posts, so if you’re interested in the programming details I encourage you to read that. Anyway, the basic idea is that we want to capture the scene from the point of view of the light source (to be more precise, we want to capture the objects in the scene that can potentially produce shadows that are visible to our camera).

    With that information, we will be able to inform out lighting pass so it can tell if a particular fragment is in the shadows (not visible from our light’s perspective) or in the light (visible from our light’s perspective) and shade it accordingly.

    From a technical point of view, recording a shadow map is exactly the same as the depth-prepass: we basically do a depth-only rendering and capture the result in a depth texture. The main differences here are that we need to render from the point of view of the light instead of our camera’s and that this being a directional light, we need to use an orthographic projection and adjust it properly so we capture all relevant shadow casters around the camera.

    Shadow map

    In the image above we can see the shadow map generated for this frame. Again, the brighter the color, the further away the fragment is from the light source. The bright white area outside the atrium building represents the part of the scene that is empty and thus ends with the maximum depth, which is what we use to clear the shadow map before rendering to it.

    In this case, we are using a 4096×4096 texture to store the shadow map image, much larger than our rendering target. This is because shadow mapping from directional lights needs a lot of precision to produce good results, otherwise we end up with very pixelated / blocky shadows, more artifacts and even missing shadows for small geometry. To illustrate this better here is the same rendering of the Sponza model from the top of this post, but using a 1024×1024 shadow map (floor reflections are disabled, but that is irrelevant to shadow mapping):

    Sponza rendering with 1024×1024 shadow map

    You can see how in the 1024×1024 version there are some missing shadows for the vines on the columns and generally blurrier shadows (when not also slightly distorted) everywhere else.

    Step 3: GBuffer

    In deferred rendering we capture various attributes of the fragments produced by rasterizing our geometry and write them to separate textures that we will use to inform the lighting pass later on (and possibly other passes).

    What we do here is to render our geometry normally, like we did in our depth-prepass, but this time, as we explained before, we configure the depth test to only pass fragments that match the contents of the depth-buffer that we produced in the depth-prepass, so we only process fragments that we now will be visible on the screen.

    Deferred rendering uses multiple render targets to capture each of these attributes to a different texture for each rasterized fragment that passes the depth test. In this particular demo our GBuffer captures:

    1. Normal vector
    2. Diffuse color
    3. Specular color
    4. Position of the fragment from the point of view of the light (for shadow mapping)

    It is important to be very careful when defining what we store in the GBuffer: since we are rendering to multiple screen-sized textures, this pass has serious bandwidth requirements and therefore, we should use texture formats that give us the range and precision we need with the smallest pixel size requirements and avoid storing information that we can get or compute efficiently through other means. This is particularly relevant for integrated GPUs that don’t have dedicated video memory (such as my Intel GPU).

    In the demo, I do lighting in view-space (that is the coordinate space used takes the camera as its origin), so I need to work with positions and vectors in this coordinate space. One of the parameters we need for lighting is surface normals, which are conveniently stored in the GBuffer, but we will also need to know the view-space position of the fragments in the screen. To avoid storing the latter in the GBuffer we take advantage of the fact that we can reconstruct the view-space position of any fragment on the screen from its depth (which is stored in the depth buffer we rendered during the depth-prepass) and the camera’s projection matrix. I might cover the process in more detail in another post, for now, what is important to remember is that we don’t need to worry about storing fragment positions in the GBuffer and that saves us some bandwidth, helping performance.

    Let’s have a look at the various GBuffer textures we produce in this stage:

    Normal vectors

    GBuffer normal texture

    Here we see the normalized normal vectors for each fragment in view-space. This means they are expressed in a coordinate space in which our camera is at the origin and the positive Z direction is opposite to the camera’s view vector. Therefore, we see that surfaces pointing to the right of our camera are red (positive X), those pointing up are green (positive Y) and those pointing opposite to the camera’s view direction are blue (positive Z).

    It should be mentioned that some of these surfaces use normal maps for bump mapping. These normal maps are textures that provide per-fragment normal information instead of the usual vertex normals that come with the polygon meshes. This means that instead of computing per-fragment normals as a simple interpolation of the per-vertex normals across the polygon faces, which gives us a rather flat result, we use a texture to adjust the normal for each fragment in the surface, which enables the lighting pass to render more nuanced surfaces that seem to have a lot more volume and detail than they would have otherwise.

    For comparison, here is the GBuffer normal texture without bump mapping enabled. The difference in surface detail should be obvious. Just look at the lion figure at the far end or the columns and and you will immediately notice the addditional detail added with bump mapping to the surface descriptions:

    GBuffer normal texture (bump mapping disabled)

    To make the impact of the bump mapping more obvious, here is a different shot of the final rendering focusing on the columns of the upper floor of the atrium, with and without bump mapping:

    Bump mapping enabled
    Bump mapping disabled

    All the extra detail in the columns is the sole result of the bump mapping technique.

    Diffuse color

    GBuffer diffuse texture

    Here we have the diffuse color of each fragment in the scene. This is basically how our scene would look like if we didn’t implement a lighting pass that considers how the light source interacts with the scene.

    Naturally, we will use this information in the lighting pass to modulate the color output based on the light interaction with each fragment.

    Specular color

    GBuffer specular texture

    This is similar to the diffuse texture, but here we are storing the color (and strength) used to compute specular reflections.

    Similarly to normal textures, we use specular maps to obtain per-fragment specular colors and intensities. This allows us to simulate combinations of more complex materials in the same mesh by specifying different specular properties for each fragment.

    For example, if we look at the cloths that hang from the upper floor of the atrium, we see that they are mostly black, meaning that they barely produce any specular reflection, as it is to be expected from textile materials. However, we also see that these same cloths have an embroidery that has specular reflection (showing up as a light gray color), which means these details in the texture have stronger specular reflections than its surrounding textile material:

    Specular reflection on cloth embroidery

    The image shows visible specular reflections in the yellow embroidery decorations of the cloth (on the bottom-left) that are not present in the textile segment (the blue region of the cloth).

    Fragment positions from Light

    GBuffer light-space position texture

    Finally, we store fragment positions in the coordinate space of the light source so we can implement shadows in the lighting pass. This image may be less intuitive to interpret, since it is encoding space positions from the point of view of the sun rather than physical properties of the fragments. We will need to retrieve this information for each fragment during the lighting pass so that we can tell, together with the shadow map, which fragments are visible from the light source (and therefore are directly lit by the sun) and which are not (and therefore are in the shadows). Again, more detail on how that process works, step by step and including Vulkan source code in my series of posts on that topic.

    Step 4: Screen Space Ambient Occlusion

    With the information stored in the GBuffer we can now also run a screen-space ambient occlusion pass that we will use to improve our lighting pass later on.

    The idea here, as I discussed in my lighting and shadows series, the Phong lighting model simplifies ambient lighting by making it constant across the scene. As a consequence of this, lighting in areas that are not directly lit by a light source look rather flat, as we can see in this image:

    SSAO disabled

    Screen-space Ambient Occlusion is a technique that gathers information about the amount of ambient light occlusion produced by nearby geometry as a way to better estimate the ambient light term of the lighting equations. We can then use that information in our lighting pass to modulate ambient light accordingly, which can greatly improve the sense of depth and volume in the scene, specially in areas that are not directly lit:

    SSAO enabled

    Comparing the images above should illustrate the benefits of the SSAO technique. For example, look at the folds in the blue curtains on the right side of the images, without SSAO, we barely see them because the lighting is too flat across all the pixels in the curtain. Similarly, thanks to SSAO we can create shadowed areas from ambient light alone, as we can see behind the cloths that hang from the upper floor of the atrium or behind the vines on the columns.

    To produce this result, the output of the SSAO pass is a texture with ambient light intensity information that looks like this (after some blur post-processing to eliminate noise artifacts):

    SSAO output texture

    In that image, white tones represent strong light intensity and black tones represent low light intensity produced by occlusion from nearby geometry. In our lighting pass we will source from this texture to obtain per-fragment ambient occlusion information and modulate the ambient term accordingly, bringing the additional volume showcased in the image above to the final rendering.

    Step 6: Lighting pass

    Finally, we get to the lighting pass. Most of what we showcased above was preparation work for this.

    The lighting pass mostly goes as I described in my lighting and shadows series, only that since we are doing deferred rendering we get our per-fragment lighting inputs by reading from the GBuffer textures instead of getting them from the vertex shader.

    Basically, the process involves retrieving diffuse, ambient and specular color information from the GBuffer and use it as input for the lighting equations to produce the final color for each fragment. We also sample from the shadow map to decide which pixels are in the shadows, in which case we remove their diffuse and specular components, making them darker and producing shadows in the image as a result.

    We also use the SSAO output to improve the ambient light term as described before, multipliying the ambient term of each fragment by the SSAO value we computed for it, reducing the strength of the ambient light for pixels that are surrounded by nearby geometry.

    The lighting pass is also where we put bump mapping to use. Bump mapping provides more detailed information about surface normals, which the lighting pass uses to simulate more complex lighting interactions with mesh surfaces, producing significantly enhanced results, as I showcased earlier in this post.

    After combining all this information, the lighting pass produces an output like this. Compare it with the GBuffer diffuse texture to see all the stuff that this pass is putting together:

    Lighting pass output

    Step 7: Tone mapping

    After the lighting pass we run a number of post-processing passes, of which tone mapping is the first one. The idea behind tone mapping is this: normally, shader color outputs are limited to the range [0, 1], which puts a hard cap on our lighting calculations. Specifically, it means that when our light contributions to a particular pixel go beyond 1.0 in any color component, they get clamped, which can distort the resulting color in unrealistic ways, specially when this happens during intermediate lighting calculations (since the deviation from the physically correct color is then used as input to more computations, which then build on that error).

    To work around this we do our lighting calculations in High Dynamic Range (HDR) which allows us to produce color values with components larger than 1.0, and then we run a tone mapping pass to re-map the result to the [0, 1] range when we are done with the lighting calculations and we are ready for display.

    The nice thing about tone mapping is that it gives the developer control over how that mapping happens, allowing us to decide if we are interested in preserving more detail in the darker or brighter areas of the scene.

    In this particular demo, I used HDR rendering to ramp up the intensity of the sun light beyond what I could have represented otherwise. Without tone mapping this would lead to unrealistic lighting in areas with strong light reflections, since would exceed the 1.0 per-color-component cap and lead to pure white colors as result, losing the color detail from the original textures. This effect can be observed in the following pictures if you look at the lit area of the floor. Notice how the tone-mapped picture better retains the detail of the floor texture while in the non tone-mapped version the floor seems to be over-exposed to light and large parts of it just become white as a result (shadow mapping has been disabled to better showcase the effects of tone-mapping on the floor):

    Tone mapping disabled
    Tone mapping enabled

    Step 8: Screen Space Reflections (SSR)

    The material used to render the floor is reflective, which means that we can see the reflections of the surrounding environment on it.

    There are various ways to capture reflections, each with their own set of pros and cons. When I implemented my OpenGL terrain rendering demo I implemented water reflections using “Planar Reflections”, which produce very accurate results at the expense of requiring to re-render the scene with the camera facing in the same direction as the reflection. Although this can be done at a lower resolution, it is still quite expensive and cumbersome to setup (for example, you would need to run an additional culling pass), and you also need to consider that we need to do this for each planar surface you want to apply reflections on, so it doesn’t scale very well. In this demo, although it is not visible in the reference screenshot, I am capturing reflections from the floor sections of both stories of the atrium, so the Planar Reflections approach might have required me to render twice when fragments of both sections are visible (admittedly, not very often, but not impossible with the free camera).

    So in this particular case I decided to experiment with a different technique that has become quite popular, despite its many shortcomings, because it is a lot faster: Screen Space Reflections.

    As all screen-space techniques, the technique uses information already present in the screen to capture the reflection information, so we don’t have to render again from a different perspective. This leads to a number of limitations that can produce fairly visible artifacts, specially when there is dynamic geometry involved. Nevertheless, in my particular case I don’t have any dynamic geometry, at least not yet, so while the artifacts are there they are not quite as distracting. I won’t go into the details of the artifacts introduced with SSR here, but for those interested, here is a good discussion.

    I should mention that my take on this is fairly basic and doesn’t implement relevant features such as the Hierarchical Z Buffer optimization (HZB) discussed here.

    The technique has 3 steps: capturing reflections, applying roughness material properties and alpha blending:

    Capturing reflections

    I only implemented support for SSR in the deferred path, since like in the case of SSAO (and more generally all screen-space algorithms), deferred rendering is the best match since we are already capturing screen-space information in the GBuffer.

    The first stage for this requires to have means to identify fragments that need reflection information. In our case, the floor fragments. What I did for this is to capture the reflectiveness of the material of each fragment in the screen during the GBuffer pass. This is a single floating-point component (in the 0-1 range). A value of 0 means that the material is not reflective and the SSR pass will just ignore it. A value of 1 means that the fragment is 100% reflective, so its color value will be solely the reflection color. Values in between allow us to control the strength of the reflection for each fragment with a reflective material in the scene.

    One small note on the GBuffer storage: because this is a single floating-point value, we don’t necessarily need an extra attachment in the GBuffer (which would have some performance penalty), instead we can just put this in the alpha component of the diffuse color, since we were not using it (the Intel Mesa driver doesn’t support rendering to RGB textures yet, so since we are limited to RGBA we might as well put it to good use).

    Besides capturing which fragments are reflective, we can also store another piece of information relevant to the reflection computations: the material’s roughness. This is another scalar value indicating how much blurring we want to apply to the resulting reflection: smooth metal-like surfaces can have very sharp reflections but with rougher materials that have not smooth surfaces we may want the reflections to look a bit blurry, to better represent these imperfections.

    Besides the reflection and roughness information, to capture screen-space reflections we will need access to the output of the previous pass (tone mapping) from which we will retrieve the color information of our reflection points, the normals that we stored in the GBuffer (to compute reflection directions for each fragment in the floor sections) and the depth buffer (from the depth-prepass), so we can check for reflection collisions.

    The technique goes like this: for each fragment that is reflective, we compute the direction of the reflection using its normal (from the GBuffer) and the view vector (from the camera and the fragment position). Once we have this direction, we execute a ray marching from the fragment position, in the direction of the reflection. For each point we generate, we take the screen-space X and Y coordinates and use them to retrieve the Z-buffer depth for that pixel in the scene. If the depth buffer value is smaller than our sample’s it means that we have moved past foreground geometry and we stop the process. If we got to this point, then we can do a binary search to pin-point the exact location where the collision with the foreground geometry happens, which will give us the screen-space X and Y coordinates of the reflection point. Once we have that we only need to sample the original scene (the output from the tone mapping pass) at that location to retrieve the reflection color.

    As discussed earlier, the technique has numerous caveats, which we need to address in one way or another and maybe adapt to the characteristics of different scenes so we can obtain the best results in each case.

    The output of this pass is a color texture where we store the reflection colors for each fragment that has a reflective material:

    Reflection texture

    Naturally, the image above only shows reflection data for the pixels in the floor, since those are the only ones with a reflective material attached. It is immediately obvious that some pixels lack reflection color though, this is due to the various limitations of the screen-space technique that are discussed in the blog post I linked above.

    Because the reflections will be alpha-blended with the original image, we use the reflectiveness that we stored in the GBuffer as the base for the alpha component of the reflection color as well (there are other aspects that can contribute to the alpha component too, but I won’t go into that here), so the image above, although not visible in the screenshot, has a valid alpha channel.

    Considering material roughness

    Once we have captured the reflection image, the next step is to apply the material roughness settings. We can accomplish this with a simple box filter based on the roughness of each fragment: the larger the roughness, the larger the box filter we apply and the blurrier the reflection we get as a result. Because we store roughness for each fragment in the GBuffer, we can have multiple reflective materials with different roughness settings if we want. In this case, we just have one material for the floor though.

    Alpha blending

    Finally, we use alpha blending to incorporate the reflection onto the original image (the output from the tone mapping) ot incorporate the reflections to the final rendering:

    SSR output

    Step 9: Anti-aliasing (FXAA)

    So far we have been neglecting anti-aliasing. Because we are doing deferred rendering Multi-Sample Anti-Aliasing (MSAA) is not an option: MSAA happens at rasterization time, which in a deferred renderer occurs before our lighting pass (specifically, when we generate the GBuffer), so it cannot account for the important effects that the lighting pass has on the resulting image, and therefore, on the eventual aliasing that we need to correct. This is why deferred renderers usually do anti-aliasing via post-processing.

    In this demo I have implemented a well-known anti-aliasing post-processing pass known as Fast Approximate Anti Aliasing (FXAA). The technique attempts to identify strong contrast across neighboring pixels in the image to identify edges and then smooth them out using linear filtering. Here is the final result which matches the one I included as reference at the top of this post:

    Anti-aliased output

    The image above shows the results of the anti-aliasing pass. Compare that with the output of the SSR pass. You can see how this pass has effectively removed the jaggies observed in the cloths hanging from the upper floor for example.

    Unlike MSAA, which acts on geometry edges only, FXAA works on all pixels, so it can also smooth out edges produced by shaders or textures. Whether that is something we want to do or not may depend on the scene. Here we can see this happening on the foreground column on the left, where some of the imperfections of the stone are slightly smoothed out by the FXAA pass.

    Conclusions and source code

    So that’s all, congratulations if you managed to read this far! In the past I have found articles that did frame analysis like this quite interesting so it’s been fun writing one myself and I only hope that this was interesting to someone else.

    This demo has been implemented in Vulkan and includes a number of configurable parameters that can be used to tweak performance and quality. The work-in-progress source code is available here, but beware that I have only tested this on Intel, since that is the only hardware I have available, so you may find issues if you run this on other GPUs. If that happens, let me know in the comments and I might be able to provide fixes at some point.

    by Iago Toral at April 17, 2018 12:45 PM

    April 16, 2018

    Jacobo Aragunde

    Updated Chromium on the GENIVI platform

    I’ve devoted some of my time at Igalia to get a newer version of Chromium running on the GENIVI Development Platform (GDP).

    Since the last update, there have been news regarding Wayland support in Chromium. My colleagues Antonio, Maksim and Frédéric have worked on a new Wayland backend following modern Chromium architecture. You can find more information in their own blogs and talks. I’m linking the most recent talk, from FOSDEM 2018.

    Everyone can already try the new, Igalia-developed backend on their embedded devices using the meta-browser layer. I built it along with the GDP but discovered that it cannot run as it is, due to the lack of ivi-shell hooks in the new Chromium backend. This is going to be fixed in the mid-term, so I decided not to spend a lot of time researching this and chose a different solution for the current GDP release.

    The LG SVL team recently announced the release of an updated Ozone Wayland backend for Chromium, based on the legacy implementation provided by Intel, as a part of the webOS OSE project. This is an update on the backend we were already running on the GDP, so it looked like a good idea to reuse their work.

    I added the meta-lgsvl-browser layer to the GDP, which provides recipes for several Chromium flavors: chromium-lge is the one that builds upon the legacy Wayland backend and currently provides Chromium version 64.

    The chromium-lge browser worked out-of-the-box on Raspberry Pi, but I faced some trouble with the other supported platforms. In the case of ARM64 platforms, we were finding a “relocation overflow” problem. This is something that my colleagues had already detected when trying the new Wayland backend on the R-Car gen. 3 platform, and it can be fixed by enabling compiler optimization flags for binary size.

    In the case of Intel platforms, compilation failed due to a build-system assertion. It looks like Clang’s Control Flow Integrity feature is enabled by default on x64 Linux builds, but we aren’t using the Clang compiler. The solution consists just in disabling this feature, like the upstream meta-browser project was already doing.

    The ongoing work is shared in this pull request. I hope to be able to make it for the next GDP release!

    Finally, this week my colleague Xavi is taking part in the GENIVI All Member Meeting. If you are interested in browsers, make sure you attend his talk, “Wayland Support in Open Source Browsers“, and visit our booth during the Member Showcase! (EDIT: check out the slides in our slideshare page!)

    by Jacobo Aragunde Pérez at April 16, 2018 11:04 AM

    April 15, 2018

    Manuel Rego

    CSSWG F2F Berlin 2018

    Last week I was in Berlin for the CSS Working Group (CSSWG) face-to-face meeting representing Igalia, member of the CSSWG since last year. Igalia has been working on the open web platform for many years, where we help our customers with the implementation of different standards on the open source web engines. Inside the CSSWG we play the implementors role, providing valuable feedback around the specifications we’re working on.

    It was really nice to meet all the folks from the CSSWG there, it’s amazing to be together with such a brilliant group of people in the same room. And it’s lovely to see how easy is to talk with any of them, you all rock!

    CSSWG F2F Berlin 2018 by Rossen Atanassov CSSWG F2F Berlin 2018 by Rossen Atanassov

    This is a brief post about my highlights from there, of course totally subjective and focused on the topics I’m more interested.

    CSS Grid Layout

    We were discussing two issues of the current specification related to the track sizing algorithm and its behavior in particular cases. Some changes will be added in the specification to try to improve them and we’ll need to update the implementations accordingly.

    On top of that, we discussed about the Level 2 of the spec. It’s already defined that this next level will include the following features:

    • The awaited subgrids feature: There was the possibility of allowing subgrids in both axis (dual-axis) or only in one of them (per-axis), note that the per-axis approach covers the dual-axis if you define the subgrid in both axis.

      There are clear uses cases for the per-axis approach but the main doubt was about how hard it’d be to implement. Mats Palmgren from Mozilla posted a comment on the issue explaining that he has just created a prototype for the feature following the per-axis idea, so the CSSWG resolved to remove the dual-axis one from the spec.

    • And aspect-ratio controlled gutters: Regarding this topic, the CSSWG decided to add a new ar unit. We didn’t discuss anything more but we need to decide what we’ll do in the situations where there’s no enough free space to fulfill the requested aspect-ratio, should we ignore it or overflow in that case?

      Talking to Rachel Andrew about the issue, she was not completely sure of what would be the preferred option from the authors point of view. I’ve just added some examples to the issue so we can discuss about them there and gather more feedback, please share your thoughts.


    This was a discussion I wanted to have with the CSSWG people in order to understand better the current situation and possible next steps for the CSSWG test suites.

    Just to add some context, the CSSWG test suites are now part of the web-platform-tests (WPT) repository. This repository is being used by most browser vendors to share tests, including tests for new CSS features. For example, at Igalia we’re currently using WPT test suites in all our developments.

    The CSSWG uses the CSS Test Harness tool which has a build system that adds some special requirements for the test suites. One of them causes that we need to duplicate some files in the repository, which is not nice at all.

    Several people in the CSSWG still rely on this tool mainly for two things:

    • Run manual tests and store their results: Some CSS features like media queries or scrolling are hard to automate when writing tests, so several specs have manual tests. Probably WebDriver can help to automate this kind of tests, maybe not all though.
    • Extract status reports: To verify that a spec fulfills the CR exit criteria, the current tooling has some nice reports, it also provides info about the test coverage of the spec.

    So we cannot get rid of the CSS Test Harness system at this point. We discussed about possible solutions but none of them were really clear, also note that the lack of funding for this kind of work makes it harder to move things forward.

    I still believe the way to go would be to improve the WPT Dashboard ( so it can support the 2 features listed above. If that’s the case maybe the specific CSS Test Harness stuff won’t be needed anymore, thus the weird requirements for people working on the test suites will be gone, and there would be a single tool for all the tests from the different working groups.

    As a side note needs some infrastructure improvements, for example Microfost was not happy as Ahem font (which is used a lot in CSS tests suites) is still not installed on the Windows virtual machines that extract test results for

    Floats, floats, floats

    People are using floats to simulate CSS Shapes on browsers that don’t have support yet. That is causing that some special cases related to floats happen more frecuently, and it’s hard to decide what’s the best thing to do on them.

    The CSSWG was discussing what would be the best solution when the non-floated content doesn’t fit in the space left by the floated elements. The problem is quite complex to explain, but imagine the following picture where you have several floated elements.

    An example of float layout An example of float layout

    In this example there are a few floated elements restricting the area where the content can be painted, if the browser needs to find the place to add a BFC (like a table) it needs to decide where to place it avoiding overlapping any other floats.

    There was a long discussion, and it seems the best choice would be that the browser tests all the options and if there’s no overlapping then puts the table there (basically Option 1 in the linked illustration). Still there are concerns about performance, so there’s still more work to be done here. As a result of this discussion a new CSS Floats specification will be created to describe the expected behavior in this kind of scenarios.

    Monica Dinculescu created a really cool demo to explain how float layout works, with the help of Ian Kilpatrick who knows it pretty well as he has been dealing with lots of corner cases while working in LayoutNG.

    TYPO Labs

    The members of the CSSWG were invited to the co-located TYPO Labs event. I attended on Friday when Elika (fantasai), Myles and Rossen gave a talk. It was nice to see that CSS Grid Layout was mentioned in the first talk of the day, as an useful tool for typographers. Variable fonts and Virtual Reality were clearly hot topics in several talks.

    Elika (fantasai), Myles and Rossen in the CSSWG talk at TYPO Labs Elika (fantasai), Rossen and Myles in the CSSWG talk at TYPO Labs

    It’s funny that the last time I was in Berlin was 10 years ago for a conference related to TYPO3, totally unrelated but with a similar name. 😄


    Some pictures of Berlin Some pictures of Berlin

    And that’s mostly all that I can remember now, I’m sure I’m missing many other important things. It was a fantastic week and I even find some time for walking around Berlin as the weather was really pleasant.

    April 15, 2018 10:00 PM

    April 10, 2018

    Samuel Iglesias

    Going to Ubucon Europe 2018!

    Next Ubucon Europe is going to be in the beautiful city of Gijón, Spain, at the Antiguo Instituto from April 27th to 29th 2018.

    Ubucon is a conference for users and developers of Ubuntu, one of the most popular GNU/Linux distributions in the world. The conference is full of talks covering very different topics, and optional activities for attendes like the traditional espicha.

    I will attend as speaker to this conference with my talk “Introduction to Mesa, an open-source graphics API”. In this talk, I will give a brief introduction to Mesa, how it works and how to contribute to it. My talk will be on Sunday 29th April at 11:25am. Don’t miss it!

    If you plan to attend the conference and you want to talk about open-source graphics drivers, Linux graphics stack, OpenGL/Vulkan or something alike, please drop me a line in my twitter.

    Ubucon Europe 2018

    April 10, 2018 06:00 AM

    April 05, 2018

    Javier Muñoz

    Ceph Day in Santiago de Compostela

    This week took place our first Ceph meetup in Santiago de Compostela. The event was organized by AMTEGA in collaboration with Red Hat, Supermicro, Colabora, Cumulus Networks, Dinahosting and Igalia.

    My talk Upstream consultancy and Ceph RadosGW/S3 covered the context and value of the upstream contributions in the Ceph project, along with some examples of consulting and technical work that we carried out in Igalia ending with new features and improvements in the project.

    If you could not attend and are interested in the topics we talked about, you can read more about the event here.

    Some of the topics on the agenda that we discussed were hardware performance, reference architectures, open web-scale networking, operations, new use cases or storage products/services.

    Outside of the agenda we also discussed the technical impact of BlueStore, the new storage backend in Luminous, the low-risk approaches to update clusters in production or how small and medium-sized companies are gradually adopting software-defined storage.

    Although the agenda was really tight, we were able to include some practical references for all those people who had their first contact with Ceph at the event. In this regard, Red Hat commented on the availability of the Ceph Storage Test lab on AWS and I shared a guided technical recipe to deploy the Ceph RGW / S3 demo container locally.

    The latter allowed some of the attendees to follow the technical section of the talks in real time and experiment with configurations, protocols and tools throughout the event.

    Finally thank all the people who participated in the organization and actively collaborated to make the event possible. See you at the next one!

    Update 2018/05/14

    The slides are now available through the Ceph community:


    by Javier at April 05, 2018 10:00 PM

    April 03, 2018

    Manuel Rego

    Getting rid of "grid-" prefix on CSS Grid Layout gutter properties

    Early this year I was working on unprefixing the CSS Grid Layout gutter properties. The properties were originally named grid-column-gap and grid-row-gap, together with the grid-gap shorthand. The CSS Working Group (CSSWG) decided to remove the grid- prefix from these properties last summer, so they could be extended to be used in other layout models like Flexbox.

    I was not planning to write a blog post about this, but the task ended up becoming something more than just renaming the properties, so this post describes what it took to implement this. Also people got quite excited about the possibility of animating grid gutters when I announced that this was ready on Twitter.

    The task

    So the theory seems pretty simply, we currently have 3 properties with the grid- prefix and we want to remove it:

    • grid-column-gap becomes column-gap,
    • grid-row-gap becomes row-gap and
    • grid-gap becomes gap.

    But column-gap is already an existent property, defined by the Multicolumn spec, which has been around for a long time. So we cannot just create a new property, but we have to make it work also for Grid Layout, and be sure that the syntax is equivalent.

    Animatable properties

    When I started to test Multicol column-gap I realized it was animatable, however our implementations (Blink and WebKit) of the Grid Layout gutter properties were not. We’d need to make our properties animatable if we want to remove the prefixes.

    More on that, I found a bug on Multicol column-gap animation, as its default computed value is normal, and it shouldn’t be possible to animate it. This was fixed quickly by Morten Stenshorne from Google.

    Making the properties animatable is not complex at all, both Blink and WebKit have everything ready to make this task easy for properties like the gutter ones that represent lengths. So I decided to do this as part of the unprefixing patch, instead of something separated.

    CSS Grid Layout gutters animation example (check it live)


    But there was something else, the Grid gutter properties accept percentage values, however column-gap hadn’t that support yet. So I added percentage support to column-gap for multicolumn, as a preliminary patch for the unprefixing one.

    There has been long discussions in the CSSWG about how to resolve percentages on gutter properties. The spec has recently changed so these properties should be resolved to zero for content-based containers. However my patch is not implementing that, as we don’t believe there’s an easy way to support something like that in most of the web engines, and Blink and WebKit are not exceptions. Our patch follows what Microsoft Edge does in these cases, and resolves the percentage gaps like it does for percentage widths or heights. And the Firefox implementation that has just landed this week does the same.

    CSS Multi-column percentage column-gap example (check it live)

    I guess we’ll still have some extra discussions about this topic in the CSSWG, but percentages themselves deserve their own blog post.


    Once all the previous problems got solved, I landed the patches related to unprefixing the gutter properties in both Blink and WebKit. So you can use the unprefixed version since Chrome 66.0.3341.0 and Safari Technology Preview 50.

    <div style="display: grid; grid: 100px 50px / 300px 200px;
                column-gap: 25px; row-gap: 10px;">
      <div>Item 1</div>
      <div>Item 2</div>
      <div>Item 3</div>
      <div>Item 4</div>

    A simple Grid Layout example using the unprefixed gutter properties A simple Grid Layout example using the unprefixed gutter properties

    Note that as specified in the spec, the previous prefixed properties are still valid and will be kept as an alias to avoid breaking existent content.

    Also it’s important to notice that now the gap shorthand applies to Multicol containers too, as it sets the value of column-gap longhand (together with row-gap which would be ignored by Multicol).

    <div style="column-count: 2; gap: 100px;">
      <div>First column</div>
      <div>Second column</div>

    Multicolumn example using gap property Multicolumn example using gap property

    Web Platform Tests

    As usual in our last developments, we have been using web-platform-tests repository for all the tests related to this work. As a result of this work we have now 16 new tests that verify the support of these properties, including tests for animations stuff too.

    Running those tests on the different browsers, I realized there was an inconsistency between css-align and css-multicol specifications. Both specs define the column-gap property, but the computed value was different. I raised a CSSWG issue that has been recently solved, so that the computed value for column-gap: normal should still be normal. This causes that the property won’t be animatable from normal to other values as explained before.

    This is the summary of the status of these tests in the main browser engines:

    • Blink and WebKit: They pass all the tests and follow last CSSWG resolution.
    • Edge: Unprefixed properties are available since version 41. Percentage support is interoperable with Blink and WebKit. The computed value of column-gap: normal is not normal there, so this needs to get updated.
    • Firefox: It doesn’t have support for the unprefixed properties yet, however the default computed value is normal like in Blink and WebKit. But Multicol column-gap percentage support has just been added. Note that there are already patches on review for this issue, so hopefully they’ll be merged in the coming days.


    The task is completed and everything should be settled down at this point, you can start using these unprefixed properties, and it seems that Firefox will join the rest of browser by adding this support very soon.

    Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

    Last, but not least, this is again part of the ongoing collaboration between Igalia and Bloomberg. I don’t mind to repeat myself over and over, but it’s really worth to highlight the support from Bloomberg in the CSS Grid Layout development, they have been showing to the world that an external company can directly influence in the new specifications from the standard bodies and implementations by the browser vendors. Thank you very much!

    Finally and just as a heads-up, I’ll be in Berlin next week for the CSSWG F2F meeting. I’m sure we’ll have interesting conversations about CSS Grid Layout and many other topics there.

    April 03, 2018 10:00 PM

    Neil Roberts

    VkRunner – a shader test tool for Vulkan

    As part of my work in the graphics team at Igalia, I’ve been helping with the effort to enable GL_ARB_gl_spirv for Intel’s i965 driver in Mesa. Most of the functionality of the extension is already working so in order to get more test coverage and discover unknown missing features we have been working to automatically convert some Piglit GLSL tests to use SPIR-V. Most of the GLSL tests are using a tool internal to Piglit called shader_runner. This tool largely simplifies the work needed to create a test by allowing it to be specified as a simple text file that just contains the required shaders and a list of commands to execute using them. The commands are very highlevel such as draw rect to submit a rectangle or probe rgba to check for a specific colour of a pixel in the framebuffer.

    A number of times during this work I’ve encountered problems that look like they are general problems with Mesa’s SPIR-V compiler and aren’t specific to the GL_ARB_gl_spirv work. I wanted a way to easily confirm this but writing a minimal test case from scratch in Vulkan seemed like quite a lot of hassle. Therefore I decided it would be nice to have a tool like shader_runner that works with Vulkan. I made a start on implementing this and called it VkRunner.


    Here is an example shader test file for VkRunner:

    [vertex shader]
    #version 430
    layout(location = 0) in vec3 pos;
            gl_Position = vec4(pos.xy * 2.0 - 1.0, 0.0, 1.0);
    [fragment shader]
    #version 430
    layout(location = 0) out vec4 color_out;
    layout(std140, push_constant) uniform block {
            vec4 color_in;
            color_out = color_in;
    [vertex data]
    0.25 0.0775 0.0
    0.145 0.3875 0.0
    0.25 0.3175 0.0
    0.25 0.0775 0.0
    0.355 0.3875 0.0
    0.25 0.3175 0.0
    0.0775 0.195 0.0
    0.4225 0.195 0.0
    0.25 0.3175 0.0
    # Clear the framebuffer to green
    clear color 0.0 0.6 0.0 1.0
    # White rectangle in the topleft corner
    uniform vec4 0 1.0 1.0 1.0 1.0
    draw rect 0 0 0.5 0.5
    # Green star in the topleft
    uniform vec4 0 0.0 0.6 0.0 1.0
    draw arrays TRIANGLE_LIST 0 9
    # Verify a rectangle colour
    relative probe rect rgb (0.5, 0.0, 0.5, 1.0) (0.0, 0.6, 0.0)

    If this is run through VkRunner it will convert the shaders to SPIR-V on the fly by piping them through glslangValidator, create a pipeline and a framebuffer and then run the test commands. The framebuffer is just an offscreen buffer so VkRunner doesn’t use any window system extensions. However you can still see the result by passing the -i image.ppm option which cause it to write a PPM image of the final rendering.

    The format is meant to be as close to shader_runner as possible but there are a few important differences. Vulkan can’t have uniforms in the default buffer and there is no way to access them by name. Instead you can use a push constant buffer and refer to the individual uniform using its byte offset in the buffer. VkRunner doesn’t yet support UBOs. The same goes for vertex attributes which are now specified using an explicit location rather than by name. In the vertex data section you can use names from VkFormat to specify the format with maximum flexibility, but it can also accept Piglit-style names for compatibilty.

    Bonus features

    VkRunner supports some extra features that shader_runner does not. Firstly you can specify a format for the framebuffer in the optional [require] section. For example you can make it use a floating-point framebuffer to be able to accurately probe results from functions.

    framebuffer R32G32B32A32_SFLOAT
    [vertex shader passthrough]
    [fragment shader]
    #version 430
    layout(location = 0) out vec4 color;
            color = vec4(atan(0.0, -1.0),
                         length(vec2(1.0, 1.0)),
                         fma(2.0, 3.0, 1.0));
    draw rect -1 -1 2 2
    probe all rgba 3.141592653589793 42.0 1.4142135623730951 7.0

    If you want to use SPIR-V instead of GLSL in the source, you can specify the shader in a [fragment shader spirv] section. This is useful to test corner cases of the driver with tweaked shaders that glslang wouldn’t generate. You can get a base for the SPIR-V source by passing the -d option to VkRunner to get the disassembly of the generated SPIR-V. There is an example of this in the repo.


    Although it can already be used for a range of tests, VkRunner still has a lot of missing features compared to shader_runner. It does not support textures, UBOs and SSBOs. Take a look at the README for the complete documentation.

    As I have been creating tests for problems that I encounter with Mesa I’ve been adding them to a branch called tests in the Github repo. I get the impression that Vulkan is somewhat under-tested compared to GL so it might be interesting to use these tests as a base to make a more complete Vulkan test suite. It could also potentially be merged into Piglit.

    UPDATE: Patches to merge VkRunner into Piglit have been posted to the mailing list.

    by nroberts at April 03, 2018 05:10 PM

    Hyunjun Ko

    Improvements for GStreamer Intel-MSDK plugins

    Last November I had a chance to dive into Intel Media SDK plugins in gst-plugins-bad. It was very good chance for me to learn how gstreamer elements are handling special memory with its own infrastructures like GstBufferPool and GstMemory. In this post I’m going to talk about the improvements of GStreamer Intel MSDK plugins in release 1.14, which is what I’ve been through since last November.

    First of all, for those not familiar with Media SDK I’m going to explain what Media SDK is briefly. Media SDK(Aka. MSDK) is the cross-platform API to access Intel’s hardware accelerated video encoder and decoder functions on Windows and Linux. You can get more information about MSDK here and here.

    But on Linux so far, it’s hard to set up environment to make MSDK driver working. If you want to set up development environment for MSDK and see what’s working on linux, you should follow the steps described in this page. But I would recommend you refer to the Victor’s post, which is very-well explained for this a little bit annoying stuff.

    Additionally the driver in linux supports only Skylake as far as I know, which is very disappointing for users of other chipsets. I(and you probably) hope we can work on it without any specific patch/driver (since it is open source!) and any dependency of chipset. As far as I know, Intel has been considering this issue to be solved, so we’ll see what’s going to happen in the future.

    Back to gstreamer, gstreamer plugins using MSDK landed in 2016. At that time they were working fine with basic features for playback and encoding but there were several things to be improved, especially for performance.

    Eventually, in the middle of last March, GStreamer 1.14 has been released including improvements of MSDK plugins, which is what I want to talk in this post.

    So let me begin now.

    Suuports bufferpool and starts using video memory.

    This is a key feature that improves the preformance.

    In MSDK, there are two types of memory supported in the driver. One is “System Memory” and another is “Video Memory”. (There is one more type of memory, which is called “Opaque Memory” but won’t talk about it in this post)

    System memory is a memory allocated on user space, which is normal. System memory is being used in the plugins already, which is quite simple to use but not recommended since the performance is not good enough.

    Video memory is a memory used by hardware acceleration device, also known as GPU, to hold frame and other types of video data.

    For applications to use video memory, something specific on its platform should be implemented. On linux for example, we can use VA-API to handle video memory through MSDK. And that’s included to the 1.14 release, which means we still need to implement something specific on Windows like this way.

    To implement using video memory, I needed to implement GstMSDK(System/Video)Memory to generalize how to access and map the memory in the way of GStreamer. And GstMSDKBufferPool can allocate this type of memory and can be proposed to upstream and can be used in each MSDK element itself. There were lots of problems and argues during this work since the design of MSDK APIs and GStreamer infrastructure don’t match perfectly.

    You can see discussion and patches here in this issue.

    In addition, if using video memory on linux, we can use DMABuf by acquiring fd handle via VA-API at allocation time. Recently this has been done only for DMABuf export in this issue though it’s not included in 1.14 release.

    Sharing context/session

    For resource utilization, there needs to share MSDK session with each MSDK plugin in a pipeline. A MSDK session maintains context for the use of any of decode,encode and convert(VPP) functions. Yes it’s just like a handle of the driver. One session can run exactly one of decode, encode and convert(VPP).

    So let’s think about an usual transcoding pipeline. It should be like this, src - decoder - converter - encoder - sink. In this case we should use same MSDK session to decode, convert and encode. Otherwise we should copy the data from upstream to work with different session because sessions cannot share data, which should get much worse.

    Also there’s one more thing. MSDK supports joining session. If application wants(or has) to use multiple sessions, it can join sessions to share data and we need to support it in the level of GStreamer as well.

    All of these can be achieved by GstContext which provides a way of sharing not only between elements. You can see the patch in this issue, same as MSDK Bufferpool issue.

    Adds vp8 and mpeg2 decoder.

    Sree has added mpeg2/vc1 decoder and I have added vp8 decoder.

    Supports a number of algorithms and tuning options in encoder

    Encoders are exposing a number of rate control algorithms now and more encoder tuning options like trellis-quantiztion (h264), slice size control (h264), B-pyramid prediction(h264), MB-level bitrate control, frame partitioning and adaptive I/B frame insertion were added. The encoder now also handles force-key-unit events and can insert frame-packing SEIs for side-by-side and top-bottom stereoscopic 3D video.

    All of this has been done by Sree and you can see the details in this issue

    Capability of encoder’s sinkpad is improved by using VPP.

    MSDK encoders had accepted only NV12 raw data since MSDK encoder supports only NV12 format. But other formats can be handled too if we convert them to NV12 by using VPP, which is also supported in the MSDK driver. This has been done by slomo and I fixed a bug related to it. See this bug for more details.

    You can find all of patches for MSDK plugins here.

    As I said in the top of this post, all MSDK stuffs should be opened first and should support some of major Intel chipsets at least even if not all. But now, the one thing that I can say is GStreamer MSDK plugins are being improved continuously. We can see what’s happening in the near future.

    Finally I want to say that Sree and Victor helped me a lot as a reviewer and an adviser with this work. I really appreciate it.

    Thanks for reading!

    April 03, 2018 03:00 PM

    March 28, 2018

    Víctor Jáquez

    GStreamer VA-API Troubleshooting

    GStreamer VA-API is not a trivial piece of software. Even though, in my opinion it is a bit over-engineered, the complexity relies on its layered architecture: the user must troubleshoot in which layer is the failure.

    So, bear in mind this architecture:

    GStreamer VA-API is not a trivial piece of software. Even though, in my opinion it is a bit over-engineered, the complexity relies on its layered architecture: the user must troubleshoot in which layer is the failure.

    So, bear in mind this architecture:

    libva architecture
    libva architecture

    And the point of failure could be anywhere.


    libva is a library designed to load another library called driver or back-end. This driver is responsible to talk with the kernel, windowing platform, memory handling library, or any other piece of software or hardware that actually will do the video processing.

    There are many drivers in the wild. As it is an API aiming to stateless video processing, and the industry is moving towards that way to process video, it is expected more drivers would appear in the future.

    Nonetheless, not all the drivers have the same level of maturity, and some of them are abandon-ware. For this reason we decided in GStreamer VA-API, some time ago, to add a white list of functional drivers, basically, those developed by Mesa3D and this one from Intel™. If you wish to disable that white-list, you can do it by setting an environment variable:

    $ export GST_VAAPI_ALL_DRIVERS=1

    Remember, if you set it, you are on your own, since we do not trust on the maturity of that driver yet.

    Internal libva↔driver version

    Thus, there is an internal API between libva and the driver and it is versioned, meaning that the internal API version of the installed libva library must match with the internal API exposed by the driver. One of the causes that libva could not initialize a driver could be because the internal API version does not match.

    Drivers path and driver name

    By default there is a path where libva looks for drivers to load. That path is defined at compilation time. Following Debian’s file-system hierarchy standard (FHS) it should be set by distributions in /usr/lib/x86_64-linux-gnu/dri/. But the user can control this path with an environment variable:

    $ export LIBVA_DRIVERS_PATH=${HOME}/src/intel-vaapi-driver/src/.libs

    The driver path, as a directory, might contain several drivers. libva will try to guess the correct one by querying the instantiated VA display (which could be either KMS/DRM, Wayland, Android or X11). If the user instantiates a VA display different of his running environment, the guess will be erroneous, the library loading will fail.

    Although, there is a way for the user to set the driver’s name too. Again, by setting an environment variable:

    $ export LIBVA_DRIVER_NAME=iHD

    With this setting, libva will try to load (a new and experimental open source driver from Intel™, targeted for MediaSDK —do not use it yet with GStreamer VAAPI—).


    vainfo is the diagnostic tool for VA-API. In a couple words, it will iterate on a list of VA displays, in try-and-error strategy, and try to initialize VA. In case of success, vainfo will report the driver signature, and it will query the driver for the available profiles and entry-points.

    For example, my skylake board for development will report

    $ vainfo
    error: can't connect to X server!
    libva info: VA-API version 1.1.0
    libva info: va_getDriverName() returns 0
    libva info: Trying to open /home/vjaquez/gst/master/intel-vaapi-driver/src/.libs/
    libva info: Found init function __vaDriverInit_1_1
    libva info: va_openDriver() returns 0
    vainfo: VA-API version: 1.1 (libva 2.1.1.pre1)
    vainfo: Driver version: Intel i965 driver for Intel(R) Skylake - 2.1.1.pre1 (2.1.0-41-g99c3748)
    vainfo: Supported profile and entrypoints
          VAProfileMPEG2Simple            : VAEntrypointVLD
          VAProfileMPEG2Simple            : VAEntrypointEncSlice
          VAProfileMPEG2Main              : VAEntrypointVLD
          VAProfileMPEG2Main              : VAEntrypointEncSlice
          VAProfileH264ConstrainedBaseline: VAEntrypointVLD
          VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
          VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
          VAProfileH264ConstrainedBaseline: VAEntrypointFEI
          VAProfileH264ConstrainedBaseline: VAEntrypointStats
          VAProfileH264Main               : VAEntrypointVLD
          VAProfileH264Main               : VAEntrypointEncSlice
          VAProfileH264Main               : VAEntrypointEncSliceLP
          VAProfileH264Main               : VAEntrypointFEI
          VAProfileH264Main               : VAEntrypointStats
          VAProfileH264High               : VAEntrypointVLD
          VAProfileH264High               : VAEntrypointEncSlice
          VAProfileH264High               : VAEntrypointEncSliceLP
          VAProfileH264High               : VAEntrypointFEI
          VAProfileH264High               : VAEntrypointStats
          VAProfileH264MultiviewHigh      : VAEntrypointVLD
          VAProfileH264MultiviewHigh      : VAEntrypointEncSlice
          VAProfileH264StereoHigh         : VAEntrypointVLD
          VAProfileH264StereoHigh         : VAEntrypointEncSlice
          VAProfileVC1Simple              : VAEntrypointVLD
          VAProfileVC1Main                : VAEntrypointVLD
          VAProfileVC1Advanced            : VAEntrypointVLD
          VAProfileNone                   : VAEntrypointVideoProc
          VAProfileJPEGBaseline           : VAEntrypointVLD
          VAProfileJPEGBaseline           : VAEntrypointEncPicture
          VAProfileVP8Version0_3          : VAEntrypointVLD
          VAProfileVP8Version0_3          : VAEntrypointEncSlice
          VAProfileHEVCMain               : VAEntrypointVLD
          VAProfileHEVCMain               : VAEntrypointEncSlice

    And my AMD board with stable packages replies:

    $ vainfo
    libva info: VA-API version 0.40.0
    libva info: va_getDriverName() returns 0
    libva info: Trying to open /usr/lib64/dri/
    libva info: Found init function __vaDriverInit_0_40
    libva info: va_openDriver() returns 0
    vainfo: VA-API version: 0.40 (libva )
    vainfo: Driver version: mesa gallium vaapi
    vainfo: Supported profile and entrypoints
          VAProfileMPEG2Simple            : VAEntrypointVLD
          VAProfileMPEG2Main              : VAEntrypointVLD
          VAProfileVC1Simple              : VAEntrypointVLD
          VAProfileVC1Main                : VAEntrypointVLD
          VAProfileVC1Advanced            : VAEntrypointVLD
          VAProfileH264ConstrainedBaseline: VAEntrypointVLD
          VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
          VAProfileH264Main               : VAEntrypointVLD
          VAProfileH264Main               : VAEntrypointEncSlice
          VAProfileH264High               : VAEntrypointVLD
          VAProfileH264High               : VAEntrypointEncSlice
          VAProfileNone                   : VAEntrypointVideoProc

    Does this mean that VA-API processes video? No. It means that there is an usable VA display which could open a driver correctly and libva can extract symbols from it.

    I would like to mention another tool, not official, but I like it a lot, since it extracts almost of the VA information available in the driver: vadumpcaps.c, written by Mark Thompson.

    GStreamer VA-API registration

    When GStreamer is launched, normally it will register all the available plugins and plugin features (elements, device providers, etc.). All that data is cache and keep until the cache file is deleted or the cache invalidated by some event.

    At registration time, GStreamer VA-API will instantiate a DRM-based VA display, which works with no need of a real display (in other words, headless), and will query the driver for the profiles and entry-points tuples, in order to register only the available elements (encoders, decoders. sink, post-processor). If the DRM VA display fails, a list of VA displays will be tried.

    In the case that libva could not load any driver, or the driver is not in the white-list, GStreamer VA-API will not register any element. Otherwise gst-inspect-1.0 will show the registered elements:

    $ gst-inspect-1.0 vaapi
    Plugin Details:
      Name                     vaapi
      Description              VA-API based elements
      Filename                 /usr/lib/x86_64-linux-gnu/gstreamer-1.0/
      Version                  1.12.4
      License                  LGPL
      Source module            gstreamer-vaapi
      Source release date      2017-12-07
      Binary package           gstreamer-vaapi
      Origin URL     
      vaapijpegdec: VA-API JPEG decoder
      vaapimpeg2dec: VA-API MPEG2 decoder
      vaapih264dec: VA-API H264 decoder
      vaapivc1dec: VA-API VC1 decoder
      vaapivp8dec: VA-API VP8 decoder
      vaapih265dec: VA-API H265 decoder
      vaapipostproc: VA-API video postprocessing
      vaapidecodebin: VA-API Decode Bin
      vaapisink: VA-API sink
      vaapimpeg2enc: VA-API MPEG-2 encoder
      vaapih265enc: VA-API H265 encoder
      vaapijpegenc: VA-API JPEG encoder
      vaapih264enc: VA-API H264 encoder
      13 features:
      +-- 13 elements

    Beside the normal behavior, GStreamer VA-API will also invalidate GStreamer’s cache at every boot, or when any of the mentioned environment variables change.


    A simple task list to review when GStreamer VA-API is not working at all is this:

    #. Check your LIBVA_* environment variables
    #. Verify that vainfo returns sensible information
    #. Invalidate GStreamer’s cache (or just delete the file)
    #. Check the output of gst-inspect-1.0 vaapi

    And, if you decide to file a bug in bugzilla, please do not forget to attach the output of vainfo and the logs if the developer asks for them.

    by vjaquez at March 28, 2018 06:11 PM

    March 27, 2018

    Víctor Jáquez

    GStreamer VA-API 1.14: what’s new?

    As you may already know, there is a new release of GStreamer, 1.14. In this blog post we will talk about the new features and improvements of GStreamer VA-API module, though you have a more comprehensive list of changes in the release notes.

    Most of the topics explained along this blog post are already mentioned in the release notes, but a bit more detailed.

    DMABuf usage

    We have improved DMA-buf’s usage, mostly at downstream.

    In the case of upstream, we just got rid a nasty hack which detected when to instantiate and use a buffer pool in sink pad with a dma-buf based allocator. This functionality has been already broken for a while, and that code was the wrong way to enabled it. The sharing of a dma-buf based buffer pool to upstream is going to be re-enabled after bug 792034 is merged.

    For downstream, we have added the handling of memory:DMABuf caps feature. The purpose of this caps feature is to negotiate a media when the buffers are not map-able onto user space, because of digital rights or platform restrictions.

    For example, currently intel-vaapi-driver doesn’t allow the mapping of its produced dma-buf descriptors. But, as we cannot know if a back-end produces or not map-able dma-buf descriptors, gstreamer-vaapi, when the allocator is instantiated, creates a dummy buffer and tries to map it, if it fails, memory:DMABuf caps feature is negotiated, otherwise, normal video caps are used.

    VA-API usage

    First of all, GStreamer VA-API has support now for libva-2.0, this means VA-API 1.10. We had to guard some deprecated symbols and the new ones. Nowadays most of distributions have upgraded to libva-2.0.

    We have improved the initialization of the VA display internal structure (GstVaapiDisplay). Previously, if a X based display was instantiated, immediately it tried to grab the screen resolution. Obviously, this broke the usage of headless systems. We just delay the screen resolution check to when the VA display truly requires that information.

    New API were added into VA, particularly for log handling. Now it is possible to redirect the log messages into a callback. Thus, we use it to redirect VA-API message into the GStreamer log mechanisms, uncluttering the console’s output.

    Also, we have blacklisted, in autoconf and meson, libva version 0.99.0, because that version is used internally by the closed-source version of Intel MediaSDK, which is incompatible with official libva. By the way, there is a new open-source version of MediaSDK, but we will talk about it in a future blog post.

    Application VA Display sharing

    Normally, the object GstVaapiDisplay is shared among the pipeline through the GstContext mechanism. But this class is defined internally and it is not exposed to users since release 1.6. This posed a problem when an application wanted to create its own VA Display and share it with an embedded pipeline. The solution is a new context application message:, defined as a GstStructure with two fields: va-display with the application’s vaDisplay, and x11-display with the application’s X11 native display. In the future, a Wayland’s native handler will be processed too. Please note that this context message is only processed by vaapisink.

    One precondition for this solution was the removal of the VA display cache mechanism, a lingered request from users, which, of course, we did.

    Interoperability with appsink and similar

    A hardware accelerated driver, as the Intel one, may have custom offsets and strides for specific memory regions. We use the GstVideoMeta to set this custom values. The problem comes when downstream does not handle this meta, for example, appsink. Then, the user expect the “normal” values for those variable, but in the case of GStreamer VA-API with a hardware based driver, when the user displays the frame, it is shown corrupted.

    In order to fix this, we have to make a memory copy, from our custom VA-API images to an allocated system memory. Of course there is a big CPU penalty, but it is better than delivering wrong video frames. If the user wants a better performance, then they should seek for a different approach.

    Resurrection of GstGLUploadTextureMeta for EGL renders

    I know, GstGLUploadTextureMeta must die, right? I am convinced of it. But, Clutter video sink uses it, an it has a vast number of users, so we still have to support it.

    Last release we had remove the support for EGL/Wayland in the last minute because we found a terrible bug just before the release. GLX support has always been there.

    With Daniel van Vugt efforts, we resurrected the support for that meta in EGL. Though I expect the replacement of Clutter sink with glimagesink someday, soon.

    vaapisink demoted in Wayland

    vaapisink was demoted to marginal rank on Wayland because COGL cannot display YUV surfaces.

    This means, by default, vaapisink won’t be auto-plugged when playing in Wayland.

    The reason is because Mutter (aka GNOME) cannot display the frames processed by vaapisink in Wayland. Nonetheless, please note that in Weston, it works just fine.


    We have improved a little bit upstream renegotiation: if the new stream is compatible with the previous one, there is no need to reset the internal parser, with the exception of changes in codec-data.

    low-latency property in H.264

    A new property has added only to H.264 decoder: low-latency. Its purpose is for live streams that do not conform the H.264 specification (sadly there are many in the wild) and they need to twitch the spec implementation. This property force to push the frames in the decoded picture buffer as soon as possible.

    base-only property in H.264

    This is the result of the Google Summer of Code 2017, by Orestis Floros. When this property is enabled, all the MVC (Multiview Video Coding) or SVC (Scalable Video Coding) frames are dropped. This is useful if you want to reduce the processing time or if your VA-API driver does not support those kind of streams.


    In this release we have put a lot of effort in encoders.

    Processing Regions of Interest

    It is possible, for certain back-ends and profiles (for example, H.264 and H.265 encoders with Intel driver), to specify a set of regions of interest per frame, with a delta-qp per region. This mean that we would ask more quality in those regions.

    In order to process regions of interest, upstream must add to the video frame, a list of GstVideoRegionOfInterestMeta. This list then is traversed by the encoder and it requests them if the VA-API profile, in the driver, supports it.

    The common use-case for this feature is if you want to higher definition in regions with faces or text messages in the picture, for example.

    New encoding properties

    • quality-level: For all the available encoders. This is number between 1 to 8, where a lower number means higher quality (and slower processing).
  • aud: This is for H.264 encoder only and it is available for certain drivers and platforms. When it is enabled, an AU delimiter is inserted for each encoded frame. This is useful for network streaming, and more particularly for Apple clients.

  • mbbrc: For H.264 only. Controls (auto/on/off) the macro-block bit-rate.

  • temporal-levels: For H.264 only. It specifies the number of temporal levels to include a the hierarchical frame prediction.

  • prediction-type: For H.264 only. It selects the reference picture selection mode.

    The frames are encoded as different layers. A frame in a particular layer will use pictures in lower or same layer as references. This means decoder can drop frames in upper layer but still decode lower layer frames.

    • hierarchical-p: P frames, except in top layer, are reference frames. Base layer frames are I or B.
  • hierarchical-b: B frames , except in top most layer, are reference frames. All the base layer frames are I or P.

  • refs: Added for H.265 (it was already supported for H.264). It specifies the number of reference pictures.

  • qp-ip and qp-ib: For H.264 and H.265 encoders. They handle the QP (quality parameters) difference between the I and P frames, the the I and B frames respectively.

  • Set media profile via downstream caps

    H.264 and H.265 encoders now can configure the desired media profile through the downstream caps.


    Many thanks to all the contributors and bug reporters.

         1  Daniel van Vugt
        46  Hyunjun Ko
         1  Jan Schmidt
         3  Julien Isorce
         1  Matt Staples
         2  Matteo Valdina
         2  Matthew Waters
         1  Michael Tretter
         4  Nicolas Dufresne
         9  Orestis Floros
         1  Philippe Normand
         4  Sebastian Dröge
        24  Sreerenj Balachandran
         1  Thibault Saunier
        13  Tim-Philipp Müller
         1  Tomas Rataj
         2  U. Artie Eoff
         1  VaL Doroshchuk
       172  Víctor Manuel Jáquez Leal
         3  XuGuangxin
         2  Yi A Wang

    by vjaquez at March 27, 2018 10:52 AM

    March 23, 2018

    Asumu Takikawa

    How to develop VPP plugins

    Recently, my teammate Jessica wrote an excellent intro blog post about VPP. VPP is an open source user-space networking framework that we’re looking into at Igalia. I highly recommend reading Jessica’s post before this post to get aquainted with general VPP ideas.

    In this blog post, I wanted to follow up on that blog post and talk a bit about some of the details of plugin construction in VPP and document some of the internals that took me some time to understand.

    First, I’ll start off by talking about how to architect the flow of packets in and out of a plugin.

    How and where do you add nodes in the graph?

    VPP is based on a graph architecture, in which vectors of packets are transferred between nodes in a graph.

    Here’s an illustration from a VPP presentation of what the graph might look like:

    VPP graph nodes

    (source: VPP overview from DevBoot)

    One thing that took me a while to understand, however, was the mechanics of how nodes in the graph are actually hooked together. I’ll try to explain some of that, starting with the types of nodes that are available.

    Some of the nodes in VPP produce data from a network driver (perhaps backed by DPDK) or some other source for consumption. Other nodes are responsible for manipulating incoming data in the graph and possibly producing some output.

    The former are called “input” nodes and are called on each main loop iteration. The latter are called “internal” nodes. The type of node is specified using the vlib_node_type_t type when declaring a node using VLIB_REGISTER_NODE.

    There’s also another type of node that is useful, which is the “process” node. This exposes a thread-like behavior, in which the node’s callback routine can be suspended and reanimated based on events or a timer. This is useful for sending periodic notifications or polling some data that’s managed by another node.

    Since input nodes are the start of the graph, they are responsible for generating packets from some source like a NIC or pcap file and injecting them into the rest of the graph.

    For internal nodes, the packets have to come from somewhere else. There seem to be many ways to specify how an internal node gets its packets.

    For example, it’s possible to tell VPP to direct all packets with a specific ethertype or IP protocol to a node that you write (see this wiki page for some of these options).

    Another method, that’s convenient for use in external plugins, is using “feature arcs”. VPP comes with abstract concepts called “features” (which are distinct from actual nodes) that essentially form an overlay over the concrete graph of nodes.

    These features can be enabled or disabled on a per-interface basis.

    You can hook your own node into these feature arcs by using VNET_FEATURE_INIT:

     * @brief Hook the sample plugin into the VPP graph hierarchy.
    VNET_FEATURE_INIT (sample, static) = 
      .arc_name = "device-input",
      .node_name = "sample",
      .runs_before = VNET_FEATURES ("ethernet-input"),

    This code from the sample plugin is setting the sample node to run on the device-input arc. For a given arc like this, the VPP dispatcher will run the code for the start node on the arc and then run all of the features that are registered on that arc. This ordering is determined by the .runs_before clauses in those features.

    Feature arcs themselves are defined via macro:

    VNET_FEATURE_ARC_INIT (device_input, static) =
      .arc_name  = "device-input",
      .start_nodes = VNET_FEATURES ("device-input"),
      .arc_index_ptr = &feature_main.device_input_feature_arc_index,

    As you can see, arcs have a start node. Inside the start node there is typically some code that checks if features are present on the node and, if the check succeeds, will designate the next node as one of those feature nodes. For example, this snippet from the DPDK plugin:

    /* Do we have any driver RX features configured on the interface? */
    vnet_feature_start_device_input_x4 (xd->vlib_sw_if_index,
                        &next0, &next1, &next2, &next3,
                        b0, b1, b2, b3);

    From this snippet, you can see how the process of getting data through feature arcs is kicked off. In the next section, you’ll see how internal nodes in the middle of a feature arc can continue the chain.

    BTW: the methods of hooking up your node into the graph I’ve described here aren’t exhaustive. For example, there’s also something called DPOs (“data path object”) that I haven’t covered at all. Maybe a topic for a future blog post (once I understand them!).

    Specifying the next node

    I went over a few methods for specifying how packets get to a node, but you may also wonder how packets get directed out of a node. There are also several ways to set that up too.

    When declaring a VPP node using VLIB_REGISTER_NODE, you can provide the .next_nodes argument (along with .n_next_nodes) to specify an indexed list of next nodes in the graph. Then you can use the next node indices in the node callback to direct packets to one of the next nodes.

    The sample plugin sets it up like this:

    VLIB_REGISTER_NODE (sample_node) = {
      /* ... elided for example ... */
      .n_next_nodes = SAMPLE_N_NEXT,
      /* edit / add dispositions here */
      .next_nodes = {
            [SAMPLE_NEXT_INTERFACE_OUTPUT] = "interface-output",

    This declaration sets up the only next node as interface-output, which is the node that selects some hardware interface to transmit on.

    Alternatively, it’s possible to programmatically fetch a next node index by name using vlib_get_node_by_name:

    ip4_lookup_node = vlib_get_node_by_name (vm, (u8 *) "ip4-lookup");
    ip4_lookup_node_index = ip4_lookup_node->index;

    This excerpt is from the VPP IPFIX code that sends report packets periodically. This kind of approach seems more common in process nodes.

    When using feature arcs, a common approach is to use vnet_feature_next to select a next node based on the feature mechanism:

    /* Setup packet for next IP feature */
    vnet_feature_next(vnet_buffer(b0)->sw_if_index[VLIB_RX], &next0, b0);
    vnet_feature_next(vnet_buffer(b1)->sw_if_index[VLIB_RX], &next1, b1);

    In this example, the next0 and next1 variables are being set by vnet_feature_next based on the appropriate next features in the feature arc.

    (vnet_feature_next is a convenience function that uses vnet_get_config_data under the hood as described in this mailing list post if you want to know more about the mechanics)

    Though even when using feature arcs, it doesn’t seem mandatory to use vnet_feature_next. The sample plugin does not, for example.

    Buffer metadata

    In addition to packets, nodes can also pass some metadata around to each other. Specifically, there is some “opaque” space reserved in the vlib_buffer_t structure that stores the packet data:

    typedef struct{
    /* ... elided ... */
    u32 opaque[10]; /**< Opaque data used by sub-graphs for their own purposes.
                      See .../vnet/vnet/buffer.h
    /* ... */
    } vlib_buffer_t;

    (source link)

    For networking purposes, the vnet library provides a data type vnet_buffer_opaque_t that is stored in the opaque field. This contains data useful to networking layers like Ethernet, IP, TCP, and so on.

    The vnet data is unioned (that is, data for one layer may be overlaid on data for another) and it’s expected that graph nodes make sure not to stomp on data that needs to be set in their next nodes. You can see how the data intersects in this slide from an DevBoot.

    One of the first fields stored in the opaque data is sw_if_index, which is set by the driver nodes initially. The field stores a 2-element array of input and output interface indices for a buffer.

    One of the things that puzzled me for a bit was precisely how the sw_if_index field, which is very commonly read and set in VPP plugins, is used. Especially the TX half of the pair, which is more commonly manipulated. It looks like the field is often set to instruct VPP to choose a transmit interface for a packet.

    For example, in the sample plugin code the index is set in the following way:

    sw_if_index0 = vnet_buffer(b0)->sw_if_index[VLIB_RX];
    /* Send pkt back out the RX interface */
    vnet_buffer(b0)->sw_if_index[VLIB_TX] = sw_if_index0;

    What this is doing is getting the interface index where the packet was received and then setting the output index to be the same. This tells VPP to send off the packet to the same interface it was received from.

    (This is explained in one of the VPP videos.)

    For a process node that’s sending off new packets, you might set the indices like this:

    vnet_buffer(b0)->sw_if_index[VLIB_RX] = 0;
    vnet_buffer(b0)->sw_if_index[VLIB_TX] = ~0;

    The receive index is set to 0, which means the local0 interface that is always configured. The use of ~0 means “don’t know yet”. The effect of that is (see the docs for ip4-lookup) that a lookup is done in the FIB attached to the receive interface, which is the local0 one in this case.

    Some code inside VPP instead sets the transmit index to some user configurable fib_index to let the user specify a destination (often defaulting to ~0 if the user specified nothing):

    vnet_buffer(b0)->sw_if_index[VLIB_RX] = 0;
    vnet_buffer(b0)->sw_if_index[VLIB_TX] = fib_index;

    VPP libraries and other resources

    Finally, I want to briefly talk about some additional resources for learning about VPP details.

    There are numerous useful libraries that come with VPP, such as data structures (e.g., vectors and hash-tables), formatting, and logging. These are contained in source directories like vlib and vppinfra.

    I found these slides on VPP libraries and its core infrastructure quite helpful. It’s from a workshop that was held in Paris. The slides give an overview of various programming components provided in VPP, and then you can look up the detailed APIs in the VPP docs.

    There are also a number of other useful pages on the VPP wiki. If you prefer learning from videos, there are some good in-depth video tutorials on VPP from the official team too.

    Since I’m a VPP newbie myself, you may want to carefully look at the source code and make sure what I’m saying is accurate. Please do let me know if you spot any inaccuracies in the post.

    by Asumu Takikawa at March 23, 2018 02:10 AM

    March 21, 2018

    José Dapena

    Updated Chromium Legacy Wayland Support


    Future Ozone Wayland backend is still not ready for shipping. So we are announcing the release of an updated Ozone Wayland backend for Chromium, based on the implementation provided by Intel. It is rebased on top of latest stable Chromium release and you can find it in my team Github. Hope you will appreciate it.

    Official Chromium on Linux desktop nowadays

    Linux desktop is progressively migrating to use Wayland as the display server. It is the default option in Fedora, Ubuntu ~~and, more importantly, the next Ubuntu Long Term Support release will ship Gnome Shell Wayland display server by default~~ (P.S. since this post was originally written, Ubuntu has delayed the Wayland adoption for LTS).

    As is, now, Chromium browser for Linux desktop support is based on X11. This means it will natively interact with an X server and with its XDG extensions for displaying the contents and receiving user events. But, as said, next generation of Linux desktop will be using Wayland display servers instead of X11. How is it working? Using XWayland server, a full X11 server built on top of Wayland protocol. Ok, but that has an impact on performance. Chromium needs to communicate and paint to X11 provided buffers, and then, those buffers need to be shared with Wayland display server. And the user events will need to be proxied from the Wayland display server through the XWayland server and X11 protocol. It requires more resources: more memory, CPU, and GPU. And it adds more latency to the communication.


    Chromium supports officially several platforms (Windows, Android, Linux desktop, iOS). But it provides abstractions for porting it to other platforms.

    The set of abstractions is named Ozone (more info here). It allows to implement one or more platform components with the hooks for properly integrating with a platform that is in the set of officially supported targets. Among other things it provides abstractions for:
    * Obtaining accelerated surfaces.
    * Creating and obtaining windows to paint the contents.
    * Interacting with the desktop cursor.
    * Receiving user events.
    * Interacting with the window manager.

    Chromium and Wayland (2014-2016)

    Even if Wayland was not used on Linux desktop, a bunch of embedded devices have been using Wayland for their display server for quite some time. LG has been shipping a full Wayland experience on the webOS TV products.

    In the last 4 years, Intel has been providing an implementation of Ozone abstractions for Wayland. It was an amazing work that allowed running Chromium browser on top of a Wayland compositor. This backend has been the de facto standard for running Chromium browser on all these Wayland-enabled embedded devices.

    But the development of this implementation has mostly stopped around Chromium 49 (though rebases on top of Chromium 51 and 53 have been provided).

    Chromium and Wayland (2018+)

    Since the end of 2016, Igalia has been involved on several initiatives to allow Chromium to run natively in Wayland. Even if this work is based on the original Ozone Wayland backend by Intel, it is mostly a rewrite and adaptation to the future graphics architecture in Chromium (Viz and Mus).

    This is being developed in the Igalia GitHub, downstream, though it is expected to be landed upstream progressively. Hopefully, at some point in 2018, this new backend will be fully ready for shipping products with it. But we are still not there. ~~Some major missing parts are Wayland TextInput protocol and content shell support~~ (P.S. since this was written, both TextInput and content shell support are working now!).

    More information on these posts from the authors:
    * June 2016: Understanding Chromium’s runtime ozone platform selection (by Antonio Gomes).
    * October 2016: Analysis of Ozone Wayland (by Frédéric Wang).
    * November 2016: Chromium, ozone, wayland and beyond (by Antonio Gomes).
    * December 2016: Chromium on R-Car M3 & AGL/Wayland (by Frédéric Wang).
    * February 2017: Mus Window System (by Frédéric Wang).
    * May 2017: Chromium Mus/Ozone update (H1/2017): wayland, x11 (by Antonio Gomes).
    * June 2017: Running Chromium m60 on R-Car M3 board & AGL/Wayland (by Maksim Sisov).

    Releasing legacy Ozone Wayland backend (2017-2018)

    Ok, so new Wayland backend is still not ready in some cases, and the old one is unmaintained. For that reason, LG is announcing the release of an updated legacy Ozone Wayland backend. It is essentially the original Intel backend, but ported to current Chromium stable.

    Why? Because we want to provide a migration path to the future Ozone Wayland backend. And because we want to share this effort with other developers, willing to run Chromium in Wayland immediately, or that are still using the old backend and cannot immediately migrate to the new one.

    WARNING If you are starting development for a product that is going to happen in 1-2 years… Very likely your best option is already migrating now to the new Ozone Wayland backend (and help with the missing bits). We will stop maintaining it ourselves once new Ozone Wayland backend lands upstream and covers all our needs.

    What does this port include?
    * Rebased on top of Chromium m60, m61, m62 and m63.
    * Ported to GN.
    * It already includes some changes to adapt to the new Ozone Wayland refactors.

    It is hosted at

    Enjoy it!

    Originally published at webOS Open Source Edition Blog. and licensed under Creative Commons Attribution 4.0.

    by José Dapena Paz at March 21, 2018 07:47 AM

    March 20, 2018

    Iago Toral

    Improving shader performance with Vulkan’s specialization constants

    For some time now I have been working on and off on a personal project with no other purpose than toying a bit with Vulkan and some rendering and shading techniques. Although I’ll probably write about that at some point, in this post I want to focus on Vulkan’s specialization constants and how they can provide a very visible performance boost when they are used properly, as I had the chance to verify while working on this project.

    The concept behind specialization constants is very simple: they allow applications to set the value of a shader constant at run-time. At first sight, this might not look like much, but it can have very important implications for certain shaders. To showcase this, let’s take the following snippet from a fragment shader as a case study:

    layout(push_constant) uniform pcb {
       int num_samples;
    } PCB;
    const int MAX_SAMPLES = 64;
    layout(set = 0, binding = 0) uniform SamplesUBO {
       vec3 samples[MAX_SAMPLES];
    } S;
    void main()
       for(int i = 0; i < PCB.num_samples; ++i) {
          vec3 sample_i = S.samples[i];

    That is a snippet taken from a Screen Space Ambient Occlusion shader that I implemented in my project, a popular techinique used in a lot of games, so it represents a real case scenario. As we can see, the process involves a set of vector samples passed to the shader as a UBO that are processed for each fragment in a loop. We have made the maximum number of samples that the shader can use large enough to accomodate a high-quality scenario, but the actual number of samples used in a particular execution will be taken from a push constant uniform, so the application has the option to choose the quality / performance balance it wants to use.

    While the code snippet may look trivial enough, let’s see how it interacts with the shader compiler:

    The first obvious issue we find with this implementation is that it is preventing loop unrolling to happen because the actual number of samples to use is unknown at shader compile time. At most, the compiler could guess that it can’t be more than 64, but that number of iterations would still be too large for Mesa to unroll the loop in any case. If the application is configured to only use 24 or 32 samples (the value of our push constant uniform at run-time) then that number of iterations would be small enough that Mesa would unroll the loop if that number was known at shader compile time, so in that scenario we would be losing the optimization just because we are using a push constant uniform instead of a constant for the sake of flexibility.

    The second issue, which might be less immediately obvious and yet is the most significant one, is the fact that if the shader compiler can tell that the size of the samples array is small enough, then it can promote the UBO array to a push constant. This means that each access to S.samples[i] turns from an expensive memory fetch to a direct register access for each sample. To put this in perspective, if we are rendering to a full HD target using 24 samples per fragment, it means that we would be saving ourselves from doing 1920x1080x24 memory reads per frame for a very visible performance gain. But again, we would be loosing this optimization because we decided to use a push constant uniform.

    Vulkan’s specialization constants allow us to get back these performance optmizations without sacrificing the flexibility we implemented in the shader. To do this, the API provides mechanisms to specify the values of the constants at run-time, but before the shader is compiled.

    Continuing with the shader snippet we showed above, here is how it can be rewritten to take advantage of specialization constants:

    layout (constant_id = 0) const int NUM_SAMPLES = 64;
    layout(std140, set = 0, binding = 0) uniform SamplesUBO {
       vec3 samples[NUM_SAMPLES];
    } S;
    void main()
       for(int i = 0; i < NUM_SAMPLES; ++i) {
          vec3 sample_i = S.samples[i];

    We are now informing the shader that we have a specialization constant NUM_SAMPLES, which represents the actual number of samples to use. By default (if the application doesn’t say otherwise), the specialization constant’s value is 64. However, now that we have a specialization constant in place, we can have the application set its value at run-time, like this:

    VkSpecializationMapEntry entry = { 0, 0, sizeof(int32_t) };
       VkSpecializationInfo spec_info = {

    The application code above sets up specialization constant information for shader consumption at run-time. This is done via an array of VkSpecializationMapEntry entries, each one determining where to fetch the constant value to use for each specialization constant declared in the shader for which we want to override its default value. In our case, we have a single specialization constant (with id 0), and we are taking its value (of integer type) from offset 0 of a buffer. In our case we only have one specialization constant, so our buffer is just the address of the variable holding the constant’s value (config.ssao.num_samples). When we create the Vulkan pipeline, we pass this specialization information using the pSpecializationInfo field of VkPipelineShaderStageCreateInfo. At that point, the driver will override the default value of the specialization constant with the value provided here before the shader code is optimized and native GPU code is generated, which allows the driver compiler backend to generate optimal code.

    It is important to remark that specialization takes place when we create the pipeline, since that is the only moment at which Vulkan drivers compile shaders. This makes specialization constants particularly useful when we know the value we want to use ahead of starting the rendering loop, for example when we are applying quality settings to shaders. However, If the value of the constant changes frequently, specialization constants are not useful, since they require expensive shader re-compiles every time we want to change their value, and we want to avoid that as much as possible in our rendering loop. Nevertheless, it it is possible to compile the same shader with different constant values in different pipelines, so even if a value changes often, so long as we have a finite number of combinations, we can generate optimized pipelines for each one ahead of the start of the redendering loop and just swap pipelines as needed while rendering.


    Specialization constants are a straight forward yet powerful way to gain control over how shader compilers optimize your code. In my particular pet project, applying specialization constants in a small number of shaders allowed me to benefit from loop unrolling and, most importantly, UBO promotion to push constants in the SSAO pass, obtaining performance improvements that ranged from 10% up to 20% depending on the configuration.

    Finally, although the above covered specialization constants from the point of view of Vulkan, this is really a feature of the SPIR-V language, so it is also available in OpenGL with the GL_ARB_gl_spirv extension, which is core since OpenGL 4.6.

    by Iago Toral at March 20, 2018 04:45 PM

    March 19, 2018

    Philippe Normand

    GStreamer’s playbin3 overview for application developers

    Multimedia applications based on GStreamer usually handle playback with the playbin element. I recently added support for playbin3 in WebKit. This post aims to document the changes needed on application side to support this new generation flavour of playbin.

    So, first of, why is it named playbin3 anyway? The GStreamer 0.10.x series had a playbin element but a first rewrite (playbin2) made it obsolete in the GStreamer 1.x series. So playbin2 was renamed to playbin. That’s why a second rewrite is nicknamed playbin3, I suppose :)

    Why should you care about playbin3? Playbin3 (and the elements it’s using internally: parsebin, decodebin3, uridecodebin3 among others) is the result of a deep re-design of playbin2 (along with decodebin2 and uridecodebin) to better support:

    • gapless playback
    • audio cross-fading support (not yet implemented)
    • adaptive streaming
    • reduced CPU, memory and I/O resource usage
    • faster stream switching and full control over the stream selection process

    This work was carried on mostly by Edward Hervey, he presented his work in detail at 3 GStreamer conferences. If you want to learn more about this and the internals of playbin3 make sure to watch his awesome presentations at the 2015 gst-conf, 2016 gst-conf and 2017 gst-conf.

    Playbin3 was added in GStreamer 1.10. It is still considered experimental but in my experience it works already very well. Just keep in mind you should use at least the latest GStreamer 1.12 (or even the upcoming 1.14) release before reporting any issue in Bugzilla. Playbin3 is not a drop-in replacement for playbin, both elements share only a sub-set of GObject properties and signals. However, if you don’t want to modify your application source code just yet, it’s very easy to try playbin3 anyway:

    $ USE_PLAYBIN3=1 my-playbin-based-app

    Setting the USE_PLAYBIN environment variable enables a code path inside the GStreamer playback plugin which swaps the playbin element for the playbin3 element. This trick provides a glance to the playbin3 element for the most lazy people :) The problem is that depending on your use of playbin, you might get runtime warnings, here’s an example with the Totem player:

    $ USE_PLAYBIN3=1 totem ~/Videos/Agent327.mp4
    (totem:22617): GLib-GObject-WARNING **: ../../../../gobject/gsignal.c:2523: signal 'video-changed' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'
    (totem:22617): GLib-GObject-WARNING **: ../../../../gobject/gsignal.c:2523: signal 'audio-changed' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'
    (totem:22617): GLib-GObject-WARNING **: ../../../../gobject/gsignal.c:2523: signal 'text-changed' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'
    (totem:22617): GLib-GObject-WARNING **: ../../../../gobject/gsignal.c:2523: signal 'video-tags-changed' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'
    (totem:22617): GLib-GObject-WARNING **: ../../../../gobject/gsignal.c:2523: signal 'audio-tags-changed' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'
    (totem:22617): GLib-GObject-WARNING **: ../../../../gobject/gsignal.c:2523: signal 'text-tags-changed' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'
    sys:1: Warning: g_object_get_is_valid_property: object class 'GstPlayBin3' has no property named 'n-audio'
    sys:1: Warning: g_object_get_is_valid_property: object class 'GstPlayBin3' has no property named 'n-text'
    sys:1: Warning: ../../../../gobject/gsignal.c:3492: signal name 'get-video-pad' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'

    As mentioned previously, playbin and playbin3 don’t share the same set of GObject properties and signals, so some changes in your application are required in order to use playbin3.

    If your application is based on the GstPlayer library then you should set the GST_PLAYER_USE_PLAYBIN3 environment variable. GstPlayer already handles both playbin and playbin3, so no changes needed in your application if you use GstPlayer!

    Ok, so what if your application relies directly on playbin? Some changes are needed! If you previously used playbin stream selection properties and signals, you will now need to handle the GstStream and GstStreamCollection APIs. Playbin3 will emit a stream collection message on the bus, this is very nice because the collection includes information (metadata!) about the streams (or tracks) the media asset contains. In playbin this was handled with a bunch of signals (audio-tags-changed, audio-changed, etc), properties (n-audio, n-video, etc) and action signals (get-audio-tags, get-audio-pad, etc). The new GstStream API provides a centralized and non-playbin-specific access point for all these informations. To select streams with playbin3 you now need to send a select_streams event so that the demuxer can know exactly which streams should be exposed to downstream elements. That means potentially improved performance! Once playbin3 completed the stream selection it will emit a streams selected message, the application should handle this message and potentially update its internal state about the selected streams. This is also the best moment to update your UI regarding the selected streams (like audio track language, video track dimensions, etc).

    Another small difference between playbin and playbin3 is about the source element setup. In playbin there is a source read-only GObject property and a source-setup GObject signal. In playbin3 only the latter is available, so your application should rely on source-setup instead of the notify::source GObject signal.

    The gst-play-1.0 playback utility program already supports playbin3 so it provides a good source of inspiration if you consider porting your application to playbin3. As mentioned at the beginning of this post, WebKit also now supports playbin3, however it needs to be enabled at build time using the CMake -DUSE_GSTREAMER_PLAYBIN3=ON option. This feature is not part of the WebKitGTK+ 2.20 series but should be shipped in 2.22. As a final note I wanted to acknowledge my favorite worker-owned coop Igalia for allowing me to work on this WebKit feature and also our friends over at Centricular for all the quality work on playbin3.

    by Philippe Normand at March 19, 2018 07:13 AM

    March 18, 2018

    Philippe Normand

    Moving to Pelican

    Time for a change! Almost 10 years ago I was starting to hack on a Blog engine with two friends, it was called Alinea and it powered this website for a long time. Back then hacking on your own Blog engine was the pre-requirement to host your blog :) But nowadays people just use Wordpress or similar platforms, if they still have a blog at all. Alinea fell into oblivion as I didn’t have time and motivation to maintain it.

    Moving to Pelican was quite easy, since I’ve been writing content in ReST on the previous blog I only had to pull data from the database and hacked a bit :)

    Now that this website looks almost modern I’ll hopefully start to blog again, expect at least news about the WebKit work I still enjoy doing at Igalia.

    by Philippe Normand at March 18, 2018 09:18 AM

    The GNOME-Shell Gajim extension maintenance

    Back in January 2011 I wrote a GNOME-Shell extension allowing Gajim users to carry on with their chats using the Empathy infrastructure and UI present in the Shell. For some time the extension was also part of the official gnome-shell-extensions module and then I had to move it to Github as a standalone extension. Sadly I stopped using Gajim a few years ago and my interest in maintaining this extension has decreased quite a lot.

    I don’t know if this extension is actively used by anyone beyond the few bugs reported in Github, so this is a call for help. If anyone still uses this extension and wants it supported in future versions of GNOME-Shell, please send me a mail so I can transfer ownership of the Github repository and see what I can do for the page as well.

    (Huh, also. Hi blogosphere again! My last post was in 2014 it seems :))

    by Philippe Normand at March 18, 2018 09:18 AM

    Web Engines Hackfest 2014

    Last week I attended the Web Engines Hackfest. The event was sponsored by Igalia (also hosting the event), Adobe and Collabora.

    As usual I spent most of the time working on the WebKitGTK+ GStreamer backend and Sebastian Dröge kindly joined and helped out quite a bit, make sure to read his post about the event!

    We first worked on the WebAudio GStreamer backend, Sebastian cleaned up various parts of the code, including the playback pipeline and the source element we use to bridge the WebCore AudioBus with the playback pipeline. On my side I finished the AudioSourceProvider patch that was abandoned for a few months (years) in Bugzilla. It’s an interesting feature to have so that web apps can use the WebAudio API with raw audio coming from Media elements.

    I also hacked on GstGL support for video rendering. It’s quite interesting to be able to share the GL context of WebKit with GStreamer! The patch is not ready yet for landing but thanks to the reviews from Sebastian, Mathew Waters and Julien Isorce I’ll improve it and hopefully commit it soon in WebKit ToT.

    Sebastian also worked on Media Source Extensions support. We had a very basic, non-working, backend that required… a rewrite, basically :) I hope we will have this reworked backend soon in trunk. Sebastian already has it working on Youtube!

    The event was interesting in general, with discussions about rendering engines, rendering and JavaScript.

    by Philippe Normand at March 18, 2018 09:18 AM