Planet Igalia

April 17, 2018

Víctor Jáquez

How to setup a gst-build environment with Intel’s VA-API stack

gst-build is far superior than gst-uninstalled scripts for developing GStreamer, mainly because its meson and ninja usage. Nonetheless, to integrate external dependencies it is not as easy as in gst-uninstalled.

This guide aims to show how to integrate GStreamer-VAAPI dependencies, in this case with the Intel VA-API driver.

Dependencies

For now we will need meson master, since it this patch is required. The pull request is already merged but it is unreleased yet.

Installation

clone gst-build repository

$ git clone git://anongit.freedesktop.org/gstreamer/gst-build
$ cd gst-build

apply custom patch for gst-build

The patch will add the repositories for libva and intel-vaapi-driver.

$ wget https://people.igalia.com/vjaquez/gst-build-vaapi/0001-add-libva-and-intel-vaapi-driver-as-subprojects.patch
$ git am 0001-add-libva-and-intel-vaapi-driver-as-subprojects.patch

0001-add-libva-and-intel-vaapi-driver-as-subprojects.patch

configure

Running this command, all dependency repositories will be cloned, symbolic links created, and the build directory configured.

$ meson bulid

apply custom patches for libva, intel-vaapi-driver and gstreamer-vaapi

libva

This patch is required since the headers files uninstalled paths doesn’t match with the ones in the “include” directives.

$ cd libva
$ wget https://people.igalia.com/vjaquez/gst-build-vaapi/0001-build-add-headers-for-uninstalled-setup.patch
$ git am 0001-build-add-headers-for-uninstalled-setup.patch
$ cd -

0001-build-add-headers-for-uninstalled-setup.patch

intel-vaapi-driver

The patch handles libva dependency as a subproject.

$ cd intel-vaapi-driver
$ wget https://people.igalia.com/vjaquez/gst-build-vaapi/0001-meson-support-libva-as-subproject.patch
$ git am 0001-meson-support-libva-as-subproject.patch
$ cd -

0001-meson-support-libva-as-subproject.patch

gstreamer-vaapi

Note to myself: this patch must be split and merged in upstream.

$ cd gstreamer-vaapi
$ wget https://people.igalia.com/vjaquez/gst-build-vaapi/0001-build-meson-libva-gst-uninstall-friendly.patch
$ git am 0001-build-meson-libva-gst-uninstall-friendly.patch
$ cd -

0001-build-meson-libva-gst-uninstall-friendly.patch updated: 2018/04/24

build

$ ninja -C build

And wait a couple minutes.

run uninstalled environment for testing

$ ninja -C build uninstalled
[gst-master] $ gst-inspect-1.0 vaapi
Plugin Details:
  Name                     vaapi
  Description              VA-API based elements
  Filename                 /opt/gst/gst-build/build/subprojects/gstreamer-vaapi/gst/vaapi/libgstvaapi.so
  Version                  1.15.0.1
  License                  LGPL
  Source module            gstreamer-vaapi
  Binary package           gstreamer-vaapi
  Origin URL               http://bugzilla.gnome.org/enter_bug.cgi?product=GStreamer

  vaapih264enc: VA-API H264 encoder
  vaapimpeg2enc: VA-API MPEG-2 encoder
  vaapisink: VA-API sink
  vaapidecodebin: VA-API Decode Bin
  vaapipostproc: VA-API video postprocessing
  vaapivc1dec: VA-API VC1 decoder
  vaapih264dec: VA-API H264 decoder
  vaapimpeg2dec: VA-API MPEG2 decoder
  vaapijpegdec: VA-API JPEG decoder

  9 features:
  +-- 9 elements

by vjaquez at April 17, 2018 02:43 PM

Iago Toral

Frame analysis of a rendering of the Sponza model

For some time now I have been working on a personal project to render the well known Sponza model provided by Crytek using Vulkan. Here is a picture of the current (still a work-in-progress) result:


Sponza rendering

This screenshot was captured on my Intel Kabylake laptop, running on the Intel Mesa Vulkan driver (Anvil).

The following list includes the main features implemented in the demo:

  • Depth pre-pass
  • Forward and deferred rendering paths
  • Anisotropic filtering
  • Shadow mapping with Percentage-Closer Filtering
  • Bump mapping
  • Screen Space Ambient Occlusion (only on the deferred path)
  • Screen Space Reflections (only on the deferred path)
  • Tone mapping
  • Anti-aliasing (FXAA)

I have been thinking about writing post about this for some time, but given that there are multiple features involved I wasn’t sure how to scope it. Eventually I decided to write a “frame analysis” post where I describe, step by step, all the render passes involved in the production of the single frame capture showed at the top of the post. I always enjoyed reading this kind of articles so I figured it would be fun to write one myself and I hope others find it informative, if not entertaining.

To avoid making the post too dense I won’t go into too much detail while describing each render pass, so don’t expect me to go into the nitty-gritty of how I implemented Screen Space Ambient Occlussion for example. Instead I intend to give a high-level overview of how the various features implemented in the demo work together to create the final result. I will provide screenshots so that readers can appreciate the outputs of each step and verify how detail and quality build up over time as we include more features in the pipeline. Those who are more interested in the programming details of particular features can always have a look at the Vulkan source code (link available at the bottom of the article), look for specific tutorials available on the Internet or wait for me to write feature-specifc posts (I don’t make any promises though!).

If you’re interested in going through with this then grab a cup of coffe and get ready, it is going to be a long ride!

Step 0: Culling

This is the only step in this discussion that runs on the CPU, and while optional from the point of view of the result (it doesn’t affect the actual result of the rendering), it is relevant from a performance point of view. Prior to rendering anything, in every frame, we usually want to cull meshes that are not visible to the camera. This can greatly help performance, even on a relatively simple scene such as this. This is of course more noticeable when the camera is looking in a direction in which a significant amount of geometry is not visible to it, but in general, there are always parts of the scene that are not visible to the camera, so culling is usually going to give you a performance bonus.

In large, complex scenes with tons of objects we probably want to use more sophisticated culling methods such as Quadtrees, but in this case, since the number of meshes is not too high (the Sponza model is slightly shy of 400 meshes), we just go though all of them and cull them individually against the camera’s frustum, which determines the area of the 3D space that is visible to the camera.

The way culling works is simple: for each mesh we compute an axis-aligned bounding box and we test that box for intersection with the camera’s frustum. If we can determine that the box never intersects, then the mesh enclosed within it is not visible and we flag it as such. Later on, at rendering time (or rather, at command recording time, since the demo has been written in Vulkan) we just skip the meshes that have been flagged.

The algorithm is not perfect, since it is possible that an axis-aligned bounding box for a particular mesh is visible to the camera and yet no part of the mesh itself is visible, but it should not affect a lot of meshes and trying to improve this would incur in additional checks that could undermine the efficiency of the process anyway.

Since in this particular demo we only have static geometry we only need to run the culling pass when the camera moves around, since otherwise the list of visible meshes doesn’t change. If dynamic geometry were present, we would need to at least cull dynamic geometry on every frame even if the camera stayed static, since dynamic elements may step in (or out of) the viewing frustum at any moment.

Step 1: Depth pre-pass

This is an optional stage, but it can help performance significantly in many cases. The idea is the following: our GPU performance is usually going to be limited by the fragment shader, and very specially so as we target higher resolutions. In this context, without a depth pre-pass, we are very likely going to execute the fragment shader for fragments that will not end up in the screen because they are occluded by fragments produced by other geometry in the scene that will be rasterized to the same XY screen-space coordinates but with a smaller Z coordinate (closer to the camera). This wastes precious GPU resources.

One way to improve the situation is to sort our geometry by distance from the camera and render front to back. With this we can get fragments that are rasterized from background geometry quickly discarded by early depth tests before the fragment shader runs for them. Unfortunately, although this will certainly help (assuming we can spare the extra CPU work to keep our geometry sorted for every frame), it won’t eliminate all the instances of the problem in the general case.

Also, some times things are more complicated, as the shading cost of different pieces of geometry can be very different and we should also take this into account. For example, we can have a very large piece of geometry for which some pixels are very close to the camera while some others are very far away and that has a very expensive shader. If our renderer is doing front-to-back rendering without any other considerations it will likely render this geometry early (since parts of it are very close to the camera), which means that it will shade all or most of its very expensive fragments. However, if the renderer accounts for the relative cost of the shader execution it would probably postpone rendering it as much as possible, so by the time it actually renders it, it takes advantage of early fragment depth tests to avoid as many of its expensive fragment shader executions as possible.

Using a depth-prepass ensures that we only run our fragment shader for visible fragments, and only those, no matter the situation. The downside is that we have to execute a separate rendering pass where we render our geometry to the depth buffer so that we can identify the visible fragments. This pass is usually very fast though, since we don’t even need a fragment shader and we are only writing to a depth texture. The exception to this rule is geometry that has opacity information, such as opacity textures, in which case we need to run a cheap fragment shader to identify transparent pixels and discard them so they don’t hit the depth buffer. In the Sponza model we need to do that for the flowers or the vines on the columns for example.

Depth pre-pass output

The picture shows the output of the depth pre-pass. Darker colors mean smaller distance from the camera. That’s why the picture gets brighter as we move further away.

Now, the remaining passes will be able to use this information to limit their shading to fragments that, for a given XY screen-space position, match exactly the Z value stored in the depth buffer, effectively selecting only the fragments that will be visible in the screen. We do this by configuring the depth test to do an EQUAL test instead of the usual LESS test, which is what we use in the depth-prepass.

In this particular demo, running on my Intel GPU, the depth pre-pass is by far the cheapest of all the GPU passes and it definitely pays off in terms of overall performance output.

Step 2: Shadow map

In this demo we have single source of light produced by a directional light that simulates the sun. You can probably guess the direction of the light by checking out the picture at the top of this post and looking at the direction projected shadows.

I already covered how shadow mapping works in previous series of posts, so if you’re interested in the programming details I encourage you to read that. Anyway, the basic idea is that we want to capture the scene from the point of view of the light source (to be more precise, we want to capture the objects in the scene that can potentially produce shadows that are visible to our camera).

With that information, we will be able to inform out lighting pass so it can tell if a particular fragment is in the shadows (not visible from our light’s perspective) or in the light (visible from our light’s perspective) and shade it accordingly.

From a technical point of view, recording a shadow map is exactly the same as the depth-prepass: we basically do a depth-only rendering and capture the result in a depth texture. The main differences here are that we need to render from the point of view of the light instead of our camera’s and that this being a directional light, we need to use an orthographic projection and adjust it properly so we capture all relevant shadow casters around the camera.

Shadow map

In the image above we can see the shadow map generated for this frame. Again, the brighter the color, the further away the fragment is from the light source. The bright white area outside the atrium building represents the part of the scene that is empty and thus ends with the maximum depth, which is what we use to clear the shadow map before rendering to it.

In this case, we are using a 4096×4096 texture to store the shadow map image, much larger than our rendering target. This is because shadow mapping from directional lights needs a lot of precision to produce good results, otherwise we end up with very pixelated / blocky shadows, more artifacts and even missing shadows for small geometry. To illustrate this better here is the same rendering of the Sponza model from the top of this post, but using a 1024×1024 shadow map (floor reflections are disabled, but that is irrelevant to shadow mapping):

Sponza rendering with 1024×1024 shadow map

You can see how in the 1024×1024 version there are some missing shadows for the vines on the columns and generally blurrier shadows (when not also slightly distorted) everywhere else.

Step 3: GBuffer

In deferred rendering we capture various attributes of the fragments produced by rasterizing our geometry and write them to separate textures that we will use to inform the lighting pass later on (and possibly other passes).

What we do here is to render our geometry normally, like we did in our depth-prepass, but this time, as we explained before, we configure the depth test to only pass fragments that match the contents of the depth-buffer that we produced in the depth-prepass, so we only process fragments that we now will be visible on the screen.

Deferred rendering uses multiple render targets to capture each of these attributes to a different texture for each rasterized fragment that passes the depth test. In this particular demo our GBuffer captures:

  1. Normal vector
  2. Diffuse color
  3. Specular color
  4. Position of the fragment from the point of view of the light (for shadow mapping)

It is important to be very careful when defining what we store in the GBuffer: since we are rendering to multiple screen-sized textures, this pass has serious bandwidth requirements and therefore, we should use texture formats that give us the range and precision we need with the smallest pixel size requirements and avoid storing information that we can get or compute efficiently through other means. This is particularly relevant for integrated GPUs that don’t have dedicated video memory (such as my Intel GPU).

In the demo, I do lighting in view-space (that is the coordinate space used takes the camera as its origin), so I need to work with positions and vectors in this coordinate space. One of the parameters we need for lighting is surface normals, which are conveniently stored in the GBuffer, but we will also need to know the view-space position of the fragments in the screen. To avoid storing the latter in the GBuffer we take advantage of the fact that we can reconstruct the view-space position of any fragment on the screen from its depth (which is stored in the depth buffer we rendered during the depth-prepass) and the camera’s projection matrix. I might cover the process in more detail in another post, for now, what is important to remember is that we don’t need to worry about storing fragment positions in the GBuffer and that saves us some bandwidth, helping performance.

Let’s have a look at the various GBuffer textures we produce in this stage:

Normal vectors

GBuffer normal texture

Here we see the normalized normal vectors for each fragment in view-space. This means they are expressed in a coordinate space in which our camera is at the origin and the positive Z direction is opposite to the camera’s view vector. Therefore, we see that surfaces pointing to the right of our camera are red (positive X), those pointing up are green (positive Y) and those pointing opposite to the camera’s view direction are blue (positive Z).

It should be mentioned that some of these surfaces use normal maps for bump mapping. These normal maps are textures that provide per-fragment normal information instead of the usual vertex normals that come with the polygon meshes. This means that instead of computing per-fragment normals as a simple interpolation of the per-vertex normals across the polygon faces, which gives us a rather flat result, we use a texture to adjust the normal for each fragment in the surface, which enables the lighting pass to render more nuanced surfaces that seem to have a lot more volume and detail than they would have otherwise.

For comparison, here is the GBuffer normal texture without bump mapping enabled. The difference in surface detail should be obvious. Just look at the lion figure at the far end or the columns and and you will immediately notice the addditional detail added with bump mapping to the surface descriptions:

GBuffer normal texture (bump mapping disabled)

To make the impact of the bump mapping more obvious, here is a different shot of the final rendering focusing on the columns of the upper floor of the atrium, with and without bump mapping:

Bump mapping enabled
Bump mapping disabled

All the extra detail in the columns is the sole result of the bump mapping technique.

Diffuse color

GBuffer diffuse texture

Here we have the diffuse color of each fragment in the scene. This is basically how our scene would look like if we didn’t implement a lighting pass that considers how the light source interacts with the scene.

Naturally, we will use this information in the lighting pass to modulate the color output based on the light interaction with each fragment.

Specular color

GBuffer specular texture

This is similar to the diffuse texture, but here we are storing the color (and strength) used to compute specular reflections.

Similarly to normal textures, we use specular maps to obtain per-fragment specular colors and intensities. This allows us to simulate combinations of more complex materials in the same mesh by specifying different specular properties for each fragment.

For example, if we look at the cloths that hang from the upper floor of the atrium, we see that they are mostly black, meaning that they barely produce any specular reflection, as it is to be expected from textile materials. However, we also see that these same cloths have an embroidery that has specular reflection (showing up as a light gray color), which means these details in the texture have stronger specular reflections than its surrounding textile material:

Specular reflection on cloth embroidery

The image shows visible specular reflections in the yellow embroidery decorations of the cloth (on the bottom-left) that are not present in the textile segment (the blue region of the cloth).

Fragment positions from Light

GBuffer light-space position texture

Finally, we store fragment positions in the coordinate space of the light source so we can implement shadows in the lighting pass. This image may be less intuitive to interpret, since it is encoding space positions from the point of view of the sun rather than physical properties of the fragments. We will need to retrieve this information for each fragment during the lighting pass so that we can tell, together with the shadow map, which fragments are visible from the light source (and therefore are directly lit by the sun) and which are not (and therefore are in the shadows). Again, more detail on how that process works, step by step and including Vulkan source code in my series of posts on that topic.

Step 4: Screen Space Ambient Occlusion

With the information stored in the GBuffer we can now also run a screen-space ambient occlusion pass that we will use to improve our lighting pass later on.

The idea here, as I discussed in my lighting and shadows series, the Phong lighting model simplifies ambient lighting by making it constant across the scene. As a consequence of this, lighting in areas that are not directly lit by a light source look rather flat, as we can see in this image:

SSAO disabled

Screen-space Ambient Occlusion is a technique that gathers information about the amount of ambient light occlusion produced by nearby geometry as a way to better estimate the ambient light term of the lighting equations. We can then use that information in our lighting pass to modulate ambient light accordingly, which can greatly improve the sense of depth and volume in the scene, specially in areas that are not directly lit:

SSAO enabled

Comparing the images above should illustrate the benefits of the SSAO technique. For example, look at the folds in the blue curtains on the right side of the images, without SSAO, we barely see them because the lighting is too flat across all the pixels in the curtain. Similarly, thanks to SSAO we can create shadowed areas from ambient light alone, as we can see behind the cloths that hang from the upper floor of the atrium or behind the vines on the columns.

To produce this result, the output of the SSAO pass is a texture with ambient light intensity information that looks like this (after some blur post-processing to eliminate noise artifacts):

SSAO output texture

In that image, white tones represent strong light intensity and black tones represent low light intensity produced by occlusion from nearby geometry. In our lighting pass we will source from this texture to obtain per-fragment ambient occlusion information and modulate the ambient term accordingly, bringing the additional volume showcased in the image above to the final rendering.

Step 6: Lighting pass

Finally, we get to the lighting pass. Most of what we showcased above was preparation work for this.

The lighting pass mostly goes as I described in my lighting and shadows series, only that since we are doing deferred rendering we get our per-fragment lighting inputs by reading from the GBuffer textures instead of getting them from the vertex shader.

Basically, the process involves retrieving diffuse, ambient and specular color information from the GBuffer and use it as input for the lighting equations to produce the final color for each fragment. We also sample from the shadow map to decide which pixels are in the shadows, in which case we remove their diffuse and specular components, making them darker and producing shadows in the image as a result.

We also use the SSAO output to improve the ambient light term as described before, multipliying the ambient term of each fragment by the SSAO value we computed for it, reducing the strength of the ambient light for pixels that are surrounded by nearby geometry.

The lighting pass is also where we put bump mapping to use. Bump mapping provides more detailed information about surface normals, which the lighting pass uses to simulate more complex lighting interactions with mesh surfaces, producing significantly enhanced results, as I showcased earlier in this post.

After combining all this information, the lighting pass produces an output like this. Compare it with the GBuffer diffuse texture to see all the stuff that this pass is putting together:

Lighting pass output

Step 7: Tone mapping

After the lighting pass we run a number of post-processing passes, of which tone mapping is the first one. The idea behind tone mapping is this: normally, shader color outputs are limited to the range [0, 1], which puts a hard cap on our lighting calculations. Specifically, it means that when our light contributions to a particular pixel go beyond 1.0 in any color component, they get clamped, which can distort the resulting color in unrealistic ways, specially when this happens during intermediate lighting calculations (since the deviation from the physically correct color is then used as input to more computations, which then build on that error).

To work around this we do our lighting calculations in High Dynamic Range (HDR) which allows us to produce color values with components larger than 1.0, and then we run a tone mapping pass to re-map the result to the [0, 1] range when we are done with the lighting calculations and we are ready for display.

The nice thing about tone mapping is that it gives the developer control over how that mapping happens, allowing us to decide if we are interested in preserving more detail in the darker or brighter areas of the scene.

In this particular demo, I used HDR rendering to ramp up the intensity of the sun light beyond what I could have represented otherwise. Without tone mapping this would lead to unrealistic lighting in areas with strong light reflections, since would exceed the 1.0 per-color-component cap and lead to pure white colors as result, losing the color detail from the original textures. This effect can be observed in the following pictures if you look at the lit area of the floor. Notice how the tone-mapped picture better retains the detail of the floor texture while in the non tone-mapped version the floor seems to be over-exposed to light and large parts of it just become white as a result (shadow mapping has been disabled to better showcase the effects of tone-mapping on the floor):

Tone mapping disabled
Tone mapping enabled

Step 8: Screen Space Reflections (SSR)

The material used to render the floor is reflective, which means that we can see the reflections of the surrounding environment on it.

There are various ways to capture reflections, each with their own set of pros and cons. When I implemented my OpenGL terrain rendering demo I implemented water reflections using “Planar Reflections”, which produce very accurate results at the expense of requiring to re-render the scene with the camera facing in the same direction as the reflection. Although this can be done at a lower resolution, it is still quite expensive and cumbersome to setup (for example, you would need to run an additional culling pass), and you also need to consider that we need to do this for each planar surface you want to apply reflections on, so it doesn’t scale very well. In this demo, although it is not visible in the reference screenshot, I am capturing reflections from the floor sections of both stories of the atrium, so the Planar Reflections approach might have required me to render twice when fragments of both sections are visible (admittedly, not very often, but not impossible with the free camera).

So in this particular case I decided to experiment with a different technique that has become quite popular, despite its many shortcomings, because it is a lot faster: Screen Space Reflections.

As all screen-space techniques, the technique uses information already present in the screen to capture the reflection information, so we don’t have to render again from a different perspective. This leads to a number of limitations that can produce fairly visible artifacts, specially when there is dynamic geometry involved. Nevertheless, in my particular case I don’t have any dynamic geometry, at least not yet, so while the artifacts are there they are not quite as distracting. I won’t go into the details of the artifacts introduced with SSR here, but for those interested, here is a good discussion.

I should mention that my take on this is fairly basic and doesn’t implement relevant features such as the Hierarchical Z Buffer optimization (HZB) discussed here.

The technique has 3 steps: capturing reflections, applying roughness material properties and alpha blending:

Capturing reflections

I only implemented support for SSR in the deferred path, since like in the case of SSAO (and more generally all screen-space algorithms), deferred rendering is the best match since we are already capturing screen-space information in the GBuffer.

The first stage for this requires to have means to identify fragments that need reflection information. In our case, the floor fragments. What I did for this is to capture the reflectiveness of the material of each fragment in the screen during the GBuffer pass. This is a single floating-point component (in the 0-1 range). A value of 0 means that the material is not reflective and the SSR pass will just ignore it. A value of 1 means that the fragment is 100% reflective, so its color value will be solely the reflection color. Values in between allow us to control the strength of the reflection for each fragment with a reflective material in the scene.

One small note on the GBuffer storage: because this is a single floating-point value, we don’t necessarily need an extra attachment in the GBuffer (which would have some performance penalty), instead we can just put this in the alpha component of the diffuse color, since we were not using it (the Intel Mesa driver doesn’t support rendering to RGB textures yet, so since we are limited to RGBA we might as well put it to good use).

Besides capturing which fragments are reflective, we can also store another piece of information relevant to the reflection computations: the material’s roughness. This is another scalar value indicating how much blurring we want to apply to the resulting reflection: smooth metal-like surfaces can have very sharp reflections but with rougher materials that have not smooth surfaces we may want the reflections to look a bit blurry, to better represent these imperfections.

Besides the reflection and roughness information, to capture screen-space reflections we will need access to the output of the previous pass (tone mapping) from which we will retrieve the color information of our reflection points, the normals that we stored in the GBuffer (to compute reflection directions for each fragment in the floor sections) and the depth buffer (from the depth-prepass), so we can check for reflection collisions.

The technique goes like this: for each fragment that is reflective, we compute the direction of the reflection using its normal (from the GBuffer) and the view vector (from the camera and the fragment position). Once we have this direction, we execute a ray marching from the fragment position, in the direction of the reflection. For each point we generate, we take the screen-space X and Y coordinates and use them to retrieve the Z-buffer depth for that pixel in the scene. If the depth buffer value is smaller than our sample’s it means that we have moved past foreground geometry and we stop the process. If we got to this point, then we can do a binary search to pin-point the exact location where the collision with the foreground geometry happens, which will give us the screen-space X and Y coordinates of the reflection point. Once we have that we only need to sample the original scene (the output from the tone mapping pass) at that location to retrieve the reflection color.

As discussed earlier, the technique has numerous caveats, which we need to address in one way or another and maybe adapt to the characteristics of different scenes so we can obtain the best results in each case.

The output of this pass is a color texture where we store the reflection colors for each fragment that has a reflective material:

Reflection texture

Naturally, the image above only shows reflection data for the pixels in the floor, since those are the only ones with a reflective material attached. It is immediately obvious that some pixels lack reflection color though, this is due to the various limitations of the screen-space technique that are discussed in the blog post I linked above.

Because the reflections will be alpha-blended with the original image, we use the reflectiveness that we stored in the GBuffer as the base for the alpha component of the reflection color as well (there are other aspects that can contribute to the alpha component too, but I won’t go into that here), so the image above, although not visible in the screenshot, has a valid alpha channel.

Considering material roughness

Once we have captured the reflection image, the next step is to apply the material roughness settings. We can accomplish this with a simple box filter based on the roughness of each fragment: the larger the roughness, the larger the box filter we apply and the blurrier the reflection we get as a result. Because we store roughness for each fragment in the GBuffer, we can have multiple reflective materials with different roughness settings if we want. In this case, we just have one material for the floor though.

Alpha blending

Finally, we use alpha blending to incorporate the reflection onto the original image (the output from the tone mapping) ot incorporate the reflections to the final rendering:

SSR output

Step 9: Anti-aliasing (FXAA)

So far we have been neglecting anti-aliasing. Because we are doing deferred rendering Multi-Sample Anti-Aliasing (MSAA) is not an option: MSAA happens at rasterization time, which in a deferred renderer occurs before our lighting pass (specifically, when we generate the GBuffer), so it cannot account for the important effects that the lighting pass has on the resulting image, and therefore, on the eventual aliasing that we need to correct. This is why deferred renderers usually do anti-aliasing via post-processing.

In this demo I have implemented a well-known anti-aliasing post-processing pass known as Fast Approximate Anti Aliasing (FXAA). The technique attempts to identify strong contrast across neighboring pixels in the image to identify edges and then smooth them out using linear filtering. Here is the final result which matches the one I included as reference at the top of this post:

Anti-aliased output

The image above shows the results of the anti-aliasing pass. Compare that with the output of the SSR pass. You can see how this pass has effectively removed the jaggies observed in the cloths hanging from the upper floor for example.

Unlike MSAA, which acts on geometry edges only, FXAA works on all pixels, so it can also smooth out edges produced by shaders or textures. Whether that is something we want to do or not may depend on the scene. Here we can see this happening on the foreground column on the left, where some of the imperfections of the stone are slightly smoothed out by the FXAA pass.

Conclusions and source code

So that’s all, congratulations if you managed to read this far! In the past I have found articles that did frame analysis like this quite interesting so it’s been fun writing one myself and I only hope that this was interesting to someone else.

This demo has been implemented in Vulkan and includes a number of configurable parameters that can be used to tweak performance and quality. The work-in-progress source code is available here, but beware that I have only tested this on Intel, since that is the only hardware I have available, so you may find issues if you run this on other GPUs. If that happens, let me know in the comments and I might be able to provide fixes at some point.

by Iago Toral at April 17, 2018 12:45 PM

April 16, 2018

Jacobo Aragunde

Updated Chromium on the GENIVI platform

I’ve devoted some of my time at Igalia to get a newer version of Chromium running on the GENIVI Development Platform (GDP).

Since the last update, there have been news regarding Wayland support in Chromium. My colleagues Antonio, Maksim and Frédéric have worked on a new Wayland backend following modern Chromium architecture. You can find more information in their own blogs and talks. I’m linking the most recent talk, from FOSDEM 2018.

Everyone can already try the new, Igalia-developed backend on their embedded devices using the meta-browser layer. I built it along with the GDP but discovered that it cannot run as it is, due to the lack of ivi-shell hooks in the new Chromium backend. This is going to be fixed in the mid-term, so I decided not to spend a lot of time researching this and chose a different solution for the current GDP release.

The LG SVL team recently announced the release of an updated Ozone Wayland backend for Chromium, based on the legacy implementation provided by Intel, as a part of the webOS OSE project. This is an update on the backend we were already running on the GDP, so it looked like a good idea to reuse their work.

I added the meta-lgsvl-browser layer to the GDP, which provides recipes for several Chromium flavors: chromium-lge is the one that builds upon the legacy Wayland backend and currently provides Chromium version 64.

The chromium-lge browser worked out-of-the-box on Raspberry Pi, but I faced some trouble with the other supported platforms. In the case of ARM64 platforms, we were finding a “relocation overflow” problem. This is something that my colleagues had already detected when trying the new Wayland backend on the R-Car gen. 3 platform, and it can be fixed by enabling compiler optimization flags for binary size.

In the case of Intel platforms, compilation failed due to a build-system assertion. It looks like Clang’s Control Flow Integrity feature is enabled by default on x64 Linux builds, but we aren’t using the Clang compiler. The solution consists just in disabling this feature, like the upstream meta-browser project was already doing.

The ongoing work is shared in this pull request. I hope to be able to make it for the next GDP release!

Finally, this week my colleague Xavi is taking part in the GENIVI All Member Meeting. If you are interested in browsers, make sure you attend his talk, “Wayland Support in Open Source Browsers“, and visit our booth during the Member Showcase!

by Jacobo Aragunde Pérez at April 16, 2018 11:04 AM

April 15, 2018

Manuel Rego

CSSWG F2F Berlin 2018

Last week I was in Berlin for the CSS Working Group (CSSWG) face-to-face meeting representing Igalia, member of the CSSWG since last year. Igalia has been working on the open web platform for many years, where we help our customers with the implementation of different standards on the open source web engines. Inside the CSSWG we play the implementors role, providing valuable feedback around the specifications we’re working on.

It was really nice to meet all the folks from the CSSWG there, it’s amazing to be together with such a brilliant group of people in the same room. And it’s lovely to see how easy is to talk with any of them, you all rock!

CSSWG F2F Berlin 2018 by Rossen Atanassov CSSWG F2F Berlin 2018 by Rossen Atanassov

This is a brief post about my highlights from there, of course totally subjective and focused on the topics I’m more interested.

CSS Grid Layout

We were discussing two issues of the current specification related to the track sizing algorithm and its behavior in particular cases. Some changes will be added in the specification to try to improve them and we’ll need to update the implementations accordingly.

On top of that, we discussed about the Level 2 of the spec. It’s already defined that this next level will include the following features:

  • The awaited subgrids feature: There was the possibility of allowing subgrids in both axis (dual-axis) or only in one of them (per-axis), note that the per-axis approach covers the dual-axis if you define the subgrid in both axis.

    There are clear uses cases for the per-axis approach but the main doubt was about how hard it’d be to implement. Mats Palmgren from Mozilla posted a comment on the issue explaining that he has just created a prototype for the feature following the per-axis idea, so the CSSWG resolved to remove the dual-axis one from the spec.

  • And aspect-ratio controlled gutters: Regarding this topic, the CSSWG decided to add a new ar unit. We didn’t discuss anything more but we need to decide what we’ll do in the situations where there’s no enough free space to fulfill the requested aspect-ratio, should we ignore it or overflow in that case?

    Talking to Rachel Andrew about the issue, she was not completely sure of what would be the preferred option from the authors point of view. I’ve just added some examples to the issue so we can discuss about them there and gather more feedback, please share your thoughts.

Tests

This was a discussion I wanted to have with the CSSWG people in order to understand better the current situation and possible next steps for the CSSWG test suites.

Just to add some context, the CSSWG test suites are now part of the web-platform-tests (WPT) repository. This repository is being used by most browser vendors to share tests, including tests for new CSS features. For example, at Igalia we’re currently using WPT test suites in all our developments.

The CSSWG uses the CSS Test Harness tool which has a build system that adds some special requirements for the test suites. One of them causes that we need to duplicate some files in the repository, which is not nice at all.

Several people in the CSSWG still rely on this tool mainly for two things:

  • Run manual tests and store their results: Some CSS features like media queries or scrolling are hard to automate when writing tests, so several specs have manual tests. Probably WebDriver can help to automate this kind of tests, maybe not all though.
  • Extract status reports: To verify that a spec fulfills the CR exit criteria, the current tooling has some nice reports, it also provides info about the test coverage of the spec.

So we cannot get rid of the CSS Test Harness system at this point. We discussed about possible solutions but none of them were really clear, also note that the lack of funding for this kind of work makes it harder to move things forward.

I still believe the way to go would be to improve the WPT Dashboard (wpt.fyi) so it can support the 2 features listed above. If that’s the case maybe the specific CSS Test Harness stuff won’t be needed anymore, thus the weird requirements for people working on the test suites will be gone, and there would be a single tool for all the tests from the different working groups.

As a side note wpt.fyi needs some infrastructure improvements, for example Microfost was not happy as Ahem font (which is used a lot in CSS tests suites) is still not installed on the Windows virtual machines that extract test results for wpt.fyi.

Floats, floats, floats

People are using floats to simulate CSS Shapes on browsers that don’t have support yet. That is causing that some special cases related to floats happen more frecuently, and it’s hard to decide what’s the best thing to do on them.

The CSSWG was discussing what would be the best solution when the non-floated content doesn’t fit in the space left by the floated elements. The problem is quite complex to explain, but imagine the following picture where you have several floated elements.

An example of float layout An example of float layout

In this example there are a few floated elements restricting the area where the content can be painted, if the browser needs to find the place to add a BFC (like a table) it needs to decide where to place it avoiding overlapping any other floats.

There was a long discussion, and it seems the best choice would be that the browser tests all the options and if there’s no overlapping then puts the table there (basically Option 1 in the linked illustration). Still there are concerns about performance, so there’s still more work to be done here. As a result of this discussion a new CSS Floats specification will be created to describe the expected behavior in this kind of scenarios.

Monica Dinculescu created a really cool demo to explain how float layout works, with the help of Ian Kilpatrick who knows it pretty well as he has been dealing with lots of corner cases while working in LayoutNG.

TYPO Labs

The members of the CSSWG were invited to the co-located TYPO Labs event. I attended on Friday when Elika (fantasai), Myles and Rossen gave a talk. It was nice to see that CSS Grid Layout was mentioned in the first talk of the day, as an useful tool for typographers. Variable fonts and Virtual Reality were clearly hot topics in several talks.

Elika (fantasai), Myles and Rossen in the CSSWG talk at TYPO Labs Elika (fantasai), Rossen and Myles in the CSSWG talk at TYPO Labs

It’s funny that the last time I was in Berlin was 10 years ago for a conference related to TYPO3, totally unrelated but with a similar name. 😄

Other

Some pictures of Berlin Some pictures of Berlin

And that’s mostly all that I can remember now, I’m sure I’m missing many other important things. It was a fantastic week and I even find some time for walking around Berlin as the weather was really pleasant.

April 15, 2018 10:00 PM

April 10, 2018

Samuel Iglesias

Going to Ubucon Europe 2018!

Next Ubucon Europe is going to be in the beautiful city of Gijón, Spain, at the Antiguo Instituto from April 27th to 29th 2018.

Ubucon is a conference for users and developers of Ubuntu, one of the most popular GNU/Linux distributions in the world. The conference is full of talks covering very different topics, and optional activities for attendes like the traditional espicha.

I will attend as speaker to this conference with my talk “Introduction to Mesa, an open-source graphics API”. In this talk, I will give a brief introduction to Mesa, how it works and how to contribute to it. My talk will be on Sunday 29th April at 11:25am. Don’t miss it!

If you plan to attend the conference and you want to talk about open-source graphics drivers, Linux graphics stack, OpenGL/Vulkan or something alike, please drop me a line in my twitter.

Ubucon Europe 2018

April 10, 2018 06:00 AM

April 03, 2018

Manuel Rego

Getting rid of "grid-" prefix on CSS Grid Layout gutter properties

Early this year I was working on unprefixing the CSS Grid Layout gutter properties. The properties were originally named grid-column-gap and grid-row-gap, together with the grid-gap shorthand. The CSS Working Group (CSSWG) decided to remove the grid- prefix from these properties last summer, so they could be extended to be used in other layout models like Flexbox.

I was not planning to write a blog post about this, but the task ended up becoming something more than just renaming the properties, so this post describes what it took to implement this. Also people got quite excited about the possibility of animating grid gutters when I announced that this was ready on Twitter.

The task

So the theory seems pretty simply, we currently have 3 properties with the grid- prefix and we want to remove it:

  • grid-column-gap becomes column-gap,
  • grid-row-gap becomes row-gap and
  • grid-gap becomes gap.

But column-gap is already an existent property, defined by the Multicolumn spec, which has been around for a long time. So we cannot just create a new property, but we have to make it work also for Grid Layout, and be sure that the syntax is equivalent.

Animatable properties

When I started to test Multicol column-gap I realized it was animatable, however our implementations (Blink and WebKit) of the Grid Layout gutter properties were not. We’d need to make our properties animatable if we want to remove the prefixes.

More on that, I found a bug on Multicol column-gap animation, as its default computed value is normal, and it shouldn’t be possible to animate it. This was fixed quickly by Morten Stenshorne from Google.

Making the properties animatable is not complex at all, both Blink and WebKit have everything ready to make this task easy for properties like the gutter ones that represent lengths. So I decided to do this as part of the unprefixing patch, instead of something separated.

CSS Grid Layout gutters animation example (check it live)

Percentages

But there was something else, the Grid gutter properties accept percentage values, however column-gap hadn’t that support yet. So I added percentage support to column-gap for multicolumn, as a preliminary patch for the unprefixing one.

There has been long discussions in the CSSWG about how to resolve percentages on gutter properties. The spec has recently changed so these properties should be resolved to zero for content-based containers. However my patch is not implementing that, as we don’t believe there’s an easy way to support something like that in most of the web engines, and Blink and WebKit are not exceptions. Our patch follows what Microsoft Edge does in these cases, and resolves the percentage gaps like it does for percentage widths or heights. And the Firefox implementation that has just landed this week does the same.

CSS Multi-column percentage column-gap example (check it live)

I guess we’ll still have some extra discussions about this topic in the CSSWG, but percentages themselves deserve their own blog post.

Implementation

Once all the previous problems got solved, I landed the patches related to unprefixing the gutter properties in both Blink and WebKit. So you can use the unprefixed version since Chrome 66.0.3341.0 and Safari Technology Preview 50.

<div style="display: grid; grid: 100px 50px / 300px 200px;
            column-gap: 25px; row-gap: 10px;">
  <div>Item 1</div>
  <div>Item 2</div>
  <div>Item 3</div>
  <div>Item 4</div>
</div>

A simple Grid Layout example using the unprefixed gutter properties A simple Grid Layout example using the unprefixed gutter properties

Note that as specified in the spec, the previous prefixed properties are still valid and will be kept as an alias to avoid breaking existent content.

Also it’s important to notice that now the gap shorthand applies to Multicol containers too, as it sets the value of column-gap longhand (together with row-gap which would be ignored by Multicol).

<div style="column-count: 2; gap: 100px;">
  <div>First column</div>
  <div>Second column</div>
</div>

Multicolumn example using gap property Multicolumn example using gap property

Web Platform Tests

As usual in our last developments, we have been using web-platform-tests repository for all the tests related to this work. As a result of this work we have now 16 new tests that verify the support of these properties, including tests for animations stuff too.

Running those tests on the different browsers, I realized there was an inconsistency between css-align and css-multicol specifications. Both specs define the column-gap property, but the computed value was different. I raised a CSSWG issue that has been recently solved, so that the computed value for column-gap: normal should still be normal. This causes that the property won’t be animatable from normal to other values as explained before.

This is the summary of the status of these tests in the main browser engines:

  • Blink and WebKit: They pass all the tests and follow last CSSWG resolution.
  • Edge: Unprefixed properties are available since version 41. Percentage support is interoperable with Blink and WebKit. The computed value of column-gap: normal is not normal there, so this needs to get updated.
  • Firefox: It doesn’t have support for the unprefixed properties yet, however the default computed value is normal like in Blink and WebKit. But Multicol column-gap percentage support has just been added. Note that there are already patches on review for this issue, so hopefully they’ll be merged in the coming days.

Conclusions

The task is completed and everything should be settled down at this point, you can start using these unprefixed properties, and it seems that Firefox will join the rest of browser by adding this support very soon.

Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

Last, but not least, this is again part of the ongoing collaboration between Igalia and Bloomberg. I don’t mind to repeat myself over and over, but it’s really worth to highlight the support from Bloomberg in the CSS Grid Layout development, they have been showing to the world that an external company can directly influence in the new specifications from the standard bodies and implementations by the browser vendors. Thank you very much!

Finally and just as a heads-up, I’ll be in Berlin next week for the CSSWG F2F meeting. I’m sure we’ll have interesting conversations about CSS Grid Layout and many other topics there.

April 03, 2018 10:00 PM

Neil Roberts

VkRunner – a shader test tool for Vulkan

As part of my work in the graphics team at Igalia, I’ve been helping with the effort to enable GL_ARB_gl_spirv for Intel’s i965 driver in Mesa. Most of the functionality of the extension is already working so in order to get more test coverage and discover unknown missing features we have been working to automatically convert some Piglit GLSL tests to use SPIR-V. Most of the GLSL tests are using a tool internal to Piglit called shader_runner. This tool largely simplifies the work needed to create a test by allowing it to be specified as a simple text file that just contains the required shaders and a list of commands to execute using them. The commands are very highlevel such as draw rect to submit a rectangle or probe rgba to check for a specific colour of a pixel in the framebuffer.

A number of times during this work I’ve encountered problems that look like they are general problems with Mesa’s SPIR-V compiler and aren’t specific to the GL_ARB_gl_spirv work. I wanted a way to easily confirm this but writing a minimal test case from scratch in Vulkan seemed like quite a lot of hassle. Therefore I decided it would be nice to have a tool like shader_runner that works with Vulkan. I made a start on implementing this and called it VkRunner.

Example

Here is an example shader test file for VkRunner:

[vertex shader]
#version 430

layout(location = 0) in vec3 pos;

void
main()
{
        gl_Position = vec4(pos.xy * 2.0 - 1.0, 0.0, 1.0);
}

[fragment shader]
#version 430

layout(location = 0) out vec4 color_out;

layout(std140, push_constant) uniform block {
        vec4 color_in;
};

void
main()
{
        color_out = color_in;
}

[vertex data]
0/r32g32b32_sfloat

0.25 0.0775 0.0
0.145 0.3875 0.0
0.25 0.3175 0.0

0.25 0.0775 0.0
0.355 0.3875 0.0
0.25 0.3175 0.0

0.0775 0.195 0.0
0.4225 0.195 0.0
0.25 0.3175 0.0

[test]
# Clear the framebuffer to green
clear color 0.0 0.6 0.0 1.0
clear

# White rectangle in the topleft corner
uniform vec4 0 1.0 1.0 1.0 1.0
draw rect 0 0 0.5 0.5

# Green star in the topleft
uniform vec4 0 0.0 0.6 0.0 1.0
draw arrays TRIANGLE_LIST 0 9

# Verify a rectangle colour
relative probe rect rgb (0.5, 0.0, 0.5, 1.0) (0.0, 0.6, 0.0)

If this is run through VkRunner it will convert the shaders to SPIR-V on the fly by piping them through glslangValidator, create a pipeline and a framebuffer and then run the test commands. The framebuffer is just an offscreen buffer so VkRunner doesn’t use any window system extensions. However you can still see the result by passing the -i image.ppm option which cause it to write a PPM image of the final rendering.

The format is meant to be as close to shader_runner as possible but there are a few important differences. Vulkan can’t have uniforms in the default buffer and there is no way to access them by name. Instead you can use a push constant buffer and refer to the individual uniform using its byte offset in the buffer. VkRunner doesn’t yet support UBOs. The same goes for vertex attributes which are now specified using an explicit location rather than by name. In the vertex data section you can use names from VkFormat to specify the format with maximum flexibility, but it can also accept Piglit-style names for compatibilty.

Bonus features

VkRunner supports some extra features that shader_runner does not. Firstly you can specify a format for the framebuffer in the optional [require] section. For example you can make it use a floating-point framebuffer to be able to accurately probe results from functions.

[require]
framebuffer R32G32B32A32_SFLOAT

[vertex shader passthrough]

[fragment shader]
#version 430

layout(location = 0) out vec4 color;

void
main()
{
        color = vec4(atan(0.0, -1.0),
                     42.0,
                     length(vec2(1.0, 1.0)),
                     fma(2.0, 3.0, 1.0));
}

[test]
clear
draw rect -1 -1 2 2
probe all rgba 3.141592653589793 42.0 1.4142135623730951 7.0

If you want to use SPIR-V instead of GLSL in the source, you can specify the shader in a [fragment shader spirv] section. This is useful to test corner cases of the driver with tweaked shaders that glslang wouldn’t generate. You can get a base for the SPIR-V source by passing the -d option to VkRunner to get the disassembly of the generated SPIR-V. There is an example of this in the repo.

Status

Although it can already be used for a range of tests, VkRunner still has a lot of missing features compared to shader_runner. It does not support textures, UBOs and SSBOs. Take a look at the README for the complete documentation.

As I have been creating tests for problems that I encounter with Mesa I’ve been adding them to a branch called tests in the Github repo. I get the impression that Vulkan is somewhat under-tested compared to GL so it might be interesting to use these tests as a base to make a more complete Vulkan test suite. It could also potentially be merged into Piglit.

UPDATE: Patches to merge VkRunner into Piglit have been posted to the mailing list.

by nroberts at April 03, 2018 05:10 PM

Hyunjun Ko

Improvements for GStreamer Intel-MSDK plugins

Last November I had a chance to dive into Intel Media SDK plugins in gst-plugins-bad. It was very good chance for me to learn how gstreamer elements are handling special memory with its own infrastructures like GstBufferPool and GstMemory. In this post I’m going to talk about the improvements of GStreamer Intel MSDK plugins in release 1.14, which is what I’ve been through since last November.

First of all, for those not familiar with Media SDK I’m going to explain what Media SDK is briefly. Media SDK(Aka. MSDK) is the cross-platform API to access Intel’s hardware accelerated video encoder and decoder functions on Windows and Linux. You can get more information about MSDK here and here.

But on Linux so far, it’s hard to set up environment to make MSDK driver working. If you want to set up development environment for MSDK and see what’s working on linux, you should follow the steps described in this page. But I would recommend you refer to the Victor’s post, which is very-well explained for this a little bit annoying stuff.

Additionally the driver in linux supports only Skylake as far as I know, which is very disappointing for users of other chipsets. I(and you probably) hope we can work on it without any specific patch/driver (since it is open source!) and any dependency of chipset. As far as I know, Intel has been considering this issue to be solved, so we’ll see what’s going to happen in the future.

Back to gstreamer, gstreamer plugins using MSDK landed in 2016. At that time they were working fine with basic features for playback and encoding but there were several things to be improved, especially for performance.

Eventually, in the middle of last March, GStreamer 1.14 has been released including improvements of MSDK plugins, which is what I want to talk in this post.

So let me begin now.

Suuports bufferpool and starts using video memory.

This is a key feature that improves the preformance.

In MSDK, there are two types of memory supported in the driver. One is “System Memory” and another is “Video Memory”. (There is one more type of memory, which is called “Opaque Memory” but won’t talk about it in this post)

System memory is a memory allocated on user space, which is normal. System memory is being used in the plugins already, which is quite simple to use but not recommended since the performance is not good enough.

Video memory is a memory used by hardware acceleration device, also known as GPU, to hold frame and other types of video data.

For applications to use video memory, something specific on its platform should be implemented. On linux for example, we can use VA-API to handle video memory through MSDK. And that’s included to the 1.14 release, which means we still need to implement something specific on Windows like this way.

To implement using video memory, I needed to implement GstMSDK(System/Video)Memory to generalize how to access and map the memory in the way of GStreamer. And GstMSDKBufferPool can allocate this type of memory and can be proposed to upstream and can be used in each MSDK element itself. There were lots of problems and argues during this work since the design of MSDK APIs and GStreamer infrastructure don’t match perfectly.

You can see discussion and patches here in this issue.

In addition, if using video memory on linux, we can use DMABuf by acquiring fd handle via VA-API at allocation time. Recently this has been done only for DMABuf export in this issue though it’s not included in 1.14 release.

Sharing context/session

For resource utilization, there needs to share MSDK session with each MSDK plugin in a pipeline. A MSDK session maintains context for the use of any of decode,encode and convert(VPP) functions. Yes it’s just like a handle of the driver. One session can run exactly one of decode, encode and convert(VPP).

So let’s think about an usual transcoding pipeline. It should be like this, src - decoder - converter - encoder - sink. In this case we should use same MSDK session to decode, convert and encode. Otherwise we should copy the data from upstream to work with different session because sessions cannot share data, which should get much worse.

Also there’s one more thing. MSDK supports joining session. If application wants(or has) to use multiple sessions, it can join sessions to share data and we need to support it in the level of GStreamer as well.

All of these can be achieved by GstContext which provides a way of sharing not only between elements. You can see the patch in this issue, same as MSDK Bufferpool issue.

Adds vp8 and mpeg2 decoder.

Sree has added mpeg2/vc1 decoder and I have added vp8 decoder.

Supports a number of algorithms and tuning options in encoder

Encoders are exposing a number of rate control algorithms now and more encoder tuning options like trellis-quantiztion (h264), slice size control (h264), B-pyramid prediction(h264), MB-level bitrate control, frame partitioning and adaptive I/B frame insertion were added. The encoder now also handles force-key-unit events and can insert frame-packing SEIs for side-by-side and top-bottom stereoscopic 3D video.

All of this has been done by Sree and you can see the details in this issue

Capability of encoder’s sinkpad is improved by using VPP.

MSDK encoders had accepted only NV12 raw data since MSDK encoder supports only NV12 format. But other formats can be handled too if we convert them to NV12 by using VPP, which is also supported in the MSDK driver. This has been done by slomo and I fixed a bug related to it. See this bug for more details.

You can find all of patches for MSDK plugins here.

As I said in the top of this post, all MSDK stuffs should be opened first and should support some of major Intel chipsets at least even if not all. But now, the one thing that I can say is GStreamer MSDK plugins are being improved continuously. We can see what’s happening in the near future.

Finally I want to say that Sree and Victor helped me a lot as a reviewer and an adviser with this work. I really appreciate it.

Thanks for reading!

April 03, 2018 03:00 PM

March 28, 2018

Víctor Jáquez

GStreamer VA-API Troubleshooting

GStreamer VA-API is not a trivial piece of software. Even though, in my opinion it is a bit over-engineered, the complexity relies on its layered architecture: the user must troubleshoot in which layer is the failure.

So, bear in mind this architecture:

GStreamer VA-API is not a trivial piece of software. Even though, in my opinion it is a bit over-engineered, the complexity relies on its layered architecture: the user must troubleshoot in which layer is the failure.

So, bear in mind this architecture:

libva architecture
libva architecture

And the point of failure could be anywhere.

Drivers

libva is a library designed to load another library called driver or back-end. This driver is responsible to talk with the kernel, windowing platform, memory handling library, or any other piece of software or hardware that actually will do the video processing.

There are many drivers in the wild. As it is an API aiming to stateless video processing, and the industry is moving towards that way to process video, it is expected more drivers would appear in the future.

Nonetheless, not all the drivers have the same level of maturity, and some of them are abandon-ware. For this reason we decided in GStreamer VA-API, some time ago, to add a white list of functional drivers, basically, those developed by Mesa3D and this one from Intel™. If you wish to disable that white-list, you can do it by setting an environment variable:

$ export GST_VAAPI_ALL_DRIVERS=1

Remember, if you set it, you are on your own, since we do not trust on the maturity of that driver yet.

Internal libva↔driver version

Thus, there is an internal API between libva and the driver and it is versioned, meaning that the internal API version of the installed libva library must match with the internal API exposed by the driver. One of the causes that libva could not initialize a driver could be because the internal API version does not match.

Drivers path and driver name

By default there is a path where libva looks for drivers to load. That path is defined at compilation time. Following Debian’s file-system hierarchy standard (FHS) it should be set by distributions in /usr/lib/x86_64-linux-gnu/dri/. But the user can control this path with an environment variable:

$ export LIBVA_DRIVERS_PATH=${HOME}/src/intel-vaapi-driver/src/.libs

The driver path, as a directory, might contain several drivers. libva will try to guess the correct one by querying the instantiated VA display (which could be either KMS/DRM, Wayland, Android or X11). If the user instantiates a VA display different of his running environment, the guess will be erroneous, the library loading will fail.

Although, there is a way for the user to set the driver’s name too. Again, by setting an environment variable:

$ export LIBVA_DRIVER_NAME=iHD

With this setting, libva will try to load iHD_drv_video.so (a new and experimental open source driver from Intel™, targeted for MediaSDK —do not use it yet with GStreamer VAAPI—).

vainfo

vainfo is the diagnostic tool for VA-API. In a couple words, it will iterate on a list of VA displays, in try-and-error strategy, and try to initialize VA. In case of success, vainfo will report the driver signature, and it will query the driver for the available profiles and entry-points.

For example, my skylake board for development will report

$ vainfo
error: can't connect to X server!
libva info: VA-API version 1.1.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /home/vjaquez/gst/master/intel-vaapi-driver/src/.libs/i965_drv_video.so
libva info: Found init function __vaDriverInit_1_1
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.1 (libva 2.1.1.pre1)
vainfo: Driver version: Intel i965 driver for Intel(R) Skylake - 2.1.1.pre1 (2.1.0-41-g99c3748)
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Simple            : VAEntrypointEncSlice
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointEncSlice
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
      VAProfileH264ConstrainedBaseline: VAEntrypointFEI
      VAProfileH264ConstrainedBaseline: VAEntrypointStats
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointEncSliceLP
      VAProfileH264Main               : VAEntrypointFEI
      VAProfileH264Main               : VAEntrypointStats
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointEncSliceLP
      VAProfileH264High               : VAEntrypointFEI
      VAProfileH264High               : VAEntrypointStats
      VAProfileH264MultiviewHigh      : VAEntrypointVLD
      VAProfileH264MultiviewHigh      : VAEntrypointEncSlice
      VAProfileH264StereoHigh         : VAEntrypointVLD
      VAProfileH264StereoHigh         : VAEntrypointEncSlice
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileNone                   : VAEntrypointVideoProc
      VAProfileJPEGBaseline           : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointEncPicture
      VAProfileVP8Version0_3          : VAEntrypointVLD
      VAProfileVP8Version0_3          : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointEncSlice

And my AMD board with stable packages replies:

$ vainfo
libva info: VA-API version 0.40.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_0_40
libva info: va_openDriver() returns 0
vainfo: VA-API version: 0.40 (libva )
vainfo: Driver version: mesa gallium vaapi
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileNone                   : VAEntrypointVideoProc

Does this mean that VA-API processes video? No. It means that there is an usable VA display which could open a driver correctly and libva can extract symbols from it.

I would like to mention another tool, not official, but I like it a lot, since it extracts almost of the VA information available in the driver: vadumpcaps.c, written by Mark Thompson.

GStreamer VA-API registration

When GStreamer is launched, normally it will register all the available plugins and plugin features (elements, device providers, etc.). All that data is cache and keep until the cache file is deleted or the cache invalidated by some event.

At registration time, GStreamer VA-API will instantiate a DRM-based VA display, which works with no need of a real display (in other words, headless), and will query the driver for the profiles and entry-points tuples, in order to register only the available elements (encoders, decoders. sink, post-processor). If the DRM VA display fails, a list of VA displays will be tried.

In the case that libva could not load any driver, or the driver is not in the white-list, GStreamer VA-API will not register any element. Otherwise gst-inspect-1.0 will show the registered elements:

$ gst-inspect-1.0 vaapi
Plugin Details:
  Name                     vaapi
  Description              VA-API based elements
  Filename                 /usr/lib/x86_64-linux-gnu/gstreamer-1.0/libgstvaapi.so
  Version                  1.12.4
  License                  LGPL
  Source module            gstreamer-vaapi
  Source release date      2017-12-07
  Binary package           gstreamer-vaapi
  Origin URL               http://bugzilla.gnome.org/enter_bug.cgi?product=GStreamer

  vaapijpegdec: VA-API JPEG decoder
  vaapimpeg2dec: VA-API MPEG2 decoder
  vaapih264dec: VA-API H264 decoder
  vaapivc1dec: VA-API VC1 decoder
  vaapivp8dec: VA-API VP8 decoder
  vaapih265dec: VA-API H265 decoder
  vaapipostproc: VA-API video postprocessing
  vaapidecodebin: VA-API Decode Bin
  vaapisink: VA-API sink
  vaapimpeg2enc: VA-API MPEG-2 encoder
  vaapih265enc: VA-API H265 encoder
  vaapijpegenc: VA-API JPEG encoder
  vaapih264enc: VA-API H264 encoder

  13 features:
  +-- 13 elements

Beside the normal behavior, GStreamer VA-API will also invalidate GStreamer’s cache at every boot, or when any of the mentioned environment variables change.

Conclusion

A simple task list to review when GStreamer VA-API is not working at all is this:

#. Check your LIBVA_* environment variables
#. Verify that vainfo returns sensible information
#. Invalidate GStreamer’s cache (or just delete the file)
#. Check the output of gst-inspect-1.0 vaapi

And, if you decide to file a bug in bugzilla, please do not forget to attach the output of vainfo and the logs if the developer asks for them.

by vjaquez at March 28, 2018 06:11 PM

March 27, 2018

Víctor Jáquez

GStreamer VA-API 1.14: what’s new?

As you may already know, there is a new release of GStreamer, 1.14. In this blog post we will talk about the new features and improvements of GStreamer VA-API module, though you have a more comprehensive list of changes in the release notes.

Most of the topics explained along this blog post are already mentioned in the release notes, but a bit more detailed.

DMABuf usage

We have improved DMA-buf’s usage, mostly at downstream.

In the case of upstream, we just got rid a nasty hack which detected when to instantiate and use a buffer pool in sink pad with a dma-buf based allocator. This functionality has been already broken for a while, and that code was the wrong way to enabled it. The sharing of a dma-buf based buffer pool to upstream is going to be re-enabled after bug 792034 is merged.

For downstream, we have added the handling of memory:DMABuf caps feature. The purpose of this caps feature is to negotiate a media when the buffers are not map-able onto user space, because of digital rights or platform restrictions.

For example, currently intel-vaapi-driver doesn’t allow the mapping of its produced dma-buf descriptors. But, as we cannot know if a back-end produces or not map-able dma-buf descriptors, gstreamer-vaapi, when the allocator is instantiated, creates a dummy buffer and tries to map it, if it fails, memory:DMABuf caps feature is negotiated, otherwise, normal video caps are used.

VA-API usage

First of all, GStreamer VA-API has support now for libva-2.0, this means VA-API 1.10. We had to guard some deprecated symbols and the new ones. Nowadays most of distributions have upgraded to libva-2.0.

We have improved the initialization of the VA display internal structure (GstVaapiDisplay). Previously, if a X based display was instantiated, immediately it tried to grab the screen resolution. Obviously, this broke the usage of headless systems. We just delay the screen resolution check to when the VA display truly requires that information.

New API were added into VA, particularly for log handling. Now it is possible to redirect the log messages into a callback. Thus, we use it to redirect VA-API message into the GStreamer log mechanisms, uncluttering the console’s output.

Also, we have blacklisted, in autoconf and meson, libva version 0.99.0, because that version is used internally by the closed-source version of Intel MediaSDK, which is incompatible with official libva. By the way, there is a new open-source version of MediaSDK, but we will talk about it in a future blog post.

Application VA Display sharing

Normally, the object GstVaapiDisplay is shared among the pipeline through the GstContext mechanism. But this class is defined internally and it is not exposed to users since release 1.6. This posed a problem when an application wanted to create its own VA Display and share it with an embedded pipeline. The solution is a new context application message: gst.vaapi.app.Display, defined as a GstStructure with two fields: va-display with the application’s vaDisplay, and x11-display with the application’s X11 native display. In the future, a Wayland’s native handler will be processed too. Please note that this context message is only processed by vaapisink.

One precondition for this solution was the removal of the VA display cache mechanism, a lingered request from users, which, of course, we did.

Interoperability with appsink and similar

A hardware accelerated driver, as the Intel one, may have custom offsets and strides for specific memory regions. We use the GstVideoMeta to set this custom values. The problem comes when downstream does not handle this meta, for example, appsink. Then, the user expect the “normal” values for those variable, but in the case of GStreamer VA-API with a hardware based driver, when the user displays the frame, it is shown corrupted.

In order to fix this, we have to make a memory copy, from our custom VA-API images to an allocated system memory. Of course there is a big CPU penalty, but it is better than delivering wrong video frames. If the user wants a better performance, then they should seek for a different approach.

Resurrection of GstGLUploadTextureMeta for EGL renders

I know, GstGLUploadTextureMeta must die, right? I am convinced of it. But, Clutter video sink uses it, an it has a vast number of users, so we still have to support it.

Last release we had remove the support for EGL/Wayland in the last minute because we found a terrible bug just before the release. GLX support has always been there.

With Daniel van Vugt efforts, we resurrected the support for that meta in EGL. Though I expect the replacement of Clutter sink with glimagesink someday, soon.

vaapisink demoted in Wayland

vaapisink was demoted to marginal rank on Wayland because COGL cannot display YUV surfaces.

This means, by default, vaapisink won’t be auto-plugged when playing in Wayland.

The reason is because Mutter (aka GNOME) cannot display the frames processed by vaapisink in Wayland. Nonetheless, please note that in Weston, it works just fine.

Decoders

We have improved a little bit upstream renegotiation: if the new stream is compatible with the previous one, there is no need to reset the internal parser, with the exception of changes in codec-data.

low-latency property in H.264

A new property has added only to H.264 decoder: low-latency. Its purpose is for live streams that do not conform the H.264 specification (sadly there are many in the wild) and they need to twitch the spec implementation. This property force to push the frames in the decoded picture buffer as soon as possible.

base-only property in H.264

This is the result of the Google Summer of Code 2017, by Orestis Floros. When this property is enabled, all the MVC (Multiview Video Coding) or SVC (Scalable Video Coding) frames are dropped. This is useful if you want to reduce the processing time or if your VA-API driver does not support those kind of streams.

Encoders

In this release we have put a lot of effort in encoders.

Processing Regions of Interest

It is possible, for certain back-ends and profiles (for example, H.264 and H.265 encoders with Intel driver), to specify a set of regions of interest per frame, with a delta-qp per region. This mean that we would ask more quality in those regions.

In order to process regions of interest, upstream must add to the video frame, a list of GstVideoRegionOfInterestMeta. This list then is traversed by the encoder and it requests them if the VA-API profile, in the driver, supports it.

The common use-case for this feature is if you want to higher definition in regions with faces or text messages in the picture, for example.

New encoding properties

  • quality-level: For all the available encoders. This is number between 1 to 8, where a lower number means higher quality (and slower processing).
  • aud: This is for H.264 encoder only and it is available for certain drivers and platforms. When it is enabled, an AU delimiter is inserted for each encoded frame. This is useful for network streaming, and more particularly for Apple clients.

  • mbbrc: For H.264 only. Controls (auto/on/off) the macro-block bit-rate.

  • temporal-levels: For H.264 only. It specifies the number of temporal levels to include a the hierarchical frame prediction.

  • prediction-type: For H.264 only. It selects the reference picture selection mode.

    The frames are encoded as different layers. A frame in a particular layer will use pictures in lower or same layer as references. This means decoder can drop frames in upper layer but still decode lower layer frames.

    • hierarchical-p: P frames, except in top layer, are reference frames. Base layer frames are I or B.
  • hierarchical-b: B frames , except in top most layer, are reference frames. All the base layer frames are I or P.

  • refs: Added for H.265 (it was already supported for H.264). It specifies the number of reference pictures.

  • qp-ip and qp-ib: For H.264 and H.265 encoders. They handle the QP (quality parameters) difference between the I and P frames, the the I and B frames respectively.

  • Set media profile via downstream caps

    H.264 and H.265 encoders now can configure the desired media profile through the downstream caps.

    Contributors

    Many thanks to all the contributors and bug reporters.

         1  Daniel van Vugt
        46  Hyunjun Ko
         1  Jan Schmidt
         3  Julien Isorce
         1  Matt Staples
         2  Matteo Valdina
         2  Matthew Waters
         1  Michael Tretter
         4  Nicolas Dufresne
         9  Orestis Floros
         1  Philippe Normand
         4  Sebastian Dröge
        24  Sreerenj Balachandran
         1  Thibault Saunier
        13  Tim-Philipp Müller
         1  Tomas Rataj
         2  U. Artie Eoff
         1  VaL Doroshchuk
       172  Víctor Manuel Jáquez Leal
         3  XuGuangxin
         2  Yi A Wang
    

    by vjaquez at March 27, 2018 10:52 AM

    March 23, 2018

    Asumu Takikawa

    How to develop VPP plugins

    Recently, my teammate Jessica wrote an excellent intro blog post about VPP. VPP is an open source user-space networking framework that we’re looking into at Igalia. I highly recommend reading Jessica’s post before this post to get aquainted with general VPP ideas.

    In this blog post, I wanted to follow up on that blog post and talk a bit about some of the details of plugin construction in VPP and document some of the internals that took me some time to understand.

    First, I’ll start off by talking about how to architect the flow of packets in and out of a plugin.

    How and where do you add nodes in the graph?

    VPP is based on a graph architecture, in which vectors of packets are transferred between nodes in a graph.

    Here’s an illustration from a VPP presentation of what the graph might look like:

    VPP graph nodes

    (source: VPP overview from fd.io DevBoot)

    One thing that took me a while to understand, however, was the mechanics of how nodes in the graph are actually hooked together. I’ll try to explain some of that, starting with the types of nodes that are available.

    Some of the nodes in VPP produce data from a network driver (perhaps backed by DPDK) or some other source for consumption. Other nodes are responsible for manipulating incoming data in the graph and possibly producing some output.

    The former are called “input” nodes and are called on each main loop iteration. The latter are called “internal” nodes. The type of node is specified using the vlib_node_type_t type when declaring a node using VLIB_REGISTER_NODE.

    There’s also another type of node that is useful, which is the “process” node. This exposes a thread-like behavior, in which the node’s callback routine can be suspended and reanimated based on events or a timer. This is useful for sending periodic notifications or polling some data that’s managed by another node.

    Since input nodes are the start of the graph, they are responsible for generating packets from some source like a NIC or pcap file and injecting them into the rest of the graph.

    For internal nodes, the packets have to come from somewhere else. There seem to be many ways to specify how an internal node gets its packets.

    For example, it’s possible to tell VPP to direct all packets with a specific ethertype or IP protocol to a node that you write (see this wiki page for some of these options).

    Another method, that’s convenient for use in external plugins, is using “feature arcs”. VPP comes with abstract concepts called “features” (which are distinct from actual nodes) that essentially form an overlay over the concrete graph of nodes.

    These features can be enabled or disabled on a per-interface basis.

    You can hook your own node into these feature arcs by using VNET_FEATURE_INIT:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    /**
     * @brief Hook the sample plugin into the VPP graph hierarchy.
     */
    VNET_FEATURE_INIT (sample, static) = 
    {
      .arc_name = "device-input",
      .node_name = "sample",
      .runs_before = VNET_FEATURES ("ethernet-input"),
    };
    

    This code from the sample plugin is setting the sample node to run on the device-input arc. For a given arc like this, the VPP dispatcher will run the code for the start node on the arc and then run all of the features that are registered on that arc. This ordering is determined by the .runs_before clauses in those features.

    Feature arcs themselves are defined via macro:

    1
    2
    3
    4
    5
    6
    VNET_FEATURE_ARC_INIT (device_input, static) =
    {
      .arc_name  = "device-input",
      .start_nodes = VNET_FEATURES ("device-input"),
      .arc_index_ptr = &feature_main.device_input_feature_arc_index,
    };
    

    As you can see, arcs have a start node. Inside the start node there is typically some code that checks if features are present on the node and, if the check succeeds, will designate the next node as one of those feature nodes. For example, this snippet from the DPDK plugin:

    1
    2
    3
    4
    /* Do we have any driver RX features configured on the interface? */
    vnet_feature_start_device_input_x4 (xd->vlib_sw_if_index,
                        &next0, &next1, &next2, &next3,
                        b0, b1, b2, b3);
    

    From this snippet, you can see how the process of getting data through feature arcs is kicked off. In the next section, you’ll see how internal nodes in the middle of a feature arc can continue the chain.

    BTW: the methods of hooking up your node into the graph I’ve described here aren’t exhaustive. For example, there’s also something called DPOs (“data path object”) that I haven’t covered at all. Maybe a topic for a future blog post (once I understand them!).

    Specifying the next node

    I went over a few methods for specifying how packets get to a node, but you may also wonder how packets get directed out of a node. There are also several ways to set that up too.

    When declaring a VPP node using VLIB_REGISTER_NODE, you can provide the .next_nodes argument (along with .n_next_nodes) to specify an indexed list of next nodes in the graph. Then you can use the next node indices in the node callback to direct packets to one of the next nodes.

    The sample plugin sets it up like this:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    VLIB_REGISTER_NODE (sample_node) = {
      /* ... elided for example ... */
    
      .n_next_nodes = SAMPLE_N_NEXT,
    
      /* edit / add dispositions here */
      .next_nodes = {
            [SAMPLE_NEXT_INTERFACE_OUTPUT] = "interface-output",
      },
    };
    

    This declaration sets up the only next node as interface-output, which is the node that selects some hardware interface to transmit on.

    Alternatively, it’s possible to programmatically fetch a next node index by name using vlib_get_node_by_name:

    1
    2
    ip4_lookup_node = vlib_get_node_by_name (vm, (u8 *) "ip4-lookup");
    ip4_lookup_node_index = ip4_lookup_node->index;
    

    This excerpt is from the VPP IPFIX code that sends report packets periodically. This kind of approach seems more common in process nodes.

    When using feature arcs, a common approach is to use vnet_feature_next to select a next node based on the feature mechanism:

    1
    2
    3
    /* Setup packet for next IP feature */
    vnet_feature_next(vnet_buffer(b0)->sw_if_index[VLIB_RX], &next0, b0);
    vnet_feature_next(vnet_buffer(b1)->sw_if_index[VLIB_RX], &next1, b1);
    

    In this example, the next0 and next1 variables are being set by vnet_feature_next based on the appropriate next features in the feature arc.

    (vnet_feature_next is a convenience function that uses vnet_get_config_data under the hood as described in this mailing list post if you want to know more about the mechanics)

    Though even when using feature arcs, it doesn’t seem mandatory to use vnet_feature_next. The sample plugin does not, for example.

    Buffer metadata

    In addition to packets, nodes can also pass some metadata around to each other. Specifically, there is some “opaque” space reserved in the vlib_buffer_t structure that stores the packet data:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    typedef struct{
    
    /* ... elided ... */
    
    u32 opaque[10]; /**< Opaque data used by sub-graphs for their own purposes.
                      See .../vnet/vnet/buffer.h
    
    /* ... */
    
    } vlib_buffer_t;
    

    (source link)

    For networking purposes, the vnet library provides a data type vnet_buffer_opaque_t that is stored in the opaque field. This contains data useful to networking layers like Ethernet, IP, TCP, and so on.

    The vnet data is unioned (that is, data for one layer may be overlaid on data for another) and it’s expected that graph nodes make sure not to stomp on data that needs to be set in their next nodes. You can see how the data intersects in this slide from an FD.io DevBoot.

    One of the first fields stored in the opaque data is sw_if_index, which is set by the driver nodes initially. The field stores a 2-element array of input and output interface indices for a buffer.

    One of the things that puzzled me for a bit was precisely how the sw_if_index field, which is very commonly read and set in VPP plugins, is used. Especially the TX half of the pair, which is more commonly manipulated. It looks like the field is often set to instruct VPP to choose a transmit interface for a packet.

    For example, in the sample plugin code the index is set in the following way:

    1
    2
    3
    4
    sw_if_index0 = vnet_buffer(b0)->sw_if_index[VLIB_RX];
    
    /* Send pkt back out the RX interface */
    vnet_buffer(b0)->sw_if_index[VLIB_TX] = sw_if_index0;
    

    What this is doing is getting the interface index where the packet was received and then setting the output index to be the same. This tells VPP to send off the packet to the same interface it was received from.

    (This is explained in one of the VPP videos.)

    For a process node that’s sending off new packets, you might set the indices like this:

    1
    2
    vnet_buffer(b0)->sw_if_index[VLIB_RX] = 0;
    vnet_buffer(b0)->sw_if_index[VLIB_TX] = ~0;
    

    The receive index is set to 0, which means the local0 interface that is always configured. The use of ~0 means “don’t know yet”. The effect of that is (see the docs for ip4-lookup) that a lookup is done in the FIB attached to the receive interface, which is the local0 one in this case.

    Some code inside VPP instead sets the transmit index to some user configurable fib_index to let the user specify a destination (often defaulting to ~0 if the user specified nothing):

    1
    2
    vnet_buffer(b0)->sw_if_index[VLIB_RX] = 0;
    vnet_buffer(b0)->sw_if_index[VLIB_TX] = fib_index;
    

    VPP libraries and other resources

    Finally, I want to briefly talk about some additional resources for learning about VPP details.

    There are numerous useful libraries that come with VPP, such as data structures (e.g., vectors and hash-tables), formatting, and logging. These are contained in source directories like vlib and vppinfra.

    I found these slides on VPP libraries and its core infrastructure quite helpful. It’s from a workshop that was held in Paris. The slides give an overview of various programming components provided in VPP, and then you can look up the detailed APIs in the VPP docs.

    There are also a number of other useful pages on the VPP wiki. If you prefer learning from videos, there are some good in-depth video tutorials on VPP from the official team too.

    Since I’m a VPP newbie myself, you may want to carefully look at the source code and make sure what I’m saying is accurate. Please do let me know if you spot any inaccuracies in the post.

    by Asumu Takikawa at March 23, 2018 02:10 AM

    March 21, 2018

    José Dapena

    Updated Chromium Legacy Wayland Support

    Introduction

    Future Ozone Wayland backend is still not ready for shipping. So we are announcing the release of an updated Ozone Wayland backend for Chromium, based on the implementation provided by Intel. It is rebased on top of latest stable Chromium release and you can find it in my team Github. Hope you will appreciate it.

    Official Chromium on Linux desktop nowadays

    Linux desktop is progressively migrating to use Wayland as the display server. It is the default option in Fedora, Ubuntu ~~and, more importantly, the next Ubuntu Long Term Support release will ship Gnome Shell Wayland display server by default~~ (P.S. since this post was originally written, Ubuntu has delayed the Wayland adoption for LTS).

    As is, now, Chromium browser for Linux desktop support is based on X11. This means it will natively interact with an X server and with its XDG extensions for displaying the contents and receiving user events. But, as said, next generation of Linux desktop will be using Wayland display servers instead of X11. How is it working? Using XWayland server, a full X11 server built on top of Wayland protocol. Ok, but that has an impact on performance. Chromium needs to communicate and paint to X11 provided buffers, and then, those buffers need to be shared with Wayland display server. And the user events will need to be proxied from the Wayland display server through the XWayland server and X11 protocol. It requires more resources: more memory, CPU, and GPU. And it adds more latency to the communication.

    Ozone

    Chromium supports officially several platforms (Windows, Android, Linux desktop, iOS). But it provides abstractions for porting it to other platforms.

    The set of abstractions is named Ozone (more info here). It allows to implement one or more platform components with the hooks for properly integrating with a platform that is in the set of officially supported targets. Among other things it provides abstractions for:
    * Obtaining accelerated surfaces.
    * Creating and obtaining windows to paint the contents.
    * Interacting with the desktop cursor.
    * Receiving user events.
    * Interacting with the window manager.

    Chromium and Wayland (2014-2016)

    Even if Wayland was not used on Linux desktop, a bunch of embedded devices have been using Wayland for their display server for quite some time. LG has been shipping a full Wayland experience on the webOS TV products.

    In the last 4 years, Intel has been providing an implementation of Ozone abstractions for Wayland. It was an amazing work that allowed running Chromium browser on top of a Wayland compositor. This backend has been the de facto standard for running Chromium browser on all these Wayland-enabled embedded devices.

    But the development of this implementation has mostly stopped around Chromium 49 (though rebases on top of Chromium 51 and 53 have been provided).

    Chromium and Wayland (2018+)

    Since the end of 2016, Igalia has been involved on several initiatives to allow Chromium to run natively in Wayland. Even if this work is based on the original Ozone Wayland backend by Intel, it is mostly a rewrite and adaptation to the future graphics architecture in Chromium (Viz and Mus).

    This is being developed in the Igalia GitHub, downstream, though it is expected to be landed upstream progressively. Hopefully, at some point in 2018, this new backend will be fully ready for shipping products with it. But we are still not there. ~~Some major missing parts are Wayland TextInput protocol and content shell support~~ (P.S. since this was written, both TextInput and content shell support are working now!).

    More information on these posts from the authors:
    * June 2016: Understanding Chromium’s runtime ozone platform selection (by Antonio Gomes).
    * October 2016: Analysis of Ozone Wayland (by Frédéric Wang).
    * November 2016: Chromium, ozone, wayland and beyond (by Antonio Gomes).
    * December 2016: Chromium on R-Car M3 & AGL/Wayland (by Frédéric Wang).
    * February 2017: Mus Window System (by Frédéric Wang).
    * May 2017: Chromium Mus/Ozone update (H1/2017): wayland, x11 (by Antonio Gomes).
    * June 2017: Running Chromium m60 on R-Car M3 board & AGL/Wayland (by Maksim Sisov).

    Releasing legacy Ozone Wayland backend (2017-2018)

    Ok, so new Wayland backend is still not ready in some cases, and the old one is unmaintained. For that reason, LG is announcing the release of an updated legacy Ozone Wayland backend. It is essentially the original Intel backend, but ported to current Chromium stable.

    Why? Because we want to provide a migration path to the future Ozone Wayland backend. And because we want to share this effort with other developers, willing to run Chromium in Wayland immediately, or that are still using the old backend and cannot immediately migrate to the new one.

    WARNING If you are starting development for a product that is going to happen in 1-2 years… Very likely your best option is already migrating now to the new Ozone Wayland backend (and help with the missing bits). We will stop maintaining it ourselves once new Ozone Wayland backend lands upstream and covers all our needs.

    What does this port include?
    * Rebased on top of Chromium m60, m61, m62 and m63.
    * Ported to GN.
    * It already includes some changes to adapt to the new Ozone Wayland refactors.

    It is hosted at https://github.com/lgsvl/chromium-src.

    Enjoy it!

    Originally published at webOS Open Source Edition Blog. and licensed under Creative Commons Attribution 4.0.

    by José Dapena Paz at March 21, 2018 07:47 AM

    March 20, 2018

    Iago Toral

    Improving shader performance with Vulkan’s specialization constants

    For some time now I have been working on and off on a personal project with no other purpose than toying a bit with Vulkan and some rendering and shading techniques. Although I’ll probably write about that at some point, in this post I want to focus on Vulkan’s specialization constants and how they can provide a very visible performance boost when they are used properly, as I had the chance to verify while working on this project.

    The concept behind specialization constants is very simple: they allow applications to set the value of a shader constant at run-time. At first sight, this might not look like much, but it can have very important implications for certain shaders. To showcase this, let’s take the following snippet from a fragment shader as a case study:

    layout(push_constant) uniform pcb {
       int num_samples;
    } PCB;
    
    const int MAX_SAMPLES = 64;
    layout(set = 0, binding = 0) uniform SamplesUBO {
       vec3 samples[MAX_SAMPLES];
    } S;
    
    void main()
    {
       ...
       for(int i = 0; i < PCB.num_samples; ++i) {
          vec3 sample_i = S.samples[i];
          ...
       }
       ...
    }
    

    That is a snippet taken from a Screen Space Ambient Occlusion shader that I implemented in my project, a popular techinique used in a lot of games, so it represents a real case scenario. As we can see, the process involves a set of vector samples passed to the shader as a UBO that are processed for each fragment in a loop. We have made the maximum number of samples that the shader can use large enough to accomodate a high-quality scenario, but the actual number of samples used in a particular execution will be taken from a push constant uniform, so the application has the option to choose the quality / performance balance it wants to use.

    While the code snippet may look trivial enough, let’s see how it interacts with the shader compiler:

    The first obvious issue we find with this implementation is that it is preventing loop unrolling to happen because the actual number of samples to use is unknown at shader compile time. At most, the compiler could guess that it can’t be more than 64, but that number of iterations would still be too large for Mesa to unroll the loop in any case. If the application is configured to only use 24 or 32 samples (the value of our push constant uniform at run-time) then that number of iterations would be small enough that Mesa would unroll the loop if that number was known at shader compile time, so in that scenario we would be losing the optimization just because we are using a push constant uniform instead of a constant for the sake of flexibility.

    The second issue, which might be less immediately obvious and yet is the most significant one, is the fact that if the shader compiler can tell that the size of the samples array is small enough, then it can promote the UBO array to a push constant. This means that each access to S.samples[i] turns from an expensive memory fetch to a direct register access for each sample. To put this in perspective, if we are rendering to a full HD target using 24 samples per fragment, it means that we would be saving ourselves from doing 1920x1080x24 memory reads per frame for a very visible performance gain. But again, we would be loosing this optimization because we decided to use a push constant uniform.

    Vulkan’s specialization constants allow us to get back these performance optmizations without sacrificing the flexibility we implemented in the shader. To do this, the API provides mechanisms to specify the values of the constants at run-time, but before the shader is compiled.

    Continuing with the shader snippet we showed above, here is how it can be rewritten to take advantage of specialization constants:

    layout (constant_id = 0) const int NUM_SAMPLES = 64;
    layout(std140, set = 0, binding = 0) uniform SamplesUBO {
       vec3 samples[NUM_SAMPLES];
    } S;
    
    void main()
    {
       ...
       for(int i = 0; i < NUM_SAMPLES; ++i) {
          vec3 sample_i = S.samples[i];
          ...
       }
       ...
    }
    

    We are now informing the shader that we have a specialization constant NUM_SAMPLES, which represents the actual number of samples to use. By default (if the application doesn’t say otherwise), the specialization constant’s value is 64. However, now that we have a specialization constant in place, we can have the application set its value at run-time, like this:

    VkSpecializationMapEntry entry = { 0, 0, sizeof(int32_t) };
       VkSpecializationInfo spec_info = {
          1,
          &entry,
          sizeof(uint32_t),
          &config.ssao.num_samples
       };
       ...
    

    The application code above sets up specialization constant information for shader consumption at run-time. This is done via an array of VkSpecializationMapEntry entries, each one determining where to fetch the constant value to use for each specialization constant declared in the shader for which we want to override its default value. In our case, we have a single specialization constant (with id 0), and we are taking its value (of integer type) from offset 0 of a buffer. In our case we only have one specialization constant, so our buffer is just the address of the variable holding the constant’s value (config.ssao.num_samples). When we create the Vulkan pipeline, we pass this specialization information using the pSpecializationInfo field of VkPipelineShaderStageCreateInfo. At that point, the driver will override the default value of the specialization constant with the value provided here before the shader code is optimized and native GPU code is generated, which allows the driver compiler backend to generate optimal code.

    It is important to remark that specialization takes place when we create the pipeline, since that is the only moment at which Vulkan drivers compile shaders. This makes specialization constants particularly useful when we know the value we want to use ahead of starting the rendering loop, for example when we are applying quality settings to shaders. However, If the value of the constant changes frequently, specialization constants are not useful, since they require expensive shader re-compiles every time we want to change their value, and we want to avoid that as much as possible in our rendering loop. Nevertheless, it it is possible to compile the same shader with different constant values in different pipelines, so even if a value changes often, so long as we have a finite number of combinations, we can generate optimized pipelines for each one ahead of the start of the redendering loop and just swap pipelines as needed while rendering.

    Conclusions

    Specialization constants are a straight forward yet powerful way to gain control over how shader compilers optimize your code. In my particular pet project, applying specialization constants in a small number of shaders allowed me to benefit from loop unrolling and, most importantly, UBO promotion to push constants in the SSAO pass, obtaining performance improvements that ranged from 10% up to 20% depending on the configuration.

    Finally, although the above covered specialization constants from the point of view of Vulkan, this is really a feature of the SPIR-V language, so it is also available in OpenGL with the GL_ARB_gl_spirv extension, which is core since OpenGL 4.6.

    by Iago Toral at March 20, 2018 04:45 PM

    March 19, 2018

    Philippe Normand

    GStreamer’s playbin3 overview for application developers

    Multimedia applications based on GStreamer usually handle playback with the playbin element. I recently added support for playbin3 in WebKit. This post aims to document the changes needed on application side to support this new generation flavour of playbin.

    So, first of, why is it named playbin3 anyway? The GStreamer 0.10.x series had a playbin element but a first rewrite (playbin2) made it obsolete in the GStreamer 1.x series. So playbin2 was renamed to playbin. That’s why a second rewrite is nicknamed playbin3, I suppose :)

    Why should you care about playbin3? Playbin3 (and the elements it’s using internally: parsebin, decodebin3, uridecodebin3 among others) is the result of a deep re-design of playbin2 (along with decodebin2 and uridecodebin) to better support:

    • gapless playback
    • audio cross-fading support (not yet implemented)
    • adaptive streaming
    • reduced CPU, memory and I/O resource usage
    • faster stream switching and full control over the stream selection process

    This work was carried on mostly by Edward Hervey, he presented his work in detail at 3 GStreamer conferences. If you want to learn more about this and the internals of playbin3 make sure to watch his awesome presentations at the 2015 gst-conf, 2016 gst-conf and 2017 gst-conf.

    Playbin3 was added in GStreamer 1.10. It is still considered experimental but in my experience it works already very well. Just keep in mind you should use at least the latest GStreamer 1.12 (or even the upcoming 1.14) release before reporting any issue in Bugzilla. Playbin3 is not a drop-in replacement for playbin, both elements share only a sub-set of GObject properties and signals. However, if you don’t want to modify your application source code just yet, it’s very easy to try playbin3 anyway:

    $ USE_PLAYBIN3=1 my-playbin-based-app
    

    Setting the USE_PLAYBIN environment variable enables a code path inside the GStreamer playback plugin which swaps the playbin element for the playbin3 element. This trick provides a glance to the playbin3 element for the most lazy people :) The problem is that depending on your use of playbin, you might get runtime warnings, here’s an example with the Totem player:

    $ USE_PLAYBIN3=1 totem ~/Videos/Agent327.mp4
    (totem:22617): GLib-GObject-WARNING **: ../../../../gobject/gsignal.c:2523: signal 'video-changed' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'
    
    (totem:22617): GLib-GObject-WARNING **: ../../../../gobject/gsignal.c:2523: signal 'audio-changed' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'
    
    (totem:22617): GLib-GObject-WARNING **: ../../../../gobject/gsignal.c:2523: signal 'text-changed' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'
    
    (totem:22617): GLib-GObject-WARNING **: ../../../../gobject/gsignal.c:2523: signal 'video-tags-changed' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'
    
    (totem:22617): GLib-GObject-WARNING **: ../../../../gobject/gsignal.c:2523: signal 'audio-tags-changed' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'
    
    (totem:22617): GLib-GObject-WARNING **: ../../../../gobject/gsignal.c:2523: signal 'text-tags-changed' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'
    sys:1: Warning: g_object_get_is_valid_property: object class 'GstPlayBin3' has no property named 'n-audio'
    sys:1: Warning: g_object_get_is_valid_property: object class 'GstPlayBin3' has no property named 'n-text'
    sys:1: Warning: ../../../../gobject/gsignal.c:3492: signal name 'get-video-pad' is invalid for instance '0x556db67f3170' of type 'GstPlayBin3'
    

    As mentioned previously, playbin and playbin3 don’t share the same set of GObject properties and signals, so some changes in your application are required in order to use playbin3.

    If your application is based on the GstPlayer library then you should set the GST_PLAYER_USE_PLAYBIN3 environment variable. GstPlayer already handles both playbin and playbin3, so no changes needed in your application if you use GstPlayer!

    Ok, so what if your application relies directly on playbin? Some changes are needed! If you previously used playbin stream selection properties and signals, you will now need to handle the GstStream and GstStreamCollection APIs. Playbin3 will emit a stream collection message on the bus, this is very nice because the collection includes information (metadata!) about the streams (or tracks) the media asset contains. In playbin this was handled with a bunch of signals (audio-tags-changed, audio-changed, etc), properties (n-audio, n-video, etc) and action signals (get-audio-tags, get-audio-pad, etc). The new GstStream API provides a centralized and non-playbin-specific access point for all these informations. To select streams with playbin3 you now need to send a select_streams event so that the demuxer can know exactly which streams should be exposed to downstream elements. That means potentially improved performance! Once playbin3 completed the stream selection it will emit a streams selected message, the application should handle this message and potentially update its internal state about the selected streams. This is also the best moment to update your UI regarding the selected streams (like audio track language, video track dimensions, etc).

    Another small difference between playbin and playbin3 is about the source element setup. In playbin there is a source read-only GObject property and a source-setup GObject signal. In playbin3 only the latter is available, so your application should rely on source-setup instead of the notify::source GObject signal.

    The gst-play-1.0 playback utility program already supports playbin3 so it provides a good source of inspiration if you consider porting your application to playbin3. As mentioned at the beginning of this post, WebKit also now supports playbin3, however it needs to be enabled at build time using the CMake -DUSE_GSTREAMER_PLAYBIN3=ON option. This feature is not part of the WebKitGTK+ 2.20 series but should be shipped in 2.22. As a final note I wanted to acknowledge my favorite worker-owned coop Igalia for allowing me to work on this WebKit feature and also our friends over at Centricular for all the quality work on playbin3.

    by Philippe Normand at March 19, 2018 07:13 AM

    March 18, 2018

    Philippe Normand

    Moving to Pelican

    Time for a change! Almost 10 years ago I was starting to hack on a Blog engine with two friends, it was called Alinea and it powered this website for a long time. Back then hacking on your own Blog engine was the pre-requirement to host your blog :) But nowadays people just use Wordpress or similar platforms, if they still have a blog at all. Alinea fell into oblivion as I didn’t have time and motivation to maintain it.

    Moving to Pelican was quite easy, since I’ve been writing content in ReST on the previous blog I only had to pull data from the database and hacked pelican_import.py a bit :)

    Now that this website looks almost modern I’ll hopefully start to blog again, expect at least news about the WebKit work I still enjoy doing at Igalia.

    by Philippe Normand at March 18, 2018 09:18 AM

    The GNOME-Shell Gajim extension maintenance

    Back in January 2011 I wrote a GNOME-Shell extension allowing Gajim users to carry on with their chats using the Empathy infrastructure and UI present in the Shell. For some time the extension was also part of the official gnome-shell-extensions module and then I had to move it to Github as a standalone extension. Sadly I stopped using Gajim a few years ago and my interest in maintaining this extension has decreased quite a lot.

    I don’t know if this extension is actively used by anyone beyond the few bugs reported in Github, so this is a call for help. If anyone still uses this extension and wants it supported in future versions of GNOME-Shell, please send me a mail so I can transfer ownership of the Github repository and see what I can do for the extensions.gnome.org page as well.

    (Huh, also. Hi blogosphere again! My last post was in 2014 it seems :))

    by Philippe Normand at March 18, 2018 09:18 AM

    Web Engines Hackfest 2014

    Last week I attended the Web Engines Hackfest. The event was sponsored by Igalia (also hosting the event), Adobe and Collabora.

    As usual I spent most of the time working on the WebKitGTK+ GStreamer backend and Sebastian Dröge kindly joined and helped out quite a bit, make sure to read his post about the event!

    We first worked on the WebAudio GStreamer backend, Sebastian cleaned up various parts of the code, including the playback pipeline and the source element we use to bridge the WebCore AudioBus with the playback pipeline. On my side I finished the AudioSourceProvider patch that was abandoned for a few months (years) in Bugzilla. It’s an interesting feature to have so that web apps can use the WebAudio API with raw audio coming from Media elements.

    I also hacked on GstGL support for video rendering. It’s quite interesting to be able to share the GL context of WebKit with GStreamer! The patch is not ready yet for landing but thanks to the reviews from Sebastian, Mathew Waters and Julien Isorce I’ll improve it and hopefully commit it soon in WebKit ToT.

    Sebastian also worked on Media Source Extensions support. We had a very basic, non-working, backend that required… a rewrite, basically :) I hope we will have this reworked backend soon in trunk. Sebastian already has it working on Youtube!

    The event was interesting in general, with discussions about rendering engines, rendering and JavaScript.

    by Philippe Normand at March 18, 2018 09:18 AM

    March 12, 2018

    Jessica Tallon

    An excursion in VPP

    Last blog post, I wrote about Snabb, a blazingly fast userspace networking framework. This blog post does stand alone, however I see networking from a Snabb mindset and there will be some comparisons between that and VPP, another userspace networking framework and the subject of this blog. I recommend folks read that first.

    The What and Why of VPP

    VPP is another framework that helps you write userspace networking applications using a kernel bypass, similar to Snabb. It came out of Cisco a few years after being open-sourced. Since then it’s made a name for itself, becoming quite popular for writing fast networking applications. As part of my work at Igalia I spent a few months with my colleague Asumu investigating and learning how to write plugins for it and seeing how it compares to Snabb, so that we can learn from it in the Snabb world as well as develop with it in addition.

    The outlook of VPP is quite different from Snabb. The first thing is that it’s all in C, quite a difference when you’ve spent the last two years writing Lua. C is a pretty standard language for these sorts of things, however it’s a lot less flexible than more modern languages (such as Lua). I sometimes question whether its current popularity is still deserved. When you start VPP for the first time it configures itself as a router. All the additional functionality you want to provide is done via plugins you write and compile out of tree, and then load into VPP. The plugins then hook themselves into a graph of nodes, usually somewhere in the IP stack. Another difference and in my opinion one of the most compelling arguments for VPP is that you can use DPDK, a layer written by Intel (one of the larger network card makers) with a lot of the drivers. DPDK adds a node at the start called dpdk-input which feeds the rest of the graph with packets. As you might imagine with a full router it also has its own routing table populated by performing ARP or NDP requests (to get the address of hops ahead). It also provides ICMP facilities to, for example, respond to ping requests.

    First thing to do is install it from the packages they provide. Once you’ve done that you can start it with your init system or directly from the vppcommand. You then access it via the command-line VPP provides. This is already somewhat of a divergence from Snabb where you compile the tree with the programs and apps you want to use present, and then run the main Snabb binary from the command line directly. The VPP shell is actually rather neat, it lets you query and configure most aspects of it. To give you a taste of it, this is configuring the network card and enabling the IPFIX plugin which comes with VPP:

    vpp# set interface ip address TenGigabitEthernet2/0/0 10.10.1.2/24
    vpp# set interface state TenGigabitEthernet2/0/0 up
    vpp# flowprobe params record l3 active 120 passive 300
    vpp# flowprobe feature add-del TenGigabitEthernet2/0/0 ip4
    vpp# show int
    Name Idx State Counter Count
    TenGigabitEthernet2/0/0 1 up
    local0 0 down drops 1
    

    Let’s get cracking

    I’m not intending this to be a tutorial, more for it to give you a taste of what it’s like working with VPP. Despite this, I hope you do find it useful if you’re hoping to hack on VPP yourself. Apologies if the blog seems a bit code heavy.

    Whilst above I told you how the outlook was different, I want to reiterate. Snabb, for me, is like Lego. It has a bunch of building blocks (apps: ARP, NDP, firewalls, ICMP responders, etc.) that you put together in order to make exactly want you want, out of the bits you want. You only have to use the blocks you decide you want. To use some code from my previous blog post:

    module(..., package.seeall)
    
    local pcap = require("apps.pcap.pcap")
    local Intel82599 = require("apps.intel_mp.intel_mp").Intel82599
    
    function run(parameters)
       -- Lua arrays are indexed from 1 not 0.
       local pci_address = parameters[1]
       local output_pcap = parameters[2]
    
       -- Configure the "apps" we're going to use in the graph.
       -- The "config" object is injected in when we're loaded by snabb.
       local c = config.new()
       config.app(c, "nic", Intel82599, {pciaddr = pci_address})
       config.app(c, "pcap", pcap.PcapWriter, output_pcap)
    
       -- Link up the apps into a graph. 
       config.link(c, "nic.output -> pcap.input")
       -- Tell the snabb engine our configuration we've just made.
       engine.configure(c)
       -- Lets start the apps for 10 seconds!
       engine.main({duration=10, report = {showlinks=true}})
    end

    This doesn’t have an IP stack, it grabs a bunch of packets from the Intel network card and dumps them into this Pcap (Packet capture) file. If you want more functionality like ARP or ICMP responding, you add those apps. I think both are valid approaches, though personally I have a strong preference for the way Snabb works. If you do need a fully setup router then of course, it’s much easier with VPP, you just start it and you’re ready to go. But a lot of the time, I don’t.

    Getting to know VPP is, I think, one of its biggest drawbacks. There are the parts I’ve mentioned above about it being a complex router which can complicate things, and it being in C, which feels terse to someone who’s not worked much in it. There are also those parts where I feel that information needed to get off the ground is a bit hard to come by, something many projects struggle with, VPP being no exception. There is some information between the wiki, documentation, youtube channel and dev mailing list. Those are useful to look at and go through.

    To get started writing your plugin there are a few parts you’ll need:

    • A graph node which takes in packets, does something and passes them on to the next node.
    • An interface to your plugin on the VPP command line so you can enable and disable it
    • Functionality to trace packets which come through (more on this later)

    We started by using the sample plugin as a base. I think it works well, it has all the above and it’s pretty basic so it wasn’t too difficult to get up to speed with. Let’s look at some of the landmarks of the VPP plugin developer landscape:

    The node

    VLIB_REGISTER_NODE (sample_node) = {
      .function = sample_node_fn,
      .name = "sample",
      .vector_size = sizeof (u32),
      .format_trace = format_sample_trace,
      .type = VLIB_NODE_TYPE_INTERNAL,
      
      .n_errors = ARRAY_LEN(sample_error_strings),
      .error_strings = sample_error_strings,
    
      .n_next_nodes = SAMPLE_N_NEXT,
    
      /* edit / add dispositions here */
      .next_nodes = {
            [SAMPLE_NEXT_INTERFACE_OUTPUT] = "interface-output",
      },
    };
    

    This macro registers your plugin and provides some meta-data for it such as the function name, the trace function (again, more on this later), the next node and some error stuff. The other thing with the node is obviously the node function itself. Here is a shortened version of it with lots of comments to explain what’s going on:

    always_inline uword
    sample_node_fn (vlib_main_t * vm,
                    vlib_node_runtime_t * node,
                    vlib_frame_t * frame,
                    u8 is_ipv6)
    {
      u32 n_left_from, * from, * to_next;
      sample_next_t next_index;
      
      /* Grab a pointer to the first vector in the frame */
      from = vlib_frame_vector_args (frame);
      n_left_from = frame->n_vectors;
      next_index = node->cached_next_index;
      
      /* Iterate over the vectors */
      while (n_left_from > 0)
        {
          u32 n_left_to_next;
          
          vlib_get_next_frame (vm, node, next_index,
                               to_next, n_left_to_next);
          
          /* There are usually two loops which look almost identical one takes
           * two packets at a time (for additional speed) and the other loop does
           * the *exact* same thing just for a single packet. There are also
           * apparently some plugins which define some for 4 packets at once too.
           *
           * The advice given is to write the single packet loop (or read) and
           * then write the worry about the multiple packet loops later. I've
           * removed the body of the 2 packet loop to shorten the code, it just
           * does what the single one does, you're not missing much.
           */
          while (n_left_from >= 4 && n_left_to_next >= 2)
            {
            }
          
          while (n_left_from > 0 && n_left_to_next > 0)
            {
              u32 bi0;
              vlib_buffer_t * b0;
              u32 next0 = SAMPLE_NEXT_INTERFACE_OUTPUT;
              u32 sw_if_index0;
              u8 tmp0[6];
              ethernet_header_t *en0;
    
              /* speculatively enqueue b0 to the current next frame */
              bi0 = from[0];
              to_next[0] = bi0;
              from += 1;
              to_next += 1;
              n_left_from -= 1;
              n_left_to_next -= 1;
          
              /* Get the reference to the buffer */
              b0 = vlib_get_buffer (vm, bi0);
              en0 = vlib_buffer_get_current (b0); /* This is the ethernet header */
          
              /* This is where you do whatever you'd like to with your packet */
              /* ... */
          
              /* Get the software index for the hardware */
              sw_if_index0 = vnet_buffer(b0)->sw_if_index[VLIB_RX];
          
              /* Send pkt back out the RX interface */
              vnet_buffer(b0)->sw_if_index[VLIB_TX] = sw_if_index0;
          
              /* Do we want to trace (used for debugging) */
              if (PREDICT_FALSE((node->flags & VLIB_NODE_FLAG_TRACE) 
                             && (b0->flags & VLIB_BUFFER_IS_TRACED))) {
                  sample_trace_t *t = 
                  vlib_add_trace (vm, node, b0, sizeof (*t));
          }
          
          /* verify speculative enqueue, maybe switch current next frame */
          vlib_validate_buffer_enqueue_x1 (vm, node, next_index,
                                           to_next, n_left_to_next,
                                           bi0, next0);
          
            }
          
          vlib_put_next_frame (vm, node, next_index, n_left_to_next);
        }
      
      return frame->n_vectors;
    }
    

    I’ve tried to pare down the code to the important things as much as I can. I personally am not a huge fan of the code duplication which occurs due to the multiple loops for different amounts of packets in the vector, I think it makes the code a bit messy. It definitely goes against DRY (Don’t Repeat Yourself). I’ve not seen any statistics on the improvements nor had time to look into it myself yet, but I’ll take it as true that it’s worth it and go with it :-). The code definitely has more boilerplate than Lua. I think that’s the nature of C unfortunately.

    Finally you need to hook the node into the graph, this is done with yet another macro, you choose the node and arc in the graph and it’ll put it into the graph for you:

    VNET_FEATURE_INIT (sample_node, static) =
    {
      .arc_name = "ip4-unicast",
      .node_name = "sample",
      .runs_before = VNET_FEATURES ("ip4-lookup"),
    };

    In this case we’re having it run before ip4-lookup.

    Tracing

    Tracing is really useful when debugging both as an operator and a developer. You can enable a trace on a specific node for a certain amount of packets. The trace shows you which nodes they go through and usually the node provides extra information in the trace. Here’s an example of such a trace:

    Packet 1
    
    00:01:23:899371: dpdk-input
      TenGigabitEthernet2/0/0 rx queue 0
      buffer 0x1d3be2: current data 14, length 46, free-list 0, clone-count 0, totlen-nifb 0, trace 0x0
                       l4-cksum-computed l4-cksum-correct l2-hdr-offset 0 l3-hdr-offset 14 
      PKT MBUF: port 0, nb_segs 1, pkt_len 60
        buf_len 2176, data_len 60, ol_flags 0x88, data_off 128, phys_addr 0x5b8ef900
        packet_type 0x211 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
        Packet Offload Flags
          PKT_RX_L4_CKSUM_BAD (0x0008) L4 cksum of RX pkt. is not OK
          PKT_RX_IP_CKSUM_GOOD (0x0080) IP cksum of RX pkt. is valid
        Packet Types
          RTE_PTYPE_L2_ETHER (0x0001) Ethernet packet
          RTE_PTYPE_L3_IPV4 (0x0010) IPv4 packet without extension headers
          RTE_PTYPE_L4_UDP (0x0200) UDP packet
      IP4: 02:02:02:02:02:02 -> 90:e2:ba:a9:84:1c
      UDP: 8.8.8.8 -> 193.5.1.176
        tos 0x00, ttl 15, length 46, checksum 0xd8fa
        fragment id 0x0000
      UDP: 12345 -> 6144
        length 26, checksum 0x0000
    00:01:23:899404: ip4-input-no-checksum
      UDP: 8.8.8.8 -> 193.5.1.176
        tos 0x00, ttl 15, length 46, checksum 0xd8fa
        fragment id 0x0000
      UDP: 12345 -> 6144
        length 26, checksum 0x0000
    00:01:23:899430: ip4-lookup
      fib 0 dpo-idx 13 flow hash: 0x00000000
      UDP: 8.8.8.8 -> 193.5.1.176
        tos 0x00, ttl 15, length 46, checksum 0xd8fa
        fragment id 0x0000
      UDP: 12345 -> 6144
        length 26, checksum 0x0000
    00:01:23:899434: ip4-load-balance
      fib 0 dpo-idx 13 flow hash: 0x00000000
      UDP: 8.8.8.8 -> 193.5.1.176
        tos 0x00, ttl 15, length 46, checksum 0xd8fa
        fragment id 0x0000
      UDP: 12345 -> 6144
        length 26, checksum 0x0000
    00:01:23:899437: ip4-arp
        UDP: 8.8.8.8 -> 193.5.1.176
          tos 0x00, ttl 15, length 46, checksum 0xd8fa
          fragment id 0x0000
        UDP: 12345 -> 6144
          length 26, checksum 0x0000
    00:01:23:899441: error-drop
      ip4-arp: address overflow drops

    The format is a timestamp with the node name, with some extra information the plugin chooses to display. It lets me easily see this came in through dpdk-input on went through a few ip4 nodes until ip4-arp (which presumably sent out an ARP packet) and then it gets black-holed (because it doesn’t know where to send it). This is invaluable information when you want to see what’s going on, I can only imagine it’s great for operators too to debug their own setup / config.

    VPP has a useful function called format and unformat. They work a bit like printf and scanf, however they can also take any struct or datatype with a format function and display (or parse) them. This means to display for example an ip4 address it’s just a matter of calling out to the function with the provided format_ip4_address. There are a whole slew of them which come with VPP and it’s trivial to write your own for your own data structures. The other thing to note when writing these tracing functions is that you need to remember to provide the data in your node to the trace function. It’s parsed after the fact as not to hinder performance. The struct we’re going to give to the trace looks like this:

    typedef struct {
      u32 next_index;
      u32 sw_if_index;
      u8 new_src_mac[6];
      u8 new_dst_mac[6];
    } sample_trace_t;

    The call to actually trace is:

    sample_trace_t *t = vlib_add_trace (vm, node, b0, sizeof (*t));
    /* These variables are defined in the node block posted above */
    t->sw_if_index = sw_if_index0;
    t->next_index = next0;
    clib_memcpy (t->new_src_mac, en0->src_address,
                 sizeof (t->new_src_mac));
    clib_memcpy (t->new_dst_mac, en0->dst_address,
                 sizeof (t->new_dst_mac));

    Finally the thing we’ve been waiting for, the trace itself:

    /* VPP actually comes with a format_mac_address, this is here to show you
     * what a format functions look like
     */
    static u8 *
    format_mac_address (u8 * s, va_list * args)
    {
      u8 *a = va_arg (*args, u8 *);
      return format (s, "%02x:%02x:%02x:%02x:%02x:%02x",
             a[0], a[1], a[2], a[3], a[4], a[5]);
    }
    
    static u8 * format_sample_trace (u8 * s, va_list * args)
    {
      CLIB_UNUSED (vlib_main_t * vm) = va_arg (*args, vlib_main_t *);
      CLIB_UNUSED (vlib_node_t * node) = va_arg (*args, vlib_node_t *);
      sample_trace_t * t = va_arg (*args, sample_trace_t *);
      
      s = format (s, "SAMPLE: sw_if_index %d, next index %d\n",
                  t->sw_if_index, t->next_index);
      s = format (s, "  new src %U -> new dst %U",
                  format_mac_address, t->new_src_mac, 
                  format_mac_address, t->new_dst_mac);
    
      return s;
    }

    Not bad, I don’t think. Whilst Snabb does make it very easy and has a lot of the same offerings to format and parse things, having for example ipv4:ntop, I appreciate how easy VPP makes doing things like this in C. Format and unformat are quite nice to work with. I love the tracing! Snabb doesn’t have a comparable tracing mechanism, it display a link report but nothing like the trace.

    CLI Interface

    We have our node, we can trace packets going through it and display some useful info specific to the plugin. We now want our plugin to work. To do that we need to tell VPP that we’re open for business and for it to send packets our way. Snabb doesn’t have a CLI similar to VPPs, partially because it works differently and there is less need for it. However I think the CLI comes into its own when you want to display lots of information and easily configure things on the fly. It also has a command from the shell you can use to execute commands, vppctl, allowing for scripting.

    In VPP you use a macro to define a command, similar to the node definition above. This example provides the command name “sample macswap” in our case, a bit of help and the function itself. Here’s what it could look like:

    static clib_error_t *
    macswap_enable_disable_command_fn (vlib_main_t * vm,
                                       unformat_input_t * input,
                                       vlib_cli_command_t * cmd)
    {
      sample_main_t * sm = &sample_main;
      u32 sw_if_index = ~0;
      int enable_disable = 1;
    
      /* Parse the command */
      while (unformat_check_input (input) != UNFORMAT_END_OF_INPUT) {
        if (unformat (input, "disable"))
          enable_disable = 0;
        else if (unformat (input, "%U", unformat_vnet_sw_interface,
                           sm->vnet_main, &sw_if_index));
        else
          break;
      }
    
      /* Display an error if we weren't provided with the interface name */
      if (sw_if_index == ~0)
        return clib_error_return (0, "Please specify an interface...");
      
      /* This is what you call out to, in other to enable or disable the plugin */
      vnet_feature_enable_disable ("device-input", "sample",
                                   sw_if_index, enable_disable, 0, 0);
      return 0;
    }
    
    VLIB_CLI_COMMAND (sr_content_command, static) = {
        .path = "sample macswap",
        .short_help = 
        "sample macswap <interface-name> [disable]",
        .function = macswap_enable_disable_command_fn,
    };

    It’s quite simple. You see if the user is enabling or disabling the plugin (by checking for the presence of “disable” in their command), grab the interface name and then tell VPP to enable or disable your plugin on said interface.

    Conclusion

    VPP is powerful and fast, it shares a lot of similarities with Snabb, however they take very different approaches. I feel like the big differences are largely personal preferences: do you prefer C or Lua? Do you need an IP router? Are your network cards supported by Snabb or do you need to built them or use DPDK?

    I think overall, during my excursion in VPP. I’ve enjoyed seeing the other side as it were. I do however, think I prefer working with Snabb. I find it faster to get up to speed with because things are simpler for simple projects. Lua also lets me have the power of C though it’s FFI, without subjecting me to the shortcomings C. I look forward however to hacking on both and maybe even seeing some of the ideas from the VPP world come into the Snabb world and the other-way round too!

    by tsyesika at March 12, 2018 03:29 PM

    Iago Toral

    Intel Mesa driver for Linux is now Vulkan 1.1 conformant

    It was only a few weeks ago when I posted that the Intel Mesa driver had successfully passed the Khronos OpenGL 4.6 conformance on day one, and now I am very proud that we can announce the same for the Intel Mesa Vulkan 1.1 driver, the new Vulkan API version announced by the Khronos Group last week. Big thanks to Intel for making Linux a first-class citizen for graphics APIs, and specially to Jason Ekstrand, who did most of the Vulkan 1.1 enablement in the driver.

    At Igalia we are very proud of being a part of this: on the driver side, we have contributed the implementation of VK_KHR_16bit_storage, numerous bugfixes for issues raised by the Khronos Conformance Test Suite (CTS) and code reviews for some of the new Vulkan 1.1 features developed by Intel. On the CTS side, we have worked with other Khronos members in reviewing and testing additions to the test suite, identifying and providing fixes for issues in the tests as well as developing new tests.

    Finally, I’d like to highlight the strong industry adoption of Vulkan: as stated in the Khronos press release, various other hardware vendors have already implemented conformant Vulkan 1.1 drivers, we are also seeing major 3D engines adopting and supporting Vulkan and AAA games that have already shipped with Vulkan-powered graphics. There is no doubt that this is only the beginning and that we will be seeing a lot more of Vulkan in the coming years, so look forward to it!

    Vulkan and the Vulkan logo are registered trademarks of the Khronos Group Inc.

    by Iago Toral at March 12, 2018 10:12 AM

    March 04, 2018

    Eleni Maria Stea

    A short OpenGL / SPIRV example.

    It’s been a while since Igalia is working on bringing SPIR-V to mesa OpenGL. Alejandro Piñeiro has already given a talk on the status of the ARB_gl_spirv extension development that was very well received at FOSDEM 2018 . Anyone interested in technical information can watch the video recording here: https://youtu.be/wXr8-C51qeU.

    So far, I haven’t been involved with the extension development myself, but as I am looking forward to work on it in the future, I decided to familiarize myself with the use of SPIR-V in OpenGL. In this post, I will try to demonstrate with a short OpenGL example how to use this extension after I briefly explain what SPIR-V is, and why it is important to bring it to OpenGL.

    So, what is SPIR-V and why is it useful?

    SPIR-V is an intermediate language for defining shaders. With the use of an external compiler, shaders written in any shading language (for example GLSL or HLSL) can be converted to SPIR-V  (see more here: https://www.khronos.org/opengl/wiki/SPIR-V and here: https://www.khronos.org/registry/spir-v/ and here: https://www.khronos.org/spir/). The obvious advantages for the OpenGL programs are speed, less complexity and portability:

    With SPIR-V, the graphics program and the driver can avoid the overhead of parsing, compiling and linking the shaders. Also, it is easy to re-use shaders written in different shading languages. For example an OpenGL program for Linux can use an HLSL shader that was originally written for a Vulkan program for Windows, by loading its SPIR-V representation. The only requirement is that the OpenGL implementation and the driver support the ARB_gl_spirv extension.

    OpenGL – SPIR-V example

    So, here’s an example OpenGL program that loads SPIR-V shaders by making use of the OpenGL ARB_gl_spirv extension: https://github.com/hikiko/gl4

    The example makes use of a vertex and a pixel shader written in GLSL 450.  Some notes on it:

    1- I had to write the shaders in a way compatible with the SPIR-V extension:

    First of all, I had to forget about the traditional attribute locations, varyings and uniform locations. Each shader had to contain all the necessary information for linking. This means that I had to specify which are the input and output attributes/varyings at each shader stage and their locations/binding points inside the GLSL program. I’ve done this using the layout qualifier. I also placed the uniforms in Uniform Buffer Objects (UBO) as standalone uniforms are not supported when using SPIR-V 1. I couldn’t use UniformBlockBinding because again is not supported when using SPIR-V shaders (see ARB_gl_spirv : issues : 24 to learn why).

    In the following tables you can see a side-by-side comparison of the traditional GLSL shaders I would use with older GLSL versions (left column) and the GLSL 450 shaders I used for this example (right column):

    Vertex shader:

    uniform mat4 mvpmat, mvmat, projmat;
    uniform mat3 normmat;
    uniform vec3 light_pos;
    attribute vec4 attr_vertex;
    attribute vec3 attr_normal;
    attribute vec2 attr_texcoord;
    varying vec3 vpos, norm, ldir;
    varying vec2 texcoord;
    void main()
    {
       gl_Position = mvpmat *
                            attr_vertex;
       vpos = (mvmat * attr_vertex).xyz;
       norm = normmat * attr_normal;
       texcoord = attr_texcoord
                   * vec2(2.0, 1.0);
    }
    #version 450
    layout(std140, binding = 0) uniform matrix_state {
       mat4 vmat;
       mat4 projmat;
       mat4 mvmat;
       mat4 mvpmat;
       vec3 light_pos;
    } matrix
    layout(location = 0) in vec4 attr_vertex;
    layout(location = 1) in vec3 attr_normal;
    layout(location = 2) in vec2 attr_texcoord;
    layout(location = 3) out vec3 vpos;
    layout(location = 4) out vec3 norm;
    layout(location = 5) out vec3 ldir;
    layout(location = 6) out vec2 texcoord;
    void main()
    {
       gl_Position = matrix.mvpmat * attr_vertex;
       vpos = (matrix.mvmat * attr_vertex).xyz;
       norm = mat3(matrix.mvmat) * attr_normal;
       texcoord = attr_texcoord * vec2(2.0, 1.0);
       ldir = matrix.light_pos - vpos;
    }

    Pixel shader:

    uniform sampler2D tex;
    varying vec3 vpos, norm, ldir;
    varying vec2 texcoord;
    void main()
    {
        vec4 texel = texture2D(tex, texcoord);
        vec3 vdir = -normalize(vpos);
        vec3 n = normalize(norm);
        vec3 l = normalize(ldir);
        vec3 h = normalize(vdir + ldir);
        float ndotl = max(dot(n, l), 0.0);
        float ndoth = max(dot(n, h), 0.0);
        vec3 diffuse = texel.rgb * ndotl;
        vec3 specular = vec3(1.0, 1.0, 1.0) * pow(ndoth, 50.0);
        gl_FragColor.rgb = diffuse + specular;
        gl_FragColor.a = texel.a;
    }
    #version 450
    layout(binding = 0) uniform sampler2D tex;
    layout(location = 3) in vec3 vpos;
    layout(location = 4) in vec3 norm;
    layout(location = 5) in vec3 ldir;
    layout(location = 6) in vec2 texcoord;
    layout(location = 0) out vec4 color;
    void main()
    {
        vec4 texel = texture(tex, texcoord);
        vec3 vdir = -normalize(vpos);
        vec3 n = normalize(norm);
        vec3 l = normalize(ldir);
        vec3 h = normalize(vdir + ldir);
        float ndotl = max(dot(n, l), 0.0);
        float ndoth = max(dot(n, h), 0.0);
        vec3 diffuse = texel.rgb * ndotl;
        vec3 specular = vec3(1.0, 1.0, 1.0) * pow(ndoth, 50.0);
        color.rgb = diffuse + specular;
        color.a = texel.a;
    }

    As you can see, in the modern GLSL version I’ve set the location for every attribute and varying as well as the binding number for the uniform buffer objects and the opaque types of the fragment shader (sampler2D) inside the shader. Also, the output varyings of the vertex stage (out variables) use  the same locations with the equivalent input varyings of the fragment stage (in variables). There are also some minor changes in the language syntax: I can’t make use of the deprecated gl_FragColor and texture2D functions anymore.

    2- GLSL to SPIR-V HOWTO:

    First of all we need a GLSL to SPIR-V compiler.
    On Linux, we can use the Khronos’s glslangValidator by checking out the code from this repository:
    https://github.com/KhronosGroup/glslang and installing it locally. Then, we can do something like:

    glslangValidator -G -V -S vertex.glsl -o spirv/vertex.spv

    for each stage (glslangValidator -h for more options). Note that -G is used to compile the shaders targeting the OpenGL platform. It should be avoided when the shaders will be ported to other platforms.

    I found easier to add some rules in the project’s Makefile (https://github.com/hikiko/gl4/blob/master/Makefile) to compile the shaders automatically.

    3- Loading, specializing and using the SPIR-V shaders

    To use the SPIR-V shaders  an OpenGL program must:

    1. load the SPIR-V shaders
    2. specialize them
    3. create the shader program

    Loading the shaders is quite simple, we only have to specify the size of the SPIR-V content and the content by calling:

    glShaderBinary(1, &sdr, GL_SHADER_BINARY_FORMAT_SPIR_V_ARB, buf, fsz);

    where sdr is our shader, buf is a buffer that contains the SPIR-V we loaded from the file and fsz the contents size (file size). Check out the load_shader function here: https://github.com/hikiko/gl4/blob/master/main.c (the case when SPIRV is defined).

    We can then specialize the shaders using the function glSpecializeShaderARB that allows us to set the shader’s entry point (which is the function from which the execution begins) and the number of constants as well as the constants that will be used by this function. In our example the execution starts from the main function that is void, therefore we set "main", 0, 0 for each shader. Note that I load the glSpecializeShaderARB function at runtime because the linker couldn’t find it, you might not need to do this in your program.

    Before we create the shader program it’s generally useful to perform some error checking:

    	glGetShaderiv(sdr, GL_COMPILE_STATUS, &status);
    	if(status) {
    		printf("successfully compiled shader: %s\n", fname);
    	} else {
    		printf("failed to compile shader: %s\n", fname);
    	}
    
    	glGetShaderiv(sdr, GL_INFO_LOG_LENGTH, &loglen);
    	if(loglen > 0 && (buf = malloc(loglen + 1))) {
    		glGetShaderInfoLog(sdr, loglen, 0, buf);
    		buf[loglen] = 0;
    		printf("%s\n", buf);
    		free(buf);
    	}

    In the code snippet above, I used the driver’s compiler (as we would use it if we had just compiled the GLSL code) to validate the shader’s SPIR-V representation.

    Then, I created and used the shader program as usual (see load_program of main.c). And that’s it.

    To run the example program you can clone it from here: https://github.com/hikiko/gl4, and supposing that you have installed the glslangValidator mentioned before you can run:

    make
    ./test

    inside the gl4/ directory.

    [1]: Standalone uniforms with explicit locations can also be accepted but since this feature is not supported in other platforms (like Vulkan) the shaders that use it won’t be portable.

    by hikiko at March 04, 2018 08:20 PM

    February 17, 2018

    Michael Catanzaro

    On Compiling WebKit (now twice as fast!)

    Are you tired of waiting for ages to build large C++ projects like WebKit? Slow headers are generally the problem. Your C++ source code file #includes a few headers, all those headers #include more, and those headers #include more, and more, and more, and since it’s C++ a bunch of these headers contain lots of complex templates to slow down things even more. Not fun.

    It turns out that much of the time spent building large C++ projects is effectively spent parsing the same headers again and again, over, and over, and over, and over, and over….

    There are three possible solutions to this problem:

    • Shred your CPU and buy a new one that’s twice as fast.
    • Use C++ modules: import instead of #include. This will soon become the best solution, but it’s not standardized yet. For WebKit’s purposes, we can’t use it until it works the same in MSVCC, Clang, and three-year-old versions of GCC. So it’ll be quite a while before we’re able to take advantage of modules.
    • Use unified builds (sometimes called unity builds).

    WebKit has adopted unified builds. This work was done by Keith Miller, from Apple. Thanks, Keith! (If you’ve built WebKit before, you’ll probably want to say that again: thanks, Keith!)

    For a release build of WebKitGTK+, on my desktop, our build times used to look like this:

    real 62m49.535s
    user 407m56.558s
    sys 62m17.166s

    That was taken using WebKitGTK+ 2.17.90; build times with any 2.18 release would be similar. Now, with trunk (or WebKitGTK+ 2.20, which will be very similar), our build times look like this:

    real 33m36.435s
    user 214m9.971s
    sys 29m55.811s

    Twice as fast.

    The approach is pretty simple: instead of telling the compiler to build the original C++ source code files that developers see, we instead tell the compiler to build unified source files that look like this:

    // UnifiedSource1.cpp
    #include "CSSValueKeywords.cpp"
    #include "ColorData.cpp"
    #include "HTMLElementFactory.cpp"
    #include "HTMLEntityTable.cpp"
    #include "JSANGLEInstancedArrays.cpp"
    #include "JSAbortController.cpp"
    #include "JSAbortSignal.cpp"
    #include "JSAbstractWorker.cpp"

    Since files are included only once per translation unit, we now have to parse the same headers only once for each unified source file, rather than for each individual original source file, and we get a dramatic build speedup. It’s pretty terrible, yet extremely effective.

    Now, how many original C++ files should you #include in each unified source file? To get the fastest clean build time, you would want to #include all of your C++ source files in one, that way the compiler sees each header only once. (Meson can do this for you automatically!) But that causes two problems. First, you have to make sure none of the files throughout your entire codebase use conflicting variable names, since the static keyword and anonymous namespaces no longer work to restrict your definitions to a single file. That’s impractical in a large project like WebKit. Second, because there’s now only one file passed to the compiler, incremental builds now take as long as clean builds, which is not fun if you are a WebKit developer and actually need to make changes to it. Unifying more files together will always make incremental builds slower. After some experimentation, Apple determined that, for WebKit, the optimal number of files to include together is roughly eight. At this point, there’s not yet much negative impact on incremental builds, and past here there are diminishing returns in clean build improvement.

    In WebKit’s implementation, the files to bundle together are computed automatically at build time using CMake black magic. Adding a new file to the build can change how the files are bundled together, potentially causing build errors in different files if there are symbol clashes. But this is usually easy to fix, because only files from the same directory are bundled together, so random unrelated files will never be built together. The bundles are always the same for everyone building the same version of WebKit, so you won’t see random build failures; only developers who are adding new files will ever have to deal with name conflicts.

    To significantly reduce name conflicts, we now limit the scope of using statements. That is, stuff like this:

    using namespace JavaScriptCore;
    namespace WebCore {
    //...
    }

    Has been changed to this:

    namespace WebCore {
    using namespace JavaScriptCore;
    // ...
    }

    Some files need to be excluded due to unsolvable name clashes. For example, files that include X11 headers, which contain lots of unnamespaced symbols that conflict with WebCore symbols, don’t really have any chance. But only a few files should need to be excluded, so this does not have much impact on build time. We’ve also opted to not unify most of the GLib API layer, so that we can continue to use conventional GObject names in our implementation, but again, the impact of not unifying a few files is minimal.

    We still have some room for further performance improvement, because some significant parts of the build are still not unified, including most of the WebKit layer on top. But I suspect developers who have to regularly build WebKit will already be quite pleased.

    by Michael Catanzaro at February 17, 2018 07:07 PM

    February 16, 2018

    Michael Catanzaro

    On Python Shebangs

    So, how do you write a shebang for a Python program? Let’s first set aside the python2/python3 issue and focus on whether to use env. Which of the following is correct?

    #!/usr/bin/env python
    #!/usr/bin/python

    The first option seems to work in all environments, but it is banned in popular distros like Fedora (and I believe also Debian, but I can’t find a reference for this). Using env in shebangs is dangerous because it can result in system packages using non-system versions of python. python is used in so many places throughout modern systems, it’s not hard to see how using #!/usr/bin/env in an important package could badly bork users’ operating systems if they install a custom version of python in /usr/local. Don’t do this.

    The second option is broken too, because it doesn’t work in BSD environments. E.g. in FreeBSD, python is installed in /usr/local/bin. So FreeBSD contributors have been upstreaming patches to convert #!/usr/bin/python shebangs to #!/usr/bin/env python. Meanwhile, Fedora has begun automatically rewriting #!/usr/bin/env python to #!/usr/bin/python, but with a warning that this is temporary and that use of #!/usr/bin/env python will eventually become a fatal error causing package builds to fail.

    So obviously there’s no way to write a shebang that will work for both major Linux distros and major BSDs. #!/usr/bin/env python seems to work today, but it’s subtly very dangerous. Lovely. I don’t even know what to recommend to upstream projects.

    Next problem: python2 versus python3. By now, we should all be well-aware of PEP 394. PEP 394 says you should never write a shebang like this:

    #!/usr/bin/env python
    #!/usr/bin/python

    unless your python script is compatible with both python2 and python3, because you don’t know what version you’re getting. Your python script is almost certainly not compatible with both python2 and python3 (and if you think it is, it’s probably somehow broken, because I doubt you regularly test it with both). Instead, you should write the shebang like this:

    #!/usr/bin/env python2
    #!/usr/bin/python2
    #!/usr/bin/env python3
    #!/usr/bin/python3

    This works as long as you only care about Linux and BSDs. It doesn’t work on macOS, which provides /usr/bin/python and /usr/bin/python2.7, but still no /usr/bin/python2 symlink, even though it’s now been six years since PEP 394. It’s hard to understate how frustrating this is.

    So let’s say you are WebKit, and need to write a python script that will be truly cross-platform. How do you do it? WebKit’s scripts are only needed (a) during the build process or (b) by developers, so we get a pass on the first problem: using /usr/bin/env should be OK, because the scripts should never be installed as part of the OS. Using #!/usr/bin/env python — which is actually what we currently do — is unacceptable, because our scripts are python2 and that’s broken on Arch, and some of our developers use that. Using #!/usr/bin/env python2 would be dead on arrival, because that doesn’t work on macOS. Seems like the option that works for everyone is #!/usr/bin/env python2.7. Then we just have to hope that the Python community sticks to its promise to never release a python2.8 (which seems likely).

    …yay?

    by Michael Catanzaro at February 16, 2018 08:21 PM

    February 15, 2018

    Diego Pino

    The B4 network function

    Some time ago I started a series of blog posts about IPv6 and network namespaces. The purpose of those posts was preparing the ground for covering a network function called B4 (Basic Bridging BroadBand).

    The B4 network function is one of the main components of a lw4o6 architecture (RFC7596). The function runs within every CPEs (Customer’s Premises Equipment, essentially a home router) of a carrier’s network. This function takes care of two things: 1) NAPT the customer’s IPv4 traffic and 2) encapsulate it into IPv6. This is fundamental as lw4o6 proposes an IPv6-only network, which can still provide IPv4 services and connectivity. Besides lw4o6, the B4 function is also present in other architectures such as DS-Lite or MAP-E. In the case of lw4o6 the exact name of this function is lwB4. All these architectures rely on A+P mapping techniques and are managed by the Softwire WG.

    The diagram below shows how a lw4o6 architecture works:

    lw4o6 chart
    lw4o6 chart

    Packets arriving the CPE from the customer (IPv4) are shown in red. Packets leaving the CPE to the carrier’s network are shown in blue (IPv6). The counterpart of a lwB4 function is the lwAFTR function, deployed at one of the border-routers of the carrrier’s network.

    In the article ‘Dive into lw4o6’ I reviewed in detail how a lw4o6 architecture works. Please check out the article if you want to learn more.

    At Igalia we implemented a high-performant lwAFTR network function. This network function has been part of Snabb since at least 2015, and has kept evolving and getting merged back to Snabb through new releases. We kindly thank Deutsche Telekom for their support financing this project, as well as Juniper networks, who also helped improving the status of Snabb’s lwAFTR.

    While we were developing the lwAFTR network function we heavily tested it through a wide range of tests: end-to-end tests, performance tests, soak tests, etc. However, in some occassions we got to diagnose potential bugs in real deployments. To do that, we needed the other major component of a lw4o6 architecture: the B4 network function.

    OpenWRT, the Linux-based OS powering many home routers, features a MAP network function to help deploying MAP-E architectures. This function can also be used to implement a B4 for DS-Lite or lw4o6. With the invaluable help of my colleagues Adrián and Carlos López I managed to setup an OpenWRT on a virtual machine with B4 enabled. However, I was not completely satisfied with the solution.

    That led me to explore another solution very much inspired by an excelent blog post from Marcel Wiget: Lightweight 4over6 B4 Client in Linux Namespace. In this post Marcel describes how to build a B4 network function using standard Linux commands.

    Basically, a B4 function does 2 things:

    • NAT44, which is possible to do it with iptables.
    • IPv4-in-IPv6 tunneling, which is possible to do it iproute2.

    In addition, Marcel’s B4 network function is isolated into its own network namespace. That’s less of a headache than installing and configuring a virtual machine.

    On the other hand, my deployment had a extra twist compared to a standard lw4o6 deployment. The lwAFTR I was trying to reach was somewhere out on the Internet, not within my ISP’s network. To make things worse, ISP providers in Spain are not rolling out IPv6 yet so I needed to use an IPv6 tunnel broker, more precisely Hurricane Electric (In the article ‘IPv6 tunnel’ I described how to set up such tunnel).

    Basically my deployment looked like this:

    lwB4-lwAFTR over Internet
    lwB4-lwAFTR over Internet

    After scratching my head during several days I came up with the following script: b4-to-aftr-over-inet.sh. I break it down below in pieces for better comprehension.

    Warning: The script requires a Hurricane Electric tunnel up and running in order to work.

    Our B4 would have the following provisioned data:

    • B4 IPv6: IPv6 address provided by Hurricane Electric.
    • B4 IPv4: 192.0.2.1
    • B4 port-range: 4096-8191

    While the address of the AFTR is 2001:DB8::0001.

    Given this settings our B4 is ready to match the following softwire in the lwAFTR’s binding table:

    softwire {
        ipv4 192.0.2.1;
        psid 1;
        b4-ipv6 IFHE (See below);
        br-address 2001:DB8::0001;
        port-set {
            psid-length 12;
        }
    }
    

    In case of doubt about how softwires work, please check ‘Dive into lw4o6’.

    IPHT="fd24:f64b:aca9:e498::1"
    IPNS="fd24:f64b:aca9:e498::2"
    CID=64
    IFHT="veth9"
    IFNS="vpeer9"
    IFHE="sit1"
    NS="ns-b4"
    

    Definition of several constants. IPHT and IPNS stand for IP host and IP namespace. Our script will create a network namespace which requires a veth pair to communicate the namespace with the host. IPHT is an ULA address for the host side, while IPNS is an ULA address for the network namespace side. Likewise, IFHT and IFNS are the interface names for host and namespace sides respectively.

    IFHE is the interface of the Hurricane Electric IPv6-in-IPv4 tunnel. We will use the IPv6 address of this interface as IPv6 source address of the B4.

    AFTR_IPV6="2001:DB8::0001"
    IP="192.0.2.1"
    PORTRANGE="4096-8191"
    

    Softwire related constants, as described above.

    # Reset everything
    ip li del dev "${IFHT}" &>/dev/null
    ip netns del "${NS}" &> /dev/null
    

    Removes namespace and host-side interface if defined.

    # Create a network namespace and enable loopback on it
    ip netns add "${NS}"
    ip netns exec "${NS}" ip li set dev lo up
    
    # Create the veth pair and move one of the ends to the NS.
    ip li add name "${IFHT}" type veth peer name "${IFNS}"
    ip li set dev "${IFNS}" netns "${NS}"
    
    # Configure interface ${IFHT} on the host
    ip -6 addr add "${IPHT}/${CID}" dev "${IFHT}"
    ip li set dev "${IFHT}" up
    
    # Configure interface ${IFNS} on the network namespace.
    ip netns exec "${NS}" ip -6 addr add "${IPNS}/${CID}" dev "${IFNS}"
    ip netns exec "${NS}" ip li set dev "${IFNS}" up
    

    The commands above set up the basics of the network namespace. A network namespace is created with two virtual-interface pairs (think of a patch cable). Each of the veth ends is assigned a private IPv6 address (ULA). One of the ends of the veth pair is moved into the network namespace while the other remains on the host side. In case of doubt, please check this other article I wrote about network namespaces.

    # Create IPv4-in-IPv6 tunnel.
    ip netns exec "${NS}" ip -6 tunnel add b4tun mode ipip6 local "${IPNS}" remote "${IPHT}" dev "${IFNS}"
    ip netns exec "${NS}" ip addr add 10.0.0.1 dev b4tun
    ip netns exec "${NS}" ip link set dev b4tun up
    # All IPv4 packets go through the tunnel.
    ip netns exec "${NS}" ip route add default dev b4tun
    # Make ${IFNS} the default gw.
    ip netns exec "${NS}" ip -6 route add default dev "${IFNS}"
    

    From the B4 we will send IPv4 packets that will get encapsulated into IPv6. These packets will eventually leave the host via the Hurricane Electric tunnel. What we do here is to create an IPv4-in-IPv6 tunnel (ipip6) called b4tun. The tunnel has two ends: IPNS and IPHT. All IPv4 traffic started from the network namespace gets routed through b4tun, so it gets encapsulated. If the traffic if IPv6 native traffic it doesn’t need to get encapsulated, thus it’s simply forwarded to IFNS.

    # Adjust MTU size. 
    ip netns exec "${NS}" ip li set mtu 1252 dev b4tun
    ip netns exec "${NS}" ip li set mtu 1300 dev vpeer9
    

    Since packets leaving the CPE get IPv6 encapsulated we need to make room for those extra bytes that will grow the packet size. Normally routing appliances have a default MTU size of 1500 bytes. That’s why we artificially reduce the MTU size of both b4tun and vpeer9 interfaces to a number lower than 1500. This technique is known as MSS (Maximum Segment Size) Clamping.

    # NAT44.
    ip netns exec "${NS}" iptables -t nat --flush
    ip netns exec "${NS}" iptables -t nat -A POSTROUTING -p tcp  -o b4tun -j SNAT --to $IP:$PORTRANGE
    ip netns exec "${NS}" iptables -t nat -A POSTROUTING -p udp  -o b4tun -j SNAT --to $IP:$PORTRANGE
    ip netns exec "${NS}" iptables -t nat -A POSTROUTING -p icmp -o b4tun -j SNAT --to $IP:$PORTRANGE
    

    Outgoing IPv4 packets leaving the B4 got their IPv4 source address and port sourced natted. The block of code above flushes the iptables’s NAT44 rules and creates new Source NAT rules for several protocols.

    # Enable forwarding and IPv6 NAT
    sysctl -w net.ipv6.conf.all.forwarding=1
    ip6tables -t nat --flush
    # Packets coming into the veth pair in the host side, change their destination address to AFTR.
    ip6tables -t nat -A PREROUTING  -i "${IFHT}" -j DNAT --to-destination "${AFTR_IPV6}"
    # Outgoing packets change their source address to HE Client address (B4 address).
    ip6tables -t nat -A POSTROUTING -o "${IFHE}" -j MASQUERADE
    

    Outgoing packets leaving our host need to get their source address masqueraded to the IPv6 address assigned to the interface of the Hurricane Electric tunnel point. Likewise anything that comes into the host should seem to arrive from the lwAFTR, when actually its origin address is the IPv6 address of the other end of the Hurricane Electric tunnel. To overcome this problem I applied a NAT66 on source address and destination. Could this be done in a different way skipping the controversial NAT66? I’m not sure. I think the veth pairs need to get assigned private addresses so the only way to get the packets routed through the Internet is with a NAT66.

    # Get into NS.
    bash=/run/current-system/sw/bin/bash
    ip netns exec ${NS} ${bash} --rcfile <(echo "PS1=\"${NS}> \"")
    

    The last step gets us into the network namespace from which we will be able to run commands constrained into the environment created during the steps before.

    I don’t know how much useful or reusable this script can be, but in hindsight coming up with this complex setting helped me learning several Linux networking tools. I think I could have never figured all this out without the help and support from my colleagues as well as the guidance from Marcel’s original script and blog post.

    February 15, 2018 06:00 AM

    February 07, 2018

    Andy Wingo

    design notes on inline caches in guile

    Ahoy, programming-language tinkerfolk! Today's rambling missive chews the gnarly bones of "inline caches", in general but also with particular respect to the Guile implementation of Scheme. First, a little intro.

    inline what?

    Inline caches are a language implementation technique used to accelerate polymorphic dispatch. Let's dive in to that.

    By implementation technique, I mean that the technique applies to the language compiler and runtime, rather than to the semantics of the language itself. The effects on the language do exist though in an indirect way, in the sense that inline caches can make some operations faster and therefore more common. Eventually inline caches can affect what users expect out of a language and what kinds of programs they write.

    But I'm getting ahead of myself. Polymorphic dispatch literally means "choosing based on multiple forms". Let's say your language has immutable strings -- like Java, Python, or Javascript. Let's say your language also has operator overloading, and that it uses + to concatenate strings. Well at that point you have a problem -- while you can specify a terse semantics of some core set of operations on strings (win!), you can't choose one representation of strings that will work well for all cases (lose!). If the user has a workload where they regularly build up strings by concatenating them, you will want to store strings as trees of substrings. On the other hand if they want to access characterscodepoints by index, then you want an array. But if the codepoints are all below 256, maybe you should represent them as bytes to save space, whereas maybe instead as 4-byte codepoints otherwise? Or maybe even UTF-8 with a codepoint index side table.

    The right representation (form) of a string depends on the myriad ways that the string might be used. The string-append operation is polymorphic, in the sense that the precise code for the operator depends on the representation of the operands -- despite the fact that the meaning of string-append is monomorphic!

    Anyway, that's the problem. Before inline caches came along, there were two solutions: callouts and open-coding. Both were bad in similar ways. A callout is where the compiler generates a call to a generic runtime routine. The runtime routine will be able to handle all the myriad forms and combination of forms of the operands. This works fine but can be a bit slow, as all callouts for a given operator (e.g. string-append) dispatch to a single routine for the whole program, so they don't get to optimize for any particular call site.

    One tempting thing for compiler writers to do is to effectively inline the string-append operation into each of its call sites. This is "open-coding" (in the terminology of the early Lisp implementations like MACLISP). The advantage here is that maybe the compiler knows something about one or more of the operands, so it can eliminate some cases, effectively performing some compile-time specialization. But this is a limited technique; one could argue that the whole point of polymorphism is to allow for generic operations on generic data, so you rarely have compile-time invariants that can allow you to specialize. Open-coding of polymorphic operations instead leads to code bloat, as the string-append operation is just so many copies of the same thing.

    Inline caches emerged to solve this problem. They trace their lineage back to Smalltalk 80, gained in complexity and power with Self and finally reached mass consciousness through Javascript. These languages all share the characteristic of being dynamically typed and object-oriented. When a user evaluates a statement like x = y.z, the language implementation needs to figure out where y.z is actually located. This location depends on the representation of y, which is rarely known at compile-time.

    However for any given reference y.z in the source code, there is a finite set of concrete representations of y that will actually flow to that call site at run-time. Inline caches allow the language implementation to specialize the y.z access for its particular call site. For example, at some point in the evaluation of a program, y may be seen to have representation R1 or R2. For R1, the z property may be stored at offset 3 within the object's storage, and for R2 it might be at offset 4. The inline cache is a bit of specialized code that compares the type of the object being accessed against R1 , in that case returning the value at offset 3, otherwise R2 and offset r4, and otherwise falling back to a generic routine. If this isn't clear to you, Vyacheslav Egorov write a fine article describing and implementing the object representation optimizations enabled by inline caches.

    Inline caches also serve as input data to later stages of an adaptive compiler, allowing the compiler to selectively inline (open-code) only those cases that are appropriate to values actually seen at any given call site.

    but how?

    The classic formulation of inline caches from Self and early V8 actually patched the code being executed. An inline cache might be allocated at address 0xcabba9e5 and the code emitted for its call-site would be jmp 0xcabba9e5. If the inline cache ended up bottoming out to the generic routine, a new inline cache would be generated that added an implementation appropriate to the newly seen "form" of the operands and the call-site. Let's say that new IC (inline cache) would have the address 0x900db334. Early versions of V8 would actually patch the machine code at the call-site to be jmp 0x900db334 instead of jmp 0xcabba6e5.

    Patching machine code has a number of disadvantages, though. It inherently target-specific: you will need different strategies to patch x86-64 and armv7 machine code. It's also expensive: you have to flush the instruction cache after the patch, which slows you down. That is, of course, if you are allowed to patch executable code; on many systems that's impossible. Writable machine code is a potential vulnerability if the system may be vulnerable to remote code execution.

    Perhaps worst of all, though, patching machine code is not thread-safe. In the case of early Javascript, this perhaps wasn't so important; but as JS implementations gained parallel garbage collectors and JS-level parallelism via "service workers", this becomes less acceptable.

    For all of these reasons, the modern take on inline caches is to implement them as a memory location that can be atomically modified. The call site is just jmp *loc, as if it were a virtual method call. Modern CPUs have "branch target buffers" that predict the target of these indirect branches with very high accuracy so that the indirect jump does not become a pipeline stall. (What does this mean in the face of the Spectre v2 vulnerabilities? Sadly, God only knows at this point. Saddest panda.)

    cry, the beloved country

    I am interested in ICs in the context of the Guile implementation of Scheme, but first I will make a digression. Scheme is a very monomorphic language. Yet, this monomorphism is entirely cultural. It is in no way essential. Lack of ICs in implementations has actually fed back and encouraged this monomorphism.

    Let us take as an example the case of property access. If you have a pair in Scheme and you want its first field, you do (car x). But if you have a vector, you do (vector-ref x 0).

    What's the reason for this nonuniformity? You could have a generic ref procedure, which when invoked as (ref x 0) would return the field in x associated with 0. Or (ref x 'foo) to return the foo property of x. It would be more orthogonal in some ways, and it's completely valid Scheme.

    We don't write Scheme programs this way, though. From what I can tell, it's for two reasons: one good, and one bad.

    The good reason is that saying vector-ref means more to the reader. You know more about the complexity of the operation and what side effects it might have. When you call ref, who knows? Using concrete primitives allows for better program analysis and understanding.

    The bad reason is that Scheme implementations, Guile included, tend to compile (car x) to much better code than (ref x 0). Scheme implementations in practice aren't well-equipped for polymorphic data access. In fact it is standard Scheme practice to abuse the "macro" facility to manually inline code so that that certain performance-sensitive operations get inlined into a closed graph of monomorphic operators with no callouts. To the extent that this is true, Scheme programmers, Scheme programs, and the Scheme language as a whole are all victims of their implementations. JavaScript, for example, does not have this problem -- to a small extent, maybe, yes, performance tweaks and tuning are always a thing but JavaScript implementations' ability to burn away polymorphism and abstraction results in an entirely different character in JS programs versus Scheme programs.

    it gets worse

    On the most basic level, Scheme is the call-by-value lambda calculus. It's well-studied, well-understood, and eminently flexible. However the way that the syntax maps to the semantics hides a constrictive monomorphism: that the "callee" of a call refer to a lambda expression.

    Concretely, in an expression like (a b), in which a is not a macro, a must evaluate to the result of a lambda expression. Perhaps by reference (e.g. (define a (lambda (x) x))), perhaps directly; but a lambda nonetheless. But what if a is actually a vector? At that point the Scheme language standard would declare that to be an error.

    The semantics of Clojure, though, would allow for ((vector 'a 'b 'c) 1) to evaluate to b. Why not in Scheme? There are the same good and bad reasons as with ref. Usually, the concerns of the language implementation dominate, regardless of those of the users who generally want to write terse code. Of course in some cases the implementation concerns should dominate, but not always. Here, Scheme could be more flexible if it wanted to.

    what have you done for me lately

    Although inline caches are not a miracle cure for performance overheads of polymorphic dispatch, they are a tool in the box. But what, precisely, can they do, both in general and for Scheme?

    To my mind, they have five uses. If you can think of more, please let me know in the comments.

    Firstly, they have the classic named property access optimizations as in JavaScript. These apply less to Scheme, as we don't have generic property access. Perhaps this is a deficiency of Scheme, but it's not exactly low-hanging fruit. Perhaps this would be more interesting if Guile had more generic protocols such as Racket's iteration.

    Next, there are the arithmetic operators: addition, multiplication, and so on. Scheme's arithmetic is indeed polymorphic; the addition operator + can add any number of complex numbers, with a distinction between exact and inexact values. On a representation level, Guile has fixnums (small exact integers, no heap allocation), bignums (arbitrary-precision heap-allocated exact integers), fractions (exact ratios between integers), flonums (heap-allocated double-precision floating point numbers), and compnums (inexact complex numbers, internally a pair of doubles). Also in Guile, arithmetic operators are a "primitive generics", meaning that they can be extended to operate on new types at runtime via GOOPS.

    The usual situation though is that any particular instance of an addition operator only sees fixnums. In that case, it makes sense to only emit code for fixnums, instead of the product of all possible numeric representations. This is a clear application where inline caches can be interesting to Guile.

    Third, there is a very specific case related to dynamic linking. Did you know that most programs compiled for GNU/Linux and related systems have inline caches in them? It's a bit weird but the "Procedure Linkage Table" (PLT) segment in ELF binaries on Linux systems is set up in a way that when e.g. libfoo.so is loaded, the dynamic linker usually doesn't eagerly resolve all of the external routines that libfoo.so uses. The first time that libfoo.so calls frobulate, it ends up calling a procedure that looks up the location of the frobulate procedure, then patches the binary code in the PLT so that the next time frobulate is called, it dispatches directly. To dynamic language people it's the weirdest thing in the world that the C/C++/everything-static universe has at its cold, cold heart a hash table and a dynamic dispatch system that it doesn't expose to any kind of user for instrumenting or introspection -- any user that's not a malware author, of course.

    But I digress! Guile can use ICs to lazily resolve runtime routines used by compiled Scheme code. But perhaps this isn't optimal, as the set of primitive runtime calls that Guile will embed in its output is finite, and so resolving these routines eagerly would probably be sufficient. Guile could use ICs for inter-module references as well, and these should indeed be resolved lazily; but I don't know, perhaps the current strategy of using a call-site cache for inter-module references is sufficient.

    Fourthly (are you counting?), there is a general case of the former: when you see a call (a b) and you don't know what a is. If you put an inline cache in the call, instead of having to emit checks that a is a heap object and a procedure and then emit an indirect call to the procedure's code, you might be able to emit simply a check that a is the same as x, the only callee you ever saw at that site, and in that case you can emit a direct branch to the function's code instead of an indirect branch.

    Here I think the argument is less strong. Modern CPUs are already very good at indirect jumps and well-predicted branches. The value of a devirtualization pass in compilers is that it makes the side effects of a virtual method call concrete, allowing for more optimizations; avoiding indirect branches is good but not necessary. On the other hand, Guile does have polymorphic callees (generic functions), and call ICs could help there. Ideally though we would need to extend the language to allow generic functions to feed back to their inline cache handlers.

    Finally, ICs could allow for cheap tracepoints and breakpoints. If at every breakable location you included a jmp *loc, and the initial value of *loc was the next instruction, then you could patch individual locations with code to run there. The patched code would be responsible for saving and restoring machine state around the instrumentation.

    Honestly I struggle a lot with the idea of debugging native code. GDB does the least-overhead, most-generic thing, which is patching code directly; but it runs from a separate process, and in Guile we need in-process portable debugging. The debugging use case is a clear area where you want adaptive optimization, so that you can emit debugging ceremony from the hottest code, knowing that you can fall back on some earlier tier. Perhaps Guile should bite the bullet and go this way too.

    implementation plan

    In Guile, monomorphic as it is in most things, probably only arithmetic is worth the trouble of inline caches, at least in the short term.

    Another question is how much to specialize the inline caches to their call site. On the extreme side, each call site could have a custom calling convention: if the first operand is in register A and the second is in register B and they are expected to be fixnums, and the result goes in register C, and the continuation is the code at L, well then you generate an inline cache that specializes to all of that. No need to shuffle operands or results, no need to save the continuation (return location) on the stack.

    The opposite would be to call ICs as if their were normal procedures: shuffle arguments into fixed operand registers, push a stack frame, and when the IC returns, shuffle the result into place.

    Honestly I am looking mostly to the simple solution. I am concerned about code and heap bloat if I specify to every last detail of a call site. Also maximum speed comes with an adaptive optimizer, and in that case simple lower tiers are best.

    sanity check

    To compare these impressions, I took a look at V8's current source code to see where they use ICs in practice. When I worked on V8, the compiler was entirely different -- there were two tiers, and both of them generated native code. Inline caches were everywhere, and they were gnarly; every architecture had its own implementation. Now in V8 there are two tiers, not the same as the old ones, and the lowest one is a bytecode interpreter.

    As an adaptive optimizer, V8 doesn't need breakpoint ICs. It can always deoptimize back to the interpreter. In actual practice, to debug at a source location, V8 will patch the bytecode to insert a "DebugBreak" instruction, which has its own support in the interpreter. V8 also supports optimized compilation of this operation. So, no ICs needed here.

    Likewise for generic type feedback, V8 records types as data rather than in the classic formulation of inline caches as in Self. I think WebKit's JavaScriptCore uses a similar strategy.

    V8 does use inline caches for property access (loads and stores). Besides that there is an inline cache used in calls which is just used to record callee counts, and not used for direct call optimization.

    Surprisingly, V8 doesn't even seem to use inline caches for arithmetic (any more?). Fair enough, I guess, given that JavaScript's numbers aren't very polymorphic, and even with a system with fixnums and heap floats like V8, floating-point numbers are rare in cold code.

    The dynamic linking and relocation points don't apply to V8 either, as it doesn't receive binary code from the internet; it always starts from source.

    twilight of the inline cache

    There was a time when inline caches were recommended to solve all your VM problems, but it would seem now that their heyday is past.

    ICs are still a win if you have named property access on objects whose shape you don't know at compile-time. But improvements in CPU branch target buffers mean that it's no longer imperative to use ICs to avoid indirect branches (modulo Spectre v2), and creating direct branches via code-patching has gotten more expensive and tricky on today's targets with concurrency and deep cache hierarchies.

    Besides that, the type feedback component of inline caches seems to be taken over by explicit data-driven call-site caches, rather than executable inline caches, and the highest-throughput tiers of an adaptive optimizer burn away inline caches anyway. The pressure on an inline cache infrastructure now is towards simplicity and ease of type and call-count profiling, leaving the speed component to those higher tiers.

    In Guile the bounded polymorphism on arithmetic combined with the need for ahead-of-time compilation means that ICs are probably a code size and execution time win, but it will take some engineering to prevent the calling convention overhead from dominating cost.

    Time to experiment, then -- I'll let y'all know how it goes. Thoughts and feedback welcome from the compilerati. Until then, happy hacking :)

    by Andy Wingo at February 07, 2018 03:14 PM

    February 05, 2018

    Andy Wingo

    notes from the fosdem 2018 networking devroom

    Greetings, internet!

    I am on my way back from FOSDEM and thought I would share with yall some impressions from talks in the Networking devroom. I didn't get to go to all that many talks -- FOSDEM's hallway track is the hottest of them all -- but I did hit a select few. Thanks to Dave Neary at Red Hat for organizing the room.

    Ray Kinsella -- Intel -- The path to data-plane micro-services

    The day started with a drum-beating talk that was very light on technical information.

    Essentially Ray was arguing for an evolution of network function virtualization -- that instead of running VNFs on bare metal as was done in the days of yore, that people started to run them in virtual machines, and now they run them in containers -- what's next? Ray is saying that "cloud-native VNFs" are the next step.

    Cloud-native VNFs to move from "greedy" VNFs that take charge of the cores that are available to them, to some kind of resource sharing. "Maybe users value flexibility over performance", says Ray. It's the Care Bears approach to networking: (resource) sharing is caring.

    In practice he proposed two ways that VNFs can map to cores and cards.

    One was in-process sharing, which if I understood him properly was actually as nodes running within a VPP process. Basically in this case VPP or DPDK is the scheduler and multiplexes two or more network functions in one process.

    The other was letting Linux schedule separate processes. In networking, we don't usually do it this way: we run network functions on dedicated cores on which nothing else runs. Ray was suggesting that perhaps network functions could be more like "normal" Linux services. Ray doesn't know if Linux scheduling will work in practice. Also it might mean allowing DPDK to work with 4K pages instead of the 2M hugepages it currently requires. This obviously has the potential for more latency hazards and would need some tighter engineering, and ultimately would have fewer guarantees than the "greedy" approach.

    Interesting side things I noticed:

    • All the diagrams show Kubernetes managing CPU node allocation and interface assignment. I guess in marketing diagrams, Kubernetes has completely replaced OpenStack.

    • One slide showed guest VNFs differentiated between "virtual network functions" and "socket-based applications", the latter ones being the legacy services that use kernel APIs. It's a useful terminology difference.

    • The talk identifies user-space networking with DPDK (only!).

    Finally, I note that Conway's law is obviously reflected in the performance overheads: because there are organizational isolations between dev teams, vendors, and users, there are big technical barriers between them too. The least-overhead forms of resource sharing are also those with the highest technical consistency and integration (nodes in a single VPP instance).

    Magnus Karlsson -- Intel -- AF_XDP

    This was a talk about getting good throughput from the NIC to userspace, but by using some kernel facilities. The idea is to get the kernel to set up the NIC and virtualize the transmit and receive ring buffers, but to let the NIC's DMA'd packets go directly to userspace.

    The performance goal is 40Gbps for thousand-byte packets, or 25 Gbps for traffic with only the smallest packets (64 bytes). The fast path does "zero copy" on the packets if the hardware has the capability to steer the subset of traffic associated with the AF_XDP socket to that particular process.

    The AF_XDP project builds on XDP, a newish thing where a little kind of bytecode can run on the kernel or possibly on the NIC. One of the bytecode commands (REDIRECT) causes packets to be forwarded to user-space instead of handled by the kernel's otherwise heavyweight networking stack. AF_XDP is the bridge between XDP on the kernel side and an interface to user-space using sockets (as opposed to e.g. AF_INET). The performance goal was to be within 10% or so of DPDK's raw user-space-only performance.

    The benefits of AF_XDP over the current situation would be that you have just one device driver, in the kernel, rather than having to have one driver in the kernel (which you have to have anyway) and one in user-space (for speed). Also, with the kernel involved, there is a possibility for better isolation between different processes or containers, when compared with raw PCI access from user-space..

    AF_XDP is what was previously known as AF_PACKET v4, and its numbers are looking somewhat OK. Though it's not upstream yet, it might be interesting to get a Snabb driver here.

    I would note that kernel-userspace cooperation is a bit of a theme these days. There are other points of potential cooperation or common domain sharing, storage being an obvious one. However I heard more than once this weekend the kind of "I don't know, that area of the kernel has a different culture" sort of concern as that highlighted by Daniel Vetter in his recent LCA talk.

    François-Frédéric Ozog -- Linaro -- Userland Network I/O

    This talk is hard to summarize. Like the previous one, it's again about getting packets to userspace with some support from the kernel, but the speaker went really deep and I'm not quite sure what in the talk is new and what is known.

    François-Frédéric is working on a new set of abstractions for relating the kernel and user-space. He works on OpenDataPlane (ODP), which is kinda like DPDK in some ways. ARM seems to be a big target for his work; that x86-64 is also a target goes without saying.

    His problem statement was, how should we enable fast userland network I/O, without duplicating drivers?

    François-Frédéric was a bit negative on AF_XDP because (he says) it is so focused on packets that it neglects other kinds of devices with similar needs, such as crypto accelerators. Apparently the challenge here is accelerating a single large IPsec tunnel -- because the cryptographic operations are serialized, you need good single-core performance, and making use of hardware accelerators seems necessary right now for even a single 10Gbps stream. (If you had many tunnels, you could parallelize, but that's not the case here.)

    He was also a bit skeptical about standardizing on the "packet array I/O model" which AF_XDP and most NICS use. What he means here is that most current NICs move packets to and from main memory with the help of a "descriptor array" ring buffer that holds pointers to packets. A transmit array stores packets ready to transmit; a receive array stores maximum-sized packet buffers ready to be filled by the NIC. The packet data itself is somewhere else in memory; the descriptor only points to it. When a new packet is received, the NIC fills the corresponding packet buffer and then updates the "descriptor array" to point to the newly available packet. This requires at least two memory writes from the NIC to memory: at least one to write the packet data (one per 64 bytes of packet data), and one to update the DMA descriptor with the packet length and possible other metadata.

    Although these writes go directly to cache, there's a limit to the number of DMA operations that can happen per second, and with 100Gbps cards, we can't afford to make one such transaction per packet.

    François-Frédéric promoted an alternative I/O model for high-throughput use cases: the "tape I/O model", where packets are just written back-to-back in a uniform array of memory. Every so often a block of memory containing some number of packets is made available to user-space. This has the advantage of packing in more packets per memory block, as there's no wasted space between packets. This increases cache density and decreases DMA transaction count for transferring packet data, as we can use each 64-byte DMA write to its fullest. Additionally there's no side table of descriptors to update, saving a DMA write there.

    Apparently the only cards currently capable of 100 Gbps traffic, the Chelsio and Netcope cards, use the "tape I/O model".

    Incidentally, the DMA transfer limit isn't the only constraint. Something I hadn't fully appreciated before was memory write bandwidth. Before, I had thought that because the NIC would transfer in packet data directly to cache, that this wouldn't necessarily cause any write traffic to RAM. Apparently that's not the case. Later over drinks (thanks to Red Hat's networking group for organizing), François-Frédéric asserted that the DMA transfers would eventually use up DDR4 bandwidth as well.

    A NIC-to-RAM DMA transaction will write one cache line (usually 64 bytes) to the socket's last-level cache. This write will evict whatever was there before. As far as I can tell, there are three cases of interest here. The best case is where the evicted cache line is from a previous DMA transfer to the same address. In that case it's modified in the cache and not yet flushed to main memory, and we can just update the cache instead of flushing to RAM. (Do I misunderstand the way caches work here? Do let me know.)

    However if the evicted cache line is from some other address, we might have to flush to RAM if the cache line is dirty. That causes a memory write traffic. But if the cache line is clean, that means it was probably loaded as part of a memory read operation, and then that means we're evicting part of the network function's working set, which will later cause memory read traffic as the data gets loaded in again, and write traffic to flush out the DMA'd packet data cache line.

    François-Frédéric simplified the whole thing to equate packet bandwidth with memory write bandwidth, that yes, the packet goes directly to cache but it is also written to RAM. I can't convince myself that that's the case for all packets, but I need to look more into this.

    Of course the cache pressure and the memory traffic is worse if the packet data is less compact in memory; and worse still if there is any need to copy data. Ultimately, processing small packets at 100Gbps is still a huge challenge for user-space networking, and it's no wonder that there are only a couple devices on the market that can do it reliably, not that I've seen either of them operate first-hand :)

    Talking with Snabb's Luke Gorrie later on, he thought that it could be that we can still stretch the packet array I/O model for a while, given that PCIe gen4 is coming soon, which will increase the DMA transaction rate. So that's a possibility to keep in mind.

    At the same time, apparently there are some "coherent interconnects" coming too which will allow the NIC's memory to be mapped into the "normal" address space available to the CPU. In this model, instead of having the NIC transfer packets to the CPU, the NIC's memory will be directly addressable from the CPU, as if it were part of RAM. The latency to pull data in from the NIC to cache is expected to be slightly longer than a RAM access; for comparison, RAM access takes about 70 nanoseconds.

    For a user-space networking workload, coherent interconnects don't change much. You still need to get the packet data into cache. True, you do avoid the writeback to main memory, as the packet is already in addressable memory before it's in cache. But, if it's possible to keep the packet on the NIC -- like maybe you are able to add some kind of inline classifier on the NIC that could directly shunt a packet towards an on-board IPSec accelerator -- in that case you could avoid a lot of memory transfer. That appears to be the driving factor for coherent interconnects.

    At some point in François-Frédéric's talk, my brain just died. I didn't quite understand all the complexities that he was taking into account. Later, after he kindly took the time to dispell some more of my ignorance, I understand more of it, though not yet all :) The concrete "deliverable" of the talk was a model for kernel modules and user-space drivers that uses the paradigms he was promoting. It's a work in progress from Linaro's networking group, with some support from NIC vendors and CPU manufacturers.

    Luke Gorrie and Asumu Takikawa -- SnabbCo and Igalia -- How to write your own NIC driver, and why

    This talk had the most magnificent beginning: a sort of "repent now ye sinners" sermon from Luke Gorrie, a seasoned veteran of software networking. Luke started by describing the path of righteousness leading to "driver heaven", a world in which all vendors have publically accessible datasheets which parsimoniously describe what you need to get packets flowing. In this blessed land it's easy to write drivers, and for that reason there are many of them. Developers choose a driver based on their needs, or they write one themselves if their needs are quite specific.

    But there is another path, says Luke, that of "driver hell": a world of wickedness and proprietary datasheets, where even when you buy the hardware, you can't program it unless you're buying a hundred thousand units, and even then you are smitten with the cursed non-disclosure agreements. In this inferno, only a vendor is practically empowered to write drivers, but their poor driver developers are only incentivized to get the driver out the door deployed on all nine architectural circles of driver hell. So they include some kind of circle-of-hell abstraction layer, resulting in a hundred thousand lines of code like a tangled frozen beard. We all saw the abyss and repented.

    Luke described the process that led to Mellanox releasing the specification for its ConnectX line of cards, something that was warmly appreciated by the entire audience, users and driver developers included. Wonderful stuff.

    My Igalia colleague Asumu Takikawa took the last half of the presentation, showing some code for the driver for the Intel i210, i350, and 82599 cards. For more on that, I recommend his recent blog post on user-space driver development. It was truly a ray of sunshine in dark, dark Brussels.

    Ole Trøan -- Cisco -- Fast dataplanes with VPP

    This talk was a delightful introduction to VPP, but without all of the marketing; the sort of talk that makes FOSDEM worthwhile. Usually at more commercial, vendory events, you can't really get close to the technical people unless you have a vendor relationship: they are surrounded by a phalanx of salesfolk. But in FOSDEM it is clear that we are all comrades out on the open source networking front.

    The speaker expressed great personal pleasure on having being able to work on open source software; his relief was palpable. A nice moment.

    He also had some kind words about Snabb, too, saying at one point that "of course you can do it on snabb as well -- Snabb and VPP are quite similar in their approach to life". He trolled the horrible complexity diagrams of many "NFV" stacks whose components reflect the org charts that produce them more than the needs of the network functions in question (service chaining anyone?).

    He did get to drop some numbers as well, which I found interesting. One is that recently they have been working on carrier-grade NAT, aiming for 6 terabits per second. Those are pretty big boxes and I hope they are getting paid appropriately for that :) For context he said that for a 4-unit server, these days you can build one that does a little less than a terabit per second. I assume that's with ten dual-port 40Gbps cards, and I would guess to power that you'd need around 40 cores or so, split between two sockets.

    Finally, he finished with a long example on lightweight 4-over-6. Incidentally this is the same network function my group at Igalia has been building in Snabb over the last couple years, so it was interesting to see the comparison. I enjoyed his commentary that although all of these technologies (carrier-grade NAT, MAP, lightweight 4-over-6) have the ostensible goal of keeping IPv4 running, in reality "we're day by day making IPv4 work worse", mainly by breaking the assumption that just because you get traffic from port P on IP M, doesn't mean you can send traffic to M from another port or another protocol and have it reach the target.

    All of these technologies also have problems with IPv4 fragmentation. Getting it right is possible but expensive. Instead, Ole mentions that he and a cross-vendor cabal of dataplane people have a "dark RFC" in the works to deprecate IPv4 fragmentation entirely :)

    OK that's it. If I get around to writing up the couple of interesting Java talks I went to (I know right?) I'll let yall know. Happy hacking!

    by Andy Wingo at February 05, 2018 05:22 PM

    February 01, 2018

    Iago Toral

    Intel Mesa driver for Linux is now OpenGL 4.6 conformant

    Khronos has recently announced the conformance program for OpenGL 4.6 and I am very happy to say that Intel has submitted successful conformance applications for various of its GPU models for the Mesa Linux driver. For specifics on the conformant hardware you can check the list of conformant OpenGL products at the Khronos webstite.

    Being conformant on day one, which the Intel Mesa Vulkan driver also obtained back in the day, is a significant achievement. Besides Intel Mesa, only NVIDIA managed to do this, which I think speaks of the amount of work and effort that one needs to put to achieve it. The days where Linux implementations lagged behind are long gone, we should all celebrate this and acknowledge the important efforts that companies like Intel have put into making this a reality.

    Over the last 8-9 months or so, I have been working together with some of my Igalian friends to keep the Intel drivers (for both OpenGL and Vulkan) conformant, so I am very proud that we have reached this milestone. Kudos to all my work mates who have worked with me on this, to our friends at Intel, who have been providing reviews for our patches, feedback and additional driver fixes, and to many other members in the Mesa community who have contributed to make this possible in one way or another.

    Of course, OpenGL 4.6 conformance requires that we have an implementation of GL_ARB_gl_spirv, which allows OpenGL applications to consume SPIR-V shaders. If you have been following Igalia’s work, you have already seen some of my colleagues sending patches for this over the last months, but the feature is not completely upstreamed yet. We are working hard on this, but the scope of the implementation that we want for upstream is rather ambitious, since it involves to (finally) have a full shader linker in NIR. Getting that to be as complete as the current GLSL linker and in a shape that is good enough for review and upstreaming is going to take some time, but it is surely a worthwhile effort that will pay off in the future, so please look forward to it and be patient with us as we upstream more of it in the coming months.

    It is also important to remark that OpenGL 4.6 conformance doesn’t just validate new features in OpenGL 4.6, it is a full conformance program for OpenGL drivers that includes OpenGL 4.6 functionality, and as such, it is a super set of the OpenGL 4.5 conformance. The OpenGL 4.6 CTS does, in fact, incorporate a whole lot of bugfixes and expanded coverage for OpenGL features that were already present in OpenGL 4.5 and prior.

    What is the conformance process and why is it important?

    It is a well known issue with standards that different implementations are not always consistent. This can happen for a number of reasons. For example, implementations have bugs which can make something work on one platform but not on another (which will then require applications to implement work arounds). Another reason for this is that some times implementators just have different interpretations of the standard.

    The Khronos conformance program is intended to ensure that products that implement Khronos standards (such as OpenGL or Vulkan drivers) do what they are supposed to do and they do it consistently across implementations from the same or different vendors. This is achieved by producing an extensive test suite, the Conformance Test Suite (or CTS for short), which aims to verify that the semantics of the standard are properly implemented by as many vendors as possible.

    Why is CTS different to other test suites available?

    One reason is that CTS development includes Khronos members who are involved in the definition of the API specifications. This means there are people well versed in the spec language who can review and provide feedback to test developers to ensure that the tests are good.

    Another reason is that before new tests go in, it is required that there are at least a number of implementation (from different vendors) that pass them. This means that various vendors have implemented the related specifications and these different implementations agree on the expected result, which is usually a good sign that the tests and the implementations are good (although this is not always enough!).

    How does CTS and the Khronos conformance process help API implementators and users?

    First, it makes it so that existing and new functionality covered in the API specifications is tested before granting the conformance status. This means that implementations have to run all these tests and pass them, producing the same results as other implementations, so as far as the test coverage goes, the implementations are correct and consistent, which is the whole point of this process: it wont’ matter if you’re running your application on Intel, NVIDIA, AMD or a different GPU vendor, if your application is correct, it should run the
    same no matter the driver you are running on.

    Now, this doesn’t mean that your application will run smoothly on all conformant platforms out of the box. Application developers still need to be aware that certain aspects or features in the specifications are optional, or that different hardware implementations may have different limits for certain things. Writing software that can run on multiple platforms is always a challenge and some of that will always need to be addressed on the application side, but at least the conformance process attempts to ensure that for applications that do their part of the work, things will work as intended.

    There is another interesting implication of conformance that has to do with correct API specification. Designing APIs that can work across hardware from different vendors is a challenging process. With the CTS, Khronos has an opportunity to validate the specifications against actual implementations. In other words, the CTS allows Khronos to verify that vendors can implement the specifications as intended and revisit the specification if they can’t before releasing them. This ensures that API specifications are reasonable and a good match for existing hardware implementations.

    Another benefit of CTS is that vendors working on any API implementation will always need some level of testing to verify their code during development. Without CTS, they would have to write their own tests (which would be biased towards their own interpretations of the spec anyway), but with CTS, they can leave that to Khronos and focus on the implementation instead, cutting down development times and sharing testing code with other vendors.

    What about Piglit or other testing frameworks?

    CTS doesn’t make Piglit obsolete or redundant at all. While CTS coverage will improve over the years it is nearly impossible to have 100% coverage, so having other testing frameworks around that can provide extra coverage is always good.

    My experience working on the Mesa drivers is that it is surprisingly easy to break stuff, specially on OpenGL which has a lot of legacy stuff in it. I have seen way too many instances of patches that seemed correct and in fact fixed actual problems only to later see Piglit, CTS and/or dEQP report regressions on existing tests. The (not so surprising) thing is that many times, the regressions were reported on just one of these testing frameworks, which means they all provide some level of coverage that is not included in the others.

    It is for this reason that the continuous integration system for Mesa provided by Intel runs all of these testing frameworks (and then some others). You just never get enough testing. And even then, some regressions slip into releases despite all the testing!

    Also, for the case of Piglit in particular, I have to say that it is very easy to write new tests, specially shader runner tests, which is always a bonus. Writing tests for CTS or dEQP, for example, requires more work in general.

    So what have I been doing exactly?

    For the last 9 months or so, I have been working on ensuring that the Intel Mesa drivers for both Vulkan and OpenGL are conformant. If you have followed any of my work in Mesa over the last year or so, you have probably already guessed this, since most of the patches I have been sending to Mesa reference the conformance tests they fix.

    To be more thorough, my work included:

    1. Reviewing and testing patches submitted for inclusion in CTS that either fixed test bugs, extended coverage for existing features or added new tests for new API specifications. CTS is a fairly active project with numerous changes submitted for review pretty much every day, for OpenGL, OpenGL ES and Vulkan, so staying on top of things requires a significant dedication.
  • Ensuring that the Intel Mesa drivers passed all the CTS tests for both Vulkan and OpenGL. This requires to run the conformance tests, identifying test failures, identifying the cause for the failures and providing proper fixes. The fixes would go to CTS when the cause for the issue was a bogus test, to the driver, when it was a bug in our implementation or the fact that the driver was simply missing some functionality, or they could even go to the OpenGL or Vulkan specs, when the source of the problem was incomplete, ambiguous or incorrect spec language that was used to drive the test development. I have found instances of all these situations.

  • Where can I get the CTS code?

    Good news, it is open source and available at GitHub.

    This is a very important and welcomed change by Khronos. When I started helping Intel with OpenGL conformance, specifically for OpenGL 4.5, the CTS code was only available to specific Khronos members. Since then, Khronos has done a significant effort in working towards having an open source testing framework where anyone can contribute, so kudos to Khronos for doing this!

    Going open source not only leverages larger collaboration and further development of the CTS. It also puts in the hands of API users a handful of small test samples that people can use to learn how some of the new Vulkan and OpenGL APIs released to the public are to be used, which is always nice.

    What is next?

    As I said above, CTS development is always ongoing, there is always testing coverage to expand for existing features, bugfixes to provide for existing tests, more tests that need to be adapted or changed to match corrections in in the spec language, new extensions and APIs that need testing coverage, etc.

    And on the driver side, there are always new features to implement that come with their potential bugs that need to be fixed, occasional regressions that need to be addressed promptly, new bugs uncovered by new tests that need fixes, etc

    So the fun never really stops 🙂

    Final words

    In this post, besides bringing the good news to everyone, I hope that I have made a good case for why the Khronos CTS is important for the industry and why we should care about it. I hope that I also managed to give a sense for the enormous amount of work that goes into making all of this possible, both on the side of Khronos and the side of the driver developer teams. I think all this effort means better drivers for everyone and I hope that we all, as users, come to appreciate it for that.

    Finally, big thanks to Intel for sponsoring our work on Mesa and CTS, and also to Igalia, for having me work on this wonderful project.

    OpenGL® and the oval logo are trademarks or registered trademarks of Silicon Graphics, Inc. in the United States and/or other countries worldwide. Additional license details are available on the SGI website.

    by Iago Toral at February 01, 2018 11:34 AM

    January 25, 2018

    Jessica Tallon

    Hacking on User Space Networking

    As most of you know, I’ve been working at Igalia close to two years now. If you asked me what I do I’d probably say I hack on “networky stuff”, which is a bit opaque. I thought I’d take the time to tell you about the kind of things I work on.

    A bit of background

    Consider a data centre full of big expensive boxes that have some high speed network cards performing some specific function (e.g. firewalls, VoIP, etc.). These are all well and good but they cost a fortune and are just what they look like: black boxes. If you’d like to modify the functionality beyond what the vendor provides, you’d probably be out of luck. If the market changes and you need to provide other functionality, you’re stuck having to buy yet another big expensive box that only performs the one function, just like the old one.

    You might be thinking, just buy regular server hardware and write some program to deal with those packets. Easy, right? Well… not so fast. Your run of the mill linux application written against the kernel is just too slow as it’s not designed to do these zippy networky things. It’s designed to be a general purpose operating system.

    For those curious about the kernel’s speeds Cloudflare has a great post demonstrating this.

    Enter user space networking

    Instead of just relying on the kernel, you can tell linux to leave the network card alone. Linux allows you to poke around directly over PCI, so you can write a driver for the card outside the kernel. The driver will be part of your program allowing you to skip all the kernel faffing and get down to crunching those packets.

    Great, right? Job done? Well, kind of…

    I’m going to mainly talk about snabb for the rest of this blog post. It’s not the only game in town but I prefer it and it’s what I have the most experience with. Snabb is a framework which implements a bunch of drivers and other neat tools in Luajit (a lightning fast implementation of Lua). Imagine you’re a program using snabb, having all these packets coming at you at breakneck speed from your 10Gb card. You have about 67.2 nanoseconds per packet (my thanks to this helpful person for doing the work for me) to deal with it and move on to the next packet. That’s not much time, right? I’ll try to put it into context; here are some things you may wish to do whilst dealing with the packet:

    L2 cache hit  4.3ns
    L3 cache hit 7.9ns
    Cache miss 32ns
    Linux syscall[1] 87.77 ns or 41.85 ns

    [1] – If CONFIG_AUDITSYSCALL is enabled it’s the higher one, if not it’s the lower one.

    You can imagine that if you’re going full pelt, you don’t really have time to do much at all. Two cache misses and that’s your budget, one syscall and you’ve blown way past the budget. The solution to all this is to deal with a bunch of packets at once. Doing this allows you to go to RAM or whatever once for the whole set rather than doing it for each packet.

    Dags att köra Snabb(t)

    My colleague Diego has recently written a great set of blogs which are more in-depth than this, so check those out if you’re interested. Snabb works by having a graph of “apps”; each app is usually responsible for doing one thing, for example: being a network driver, reading/writing pcap files, filtering, and so forth. The apps are then connected together to form a graph.

    Lets see a real example of a very simple, but potentially useful app. We’re going to take packets in from an Intel 82599 10 Gigabit card and throw them into a pcap file (packet capture file). We’ll take the “output” of the driver app and connect it to the input of the pcap writing app. Here’s the code:

    module(..., package.seeall)
    
    local pcap = require("apps.pcap.pcap")
    local Intel82599 = require("apps.intel_mp.intel_mp").Intel82599
    
    
    function run(parameters)
       -- Lua arrays are indexed from 1 not 0.
       local pci_address = parameters[1]
       local output_pcap = parameters[2]
    
       -- Configure the "apps" we're going to use in the graph.
       -- The "config" object is injected in when we're loaded by snabb.
       local c = config.new()
       config.app(c, "nic", Intel82599, {pciaddr = pci_address})
       config.app(c, "pcap", pcap.PcapWriter, output_pcap)
    
       -- Link up the apps into a graph.
       config.link(c, "nic.output -> pcap.input")
    
       -- Tell the snabb engine our configuration we've just made.
       engine.configure(c)
    
       -- Lets start the apps for 10 seconds!
       engine.main({duration=10, report = {showlinks=true}})
    end
    

    Awesome! 🙏 

    Just to give you a taste of what is included in snabb out of the box (by no means an exhaustive list):

    • pcap reading / writing
    • PF filtering (like the openBSD firewall, more about speed and implementation from my colleagues Asumu and Andy)
    • Rate limiting
    • ARP and NDP
    • YANG (network configuration language for specifying, validating and creating parsers for configuration files)

    …and the list continues, it also is pretty simple to create your own. Just define a Lua object which has a “pull” (input) and “push” (output) methods and do something with the packet in the middle. I mainly work on the LwAFTR which provides a solution to IPv4 exhaustion, this is done by wrapping the IPv4 packet up in a IPv6 at the user and unwrapping it at the ISP so they can give users a port range of an IPv4 address rather than the whole thing.

    by tsyesika at January 25, 2018 08:15 AM

    January 24, 2018

    Michael Catanzaro

    Announcing Epiphany Technology Preview

    If you use macOS, the best way to use a recent development snapshot of WebKit is surely Safari Technology Preview. But until now, there’s been no good way to do so on Linux, short of running a development distribution like Fedora Rawhide.

    Enter Epiphany Technology Preview. This is a nightly build of Epiphany, on top of the latest development release of WebKitGTK+, running on the GNOME master Flatpak runtime. The target audience is anyone who wants to assist with Epiphany development by testing the latest code and reporting bugs, so I’ve added the download link to Epiphany’s development page.

    Since it uses Flatpak, there are no host dependencies asides from Flatpak itself, so it should work on any system that can run Flatpak. Thanks to the Flatpak sandbox, it’s far more secure than the version of Epiphany provided by your operating system. And of course, you enjoy automatic updates from GNOME Software or any software center that supports Flatpak.

    Enjoy!

    (P.S. If you want to use the latest stable version instead, with all the benefits provided by Flatpak, get that here.)

    by Michael Catanzaro at January 24, 2018 10:58 PM

    January 17, 2018

    Andy Wingo

    instruction explosion in guile

    Greetings, fellow Schemers and compiler nerds: I bring fresh nargery!

    instruction explosion

    A couple years ago I made a list of compiler tasks for Guile. Most of these are still open, but I've been chipping away at the one labeled "instruction explosion":

    Now we get more to the compiler side of things. Currently in Guile's VM there are instructions like vector-ref. This is a little silly: there are also instructions to branch on the type of an object (br-if-tc7 in this case), to get the vector's length, and to do a branching integer comparison. Really we should replace vector-ref with a combination of these test-and-branches, with real control flow in the function, and then the actual ref should use some more primitive unchecked memory reference instruction. Optimization could end up hoisting everything but the primitive unchecked memory reference, while preserving safety, which would be a win. But probably in most cases optimization wouldn't manage to do this, which would be a lose overall because you have more instruction dispatch.

    Well, this transformation is something we need for native compilation anyway. I would accept a patch to do this kind of transformation on the master branch, after version 2.2.0 has forked. In theory this would remove most all high level instructions from the VM, making the bytecode closer to a virtual CPU, and likewise making it easier for the compiler to emit native code as it's working at a lower level.

    Now that I'm getting close to finished I wanted to share some thoughts. Previous progress reports on the mailing list.

    a simple loop

    As an example, consider this loop that sums the 32-bit floats in a bytevector. I've annotated the code with lines and columns so that you can correspond different pieces to the assembly.

       0       8   12     19
     +-v-------v---v------v-
     |
    1| (use-modules (rnrs bytevectors))
    2| (define (f32v-sum bv)
    3|   (let lp ((n 0) (sum 0.0))
    4|     (if (< n (bytevector-length bv))
    5|         (lp (+ n 4)
    6|             (+ sum (bytevector-ieee-single-native-ref bv n)))
    7|          sum)))
    

    The assembly for the loop before instruction explosion went like this:

    L1:
      17    (handle-interrupts)     at (unknown file):5:12
      18    (uadd/immediate 0 1 4)
      19    (bv-f32-ref 1 3 1)      at (unknown file):6:19
      20    (fadd 2 2 1)            at (unknown file):6:12
      21    (s64<? 0 4)             at (unknown file):4:8
      22    (jnl 8)                ;; -> L4
      23    (mov 1 0)               at (unknown file):5:8
      24    (j -7)                 ;; -> L1
    

    So, already Guile's compiler has hoisted the (bytevector-length bv) and unboxed the loop index n and accumulator sum. This work aims to simplify further by exploding bv-f32-ref.

    exploding the loop

    In practice, instruction explosion happens in CPS conversion, as we are converting the Scheme-like Tree-IL language down to the CPS soup language. When we see a Tree-Il primcall (a call to a known primitive), instead of lowering it to a corresponding CPS primcall, we inline a whole blob of code.

    In the concrete case of bv-f32-ref, we'd inline it with something like the following:

    (unless (and (heap-object? bv)
                 (eq? (heap-type-tag bv) %bytevector-tag))
      (error "not a bytevector" bv))
    (define len (word-ref bv 1))
    (define ptr (word-ref bv 2))
    (unless (and (<= 4 len)
                 (<= idx (- len 4)))
      (error "out of range" idx))
    (f32-ref ptr len)
    

    As you can see, there are four branches hidden in the bv-f32-ref: two to check that the object is a bytevector, and two to check that the index is within range. In this explanation we assume that the offset idx is already unboxed, but actually unboxing the index ends up being part of this work as well.

    One of the goals of instruction explosion was that by breaking the operation into a number of smaller, more orthogonal parts, native code generation would be easier, because the compiler would only have to know about those small bits. However without an optimizing compiler, it would be better to reify a call out to a specialized bv-f32-ref runtime routine instead of inlining all of this code -- probably whatever language you write your runtime routine in (C, rust, whatever) will do a better job optimizing than your compiler will.

    But with an optimizing compiler, there is the possibility of removing possibly everything but the f32-ref. Guile doesn't quite get there, but almost; here's the post-explosion optimized assembly of the inner loop of f32v-sum:

    L1:
      27    (handle-interrupts)
      28    (tag-fixnum 1 2)
      29    (s64<? 2 4)             at (unknown file):4:8
      30    (jnl 15)               ;; -> L5
      31    (uadd/immediate 0 2 4)  at (unknown file):5:12
      32    (u64<? 2 7)             at (unknown file):6:19
      33    (jnl 5)                ;; -> L2
      34    (f32-ref 2 5 2)
      35    (fadd 3 3 2)            at (unknown file):6:12
      36    (mov 2 0)               at (unknown file):5:8
      37    (j -10)                ;; -> L1
    

    good things

    The first thing to note is that unlike the "before" code, there's no instruction in this loop that can throw an exception. Neat.

    Next, note that there's no type check on the bytevector; the peeled iteration preceding the loop already proved that the bytevector is a bytevector.

    And indeed there's no reference to the bytevector at all in the loop! The value being dereferenced in (f32-ref 2 5 2) is a raw pointer. (Read this instruction as, "sp[2] = *(float*)((byte*)sp[5] + (uptrdiff_t)sp[2])".) The compiler does something interesting; the f32-ref CPS primcall actually takes three arguments: the garbage-collected object protecting the pointer, the pointer itself, and the offset. The object itself doesn't appear in the residual code, but including it in the f32-ref primcall's inputs keeps it alive as long as the f32-ref itself is alive.

    bad things

    Then there are the limitations. Firstly, instruction 28 tags the u64 loop index as a fixnum, but never uses the result. Why is this here? Sadly it's because the value is used in the bailout at L2. Recall this pseudocode:

    (unless (and (<= 4 len)
                 (<= idx (- len 4)))
      (error "out of range" idx))
    

    Here the error ends up lowering to a throw CPS term that the compiler recognizes as a bailout and renders out-of-line; cool. But it uses idx as an argument, as a tagged SCM value. The compiler untags the loop index, but has to keep a tagged version around for the error cases.

    The right fix is probably some kind of allocation sinking pass that sinks the tag-fixnum to the bailouts. Oh well.

    Additionally, there are two tests in the loop. Are both necessary? Turns out, yes :( Imagine you have a bytevector of length 1025. The loop continues until the last ref at offset 1024, which is within bounds of the bytevector but there's one one byte available at that point, so we need to throw an exception at this point. The compiler did as good a job as we could expect it to do.

    is is worth it? where to now?

    On the one hand, instruction explosion is a step sideways. The code is more optimal, but it's more instructions. Because Guile currently has a bytecode VM, that means more total interpreter overhead. Testing on a 40-megabyte bytevector of 32-bit floats, the exploded f32v-sum completes in 115 milliseconds compared to around 97 for the earlier version.

    On the other hand, it is very easy to imagine how to compile these instructions to native code, either ahead-of-time or via a simple template JIT. You practically just have to look up the instructions in the corresponding ISA reference, is all. The result should perform quite well.

    I will probably take a whack at a simple template JIT first that does no register allocation, then ahead-of-time compilation with register allocation. Getting the AOT-compiled artifacts to dynamically link with runtime routines is a sufficient pain in my mind that I will put it off a bit until later. I also need to figure out a good strategy for truly polymorphic operations like general integer addition; probably involving inline caches.

    So that's where we're at :) Thanks for reading, and happy hacking in Guile in 2018!

    by Andy Wingo at January 17, 2018 10:30 AM

    January 16, 2018

    Asumu Takikawa

    Supporting both VMDq and RSS in Snabb

    In my previous blog post, I talked about the support libraries and the core structure of Snabb’s NIC drivers. In this post, I’ll talk about some of the driver improvements we made at Igalia over the last few months.

    (as in my previous post, this work was joint work with Nicola Larosa)

    Background

    Modern NICs are designed to take advantage of increasing parallelism in modern CPUs in order to scale to larger workloads.

    In particular, to scale to 100G workloads, it becomes necessary to work in parallel since a single off-the-shelf core cannot keep up. Even with 10G hardware, processing packets in parallel makes it easier for software to operate at line-rate because the time budget is quite tight.

    To get an idea of what the time budget is like, see these calculations. tl;dr is 67.2 ns/packet or about 201 cycles.

    To scale to multiple CPUs, NICs have a feature called receive-side scaling or RSS which distributes incoming packets to multiple receive queues. These queues can be serviced by separate cores.

    RSS and related features for Intel NICs are detailed more in an overview whitepaper

    RSS works by computing a hash in hardware over the packet to determine the flow it belongs to (this is similar to the hashing used in IPFIX, which I described in a previous blog post).

    RSS diagram
    A diagram showing how RSS directs packets

    The diagram above tries to illustrate this. When a packet arrives in the NIC, the hash is computed. Packets with the same hash (i.e., they’re in the same flow) are directed to a particular receive queue. Receive queues live in RAM as a ring buffer (shown as blue rings in the diagram) and packets are placed there via DMA by consulting registers on the NIC.

    All this means that network functions that depend on tracking flow-related state can usually still work in this parallel setup.

    As a side note, you might wonder (I did anyway!) what happens to fragmented packets whose flow membership may not be identifiable from a fragment. It turns out that on Intel NICs, the hash function will ignore the layer 3 flow information when a packet is fragmented. This means that on occasion a fragmented packet may end up on a different queue than a non-fragmented packet in the same flow. More on this problem here.

    Snabb’s two Intel drivers

    The existing driver used in most Snabb programs (apps.intel.intel_app) worked well and was mature but was missing support for RSS.

    An alternate driver (apps.intel_mp.intel_mp) made by Peter Bristow supported RSS, but wasn’t entirely compatible with the features provided by the main Intel driver. We worked on extending intel_mp to work as a more-or-less drop in replacement for intel_app.

    The incompatibility between the two drivers was caused mainly by lack of support for VMDq (Virtual Machine Device Queues) in intel_mp. This is another feature that allows for multiple queue operation on Intel NICs that is used to allow a NIC to present itself as multiple virtualized sets of queues. It’s often used to host VMs in a virtualized environment, but can also be used (as in Snabb) for serving logically separate apps.

    The basic idea is that queues may be assigned to separate pools assigned to a VM or app with its own particular MAC address. A host can use this to run logically separate network functions sharing a single NIC. As with RSS, services running on separate cores can service the queues in parallel.

    RSS diagram
    A diagram showing how VMDq affects queue selection

    As the diagram above shows, adding VMDq changes queue selection slightly from the RSS case above. An appropriate pool is selected based on criteria such as the MAC address (or VLAN tag, and so on) and then RSS may be used.

    BTW, VMDq is not the only virtualization feature on these NICs. There is also SR-IOV or “Single Root I/O Virtualization” which is designed to provide a virtualized NIC for every VM that directly uses the NIC hardware resources. My understanding is that Snabb doesn’t use it for now because we can implement more switching flexibility in software.

    The intel_app driver supports VMDq but not RSS and the opposite situation is true for intel_mp. It turns out that both features can be used simultaneously, in which case packets are first sorted by MAC address and then by flow hashing for RSS. Basically each VMDq pool has its own set of RSS queues.

    We implemented this support in the intel_mp driver and made the driver interface mostly compatible with intel_app so that only minimal modifications are necessary to switch over. In the process, we made bug-fixes and performance fixes in the driver to try to ensure that performance and reliability are comparable to using intel_app.

    The development process was made a lot easier due to the existence of the intel_app code that we could copy and follow in many cases.

    The tricky parts were making sure that the NIC state was set correctly when multiple processes were using the NIC. In particular, intel_app can rely on tracking VMDq state inside a single Lua process.

    For intel_mp, it is necessary to use locking and IPC (via shared memory) to coordinate between different Lua processes that are setting driver state. In particular, the driver needs to be careful to be aware of what resources (VMDq pool numbers, MAC address registers, etc.) are available for use.

    Current status

    The driver improvements are now merged upstream in intel_mp, which is now the default driver, and is available in the Snabb 2017.11 “Endive” release. It’s still possible to opt out and use the old driver in case there are any problems with using intel_mp. And of course we appreciate any bug reports or feedback.

    by Asumu Takikawa at January 16, 2018 12:24 AM

    January 12, 2018

    Diego Pino

    More practical Snabb

    Some time ago, in a Hacker News thread an user proposed the following use case for Snabb:

    I have a ChromeCast on my home network, but I want sandbox/log its traffic. I would want to write some logic to ignore video data, because that’s big. But I want to see the metadata and which servers it’s talking to. I want to see when it’s auto-updating itself with new binaries and record them.

    Is that a good use case for Snabb Switch, or is there is an easier way to accomplish what I want?

    I decided to take this request and implement it as a tutorial. Hopefully, the resulting tutorial can be a valuable piece of information highlighting some of Snabb’s strengths:

    • Fine-grained control of the data-plane.
    • Wide variety of solid libraries for protocol parsing.
    • Rapid prototyping.

    Limiting the project’s scope

    Before putting my hands down on this project, I broke it down into smaller pieces and checked how much of it is already supported in Snabb. To fully implement this project I’d need:

    1. To be able to discover Chromecast devices.
    2. Identify their network flows.
    3. Save the data to disk.

    Snabb provides libraries to identify network flows as well as capturing packets and filter them by content. That pretty much covers bullets 2) and 3). However, Snabb doesn’t provide any tool or library to fully support bullet 1). Thus, I’m going to limit the scope of this tutorial to that single feature: Discover Chromecast and similar devices in a local network.

    Multicast DNS

    A fast lookup on Chromecast’s Wikipedia article reveals Chromecast devices rely on a protocol called Multicast DNS (mDNS).

    Multicast DNS is standardized as RFC6762. The origin of the protocol goes back to Apple’s Rendezvous, later rebranded as Bonjour. Bonjour is in fact the origin of the more generic concept known as Zeroconf. Zeroconf’s goal is to automatically create usable TCP/IP computer networks when computers or network peripherals are interconnected. It is composed of three main elements:

    • Addressing: Self-Assigned Link-Local Addressing (RFC2462 and RFC3927). Automatically assigned addresses in the 169.254.0.0/16 network space.
    • Naming: Multicast DNS (RFC6762). Host name resolution.
    • Browsing: DNS Service Discovery (RFC6763). The ability of discovering devices and services in a local network.

    Multicast DNS and DNS-SD are very similar and are often mixed up, although they are not strictly the same thing. The former is the description of how to do name resolution in a serverless DNS network, while DNS-SD, although a protocol as well, is an specific use of Multicast DNS.

    One of the nicest things of Multicast DNS is that it reuses many of the concepts of DNS. This allowed mDNS to spread quickly and gain fast adoption, since existing software only required mininimal change. What’s more, programmers didn’t need to learn new APIs or study a completely brand-new protocol.

    Today Multicast DNS is featured in a myriad of small devices, ranging from Google Chromecast to Amazon’s FireTV or Philips Hue lights, as well as software such as Apple’s Bonjour or Spotify.

    This tutorial is going to focus pretty much on mDNS/DNS-SD. Since Multicast DNS reuses many of the ideas of DNS, I am going to review DNS first. Feel free to skip the next section if you are already familiar with DNS.

    DNS basis

    The most common use case of DNS is resolving host names to IP addresses:

    $ dig igalia.com -t A +short
    91.117.99.155

    In the command above, flag ‘-t A’ means an Address record. There are actually many different types of DNS records. The most common ones are:

    • A (Address record). Used to map hostnames to IPv4 address.
    • AAAA (IPv6 address record). Used to map hostnames to IPv6 address.
    • PTR (Pointer record). Used for reverse DNS lookups, that means, IP addresses to hostnames.
    • SOA (Start of zone of authority). DNS can be seen as a distributed database which is organized in a hierarchical layout of subdomains. A DNS zone is a contiguous portion of the domain space for which a server is responsible of. The top-level DNS zone is known as the DNS root zone, which consists of 13 logical root name servers (although there are more than 13 instances) that contain the top-level domains, generic top-level domains (.com, .net, etc) and country code top-level domains. The command below prints out how the domain www.google.com gets resolved (I trimmed down the output for the sake of clarity).
    $ dig @8.8.8.8 www.google.com +trace
    
    ; <<>> DiG 9.10.3-P4-Ubuntu <<>> @8.8.8.8 www.google.com +trace
    ; (1 server found)
    ;; global options: +cmd
    .                       181853  IN      NS      k.root-servers.net.
    .                       181853  IN      NS      g.root-servers.net.
    .                       181853  IN      NS      j.root-servers.net.
    .                       181853  IN      RRSIG   NS 8 0 518400 20180117170000 20180104160000 41824 ....
    ;; Received 525 bytes from 8.8.8.8#53(8.8.8.8) in 48 ms
    
    com.                    172800  IN      NS      j.gtld-servers.net.
    com.                    172800  IN      NS      k.gtld-servers.net.
    com.                    172800  IN      NS      l.gtld-servers.net.
    com.                    86400   IN      RRSIG   DS 8 1 86400 20180118170000 20180105160000 41824 ...
    ;; Received 1174 bytes from 199.7.83.42#53(l.root-servers.net) in 44 ms
    
    google.com.             172800  IN      NS      ns2.google.com.
    google.com.             172800  IN      NS      ns1.google.com.
    google.com.             172800  IN      NS      ns3.google.com.
    google.com.             172800  IN      NS      ns4.google.com.
    
    ;; Received 664 bytes from 192.26.92.30#53(c.gtld-servers.net) in 44 ms
    
    www.google.com.         300     IN      A       216.58.201.132
    ;; Received 48 bytes from 216.239.32.10#53(ns1.google.com) in 48 ms

    The domain name is split in parts. First the top-level domain is consulted which returns a list of name servers. The root server l.root-servers.net gets consulted to resolve the subdomain .com. That also returns a list of generic top-level domain name servers. Name server c.gtld-servers.net is picked and returns another list of name servers for google.com. Finally www.google.com gets resolved by ns1.google.com, that returns the A record containing the domain name IPv4 address.

    Using DNS is also possible to resolve an IP address to a domain name.

    $ dig -x 8.8.4.4 +short
    google-public-dns-b.google.com.

    In this case, the type record is PTR. The command above is equivalent to:

    $ dig 4.4.8.8.in-addr.arpa -t PTR +short
    google-public-dns-b.google.com.

    When using PTR records for reverse lookups, the target IPv4 addres has to be part of the domain in-addr.arpa. This is an special domain registered under the top-level domain arpa and it’s used for reverse IPv4 lookup. Reverse lookup is the most common use of PTR records, but in fact PTR records are just pointers to a canonical name and other uses are possible as we will see later.

    Summarizing:

    • DNS helps solving a host name to an IP address. Other types of record resolution are possible.
    • DNS is a centralized protocol where DNS servers respond to DNS queries.
    • DNS names are grouped in zones or domains, forming a hierarchical structure. Each SOA is responsible of the name resolution within its area.

    DNS Service Discovery

    Unlike DNS, Multicast DNS doesn’t require a central server. Instead devices listen on port 5353 for DNS queries to a multicast address. In the case of IPv4, this destination address is 224.0.0.251. In addition, the destination Ethernet address of a mDNS request must be 01:00:5E:00:00:FB.

    The Multicast DNS standard defines the domain name local as a pseudo-TLD (top-level domain) under which hosts and services can register. For instance, a laptop computer might answer to the name mylaptop.local (replace mylaptop for your actual laptop’s name).

    $ dig @224.0.0.251 -p 5353 mylaptop.local. +short
    192.168.0.13

    To discover all the services and devices in a local network, DNS-SD sends a PTR Multicast DNS request asking for the domain name `services._dns-sd._udp.local.

    $ dig @224.0.0.251 -p 5353 -t PTR _services._dns-sd._udp.local

    The expected result should be a set of PTR records announcing their domain name. In my case the dig command doesn’t print out any PTR records, but using tcpdump I can check I’m in fact receiving mDNS responses:

    $ sudo tcpdump "port 5353" -t -qns 0 -e -i wlp3s0
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on wlp3s0, link-type EN10MB (Ethernet), capture size 262144 bytes
    44:85:00:4f:b8:fc > 01:00:5e:00:00:fb, IPv4, length 99: 192.168.86.30.58722 > 224.0.0.251.5353: UDP, length 57
    54:60:09:fc:d6:04 > 01:00:5e:00:00:fb, IPv4, length 82: 192.168.86.57.5353 > 224.0.0.251.5353: UDP, length 40
    54:60:09:fc:d6:04 > 01:00:5e:00:00:fb, IPv4, length 299: 192.168.86.57.5353 > 224.0.0.251.5353: UDP, length 257
    54:60:09:fc:d6:04 > 01:00:5e:00:00:fb, IPv4, length 119: 192.168.86.57.5353 > 224.0.0.251.5353: UDP, length 77
    f4:f5:d8:d3:de:dc > 01:00:5e:00:00:fb, IPv4, length 299: 192.168.86.61.5353 > 224.0.0.251.5353: UDP, length 257
    f4:f5:d8:d3:de:dc > 01:00:5e:00:00:fb, IPv4, length 186: 192.168.86.61.5353 > 224.0.0.251.5353: UDP, length 144

    Why dig doesn’t print out the PTR records is still a mystery to me. So instead of dig I used Avahi, the free software implementation of mDNS/DNS-SD, to browse the available devices:

    $ avahi-browse -a
    + wlp3s0 IPv4 dcad2b6c-7a21-10c310-568b-ad83b4a3ea1e          _googlezone._tcp     local
    + wlp3s0 IPv4 1ebe35f6-26f1-bc92-318c-9e35fdcbe11d          _googlezone._tcp     local
    + wlp3s0 IPv4 Google-Cast-Group-71010755f10ad16b10c231437a5e543d1dc3 _googlecast._tcp     local
    + wlp3s0 IPv4 Chromecast-Audio-fd7d2b9d29c92b24db10be10661010eebb9f _googlecast._tcp     local
    + wlp3s0 IPv4 Google-Home-d81d02e1e48a1f0b7d2cbac88f2df820  _googlecast._tcp     local
    + wlp3s0 IPv4 dcad2b6c7a2110c310-0                            _spotify-connect._tcp local

    Each row identifies a service instance name. The structure of a service instance name is the following:

    Service Instance Name = <Instance> . <Service> . <Domain>
    

    For example, consider the following record “_spotify-connect._tcp.local”:

    • Domain: local. The pseudo-TLD used by Multicast DNS.
    • Service: spotify-connect._tcp. The service names consists of a pair of DNS labels. The first label identifies what the service does (_spotify-connect is a service that allows an user to continue playing Spotify from a phone to a desktop computer, and viceversa). The second label identifies what protocol the service uses, in this case TCP.
    • Instance: dcad2b6c7a2110c310-0. An user friendly name that identifies the service.

    Besides a PTR record, an instance also replies with several additional DNS records that might be useful for the requester. These extra records are part of the PTR record and are embed in the DNS additional records field. These extra records are of 3 types:

    • SRV: Gives the target host and port where the service instance can be reached.
    • TXT: Gives additional information about this instance, in a structured form using key/value pairs.
    • A: IPv4 address of the reached instance.

    Snabb’s DNS-SD

    Now that we have a fair understanding of Multicast DNS and DNS-SD, we can start coding the app in Snabb. Like on the previous posts I decided not to past the code directly here, instead I’ve pushed the code to a remote branch and will comment on the most relevant parts. To checkout this repo do:

    $ git clone https://github.com/snabbco/snabb.git
    $ cd snabb
    $ git remote add dpino https://github.com/dpino/snabb.git
    $ git checkout dns-sd
    

    Highlights:

    • The app needs to send a DNS-SD packet through a network interface managed by the OS. I used Snabb’s RawSocket app to do that.
    • A DNSSD app emits one DNS-SD request every second. This is done in DNSSD’s pull method. There’s a helper class called mDNSQuery that is in charge of composing this request.
    • The DNSSD app receives responses on its push method. If the response is a Multicast DNS packet, it will print out all the contained DNS records in stdout.
    • A Multicast DNS packet is composed by a header and a body. The header contains control information such as number of queries, answers, additional records, etc. The body contains DNS records. If the mDNS packet is a response packet, these are the DNS records we would need to print out.
    • To help me handling Multicast DNS packets I created a MDNS helper class. Similarly, I added a DNS helper class that helps me parsing the necessary DNS records: PTR, SRV, TXT and A records.

    Here is Snabb’s dns-sd command in use:

    $ sudo ./snabb dnssd --interface wlp3s0
    PTR: (name: _services._dns-sd._udp.local; domain-name: _spotify-connect._tcp )
    SRV: (target: dcad2b6c7a2110c310-0)
    TXT: (CPath=/zc/0;VERSION=1.0;Stack=SP;)
    Address: 192.168.86.55
    PTR: (name: _googlecast._tcp.local; domain-name: Chromecast-Audio-fd7d2b9d29c92b24db10be10661010eebb9f)
    SRV: (target: 1ebe35f6-26f1-bc92-318c-9e35fdcbe11d)
    TXT: (id=fd7d2b9d29c92b24db10be10661010eebb9f;cd=224708C2E61AED24676383796588FF7E;
    rm=8F2EE2757C6626CC;ve=05;md=Chromecast Audio;ic=/setup/icon.png;fn=Jukebox;
    ca=2052;st=0;bs=FA8FCA9E3FC2;nf=1;rs=;)
    Address: 192.168.86.57

    Finally I’d like to share some trick or practices I used when coding the app:

    1) I started small by capturing a DNS-SD’s request emited from Avahi. Then I sent that very same packet from Snabb and checked the response was a Multicast DNS packet:

    $ avahi-browse -a
    $ sudo tcpdump -i wlp3s0 -w mdns.pcap

    Then open mdns.pcap with Wireshark, mark the request packet only and save it to disk. Then use od command to dump the packet as text:

    $ od -j 40 -A x -tx1 mdns_request.pcap
    000028 01 00 5e 00 00 fb 44 85 00 4f b8 fc 08 00 45 00
    000038 00 55 32 7c 00 00 01 11 8f 5a c0 a8 56 1e e0 00
    000048 00 fb e3 53 14 e9 00 41 89 9d 25 85 01 20 00 01
    000058 00 00 00 00 00 01 09 5f 73 65 72 76 69 63 65 73
    000068 07 5f 64 6e 73 2d 73 64 04 5f 75 64 70 05 6c 6f
    000078 63 61 6c 00 00 0c 00 01 00 00 29 10 00 00 00 00
    000088 00 00 00

    This dumped packet can be copied raw into Snabb such in MDNS’s selftest.

    NOTE: text2pcap command can also be a very convenient tool to convert a dumped packet in text format to a pcap file.

    2) Instead of sending requests on the wire to obtain responses, I saved a bunch of responses to a .pcap file and used the file as an input for the DNS parser. In fact the command supports a –pcap flag that can be used to print out DNS records.

    $ sudo ./snabb dnssd --pcap /home/dpino/avahi-browse.pcap
    Reading from file: /home/dpino/avahi-browse.pcap
    PTR: (name: _services._dns-sd._udp.local; domain-name: _spotify-connect._tcp)
    PTR: (name: ; domain-name: dcad2b6c7a2110c310-0)
    SRV: (target: dcad2b6c7a2110c310-0)
    TXT: (CPath=/zc/0;VERSION=1.0;Stack=SP;)
    Address: 192.168.86.55
    ..._

    3) When sending a packet to the wire, checkout the packet’s header checksum are correct. Wireshark has a mode to verify whether a packet’s header checksums are correct or not, which is very convenient. Snabb counts with protocol libraries to calculate a IP, TCP or UDP checksums. Check out how mDNSQuery does it.

    Last thoughts

    Implementing this tool has helped me to understand DNS better, specially the Multicast DNS/DNS-SD part. I never expected it could be so interesting.

    Going from an idea to a working prototype with Snabb is really fast. It’s one of the advantages of user-space networking and one of the things I enjoy the most. That said the resulting code has been bigger that I initially expected. I think that to avoid losing this work I will try to land the DNS and mDNS libraries into Snabb.

    This post puts an end to this series of practical Snabb posts. I hope you found them interesting as much as I enjoyed writing them. Luckily in the future these posts can be useful for anyone interested in user-space networking to try out Snabb.

    January 12, 2018 06:00 AM

    January 11, 2018

    Frédéric Wang

    Review of Igalia's Web Platform activities (H2 2017)

    Last september, I published a first blog post to let people know a bit more about Igalia’s activities around the Web platform, with a plan to repeat such a review each semester. The present blog post focuses on the activity of the second semester of 2017.

    Accessibility

    As part of Igalia’s commitment to diversity and inclusion, we continue our effort to standardize and implement accessibility technologies. More specifically, Igalian Joanmarie Diggs continues to serve as chair of the W3C’s ARIA working group and as an editor of Accessible Rich Internet Applications (WAI-ARIA) 1.1, Core Accessibility API Mappings 1.1, Digital Publishing WAI-ARIA Module 1.0, Digital Publishing Accessibility API Mappings 1.0 all of which became W3C Recommandations in December! Work on versions 1.2 of ARIA and the Core AAM will begin in January. Stay tuned for the First Public Working Drafts.

    We also contributed patches to fix several issues in the ARIA implementations of WebKit and Gecko and implemented support for the new DPub ARIA roles. We expect to continue this collaboration with Apple and Mozilla next year as well as to resume more active maintenance of Orca, the screen reader used to access graphical desktop environments in GNU/Linux.

    Last but not least, progress continues on switching to Web Platform Tests for ARIA and “Accessibility API Mappings” tests. This task is challenging because, unlike other aspects of the Web Platform, testing accessibility mappings cannot be done by solely examining what is rendered by the user agent. Instead, an additional tool, an “Accessible Technology Test Adapter” (ATTA) must be also be run. ATTAs work in a similar fashion to assistive technologies such as screen readers, using the implemented platform accessibility API to query information about elements and reporting what it obtains back to WPT which in turn determines if a test passed or failed. As a result, the tests are currently officially manual while the platform ATTAs continue to be developed and refined. We hope to make sufficient progress during 2018 that ATTA integration into WPT can begin.

    CSS

    This semester, we were glad to receive Bloomberg’s support again to pursue our activities around CSS. After a long commitment to CSS and a lot of feedback to Editors, several of our members finally joined the Working Group! Incidentally and as mentioned in a previous blog post, during the CSS Working Group face-to-face meeting in Paris we got the opportunity to answer Microsoft’s questions regarding The Story of CSS Grid, from Its Creators (see also the video). You might want to take a look at our own videos for CSS Grid Layout, regarding alignment and placement and easy design.

    On the development side, we maintained and fixed bugs in Grid Layout implementation for Blink and WebKit. We also implemented alignment of positioned items in Blink and WebKit. We have several improvements and bug fixes for editing/selection from Bloomberg’s downstream branch that we’ve already upstreamed or plan to upstream. Finally, it’s worth mentioning that the work done on display: contents by our former coding experience student Emilio Cobos was taken over and completed by antiik (for WebKit) and rune (for Blink) and is now enabled by default! We plan to pursue these developments next year and have various ideas. One of them is improving the way grids are stored in memory to allow huge grids (e.g. spreadsheet).

    Web Platform Predictability

    One of the area where we would like to increase our activity is Web Platform Predictability. This is obviously essential for our users but is also instrumental for a company like Igalia making developments on all the open source Javascript and Web engines, to ensure that our work is implemented consistently across all platforms. This semester, we were able to put more effort on this thanks to financial support from Bloomberg and Google AMP.

    We have implemented more frame sandboxing attributes WebKit to improve user safety and make control of sandboxed documents more flexible. We improved the sandboxed navigation browser context flag and implemented the new allow-popup-to-escape-sandbox, allow-top-navigation-without-user-activation, and allow-modals values for the sandbox attribute.

    Currently, HTML frame scrolling is not implemented in WebKit/iOS. As a consequence, one has to use the non-standard -webkit-overflow-scrolling: touch property on overflow nodes to emulate scrollable elements. In parallel to the progresses toward using more standard HTML frame scrolling we have also worked on annoying issues related to overflow nodes, including flickering/jittering of “position: fixed” nodes or broken Find UI or a regression causing content to disappear.

    Another important task as part of our CSS effort was to address compatibility issues between the different browsers. For example we fixed editing bugs related to HTML List items: WebKit’s Bug 174593/Chromium’s Issue 744936 and WebKit’s Bug 173148/Chromium’s Issue 731621. Inconsistencies in web engines regarding selection with floats have also been detected and we submitted the first patches for WebKit and Blink. Finally, we are currently improving line-breaking behavior in Blink and WebKit, which implies the implementation of new CSS values and properties defined in the last draft of the CSS Text 3 specification.

    We expect to continue this effort on Web Platform Predictability next year and we are discussing more ideas e.g. WebPackage or flexbox compatibility issues. For sure, Web Platform Tests are an important aspect to ensure cross-platform inter-operability and we would like to help improving synchronization with the conformance tests of browser repositories. This includes the accessibility tests mentioned above.

    MathML

    Last November, we launched a fundraising Campaign to implement MathML in Chromium and presented it during Frankfurt Book Fair and TPAC. We have gotten very positive feedback so far with encouragement from people excited about this project. We strongly believe the native MathML implementation in the browsers will bring about a huge impact to STEM education across the globe and all the incumbent industries will benefit from the technology. As pointed out by Rick Byers, the web platform is a commons and we believe that a more collective commitment and contribution are essential for making this world a better place!

    While waiting for progress on Chromium’s side, we have provided minimal maintenance for MathML in WebKit:

    • We fixed all the debug ASSERTs reported on Bugzilla.
    • We did follow-up code clean up and refactoring.
    • We imported Web Platform tests in WebKit.
    • We performed review of MathML patches.

    Regarding the last point, we would like to thank Minsheng Liu, a new volunteer who has started to contribute patches to WebKit to fix issues with MathML operators. He is willing to continue to work on MathML development in 2018 as well so stay tuned for more improvements!

    Javascript

    During the second semester of 2017, we worked on the design, standardization and implementation of several JavaScript features thanks to sponsorship from Bloomberg and Mozilla.

    One of the new features we focused on recently is BigInt. We are working on an implementation of BigInt in SpiderMonkey, which is currently feature-complete but requires more optimization and cleanup. We wrote corresponding test262 conformance tests, which are mostly complete and upstreamed. Next semester, we intend to finish that work while our coding experience student Caio Lima continues work on a BigInt implementation on JSC, which has already started to land. Google also decided to implement that feature in V8 based on the specification we wrote. The BigInt specification that we wrote reached Stage 3 of TC39 standardization. We plan to keep working on these two BigInt implementations, making specification tweaks as needed, with an aim towards reaching Stage 4 at TC39 for the BigInt proposal in 2018.

    Igalia is also proposing class fields and private methods for JavaScript. Similarly to BigInt, we were able to move them to Stage 3 and we are working to move them to stage 4. Our plan is to write test262 tests for private methods and work on an implementation in a JavaScript engine early next year.

    Igalia implemented and shipped async iterators and generators in Chrome 63, providing a convenient syntax for exposing and using asynchronous data streams, e.g., HTML streams. Additionally, we shipped a major performance optimization for Promises and async functions in V8.

    We implemented and shipped two internationalization features in Chrome, Intl.PluralRules and Intl.NumberFormat.prototype.formatToParts. To push the specifications of internationalization features forwards, we have been editing various other internationalization-related specifications such as Intl.RelativeTimeFormat, Intl.Locale and Intl.ListFormat; we also convened and led the first of what will be a monthly meeting of internationalization experts to propose and refine further API details.

    Finally, Igalia has also been formalizing WebAssembly’s JavaScript API specification, which reached the W3C first public working draft stage, and plans to go on to improve testing of that specification as the next step once further editorial issues are fixed.

    Miscellaneous

    Thanks to sponsorship from Mozilla we have continued our involvement in the Quantum Render project with the goal of using Servo’s WebRender in Firefox.

    Support from Metrological has also given us the opportunity to implement more web standards from some Linux ports of WebKit (GTK and WPE, including:

    • WebRTC
    • WebM
    • WebVR
    • Web Crypto
    • Web Driver
    • WebP animations support
    • HTML interactive form validation
    • MSE

    Conclusion

    Thanks for reading and we look forward to more work on the web platform in 2018. Onwards and upwards!

    January 11, 2018 11:00 PM

    Gyuyoung Kim

    Share my experience to build Chromium with ICECC and Jumbo

    If you’re a Chromium developer, I guess that you’ve suffered from the long build time of Chromium like me. Recently, I’ve set up the icecc build environment for Chromium in the Igalia Korea office. Although there have been some instructions how to build Chromium with ICECC, I think that someone might feel they are a bit difficult. So, I’d like to share my experience how to set up the environment to build Chromium on the icecc with Clang on Linux in order to speed up the build of Chromium.

    P.S. Recently Google announced that they will open Goma (Google internal distributed build system) for everyone. Let’s see goma can make us not to use icecc anymore 😉
    https://groups.google.com/a/chromium.org/forum/?utm_medium=email&utm_source=footer#!msg/chromium-dev/q7hSGr_JNzg/p44IkGhDDgAJ

    Prerequisites in your environment

    1. First, we should install the icecc on your all machines.
      sudo apt-get install icecc ccache [icecc-monitor]
    2. To build Chromium using icecc on the Clang, we have to use some configurations when generating the gn files. To do it as easily as possible, I made a script. So you just download it and register it to $PATH. 
      1. Clone the script project
        $ git clone https://github.com/Gyuyoung/ChromiumBuild.git

        FYI, you can see the detailed configuration here – https://github.com/Gyuyoung/ChromiumBuild/blob/master/buildChromiumICECC.sh

      2. Add your chromium/src path to buildChromiumICECC.sh
        # Please set your path to ICECC_VERSION and CHROMIUM_SRC.
        export CHROMIUM_SRC=$HOME/chromium/src
        export ICECC_VERSION=$HOME/chromium/clang.tar.gz
      3. Register it to PATH environment in .bashrc
        export PATH=/path/to/ChromiumBuild:$PATH
      4. Create a clang toolchain from the patched Chromium version
        I added the process to “sync” argument of buildChromiumICECC.sh. So please just execute the below command before starting compiling. Whenever you run it, clang.tar.gz will be updated every time with the latest Chromium version.
    $ buildChromiumICECC.sh sync

    Build

    1. Run an icecc scheduler on a master machine,
      sudo service icecc-scheduler start
    2. Then, run an icecc daemon on each slave machine
      sudo service iceccd start

      If you run icemon, you can monitor the build status on the icecc.

    3. Start building in chromium/src
      $ buildChromiumICECC.sh Debug|Release

      When you start building Chromium, you’ll see that the icecc works on the monitor!

    Build Time

    In my case, I’ve  been using 1 laptop and 2 desktops for Chromium build. The HW information is as below,

    • Laptop (Dell XPS 15″ 9560)
      1. CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
      2. RAM: 16G
      3. SSD
    • Desktop 1
      1. CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
      2. RAM: 16G
      3. SSD
    • Desktop 2
      1. CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
      2. RAM: 16G
      3. SSD

    I’ve measured how long time I’ve spent to build Chromium with the script in below cases,

    • Laptop
      • Build time on Release with the jumbo build: About 84 min
      • Build time on Release without the jumbo build: About 203 min
    • Laptop + Desktop 1 + Desktop 2
      • Build time on Release with the jumbo build: About 35 min
      • Build time on Release without the jumbo build: About 73 min

    But these builds haven’t applied any object caching by ccache yet. If ccache works next time, the build time will be reduced. Besides, the build time is depended on the count of build nodes and performance. So this time can differ from your environment.

    Troubleshooting (Ongoing)

    1. Undeclared identifier
      1. Error message
        /home/gyuyoung/chromium/src/third_party/llvm-build/Release+Asserts
         /lib/clang/6.0.0/include/avx512vnniintrin.h:38:20:
         error: use of undeclared identifier '__builtin_ia32_vpdpbusd512_mask'
         return (__m512i) __builtin_ia32_vpdpbusd512_mask ((__v16si) __S, ^
      2. Solution
        Please check if the path of ICECC_VERSION was set correctly.
    2. Loading error libtinfo.so.5
      1. Error message
        usr/bin/clang: error while loading shared libraries: libtinfo.so.5: 
        failed to map segment from shared object
      2. Solution
        Not find a correct fix yet. Just restart the build for now.
    3. Out of Memory
      1. Error message
        LLVM ERROR: out of memory
      2. Solution
        • Add more physical RAM to the machine.
        • Alternatively, we can try to increase the space of swap.
          • For example, create a 4G swap file
            $ size="4G" && file_swap=/swapfile_$size.img && sudo touch $file_swap && sudo fallocate -l $size /$file_swap && sudo mkswap /$file_swap && sudo swapon -p 20 /$file_swap
            

          • Make the swap file permanent
            # in your /ets/fstab file
            /swapfile    none    swap    sw,pri=10      0       0
            /swapfile_4G.img     none    swap    sw,pri=20      0       0
          • Check swap situation after reboot
            $ sudo swapon  -s
            Filename       Type     Size        Used    Priority
            /swapfile      file     262140      0       10
            /swapfile_4G.img       file     4194300     0       20

    Reference

    1. WebKitGTK SpeedUpBuild: https://trac.webkit.org/wiki/WebKitGTK/SpeedUpBuild
    2. compiling-chromium-with-clang-and-icecc : http://mkollaro.github.io/2015/05/08/compiling-chromium-with-clang-and-icecc/
    3. Rune Lillesveen’s icecc-chromium project:
      https://github.com/lilles/icecc-chromium
    4. How to increase swap space?
      https://askubuntu.com/questions/178712/how-to-increase-swap-space

    by gyuyoung at January 11, 2018 12:37 AM

    January 10, 2018

    Manuel Rego

    "display: contents" is coming

    Yes, display: contents is enabled by default in Blink and WebKit and it will be probably shipped in Chrome 65 and Safari 11.1. These browsers will join Firefox that is shipping it since version 37, which makes Edge the only one missing the feature (you can vote for it!).

    Regarding this I’d like to highlight that the work to support it in Chromium was started by Emilio Cobos during his Igalia Coding Experience that took place from fall 2016 to summer 2017.

    You might (or not) remember a blog post from early 2016 where I was talking about the Igalia Coding Experience program and some ideas of tasks to be done as part of the Web Platform team. One of the them was display: contents which is finally happening.

    What is display: contents?

    This new value for the display property allows you to somehow remove an element from the box tree but still keep its contents. The proper definition from the spec:

    The element itself does not generate any boxes, but its children and pseudo-elements still generate boxes and text runs as normal. For the purposes of box generation and layout, the element must be treated as if it had been replaced in the element tree by its contents (including both its source-document children and its pseudo-elements, such as ::before and ::after pseudo-elements, which are generated before/after the element’s children as normal).

    A simple example will help to understand it properly:

    <div style="display: contents;
                background: magenta; border: solid thick black; padding: 20px;
                color: cyan; font: 30px/1 Monospace;">
      <span style="background: black;">foobar</span>
    </div>

    display: contents makes that the div doesn’t generate any box, so its background, border and padding are not renderer. However the inherited properties like color and font have effect on the child (span element) as expected.

    For this example, the final result would be something like:

    <span style="background: black; color: cyan; font: 30px/1 Monospace;">foobar</span>

    Unsupported

    foobar

    Actual

    foobar

    Supported

    foobar

    Unsupported vs actual (in your browser) vs supported output for the previous example

    If you want more details Rachel Andrew has a nice blog post about this topic.

    CSS Grid Layout & display: contents

    As you could expect from a post from myself this is somehow related to CSS Grid Layout. 😎 display: contents can be used as a replacement of subgrids (which are not supported by any browser at this point) in some use cases. However subgrids are still needed for other scenarios.

    The canonical example for Grid Layout auto-placement is a simple form like:

    <style>
      form   { display:     grid;   }
      label  { grid-column: 1;      }
      input  { grid-column: 2;      }
      button { grid-column: span 2; }
    </style>
    <form>
      <label>Name</label><input />
      <label>Mail</label><input />
      <button>Send</button>
    </form>

    A simple form formatted with CSS Grid Layout A simple form formatted with CSS Grid Layout

    However this is not the typical HTML of a form, as you usually want to use a list inside, so people using screen readers will know how many fields they have to fill in your form beforehand. So the HTML looks more like this:

    <form>
      <ul>
        <li><label>Name</label><input /></li>
        <li><label>Mail</label><input /></li>
        <li><button>Send</button></li>
      </ul>
    </form>

    With display: contents you’ll be able to have the same layout than in the first case with a similar CSS:

    ul     { display: grid;       }
    li     { display: contents;   }
    label  { grid-column: 1;      }
    input  { grid-column: 2;      }
    button { grid-column: span 2; }

    So this is really nice, now when you convert your website to start using CSS Grid Layout, you would need less changes on your HTML and you won’t need to remove some HTML that is really useful, like the list in the previous example.

    Chromium implementation

    As I said in the introduction, Firefox shipped display: contents three years ago, however Chromium didn’t have any implementation for it. Igalia as CSS Grid Layout implementor was interested in having support for the feature as it’s a handy solution for several Grid Layout use cases.

    The proposal for the Igalia Coding Experience was the implementation of display: contents on Blink as the main task. Emilio did an awesome job and managed to implement most of it, reporting issues to CSS Working Group and other browsers as needed, and writing tests for the web-platform-tests repository to ensure interoperability between the implementations.

    Once the Coding Experience was over there were still a few missing things to be able to enable display: contents by default. Rune Lillesveen (Google and previously Opera) who was helping during the whole process with the reviews, finished the work and shipped it past week.

    WebKit implementation

    WebKit already had an initial support for display: contents that was only used internally by Shadow DOM implementation and not exposed to the end users, neither supported by the rest of the code.

    We reactivated the work there too, he didn’t have time to finish the whole thing but later Antti Koivisto (Apple) completed the work and enabled it by default on trunk by November 2017.

    Conclusions

    Igalia is one of the top external contributors on the open web platform projects. This put us on a position that allows us to implement new features in the different open source projects, thanks to our community involvement and internal knowledge after several years of experience on the field. Regarding display: contents implementation, this feature probably wouldn’t be available today in Chromium and WebKit without Igalia’s support, in this particular case through a Coding Experience.

    We’re really happy about the results of the Coding Experience and we’re looking forward to repeat the success story in the future.

    Of course, all the merit goes to Emilio, who is an impressive engineer and did a great job during the Coding Experience. As part of this process he got committer privileges in both Chromium and WebKit projects. Kudos!

    Last, but not least, thanks to Antti and Rune for finishing the work and making display: contents available to WebKit and Chromium users.

    January 10, 2018 11:00 PM

    December 27, 2017

    Manuel Rego

    Web Engines Hackfest 2017

    One year more Igalia organized, hosted and sponsored a new edition of the Web Engines Hackfest. This is my usual post about the event focusing on the things I was working on during those days.

    Organization

    This year I wanted to talk about this because being part of the organization in an event with 60 people is not that simple and take some time. I’m one of the members of the organization which ended up meaning that I was really busy during the 3 days of the event trying to make the life of all the attendees as easy as possible.

    Yeah, this year we were 60 people, the biggest number ever! Note that last year we were 40 people, so it’s clear the interest in the event is growing after each edition, which is really nice.

    For the first time we had conference badges, with so many faces it was going to be really hard to put names to all of them. In addition we had pronouns stickers and different lanyard color for the people that don’t want to appear in the multimedia material published during and after the event. I believe all these things worked very well and we’ll be repeating for sure in future editions.

    My Web Engines Hackfest 2017 conference badge My Web Engines Hackfest 2017 conference badge

    The survey after the event showed an awesome positive feedback, so we’re really glad that you have enjoined the hackfest. I was specially happy seeing how many parallel discussions were taking place all around the office in small groups of people.

    Talks

    The main focus of the event is the different discussions that happen between people about many different topics. This together with the possibility to hack with some people from other browser engines and/or other companies make the hackfest an special event.

    On top of that, we arrange a few talks that are usually quite interesting. The talks from this year can be found on a YouTube playlist (thanks Juan for helping with the video editing). This year the talks covered the following topics: Web Platform Tests, zlib optimizations, Chromium on Wayland, BigInt, WebRTC and WebVR.

    Some pictures of the Web Engines Hackfest 2017 Some pictures of the Web Engines Hackfest 2017

    CSS Grid Layout

    During the hackfest I participated in several breakout sessions, like for example one about Web Platform Tests or another one about MathML. However as usual on the last years my main was related to CSS Grid Layout. In this case we took advantage to discuss several topics from which I’m going to highlight two:

    Chromium bug on input elements which only happens on Mac

    This is about Chromium bug #727076, where the input elements that are grid items get collapsed/shrunk when the user starts to enter some text. This was a tricky issue only reproducible on Mac platform that was hitting us for a while, so we needed to find a solution.

    We had some long discussions about this and finally my colleague Javi found the root cause of the problem and finally fixed it, kudos!

    This bug is a quite complex and tight to some Chromium specific implementation bits. The summary is that in Mac platform you can get requests about the intrinsic (preferred) width of the grid container at random times, which means that you’re not sure a full layout will be performed afterwards. Our code was not ready for that, as we were always expecting a full layout after asking for the intrinsic width.

    Percentage tracks and gutters

    This is a neverending topic that has been discussed in the CSS WG for a long time. There are 2 different things so let’s go one by one.

    First, how percentage tracks are resolved when the width/height of a grid container is indefinite. So far, this was not symmetric but was working like in regular blocks:

    • Percentage columns work like percentage widths: This means that they are treated as auto during intrinsic size computation and later resolved during layout.
    • Percentage rows work like percentage heights: In this case they are treated as auto and never resolved.

    However the CSS WG decided to change this and make both symmetric, so percentage rows will work like percentage columns. This hasn’t been implemented by any browser yet, and all of them have interoperability with the previous status. We’re not 100% sure about the complexity this could bring and before changing current behavior we’re going to gather some usage statistics to verify this won’t break a lot o content out there. We’d also love to get feedback from other implementors about this. More information about this topic can be found on CSS WG issue #1921.

    Now let’s talk about the second issue, how percentage gaps work. The whole discussion can be checked on CSS WG issue #509 that I started back in TPAC 2016. For this case there are no interoperability between browsers as Firefox has its own solution of back-computing percentage gaps, the rest of browsers have the same behavior in line with the percentage tracks resolution, but again it is not symmetric:

    • Percentage column gaps contribute zero to intrinsic width computation and are resolved as percent during layout.
    • Percentage row gaps are treated always as zero.

    The CSS WG resolved to modify this behavior to make them both symmetric, in this case choosing the row gaps behavior as reference. So browsers will need to change how column gaps work to avoid resolving the percentages during layout. We don’t know if we could detect this situation in Blink/WebKit without quite complex changes on the engine, and we’re waiting for feedback from other implementors on that regard.

    So I won’t say any of those topics are definitely closed yet, and it won’t be unexpected if some other changes happen in the future when the implementations try to catch up with the spec changes.

    Thanks

    To close this blog post let’s say thanks to everyone that come to our office and participated in the event, the Web Engines Hackfest won’t be possible without such a great bunch of people that decided to spend a few days working together on improving the status of the Web.

    Web Engines Hackfest 2017 sponsors: Collabora, Google, Igalia and Mozilla Web Engines Hackfest 2017 sponsors: Collabora, Google, Igalia and Mozilla

    Of course we cannot forget about the sponsors either: Collabora, Google, Igalia and Mozilla. Thank you all very much!

    And last, but not least, thanks to Igalia for organizing and hosting one year more this event. Looking forward to the new year and the 2018 edition!

    December 27, 2017 11:00 PM