Planet Igalia

October 06, 2015

Manuel Rego

Grid Layout Coast to Coast

So here we’re again, this time to announce that I’ll be presenting my talk CSS Grid Layout from the inside out at HTML5DevConf Autumn 2015 (19-20 October). I’m really excited about this new opportunity to show the world the wonders of CSS Grid Layout specification, together with the “dark” stuff that happens behind the scenes. Thanks to the organization for counting me in. 😃

My talk announced at the HTML5DevConf website My talk announced at the HTML5DevConf website

During the talk I’ll give a short introduction to the grid layout syntax and features, including some of the most recent developments, like the much awaited grid gutters (which will hopefully land very soon). Igalia has been working on the CSS Grid Layout implementation on Blink and WebKit since 2013 as part of a collaboration with Bloomberg. Right now we’re closer than ever to ship the feature in Chromium!

Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

This’ll be my second grid layout talk this year after the one at CSSConf in New York. It’ll be my first time at HTML5DevConf and my first time in San Francisco too. It seems a huge conference full of great speakers and interesting talks. I really hope that people like my talk, anyway I’m sure I’ll enjoy the conference and learn a lot about the latest cool stuff that is moving around the web platform. Please, feel free to ping me if you want to talk about grid, new layout models, the web and/or Igalia.

Last but not least, my igalian mate Martin Robinson who is currently hacking on Servo will be attending the conference too. Don’t hesitate to ask him any question about web browser engines, he knows them very well. 😉

October 06, 2015 10:00 PM

September 25, 2015

Samuel Iglesias

ARB_program_interface_query and shader_runner

Lately I have worked on ARB_shader_storage_buffer_object  extension implementation on Mesa, along with Iago (who has already explained how we implement it).

One of the features we implemented was adding support for buffer variables to the queries defined in ARB_program_interface_query. Those queries ended up being very useful to check another feature introduced by ARB_shader_storage_buffer_object: the new storage layout called std430.

I have developed basic tests for piglit, but Jordan Justen and Ilia Mirkin mentioned that Ian Romanick had worked in a comprehensive testing for uniform buffer objects presented at XDC2014.

Ian developed a stochastic search-based testing for uniform block layouts which generates shader_runner tests using a template in Mako that defines a series of uniform blocks with different layouts. Finally, shader_runner verifies some information through glGetActiveUniforms*() queries such as: the variable type, its offset inside the block, its array size (if any), if it is row major, etc… using “active uniform” command.

That was the kind of testing that I was looking for, so I cloned his repository and developed a modified version for shader storage blocks and std430.

During that process, I found that shader_runner was lacking support for the queries defined in ARB_program_interface_query extension. Yesterday, the patches were pushed to master branch of the piglit repository.

If you want to use this new command, the format  is the following:

verify program_interface_query GL_INTERFACE_TYPE_ENUM var_name GL_PROPS_ENUM integer

or, if we include the GL type enum:

verify program_interface_query GL_INTERFACE_TYPE_ENUM var_name GL_PROPS_ENUM GL_TYPE_ENUM

I have written an example to show how to use this command in a shader_runner test. This is just a snippet showing the usage of the new command:

link success

verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].s1_3.fv3[0] GL_TYPE GL_FLOAT_VEC4
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].s1_3.fv3[0] GL_TOP_LEVEL_ARRAY_SIZE 4
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].s1_3.fv3[0] GL_OFFSET 128
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[0].s1_3.fv3[0] GL_OFFSET 48
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[0].m34_1 GL_TYPE GL_FLOAT_MAT3x4
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].m34_1 GL_OFFSET 80
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].m34_1 GL_ARRAY_STRIDE 0
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.m43_1 GL_IS_ROW_MAJOR 1
verify program_interface_query GL_PROGRAM_INPUT piglit_vertex GL_TYPE GL_FLOAT_VEC4
verify program_interface_query GL_PROGRAM_OUTPUT piglit_fragcolor GL_TYPE GL_FLOAT_VEC4

I hope you find them useful for your tests!


by Samuel Iglesias at September 25, 2015 08:56 AM

September 23, 2015

Xabier Rodríguez Calvar

WebKit Contributors Meeting 2015 (late, I know)

After writing my last post I realized that I needed to write a bit more about what I had been doing at the WebKit Contributors Meeting.

First thing to say is that it happened in March at Apple campus in Cupertino and I atteded as part of the Igalia gang.

My goal when I went there was to discuss with Youenn Fablet about Streams API and we are implementing and see how we could bootstrap the reviews and being to get the code reviewed and landed efficiently. Youenn and I also made a presentation (mainly him) about it. At that moment we got some comments and help from Benjamin Poulain and nowadays we are also working with Darin Adler and Geoffrey Garen so the work is ongoing.

WebRTC was also a hot topic and we talked a bit about how to deal with the promises as they seem to be involved in the WebRTC standard was well. My Igalian partner Philippe was missed in this regard as he is involved in the development of WebRTC in WebKit, but he unfortunately couldn’t make it because of personal reasons.

I also had an interesting talk with Jer Noble and Eric Carlson about Media Source and Encrypted Media Extensions. I told them about the several downstream implementations that we are or were working on, specially the WebKit4Wayland one and that we expect to begin to upstream soon. They commented that they still have doubts about the abstractions they made for them and of course I promised to get back to them when we begin with the job. Actually I already discussed some issues with Quique, another fellow Igalian.

Among the other interesting discussions, I found very necessary the migration of Mac port to CMake. Actually, I am experiencing now the painbenefits of using XCode to add files, specially the generated ones to the compilation. I hope that Alex succeeds with the task and soon we have a common build system for all main ports.

by calvaris at September 23, 2015 04:36 PM

Joanmarie Diggs

New in Orca 3.18: Firefox Support Rewrite and MathML

For years I had been threatening to nuke Orca’s support for Firefox (it’s the only way to be sure) and start over. This release cycle I finally made good on my threat. :)

The funny thing is that what prompted me to do so was not Firefox; it was the need to implement an Orca-controlled caret for WebKitGtk content. Why an Orca-controlled caret for WebKitGtk content is needed — and why the resulting support is not yet being used for WebKitGtk content — is a rant topic for another day. But with that work largely done, and Orca’s support for Firefox in desperate need of a rewrite, the time was right.

The improvements include:

  1. Addition of speech support for MathML. (Nemeth is a high-priority for me, but just getting something working had to happen first.)
  2. Significant performance improvements: The lags Orca’s Firefox support had become (in)famous for should be gone — at least the ones caused by Orca, which was quite a few. I have also added some defensive handling for the accessible-event floods Orca is sometimes on the receiving end of (e.g. due to unusually large numbers of DOM changes).
  3. Support for Google Docs applications: Just enable Google Docs’ accessibility support and Orca’s sticky focus mode, and you should be good to go.
  4. Improved support for other rich text editors like Etherpad.
  5. Fixes for Orca getting stuck in or skipping over content in browse mode.
  6. Fixes for Orca getting tripped up by auto-reloading/refreshing content.
  7. Fixes for several bugs in presentation of elements in object mode. (By the way, yes, Orca can present one object per line “like JAWS does.”)
  8. Improved accuracy in Orca’s label inference support (because there are still authors who fail to use label elements and/or ARIA to make their form fields accessible).
  9. Fixes for several bugs related to using structural navigation, fast forward, and rewind in SayAll. (Note: These “experimental” SayAll features still need to be enabled via your Orca customizations because GNOME’s GUI and string freezes snuck up on me. But once you’ve enabled them, they should work more reliably than before.)
  10. Improved presentation of dynamically-added content, such as items added to the bottom of a search results page in real time.
  11. Did I mention MathML and lag elimination? :)
  12. Updated documentation (yes, finally!) to reflect the addition of browse mode, focus mode, sticky focus mode, layout mode, and the like.

While I still have more improvements to make and bugs to fix, I’m quite pleased with how Orca 3.18.0 works with Firefox. I hope you will be too.

To give Orca 3.18.0 a try, clone it from or check with your distro. Questions? Please ask on the Orca mailing list.

by jd at September 23, 2015 02:37 AM

September 21, 2015

Carlos García Campos

WebKitGTK+ 2.10

HTTP Disk Cache

WebKitGTK+ already had an HTTP disk cache implementation, simply using SoupCache, but Apple introduced a new cross-platform implementation to WebKit (just a few bits needed a platform specific implementation), so we decided to switch to it. This new cache has a lot of advantages over the SoupCache approach:

  • It’s fully integrated in the WebKit loading process, sharing some logic with the memory cache too.
  • It’s more efficient in terms of speed (the cache is in the NetworkProcess, but only the file descriptor is sent to the Web Process that mmaps the file) and disk usage (resource body and headers are stored in separate files in disk, using hard links for the body so that difference resources with the exactly same contents are only stored once).
  • It’s also more robust thanks to the lack of index. The synchronization between the index and the actual contents has always been a headache in SoupCache, with many resources leaked in disk, resources that are cache twice, etc.

The new disk cache is only used by the Network Process, so in case of using the shared secondary process model the SoupCache will still be used in the Web Process.

New inspector UI

The Web Inspector UI has been redesigned, you can see some of the differences in this screenshot:


For more details see this post in the Safari blog


This was one the few regressions we still had compared to WebKit1. When we switched to WebKit2 we lost IndexedDB support, but It’s now back in 2.10. It uses its own new process, the DatabaseProcess, to perform all database operations.


WebKitGTK+ 2.8 improved the overall performance thanks to the use of the bmalloc memory allocator. In 2.10 the overall performance has also improved, this time thanks to a new implementation of the locking primitives. All uses of mutex/condition have been replaced by a new implementation. You can see more details in the email Filip sent to webkit-dev or in the so detailed commit messages.

Screen Saver inhibitor

It’s more and more common to use the web browser to watch large videos in fullscreen mode, and quite annoying when the screen saver decides to “save” your screen every x minutes during the whole video. WebKitGTK+ 2.10 uses the ScreenSaver DBus service to inhibit the screen saver while a video is playing in fullscreen mode.

Font matching for strong aliases

WebKit’s font matching algorithm has improved, and now allows replacing fonts with metric-compatible equivalents. For example, sites that specify Arial will now get Liberation Sans, rather than your system’s default sans font (usually DejaVu). This makes text appear better on many pages, since some fonts require more space than others. The new algorithm is based on code from Skia that we expect will be used by Chrome in the future.

Improve image quality when using newer versions of cairo/pixman

The poor downscaling quality of cairo/pixman is a well known issue that was finally fixed in Cairo 1.14, however we were not taking advantage of it in WebKit even when using a recent enough version of cairo. The reason is that we were using CAIRO_FILTER_BILINEAR filter that was not affected by the cairo changes. So, we just switched to use CAIRO_FILTER_GOOD, that will use the BILINEAR filter in previous versions of Cairo (keeping the backwards compatibility), and a box filter for downscaling in newer versions. This drastically improves the image quality of downscaled images with a minim impact in performance.


Editor API

The lack of editor capabilities from the API point of view was blocking the migration to WebKit2 for some applications like Evolution. In 2.10 we have started to add the required API to ensure not only that the migration is possible for any application using a WebView in editable mode, but also that it will be more convenient to use.

So, for example, to monitor the state of the editor associated to a WebView, 2.10 provides a new class WebKitEditorState, that for now allows to monitor the typing attributestyping attributes. With WebKit1 you had to connect to the selection-changed signal and use the DOM bindings API to manually query the typing attributes. This is quite useful for updating the state of the editing buttons in the editor toolbar, for example. You just need to connect to WebKitEditorState::notify::typying-attributes and update the UI accordingly. For now typing attributes is the only thing you can monitor from the UI process API, but we will add more information when needed like the current cursor position, for example.

Having WebKitEditorState doesn’t mean we don’t need a selection-changed signal that we can monitor to query the DOM ourselves. But since in WebKit2 the DOM lives in the Web Process, the selection-changed signal has been added to the Web Extensions API. A new class WebKitWebEditor has been added, to represent the web editor associated to a WebKitWebPage, and can be obtained with webkit_web_page_get_editor(). And is this new class the one providing the selection-changed signal. So, you can connect to the signal and use the DOM API the same way it was done in WebKit1.

Some of the editor commands require an argument, like for example, the command to insert an image requires the image source URL. But both the WebKit1 and WebKit2 APIs only provided methods to run editor commands without any argument. This means that, once again, to implement something like insert-image or insert link, you had to use the DOM bindings to create and insert the new elements in the correct place. WebKitGTK+ 2.10 provides webkit_web_view_execute_editing_command_with_argument() to make this a lot more convenient.

You can test all this features using the new editor mode of MiniBrowser, simply run it with -e command line option and no arguments.


Website data

When browsing the web, websites are allowed to store data at the client side. It could be a cache, like the HTTP disk cache, or data required by web features like offline applications, local storage, IndexedDB, WebSQL, etc. All that data is currently stored in different directories and not all of those could be configured by the user. The new WebKitWebsiteDataManager class in 2.10 allows you to configure all those directories, either using a common base cache/data directory or providing a specific directory for every kind of data stored. It’s not mandatory to use it though, the default values are compatible with the ones previously used.

This gives the user more control over the browsing data stored in the client side, but in the future versions we plan to add support for actually handling the data, so that you will be able to query and delete the data stored by a particular security domain.

Web Processes limit

WebKitGTK+ currently supports two process models, the single shared secondary process and the multiple secondary processes. When using the latter, a new web process is created for every new web view. When there are a lot of web views created at the same time, the resources required to create all those processes could be too much in some systems. To improve that a bit 2.10 adds webkit_web_context_set_web_process_count_limit(), to set the maximum number of web process that can be created a the same time.

This new API can also be used to implement a slightly different version of the shared single process model. By using the multiple secondary process model with a limit of 1 web process, you still have a single shared web process, but using the multi-process mechanism, which means the network will happen in the Network Process, among other things. So, if you use the shared secondary process model in your application, unless your application only loads local resources, we recommend you to switch to multiple process model and use the limit to benefit from all the Network Process feature like the new disk cache, for example. Epiphany already does this for the secondary process model and web apps.

Missing media plugins installation permission request

When you try to play media, and the media backend doesn’t find the plugins/codecs required to play it, the missing plugin installation mechanism starts the package installer to allow the user to find and install the required plugins/codecs. This used to happen in the Web Process and without any way for the user to avoid it. WebKitGTK+ 2.10 provides a new WebKitPermissionRequest implementation that allows the user to block the request and prevent the installer from being invoked.

by carlos garcia campos at September 21, 2015 11:52 AM

September 18, 2015

Alejandro Piñeiro

Optimizing shader assembly instruction on Mesa using shader-db (II)

On my previous post I mentioned that I have been working on optimizing the shader instruction count for specific shaders guided by shader-db, and showed one specific example. In this post I will show another one, slightly more complex on the triaging and solution.

Some of the shaders with a worse instruction count can be foundn at shader-db/shaders/dolphin. Again I analyzed it in order to get the more simpler shader possible with the same issue:

   #version 130

   in vec2 myData;

   void main()
      gl_Position = vec4(myData, 3.0, 4.0);

Some comments:

  • It also happens with uniforms (so you can replace “in vec2″ for “uniform vec2″)
  • It doesn’t happens if you use directly the input. You need to do this kind of “input and const” combination.

So as in my previous post I executed the compilation, using the option optimizer. In the case of IR I got the following files:

  • VS-0001-00-start
  • VS-0001-01-01-opt_reduce_swizzle
  • VS-0001-01-04-opt_copy_propagation
  • VS-0001-01-07-opt_register_coalesce
  • VS-0001-02-02-dead_code_eliminate
  • VS-0001-02-07-opt_register_coalesce

Being this the desired outcome (so the content of VS-0001-02-02-dead_code_eliminate):

0: mov m3.z:F, 3.000000F
1: mov m3.w:F, 4.000000F
2: mov m3.xy:F, attr17.xyyy:F
3: mov m2:D, 0D
4: vs_urb_write (null):UD

Unsurprisingly it is mostly movs. In opposite to the shader I mentioned on my previous post, where on both cases the same optimizations were applied, in this case the NIR path doesn’t apply the last optimization (the second register coalesce). So this time I will focus on the starting point and the state just after the dead code eliminate pass.

So on IR, the starting point (VS-0001-00-start) is:

0: mov vgrf2.0.x:F, 3.000000F
1: mov vgrf2.0.y:F, 4.000000F
2: mov, vgrf2.xxxy:F
3: mov vgrf1.0.xy:F, attr17.xyxx:F
4: mov vgrf0.0:F, vgrf1.xyzw:F
5: mov m2:D, 0D
6: mov m3:F, vgrf0.xyzw:F
7: vs_urb_write (null):UD

and the state after the dead code eliminate is the following one:

0: mov vgrf1.0.z:F, 3.000000F
1: mov vgrf1.0.w:F, 4.000000F
2: mov vgrf1.0.xy:F, attr17.xyyy:F
3: mov m2:D, 0D
4: mov m3:F, vgrf1.xyzw:F
5: vs_urb_write (null):UD

On NIR, the starting point is:

0: mov vgrf2.0.x:F, 3.000000F
1: mov vgrf2.0.y:F, 4.000000F
2: mov vgrf0.0.xy:F, attr17.xyyy:F
3: mov vgrf1.0.xy:D, vgrf0.xyzw:D
4: mov, vgrf2.xxxy:D
5: mov m2:D, 0D
6: mov m3:F, vgrf1.xyzw:F
7: vs_urb_write (null):UD

and the state after the dead code eliminate is the following one:

0: mov vgrf2.0.x:F, 3.000000F
1: mov vgrf2.0.y:F, 4.000000F
2: mov m3.xy:D, attr17.xyyy:D
3: mov, vgrf2.xxxy:D
4: mov m2:D, 0D
5: vs_urb_write (null):UD

The first difference we can see is that although the instructions are basically the same at the starting point, the order is not the same. In fact if we check the different intermediate steps (I will not show them here to avoid a post too long), although the optimizations are the same, how and which get optimized are somewhat different. One could conclude that the problem is this order, but if we take a look to the final step on the NIR assembly shader, there isn’t anything clearly indicating that that shader can’t be simplified. Specifically instruction #3 could go away if instruction #0 and #1 writes directly to m3 instead of vgrf2, that is what the IR path does. So it seems that the problem is on the register coalesce optimization.

As I mentioned, there is a slight order difference between NIR and IR. That leads that on the NIR case, between instruction #3 and #0/#1 there is another instruction, that is in a different place on IR. So my first thought was that the optimization was only checking against the immediate previous instruction. Once I started to look to the code it showed that I was wrong. For each instruction, there was a loop checking for all the previous instructions. What I noticed is that on that loop, all the checks that rejected one previous instruction was a break. So I initially thought that perhaps one of those breaks was in fact a continue. This seemed to be confirmed when I did the quick hack of replace everything for continues. It proved wrong as soon as I saw all the piglit regressions I had in hand. So after that I did the proper, and do a proper debug. So using gdb, the condition it was stopping the optimization to check previous instructions was the following one:

/* If somebody else writes our destination here, we can't coalesce
* before that.
if (inst->dst.in_range(scan_inst->dst, scan_inst->regs_written))

Probably the code is hard to understand out of context, but the comment is clear. When we coalesce two instructions, that is possible when the previous one writes to a register we are reading on the current instruction. But obviously, that can’t be done if there is a instruction in the middle that writes on the same register. And it is true that is what is happening here. If you look at the final state of the NIR path, we want to coalesce instruction #3 with instruction #1 and #0, but instruction #2 is writing on m3 too.

So, that’s over? Not exactly. Remember that IR was able to simplify this, and it can’t be only because the order was different. If you take a deeper look to those instructions, there are some x, y, z, w after the register names. Those report in which channels those instructions are writing. As I mentioned on my previous post, this work is about providing a NIR to vec4 pass. So those registers are vectors. Instruction #3 can be read as “move the content of components x and y from register vgrf2 to components z and w of register m3″.  And instruction #2 can be read as “move the content of components x and y from register attr17 to components x and y of register m3″. So although we are writing to the same destination, we are writing to different components, meaning that it would be safe to do the coalescing. We just need to be sure that there isn’t any component overlap between current instruction and the previous one we are checking against. Fortunately the registers already save in which registers they are writing on in a variable called “writemask”. So we only need to change that code for the following one:

/* If somebody else writes the same channels of our destination here,
* we can't coalesce before that.
if (inst->dst.in_range(scan_inst->dst, scan_inst->regs_written) &&
(inst->dst.writemask & scan_inst->dst.writemask) != 0) {

The patch with this change was sent to the mesa list (here), approved and pushed to master.

Final words

So again, a problem what was easier to write the solution that to write the patch. But in any case, it showed a significant improvement. Using shader-db tool comparing before and after the patch:

total instructions in shared programs: 1781593 -> 1734957 (-2.62%)
instructions in affected programs:     1238390 -> 1191754 (-3.77%)
helped:                                12782
HURT:                                  0

by infapi00 at September 18, 2015 08:40 AM

September 14, 2015

Alejandro Piñeiro

Optimizing shader assembly instruction on Mesa using shader-db

Lately I have been working on Mesa. Specifically I have been working with my fellow igalians Eduardo Lima and Antía Puentes to provide a NIR to vec4 pass to the i965 backend. I will not go too much in the details, but in summary, NIR is a new intermediate representation for Mesa. Intermediate as being in the middle of the OpenGL GLSL language used for shaders, and the final GPU machine instructions for each specific Mesa backends. NIR is intended to replace the previous GLSL IR, and in some places it is already done. If you are interested on the details, take a look to the NIR announcement and NIR documentation page.

Although the bug is still open, Mesa master already has the functionality for this pass, and in fact, is the default now. This new NIR pass provides the same functionality that the one available with the old GLSL IR pass (from now, just IR). This was properly tested with piglit. But although the total instruction count in general have improved, we are getting worse instruction count compiling some specific known shaders if we use NIR. So the next step would be improve this. This is an ongoing effort, like these patches from Jason Ekstrand, but I would like to share some of my experience so far.

In order to guide this work, we have been using shader-db. shader-db is a shader database, with a executable to compile those shaders, and a tool to compare two executions of that compilation. Usually it is used to verify that the optimization that you are implementing is really improving the instruction count, or even to justify your change. Several Mesa commits include the before and after shader-db statistics. But in this case, we have been using it as a guide of what we could improve. We compile all the shaders using IR and using NIR (using the environment variable INTEL_USE_NIR), and check in which shaders there are a instruction count regression.

Case 1: subtraction needs an extra mov.

Ok, so one of the shaders with a worse instruction count is humus-celshading/4.shader_test. After some analysis of what was the problem, I got a simpler shader with the same problem:

in vec4 inData;

void main(){
gl_Position = gl_Vertex - inData;

This simple shader needs one extra instructions using NIR. So yes, a simple subtraction is getting worse. FWIW, this is the desired final shader assembly:

0: add m3:F, attr0.xyzw:F, -attr18.xyzw:F
1: mov m2:D, 0D
2: vs_urb_write (null):UD

Note that there isn’t an assembly subtraction instruction, but it is represented as negating the second parameter and use an add (this seems captain obvious information here, but will be relevant later).

So at this point one option would be start to look at the backend (remember, i965) code for vec4, specifically the optimizations, and check if we see something. Those optimization are called at brw_vec4.cpp. Those optimizations are in general common to any compiler, like dead code elimination, copy propagation, register coalescing, etc. And usually they are executed several times in several passes, and some of those are simplifications to be used by other optimizations (for example, if your copy propagation optimization pass works, then it is common that your dead code elimination pass will get an instruction out). So with all those optimizations and passes, how do you find the problem? Although it is a good idea read the code for those optimizations to know how they work, it is usually not enough to know where the problem is. So this is again a debug problem, and as usually, you want to know what it is happening step by step.

For this I executed again the compilation, with the following environment variable:

INTEL_USE_NIR=1 INTEL_DEBUG=optimizer ./run subtraction.shader_test

This option prints out to a file the shader assembly compiled at each optimization pass (if applied). So for example, I get the following files for both cases:

  • VS-0001-00-start
  • VS-0001-01-04-opt_copy_propagation
  • VS-0001-01-07-opt_register_coalesce
  • VS-0001-02-02-dead_code_eliminate

So in order to get the final shader assemlby, it was executed a copy propagation, a register coalesce, and a dead code eliminate. BTW, I found that environment variable while looking at the code. It is not listed on the mesa envvar page, something I assume is a bug.

So I started to look at the differences between the different steps. Taking into account that on both cases, the same optimizations were executed, and in the same order, I started looking for differences between one and the other at any step. And I found one difference on the copy propagation.

So let’s see the starting point using IR:

0: mov vgrf2.0:F, -attr18.xyzw:F
1: add vgrf0.0:F, attr0.xyzw:F, vgrf2.xyzw:F
2: mov m2:D, 0D
3: mov m3:F, vgrf0.xyzw:F
4: vs_urb_write (null):UD

And the outcome of the copy propagation:

0: mov vgrf2.0:F, -attr18.xyzw:F
1: add vgrf0.0:F, attr0.xyzw:F, -attr18.xyzw:F
2: mov m2:D, 0D
3: mov m3:F, vgrf0.xyzw:F
4: vs_urb_write (null):UD

And the starting point using NIR:

0: mov vgrf0.0:UD, attr0.xyzw:UD
1: mov vgrf1.0:UD, attr18.xyzw:UD
2: add vgrf2.0:F, vgrf0.xyzw:F, -vgrf1.xyzw:F
3: mov m2:D, 0D
4: mov m3:F, vgrf2.xyzw:F
5: vs_urb_write (null):UD

And the outcome of the copy propagation:

0: mov vgrf0.0:UD, attr0.xyzw:UD
1: mov vgrf1.0:UD, attr18.xyzw:UD
2: add vgrf2.0:F, attr0.xyzw:F, -vgrf1.xyzw:F
3: mov m2:D, 0D
4: mov m3:F, vgrf2.xyzw:F
5: vs_urb_write (null):UD

Although it is true that the starting point for NIR already have one extra instruction compared with IR, that extra one gets optimized on following steps. What caught my attention was the difference between what happens with the instruction #1 on the IR case, compared with the equivalent instruction #2 on the NIR case (the add). On the IR case, copy propagation is able to propagate attr18 from the previous instruction. So is easy to see that this could be simplified on following optimization steps. But that doesn’t happen on the NIR case. On NIR, instruction #2 after the copy propagation remains the same.

So I started to take a look to the implementation of the copy propagation optimization code (here). Without entering into details, this pass analyses each instruction, comparing them with the previous ones in order to know if it can do a copy propagation. So I looked why with that specific instruction the pass concludes that it can’t be done. At this point you could use gdb, but I used some extra printfs (sometimes they are useful too). So I find the check that rejected that instruction:

bool has_source_modifiers = value.negate || value.abs;


if (has_source_modifiers && value.type != inst->src[arg].type)
    return false;

That means that if the source of the previous instruction you are checking against is negated (or has an abs), and the types are different, you can’t do the propagation. This makes sense, because negation is different on different types. If we go back to check the shader assembly output, we find that it is true that the types (those F, D and UD just after the registers) are different between the IR and the NIR case. Why we didn’t worry before? Why this was not failing on any piglit test? Well, because if you take a look more carefully, the instructions that had different types are the movs. In both cases, the type is correct on the add instruction. And in a mov, the type is somewhat irrelevant. You are just moving raw data from one place to the other. It is important on the ALU operation. But in any case, it is true that the type is wrong on those registers (compared with the original GLSL code), and as we are seeing, is causing some problems on the optimization passes. So next step: check where those types are filled.

Searching a little on the code, and using gdb this time, this is done on the function nir_setup_uniforms at brw_vec4_nir.cpp,while creating a source register variable. But curiously it is using the type that came from NIR:

src_reg src = src_reg(ATTR, var->data.location + i, var->type);

and printing out the content of var->type with gdb, it properly shows the type used at the GLSL code. If we go deeper to the src_reg constructor:

src_reg::src_reg(register_file file, int reg, const glsl_type *type)

    this->file = file;
    this->reg = reg;
    if (type && (type->is_scalar() || type->is_vector() || type->is_matrix()))
        this->swizzle = brw_swizzle_for_size(type->vector_elements);
        this->swizzle = BRW_SWIZZLE_XYZW;

We see that the type is only used to fill the swizzle. But if we compare this to the equivalent code for a destination register:

dst_reg::dst_reg(register_file file, int reg, const glsl_type *type,
unsigned writemask)

    this->file = file;
    this->reg = reg;
    this->type = brw_type_for_base_type(type);
    this->writemask = writemask;

dst_reg is is also filling internally the type, something src_reg is not doing. So at this point the patch is straightforward. It is just fill src_reg->type using the constructor type parameter. The patch was approved and is already on master.

Coming up next

At the end, I didn’t need to improve at all any of the optimization passes, as the bug was elsewhere, but the debugging steps for this work are still the same. In fact it was the usual bug that was harder to find (for simplicity I summarized the triaging) that to solve. For the next blog post, I will explain how I worked on another instruction count regression, somewhat more complex, that needed a change on one of the optimization passes.


by infapi00 at September 14, 2015 11:58 AM

September 11, 2015

Jacobo Aragunde

An update on LibreOffice for Android

Summer time is for hacking! I booked some time in the last weeks to carry on with the work on the Android port we started earlier this year at Igalia, merging the pending bits in master, completing features and adding some polish. Now that everything is upstream, let me share it with you!

Merge ownCloud document provider

I had used the ownCloud Android library to speed up the implementation of the ownCloud document provider. Unfortunately, it included some libraries as dependencies in binary form and that was not acceptable to integrate the library in LibreOffice.

My solution was creating a Github fork of the library and replace the binary dependencies with the corresponding sources. Now a tarball containing the sources in that repository is downloaded and compiled by LibreOffice build system when asked for the Android target.

Implement save to cloud

While I was working on the Document Providers infrastructure, Miklos and Tomaž were implementing document edition on the LibreOffice port. Now there is some basic support to edit and save documents, I must enable the save operation in the Document Providers framework and in particular in the ownCloud test implementation.

The ownCloud save operation itself was easy, but knowing whether the local save operation had finished or not proved to be a bit more challenging. LibreOfficeKit API does not yet provide a way to set a callback on the save operation so we can check its status and do something after it has finished. As a workaround, I periodically check the modification date field in the file to know if LibreOffice core is already done with it.

This is meant to be a temporary solution, but at least, we can provide some feedback on the save operation now.

Feedback on save

A hamburger in my LibreOffice

Without any indication, the drawer containing the list of document providers was completely hidden from users; only a handful of people I showed the app to, besides myself, were aware of it. For that reason, I enabled the application home button, the one on the left of the action bar, to become a drawer indicator, the famous “hamburger” icon found in many Android apps.

It took me some time to figure out the way to put everything together so the button changes its icon and behavior according to the context. When the user is on the root folder, the “hamburger” will indicate the presence of a drawer menu with the additional locations implemented by the available document providers. When browsing some directory, it will become a back arrow to browse the parent directory. Finally, when the drawer is open it will also become an arrow to close the drawer.

Drawer indicator

Smart behavior for system back key

Tapping the system back key when browsing some directory and see the app close was a bit frustrating, because most users expect it to go back to the previous location, the parent directory. I’ve overridden the default behavior to make it smarter and context-aware. When browsing some directory, it will open the parent directory; when the drawer is open, it will close it; when browsing the root directory, it will ask for an additional tap for confirmation to exit the application.

Exit confirmation

All these patches are in master branch and will be part of the next major release. Impatient people will also find them in the daily builds from tomorrow onward.

by Jacobo Aragunde Pérez at September 11, 2015 06:44 PM

August 15, 2015

Adrián Pérez

Locked out of SSH? Renew all the keys!

Long time, no post! Well, here I am with something completely unrelated to programming. Instead, is this something that has been bugging me for a while but I had been postponing because from the surface it looked a bit like a deep rabbit hole.

The Issue

For a while, every time I tried to connect to certain servers via SSH (or Mosh, make sure to check it out!), I was getting something like this:

~ % ssh
Host key fingerprint is ~~REDACTED~~
Permission denied (publickey).
~ %

Having plenty of things pending to do, plus lazyness, made me ignore this for a while because most of the machines I connect to via SSH have password-based authentication as a fallback. But today I had to use a machine that is configured to accept only key-based authentication, so I had to bite the bullet. As usual, I re-tried connecting with the very same command, plus adding -v to get some debugging output. Notice this message:

debug1: Skipping ssh-dss key /home/aperez/.ssh/id_dsa for not in PubkeyAcceptedKeyTypes

Could this be the root cause? Searching on Internet did not yield anything interesting — nobody checks result pages after the second one, ever. Then I remembered that after the infamous Heartbleed bug several initiatives with the goal of making secure stuff more secure —and hopefully bug-free— were started. While LibreSSL gets the honorable mention of having the cutest logo, other teams have not been behind in “de-crufting” their code. I started to fear that the latest release of OpenSSH may not have support anymore for one of:

  • The host key used by the server.
  • The host key type used by my laptop.
  • The identity key type.

It was time to learn about what changed in the recent past.

Exhibit “A”

Looking at OpenSSH 7.0 release notes, there is the culpript:

 * Support for ssh-dss, ssh-dss-cert-* host and user keys is disabled
   by default at run-time. These may be re-enabled using the
   instructions at

So it turns out that support for ssh-dss keys like the one I was trying to use is still available, but disabled by default. The mentioned URL contains information on how to use legacy features, and in this case the support for ssh-dss keys can be re-enabled using the PubkeyAcceptedKeyTypes option, either temporarily for a single ssh (or scp) invocation:

~ % ssh -oPubkeyAcceptedKeyTypes=+ssh-dss

or permanently adding a snippet to the ~/.ssh/config file:

     PubkeyAcceptedKeyTypes +ssh-dss

The Solution

I suspect it won't be long before support for DSA keys is disabled at compile time, and for good reasons, so this seemed a moment as good as any to make a new SSH key, and propagate it to the hosts where I was using the old one.

Indeed, I should.

Question is: Which type of key should I generate as of 2015? The current default of ssh-keygen is to use RSA keys of 2048 bits, which seems like a reasonable thing provided that it is technically possible to break 1024 bit keys. Applying a bit of paranoia, I decided to better use a 4096 bit key, to make it future-proof as well.

But hey, we live in dangerous days, and one may want to stay away from RSA keys. They use the standard NIST curves, which are not that good, and because we know for a fact that the NSA has been tampering around with them, we may want to take a different approach, and go for an Ed25519 key instead. Which, apart from having Daniel J. Bernstein in the design team, are not vulnerable to poor random number generation. There is only one catch: support for Ed25519 is kind of new, so if you need to connect to machines using OpenSSH ≤ 6.5 you may still want to create a 4096 bit RSA key for them, ensuring that the Ed25519 key is used by default and the RSA one only when needed. More on that later, for now let's create the new keys with:

~ % ssh-keygen -t ed25519


~ % ssh-keygen -t rsa -b 4096

Then, append the key to the relevant machines, changing id_ed25519 to id_rsa for the ones which do not support Ed25519; note how the option to enable ssh-dss keys is used to temporarily allow using the old key to be able to append the new one (which somehow did not work with ssh-copy-id):

% ssh -oPubkeyAcceptedKeyTypes=+ssh-dss \
    'cat >> ~/.ssh/authorized_keys' < ~/.ssh/

Last but not least, let's make sure that the RSA key is only ever used when needed, with a couple of tweaks in the ~/.ssh/config file:

PubkeyAcceptedKeyTypes ssh-ed25519,

     PubkeyAcceptedKeyTypes +ssh-rsa

Note that it is not possible to remove one item from the PubkeyAcceptedKeyTypes list with -ssh-rsa: we need to specify a complete list of key types that OpenSSH will be allowed to use. For reference: the ssh_config manual page lists the default value, and ssh -Q key lists key types built in a particular OpenSSH installation. In my case, I have decided to use my new Ed25519 key as primary, allowing only this key type by default, and using the RSA key for selected hosts.

While we are at it, it may be interesting to switch from the OpenSSH server to TinySSH in the servers under our control. Despite being small, it supports Ed25519, and it uses NaCl (DJB's crypto library) instead of OpenSSL. Also, it is possibly the simplest service one can setup over CurveCP at the moment. But this post is already long enough, so let's move on to the finale.


My main desktop environment is GNOME, which is very nice and by default includes a convenient SSH agent as part of its Keyring daemon. But every now and then I just use OpenBox window manager, or even a plain Linux virtual terminal with a tmux session in it, where there is no GNOME components running. And I do miss its SSH agent. After looking a bit around, I have installed Envoy, which instead of implementing its own agent, ensures that one (and only one) instance of the OpenSSH ssh-agent runs for each user, across all logged-in sessions, and without needing changes to your shell startup files when using the included PAM module.

Update (2015-08-16): Another reason to use Envoy, as pointed out by Óscar Amor, is that the GNOME Keyring daemon can't handle Ed25519 keys.

Happy SSH'ing!

by aperez ( at August 15, 2015 11:52 PM

August 14, 2015

Alberto Garcia

I/O limits for disk groups in QEMU 2.4

QEMU 2.4.0 has just been released, and among many other things it comes with some of the stuff I have been working on lately. In this blog post I am going to talk about disk I/O limits and the new feature to group several disks together.

Disk I/O limits

Disk I/O limits allow us to control the amount of I/O that a guest can perform. This is useful for example if we have several VMs in the same host and we want to reduce the impact they have on each other if the disk usage is very high.

The I/O limits can be set using the QMP command block_set_io_throttle, or with the command line using the throttling.* options for the -drive parameter (in brackets in the examples below). Both the throughput and the number of I/O operations can be limited. For a more fine-grained control, the limits of each one of them can be set on read operations, write operations, or the combination of both:

  • bps (throttling.bps-total): Total throughput limit (in bytes/second).
  • bps_rd (throttling.bps-read): Read throughput limit.
  • bps_wr (throttling.bps-write): Write throughput limit.
  • iops (throttling.iops-total): Total I/O operations per second.
  • iops_rd (throttling.iops-read): Read I/O operations per second.
  • iops_wr (throttling.iops-write): Write I/O operations per second.


-drive if=virtio,file=hd1.qcow2,throttling.bps-write=52428800,throttling.iops-total=6000

In addition to that, it is also possible to configure the maximum burst size, which defines a pool of I/O that the guest can perform without being limited:

  • bps_max (throttling.bps-total-max): Total maximum (in bytes).
  • bps_rd_max (throttling.bps-read-max): Read maximum.
  • bps_wr_max (throttling.bps-write-max): Write maximum.
  • iops_max (throttling.iops-total-max): Total maximum of I/O operations.
  • iops_rd_max (throttling.iops-read-max): Read I/O operations.
  • iops_wr_max (throttling.iops-write-max): Write I/O operations.

One additional parameter named iops_size allows us to deal with the case where big I/O operations can be used to bypass the limits we have set. In this case, if a particular I/O operation is bigger than iops_size then it is counted several times when it comes to calculating the I/O limits. So a 128KB request will be counted as 4 requests if iops_size is 32KB.

  • iops_size (throttling.iops-size): Size of an I/O request (in bytes).

Group throttling

All of these parameters I’ve just described operate on individual disk drives and have been available for a while. Since QEMU 2.4 however, it is also possible to have several drives share the same limits. This is configured using the new group parameter.

The way it works is that each disk with I/O limits is member of a throttle group, and the limits apply to the combined I/O of all group members using a round-robin algorithm. The way to put several disks together is just to use the group parameter with all of them using the same group name. Once the group is set, there’s no need to pass the parameter to block_set_io_throttle anymore unless we want to move the drive to a different group. Since the I/O limits apply to all group members, it is enough to use block_set_io_throttle in just one of them.

Here’s an example of how to set groups using the command line:

-drive if=virtio,file=hd1.qcow2,throttling.iops-total=6000,
-drive if=virtio,file=hd2.qcow2,throttling.iops-total=6000,
-drive if=virtio,file=hd3.qcow2,throttling.iops-total=3000,
-drive if=virtio,file=hd4.qcow2,throttling.iops-total=6000,
-drive if=virtio,file=hd5.qcow2,throttling.iops-total=3000,
-drive if=virtio,file=hd6.qcow2,throttling.iops-total=5000

In this example, hd1, hd2 and hd4 are all members of a group named foo with a combined IOPS limit of 6000, and hd3 and hd5 are members of bar. hd6 is left alone (technically it is part of a 1-member group).

Next steps

I am currently working on providing more I/O statistics for disk drives, including latencies and average queue depth on a user-defined interval. The code is almost ready. Next week I will be in Seattle for the KVM Forum where I will hopefully be able to finish the remaining bits.

I will also attend LinuxCon North America. Igalia is sponsoring the event and we have a booth there. Come if you want to talk to us or see our latest demos with WebKit for Wayland.

See you in Seattle!

by berto at August 14, 2015 10:22 AM

August 04, 2015

Andy Wingo

developing v8 with guix

a guided descent into hell

It all started off so simply. My primary development machine is a desktop computer that I never turn off. I suspend it when I leave work, and then resume it when I come back. It's always where I left it, as it should be.

I rarely update this machine because it works well enough for me, and anyway my focus isn't the machine, it's the things I do on it. Mostly I work on V8. The setup is so boring that I certainly didn't imagine myself writing an article about it today, but circumstances have forced my hand.

This machine runs Debian. It used to run the testing distribution, but somehow in the past I needed something that wasn't in testing so it runs unstable. I've been using Debian for some 16 years now, though not continuously, so although running unstable can be risky, usually it isn't, and I've unborked it enough times that I felt pretty comfortable.

Perhaps you see where this is going!

I went to install something, I can't even remember what it was now, and the downloads failed because I hadn't updated in a while. So I update, install the thing, and all is well. Except my instant messaging isn't working any more because there are a few moving parts (empathy / telepathy / mission control / gabble / dbus / whatwhat), and the install must have pulled in something that broke one of them. No biggie, this happens. Might as well go ahead and update the rest of the system while I'm at it and get a reboot to make sure I'm not running old software.

Most Debian users know that you probably shouldn't do a dist-upgrade from an old system -- you upgrade and then you dist-upgrade. Or perhaps this isn't even true, it's tribal lore to avoid getting eaten by the wild beasts of bork that roam around the village walls at night. Anyway that's what I did -- an upgrade, let it chunk for a while, then a dist-upgrade, check the list to make sure it didn't decide to remove one of my kidneys to satisfy the priorities of the bearded demon that lives inside apt-get, OK, let it go, all is well, reboot. Swell.

Or not! The computer restarts to a blank screen. Ha ha ha you have been bitten by a bork-beast! Switch to a terminal and try to see what's going on with GDM. It's gone! Ha ha ha! Your organs are being masticated as we speak! How does that feel! Try to figure out which package is causing it, happily with another computer that actually works. Surely this will be fixed in some update coming soon. Oh it's something that's going to take a few weeks!!!! Ninth level, end of the line, all passengers off!

my gods

I know how we got here, I love Debian, but it is just unacceptable and revolting that software development in 2015 is exposed to an upgrade process which (1) can break your system (2) by default and (3) can't be rolled back. The last one is the killer: who would design software this way? If you make a system like this in 2015 I'd say you're committing malpractice.

Well yesterday I resolved that this would be the last time this happens to me. Of course I could just develop in a virtual machine, and save and restore around upgrades, but that's kinda trash. Or I could use btrfs and be able to rewind changes to the file system, but then it would rewind everything, not just the system state.

Fortunately there is a better option in the form of functional package managers, like Nix and Guix. Instead of upgrading your system by mutating /usr, Nix and Guix store all files in a content-addressed store (/nix/store and /gnu/store, respectively). A user accesses the store via a "profile", which is a forest of symlinks into the store.

For example, on my machine with a NixOS system installation, I have:

$ which ls

$ ls -l /run/current-system/sw/bin/ls
lrwxrwxrwx 1 root nixbld 65 Jan  1  1970
  /run/current-system/sw/bin/ls ->

$ ldd /nix/store/wc472nw0kyw0iwgl6352ii5czxd97js2-coreutils-8.23/bin/ls (0x00007fff5d3c4000) => /nix/store/c2p56z920h4mxw12pjw053sqfhhh0l0y-acl-2.2.52/lib/ (0x00007fce99d5d000) => /nix/store/la5imi1602jxhpds9675n2n2d0683lbq-glibc-2.20/lib/ (0x00007fce999c0000) => /nix/store/jd3gggw5bs3a6sbjnwhjapcqr8g78f5c-attr-2.4.47/lib/ (0x00007fce997bc000)
  /nix/store/la5imi1602jxhpds9675n2n2d0683lbq-glibc-2.20/lib/ (0x00007fce99f65000)

Content-addressed linkage means that files in the store are never mutated: they will never be overwritten by a software upgrade. Never. Never will I again gaze in horror at the frozen beardcicles of a Debian system in the throes of "oops I just deleted all your programs, like that time a few months ago, wasn't that cool, it's really cold down here, how do you like my frozen facial tresses and also the horns".

At the same time, I don't have to give up upgrades. Paradoxically, immutable software facilitates change and gives me the freedom to upgrade my system without anxiety and lost work.

nix and guix

So, there's Nix and there's Guix. Both are great. I'll get to comparing them, but first a digression on the ways they can be installed.

Both Nix and Guix can be installed either as the operating system of your computer, or just as a user-space package manager. I would actually recommend to people to start with the latter way of working, and move on to the OS if you feel comfortable. The fundamental observation here is that because /nix/store doesn't depend on or conflict with /usr, you can run Nix or Guix as a user on a (e.g.) Debian system with no problems. You can have a forest of symlinks in ~/.guix-profile/bin that links to nifty things you've installed in the store and that's cool, you don't have to tell Debian.

and now look at me

In my case I wanted to also have the system managed by Nix or Guix. GuixSD, the name of the Guix OS install, isn't appropriate for me yet because it doesn't do GNOME. I am used to GNOME and don't care to change, so I installed NixOS instead. It works fine. There have been some irritations -- for example it just took me 30 minutes to figure out how to install dict, with a local wordnet dictionary server -- but mostly it has the packages I need. Again, I don't recommend starting with the OS install though.

GuixSD, the OS installation of Guix, is a bit harder even than NixOS. It has fewer packages, though what it does have tends to be more up-to-date than Nix. There are two big things about GuixSD though. One is that it aims to be fully free, including avoiding non-free firmware. Because they build deterministic build products from source, Nix and Guix can offer completely reproducible builds, which is swell for software reliability. Many reliability people also care a lot about software freedom and although Nix does support software freedom very well, it also includes options to turn on the Flash plugin, for example, and of course includes the Linux kernel with all of the firmware. Well GuixSD eschews non-free firmware, and uses the Linux-Libre kernel. For myself I have a local build on another machine that uses the stock Linux kernel with firmware for my Intel wireless device, and I was really discouraged from even sharing the existence of this hack. I guess it makes sense, it takes a world to make software freedom, but that particular part is not my fight.

The other thing about Guix is that it's really GNU-focused. This is great but also affects the product in some negative ways. They use "dmd" as an init system, for example, which is kinda like systemd but not. One consequence of this is that GuixSD doesn't have an implementation of the org.freedesktop.login1 seat management interface, which these days is implemented by part of systemd, which in turn precludes a bunch of other things GNOME-related. At one point I started working on a fork of systemd that pulled logind out to a separate project, which makes sense to me for distros that want seat management but not systemd, but TBH I have no horse in the systemd race and in fact systemd works well for me. But, a system with elogind would also work well for me. Anyway, the upshot is that unless you care a lot about the distro itself or are willing to adapt to e.g. Xfce or Xmonad or something, NixOS is a more pragmatic choice.

i'm on a horse

I actually like Guix's tools better than Nix's, and not just because they are written in Guile. Guix also has all the tools I need for software development, so I prefer it and ended up installing it as a user-space package manager on this NixOS system. Sounds bizarre but it actually works pretty well.

So, the point of this article is to be a little guide of how to build V8 with Guix. Here we go!

up and running with guix

First, check the manual. It's great and well-written and answers many questions and in fact includes all of this.

Now, I assume you're on an x86-64 Linux system, so we're going to use the awesome binary installation mechanism. Check it out: because everything in /gnu/store is linked directly to each other, all you have to do is to copy a reified /gnu/store onto a working system, then copy a sqlite thing into /var, and you've installed Guix. Sweet, eh? And actually you can take a running system and clone it onto other systems in that way, and Guix even provides a tool to generate such a tarball for you. Neat stuff.

cd /tmp
tar xf guix-binary-0.8.3.x86_64-linux.tar.xz
mv var/guix /var/ && mv gnu /

This Guix installation has a built-in profile for the root user, so let's go ahead and add a link from ~root to the store.

ln -sf /var/guix/profiles/per-user/root/guix-profile \

Since we're root, we can add the bin/ part of the Guix profile to our environment.

export PATH="$HOME/.guix-profile/bin:$HOME/.guix-profile/sbin:$PATH"

Perhaps we add that line to our ~root/.bash_profile. Anyway, now we have Guix. Or rather, we almost have Guix -- we need to start the daemon that actually manages the store. Create some users:

groupadd --system guixbuild

for i in `seq -w 1 10`; do
  useradd -g guixbuild -G guixbuild           \
          -d /var/empty -s `which nologin`    \
          -c "Guix build user $i" --system    \

And now run the daemon:

guix-daemon --build-users-group=guixbuild

If your host distro uses systemd, there's a unit that you can drop into the systemd folder. See the manual.

A few more things. One, usually when you go to install something, you'll want to fetch a pre-built copy of that software if it's available. Although Guix is fundamentally a build-from-source distro, Guix also runs a continuous builder service to make sure that binaries are available, if you trust the machine building the binaries of course. To do that, we tell the daemon to trust

guix archive --authorize < ~root/.guix-profile/share/guix/

as a user

OK now we have Guix installed. Running Guix commands will install things into the store as needed, and populate the forest of symlinks in the current user's $HOME/.guix-profile. So probably what you want to do is to run, as your user:

/var/guix/profiles/per-user/root/guix-profile/bin/guix \
  package --install guix

This will make Guix available in your own user's profile. From here you can begin to install software; for example, if you run

guix package --install emacs

You'll then have an emacs in ~/.guix-profile/bin/emacs which you can run. Pretty cool stuff.

back on the horse

So what does it mean for software development? Well, when I develop software, I usually want to know exactly what the inputs are, and to not have inputs to the build process that I don't control, and not have my build depend on unrelated software upgrades on my system. That's what Guix provides for me. For example, when I develop V8, I just need a few things. In fact I need these things:

;; Save as ~/src/profiles/v8.scm
(use-package-modules gcc llvm base python version-control less ccache)

 (list clang
       (list gcc-4.9 "lib")

This set of Guix packages is what it took for me to set up a V8 development environment. I can make a development environment containing only these packages and no others by saving the above file as v8.scm and then sourcing this script:

~/.guix-profile/bin/guix package -p ~/src/profiles/v8 -m ~/src/profiles/v8.scm
eval `~/.guix-profile/bin/guix package -p ~/src/profiles/v8 --search-paths`
export GYP_DEFINES='linux_use_bundled_gold=0 linux_use_gold_flags=0 linux_use_bundled_binutils=0'
export CXX='ccache clang++'
export CC='ccache clang'
export LD_LIBRARY_PATH=$HOME/src/profiles/v8/lib

Let's take this one line at a time. The first line takes my manifest -- the set of packages that collectively form my build environment -- and arranges to populate a symlink forest at ~/src/profiles/v8.

$ ls -l ~/src/profiles/v8/
total 44
dr-xr-xr-x  2 root guixbuild  4096 Jan  1  1970 bin
dr-xr-xr-x  2 root guixbuild  4096 Jan  1  1970 etc
dr-xr-xr-x  4 root guixbuild  4096 Jan  1  1970 include
dr-xr-xr-x  2 root guixbuild 12288 Jan  1  1970 lib
dr-xr-xr-x  2 root guixbuild  4096 Jan  1  1970 libexec
-r--r--r--  2 root guixbuild  4138 Jan  1  1970 manifest
lrwxrwxrwx 12 root guixbuild    59 Jan  1  1970 sbin -> /gnu/store/1g78hxc8vn7q7x9wq3iswxqd8lbpfnwj-glibc-2.21/sbin
dr-xr-xr-x  6 root guixbuild  4096 Jan  1  1970 share
lrwxrwxrwx 12 root guixbuild    58 Jan  1  1970 var -> /gnu/store/1g78hxc8vn7q7x9wq3iswxqd8lbpfnwj-glibc-2.21/var
lrwxrwxrwx 12 root guixbuild    82 Jan  1  1970 x86_64-unknown-linux-gnu -> /gnu/store/wq6q6ahqs9rr0chp97h461yj8w9ympvm-binutils-2.25/x86_64-unknown-linux-gnu

So that's totally scrolling off the right for you, that's the thing about Nix and Guix names. What it means is that I have a tree of software, and most directories contain a union of links from various packages. It so happens that sbin though just has links from glibc, so it links directly into the store. Anyway. The next line in my arranges to point my shell into that environment.

$ guix package -p ~/src/profiles/v8 --search-paths
export PATH="/home/wingo/src/profiles/v8/bin:/home/wingo/src/profiles/v8/sbin"
export CPATH="/home/wingo/src/profiles/v8/include"
export LIBRARY_PATH="/home/wingo/src/profiles/v8/lib"
export LOCPATH="/home/wingo/src/profiles/v8/lib/locale"
export PYTHONPATH="/home/wingo/src/profiles/v8/lib/python2.7/site-packages"

Having sourced this into my environment, my shell's ls for example now points into my new profile:

$ which ls

Neat. Next we have some V8 defines. On x86_64 on Linux, v8 wants to use some binutils things that it bundles itself, but oddly enough for months under Debian I was seeing spurious intermittent segfaults while linking with their bundled gold linker binary. I don't want to use their idea of what a linker is anyway, so I set some defines to make v8's build tool use Guix's linker. (Incidentally, figuring out what those defines were took spelunking through makefiles, to gyp files, to the source of gyp itself, to the source of the standard shlex Python module to figure out what delimiters shlex.split actually splits on... yaaarrggh!)

Then some defines to use ccache, then a strange thing: what's up with that LD_LIBRARY_PATH?

Well. I'm not sure. However the normal thing for dynamic linking under Linux is that you end up with binaries that are just linked against e.g., whereever the system will find That's not what we want in Guix -- we want to link against a specific version of every dependency, not just any old version. Guix's builders normally do this when building software for Guix, but somehow in this case I haven't managed to make that happen, so the binaries that are built as part of the build process can end up not specifying the path of the libraries they are linked to. I don't know whether this is an issue with v8's build system, that it doesn't want to work well with Nix / Guix, or if it's something else. Anyway I hack around it by assuming that whatever's in my artisanally assembled symlink forest ("profile") is the right thing, so I set it as the search path for the dynamic linker. Suggestions welcome here.

And from here... well it just works! I've gained the ability to precisely specify a reproducible build environment for the software I am working on, which is entirely separated from the set of software that I have installed on my system, which I can reproduce precisely with a script, and yet which is still part of my system -- I'm not isolated from it by container or VM boundaries (though I can be; see NixOps for more in that direction).

OK I lied a little bit. I had to apply this patch to V8:

$ git diff
diff --git a/build/standalone.gypi b/build/standalone.gypi
index 2bdd39d..941b9d7 100644
--- a/build/standalone.gypi
+++ b/build/standalone.gypi
@@ -98,7 +98,7 @@
         ['OS=="win"', {
           'gomadir': 'c:\\goma\\goma-win',
         }, {
-          'gomadir': '<!(/bin/echo -n ${HOME}/goma)',
+          'gomadir': '<!(/usr/bin/env echo -n ${HOME}/goma)',
         ['host_arch!="ppc" and host_arch!="ppc64" and host_arch!="ppc64le"', {
           'host_clang%': '1',

See? Because my system is NixOS, there is no /bin/echo. It does helpfully install a /usr/bin/env though, which other shell invocations in this build script use, so I use that instead. I mention this as an example of what works and what workarounds there are.

dpkg --purgatory

So now I have NixOS as my OS, and I mostly use Guix for software development. This is a new setup and we'll see how it works in practice.

Installing NixOS on top of Debian was a bit irritating. I ended up making a bootable USB installation image, then installing over to my Debian partition, happy in the idea that it wouldn't conflict with my system. But in that I forgot about /etc and /var and all that. So I copied /etc to /etc-debian, just as a backup, and NixOS appeared to install fine. However it wouldn't boot, and that's because some systemd state from my old /etc which was still in place conflicted with... something? In the end I redid the install, moving my old /usr, /etc and such directories to backup names and letting NixOS have control. That worked fine.

I have GuixSD on a laptop but I really don't recommend it right now -- not unless you have time and are willing to hack on it. But that's OK, install NixOS and you'll be happy on the system side, and if you want Guix you can install it as a user.

Comments and corrections welcome, and happy hacking!

by Andy Wingo at August 04, 2015 04:23 PM

July 28, 2015

Andy Wingo

loop optimizations in guile

Sup peeps. So, after the slog to update Guile's intermediate language, I wanted to land some new optimizations before moving on to the next thing. For years I've been meaning to do some loop optimizations, and I was finally able to land a few of them.

loop peeling

For a long time I have wanted to do "loop peeling". Loop peeling means peeling off the first iteration of a loop. If you have a source program that looks like this:

while foo:

Loop peeling turns it into this:

if foo:
  while foo:

You wouldn't think that this is actually an optimization, would you? Well on its own, it's not. But if you combine it with common subexpression elimination, then it means that the loop body is now dominated by all effects and all loop-invariant expressions that must be evaluated for the expression to loop.

In dynamic languages, this is most useful when one source expression expands to a number of low-level steps. So for example if your language runtime implements top-level variable references in three parts, one where it gets a reference to a mutable box, then it checks if the box has a value, and and the third where it unboxes it, then we would have:

if foo:
  bar_location = lookup("bar")
  bar_value = dereference(bar_location)
  if bar_value is null: throw NotFound("bar")

  baz_location = lookup("baz")
  baz_value = dereference(baz_location)
  if baz_value is null: throw NotFound("baz")

  while foo:
    bar_value = dereference(bar_location)

    baz_value = dereference(baz_location)

The result is that we have hoisted the lookups and null checks out of the loop (if a box can never transition from full back to empty). It's a really powerful transformation that can even hoist things that traditional loop-invariant code motion can't, but more on that later.

Now, the problem with loop peeling is that usually values will escape your loop. For example:

while foo:
  x = qux()
  if x then return x

In this little example, there is a value x, and the return x statement is actually not in the loop. It's syntactically in the loop, but the underlying representation that the compiler uses looks more like this:

function qux(k):
  label loop_header():
    fetch(foo) -gt; loop_test
  label loop_test(foo_value):
    if foo_value then -> exit else -> body
  label body():
    fetch(x) -gt; have_x
  label have_x(x_value):
    if x_value then -> return_x else -> loop_header
  label return_x():
    values(x) -> k
  label exit():

This is the "CPS soup" I described in my last post. Only the bold parts are in the loop; notably, the return is outside the loop. Point being, if we peel off the first iteration, then there are two possible values for x that we would return:

if foo:
  x1 = qux()
  if x1 then return x1
  while foo:
    x2 = qux()
    if x2 then return x2

I have them marked as x1 and x2. But I've also duplicated the return x terms, which is not what we want. We want to peel off the first iteration, which will cause code growth equal to the size of the loop body, but we don't want to have to duplicate everything that's after the loop. What we have to do is re-introduce a join point that defines x:

if foo:
  x1 = qux()
  if x1 then join(x1)
  while foo:
    x2 = qux()
    if x2 then join(x2)
label join(x)
  return x

Here I'm playing fast and loose with notation because the real terms are too gnarly. What I'm trying to get across is that for each value that flows out of a loop, you need a join point. That's fine, it's a bit more involved, but what if your loop exits to two different points, but one value is live in both of them? A value can only be defined in one place, in CPS or SSA. You could re-place a whole tree of phi variables, in SSA parlance, with join blocks and such, but it's just too hard.

However we can still get the benefits of peeling in most cases if we restrict ourselves to loops that exit to only one continuation. In that case the live variable set is the intersection of all variables defined in the loop that are live at the exit points. Easy enough, and that's what we have in Guile now. Peeling causes some code growth but the loops are smaller so it should still be a win. Check out the source, if that's your thing.

loop-invariant code motion

Usually when people are interested in moving code out of loops they talk about loop-invariant code motion, or LICM. Contrary to what you might think, LICM is complementary to peeling: some things that peeling+CSE can hoist are not hoistable by LICM, and vice versa.

Unlike peeling, LICM does not cause code growth. Instead, for each expression in a loop, LICM tries to hoist it out of the loop if it can. An expression can be hoisted if all of these conditions are true:

  1. It doesn't cause the creation of an observably new object. In Scheme, the definition of "observable" is quite subtle, so in practice in Guile we don't hoist expressions that can cause any allocation. We could use alias analysis to improve this.

  2. The expression cannot throw an exception, or the expression is always evaluated for every loop iteration.

  3. The expression makes no writes to memory, or if it writes to memory, other expressions in the loop cannot possibly read from that memory. We use effects analysis for this.

  4. The expression makes no reads from memory, or if it reads from memory, no other expression in the loop can clobber those reads. Again, effects analysis.

  5. The expression uses only loop-invariant variables.

This definition is inductive, so once an expression is hoisted, the values it defines are then considered loop-invariant, so you might be able to hoist a whole chain of values.

Compared to loop peeling, this has the gnarly aspect of having to explicitly reason about loop invariance and manually move code, which is a pain. (Really LICM would be better named "artisanal code motion".) However it causes no code growth, which is a plus, though like peeling it can increase register pressure. But the big difference is that LICM can hoist effect-free expressions that aren't always executed. Consider:

while foo:
  x = qux() ? "hi" : "ho"

Here for some reason it could be faster to cache "hi" or "ho" in registers, which is what LICM allows:

hi, ho = "hi", "ho"
while foo:
  x = qux() ? hi : ho

On the other hand, LICM alone can't hoist the if baz is null checks in this example from above:

while foo:

The issue is that the call to bar() might not return, so the error that might be thrown if baz is null shouldn't be observed until bar is called. In general we can't hoist anything that might throw an exception past some non-hoisted code that might throw an exception. This specific situation happens in Guile but there are similar ones in any language, I think.

More formally, LICM will hoist effectful but loop-invariant expressions that postdominate the loop header, whereas peeling hoists those expressions that dominate all back-edges. I think? We'll go with that. Again, the source.

loop inversion

Loop inversion is a little hack to improve code generation, and again it's a little counterintuitive. If you have this loop:

while n < x:

Loop inversion turns it into:

if n < x:
  while n < x

The goal is that instead of generating code that looks like this:

  test n, x;
  branch-if-greater-than-or-equal done;
  x = x + 1
  goto header

You make something that looks like this:

  test n, x;
  branch-if-greater-than-or-equal done;
  x = x + 1
  test n, x;
  branch-if-less-than header;

The upshot is that the loop body now contains one branch instead of two. It's mostly helpful for tight loops.

It turns out that you can express this transformation on CPS (or SSA, or whatever), but that like loop peeling the extra branch introduces an extra join point in your program. If your loop exits to more than one label, then we have the same problems as loop peeling. For this reason Guile restricts loop inversion (which it calls "loop rotation" at the moment; I should probably fix that) to loops with only one exit continuation.

Loop inversion has some other caveats, but probably the biggest one is that in Guile it doesn't actually guarantee that each back-edge is a conditional branch. The reason is that usually a loop has some associated loop variables, and it could be that you need to reshuffle those variables when you jump back to the top. Mostly Guile's compiler manages to avoid shuffling, allowing inversion to have the right effect, but it's not guaranteed. Fixing this is not straightforward, since the shuffling of values is associated with the predecessor of the loop header and not the loop header itself. If instead we reshuffled before the header, that might work, but each back-edge might have a different shuffling to make... anyway. In practice inversion seems to work out fine; I haven't yet seen a case where it doesn't work. Source code here.

loop identification

One final note: what is a loop anyway? Turns out this is a somewhat hard problem, especially once you start trying to identify nested loops. Guile currently does the simple thing and just computes strongly-connected components in a function's flow-graph, and says that a loop is a non-trivial SCC with a single predecessor. That won't tease apart loop nests but oh wells! I spent a lot of time last year or maybe two years ago with that "Loop identification via D-J graphs" paper but in the end simple is best, at least for making incremental steps.

Okeysmokes, until next time, loop on!

by Andy Wingo at July 28, 2015 08:10 AM

July 27, 2015

Andy Wingo

cps soup

Hello internets! This blog goes out to my long-time readers who have followed my saga hacking on Guile's compiler. For the rest of you, a little history, then the new thing.

In the olden days, Guile had no compiler, just an interpreter written in C. Around 8 years ago now, we ported Guile to compile to bytecode. That bytecode is what is currently deployed as Guile 2.0. For many reasons we wanted to upgrade our compiler and virtual machine for Guile 2.2, and the result of that was a new continuation-passing-style compiler for Guile. Check that link for all the backstory.

So, I was going to finish documenting this intermediate language about 5 months ago, in preparation for making the first Guile 2.2 prereleases. But something about it made me really unhappy. You can catch some foreshadowing of this in my article from last August on common subexpression elimination; I'll just quote a paragraph here:

In essence, the scope tree doesn't necessarily reflect the dominator tree, so not all transformations you might like to make are syntactically valid. In Guile 2.2's CSE pass, we work around the issue by concurrently rewriting the scope tree to reflect the dominator tree. It's something I am seeing more and more and it gives me some pause as to the suitability of CPS as an intermediate language.

This is exactly the same concern that Matthew Fluet and Stephen Weeks had back in 2003:

Thinking of it another way, both CPS and SSA require that variable definitions dominate uses. The difference is that using CPS as an IL requires that all transformations provide a proof of dominance in the form of the nesting, while SSA doesn't. Now, if a CPS transformation doesn't do too much rewriting, then the partial dominance information that it had from the input tree is sufficient for the output tree. Hence tree splicing works fine. However, sometimes it is not sufficient.

As a concrete example, consider common-subexpression elimination. Suppose we have a common subexpression x = e that dominates an expression y = e in a function. In CPS, if y = e happens to be within the scope of x = e, then we are fine and can rewrite it to y = x. If however, y = e is not within the scope of x, then either we have to do massive tree rewriting (essentially making the syntax tree closer to the dominator tree) or skip the optimization. Another way out is to simply use the syntax tree as an approximation to the dominator tree for common-subexpression elimination, but then you miss some optimization opportunities. On the other hand, with SSA, you simply compute the dominator tree, and can always replace y = e with y = x, without having to worry about providing a proof in the output that x dominates y (i.e. without putting y in the scope of x)

[MLton-devel] CPS vs SSA

To be honest I think all this talk about dominators is distracting. Dominators are but a lightweight flow analysis, and I usually find myself using full-on flow analysis to compute the set of optimizations that I can do on a piece of code. In fact the only use I had for dominators in the nested CPS language was to rewrite scope trees! The salient part of Weeks' observation is that nested scope trees are the problem, not that dominators are the solution.

So, after literally years of hemming and hawing about this, I finally decided to remove nested scope trees from Guile's CPS intermediate language. Instead, a function is now a collection of labelled continuations, with one distinguished entry continuation. There is no more $letk term to nest continuations in each other. A program is now represented as a "soup" -- basically a map from labels to continuation bodies, again with a distinguished entry. As an example, consider this expression:

  return add(x, 1)

If we rewrote it in continuation-passing style, we'd give the function a name for its "tail continuation", ktail, and annotate each expression with its continuation:

function(ktail, x):
  add(x, 1) -> ktail

Here the -> ktail means that the add expression passes its values to the continuation ktail.

With nested CPS, it could look like:

function(ktail, x):
  letk have_one(one): add(x, one) -> ktail
    load_constant(1) -> have_one

Here the label have_one is in a scope, as is the value one. With "CPS soup", though, it looks more like this:

function(ktail, x):
  label have_one(one): add(x, one) -> ktail
  label main(x): load_constant(1) -> have_one

It's a subtle change, but it took a few months to make so it's worth pointing out what's going on. The difference is that there is no scope tree for labels or variables any more. A variable can be used at a label if it flows to the label, in a flow analysis sense. Indeed, determining the set of variables that can be used at a label requires flow analysis; that's what Weeks was getting at in his 2003 mail about the advantages of SSA, which are really the advantages of an intermediate language without nested scope trees.

The question arises, though, now that we've decided on CPS soup, how should we represent a program as a value? We've gone from a nested term to a graph term, and we need to find a way to represent it somehow that facilitates looking up labels by name, and facilitates tree rewrites.

In Guile's IR, labels and variables are both integers, so happily enough, we have such a data structure: Clojure-style maps specialized for integer keys.

Friends, if there has been one realization or revolution for me in the last year, it has been Clojure-style data structures. Here's why. In compilers, I often have to build up some kind of analysis, then use that analysis to transform data. Often I need to keep the old term around while I build a new one, but it would be nice to share state between old and new terms. With a nested tree, if a leaf changed you'd have to rebuild all surrounding terms, which is gnarly. But with Clojure-style data structures, more and more I find myself computing in terms of values: build up this value, transform this map to that set, fold over this map -- and yes, you can fold over Guile's intmaps -- and so on. By providing an expressive data structure for which I can control performance characteristics by using transients if needed, these data structures make my programs more about data and less about gnarly machinery.

As a concrete example, the old contification pass in Guile, I didn't have the mental capacity to understand all the moving parts in such a way that I could compute an optimal contification from the beginning; instead I had to iterate to a fixed point, as Kennedy did in his "Compiling with Continuations, Continued" paper. With the new CPS soup language and with Clojure-style data structures, I could actually fit more of the algorithm into my head, with the result that Guile now contifies optimally while avoiding the fixed-point transformation. Also, the old pass used hash tables to represent the analysis, which I found incredibly confusing to reason about -- I totally buy Rich Hickey's argument that place-oriented programming is the source of many evils in programs, and hash tables are nothing if not a place party. Using functional maps let me solve harder problems because they are easier for me to reason about.

Contification isn't an isolated case, either. For example, we are able to do the complete set of optimizations from the "Optimizing closures in O(0) time" paper, including closure sharing, which I think makes Guile unique besides Chez Scheme. I wasn't capable of doing it on the old representation because it was just too hard for me to think about, because my data structures weren't right.

This new "CPS soup" language is still a first-order CPS language in that each term specifies its continuation, and that variable names appear in the continuation of a definition, not the definition itself. This effectively makes every variable a phi variable, in the sense of SSA, and you have to do some work to get to a variable's definition. It could be that still this isn't the right number of names; consider this function:

function foo(k, x):
  label have_y(y) bar(y) -> k
  label y_is_two() load_constant(2) -> have_y
  label y_is_one() load_constant(1) -> have_y
  label main(x) if x -> y_is_one else -> y_is_two

Here there is no distinguished name for the value load_constant(1) versus load_constant(2): both are possible values for y. If we ended up giving them names, we'd have to reintroduce actual phi variables for the joins, which would basically complete the transformation to SSA. Until now though I haven't wanted those names, so perhaps I can put this off. On the other hand, every term has a label, which simplifies many things compared to having to contain terms in basic blocks, as is usually done in SSA. Yet another chapter in CPS is SSA is CPS is SSA, it seems.

Welp, that's all the nerdery for right now. Talk at yall later!

by Andy Wingo at July 27, 2015 02:43 PM

July 25, 2015

Michael Catanzaro

Useful DuckDuckGo bangs

DuckDuckGo bangs are just shortcuts to redirect your search to another search engine. My personal favorites:

  • !gnomebugs — Runs a search on GNOME Bugzilla. Especially useful followed by a bug number. For example, search for ‘!gnomebugs 100000’ and see what you get.
  • !wkb — Same thing for WebKit Bugzilla.
  • !w — Searches Wikipedia.

There’s 6388 more, but those are the three I can remember. If you work on GNOME or WebKit, these are super convenient.

by Michael Catanzaro at July 25, 2015 01:47 AM

July 23, 2015

Xabier Rodríguez Calvar

ReadableStream almost ready

Hello dear readers! Long time no see! You might thing that I have been lazy, and I was in blog posting but I was coding like mad.

First remarkable thing is that I attended the WebKit Contributors Meeting that happened in March at Apple campus in Cupertino as part of the Igalia gang. There we discussed of course about Streams API, its state and different implementation possibilities. Another very interesting point which would make me very happy would be the movement of Mac to CMake.

In a previous post I already introduced the concepts of the Streams API and some of its possible use cases so I’ll save you that part now. The news is that ReadableStream has its basic funcionality complete. And what does it mean? It means that you can create a ReadableStream by providing the constructor with the underlying source and the strategy objects and read from it with its reader and all the internal mechanisms of backpresure and so on will work according to the spec. Yay!

Nevertheless, there’s still quite some work to do to complete the implementation of Streams API, like the implementation of byte streams, writable and transform streams, piping operations and built-in strategies (which is what I am on right now).I don’t know either when Streams API will be activated by default in the next builds of Safari, WebKitGTK+ or WebKit for Wayland, but we’ll make it at some point!

Code suffered already lots of changes because we were still figuring out which architecture was the best and Youenn did an awesome job in refactoring some things and providing support for promises in the bindings to make the implementation of ReadableStream more straitghforward and less “custom”.

Implementation could still suffer quite some important changes as, as part of my work implementing the strategies, some reviewers raised their concerns of having Streams API implemented inside WebCore in terms of IDL interfaces. I have already a proof of concept of CountQueuingStrategy and ByteLengthQueuingStrategy implemented inside JavaScriptCore, even a case where we use built-in JavaScript functions, which might help to keep closer to the spec if we can just include JavaScript code directly. We’ll see how we end up!

Last and not least I would like to thank Igalia for sponsoring me to attend the WebKit Contributors Meeting in Cupertino and also Adenilson for being so nice and taking us to very nice places for dinner and drinks that we wouldn’t be able to find ourselves (I owe you, promise to return the favor at the Web Engines Hackfest). It was also really nice to have the oportunity of quickly visiting New York City for some hours because of the long connection there which usually would be a PITA, but it was very enjoyable this time.

by calvaris at July 23, 2015 04:17 PM

July 16, 2015

Víctor Jáquez

GStreamer VA-API: A new release!

A new release of gstreamer-vaapi is now available!! Since we have been working on it for the last months,  I would like talk you about it.

gstreamer-vaapi is a set of GStreamer plugins and libraries for hardware accelerated video processing using VA-API.

VA-API stands for “Video Acceleration – Application Programming Interface”, and is, simultaneously, a specification of an application programming interface, and its library implementation under the Open Source MIT license, whose purpose is to offer applications access to the GPU accelerated video processing capabilities, offloading those tasks from the CPU. Accelerated processing includes video decoding, sub-picture blending and rendering.

The library implementation (libva) is designed to be a front-end for many different back-ends. I am aware only of these back-ends:

As you can notice, only the Intel driver is well maintained, meanwhile the others haven’t been updated since several years ago. And of course, this issue has impacted on the bugs handled in gstreamer-vaapi (for example, bug #749554).

VA-API is designed to use various entry-points of the hardware accelerated processing in the GPU driver, such as VLD (also known as slice level acceleration), iDCT, among others.

The fact that VA-API uses slice level decode is important, since it is different from the software-based video decoders, were the decoding is at frame level. As consequence, more information is required from the current video parsers (bug #691712).

But let us back on track. VA-API, as we already said, is a set of GStreamer elements (vaapidecode, vaapipostroc, vaapisink, and several encoders) and a GObject-friendly library named libgstvaapi. This library wraps libva under a GObject/GStreamer semantics. I might talk about this library in another opportunity.

vaapisink is video sink with support for multiple display server protocols:

UPDATE (2015/07/20): As Sree commented, vaapisink does not use GLX/EGL for rendering, it uses VA-API, but rather vaapidecode/vaapipostproc can deal with GLX/EGL contexts so they can share buffers with other OpenGL-based elements.

This element is the most efficient way to render the video processed by other VA-API elements such as decoders and post-processors, since no media transformations are required. If we want to render the stream in other video sinks, some extra operations might be necessary. The worst case scenario is where we need to download the whole image from the GPU memory onto the CPU one, and the overall performance of the pipeline might be impacted.

vaapidecode is single decoding element that handles several codecs, depending on the back-end and the hardware available. The comprehensive list of possible enabled codecs is:

  • MPEG-2 (simple and main profiles)
  • MPEG-4 (simple, main and advance-simple profiles) / DivX / Xvid
  • H263 (baseline profiles)
  • H264 (baseline, constrained-baseline, main, high, multiview-high and stereo-high profiles)
  • H265 (main and main10)
  • WMV3 (simple and main profiles) / VC1
  • VP8 (version 0.3)
  • JPEG (baseline profile)

In the particular case of Intel chip-sets, the next table shows the codec support per each generation:

Skylake Y Y Y Y Y Y
Cherry View Y Y Y Y Y Y
Broadwell Y Y Y Y
Haswell Y Y Y Y
Ivybridge Y Y Y Y
Sandybridge Y Y Y
Ironlake Y Y
G4x Y

In the case of H264, there are multiple profiles, and every chip-set offers different profile sets. But for now, I will not dig in more on it.

In other back-ends the support shall be different. For example, I have a box with an GPU NVidia GeForce GT which supports MPEG2, MPEG4, H264 and VC1.

There is a branch for G45 GPU, where H264 decoding support is added. Nevertheless, is not official, it doesn’t merge with current master anymore, and the supported output color space is NV12 only (bug #745660).

An interesting thing to know is because of the slice level decoding, gstreamer-vaapi does its own parsing inside the decoder, even if a video parser was plugged before.

vaapipostproc is a video transform element which can do color space conversions, de-interlacing, sharpening and many other types of video filtering, such as:

  • Color conversion
  • Resize / Scale
  • Noise reduction
  • De-interlacing
  • Sharpening
  • Color Balance
  • Skin tone enhancement

Also, the availability of this elements and its filters, depends on the back-end and the hardware. For example, the VDPAU back-end doesn’t provide any post-processing capability.

In the case of Intel’s chip-sets these are the provided post-processing capabilities:

CHIPSET Format Noise Deinterlace Sharp C.B. STE
Skylake Y Y Y Y Y Y
Cherry View Y Y Y Y Y Y
Broadwell Y Y Y Y Y Y
Haswell Y Y Y Y Y Y
Ivybridge Y Y Y
Sandybridge Y Y Y
Ironlake Y

In the case of the video encoders, they are split in different elements. Those implemented in gstreamer-vaapi are:

  • vaapiencode_mpeg2
  • vaapiencode_h264
  • vaapiencode_h265
  • vaapiencode_vp8
  • vaapiencode_jpeg

As far as I know, the only VA-API back-end that provides encoder is the Intel one, and this is the table of the encoders per chip-set:

Skylake Y Y Y Y Y
Cherry View Y Y Y
Broadwell Y Y
Haswell Y Y
Ivybridge Y Y
Sandybridge Y

vaapidecodebin is a new bin composed by vaapidecode a queue and vaapipostproc. Its purpose is bundle a complete decoding solution, with de-interlacing support. The purpose of the queue is to set the decoder to its full speed. The problem is, that since each back-end and chip-set might or might not support vaapipostproc we have to check for it in run-time (bug #749554).

|                     vaapidecodebin                    |
|   (-------------)    (-------)    (---------------)   |
|-->| vaapidecode |--->| queue |--->| vaapipostproc |-->|
|   (-------------)    (-------)    (---------------)   |
|                                                       |

In my opinion, this should be just a temporal workaround meanwhile we have auto-plugging support of de-interlacers (bug #687182).

So far we have explained what is GStreamer VA-API. Let us now talk about what is new and hot in this release.

This is the short log summary since the last release, 0.5.10:

 1  Adrian Cox
 2  Alban Browaeys
30  Gwenole Beauchesne
 1  Jacobo Aragunde Pérez
 2  Jan Schmidt
 1  Julien Isorce
 1  Lim Siew Hoon
 1  Martin Sherburn
 4  Michael Olbrich
12  Olivier Crete
 3  Simon Farnsworth
74  Sreerenj Balachandran
75  Víctor Manuel Jáquez Leal
 1  Wind Yuan

The major changes in this release includes:

  • HEVC (H265) decoding and encoding support (available only in Skylake and Cherry View chip-sets)
  • VP8 encoder (Skylake)
  • JPEG encoder (Skylake and Cherry View)
  • vaapidecodebin element, which we have talked about above.
  • Support for EGL either in vaapidecode, vaapipostproc and vaapisink.
  • Skin tone enhancement support in vaapipostproc
  • Support for H.264 Multiview High profile encoding with more than 2 views, which encompasses Jan Schmidt’s efforts for stereoscopic / multi-view support in GStreamer (bug #611157)
  • A lot of improvements and round rough corners all over the place. Only to mention that we closed more than 70 bug reports along this cycle

Now, please test this new release, enjoy it, and if you find something ugly, let us known!!!

by vjaquez at July 16, 2015 05:45 PM

July 10, 2015

Claudio Saavedra

Fri 2015/Jul/10

It's summer! That means that, if you are a student, you could be one of our summer interns in Igalia this season. We have two positions available: the first related to WebKit work and the second to web development. Both positions can be filled in either of our locations in Galicia or you can work remotely from wherever you prefer (plenty of us work remotely, so you'll have to communicate with some of us via jabber and email anyway).

Have a look at the announcement in our web page for more details, and don't hesitate to contact me if you have any doubt about the internships!

July 10, 2015 07:26 AM

July 08, 2015

Iago Toral

Implementing ARB_shader_storage_buffer

In my previous post I introduced ARB_shader_storage_buffer, an OpenGL 4.3 feature that is coming soon to Mesa and the Intel i965 driver. While that post focused on explaining the features introduced by the extension, in this post I’ll dive into some of the implementation aspects, for those who are curious about this kind of stuff. Be warned that some parts of this post will be specific to Intel hardware.

Following the trail of UBOs

As I explained in my previous post, SSBOs are similar to UBOs, but they are read-write. Because there is a lot of code already in place in Mesa’s GLSL compiler to deal with UBOs, it made sense to try and reuse all the data structures and code we had for UBOs and specialize the behavior for SSBOs where that was needed, that allows us to build on code paths that are already working well and reuse most of the code.

That path, however, had some issues that bit me a bit further down the road. When it comes to representing these operations in the IR, my first idea was to follow the trail of UBO loads as well, which are represented as ir_expression nodes. There is a fundamental difference between the two though: UBO loads are constant operations because uniform buffers are read-only. This means that a UBO load operation with the same parameters will always return the same value. This has implications related to certain optimization passes that work based on the assumption that other ir_expression operations share this feature. SSBO loads are not like this: since the shader storage buffer is read-write, two identical SSBO load operations in the same shader may not return the same result if the underlying buffer storage has been altered in between by SSBO write operations within the same or other threads. This forced me to alter a number of optimization passes in Mesa to deal with this situation (mostly disabling them for the cases of SSBO loads and stores).

The situation was worse with SSBO stores. These just did not fit into ir_expression nodes: they did not return a value and had side-effects (memory writes) so we had to come up with a different way to represent them. My initial implementation created a new IR node for these, ir_ssbo_store. That worked well enough, but it left us with an implementation of loads and stores that was a bit inconsistent since both operations used very different IR constructs.

These issues were made clear during the review process, where it was suggested that we used GLSL IR intrinsics to represent load and store operations instead. This has the benefit that we can make the implementation more consistent, having both loads and stores represented with the same IR construct and follow a similar treatment in both the GLSL compiler and the i965 backend. It would also remove the need to disable or alter certain optimization passes to be SSBO friendly.

Read/Write coherence

One of the issues we detected early in development was that our reads and writes did not seem to work very well together: some times a read after a write would fail to see the last value written to a buffer variable. The problem here also spawned from following the implementation trail of the UBO path. In the Intel hardware, there are various interfaces to access memory, like the Sampling Engine and the Data Port. The former is a read-only interface and is used, for example, for texture and UBO reads. The Data Port allows for read-write access. Although both interfaces give access to the same memory region, there is something to consider here: if you mix reads through the Sampling Engine and writes through the Data Port you can run into cache coherence issues, this is because the caches in use by the Sampling Engine and the Data Port functions are different. Initially, we implemented SSBO load operations like UBO loads, so we used the Sampling Engine, and ended up running into this problem. The solution, of course, was to rewrite SSBO loads to go though the Data Port as well.

Parallel reads and writes

GPUs are highly parallel hardware and this has some implications for driver developers. Take a sentence like this in a fragment shader program:

float cx = 1.0;

This is a simple assignment of the value 1.0 to variable cx that is supposed to happen for each fragment produced. In Intel hardware running in SIMD16 mode, we process 16 fragments simultaneously in the same GPU thread, this means that this instruction is actually 16 elements wide. That is, we are doing 16 assignments of the value 1.0 simultaneously, each one is stored at a different offset into the GPU register used to hold the value of cx.

If cx was a buffer variable in a SSBO, it would also mean that the assignment above should translate to 16 memory writes to the same offset into the buffer. That may seem a bit absurd: why would we want to write 16 times if we are always assigning the same value? Well, because things can get more complex, like this:

float cx = gl_FragCoord.x;

Now we are no longer assigning the same value for all fragments, each of the 16 values assigned with this instruction could be different. If cx was a buffer variable inside a SSBO, then we could be potentially writing 16 different values to it. It is still a bit silly, since only one of the values (the one we write last), would prevail.

Okay, but what if we do something like this?:

int index = int(mod(gl_FragCoord.x, 8));
cx[index] = 1;

Now, depending on the value we are reading for each fragment, we are writing to a separate offset into the SSBO. We still have a single assignment in the GLSL program, but that translates to 16 different writes, and in this case the order may not be relevant, but we want all of them to happen to achieve correct behavior.

The bottom line is that when we implement SSBO load and store operations, we need to understand the parallel environment in which we are running and work with test scenarios that allow us to verify correct behavior in these situations. For example, if we only test scenarios with assignments that give the same value to all the fragments/vertices involved in the parallel instructions (i.e. assignments of values that do not depend on properties of the current fragment or vertex), we could easily overlook fundamental defects in the implementation.

Dealing with helper invocations

From Section 7.1 of the GLSL spec version 4.5:

“Fragment shader helper invocations execute the same shader code
as non-helper invocations, but will not have side effects that
modify the framebuffer or other shader-accessible memory.”

To understand what this means I have to introduce the concept of helper invocations: certain operations in the fragment shader need to evaluate derivatives (explicitly or implicitly) and for that to work well we need to make sure that we compute values for adjacent fragments that may not be inside the primitive that we are rendering. The fragment shader executions for these added fragments are called helper invocations, meaning that they are only needed to help in computations for other fragments that are part of the primitive we are rendering.

How does this affect SSBOs? Because helper invocations are not part of the primitive, they cannot have side-effects, after they had served their purpose it should be as if they had never been produced, so in the case of SSBOs we have to be careful not to do memory writes for helper fragments. Notice also, that in a SIMD16 execution, we can have both proper and helper fragments mixed in the group of 16 fragments we are handling in parallel.

Of course, the hardware knows if a fragment is part of a helper invocation or not and it tells us about this through a pixel mask register that is delivered with all executions of a fragment shader thread, this register has a bitmask stating which pixels are proper and which are helper. The Intel hardware also provides developers with various kinds of messages that we can use, via the Data Port interface, to write to memory, however, the tricky thing is that not all of them incorporate pixel mask information, so for use cases where you need to disable writes from helper fragments you need to be careful with the write message you use and select one that accepts this sort of information.

Vector alignments

Another interesting thing we had to deal with are address alignments. UBOs work with layout std140. In this setup, elements in the UBO definition are aligned to 16-byte boundaries (the size of a vec4). It turns out that GPUs can usually optimize reads and writes to multiples of 16 bytes, so this makes sense, however, as I explained in my previous post, SSBOs also introduce a packed layout mode known as std430.

Intel hardware provides a number of messages that we can use through the Data Port interface to write to memory. Each message has different characteristics that makes it more suitable for certain scenarios, like the pixel mask I discussed before. For example, some of these messages have the capacity to write data in chunks of 16-bytes (that is, they write vec4 elements, or OWORDS in the language of the technical docs). One could think that these messages are great when you work with vector data types, however, they also introduce the problem of dealing with partial writes: what happens when you only write to an element of a vector? or to a buffer variable that is smaller than the size of a vector? what if you write columns in a row_major matrix? etc

In these scenarios, using these messages introduces the need to mask the writes because you need to disable the channels in the vec4 element that you don’t want to write. Of course, the hardware provides means to do this, we only need to set the writemask of the destination register of the message instruction to select the right channels. Consider this example:

struct TB {
    float a, b, c, d;

layout(std140, binding=0) buffer Fragments {
   TB s[3];
   int index;

void main()
   s[0].d = -1.0;

In this case, we could use a 16-byte write message that takes 0 as offset (i.e writes at the beginning of the buffer, where s[0] is stored) and then set the writemask on that instruction to WRITEMASK_W so that only the fourth data element is actually written, this way we only write one data element of 4 bytes (-1) at offset 12 bytes (s[0].d). Easy, right? However, how do we know, in general, the writemask that we need to use? In std140 layout mode this is easy: since each element in the SSBO is aligned to a 16-byte boundary, we simply need to take the byte offset at which we are writing, divide it by 16 (to convert it to units of vec4) and the modulo of that operation is the byte offset into the chunk of 16-bytes that we are writing into, then we only have to divide that by 4 to get the component slot we need to write to (a number between 0 and 3).

However, there is a restriction: we can only set the writemask of a register at compile/link time, so what happens when we have something like this?:

s[i].d = -1.0;

The problem with this is that we cannot evaluate the value of i at compile/link time, which inevitably makes our solution invalid for this. In other words, if we cannot evaluate the actual value of the offset at which we are writing at compile/link time, we cannot use the writemask to select the channels we want to use when we don’t want to write a vec4 worth of data and we have to use a different type of message.

That said, in the case of std140 layout mode, since each data element in the SSBO is aligned to a 16-byte boundary you may realize that the actual value of i is irrelevant for the purpose of the modulo operation discussed above and we can still manage to make things work by completely ignoring it for the purpose of computing the writemask, but in std430 that trick won’t work at all, and even in std140 we would still have row_major matrix writes to deal with.

Also, we may need to tweak the message depending on whether we are running on the vertex shader or the fragment shader because not all message types have appropriate SIMD modes (SIMD4x2, SIMD8, SIMD16, etc) for both, or because different hardware generations may not have all the message types or support all the SIMD modes we need need, etc

The point of this is that selecting the right message to use can be tricky, there are multiple things and corner cases to consider and you do not want to end up with an implementation that requires using many different messages depending on various circumstances because of the increasing complexity that it would add to the implementation and maintenance of the code.

Closing notes

This post did not cover all the intricacies of the implementation of ARB_shader_storage_buffer_object, I did not discuss things like the optional unsized array or the compiler details of std430 for example, but, hopefully, I managed to give an idea of the kind of problems one would have to deal with when coding driver support for this or other similar features.

by Iago Toral at July 08, 2015 07:02 AM

July 03, 2015

Andy Wingo

Pfmatch, a packet filtering language embedded in Lua

Greets, hackers! I just finished implementing a little embedded language in Lua and wanted to share it with you. First, a bit about the language, then some notes on how it works with Lua to reach the high performance targets of Snabb Switch.

the pfmatch language

Pfmatch is a language designed for filtering, classifying, and dispatching network packets in Lua. Pfmatch is built on the well-known pflang packet filtering language, using the fast pflua compiler for LuaJIT.

Here's an example of a simple pfmatch program that just divides up packets depending on whether they are TCP, UDP, or something else:

match {
   tcp => handle_tcp
   udp => handle_udp
   otherwise => handle_other

Unlike pflang filters written for such tools as tcpdump, a pfmatch program can dispatch packets to multiple handlers, potentially destructuring them along the way. In contrast, a pflang filter can only say "yes" or "no" on a packet.

Here's a more complicated example that passes all non-IP traffic, drops all IP traffic that is not going to or coming from certain IP addresses, and calls a handler on the rest of the traffic.

match {
   not ip => forward
   ip src => incoming_ip
   ip dst => outgoing_ip
   otherwise => drop

In the example above, the handlers after the arrows (forward, incoming_ip, outgoing_ip, and drop) are Lua functions. The part before the arrow (not ip and so on) is a pflang expression. If the pflang expression matches, its handler will be called with two arguments: the packet data and the length. For example, if the not ip pflang expression is true on the packet, the forward handler will be called.

It's also possible for the handler of an expression to be a sub-match:

match {
   not ip => forward
   ip src => {
      tcp => incoming_tcp(&ip[0], &tcp[0])
      udp => incoming_udp(&ip[0], &ucp[0])
      otherwise => incoming_ip(&ip[0])
   ip dst => {
      tcp => outgoing_tcp(&ip[0], &tcp[0])
      udp => outgoing_udp(&ip[0], &ucp[0])
      otherwise => outgoing_ip(&ip[0])
   otherwise => drop

As you can see, the handlers can also have additional arguments, beyond the implicit packet data and length. In the above example, if not ip doesn't match, then ip src matches, then tcp matches, then the incoming_tcp function will be called with four arguments: the packet data as a uint8_t* pointer, its length in bytes, the offset of byte 0 of the IP header, and the offset of byte 0 of the TCP header. An argument to a handler can be any arithmetic expression of pflang; in this case &ip[0] is actually an extension. More on that later. For language lawyers, check the syntax and semantics over in our source repo.

Thanks especially to my colleague Katerina Barone-Adesi for long backs and forths about the language design; they really made it better. Fistbump!

pfmatch and lua

The challenge of designing pfmatch is to gain expressiveness, compared to writing filters by hand, while not endangering the performance targets of Pflua and Snabb Switch. These days Snabb is on target to give ASIC-driven network appliances a run for their money, so anything we come up with cannot sacrifice speed.

In practice what this means is compile, don't interpret. Using the pflua compiler allows us to generalize the good performance that we have gotten on pflang expressions to a multiple-dispatch scenario. It's a pretty straightword strategy. Naturally though, the interface with Lua is more complex now, so to understand the performance we should understand the interaction with Lua.

How does one make two languages interoperate, anyway? With pflang it's pretty clear: you compile pflang to a Lua function, and call the Lua function to match on packets. It returns true or false. It's a thin interface. Indeed with pflang and pflua you could just match the clauses in order:

not_ip = pf.compile('not ip')
incoming = pf.compile('ip src')
outgoing = pf.compile('ip dst')

function handle(packet, len)
   if not_ip(packet, len) then return forward(packet, len)
   elseif incoming(packet, len) then return incoming_ip(packet, len)
   elseif outgoing(packet, len) then return outgoing_ip(packet, len)
   else return drop(packet, len) end

But not only is this tedious, you don't get easy access to the packet itself, and you're missing out on opportunities for optimization. For example, if the packet fails the not_ip check, we don't need to check if it's an IP packet in the incoming check. Compiling a pfmatch program takes advantage of pflua's optimizer to produce good code for the match expression as a whole.

If this were Scheme I would make the right-hand side of an arrow be an expression and implement pfmatch as a macro; see Racket's match documentation for an example. In Lua or other languages that's harder to do; you would have to parse Lua, and it's not clear which parts of the production as a whole are the host language (Lua) and which are the embedded language (pfmatch).

Instead, I think embedding host language snippets by function name is a fine solution. It seems fairly clear that incoming_ip, for example, is some kind of function. It's easy to parse identifiers in an embedded language, both for humans and for programs, so that takes away a lot of implementation headache and cognitive overhead.

We are left with a few problems: how to map names to functions, what to do about the return value of match expressions, and how to tie it all together in the host language. Again, if this were Scheme then I'd use macros to embed expressions into the pfmatch term, and their names would be scoped into whatever environment the match term was defined. In Lua, the best way to implement a name/value mapping is with a table. So we have:

local handlers = {
   forward = function(data, len)
   drop = function(data, len)
   incoming_ip = function(data, len)
   outgoing_ip = function(data, len)

Then we will pass the handlers table to the matcher function, and the matcher function will call the handlers by name. LuaJIT will mostly take care of the overhead of the table dispatch. We compile the filter like this:

local match = require('pf.match')

local dispatcher = match.compile([[match {
   not ip => forward
   ip src => incoming_ip
   ip dst => outgoing_ip
   otherwise => drop

To use it, you just invoke the dispatcher with the handlers, data, and length, and the return value is whatever the handler returns. Here let's assume it's a boolean.

function loop(self)
   local i, o = self.input.input, self.output.output
   while not link.empty() do
      local pkt = link.receive(i)
      if dispatcher(handlers,, pkt.length) then
         link.transmit(o, pkt)

Finally, we're ready for an example of a compiled matcher function. Here's what pflua does with the match expression above:

local cast = require("ffi").cast
return function(self,P,length)
   if length < 14 then return self.forward(P, len) end
   if cast("uint16_t*", P+12)[0] ~= 8 then return self.forward(P, len) end
   if length < 34 then return self.drop(P, len) end
   if P[23] ~= 6 then return self.drop(P, len) end
   if cast("uint32_t*", P+26)[0] == 67305985 then return self.incoming_ip(P, len) end
   if cast("uint32_t*", P+30)[0] == 134678021 then return self.outgoing_ip(P, len) end
   return self.drop(P, len)

The result is a pretty good dispatcher. There are always things to improve, but it's likely that the function above is better than what you would write by hand, and it will continue to get better as pflua improves.

Getting back to what I mentioned earlier, when we write filtering code by hand, we inevitably end up writing interpreters for some kind of filtering language. Network functions are essentially linguistic in nature: static appliances are no good because network topologies change, and people want solutions that reflect their problems. Usually this means embedding an interpreter for some embedded language, for example BPF bytecode or iptables rules. Using pflua and pfmatch expressions, we can instead compile a filter suited directly for the problem at hand -- and while we're at it, we can forget about worrying about pesky offsets, constants, and bit-shifts.


I'm optimistic about pfmatch or something like it being a success, but there are some challenges too.

One challenge is that pflang is pretty weird. For example, attempting to access ip[100] will abort a filter immediately on a packet that is less than 100 bytes long, not including L2 encapsulation. It's wonky semantics, and in the context of pfmatch, aborting the entire pfmatch program would obviously be the wrong thing. That would abort too much. Instead it should probably just fail the pflang test in which that packet access appears. To this end, in pfmatch we turn those aborts into local expression match failures. However, this leads to an inconsistency with pflang. For example in (ip[100000] == 0 or (1==1)), instead of ip[100000] causing the whole pflang match to fail, it just causes the local test to fail. This leaves us with 1==1, which passes. We abort too little.

This inconsistency is probably a bug. We want people to be able to test clauses with vanilla pflang expressions, and have the result match the pfmatch behavior. Due to limitations in some of pflua's intermediate languages, it's likely to persist for a while. It is the only inconsistency that I know of, though.

Pflang is also underpowered in many ways. It has terrible IPv6 support; for example, tcp[0] only matches IPv4 packets, and at least as implemented in libpcap, most payload access on IPv6 packets does the wrong thing regarding chained extension headers. There is no facility in the language for binding names to intermediate results, there is no linguistic facility for talking about fragmentation, no ability to address IP source and destination addresses in arithmetic expressions by name, and so on. We can solve these in pflua with extensions to the language, but that introduces incompatibilities with pflang.

You might wonder why to stick with pflang, after all of this. If this is you, Juho Snellman wrote a great article on this topic, just for you: What's wrong with pcap filters.

Pflua's optimizer has mostly helped us, but there have been places where it could be more helpful. When compiling just one expression, you can often end up figuring out which branches are dead-ends, which helps the rest of the optimization to proceed. With more than one successful branch, we had to make a few improvements to the optimizer to actually get decent results. We also had to relax one restriction on the optimizer: usually we only permit transformations that make the code smaller. This way we know we're going in the right direction and will eventually terminate. However because of reasons™ we did decide to allow tail calls to be duplicated, so instead of having just one place in the match function that tail-calls a handler, you can end up with multiple calls. I suspect using a tracing compiler will largely make this moot, as control-flow splits effectively lead to trace duplication anyway, and making sure control-flow joins later doesn't effectively counter that. Still, I suspect that the resulting trace shape will rejoin only at the loop head, instead of in some intermediate point, which is probably OK.


With all of these concerns, is pfmatch still a win? Yes, probably! We're going to start using it when building Snabb apps, and will see how it goes. We'll probably end up adding a few more pflang extensions before we're done. If it's something you're in to, snabb-devel is the place to try it out, and see you on the bug tracker. Happy packet hacking!

by Andy Wingo at July 03, 2015 11:05 AM

June 29, 2015

Manuel Rego

CSS Grid Layout is just around the corner (CSSConf US 2015)

Coming back to real life after a wonderful week in New York City is not that easy, but here we’re on the other side of the pond writing about CSS Grid Layout again.

First kudos to Bocoup for the CSSConf US 2015 organization. Specially to Adam Sontag and the rest of the conference staff. You were really supportive during the whole week. And the videos with live transcripts were available just a few days after the conference, awesome job! The only issue was the internet connection which was really flaky.

So, yeah I attended CSSConf this year, but not only that, I was also speaking about CSS Grid Layout and the video of my talk is already online together with the slides.

During the talk I described the basic concepts, syntax and features of CSS Grid with different live coding examples. Then I tried to explain the main tasks that the browser has to do in order to render a grid and gave some tips about grid performance. Finally, we reviewed the browsers adoption and the status of Chromium/Blink and Safari/WebKit implementations that Igalia is doing.

CSS Grid Layout is just around the corner talk sketchnotes by Susan CSS Grid Layout is just around the corner talk sketchnotes by Susan

The feedback about my talk was incredibly positive and everybody seemed really excited about what CSS Grid Layout can bring to the web platform. Big thanks to you all!

Of course, there were other great talks at CSSConf as you can check in the videos. From the top of my head, I loved the one by Lea Verou, impressive talk as usual where she even released a polyfill for conic gradients on the stage. SVG and animations have two nice talks by Chris Coyier and Sarah Drasner. PostCSS and inline styles were also hot topics. Responsive (and responsible!) images, Fun.css and CSS? WTF! were also great (and probably I’m forgetting some other).

Last, on Thursday’s night we attended BrooklynJS which had a great panel discussing about CSS. The inline styles vs stylesheets topic became hot, as projects like React are moving people away from stylesheets. Chris Coyier (one of the panelists and also speaker at CSSConf) wrote a nice post past week giving a good overview of this topic. Also The Four Fives were amazing!

On top of that, as part of the collaboration between Igalia and Bloomberg, I was visiting their fancy office in Manhattan. I spent a great time there talking about grids with several people from the team. They really believe that CSS Grid Layout will change the future of the web benefiting lots of people in different use cases, and hopefully helping to alleviate performance issues in complex scenarios.

Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

Looking forward for the next opportunity to talk about CSS Grid Layout. Keeping the hard work to make it a reality as soon as possible!

June 29, 2015 10:00 PM

June 24, 2015

Javier Fernández

Performance analysis of Grid Layout

Now that we have a quite complete implementation of CSS Grid Layout specification it’s time to take care of performance analysis and optimizations. In this essay, which is the first of a series of posts about performance, I’ll first introduce briefly how to use Blink (Chrome) and WebKit (Safari) performance analysis tools, some of the most interesting cases I’ve seen during my work on the implementation of this spec and, finally, a basic case to compare Flexbox and Grid layout models, which I’d like to evolve and analyze further in the coming months.

Performance analysis tools

Both WebKit and Blink projects provide several useful and easy to use scrips (python) to run a set of test cases and take different measurements and early analysis. They were written before the fork, that’s why related documentation can be found at WebKit’s track, but both engines still uses them, for the time being.


There are a wide set of performance tests under PerformanceTest folder, at Blink’s/WebKit’s root directory, but even though both engines share a substantial number of tests, there are some differences.

(blink’s root directory) $ ls PerformanceTests/
Bindings BlinkGC Canvas CSS DOM Dromaeo Events inspector Layout Mutation OWNERS Parser resources ShadowDOM Skipped SunSpider SVG XMLHttpRequest XSSAuditor

Chromium project has introduced a new performance tool, called Telemetry, which in addition of running the above mentioned tests, it’s designed to execute more complex cases like running specific PageSets or doing benchmarking to compare results with a preset recording (WebPageRelay). It’s also possible to send patches to performance try bots, directly from gclient or git (depot_tools) command line. There are quite much information available in the following links:

Regarding profiling tools, it’s possible both in Webkit and Blink to use the –profiler option when running the performance tests so we can collect profiling data. However, while WebKit recommends perf for linux, Google’s Blink engine provides some alternatives.

CSS Grid Layout performance tests and current status

While implementing a new browser feature is not easy to measure performance while code evolves so much and quickly and, what it’s worst, be aware of regressions introduced by new logic. When the feature’s syntax changes or there are missing or incomplete functionality, it’s not always possible to establish a well defined baseline for performance. It’s also a though decision to determine which use cases we might care about; obviously the faster the better, but adding performance optimizations usually complicates code, it may affect its robustness and it could lead to unexpected, and even worst, hard to find bugs.

At the time of this writing, we had 3 basic performance tests:

Why we have selected those uses cases to measure and keep track of performance regression ? First of all, note that auto-sizing one of the most expensive branches inside the grid track sizing algorithm, so we are really interested on both, improving it and keeping track of regressions on this code path.

body {
    display: grid;
    grid-template-rows: repeat(100, auto);
    grid-template-columns: repeat(20, auto);
.gridItem {
    height: 200px;
    width: 200px;

On the other hand, fixed-sized is the easiest/fastest path of the algorithm, so besides the importance of avoiding regressions (when possible), it’s also a good case to compare with auto-sized.

body {
    display: grid;
    grid-template-rows: repeat(100, 200px);
    grid-template-columns: repeat(20, 200px);
.gridItem {
    height: 200px;
    width: 200px;

Finally, a stretching use cases was added because it’s the default alignment value for grid items and the two test cases already described use fixed size items, hence no stretch (even though items fill the whole grid cell area). Given that I implemented CSS Box Alignment support for grid I was conscious of how expensive the stretching logic is, so I considered it an important use case to analyze and optimize as much as possible. Actually, I’ve already introduced several optimizations because the early implementation was quite slow, around 40% slower than using any other basic alignment (start, end, center). We will talk more about this later when we analyze a case to compare Flexbox and Grid performance in layout.

body {
    display: grid;
    grid-template-rows: repeat(100, 200px);
    grid-template-columns: repeat(20, 200px);
.gridItem {
    height: auto;
    width: auto;

The basic HTML body of these 3 tests is quite simple because we want to analyze performance of very specific parts of the Grid Layout logic, in order to detect regressions in sensible code paths. We’d like to have eventually some real use cases to analyze and create many more performance tests, but chrome performance platform it’s definitively not the place to do so. The following graphs show performance evolution during 2015 for the 3 tests we have defined so far.


Note that yellow trace shows data taken from a reference build, so we can discount temporary glitches on the machine running the performance tests of target build, which are shown in the blue trace; this reference trace is also useful to detect invalid regression alerts.

Why performance is so different for these cases ?

The 3 tests we have for Grid Layout use runs/second values as a way to measure performance; this is the preferred method for both WebKit and Blink engines because we can detect regressions with relatively small tests. It’s possible, though, to do other kind of measurements. Looking at the graphs above we can extract the following data:

  • auto-sized grid: around 650 runs/sec
  • fixed-sized grid: around 1400 runs/sec
  • fixed-sized stretched grid: around 1250 runs/sec

Before analyzing possible causes of performance drop for each case, I’ve defined some additional tests to stress even more these 3 cases, so we can realize how grid size affect to the obtained results. I defined 20 tests for these cases, each one with different grid items; from 10×10 up to 200×200 grids. I run those tests in my own laptop, so let’s take the absolute numbers of each case with a grain of salt; although differences between each of these 3 scenarios should be coherent. The table below shows some numeric results of this experiment.


First of all, recall that these 3 tests produce the same web visualization, consisting of grids with NxN items of 100px each one. The only difference is the grid layout strategy used to produce such result: auto-sizing, fixed-sizing and stretching. So now, focusing on previous table’s data we can evaluate the cost, in terms of layout performance, of using auto-sized tracks for defining the grid (which may be the only solution for certain cases). Performance drop is even growing with the number of grid items, but we can conclude that it’s stabilized around 60%. On the other hand stretching is also slower but, unlike auto-sized, in this case performance drop does not show a high dependency of grid size, more or less constant around 15%.


Impact of auto-sized tracks in layout performance

Basically, the track sizing algorithm can be described in the following 4 steps:

  • 1- Initialize per Grid track variables.
  • 2- Resolve content-based TrackSizingFunctions.
  • 3- Grow all Grid tracks in GridTracks from their baseSize up to their growthLimit value until freeSpace is exhausted.
  • 4- Grow all Grid tracks having a fraction as the MaxTrackSizingFunction.

These steps will be executed twice, first cycle for determining column tracks’s size and another cycle to set row tracks’s size which it may depend on grid’s width. When using just fixed-sized tracks in the very simple case we are testing, the only computation required to determine grid’s size is completing step 1 and determining free available space based on the specified fixed-size values of each track.

// 1. Initialize per Grid track variables.
for (size_t i = 0; i < tracks.size(); ++i) {
    GridTrack& track = tracks[i];
    GridTrackSize trackSize = gridTrackSize(direction, i);
    const GridLength& minTrackBreadth = trackSize.minTrackBreadth();
    const GridLength& maxTrackBreadth = trackSize.maxTrackBreadth();
    track.setBaseSize(computeUsedBreadthOfMinLength(direction, minTrackBreadth));
    track.setGrowthLimit(computeUsedBreadthOfMaxLength(direction, maxTrackBreadth, track.baseSize()));
    if (trackSize.isContentSized())
    if (trackSize.maxTrackBreadth().isFlex())
for (const auto& track: tracks) {
    freeSpace -= track.baseSize();

Focusing now on the auto-sized scenario, we will have the overhead of resolving content-sized functions for all the grid items.

// 2. Resolve content-based TrackSizingFunctions.
if (!sizingData.contentSizedTracksIndex.isEmpty())
    resolveContentBasedTrackSizingFunctions(direction, sizingData);

I didn’t add source code of resolveContentBasedTrackSizingFunctions because it’s quite complex, but basically it implies a cost proportional to the number of grid tracks (minimum of 2x), in order to determine minContent and maxContent values for each grid item. It might imply additional computation overhead when using spanning items; it would require to sort them based on their spanning value and iterate over them again to resolve their content-sized functions.

Some issues may be interesting to analyze in the future:

  • How much each content-sized track costs ?
  • What is the impact on performance of using flexible-sized tracks ? Would it be the worst case scenario ? Considering it will require to follow the four steps of track sizing algorithm, it likely will.
  • Which are the performance implications of using spanning items ?

Why stretching is so performance drain ?

This is an interesting issue, given that stretch is the default value for both Grid and Flexbox items. Actually, it’s the root cause of why Grid beats Flexbox in terms of layout performance for the cases when stretch alignment is used. As I’ll explain later, Flexbox doesn’t have the optimizations I’ve implemented for Grid Layout.

Stretching logic takes place during the grid container layout operations, after all tracks have their size precisely determined and we have properly computed all grid track’s positions relatively to the grid container. It happens before the alignment logic is executed because stretching may imply changing some grid item’s size, hence they will be marked for layout (if they wasn’t already).

Obviously, stretching only takes place when the corresponding Self Alignment properties (align-self, justify-self) have either auto or stretch as value, but there are other conditions that must be fulfilled to trigger this operation:

  • box’s computed width/height (as appropriate to the axis) is auto.
  • neither of its margins (in the appropriate axis) are auto
  • still respecting the constraints imposed by min-height/min-width/max-height/max-width

In that scenario, stretching logic implies the following operations:

LayoutUnit stretchedLogicalHeight = availableAlignmentSpaceForChildBeforeStretching(gridAreaBreadthForChild, child);
LayoutUnit desiredLogicalHeight = child.constrainLogicalHeightByMinMax(stretchedLogicalHeight, -1);
bool childNeedsRelayout = desiredLogicalHeight != child.logicalHeight();
if (childNeedsRelayout || !child.hasOverrideLogicalContentHeight())
    child.setOverrideLogicalContentHeight(desiredLogicalHeight - child.borderAndPaddingLogicalHeight());
if (childNeedsRelayout) {
LayoutUnit LayoutGrid::availableAlignmentSpaceForChildBeforeStretching(LayoutUnit gridAreaBreadthForChild, const LayoutBox& child) const
    LayoutUnit childMarginLogicalHeight = marginLogicalHeightForChild(child);
    // Because we want to avoid multiple layouts, stretching logic might be performed before
    // children are laid out, so we can't use the child cached values. Hence, we need to
    // compute margins in order to determine the available height before stretching.
    if (childMarginLogicalHeight == 0)
        childMarginLogicalHeight = computeMarginLogicalHeightForChild(child);
    return gridAreaBreadthForChild - childMarginLogicalHeight;

In addition to the extra layout required for changing grid item’s size, computing the available space for stretching adds an additional overhead, overall if we have to compute grid item’s margins because some layout operations are still incomplete.

Given that grid container relies on generic block’s layout operations to determine the stretched width, this specific logic is only executed for determining the stretched height. Hence performance drop is alleviated, compared with the auto-sized tracks scenario.

Grid VS Flexbox layout performance

One of the main goals of CSS Grid Layout specification is to complement Flexbox layout model for 2 dimensions. It’s expectable that creating grid designs with Flexbox will be more inefficient than using a layout model specifically designed for these cases, not only regarding CSS syntax, but also regarding layout performance.

However, I think it’s interesting to measure Grid Layout performance in 1-dimensional cases, usually managed using Flexbox, so we can have comparable scenarios to evaluate both models. In this post I’ll start with such cases, using a very simple one in this occasion. I’d like to get more complex examples in future posts, the ones more usual in Flexbox based designs.

So, let’s consider the following simple test case:

<div class="className">
   <div class="i1">Item 1</div> 
   <div class="i2">Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</div>
   <div class="i3">Item 3 longer</div>

I evaluated the simple HTML example above with both Flexbox and Grid layouts to measure performance. I used a CPU profiler to figure out where the bottlenecks are for each model, trying to explain where differences came from. So, I defined 2 CSS classes for each layout model, as follows:

.flex {
    background-color: silver;
    display: flex;
    height: 100px;
    align-items: start;
.grid {
    background-color: silver;
    display: grid;
    grid-template-columns: 100px 1fr auto;
    grid-template-rows: 100px;
    align-items: start;
    justify-items: start;
.i1 { 
    background-color: cyan;
    flex-basis: 100px; 
.i2 { 
    background-color: magenta;
    flex: 1; 
.i3 { 
    background-color: yellow; 

Given that there is not concept of row in Flexbox, I evaluated performance of 100 up to 2000 grid or flex containers, creating 20 tests to be run inside the chrome performance framework, described at the beginning of this post. You can check out resources and a script to generate them at our github examples repo.


When comparing both layout models targeting layout times, we see clearly that Grid Layout beats Flexbox using the default values for CSS properties controlling layout itself and alignment, which is stretch for these containers. As it was explained before, the stretching logic adds an important computation overhead, which as we can see now in the numeric table above, has more weight for Flexbox than Grid.

Looking at the plot about differences in layout time, we see that for the default case, Grid performance improvement is stabilized around 7%. However, when we avoid the stretching logic, for instance by using any other alignment value, layout performance it’s considerable worse than Flexbox, for this test case, around 15% slower. This is something sensible, as this test case is the idea for Flexbox, while a bit artificial for Grid; using a single Grid with N rows improves performance considerably, getting much better numbers than Flexbox, but we will see these cases in future analysis.

Grid layout better results for the default case (stretch) are explained because I implemented several optimizations for Grid. Probably Flexbox should do the same, as it’s the default value and it could affect many sites using this layout model in their designs.

Thanks to Bloomberg for sponsoring this work, as part of the efforts that Igalia has been doing all these years pursuing a better and more open web.

Igalia & Bloomberg logos

by jfernandez at June 24, 2015 12:03 PM

June 18, 2015

Andy Wingo

arrow functions coming to chrome 45!

It's been a long time coming, but I just flipped the bit in V8 that will ship arrow functions in Chrome 45! Woo hoo!

You probably know, but arrow functions are a new way to write functions in JavaScript. They look like this:

// Two arguments, body implicitly returned.
(x, y) => x + y

// With just one argument, no parentheses needed.
x => x * 2

// Body can have braces too; in that case use "return".
x => { return x * 2 }

Relative to the other kind of function that is written like function (x) { return x * 2 }, arrow functions don't define this or arguments in their bodies, instead capturing these values from the environment. There are a couple of other minor differences, too, but instead of writing about them here I'll just point to the great article by Jason Orendorff of the SpiderMonkey team.

Arrow functions are part of the JavaScript language standard that was called "ECMAScript 6" or ES6, and I guess you could still call it that. It seems like a silly thing for the committee to do to throw away all their branding like that but they decided to rename it ECMAScript 2015, which I'm sure is a link that the pedants are glad I have included. The upshot is that the standard is now final, gold master, etched in stone, which from an implementor's perspective is a relief. You can practically feel the anxiety ebbing away by the happy rate at which commits bubble out of source repositories and into shipping browsers, free from the fear that some spec change will force the hack-stream to change course.

From the V8 side, our arrow function implementation has also been a long time coming. My colleague Adrián Pérez did the first half of the work, and I picked up on the back end of things. It seems like such a small feature and in many ways it is, but still it took a long time. Now I know that my readers are a bunch of nerds and many of you like implementing languages, so you might appreciate these nargish points.

One of the first bits is that arrow functions are hard to parse. Consider, this is a valid JavaScript expression:


It's a "comma expression" that will evaluate x then y and its result will be the result of evaluating y. But add an arrow on after the end and you get not an expression but a formal parameter list:


Now you might think, well OK, when you see an arrow, rewind the input stream and parse in "arrow function mode". Indeed that would be fine, but not in combination with some additional ES6 features, optional and destructuring arguments. Optional arguments look like this:


The =42 part is the expression that will be evaluated to give x a value, if the function is called with no arguments. Note that this bit is still under implementation in V8 so you can't try it in your browser. An optional argument initializer is an expression and not a value, so you can also have:


Combined, this makes rewinding the token stream a proposition of exponential complexity, which is a no-go for a production JavaScript parser. Parsers are on the hot path for page-load times and no browser vendor wants to introduce a pathological case into their page load.

Instead, V8 does something I hadn't seen before. It keeps an open mind about whether something is a comma expression or a formal parameter list of an arrow function, and only makes a decision when it sees the => (or not). As it parses, V8 records places that it would signal an error for either a parameter list or for an expression, and then when that superimposed wave function collapses it checks that the production is valid, signalling the appropriate error if not. I thought this was a really neat trick, so if you're into that thing see expression classifier to see those details.

The other thing that's tricky about arrow functions is the this binding. In JavaScript, this is basically a hidden parameter passed to a function when it is called. Calling a function like o.f() passes the value of o to f as its this parameter. If instead f() is called directly, like with no dot before the call, then undefined is passed as this. Also for sloppy-mode functions, if the passed this value isn't an object, then the global object instead is assigned to this. Finally outside a function, this is bound to the global object.

OK, I know all of you know these things. Thing is, you always have a this, and although it's like a variable it's not a valid variable name, and before ES6 nothing could capture its value, because each function has its own this value. Perhaps you see where I'm going with this (ahem) now. Arrow functions introduce a function scope that doesn't have a this value, and that indeed might capture some other scope's this value, forcing it to be context-allocated. Other parts of ES6 can actually force assignment to this, like a super call, and that assignment can actually come from within an arrow function. Zounds! A simple concept, but there was a lot of incidental complexity in V8 around the implementation. Between Adrián and myself it took like three months to fix this usage in V8 to always just go through the (possibly context-allocated) variable, and there are still probably some devtools bugs to find in the upcoming weeks.

Performance-wise, arrow functions are just like functions. They should be just as fast as if you wrote them with function. So use them with joy, use them with abandon, use them judiciously -- however you decide you use them, don't let perf influence your decision one way or the other.

That's about it! Like all of my JS engine work over the past couple years, this hacking was sponsored by fabulous folks over at Bloomberg, so big ups to them. From me and Adrián at Igalia, until next time! We leave you to puzzle out what this bit of JavaScript evaluates to:


Happy hacking!

by Andy Wingo at June 18, 2015 04:41 PM

June 12, 2015

Samuel Iglesias

piglit (V): how to contribute to piglit and table of contents

Last post and the one before were about how to create your own piglit tests. Previously, I have written an introduction to piglit and how to launch a tailored piglit run (more details about these last two topics in my FOSDEM 2015 talk).

Now it’s time to talk about how to contribute to piglit.

How to contribute to piglit

Once you want to contribute something to piglit, you need to generate the patches and send them for review to the mailing list. They are usually created by git format-patch and sent by git send-email command (if you need help with git, there are a lot of tutorials). Remember to rebase your branch against up-to-dated master before creating the patches, so no merge conflicts will appear if the reviewers wants to apply them locally.

Whether you have some patches ready to be submitted or you have questions about piglit, subscribe to and send them there.

Most piglit developers work in other areas (such as OpenGL driver development!) which means that the review process of piglit patches could take some time, so be patient and wait.

If after some time (something like one or two weeks) there is no answer about your patches, you can send a reminder saying that a review is pending. If you have commit rights and the patch is trivial (or you are very confident that it is right), you can even push it to the repository’s master branch after that time. Piglit is not as strict as other projects in this regard, however do not abuse of this rule.

Once you have a good track of contributions or other contributors told you to do so, you can ask for piglit repository’s commit rights by following these instructions. And don’t hesitate to review patches from others!

Table of contents

This is the list of my piglit related posts:

  1. Piglit, an open-source test suite for OpenGL implementations
  2. piglit (II): How to launch a tailored piglit run
  3. piglit (III): How to write GLSL shader tests
  4. piglit (IV): How to write binary test programs
  5. piglit (V): how to contribute to piglit and table of contents

Plus my FOSDEM 2015 talk.

Thanks for following this short introduction to piglit. Happy hacking!

by Samuel Iglesias at June 12, 2015 08:07 AM

June 11, 2015

Samuel Iglesias

piglit (IV): How to write binary test programs

Last post I talked about how to develop GLSL shader tests and how to add them to piglit. This is a nice way to develop simple tests but sometimes you need to do something more complex. For that case, piglit can run binary test programs.


Binary test programs in piglit are written in C language taking advantage of the piglit framework to facilitate the test development (there is no main() function, no need to setup GLX (or WGL or EGL or whatever), no need to check manually the available extensions or call to glXSwapBuffers() or similar functions…), however you can still use all OpenGL C-language API.

Simple example

Piglit framework is mostly undocumented but easy to understand once you start reading some existing tests. So I will start with a simple example explaining how a typical binary test looks like and then show you a real example.

 * Copyright © 2015 Intel Corporation
 * Permission is hereby granted, free of charge, to any person obtaining a
 * copy of this software and associated documentation files (the "Software"),
 * to deal in the Software without restriction, including without limitation
 * the rights to use, copy, modify, merge, publish, distribute, sublicense,
 * and/or sell copies of the Software, and to permit persons to whom the
 * Software is furnished to do so, subject to the following conditions:
 * The above copyright notice and this permission notice (including the next
 * paragraph) shall be included in all copies or substantial portions of the
 * Software.

/** @file test-example.c
 * This test shows an skeleton for creating new tests.

#include "piglit-util-gl.h"

    config.window_width = 100;
    config.window_height = 100;
    config.supports_gl_compat_version = 10;
    config.supports_gl_core_version = 31;


static const char vs_pass_thru_text[] =
    "#version 330n"
    "in vec4 piglit_vertex;n"
    "void main() {n"
    "   gl_Position = piglit_vertex;n"

static const char fs_source[] =
    "#version 330n"
    "void main() {n"
    "       color = vec4(1.0, 0.0. 0.0, 1.0);n"

GLuint prog;

piglit_init(int argc, char **argv)
    bool pass = true;

    /* piglit_require_extension("GL_ARB_shader_storage_buffer_object"); */

    prog = piglit_build_simple_program(vs_pass_thru_text, fs_source);


    glClearColor(0, 0, 0, 0);

    /* <-- OpenGL commands to be done --> */

    glViewport(0, 0, piglit_width, piglit_height);

    /* piglit_draw_* commands can go into piglit_display() too */
    piglit_draw_rect(-1, -1, 2, 2);

    if (!piglit_check_gl_error(GL_NO_ERROR))
       pass = false;

    piglit_report_result(pass ? PIGLIT_PASS : PIGLIT_FAIL);

enum piglit_result piglit_display(void)
    /* <-- OpenGL drawing commands, if needed --> */

    /* UNREACHED */
    return PIGLIT_FAIL;

As you see in this example, there are four different parts:

  1. License header and description of the test
    • The license details should be included in each source file. There is one agreed by most contributors and it’s a MIT license assigning the copyright to Intel Corporation. More information in COPYING file.
    • It includes a brief description of what the test does.

     * Copyright © 2015 Intel Corporation
     * Permission is hereby granted, free of charge, to any person obtaining a
     * copy of this software and associated documentation files (the "Software"),
     * to deal in the Software without restriction, including without limitation
     * the rights to use, copy, modify, merge, publish, distribute, sublicense,
     * and/or sell copies of the Software, and to permit persons to whom the
     * Software is furnished to do so, subject to the following conditions:
     * The above copyright notice and this permission notice (including the next
     * paragraph) shall be included in all copies or substantial portions of the
     * Software.
    /** @file test-example.c
     * This test shows an skeleton for creating new tests.

  2. Piglit setup. This is needed to check if a test can be executed by a given driver (minimum supported GL version), or to create a window of a specific size, or even to define if we want double buffering.
    config.window_width = 100;
    config.window_height = 100;
    config.supports_gl_compat_version = 10;
    config.supports_gl_core_version = 31;


  • piglit_init(). This is the function that it is going to be called for configuring the test itself. Some tests implement all their code inside piglit_init() because it doesn’t need to draw anything (or it doesn’t need to update the drawing frame by frame). In any case, you usually put here the following code:
    • Check for needed extensions.
    • Check for limits or maximum values of different variables like GL_MAX_SHADER_STORAGE_BLOCKS, GL_UNIFORM_BLOCK_SIZE, etc.
    • Setup constant data, upload it.
    • All the initialization setup you need for drawing commands: compile and link shaders, set clear color, etc.

    piglit_init(int argc, char **argv)
        bool pass = true;
        /* piglit_require_extension("GL_ARB_shader_storage_buffer_object"); */
        prog = piglit_build_simple_program(vs_pass_thru_text, fs_source);
        glClearColor(0, 0, 0, 0);
        /* <-- OpenGL commands to be done --> */
        glViewport(0, 0, piglit_width, piglit_height);
        /* piglit_draw_* commands can go into piglit_display() too */
        piglit_draw_rect(-1, -1, 2, 2);
        if (!piglit_check_gl_error(GL_NO_ERROR))
           pass = false;
        piglit_report_result(pass ? PIGLIT_PASS : PIGLIT_FAIL);

  • piglit_display(). This is the function that it is going to be executed periodically to update each frame of the rendered window. In some tests, you will find it almost empty (it returns PIGLIT_FAIL) because it is not needed by the test program.
  • enum piglit_result piglit_display(void)
        /* <-- OpenGL drawing commands, if needed --> */
        /* UNREACHED */
        return PIGLIT_FAIL;

    Notice that you are free to add any helper functions you need like any other C program but the aforementioned parts are required by piglit.

    Piglit API

    Piglit provides a lot of functions under its API to be used by the test program. They are usually often-used functions that substitute one or several OpenGL function calls and other code that accompany them.

    The available functions are listed in file, which must be included in every binary test source code.

     * Copyright (c) The Piglit project 2007
     * Permission is hereby granted, free of charge, to any person obtaining a
     * copy of this software and associated documentation files (the "Software"),
     * to deal in the Software without restriction, including without limitation
     * on the rights to use, copy, modify, merge, publish, distribute, sub
     * license, and/or sell copies of the Software, and to permit persons to whom
     * the Software is furnished to do so, subject to the following conditions:
     * The above copyright notice and this permission notice (including the next
     * paragraph) shall be included in all copies or substantial portions of the
     * Software.
    #pragma once
    #ifndef __PIGLIT_UTIL_GL_H__
    #define __PIGLIT_UTIL_GL_H__
    #ifdef __cplusplus
    extern "C" {
    #include "piglit-util.h"
    #include <piglit/gl_wrap.h>
    #include <piglit/glut_wrap.h>
    #define piglit_get_proc_address(x) piglit_dispatch_resolve_function(x)
    #include "piglit-framework-gl.h"
    #include "piglit-shader.h"
    extern const uint8_t fdo_bitmap[];
    extern const unsigned int fdo_bitmap_width;
    extern const unsigned int fdo_bitmap_height;
    extern bool piglit_is_core_profile;
     * Determine if the API is OpenGL ES.
    bool piglit_is_gles(void);
     * \brief Get version of OpenGL or OpenGL ES API.
     * Returned version is multiplied by 10 to make it an integer.  So for
     * example, if the GL version is 2.1, the return value is 21.
    int piglit_get_gl_version(void);
     * \precondition name is not null
    bool piglit_is_extension_supported(const char *name);
     * reinitialize the supported extension List.
    void piglit_gl_reinitialize_extensions();
     * \brief Convert a GL error to a string.
     * For example, given GL_INVALID_ENUM, return "GL_INVALID_ENUM".
     * Return "(unrecognized error)" if the enum is not recognized.
    const char* piglit_get_gl_error_name(GLenum error);
     * \brief Convert a GL enum to a string.
     * For example, given GL_INVALID_ENUM, return "GL_INVALID_ENUM".
     * Return "(unrecognized enum)" if the enum is not recognized.
    const char *piglit_get_gl_enum_name(GLenum param);
     * \brief Convert a string to a GL enum.
     * For example, given "GL_INVALID_ENUM", return GL_INVALID_ENUM.
     * abort() if the string is not recognized.
    GLenum piglit_get_gl_enum_from_name(const char *name);
     * \brief Convert a GL primitive type enum value to a string.
     * For example, given GL_POLYGON, return "GL_POLYGON".
     * We don't use piglit_get_gl_enum_name() for this because there are
     * other enums which alias the prim type enums (ex: GL_POINTS = GL_NONE);
     * Return "(unrecognized enum)" if the enum is not recognized.
    const char *piglit_get_prim_name(GLenum prim);
     * \brief Check for unexpected GL errors.
     * If glGetError() returns an error other than \c expected_error, then
     * print a diagnostic and return GL_FALSE.  Otherwise return GL_TRUE.
    piglit_check_gl_error_(GLenum expected_error, const char *file, unsigned line);
    #define piglit_check_gl_error(expected) \
     piglit_check_gl_error_((expected), __FILE__, __LINE__)
     * \brief Drain all GL errors.
     * Repeatly call glGetError and discard errors until it returns GL_NO_ERROR.
    void piglit_reset_gl_error(void);
    void piglit_require_gl_version(int required_version_times_10);
    void piglit_require_extension(const char *name);
    void piglit_require_not_extension(const char *name);
    unsigned piglit_num_components(GLenum base_format);
    bool piglit_get_luminance_intensity_bits(GLenum internalformat, int *bits);
    int piglit_probe_pixel_rgb_silent(int x, int y, const float* expected, float *out_probe);
    int piglit_probe_pixel_rgba_silent(int x, int y, const float* expected, float *out_probe);
    int piglit_probe_pixel_rgb(int x, int y, const float* expected);
    int piglit_probe_pixel_rgba(int x, int y, const float* expected);
    int piglit_probe_rect_r_ubyte(int x, int y, int w, int h, GLubyte expected);
    int piglit_probe_rect_rgb(int x, int y, int w, int h, const float* expected);
    int piglit_probe_rect_rgb_silent(int x, int y, int w, int h, const float *expected);
    int piglit_probe_rect_rgba(int x, int y, int w, int h, const float* expected);
    int piglit_probe_rect_rgba_int(int x, int y, int w, int h, const int* expected);
    int piglit_probe_rect_rgba_uint(int x, int y, int w, int h, const unsigned int* expected);
    void piglit_compute_probe_tolerance(GLenum format, float *tolerance);
    int piglit_compare_images_color(int x, int y, int w, int h, int num_components,
                    const float *tolerance,
                    const float *expected_image,
                    const float *observed_image);
    int piglit_probe_image_color(int x, int y, int w, int h, GLenum format, const float *image);
    int piglit_probe_image_rgb(int x, int y, int w, int h, const float *image);
    int piglit_probe_image_rgba(int x, int y, int w, int h, const float *image);
    int piglit_compare_images_ubyte(int x, int y, int w, int h,
                    const GLubyte *expected_image,
                    const GLubyte *observed_image);
    int piglit_probe_image_stencil(int x, int y, int w, int h, const GLubyte *image);
    int piglit_probe_image_ubyte(int x, int y, int w, int h, GLenum format,
                     const GLubyte *image);
    int piglit_probe_texel_rect_rgb(int target, int level, int x, int y,
                    int w, int h, const float *expected);
    int piglit_probe_texel_rgb(int target, int level, int x, int y,
                   const float* expected);
    int piglit_probe_texel_rect_rgba(int target, int level, int x, int y,
                     int w, int h, const float *expected);
    int piglit_probe_texel_rgba(int target, int level, int x, int y,
                    const float* expected);
    int piglit_probe_texel_volume_rgba(int target, int level, int x, int y, int z,
                     int w, int h, int d, const float *expected);
    int piglit_probe_pixel_depth(int x, int y, float expected);
    int piglit_probe_rect_depth(int x, int y, int w, int h, float expected);
    int piglit_probe_pixel_stencil(int x, int y, unsigned expected);
    int piglit_probe_rect_stencil(int x, int y, int w, int h, unsigned expected);
    int piglit_probe_rect_halves_equal_rgba(int x, int y, int w, int h);
    bool piglit_probe_buffer(GLuint buf, GLenum target, const char *label,
                 unsigned n, unsigned num_components,
                 const float *expected);
    int piglit_use_fragment_program(void);
    int piglit_use_vertex_program(void);
    void piglit_require_fragment_program(void);
    void piglit_require_vertex_program(void);
    GLuint piglit_compile_program(GLenum target, const char* text);
    GLvoid piglit_draw_triangle(float x1, float y1, float x2, float y2,
                    float x3, float y3);
    GLvoid piglit_draw_triangle_z(float z, float x1, float y1, float x2, float y2,
                      float x3, float y3);
    GLvoid piglit_draw_rect_custom(float x, float y, float w, float h,
                       bool use_patches);
    GLvoid piglit_draw_rect(float x, float y, float w, float h);
    GLvoid piglit_draw_rect_z(float z, float x, float y, float w, float h);
    GLvoid piglit_draw_rect_tex(float x, float y, float w, float h,
                                float tx, float ty, float tw, float th);
    GLvoid piglit_draw_rect_back(float x, float y, float w, float h);
    void piglit_draw_rect_from_arrays(const void *verts, const void *tex,
                      bool use_patches);
    unsigned short piglit_half_from_float(float val);
    void piglit_escape_exit_key(unsigned char key, int x, int y);
    void piglit_gen_ortho_projection(double left, double right, double bottom,
                     double top, double near_val, double far_val,
                     GLboolean push);
    void piglit_ortho_projection(int w, int h, GLboolean push);
    void piglit_frustum_projection(GLboolean push, double l, double r, double b,
                       double t, double n, double f);
    void piglit_gen_ortho_uniform(GLint location, double left, double right,
                      double bottom, double top, double near_val,
                      double far_val);
    void piglit_ortho_uniform(GLint location, int w, int h);
    GLuint piglit_checkerboard_texture(GLuint tex, unsigned level,
        unsigned width, unsigned height,
        unsigned horiz_square_size, unsigned vert_square_size,
        const float *black, const float *white);
    GLuint piglit_miptree_texture(void);
    GLfloat *piglit_rgbw_image(GLenum internalFormat, int w, int h,
                               GLboolean alpha, GLenum basetype);
    GLubyte *piglit_rgbw_image_ubyte(int w, int h, GLboolean alpha);
    GLuint piglit_rgbw_texture(GLenum internalFormat, int w, int h, GLboolean mip,
                GLboolean alpha, GLenum basetype);
    GLuint piglit_depth_texture(GLenum target, GLenum format, int w, int h, int d, GLboolean mip);
    GLuint piglit_array_texture(GLenum target, GLenum format, int w, int h, int d, GLboolean mip);
    GLuint piglit_multisample_texture(GLenum target, GLenum tex,
                      GLenum internalFormat,
                      unsigned width, unsigned height,
                      unsigned depth, unsigned samples,
                      GLenum format, GLenum type, void *data);
    extern float piglit_tolerance[4];
    void piglit_set_tolerance_for_bits(int rbits, int gbits, int bbits, int abits);
    extern void piglit_require_transform_feedback(void);
    piglit_get_compressed_block_size(GLenum format,
                     unsigned *bw, unsigned *bh, unsigned *bytes);
    piglit_compressed_image_size(GLenum format, unsigned width, unsigned height);
    piglit_compressed_pixel_offset(GLenum format, unsigned width,
                       unsigned x, unsigned y);
    piglit_visualize_image(float *img, GLenum base_internal_format,
                   int image_width, int image_height,
                   int image_count, bool rhs);
    float piglit_srgb_to_linear(float x);
    float piglit_linear_to_srgb(float x);
    extern GLfloat cube_face_texcoords[6][4][3];
    extern const char *cube_face_names[6];
    extern const GLenum cube_face_targets[6];
     * Common vertex program code to perform a model-view-project matrix transform
        "ATTRIB iPos = vertex.position;\n"      \
        "OUTPUT oPos = result.position;\n"      \
        "PARAM  mvp[4] = { state.matrix.mvp };\n"   \
        "DP4    oPos.x, mvp[0], iPos;\n"        \
        "DP4    oPos.y, mvp[1], iPos;\n"        \
        "DP4    oPos.z, mvp[2], iPos;\n"        \
        "DP4    oPos.w, mvp[3], iPos;\n"
     * Handle to a generic fragment program that passes the input color to output
     * \note
     * Either \c piglit_use_fragment_program or \c piglit_require_fragment_program
     * must be called before using this program handle.
    extern GLint piglit_ARBfp_pass_through;
    static const GLint PIGLIT_ATTRIB_POS = 0;
    static const GLint PIGLIT_ATTRIB_TEX = 1;
     * Given a GLSL version number, return the lowest-numbered GL version
     * that is guaranteed to support it.
    required_gl_version_from_glsl_version(unsigned glsl_version);
    #ifdef __cplusplus
    } /* end extern "C" */
    #endif /* __PIGLIT_UTIL_GL_H__ */

    Most functions are undocumented although there are a lot of examples of how to use it in other piglit tests. Furthermore, once you know which function you need, it is usually straightforward to learn how to call it.

    Just to mention a few: you can request extensions (piglit_require_extension()) or a GL version (piglit_require_gl_version()), compile program (piglit_compile_program()), draw a rectangle or triangle, read pixel values and compare them with a list of expected values, check for GL errors, etc.

    piglit-shader.h has all the shader-related functions: compile a shader, link a simple program, build (i.e. compile and link) a simple program with vertex and fragment shaders, require a specific GLSL version, etc.

     * Copyright (c) The Piglit project 2007
     * Permission is hereby granted, free of charge, to any person obtaining a
     * copy of this software and associated documentation files (the "Software"),
     * to deal in the Software without restriction, including without limitation
     * on the rights to use, copy, modify, merge, publish, distribute, sub
     * license, and/or sell copies of the Software, and to permit persons to whom
     * the Software is furnished to do so, subject to the following conditions:
     * The above copyright notice and this permission notice (including the next
     * paragraph) shall be included in all copies or substantial portions of the
     * Software.
    #pragma once
     * Null parameters are ignored.
     * \param es Is it GLSL ES?
    void piglit_get_glsl_version(bool *es, int* major, int* minor);
    GLuint piglit_compile_shader(GLenum target, const char *filename);
    GLuint piglit_compile_shader_text_nothrow(GLenum target, const char *text);
    GLuint piglit_compile_shader_text(GLenum target, const char *text);
    GLboolean piglit_link_check_status(GLint prog);
    GLboolean piglit_link_check_status_quiet(GLint prog);
    GLint piglit_link_simple_program(GLint vs, GLint fs);
    GLint piglit_build_simple_program(const char *vs_source, const char *fs_source);
    GLuint piglit_build_simple_program_unlinked(const char *vs_source,
                            const char *fs_source);
    GLint piglit_link_simple_program_multiple_shaders(GLint shader1, ...);
    GLint piglit_build_simple_program_unlinked_multiple_shaders_v(GLenum target1,
                                     const char*source1,
                                     va_list ap);
    GLint piglit_build_simple_program_unlinked_multiple_shaders(GLenum target1,
                                   const char *source1,
    GLint piglit_build_simple_program_multiple_shaders(GLenum target1,
                              const char *source1,
    extern GLboolean piglit_program_pipeline_check_status(GLuint pipeline);
    extern GLboolean piglit_program_pipeline_check_status_quiet(GLuint pipeline);
     * Require a specific version of GLSL.
     * \param version Integer version, for example 130
    extern void piglit_require_GLSL_version(int version);
    /** Require any version of GLSL */
    extern void piglit_require_GLSL(void);
    extern void piglit_require_fragment_shader(void);
    extern void piglit_require_vertex_shader(void);

    There are more header files such as piglit-glx-util.h, piglit-matrix.h, piglit-util-egl.h, etc.

    Usually, you only need to add piglit-util-gl.h to your source code, however I recommend you to browse through tests/util/ so you find out all the available functions that piglit provides.


    A complete example of how a piglit binary test looks like is ARB_uniform_buffer_object rendering test.

     * Copyright (c) 2014 VMware, Inc.
     * Permission is hereby granted, free of charge, to any person obtaining a
     * copy of this software and associated documentation files (the "Software"),
     * to deal in the Software without restriction, including without limitation
     * the rights to use, copy, modify, merge, publish, distribute, sublicense,
     * and/or sell copies of the Software, and to permit persons to whom the
     * Software is furnished to do so, subject to the following conditions:
     * The above copyright notice and this permission notice (including the next
     * paragraph) shall be included in all copies or substantial portions of the
     * Software.
    /** @file rendering.c
     * Test rendering with UBOs.  We draw four squares with different positions,
     * sizes, rotations and colors where those parameters come from UBOs.
    #include "piglit-util-gl.h"
        config.supports_gl_compat_version = 20;
        config.window_visual = PIGLIT_GL_VISUAL_DOUBLE | PIGLIT_GL_VISUAL_RGBA;
    static const char vert_shader_text[] =
        "#extension GL_ARB_uniform_buffer_object : require\n"
        "layout(std140) uniform;\n"
        "uniform ub_pos_size { vec2 pos; float size; };\n"
        "uniform ub_rot {float rotation; };\n"
        "void main()\n"
        "   mat2 m;\n"
        "   m[0][0] = m[1][1] = cos(rotation); \n"
        "   m[0][1] = sin(rotation); \n"
        "   m[1][0] = -m[0][1]; \n"
        "   gl_Position.xy = m * gl_Vertex.xy * vec2(size) + pos;\n"
        " = vec2(0, 1);\n"
    static const char frag_shader_text[] =
        "#extension GL_ARB_uniform_buffer_object : require\n"
        "layout(std140) uniform;\n"
        "uniform ub_color { vec4 color; float color_scale; };\n"
        "void main()\n"
        "   gl_FragColor = color * color_scale;\n"
    #define NUM_SQUARES 4
    #define NUM_UBOS 3
    /* Square positions and sizes */
    static const float pos_size[NUM_SQUARES][3] = {
        { -0.5, -0.5, 0.1 },
        {  0.5, -0.5, 0.2 },
        { -0.5, 0.5, 0.3 },
        {  0.5, 0.5, 0.4 }
    /* Square color and color_scales */
    static const float color[NUM_SQUARES][8] = {
        { 2.0, 0.0, 0.0, 1.0,   0.50, 0.0, 0.0, 0.0 },
        { 0.0, 4.0, 0.0, 1.0,   0.25, 0.0, 0.0, 0.0 },
        { 0.0, 0.0, 5.0, 1.0,   0.20, 0.0, 0.0, 0.0 },
        { 0.2, 0.2, 0.2, 0.2,   5.00, 0.0, 0.0, 0.0 }
    /* Square rotations */
    static const float rotation[NUM_SQUARES] = {
    static GLuint prog;
    static GLuint buffers[NUM_UBOS];
    static GLint alignment;
    static bool test_buffer_offset = false;
    static void
        static const char *names[NUM_UBOS] = {
        static GLubyte zeros[1000] = {0};
        int i;
        glGetIntegerv(GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, &alignment);
        printf("GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT = %d\n", alignment);
        if (test_buffer_offset) {
            printf("Testing buffer offset %d\n", alignment);
        else {
            /* we use alignment as the offset */
            alignment = 0;
        glGenBuffers(NUM_UBOS, buffers);
        for (i = 0; i < NUM_UBOS; i++) {
            GLint index, size;
            /* query UBO index */
            index = glGetUniformBlockIndex(prog, names[i]);
            /* query UBO size */
            glGetActiveUniformBlockiv(prog, index,
                          GL_UNIFORM_BLOCK_DATA_SIZE, &size);
            printf("UBO %s: index = %d, size = %d\n",
                   names[i], index, size);
            /* Allocate UBO */
            /* XXX for some reason, this test doesn't work at all with
             * nvidia if we pass NULL instead of zeros here.  The UBO data
             * is set/overwritten in the piglit_display() function so this
             * really shouldn't matter.
            glBindBuffer(GL_UNIFORM_BUFFER, buffers[i]);
            glBufferData(GL_UNIFORM_BUFFER, size + alignment,
                                 zeros, GL_DYNAMIC_DRAW);
            /* Attach UBO */
            glBindBufferRange(GL_UNIFORM_BUFFER, i, buffers[i],
                      alignment,  /* offset */
            glUniformBlockBinding(prog, index, i);
            if (!piglit_check_gl_error(GL_NO_ERROR))
    piglit_init(int argc, char **argv)
        if (argc > 1 && strcmp(argv[1], "offset") == 0) {
            test_buffer_offset = true;
        prog = piglit_build_simple_program(vert_shader_text, frag_shader_text);
        glClearColor(0.2, 0.2, 0.2, 0.2);
    static bool
    probe(int x, int y, int color_index)
        float expected[4];
        /* mul color by color_scale */
        expected[0] = color[color_index][0] * color[color_index][4];
        expected[1] = color[color_index][1] * color[color_index][4];
        expected[2] = color[color_index][2] * color[color_index][4];
        expected[3] = color[color_index][3] * color[color_index][4];
        return piglit_probe_pixel_rgba(x, y, expected);
    enum piglit_result
        bool pass = true;
        int x0 = piglit_width / 4;
        int x1 = piglit_width * 3 / 4;
        int y0 = piglit_height / 4;
        int y1 = piglit_height * 3 / 4;
        int i;
        glViewport(0, 0, piglit_width, piglit_height);
        for (i = 0; i < NUM_SQUARES; i++) {
            /* Load UBO data, at offset=alignment */
            glBindBuffer(GL_UNIFORM_BUFFER, buffers[0]);
            glBufferSubData(GL_UNIFORM_BUFFER, alignment, sizeof(pos_size[0]),
            glBindBuffer(GL_UNIFORM_BUFFER, buffers[1]);
            glBufferSubData(GL_UNIFORM_BUFFER, alignment, sizeof(color[0]),
            glBindBuffer(GL_UNIFORM_BUFFER, buffers[2]);
            glBufferSubData(GL_UNIFORM_BUFFER, alignment, sizeof(rotation[0]),
            if (!piglit_check_gl_error(GL_NO_ERROR))
                return PIGLIT_FAIL;
            piglit_draw_rect(-1, -1, 2, 2);
        pass = probe(x0, y0, 0) && pass;
        pass = probe(x1, y0, 1) && pass;
        pass = probe(x0, y1, 2) && pass;
        pass = probe(x1, y1, 3) && pass;
        return pass ? PIGLIT_PASS : PIGLIT_FAIL;

    The source starts with the license header and describing what the test does. Following those, it’s the config setup for piglit: request doube buffer, RGBA pixel format and GL compat version 2.0.

    Furthermore, this test defines GLSL shaders, the global data and helper functions as any other OpenGL C-language program. Notice that setup_ubos() include calls to OpenGL API but also calls to piglit_check_gl_error() and piglit_report_result() which are used to check for any OpenGL error and tell piglit that there was a failure, respectively.

    Following the structure introduced before, piglit_init() indicates that ARB_uniform_buffer_object extension is required, it builds the program with the aforementioned shaders, setups the uniform buffer objects and sets the clear color.

    Finally in piglit_display() is where the relevant content is placed. Among other things, it loads UBO’s data, draw a rectangle and check that the rendered pixels have the same values as the expected ones. Depending of the result of that check, it will report to piglit that the test was a success or a failure.

    How to add a binary test to piglit

    How to build it

    Now that you have written the source code of your test, it’s time to learn how to build it. As we have seen in an earlier post, piglit uses cmake for generating binaries.

    First you need to check if cmake tracks the directory where your source code is. If you create a new extension subdirectory under tests/spec, modify this CMakeLists.txt file and add yours.

    Once you have done that, cmake needs to know that there are binaries to build and where to get its source code. For doing that, create a CMakeLists.txt file in your test’s directory with the following content:


    If you have binaries inside directories pending the former, you can add them to the same CMakeLists.txt file you are creating:

    add_subdirectory (compiler)
    add_subdirectory (execution)
    add_subdirectory (linker)

    But this is not enough to compile your test as we have not specified which binary is and where to get its source code. We do that by creating a new file with a similar content than this one.

    link_libraries (
    piglit_add_executable (arb_shader_storage_buffer_object-minmax minmax.c)
    # vim: ft=cmake:

    As you see, first we declare where to find the needed headers and libraries. Then we define the binary name (arb_shader_storage_buffer_object-minmax) and which is its source code file (minmax.c).

    And that’s it. If everything is fine, next time you run cmake . && make in piglit’s directory, piglit will build the test and place it inside bin/ directory.

    Example in piglit repository

    Let’s review a real example of how a test was added for a new extension in piglit (this commit). As you see, it added tests/spec/arb_robustness/ subdirectory to tests/spec/CMakeLists.txt, create a tests/spec/arb_robustness/CMakeLists.txt to indicate cmake to track this directory, tests/spec/arb_robustness/ to compile the binary test file and add its source code file tests/spec/arb_robustness/draw-vbo-bounds.c.

    tests/spec/CMakeLists.txt | 1 +
     tests/spec/arb_robustness/ | 19 +++++++++++++++++++
     tests/spec/arb_robustness/CMakeLists.txt | 1 +
     tests/spec/arb_robustness/draw-vbo-bounds.c | 205 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
     4 files changed, 226 insertions(+)

    If you run git log tests/spec/<dir> command in other directories, you will find similar commits.

    How to run it with profile

    Once you successfully build the binary test program you can run it standalone. However, it’s better to add it to tests/ profile so it is executed automatically by this profile.

    Open tests/  and look for your extension name, then add your test by looking at  how other tests are defined and adapt it to your case. In case you are developing tests for a new extension, you have to add more lines to tests/ For example this is how was done for ARB_shader_storage_buffer_object extension:

    with profile.group_manager(
            grouptools.join('spec', 'arb_shader_storage_buffer_object')) as g:
        g(['arb_shader_storage_buffer_object-minmax'], 'minmax')

    These lines make several things: indicate under which extension you will find the results, which are the binaries to run and which short name will appear in the summaries for each binary.

    ARB_shader_storage_buffer_object's minmax test result in HTML summary
    ARB_shader_storage_buffer_object’s minmax test in HTML summary


    This post and last one were about how to create your own piglit tests. I hope you will find them useful and I will be glad to see new contributors because of them :-)

    Next post will talk about how to send your contributions to piglit and a table of contents so you can easily jump to any post of this series.

    by Samuel Iglesias at June 11, 2015 02:02 PM