Planet Igalia

July 29, 2016

Alejandro Piñeiro

One year of Mesa

Changes

During the last year and something, my work at Igalia was focused on the Intel i965 driver for Mesa, the open source OpenGL implementation. Iago and Samuel were working for some time, then joined Edu and Antía, and then I joined myself, as the fifth igalian.

One of the reasons I will always remember working on this project is that I was also being a father for a year and something, so it will be always easy to relate one and the other. Changes! A project change plus a total life change.

In fact the parenthood affected a little how I joined the project. We (as Igalia) decided that I would be a good candidate to join the project around April. But as my leave of absence was estimated to be around the last weeks of May, and it would be strange to start a project and suddenly disappear, we decided to postpone the joining just after the parental leave. So I spent some time closing and transferring the stuff I was doing at Igalia at that moment, and starting to get into the specifics of Intel driver, and then at the end of May, my parental leave started. Pedro is here!

 

Pedro’s day zero
Parenthood

Mesa can be a challenging project, but nothing compared to how to parent. A lot of changes. In Spain the more usual parental leave for the father is 15 days, where the only  flexibility is that you need to take 2 just after the child has born, and chose when take the other 13. But those 13 needs to be taken in a row. So I decided to took 15 days in a row. Igalia gives you 15 extra days. In fact, the equivalent to 15 days, as they gives you flexibility of how to take them. So instead of took them as 15 full-day, I decided to take 30 part-time days. Additionally Pedro’s grandmother (maternal) came to live with us for a month. That allowed us to do a step by step go back to work. So it was something like:

  • Pedro was born, I am in full parental leave, plus grandmother living at our house helping.
  • I start to work part-time
  • Grandmother goes back to Brazil
  • I start to work full-time.
  • Pedro’s mother goes back to work part-time

And probably the more complicated moment was when Pedro’s mother went back to work. More or less at that time we switched the usual grandparents (paternal) weekend visits to in-week visits. So what it is my usual timetable? Three days per week I work 8:00-14:30 at my home, and I take care of Pedro on the afternoon, when his mother goes to work. Two days per week is 8:00-12:30 at my home, and 15:00-20:00 at my parents home. And that is more or less, as no day is the same that the other 😉 And now I go more to the parks!

Pedro playing on a park
Pedro playing on a park

There were other changes. For example switched from being one of the usual suspects at Igalia office to start to work mostly from home. In any case the experience is totally worthing, and it is incredible how fast the time passed. Pedro is one year old already!

Pedro destroying his birthday cake
Mesa tasks

I almost forgot that this blog post started to talk about my work on Mesa. So many photos! During this year, I have participated (although not alone, specially on spec implementations) in the following tasks (and except last one, in chronological order):

About which task I liked the most, I think that it was the work done on optimizing the NIR to vec4 pass. And not forget that the work done by Igalia to implement internalformat_query2 and vertex_attrib_64bit, plus the work done by Iago and Samuel with fp64, helped to get Mesa 12.0 exposing OpenGL 4.2 support on Broadwell and later.

What happens with accessibility?

Being working full time for the same project for a full year, plus learning how to parent, means that I didn’t have too much spare time for accessibility development. Having said so, I have been reviewing ATK patches, doing the ATK releases regularly and keeping an eye on the GNOME accessibility mailing lists, so I’m still around. And we have Joanmarie Diggs, also fellow igalian, still rocking on improving Orca, WebKitGTK and WAI-ARIA.

Thanks

Finally, I’d like to thank both Intel and Igalia for supporting my work on Mesa and i965 all this time. Specially on allowing me so flexible timetable, where the important is what you deliver, and not when you do the work, allowing me to enjoy parenthood. I also want to thanks the igalian colleagues I has been working during this year, Iago, Samuel, Antía, Edu, Andrés, and Juan, and all those at Intel who have been helping and reviewing my work during all this year, like Jason Ekstrand, Kenneth Graunke, Matt Turner and Francisco Jerez.

 

by infapi00 at July 29, 2016 09:02 AM

July 13, 2016

Iago Toral

Story and status of ARB_gpu_shader_fp64 on Intel GPUs

In case you haven’t heard yet, with the recently announced Mesa 12.0 release, Intel gen8+ GPUs expose OpenGL 4.3, which is quite a big leap from the previous OpenGL 3.3!

OpenGL 4.3
The Mesa i965 Intel driver now exposes OpenGL 4.3 on Broadwell and later!

Although this might surprise some, the truth is that even if the i965 driver only exposed OpenGL 3.3 it had been exposing many of the OpenGL 4.x extensions for quite some time, however, there was one OpenGL 4.0 extension in particular that was still missing and preventing the driver from exposing a higher version: ARB_gpu_shader_fp64 (fp64 for short). There was a good reason for this: it is a very large feature that has been in the works by Intel first and Igalia later for quite some time. We first started to work on this as far back as November 2015 and by that time Intel had already been working on it for months.

I won’t cover here what made this such a large effort because there would be a lot of stuff to cover and I don’t feel like spending weeks writing a series of posts on the subject :). Hopefully I will get a chance to talk about all that at XDC in September, so instead I’ll focus on explaining why we only have this working in gen8+ at the moment and the status of gen7 hardware.

The plan for ARB_gpu_shader_fp64 was always to focus on gen8+ hardware (Broadwell and later) first because it has better support for the feature. I must add that it also has fewer hardware bugs too, although we only found out about that later ;). So the plan was to do gen8+ and then extend the implementation to cover the quirks required by gen7 hardware (IvyBridge, Haswell, ValleyView).

At this point I should explain that Intel GPUs have two code generation backends: scalar and vector. The main difference between both backends is that the vector backend (also known as align16) operates on vectors (surprise, right?) and has native support for things like swizzles and writemasks, while the scalar backend (known as align1) operates on scalars, which means that, for example, a vec4 GLSL operation running is broken up into 4 separate instructions, each one operating on a single component. You might think that this makes the scalar backend slower, but that would not be accurate. In fact it is usually faster because it allows the GPU to exploit SIMD better than the vector backend.

The thing is that different hardware generations use one backend or the other for different shader stages. For example, gen8+ used to run Vertex, Fragment and Compute shaders through the scalar backend and Geometry and Tessellation shaders via the vector backend, whereas Haswell and IvyBridge use the vector backend also for Vertex shaders.

Because you can use 64-bit floating point in any shader stage, the original plan was to implement fp64 support on both backends. Implementing fp64 requires a lot of changes throughout the driver compiler backends, which makes the task anything but trivial, but the vector backend is particularly difficult to implement because the hardware only supports 32-bit swizzles. This restriction means that a hardware swizzle such as XYZW only selects components XY in a dvecN and therefore, there is no direct mechanism to access components ZW. As a consequence, dealing with anything bigger than a dvec2 requires more creative solutions, which then need to face some other hardware limitations and bugs, etc, which eventually makes the vector backend require a significantly larger development effort than the scalar backend.

Thankfully, gen8+ hardware supports scalar Geometry and Tessellation shaders and Intel‘s Kenneth Graunke had been working on enabling that for a while. When we realized that the vector fp64 backend was going to require much more effort than what we had initially thought, he gave a final push to the full scalar gen8+ implementation, which in turn allowed us to have a full fp64 implementation for this hardware and expose OpenGL 4.0, and soon after, OpenGL 4.3.

That does not mean that we don’t care about gen7 though. As I said above, the plan has always been to bring fp64 and OpenGL4 to gen7 as well. In fact, we have been hard at work on that since even before we started sending the gen8+ implementation for review and we have made some good progress.

Besides addressing the quirks of fp64 for IvyBridge and Haswell (yes, they have different implementation requirements) we also need to implement the full fp64 vector backend support from scratch, which as I said, is not a trivial undertaking. Because Haswell seems to require fewer changes we have started with that and I am happy to report that we have a working version already. In fact, we have already sent a small set of patches for review that implement Haswell‘s requirements for the scalar backend and as I write this I am cleaning-up an initial implementation of the vector backend in preparation for review (currently at about 100 patches, but I hope to trim it down a bit before we start the review process). IvyBridge and ValleView will come next.

The initial implementation for the vector backend has room for improvement since the focus was on getting it working first so we can expose OpenGL4 in gen7 as soon as possible. The good thing is that it is more or less clear how we can improve the implementation going forward (you can see an excellent post by Curro on that topic here).

You might also be wondering about OpenGL 4.1’s ARB_vertex_attrib_64bit, after all, that kind of goes hand in hand with ARB_gpu_shader_fp64 and we implemented the extension for gen8+ too. There is good news here too, as my colleague Juan Suárez has already implemented this for Haswell and I would expect it to mostly work on IvyBridge as is or with minor tweaks. With that we should be able to expose at least OpenGL 4.2 on all gen7 hardware once we are done.

So far, implementing ARB_gpu_shader_fp64 has been quite the ride and I have learned a lot of interesting stuff about how the i965 driver and Intel GPUs operate in the process. Hopefully, I’ll get to talk about all this in more detail at XDC later this year. If you are planning to attend and you are interested in discussing this or other Mesa stuff with me, please find me there, I’ll be looking forward to it.

Finally, I’d like to thank both Intel and Igalia for supporting my work on Mesa and i965 all this time, my igalian friends Samuel Iglesias, who has been hard at work with me on the fp64 implementation all this time, Juan Suárez and Andrés Gómez, who have done a lot of work to improve the fp64 test suite in Piglit and all the friends at Intel who have been helping us in the process, very especially Connor Abbot, Francisco Jerez, Jason Ekstrand and Kenneth Graunke.

by Iago Toral at July 13, 2016 02:48 PM

July 12, 2016

Frédéric Wang

MathML Improvements in WebKit

In a previous blog post, I explained the work made by Igalia’s web platform team to refactor WebKit’s MathML layout classes. I stated that although some rendering improvements were a nice side effect, the main goal of the first phase was really to clean the code up so that it is easier for developers to work on MathML in the future. Indeed this really made things easier to review: Quite unexpectedly to me, the second phase only took 4 days to be upstreamed… Kudos to Brent Fulgham for having reviewed so many patches in such a short period of time!

In this blog post, I am going to give an overview of the improvements made during these two first phases taking changeset r203109 as a reference. The changes will be available in WebKitGTK+ 2.14 in September and are likely to be included this month in the next Safari Technology Preview. It definitely remains more work to do such as the third phase or other rendering improvements, but I believe we have already made a big step forward!

Mathematical Fonts

Two years ago, basic support for operator stretching via the OpenType MATH table was added to WebKit. During the refactoring, we improved that support and also made use of more parameters to improve the math layout (see section about OpenType MATH parameters below). While Latin Modern Math will be used in most screenshots, the following one shows that you can use any math fonts. By default WebKit will try and use one of these fonts but if none are available or if you force a non-math font-family then the rendering quality may not be good.

The following screenshot gives the rendering for various fonts. For the last one we used the value sans-serif to illustrate issues with non-math fonts (displaystyle integral too small, mathvariant italic glyphs taken from another font, missing italic correction etc).

Screenshots of a formula with various fonts Integral obtained by contour integration
WebKitGTK+ r203109 with various font-family values.
From top to bottom: TeX Gyre Schola, Latin Modern, DejaVu Math, STIX, and sans-serif.

The href Attribute

This new feature is obvious: You can now create a hyperlink for any part of a mathematical formula! Here is a screenshot of the MathML Torture Test 21 with additional links, as displayed in WebKit r203109. Click the image to load the test case in your browser and test it.

Screenshot of MathML Torture test 21 with hyperlinks

The mathvariant Attribute

Unicode contains Mathematical Alphanumeric Symbols to convey special meaning such as double-struck or specific Arabic styles. Characters for these symbols are generally provided by math fonts. In MathML, mathematical variables are automatically rendered using the italic characters from this Unicode block. One can also access these characters via the mathvariant attribute and that attribute is actually used by many LaTeX-to-MathML converters.

In the following screenshot, you can see that the letters f, x and y are now drawn with this special mathematical italic glyphs and that WebKit uses the conventional fraktur style for the Lie algebra g. Note that the prime is still too small because WebKit does not make use of the ssty feature yet.

Screenshot of Wikipedia article on Lie Algebra Homomorphism of Lie algebra
Top: Safari 9.1.1. Bottom: Safari r203109.

Operators and Spacing

As said in my previous blog post, the rendering of large and stretchy operators have been rewritten a lot and as a consequence the rendering has improved. Also, I mentioned that the width of operators may depend on their height. This may cause accumulated approximations during the computation of preferred widths. The old flexbox-based implementation incorrectly forced layout during preferred computation to avoid that but a quick workaround for that security concern caused the approximate preferred widths to be used for the logical widths. With our new implementation, the logical width is now correctly calculated. Finally, we added partial support for the mpadded element which is often used to tweak spacing in mathematical formulas.

The screenshot below illustrates the fix for a serious regression with large operator (summation symbol) as well as improvements in the rendering of stretchy operators (horizontal braces). Note that the formula has a hack with a zero-width mpadded element which used to cause improper spacing (large gap between the group of a’s and the group of b’s).

Screenshot from Mozilla's MathML torture test using Safari 9.1.1 or the the current refactoring Tests 21 and 22 from the MathML torture test
Left: Safari 9.1.1. Right: Safari r203109.

The following screenshot shows how incorrect width computations used to cause excessive spacing after the stretchy radical and slash symbols:

Screenshot of Wikipedia article on Trigonometric Functions Sine of 1 degree
Top: Safari 9.1.1. Bottom: Safari r203109.

The displaystyle Property

Mathematical formulas can be integrated inside a paragraph of text (inline math in TeX terminology) or displayed in its own horizontally centered paragraph (display math in TeX terminology). In the latter case, the formula is in displaystyle and does not have any restrictions on vertical spacing. In the former case, the layout of the mathematical formula is modified a bit to optimize this vertical spacing and to better integrate within the surrounding text. The displaystyle property can also be set using the corresponding attribute or can change automatically in subformulas (e.g. in fractions).

In the following screenshot the fix for the large operator regression is obvious but you can also notice that the summation is now slightly different for the definition of a Bézier curve (top) and for the one of a rational Bézier curve (bottom). For example, to save some vertical space in the fractions, the sigma symbol is drawn smaller and the scripts attached to it are moved on its right. However, the script size could still be improved when we implement the scriptlevel property.

Screenshot of Wikipedia article on Bézier Curves Definitions of Bézier Curve
Left: Safari 9.1.1. Right: Safari r203109.

OpenType MATH Parameters

Many new OpenType MATH features have been implemented following the MathML in HTML5 Implementation Note:

  • Use of the AxisHeight parameter to set vertical position of fractions, tables and symmetric operators.
  • Use of layout constants for radicals (some of them were already used), scripts and fractions. This improves math spacing and positioning and allows to adjust them according to value of the displaystyle property discussed in the previous section.
  • Use of the italic correction of large operator glyph to set the position of subscripts.

The screenshots below illustrate some of these improvements. For the first one, the use of AxisHeight allows to better align the fraction bar with the plus sign. For the second one, the use of layout constants for scripts as well as the italic correction of the integral improves the placement of limits.

Screenshot of Wikipedia article on the Continued Fractions Generalized Continued Fraction
Top: Safari 9.1.1. Bottom: Safari r203109.
Screenshot of Wikipedia article on the Gamma Function Definition of the Gamma Function
Top: Safari 9.1.1. Bottom: Safari r203109.

Right-to-left radicals

One of the advantage of the old flexbox-based implementation is that right-to-left layout was available for free. This support has of course been preserved in the new implementation. We also added a simple workaround to mirror radicals using a scale transform as shown in the screenshot below. However, general glyph-level mirroring is still missing.

Screenshot of Wikipedia article on Lie Algebra Test 13 from the MathML torture test (Machrek Style)
Top: Safari 9.1.1. Bottom: Safari r203109.

Conclusion

Igalia’s web platform team has been able to follow the MathML in HTML5 Implementation Note in order to significantly improve the rendering of mathematical expressions in WebKit. More work remains to do but we will definitely appreciate any feedback that can help improving native MathML support in web engines. We are also excited to continue work and collaboration at the next Web Engines Hackfest!

July 12, 2016 10:00 PM

July 08, 2016

Frédéric Wang

MathML Refactoring in WebKit

Introduction

If you follow WebKit developments, you are certainly aware that Igalia has been working on WebKit’s MathML implementation for some time. More recently, effort has been made to write a clean implementation addressing issues reported by WebKit reviewers in the past. After joining Igalia in March, I have been in charge of getting this work reviewed and merged into WebKit’s development branch. In the past four months, we have been successful in upstreaming the first phase of the refactoring and the work accomplished is described in this blog post.

Screenshot from Mozilla's MathML torture test using Safari 9.1.1 or the the current refactoring
MathML torture tests with the Latin Modern Math font
Left: Safari 9.1.1. Right: Current MathML branch.

Note that the focus was on code refactoring so the improvement may not be obvious for non-developers. Nevertheless many issues have already been fixed as a consequence of that work: math italic, position of scripts, stretchy and large operators, rendering update and more. More importantly, this preliminary step opens the way for beautiful math rendering based on TeX rules and the OpenType MATH table. Rendering improvements and implementation of new features have already started in the next phase of the refactoring, so stay tuned…

Design Issues

As explained in a previous report, the main design issues in the flexbox-based implementation released in 2013 can essentially be summarized in three points:

  1. WebKit’s code to stretch operator was not efficient at all and was limited to some basic fences buildable via Unicode characters.
  2. WebKit’s MathML code violated many layout invariants, making the code unreliable.
  3. WebKit’s MathML code relied heavily on the C++ renderer classes for flexboxes and had to manage too many anonymous renderers.

For the first point, the performance issue had been fixed by Igalia developers right after the initial feedback from WebKit developers and we improved that again during our refactoring. Partial support for the OpenType MATH table was added during my crowdfunding project and allowed to stretch more operators with the help of math fonts. For the second point, the main issue was also fixed right after the initial feedback. However one could still have some doubts regarding the layout steps, given the complexity implied by the third point. That last issue was still problematic so far and addressing it was the main achievement of our refactoring.

Technically, the dependence on flexbox is unnecessary and the implementation actually only used a limited set of flexbox features. Thus executing the whole flexbox code was overkill. It can also be a burden for people working on other places of the layout code. For example Javi Fernández has worked on improving the box alignments in the past and he had hard time fixing the MathML code impacted by his changes. This is probably the cause of the bad position of the summation symbol that can be seen in the screenshot above.

From the layout perspective, most of the rendering logic was implemented in the flexbox classes and the MathML “renderer” classes were really just managing the creation and update of anonymous nodes and style. Although this sounds good code reuse it actually made impossible to understand how and when layout steps happen or to add more advanced features. The new implementation replaced this manipulation of the render tree with simple arithmetic calculations on box metrics which is safer and more reliable. Also, complex renderers such as RenderMathMLScripts or RenderMathMLRoot actually achieve better rendering quality with less code!

As an example of the complexity, RenderMathMLUnderOver can behave as a RenderMathMLScripts in some situation so we really want the former class to reuse the latter class. As we will see below the old implementation of the two renderers were quite different: RenderMathMLUnderOver only relied on setting column direction in the user agent stylesheet while RenderMathMLScripts created a complex render tree structures with anonymous style. Hence it seemed difficult to share the two cases or to handle DOM changes causing to move from one case to the other one. With our new implementation, this is simply reduced to simple C++ inheritance.

When I started to work on WebKit some years ago, I made the mistake of continuing with the existing approach. The implementation of multiscripts or automatic italic mathvariant added more anonymous objects and made the situation even worse. After the end of my crowdfunding project, Alex Castro did more cleanup and tried to implement important features such as displaystyle but he also soon realized that it was too hard to work with the current code base…

Layout Refactoring

In order to solve the issues discussed in the previous section, Javi and Alex worked on a new MathML branch where the first step was to remove the inheritance on the flexbox layout classes. During the Web Engines Hackfest 2015, I collaborated with the Igalia’s web platform team team to continue the work on this branch. In a second step, we rewrote many MathML renderer functions so that they stop creating anonymous nodes or style. We obtained very encouraging results: The implementation looked much simpler and much more understandable!

Alex announced the initial plan on the webkit-dev mailing list. He started opening bugs and attaching patches to merge the first step. However, that step still required many of the flexbox logic and so made code hard to understand for reviewers. Hence when I joined Igalia four months ago Alex asked me to try and see how to reorganize patches so that the two initial steps can be submitted in one go. This corresponds to the first phase mentioned in the introduction. As indicated on the wiki page, the layout refactoring consisted in rewriting the following member functions of each renderer class:

  • computePreferredLogicalWidths: calculate preferred widths, based on the preferred widths of child renderers.
  • layoutBlock: set final position and size of child renderers.
  • firstLineBaseLine: calculate the ascent of the renderer.
  • paint (optional): perform special painting such as fraction bars.

Refactored renderers no longer rely on any flexbox code nor anonymous renderers and the functions mentioned above essentially perform arithmetic computations. By reading the code, one can be sure that we follow standard layout rules and that we do not perform unnecessary reflow. Moreover, the rules specific to math rendering are only located in the MathML renderers and can be better understood. Details for each class are provided in the next subsections. After all the layout functions were rewritten and the code managing the render tree structure removed, we were able to make the RenderMathMLBlock class inherit from RenderBlock instead of RenderFlexibleBox. Many of the bugs could then be immediately closed or otherwise fixed with small follow-up patches.

Spacing

RenderMathMLSpace is a simple class inserting blank boxes for adjusting spacing of math formulas. Obviously, we do not need any of the complexity of flexbox so it was straightforward to write the layout functions.

3x Large space between 3 and x.

Grouping

RenderMathMLRow performs rendering of a row of math items. Since WebKit does not support linebreaking in MathML at the moment, this is just putting child boxes on a same baseline. One specificity is that some operators can be stretched vertically and so their width may depend on their height.

{ 2x x 3 Row containing a stretched brace, a fraction and a scripted element.

Again, flexbox features are useless here. With the old code, it was not clear whether we were violating the CSS invariant with preferred and logical widths and which kind of relayout or render tree changes would happen when doing the stretch call. By properly implementing the layout functions previously mentioned all of this became much more trustable.

Fractions

RenderMathMLFraction draws a fraction with numerator and denominator.

x+1y+2 Simple fraction.

This used to be implemented using a column direction for the fraction element. Numerator and denominator were wrapped into anonymous nodes with additional style to leave space for the fraction bar and to adjust the horizontal alignments.

RenderMathMLFraction (flexbox with column direction)
  RenderMathMLBlock (anonymous flexbox)
    RenderMathMLRow (numerator)
      ...
  RenderMathMLBlock (anonymous flexbox)
    RenderMathMLRow (denominator)
      ...

It was relatively easy to implement this without any anonymous nodes and again the use of flexbox did not sound justified. For example, to calculate the preferred width we just take the maximum of the preferred widths of the numerator and denominator. For the layout, the calculation of the logical width is similar and we calculate the horizontal coordinates of numerator and denominator so that their centers are aligned. Vertical metrics are similarly calculated from the vertical metrics of the numerator and denominator. During that step, we also fixed some bugs with the linethickness attribute and added support for some OpenType MATH table constants.

Scripts above and below

RenderMathMLUnderOver is used to attach some scripts above and below a base. Each child can itself be a horizontal stretchy operator.

base→ Base with stretchy arrow over it.

This was implemented in the user agent stylesheet by using flexboxes with column direction for the corresponding MathML elements and the C++ class had additional rules to fire the stretching. So the problems and solutions for this class were essentially a mixed of the cases of RenderMathMLFraction and RenderMathMLRow we just discussed.

Subscripts and Superscripts

RenderMathMLScripts is used for a base with some arbitrary number of scripts. All the scripts can have different positions (pre, post, sub, super) and metrics (width, ascent and descent). We must avoid collisions and take care of horizontal and vertical alignements.

base a b c d e f Base with pre and post scripts.

The old code used a complex render tree with additional style to achieve the best possible result. However, the quality was still bad as you can see for the script attached to the integral in the screenshot above. Managing the render tree was a nightmare: Just to give the idea, additional anonymous node and style were used to allow horizontal and vertical adjustments (similar to RenderMathMLFraction above) and prescripts had negative order property so that they were positioned before the base.

RenderMathMLScripts
    Base Wrapper (anonymous flexbox)
      RenderMathMLRow (base)
        ...
    SubSupPair Wrapper (anonymous flexbox with column direction)
      RenderMathMLRow (post-subscript)
        ...
      RenderMathMLRow (subperscript)
        ...
    SubSupPair Wrapper (anonymous flexbox with column direction)
      RenderMathMLRow (post-subscript)
        ...
      RenderMathMLRow (post-superscript)
        ...
    ... (more postscripts)
    RenderMathMLBlock (prescripts separator)
    SubSupPair Wrapper (anonymous flexbox with column direction and order -1)
      RenderMathMLRow (pre-subscript)
        ...
      RenderMathMLRow (pre-subperscript)
        ...
    SubSupPair Wrapper (anonymous flexbox with column direction and order -1)
      RenderMathMLRow (pre-subscript)
        ...
      RenderMathMLRow (pre-superscript)
        ...
    ... (more prescripts)

Rules from TeX and the OpenType MATH table are relatively complex and we decided to implement them directly in the new refactoring as otherwise it was impossible to get decent quality. The code is still complex but we now have clear rules, we only perform simple calculations and the render tree structure matches the DOM tree.

“Enclosing” Notations

RenderMathMLMenclose is a row of math items with some additional notations. Gurpreet Kaur implemented this element two years ago but she followed the same approch, combining anonymous nodes and style for some simple notations and special painting for others.

x+1 circle and strike notations

During the refactoring, the code has been completely rewritten so that RenderMathMLMenclose is now essentially a derived class of RenderMathMLRow with the measuring and painting functions adjusted to take into account the additional notations. During that refactoring, we also removed support for unused radical notation, which was implemented using an anonymous RenderMathMLSquareRoot (see Radicals section below).

Helper Classes for Operators

The RenderMathMLOperator class is used for math operators. It was quite complex class and we decided to extract from it two features that are unrelated to layout:

The remaining code was indeed the real layout part but the mess with anonymous node and style was only removed later (see Text Classes below). Although it seems we just needed to move the code out of RenderMathMLOperator into those new classes, the case of MathOperator was particularly difficult. We had to split the effort into several small steps to make review possible and also fixed many issues due to the entanglement and confusion of these three different features of the RenderMathMLOperator class… The work done for MathOperator actually improved the rendering of stretchy operators as you can see for the horizontal braces in the screenshot above.

Radicals

RenderMathMLRoot is used for square root or arbitrary N-th root. Many of the TeX and OpenType MATH table rules were already used by the old implementation with anonymous nodes and style. However, there were bugs difficult to fix related to zooming, child removal or style change due to the management of the anonymous RenderMathMLOperator to draw the radical sign.

x+1 + x+1 3 square and cube roots

The old implementation actually had two classes for the square and general cases (RenderMathMLSquareRoot and RenderMathMLRoot). The usual technique with various anonymous wrappers and style was used. After the refactoring, we were able to merge everything in a single RenderMathMLRoot class. Because the square root behaves as an mrow, we also made that class derive from RenderMathMLRow to reuse as much code as possible. Here is are how the render trees used to look like:

RenderMathMLSquareRoot
  RenderMathMLBlock (anonymous used for metric adjustements)
    RenderMathMLRadicalOperator (anonymous used for the radical symbol)
      ...
  RenderMathMLRootWrapper (anonymous used for the children)
    RenderMathMLRow (child 1)
      ...
    RenderMathMLRow (child 2)
      ...
    ...
    RenderMathMLRow (child N)
      ...

RenderMathMLRoot
  RenderMathMLRootWrapper (anonymous for the index)
    ...
  RenderMathMLBlock (anonymous used for metric adjustements)
     RenderMathMLRadicalOperator (anonymous used for the radical symbol)
       ...
  RenderMathMLRootWrapper (anonymous for the base)
    ...

Again, we rewrote the implementation using only simple box positioning. The difficult part was to get rid of the anonymous RenderMathMLRadicalOperator to draw the radical symbol. This class was derived from RenderMathMLOperator and extended it with some fallback drawing when math fonts were not available. After having extracted stretchy operator shaping from RenderMathMLOperator it became possible to use the MathOperator helper class to draw the radical symbol. We implemented the fallback for missing math fonts the same as Gecko: Use a scale transform to stretch the base glyph for U+221A SQUARE ROOT. As a bonus, we used such transform to implement glyph mirroring, as required to draw right-to-left radicals in some Arabic mathematical notations.

Text Classes

These classes are containers for math text such as variables or operators. There is a generic RenderMathMLToken class and a derived class RenderMathMLOperator adding features specific to operators such as spacing, dictionary property, stretching… Anonymous wrappers and style were used to implement automatic italic mathvariant or operator spacing. The RenderText child of RenderMathMLOperator was (re)built as an anonymous text node so that is was possible to convert U+002D HYPHEN-MINUS into U+2212 MINUS SIGN or to provide some text for anonymous operators created by RenderMathMLFenced (see Unchanged Classes section).

RenderMathMLToken (e.g. mi element)
  RenderMathMLBlock (anonymous flexbox used to apply CSS italic)
    RenderBlock (anonymous created elsewhere to honor CSS rules)
      RenderText
        text run "x"

RenderMathMLOperator (mo element)
   RenderMathMLBlock (anonymous flexbox used for spacing)
    RenderBlock (anonymous created elsewhere to honor CSS rules)
      RenderText (anonymous destroyed and built again)
         text run "−"

We did a big refactoring to remove all the anonymous nodes created by the MathML renderer classes. Just like for MathOperator, we had to be careful and submit various small pieces as the text rendering was quite sensible to code change.

The simplified operator spacing that was supported by WebKit was easy to implement with the new approach. To do automatic italic mathvariant, we modified the paint function to use Mathematical Alphanumeric Symbols instead of CSS italic as you can notice for the variables displayed in the screenshot above. Hence we could remove the RenderMathMLBlock anonymous wrapper.

The use of an anonymous node for the text prevented it to appear in the dumped render tree of layout tests and also required some hacks in the accessibility code to expose that text. In order to address the cases of the minus sign and of mfenced operators, we decided to use our new MathOperator class again. Indeed MathOperator is actually also able to draw unstretched operators made of a single character and this works for the minus sign and for mfenced operators used in practice.

Unchanged Classes

Two classes have not been modified but such modifications were not needed to remove the dependency on RenderFlexibleBox:

  • RenderMathMLFenced is used for an mrow-like element that is defined in the MathML specification as strictly equivalent to constructions with rows and operators. It is implemented as a derived class of RenderMathMLRow and creates anonymous RenderMathMLOperators. This is the only remaining class that modifies the render tree structure. Note that prominent MathML websites and generators do not use the mfenced element, so it is not a big concern.

  • RenderMathMLTable is used for table layout. It is just derived from RenderTable, not RenderFlexibleBox. We did not change anything for now but we considered creating our own implementation in order to make our code independent from HTML table, to support MathML-specific table features and to make it better integrated with the rest of the MathML code.

Accessibility

Even if our main focus was on rendering, the changes we made also had impact on the MathML accessibility code. Indeed, the accessibility tree is generated from the MathML renderer classes: Since we changed the latter during the refactoring, we also had to adjust the accessibility code. Fortunately, we are lucky to have Joanmarie Diggs in our team and she was able to provide some help here.

First, the accessibility code exposes the linethickness of fractions to implement Apple’s AXMathLineThickness attribute. In practice, this is really necessary to know whether the linethickness is null or not (e.g. binomial coefficient VS the Legendre symbol). Apple’s unit test seemed to expose the ratio between the actual thickness and the default thickness but the accessibility code really just reads the actual thickness calculated by RenderMathMLFraction. Our fix and improvement for linethickness made the Apple’s unit test fail so we had to adjust RenderMathMLFraction to expose the value expected by that test.

In general, the accessibility code does not care about anonymous nodes created for layout purpose and there was some code to avoid exposing them in the accessibility tree. So removing all the anonymous during the layout refactoring was actually a good and safe thing to do. There were some helper functions to implement Apple’s AXMathRootRadicand and AXMathRootIndex attributes that had to be adjusted, though. These functions used to do some work to skip the anonymous wrappers and we were actually able to simplify them.

There was also some specific code for the RenderMathMLOperators and their anonymous RenderText that were necessary to expose the text content. Actually, there was an old bug in the accessibility code and the anonymous RenderMathMLOperators created by mfenced were not correctly exposed. The unit test we had for mfenced operators was only checking the text content but it was still passing and so the regression had never been detected before. After the layout refactoring we removed the anonymous RenderText of mfenced operators and so broke that test… We thus spent some time to fix the RenderMathMLOperator code. Essentially, we removed all the old hacks and only left a specific handling for mfenced operators. We also used this opportunity to improve and extend our MathML accessibility tests.

Finally, the MathML accessibility code was directly implemented into a generic AccessibilityRenderObject class. There was some functions to access math nodes and properties but also specific cases scattered all over the code (anonymous boxes, mfenced operators, math roles etc). In order to facilitate future work and maintenance we decided to move all the MathML code into a new AccessibilityMathMLElement class. Hence the work implied by the layout refactoring actually encouraged us to improve the organization and testing of our accessibility code!

Conclusion

In the past four months, Igalia’s web platform team has successfully upstreamed the refactoring of WebKit’s MathML renderer classes and we are now very confident about the quality of the layout code. In addition to the people mentioned above I would personally like to thank everybody who helped with this work. More specifically, I am very grateful to other people from Igalia (Martin Robinson, Sergio Villar and Manuel Rego) or Apple (Brent Fulgham and Darin Adler) who have spent some time to review patches. As a nice side effect of this work, mathematical formulas look better and the accessibility code has been improved. More is happenning in the next two phases. We are looking forward to continuing implementation of Web standards and collaboration with browser vendors at the next Web Engines Hackfest!

July 08, 2016 10:00 PM

July 01, 2016

Víctor Jáquez

VA-API and DRM/KMS in MinnowBoard

In 2012 I started to work on a video renderer for GStreamer which uses directly the DRM/KMS kernel subsystem to display images. I even blogged about it, but I didn’t finished it.

Nonetheless, in December last year, a customer asked us to finish the element, but this time focusing on i.MX6, rather than OMAP4. Consequently, early this year, kmssink got merged into upstream.

The video sink has some nice features, such as DMA-buf importing through the PRIME kernel interface. This feature makes possible a zero-copy path when the video decoder delivers dmabuf based frames. For this particular project, the kernel we used supports the CODA media subsystem, and through the GStreamer element v4l2videodec we could link pipelines negotiating dmabuf sharing, and thus we played videos very efficiently.

During the last GStreamer Hackfest, Nicolas Dufresne got working the MFC media subsystem, for Exynos, with v4l2videodec and kmssink, but I don’t remember if he also got the dmabuf sharing working.

All in all, kmssink had only been tested in a couple of ARM devices, so I wonder, as a gstremer-vaapi co-maintainer, if it also works in x86 devices.

Some colleagues at Igalia, use a Minnowboard to test WebKit4Wayland, so I borrowed it, installed a minimal Fedora 23 in it, and, surprisingly, I found out that it has a nice VA-API support:

   $ vainfo
   error: can't connect to X server!
   libva info: VA-API version 0.38.1
   libva info: va_getDriverName() returns 0
   libva info: Trying to open /usr/lib64/dri/i965_drv_video.so
   libva info: Found init function __vaDriverInit_0_38
   libva info: va_openDriver() returns 0
   vainfo: VA-API version: 0.38 (libva 1.6.2)
   vainfo: Driver version: Intel i965 driver for Intel(R) Bay Trail - 1.6.2
   vainfo: Supported profile and entrypoints
         VAProfileMPEG2Simple            : VAEntrypointVLD
         VAProfileMPEG2Simple            : VAEntrypointEncSlice
         VAProfileMPEG2Main              : VAEntrypointVLD
         VAProfileMPEG2Main              : VAEntrypointEncSlice
         VAProfileH264ConstrainedBaseline: VAEntrypointVLD
         VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
         VAProfileH264Main               : VAEntrypointVLD
         VAProfileH264Main               : VAEntrypointEncSlice
         VAProfileH264High               : VAEntrypointVLD
         VAProfileH264High               : VAEntrypointEncSlice
         VAProfileH264StereoHigh         : VAEntrypointVLD
         VAProfileVC1Simple              : VAEntrypointVLD
         VAProfileVC1Main                : VAEntrypointVLD
         VAProfileVC1Advanced            : VAEntrypointVLD
         VAProfileNone                   : VAEntrypointVideoProc
         VAProfileJPEGBaseline           : VAEntrypointVLD


Afterwards, I compiled GStreamer using gst-uninstalled in order to have kmssink and the master branch of gstreamer-vaapi. And got it! I had hardware accelerated decoding with VA-API, and video rendering using KMS/DRM. But, sadly, no zero copy, since, in order to have dmabuf sharing, we have to finish and to merge bug 755072 in gstreamer-vaapi.

Running the typical Big Bunny video (1080p, H.264) the playback consumes less than the 50% of the CPU usage, compared with the 100% of CPU usage with software decoders (and a lot of frames dropped).

I recorded a small video with the experiment as a proof.

Update: Javer Martinez told me in twitter, that MFC can do dma-buf export since uses vb2 and Exynos DRM can import, but he couldn’t get zero copy to work due missing format support.

by vjaquez at July 01, 2016 05:00 PM

June 21, 2016

Xavier Castaño

We’re proud to sponsor #Linuxcon North America and to celebrate the 25th anniversary of #Linux and #opensource!

LinuxCon North America and ContainerCon features 175+ sessions for developers, operators, users and other open source professionals with a range of content covering Linux, Containers, Cloud and much more!

Join us in Toronto, Canada where you’ll experience:

  1. Bonus Content Day featuring Docker 101 lab and tutorials
  2. Co-located events such as CloudNativeDay, KVM Forum, Linux Security Summit, Open Source Storage Summit, and Xen Project Developer Summit
  3. 25th Anniversary of Linux Casino Royale Gala on Wednesday, August 24th

Igalia will have a booth there so don’t miss your chance to attend and visit us!

by xavier castano at June 21, 2016 09:54 AM

June 20, 2016

Javier Muñoz

Ansible AWS S3 core module now supports Ceph RGW S3

The Ansible AWS S3 core module now supports Ceph RGW S3. The patch was upstream today and it will be included in Ansible 2.2

This post will introduce the new RGW S3 support in Ansible together with the required bits to run Ansible playbooks handling S3 use cases in Ceph Jewel

.

The Ansible project

Ansible is a simple IT automation engine that automates cloud provisioning, configuration management, application deployment and intra-service orchestration among many other IT needs.

Ansible works by connecting to nodes (SSH/WinRM) and pushing out small programs, called 'Ansible modules' to them. These programs are written to be resource models of the desired state of the system. Ansible then executes these modules and removes them when finished.

Ansible uses playbooks to orchestrate the infrastructure with very detailed control. Those playbooks define configuration policies and orchestration workflows. They are a YAML definition of automation tasks that describe how a particular piece of automation should be done.

Playbooks are modeled as a collection of plays, each of which defines a set of tasks to be executed on a group of remote hosts. A play also defines the environment where the tasks will be executed.

Ansible modules ensure indempotence so it is possible running the same tasks over and over without affecting the final result.

Using the Ansible AWS S3 core module with Ceph

The Ceph RGW S3 support is part of the Amazon S3 core module in Ansible. The AWS S3 core module allows the user to manage S3 buckets and the objects within them. It includes support for creating and deleting both objects and buckets, retrieving objects as files or strings and generating download links.

The patch leverages the AWS S3 use cases without any restriction or limitation with regions, URLs, etc.

To enable the RGW S3 flavour in the S3 core module you set the 'rgw' boolean option to 'true' and the 's3_url' string option to the RGW S3 server.

controller:~$ ansible rgw.test.node -m s3 -a \
"mode=list bucket=my-bucket s3_url=http://rgw.test.server:8000 rgw=true"

The 's3_url' option is mandatory in the RGW S3 flavour.

Testing the RGW S3 core module flavour via Playbook

This Playbook example tests the RGW S3 flavour in a simple three boxes network ('controller', 'rgw.test.node' and 'rgw.test.server')

The 'controller' host is the box running the Ansible engine.

The 'rgw.test.node' is the host connecting to the Ceph RGW server and running the S3 use cases.

The 'rgw.test.server' is the Ceph RGW S3 server.

Running the testing Playbook...

controller:~$ ansible-playbook playbook-test-rgw-s3-core-module.yml

PLAY [all] *********************************************************************

TASK [setup] *******************************************************************
ok: [rgw.test.node]

TASK [create test_file test-file.txt] ******************************************
changed: [rgw.test.node]

TASK [put s3_bucket_items my-test-object-1] ************************************
changed: [rgw.test.node] => (item=my-test-object-1.txt)
changed: [rgw.test.node] => (item=my-test-object-2.txt)

TASK [get s3_bucket_items] *****************************************************
ok: [rgw.test.node]

TASK [download s3_bucket_items] ************************************************
changed: [rgw.test.node] => (item=my-test-object-1.txt)
changed: [rgw.test.node] => (item=my-test-object-2.txt)

TASK [remove s3_bucket_and_content] ********************************************
changed: [rgw.test.node]

PLAY RECAP *********************************************************************
rgw.test.node              : ok=6    changed=4    unreachable=0    failed=0

As expected, the Playbook runs the 'create', 'upload', 'list', 'download' and 'remove' S3 use cases for buckets and objects. Adding the '-v' switch will show a more verbose output.

Wrap-up

All examples, modules and playbooks related to RGW S3 were tested on the new Ceph Jewel release.

Beyond of the RGW S3 support in the AWS S3 core module you could be interested in Ansible Playbooks to set up and configure Ceph clusters automatically. You can find those Playbooks in ceph/ceph-ansible with general support for Monitors, OSDs, MDSs and RGW.

The primary documentation for Ansible is available here. I found the Ansible whitepapers a great resource too.

Acknowledgments

My work in Ansible is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Ansible community. Thank you all!

by Javier at June 20, 2016 10:00 PM

Manuel Rego

My BlinkOn 6 Summary: Grid Layout, Houdini & MathML

Igalia could not miss the chance to participate in a new European edition of BlinkOn. So past week my colleague Dape and I were attending BlinkOn 6 in Munich. In this post I’d do a personal summary about the main conference highlights.

Status of CSS Grid Layout implementation

During the conference I gave a talk about the status of Grid Layout on Blink. I went over the whole spec checking the things that are DONE, WIP or TODO. The summary is that on top of the things we’re already working on, and that will be landing very soon (auto-fit repeat, orthogonal flows, normal value in align|justify-items properties, etc.), we’ve just a few tasks pending (most of them changes on the spec in the last 2 months).

The main features pending are:

  • Fragmentation: This is an issue on the whole project and not only related to Grid Layout. For example, Blink doesn’t have proper fragmentation support on Flexbox either, so probably it’s not a blocker in order to ship Grid Layout. This will affect you when you try to print a grid and the rows are cut in the middle, actually they should have been moved completely to the next page. Of course, it’d be really nice to have it fixed and working as expected.
  • Subgrids: This is a hot topic and the opinions vary a lot depending on whom you’re asking. For some people, we should ship without subgrids support and add it later (as the change won’t be breaking any content); other believe that we shouldn’t ship until subgrids are implemented. I think it’s still needed more discussion between the CSS Working Group and the different browser vendors to reach an agreement on this issue.

You can find the slides of my talk on this blog already (notice that you’d need Chrome Canary with the experimental flag enabled to see them properly). Sadly the talks outside the main room were not recorded, so there won’t be any video of mine.

Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

From the different conversations I had, everyone seems really happy about the work we’re doing on Grid Layout, so thank you very much for the kind words. 😉 As you probably already know Igalia work on this feature has been sponsored by Bloomgerg, big kudos to them!

CSS Houdini

Houdini had a big presence on the conference, it seems clear that Google is really pushing for this to happen. It was nice to see the current status of some APIs that are already working in an experimental way like Typed OM and Painting API. Also, it’s great to check that other browsers are already taking some initial steps too, like Firefox with the Properties and Values API.

Userspace vector graphics using Houdini & Custom Elements by Shane Stephens Userspace vector graphics using Houdini & Custom Elements by Shane Stephens

Apart from the introductory talk on the main room there were 2 more sessions:

MathML in Chromium?

Just in case you don’t know, Igalia has been lately working on the refactoring of the MathML implementation in WebKit. The rationale was that the previous code was a pain to maintain and improve (e.g. it depended a lot on Flexbox making it really hard to implement new features), actually this was one of the reasons why it was removed from Blink after the fork. The refactoring is still ongoing but several patches have already landed.

As the new code (after the refactoring) was looking much better in WebKit my colleague Fred Wang started to port it to Blink on a GitHub repository. The initial results are looking pretty good, but all the code from WebKit hasn’t been ported yet.

MathML mentioned on the Houdini session MathML mentioned on the Houdini session

Thus, I took advantage of the conference to talk about this work with several people in order to check the feasibility of bringing MathML back to Chromium. After some discussions it seems clear that Google would be interested in having MathML support, however at this moment they seem more inclined to do it through the new Houdini APIs like CSS Layout and Fonts Metrics that are still being worked on. MathML seems to be a nice use case to verify that they work as expected.

If they follow this approach, it probably means that we won’t see MathML supported on Blink in the short-term, as we’d need to wait for those APIs to be ready. In any case, Igalia will keep an eye on all this stuff looking for a good opportunity to make it a reality.

Summary

As usual for me the most important part of these conferences is having the chance to meet some new faces and old friends too. It’s really nice to have the chance to spend a while talking face to face with people that have been interacting with you on the Internet for a long time.

Regarding Grid Layout I hope that the different parties can find a good way to bring it to the web authors in the 3 major browser (Chrome, Safari and Firefox) almost synchronously. I cannot say if it’s going to happen this year or the next one, but I’m sure it’s going to be something really big for the Web!

All the CSS Houdini stuff sounds really cool but a long-term thing at this moment. Let’s see how long we need to wait until we can use these new APIs for real.

Munich's Town Hall from from the Tower of St. Peter's Church Munich’s Town Hall from from the Tower of St. Peter’s Church

Finally, it was nice to visit Munich and to have some time for a walk around the city. We couldn’t find any single hill around the city center, next time it might be a good idea to rent a bike.

June 20, 2016 10:00 PM

June 14, 2016

Antonio Gomes

Understanding Chromium’s runtime ozone platform selection

Requisites

For the context of this post, it is assumed a ‘content_shell’ ChromeOS GN build produced with the following commands:

$ gn gen --args='target_os="chromeos"
                 use_ozone=true
                 ozone_platform_wayland=true
                 use_wayland_egl=false
                 ozone_platform_x11=true
                 ozone_platform_headless=true
                 ozone_auto_platforms=false' out-gn-ozone

$ ninja -C out-gn-ozone blink_tests

Ozone at runtime

Looking at the build arguments above, one can foresee that the content_shell app will be able to run on the following Graphics backends: Wayland (no EGL), X11, and “headless” – and it indeed is. This happens thanks to Chromium’s Graphics layer abstraction, Ozone.

So, in order to run content_shell app on Ozone platform “bleh”, one simply does:

$ out-gn-ozone/content_shell --ozone-platform="bleh"

Simple no? Well, yes .. and not as much.

The way the desired Ozone platform classes/objects are instantiated is interesting, involving c++ templates, GYP/GN hooks, python generated code, and more. This post aims to detail the process some more.

Ozone platform selection logic

Two methods kick off OzonePlatform instantiation: ::InitializeForUI and ::InitializeForGPU . They both call ::CreateInstance(), which is our starting point.

This is how simple it looks:

63 void OzonePlatform::CreateInstance() {
64     if (!instance_) {
(..)
69         std::unique_ptr<OzonePlatform> platform =
70             PlatformObject<OzonePlatform>::Create();
71
72         // TODO(spang): Currently need to leak this object.
73         OzonePlatform* pl = platform.release();
74         DCHECK_EQ(instance_, pl);
75     }
76 }

Essentially, when PlatformObject<T>::Create is ran (lines 69 and 70), it ends up calling a method named Create{T}Bleh, where

  • “T” is the template argument name, e.g. “OzonePlatform”.
  • “bleh” is the value passed to –ozone-platform command line parameter.

For instance, in the case of ./content_shell –ozone-platform=x11, the method called would CreateOzonePlatformX11, following the pattern Create{T}Bleh (i.e. “Create”+”OzonePlatform”+”X11”).

The actual logic

In order to understand how PlatformObject class works, lets start by looking at its definition (ui/ozone/platform_object.h & platform_internal_object.h):

template <class T> class PlatformObject {
 public:
  static std::unique_ptr<T> Create();
};
16 template <class T>
17 std::unique_ptr<T> PlatformObject<T>::Create() {
18     typedef typename PlatformConstructorList<T>::Constructor Constructor;
19
20     // Determine selected platform (from --ozone-platform flag, or default).
21     int platform = GetOzonePlatformId();
22
23     // Look up the constructor in the constructor list.
24     Constructor constructor = PlatformConstructorList<T>::kConstructors[platform];
26     // Call the constructor.
27     return base::WrapUnique(constructor());
28 }

In line 24 (highlighted above), the ozone platform runtime selection machinery actually happens. It retrieves a Constructor, which is a typedef for a PlatformConstructorList<T>::Constructor.

By looking at the definition of PlatformConstructorList class (below), Constructor is actually a pointer to a function that returns a T*.

14 template <class T>
15 struct PlatformConstructorList {
16     typedef T* (*Constructor)();
17     static const Constructor kConstructors[kPlatformCount];
18 };

Ok, so basically here is what we know this far:

  1. OzonePlatform::CreateInstance method calls OzonePlatform<bleh>::Create
  2. OzonePlatform<bleh>::Create picks up an index and retrieves a PlatformConstructorList<bleh>::Constructor (via kConstructor[index])
  3. PlatformConstructorList<bleh>::Constructor is a typedef to a function pointer that returns a bleh*.
  4. (..)
  5. This chain ends up calling Create{bleh}{ozone_platform}()

But wait! kConstructors, the array of pointers to functions – that solves the puzzle – is not defined anywhere in src/! 🙂

This is because its actual definition is part of some generated code triggered by specific GN/GYP hooks. They are:

  • generate_ozone_platform_list which generates out/../platform_list.cc,h,txt
  • generate_constructor_list which generates out/../constructor_list.cc though generate_constructor_list.py

In the end out/../generate_constructor_list.cc has the definition of kConstructors.

Again, in the specific case of the given GN build arguments, kConstructors would look like:

template <> const OzonePlatformConstructor
PlatformConstructorList<ui::OzonePlatform>::kConstructors[] = {
    &ui::CreateOzonePlatformHeadless,
    &ui::CreateOzonePlatformWayland,
    &ui::CreateOzonePlatformX11,
};

Logic wrap up

  • GYP/GN hooks are ran at build time, and generate plaform_list.{txt,c,h} as well as constructor_list.cc files respecting ozone_platform_bleh parameters.
    • constructor_list.cc has PlatformConstructorList<bleh>kConstructors actually populated.
  • ./content_shell –ozone-platform=bleh is called
  • OzonePlatform::InitializeFor{UI,Gpu}()
  • OzonePlatform::CreateInstance()
  • PlaformObject<OzonePlatformBleh>::Create()
  • PlatformConstructorList<bleh>::Constructor is retrieved – it is a pointer to a function stored in PlatformConstructorList<bleh>::kConstructor
  • function is ran and an OzonePlatformBleh instance is returned.

by agomes at June 14, 2016 04:28 AM

June 02, 2016

Alejandro Piñeiro

Introducing Mesa intermediate representations on Intel drivers with a practical example

Introduction

The recent big news on the Igalia work on Mesa was that our effort getting the ARB_gpu_shader_fp64 and ARB_vertex_attrib_64bit extensions implemented for Intel Gen8+, allowed to expose OpenGL 4.2 for Gen8+. But I will let other igalians to talk in details about them (no pressures ;)).

In a previous blog post I mentioned that NIR was intended to replace GLSL IR. Although that was true on the context I was talking about, that comment could be somewhat misleading, so I will try to clarify it.

Intermediate representations on Intel Mesa drivers

So first, let’s list the intermediate representations that you would find when working on Mesa Intel drivers:

  • AST (Abstract Syntax Tree): calling it an Intermediate Language is somewhat an abuse of language. This is the tree representation of your GLSL shader just after parsing it with Flex/Bison.
  • Mesa IR (Intermediate Representation): also called HIR and GLSL IR. A real Intermediate Language. It is converted from AST. Here you have optimizations, link support, etc.
  • NIR (New Intermediate Representation): a new Intermediate Language added recently.

So the first questions would be, why three? Having AST and another intermediate representation is easier to explain. AST is a raw tree representation, not useful to generate code. But why IR and NIR?

Current Mesa IR was created some years ago. The design decisions behind it had their advantages. But it has also some disadvantages. For example, their tree-like structure make it complex to traverse, making difficult to navigate, implement optimizations, etc. Other disadvantage was that it was not on SSA form, making again some optimizations hard to write. You can see a summary of all this on Ian Romanick’s presentation “Three Years Experience with a Tree-like Shader IR“.

So at some point an effort was started to “flatenize” Mesa IR and adding SSA support. But the conclusion was that the effort to modify Mesa IR was so big, that it was worth to just start from scratch using the learned lessons, as explained on Connor Abbot’s email “A new IR for Mesa“, in which he proposed this new IR.

Some time later, NIR was ready for production, and as I mentioned on my blog post (that one I’m clarifying right now), some parts of Mesa Intel driver was reimplemented in order to use NIR instead of Mesa IR. So Mesa IR was being replaced there. Where exactly? The parts where the final assembly code was being generated. And now that that is finished (at least on the i965 driver), we can say that Mesa IR is not used to generate code at all.

So right now there are an AST->Mesa IR->NIR chain. What is the plan now? Generate an AST->NIR pass and completely remove Mesa IR? This same question was asked (among other things) on January 2016, on the mesa-dev email “Nir, SCons, and Gallium“.  And the answer was “no”, as you can see on two Ian Romanick’s replies (here and here). The summary is that Mesa IR has several GLSL specifics that aren’t appropriate for NIR’s level. In that sense, NIR is a step below Mesa IR, more near to the GPU needs. It is also worth to mention that since then, Vulkan support was added to Mesa. In order to support Spir-V (Vulkan’s shader language, that is an intermediate representation itself), a SPIRV->NIR pass was created. In that sense, for OpenGL there is an OpenGL-specific intermediate representation, that is Mesa IR, and for Vulkan there is a Vulkan-specific intermediate representation, that is Spirv, and both are translated to the same common representation, NIR.

Practical example:

So, what means that in the practice? Do I need to deal with all those intermediate representations? Well, as anything in life, that would depend. If for example, you want to provide the support for a new GLSL feature or a specific hw, you would need to touch all three. You would need to modify the flex/bison files, so AST would need to be updated, and then Mesa IR, NIR, and the passes that transform one to the other. But if you want to give support for an GLSL feature that is already supported, but on new hw, the most likely is that you will not need changes on any of them, but just on the code that generate the final assembly using NIR.

And what would happen to other features like warnings and errors? Right now most of them are detected at the AST level, and some at the IR level. NIR doesn’t trigger any error/warning yet. It contains several asserts, but basically because it assumes that at that moment the representation of the shader should be correct, so if you find something wrong, means that the developer working on NIR is doing something wrong, in opposite to the developer that wrote the GLSL shader.

So lets go for a practical example I was working on: uninitialized variable warnings (“undefined values” on the bug tracking the issue). In short, this is about warn the developer about things like this:

out color;

void main()

{
 vec4 myTemp;

 color = myTemp;
}

What would be the color on screen? Who knows. So although is a feature you can live without, it is a good nice-to-have.

On the original bug, it was mentioned that this kind of errors are easy to detect, as NIR has a type ssa_undefs, so we just need to check if they are used. And in fact, when I started to work on it, I quickly find how to raise a warning. For example, on the method nir_print.c:print_ssa_def, used to debug, it is easy to modify it in order to point that it is using a undef. But trying to raise the warning there have some problems:

  • As mentioned NIR doesn’t have any warning/error triggering mechanism implemented, you would need to add them.
  • You want to include this warning on the OpenGL InfoLog, but as mentioned NIR is used for both OpenGL and Vulkan, and right now it doesn’t maintains any info about the origin.
  • And perhaps more important, at that point you lack the context data that points which source code line you are working on.

In fact, the last bullet point also applies to Mesa IR. There are some warnings raised at the Mesa IR level, but are “line-less”. Not sure any other developer, but for me, this kind of warning without a reference to the source line number would be annoying to use. Does that mean that the only option would be the totally raw AST tree?

Fortunately, Mesa IR was already saving if a variable was being statically assigned or not (to check some other possible errors). This was being computed on the AST to Mesa IR pass, and in fact the documentation mentions that this value is only valid at that moment. We would be on the middle of AST and Mesa IR. So when to raise the warning? The straightforward solution would be when a variable is , just before/after the error “variableX undeclared” is raised. But that is not so easy. For example:

float myFloat1;
float myFloat2;

myFloat1 = myFloat2;

How many warnings should we raise? Just one, for myFloat2. But technically we are also using myFloat1, and it is uninitialized. So we need to differentiate both cases. Being AST so raw, at that point we don’t have that information, and in fact it is also impossible to go up to the parent expression in order to compute that information. So it was needed to add an attribute on the AST node, that I called is_lhs (as “is left hand side”). That variable would be set when parent expressions are being transformed.

If you are taking attention, probably you start to see what would be the collateral effect of this. Being AST so raw, and OpenGL specific, there would be several corner cases needed to be manually assigned. In fact the  first commit of the series is already covering several corner cases. And in spite of this, once the code reached master, there were two cases of false positives that needed extra checks (for builtin-variables and for inout/out function parameters)

After those two false positives, managing this warning was spread all along the code that made the AST to Mesa IR pass, so seemed easy to broke. So I decided to send too some unit tests to verify that it gets working. First I sent the make check test that tested that warning, and then the unit tests. 30 unit tests (it was initially 28, but reviewer asked two more). Not a bad number for a warning.

Final words

At this point, one would wonder if it still makes sense to have this warning on the AST to Mesa IR pass, and if it would have it better to do it as initially proposed, on NIR. But although it is true that “just detecting it” would be easier on NIR, without dealing with so many corner cases, I still think that adding the support for raising warning/errors compatible with OpenGL Infolog, and bringing somehow the original source code line number, would mean too many changes on both Mesa IR and NIR. More changes that dealing with those corner cases when using the variable on the AST to Mesa IR pass. And in any case, if in the future the situation changes, and makes sense to move the warning to NIR, we would have the unit tests that would help to ensure that we don’t introduce regressions.

In relation to the intermediate representations, just to note that I’m focusing on the Intel driver. Gallium drivers use other intermediate representation, called TGSI. As far as I know, on those drivers, they have a AST->Mesa IR->TGSI chain, and right now there is a work in progress AST->Mesa IR->NIR->TGSI chain that will be used on some specific cases. But all this is beyond my knowledge, so you would need to investigate if you are interested.

Appendix, extra documentation:

If you want more details about MESA IR, you can read:

  • Read Mesa IR README
  • Read past blog posts from fellow igalian Iago Toral (when NIR was not available yet): post 1 and post 2

If you want extra information about Mesa NIR, you can read:

 

 

by infapi00 at June 02, 2016 05:47 PM

Xavier Castaño

Igalia sponsors LinuxCon and ContainerCon Japan 2016, the premier Linux Conference in Asia

Join us at LinuxCon + ContainerCon Japan! LinuxCon Japan is the premier Linux conference in Asia that brings together a unique blend of core developers, administrators, users, community managers and industry experts. It is designed not only to encourage collaboration but to support future interaction between Japan and other Asia Pacific countries and the rest of the global Linux community.

ContainerCon also comes to Japan for the first time this year, after a successful launch in North America in 2015. New developments in Linux containers are driving the adoption of cloud and virtualization technologies in many industries and ContainerCon Japan will provide a platform for exhibiting the best work.

My colleague Mi Sun Silvia Cho will be talking about WebKit For Wayland and we will show some nice demos about our Browsers, Graphics and Compilers experience in our booth in Tokyo next month.

by xavier castano at June 02, 2016 07:10 AM

May 26, 2016

Manuel Rego

CSS Grid Layout and positioned items

As part of the work done by Igalia in the CSS Grid Layout implementation on Chromium/Blink and Safari/WebKit, we’ve been implementing the support for positioned items. Yeah, absolute positioning inside a grid. 😅

Probably the first idea is that come to your mind is that you don’t want to use positioned grid items, but maybe in some use cases it can be needed. The idea of this post is to explain how they work inside a grid container as they have some particularities.

Actually there’s not such a big difference compared to regular grid items. When the grid container is the containing block of the positioned items (e.g. using position: relative; on the grid container) they’re placed almost the same than regular grid items. But, there’re a few differences:

  • Positioned items don't stretch by default.
  • They don't use the implicit grid. They don't create implicit tracks.
  • They don't occupy cells regarding auto-placement feature.
  • autohas a special meaning when referring lines.

Let’s explain with more detail each of these features.

Positioned items shrink to fit

We’re used to regular items that stretch by default to fill their area. However, that’s not the case for positioned items, similar to what a positioned regular block does, they shrink to fit.

This is pretty easy to get, but a simple example will make it crystal clear:

<div style="display: grid; position: relative;
            grid-template-columns: 300px 200px; grid-template-rows: 200px 100px;">
    <div style="grid-column: 1 / 3; grid-row: 1 / 3;"></div>
    <div style="position: absolute; grid-column: 1 / 3; grid-row: 1 / 3;"></div>
</div>

In this example we’ve a simple 2x2 grid. Both the regular item and the positioned one are placed with the same rules taking the whole grid. This defines the area for those items, which takes the 1st & 2nd rows and 1st & 2nd columns.

Positioned items shrink to fit Positioned items shrink to fit

The regular item stretches by default both horizontally and vertically, so it takes the whole size of the grid area. However, the positioned item shrink to fit and adapts its size to the contents.

For the examples in the next points I’m ignoring this difference, as I want to show the area that each positioned item takes. To get the same result than in the pictures, you’d need to set 100% width and height on the positioned items.

Positioned items and implicit grid

Positioned items don’t participate in the layout of the grid, neither they affect how items are placed.

You can place a regular item outside the explicit grid, and the grid will create the required tracks to accommodate the item. However, in the case of positioned items, you cannot even refer to lines in the implicit grid, they'll be treated as auto. Which means that you cannot place a positioned item in the implicit grid. they cannot create implicit tracks as they don't participate in the layout of the grid.

Let’s use an example to understand this better:

<div style="display: grid; position: relative;
            grid-template-columns: 200px 100px; grid-template-rows: 100px 100px;
            grid-auto-columns: 100px; grid-auto-rows: 100px;
            width: 300px; height: 200px;">
    <div style="position: absolute; grid-area: 4 / 4;"></div>
</div>

The example defines a 2x2 grid, but the positioned item is using grid-area: 4 / 4; so it tries to goes to the 4th row and 4th column. However the positioned items cannot create those implicit tracks. So it’s positioned like if it has auto, which in this case will take the whole explicit grid. auto has a special meaning in positioned items, it’ll be properly explained later.

Positioned items do not create implicit tracks Positioned items do not create implicit tracks

Imagine another example where regular items create implicit tracks:

<div style="display: grid; position: relative;
            grid-template-columns: 200px 100px; grid-template-rows: 100px 100px;
            grid-auto-columns: 100px; grid-auto-rows: 100px;
            width: 500px; height: 400px;">
    <div style="grid-area: 1 / 4;"></div>
    <div style="grid-area: 4 / 1;"></div>
    <div style="position: absolute; grid-area: 4 / 4;"></div>
</div>

In this case, the regular items will be creating the implicit tracks, making a 4x4 grid in total. Now the positioned item can be placed on the 4th row and 4th column, even if those columns are on the explicit grid.

Positioned items can be placed on the implicit grid Positioned items can be placed on the implicit grid

As you can see this part of the post has been modified, thanks to @fantasai for notifying me about the mistake.

Positioned items and placement algorithm

Again the positioned items do not affect the position of other items, as they don’t participate in the placement algorithm.

So, if you’ve a positioned item and you’re using auto-placement for some regular items, it’s expected that the positioned one overlaps the other. The positioned items are completely ignored during auto-placement.

Just showing a simple example to show this behavior:

<div style="display: grid; position: relative;
            grid-template-columns: 300px 300px; grid-template-rows: 200px 200px;">
    <div style="grid-column: auto; grid-row: auto;"></div>
    <div style="grid-column: auto; grid-row: auto;"></div>
    <div style="grid-column: auto; grid-row: auto;"></div>
    <div style="position: absolute;
                grid-column: 2 / 3; grid-row: 1 / 2;"></div>
</div>

Here we’ve again a 2x2 grid, with 3 auto-placed regular items, and 1 absolutely positioned item. As you can see the positioned item is placed on the 1st row and 2nd column, but there’s an auto-placed item in that cell too, which is below the positioned one. This shows that the grid container doesn’t care about positioned items and it just ignores them when it has to place regular items.

Positioned items and placement algorithm Positioned items and placement algorithm

If all the children were not positioned, the last one would be placed in the given position (1st row and 2nd column), and the rest of them (auto-placed) will take the other cells, without overlapping.

Positioned items and auto lines

This is probably the biggest difference compared to regular grid items. If you don’t specify a line, it’s considered that you’re using auto, but auto is not resolved as span 1 like in regular items. For positioned items auto is resolved to the padding edge.

The specification introduces the concepts of the lines 0 and -0, despite how weird it can sound, it actually makes sense. The auto lines would be referencing to those 0 and -0 lines, that represent the padding edges of the grid container.

Again let’s use a few examples to explain this:

<div style="display: grid; position: relative;
            grid-template-columns: 300px 200px; grid-template-rows: 200px 200px;
            padding: 50px 100px; width: 500px; height: 400px;">
    <div style="position: absolute;
                grid-column: 1 / auto; grid-row: 2 / auto;"></div>
</div>

Here we have a 2x2 grid container, which has some padding. The positioned item will be placed in the 2nd row and 1st column, but its area will take up to the padding edges (as the end line is auto in both axis).

Positioned items and auto lines Positioned items and auto lines

We could even place positioned grid items on the padding itself. For example using “grid-column: auto / 1;” the item would be on the left padding.

Positioned items using auto lines to be placed on the left padding Positioned items using auto lines to be placed on the left padding

Of course if the grid is wider and we’ve some free space on the content box, the items will take that space too. For example:

<div style="display: grid; position: relative;
            grid-template-columns: 300px 200px; grid-template-rows: 200px 200px;
            padding: 50px 100px; width: 600px; height: 400px;">
    <div style="position: absolute;
                grid-column: 3 / auto; grid-row: 2 / 3;"></div>
</div>

Here the grid columns are 500px, but the grid container has 600px width. This means that we’ve 100px of free space in the grid content box. As you can see in the example, that space will be also used when the positioned items extend up to the padding edges.

Positioned items taking free space and right padding Positioned items taking free space and right padding

Offsets

Of course you can use offsets to place your positioned items (left, right, top and bottom properties).

These offsets will apply inside the grid area defined for the positioned items, following the rules explained above.

Let’s use another example:

<div style="display: grid; position: relative;
            grid-template-columns: 300px 200px; grid-template-rows: 200px 200px;
            padding: 50px 100px; width: 500px; height: 400px;">
    <div style="position: absolute;
                grid-column: 1 / auto; grid-row: 2 / auto;
                left: 90px; right: 70px; top: 80px; bottom: 60px;"></div>
</div>

Again a 2x2 grid container with some padding. The positioned item have some offsets which are applied inside its grid area.

Positioned items and offets Positioned items and offets

Wrap-up

I’m not completely sure about how important is the support of positioned elements for web authors using Grid Layout. You’ll be the ones that have to tell if you really find use cases that need this. I hope this post helps to understand it better and make your minds about real-life scenarios where this might be useful.

The good news is that you can test this already in the most recent versions of some major browsers: Chrome Canary, Safari Technology Preview and Firefox. We hope that the 3 implementations are interoperable, but please let us know if you find any issue.

There’s one last thing missing: alignment support for positioned items. This hasn’t been implemented yet in any of the browsers, but the behavior will be pretty similar to the one you can already use with regular grid items. Hopefully, we’ll have time to add support for this in the coming months.

Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

Last but least, thanks to Bloomberg for supporting Igalia in the CSS Grid Layout implementation on Blink and WebKit.

May 26, 2016 10:00 PM

May 24, 2016

Alberto Garcia

I/O bursts with QEMU 2.6

QEMU 2.6 was released a few days ago. One new feature that I have been working on is the new way to configure I/O limits in disk drives to allow bursts and increase the responsiveness of the virtual machine. In this post I’ll try to explain how it works.

The basic settings

First I will summarize the basic settings that were already available in earlier versions of QEMU.

Two aspects of the disk I/O can be limited: the number of bytes per second and the number of operations per second (IOPS). For each one of them the user can set a global limit or separate limits for read and write operations. This gives us a total of six different parameters.

I/O limits can be set using the throttling.* parameters of -drive, or using the QMP block_set_io_throttle command. These are the names of the parameters for both cases:

-drive block_set_io_throttle
throttling.iops-total iops
throttling.iops-read iops_rd
throttling.iops-write iops_wr
throttling.bps-total bps
throttling.bps-read bps_rd
throttling.bps-write bps_wr

It is possible to set limits for both IOPS and bps at the same time, and for each case we can decide whether to have separate read and write limits or not, but if iops-total is set then neither iops-read nor iops-write can be set. The same applies to bps-total and bps-read/write.

The default value of these parameters is 0, and it means unlimited.

In its most basic usage, the user can add a drive to QEMU with a limit of, say, 100 IOPS with the following -drive line:

-drive file=hd0.qcow2,throttling.iops-total=100

We can do the same using QMP. In this case all these parameters are mandatory, so we must set to 0 the ones that we don’t want to limit:

   { "execute": "block_set_io_throttle",
     "arguments": {
        "device": "virtio0",
        "iops": 100,
        "iops_rd": 0,
        "iops_wr": 0,
        "bps": 0,
        "bps_rd": 0,
        "bps_wr": 0
     }
   }

I/O bursts

While the settings that we have just seen are enough to prevent the virtual machine from performing too much I/O, it can be useful to allow the user to exceed those limits occasionally. This way we can have a more responsive VM that is able to cope better with peaks of activity while keeping the average limits lower the rest of the time.

Starting from QEMU 2.6, it is possible to allow the user to do bursts of I/O for a configurable amount of time. A burst is an amount of I/O that can exceed the basic limit, and there are two parameters that control them: their length and the maximum amount of I/O they allow. These two can be configured separately for each one of the six basic parameters described in the previous section, but here we’ll use ‘iops-total’ as an example.

The I/O limit during bursts is set using ‘iops-total-max’, and the maximum length (in seconds) is set with ‘iops-total-max-length’. So if we want to configure a drive with a basic limit of 100 IOPS and allow bursts of 2000 IOPS for 60 seconds, we would do it like this (the line is split for clarity):

   -drive file=hd0.qcow2,
          throttling.iops-total=100,
          throttling.iops-total-max=2000,
          throttling.iops-total-max-length=60

Or with QMP:

   { "execute": "block_set_io_throttle",
     "arguments": {
        "device": "virtio0",
        "iops": 100,
        "iops_rd": 0,
        "iops_wr": 0,
        "bps": 0,
        "bps_rd": 0,
        "bps_wr": 0,
        "iops_max": 2000,
        "iops_max_length": 60,
     }
   }

With this, the user can perform I/O on hd0.qcow2 at a rate of 2000 IOPS for 1 minute before it’s throttled down to 100 IOPS.

The user will be able to do bursts again if there’s a sufficiently long period of time with unused I/O (see below for details).

The default value for ‘iops-total-max’ is 0 and it means that bursts are not allowed. ‘iops-total-max-length’ can only be set if ‘iops-total-max’ is set as well, and its default value is 1 second.

Controlling the size of I/O operations

When applying IOPS limits all I/O operations are treated equally regardless of their size. This means that the user can take advantage of this in order to circumvent the limits and submit one huge I/O request instead of several smaller ones.

QEMU provides a setting called throttling.iops-size to prevent this from happening. This setting specifies the size (in bytes) of an I/O request for accounting purposes. Larger requests will be counted proportionally to this size.

For example, if iops-size is set to 4096 then an 8KB request will be counted as two, and a 6KB request will be counted as one and a half. This only applies to requests larger than iops-size: smaller requests will be always counted as one, no matter their size.

The default value of iops-size is 0 and it means that the size of the requests is never taken into account when applying IOPS limits.

Applying I/O limits to groups of disks

In all the examples so far we have seen how to apply limits to the I/O performed on individual drives, but QEMU allows grouping drives so they all share the same limits.

This feature is available since QEMU 2.4. Please refer to the post I wrote when it was published for more details.

The Leaky Bucket algorithm

I/O limits in QEMU are implemented using the leaky bucket algorithm (specifically the “Leaky bucket as a meter” variant).

This algorithm uses the analogy of a bucket that leaks water constantly. The water that gets into the bucket represents the I/O that has been performed, and no more I/O is allowed once the bucket is full.

To see the way this corresponds to the throttling parameters in QEMU, consider the following values:

  iops-total=100
  iops-total-max=2000
  iops-total-max-length=60
  • Water leaks from the bucket at a rate of 100 IOPS.
  • Water can be added to the bucket at a rate of 2000 IOPS.
  • The size of the bucket is 2000 x 60 = 120000.
  • If iops-total-max is unset then the bucket size is 100.

bucket

The bucket is initially empty, therefore water can be added until it’s full at a rate of 2000 IOPS (the burst rate). Once the bucket is full we can only add as much water as it leaks, therefore the I/O rate is reduced to 100 IOPS. If we add less water than it leaks then the bucket will start to empty, allowing for bursts again.

Note that since water is leaking from the bucket even during bursts, it will take a bit more than 60 seconds at 2000 IOPS to fill it up. After those 60 seconds the bucket will have leaked 60 x 100 = 6000, allowing for 3 more seconds of I/O at 2000 IOPS.

Also, due to the way the algorithm works, longer burst can be done at a lower I/O rate, e.g. 1000 IOPS during 120 seconds.

Acknowledgments

As usual, my work in QEMU is sponsored by Outscale and has been made possible by Igalia and the help of the QEMU development team.

igalia-outscale

Enjoy QEMU 2.6!

by berto at May 24, 2016 11:47 AM

May 23, 2016

Igalia Compilers Team

Awaiting the future of JavaScript in V8

On the evening of Monday, May 16th, 2016, we have made history. We’ve landed the initial implementation of “Async Functions” in V8, the JavaScript runtime in use by the Google Chrome and Node.js. We do these things not because they are easy, but because they are hard. Because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one we are willing to accept. It is very exciting to see this, roughly 2 months of implementation, codereview and standards finangling/discussion to land. It is truly an honour.

 

To introduce you to Async Functions, it’s first necessary to understand two things: the status quo of async programming in JavaScript, as well as Generators (previously implemented by fellow Igalian Andy).

 

Async programming in JavaScript has historically been implemented by callbacks. `window.setTimeout(function toExecuteLaterOnceTimeHasPassed() {}, …)` being the common example. Callbacks on their own are not scalable: when numerous nested asynchronous operations are needed, code becomes extremely difficult to read and reason about. Abstraction libraries have been tacked on to improve this, including caolan’s async package, or Promise libraries such as Q. These abstractions simplify control flow management and data flow management, and are a massive improvement over plain Callbacks. But we can do better! For a more detailed look at Promises, have a look at the fantastic MDN article. Some great resources on why and how callbacks can lead to utter non-scalable disaster exist too, check out http://callbackhell.com!

 

The second concept, Generators, allow a runtime to return from a function at an arbitrary line, and later re-enter that function at the following instruction, in order to continue execution. So right away you can imagine where this is going — we can continue execution of the same function, rather than writing a closure to continue execution in a new function. Async Functions rely on this same mechanism (and in fact, on the underlying Generators implementation), to achieve their goal, immensely simplifying non-trivial coordination of asynchronous operations.

 

As a simple example, lets compare the following two approaches:
function deployApplication() {
  return cleanDirectory(__DEPLOYMENT_DIR__).
    then(fetchNpmDependencies).
    then(
      deps => Promise.all(
        deps.map(
          dep => moveToDeploymentSite(
            dep.files,
            `${__DEPLOYMENT_DIR__}/deps/${dep.name}`
        ))).
    then(() => compileSources(__SRC_DIR__,
                              __DEPLOYMENT_DIR__)).
    then(uploadToServer);
}

The Promise boiler plate makes this preit harder to read and follow than it could be. And what happens if an error occurs? Do we want to add catch handlers to each link in the Promise chain? That will only make it even more difficult to follow, with error handling interleaved in difficult to read ways.

Lets refactor this using async functions:

async function deployApplication() {
    await cleanDIrectory(__DEPLOYMENT_DIR__);
    let dependencies = await fetchNpmDependencies();

    // *see below*
    for (let dep of dependencies) {
        await moveToDeploymentSite(
            dep.files,
            `${__DEPLOYMENT_DIR__}/deps/${dep.name}`);
    }

    await compileSources(__SRC_DIR__,
                         __DEPLOYMENT_DIR__);
    return uploadToServer();
}

 

You’ll notice that the “moveToDeploymentSite” step is slightly different in the async function version, in that it completes each operation in a serial pipeline, rather than completing each operation in parallel, and continuing once finished. This is an unfortunate limitation of the async function specification, which will hopefully be improved on in the future.

 

In the meantime, it’s still possible to use the Promise API in async functions, as you can `await` any Promise, and continue execution after it is resolved. This grants compatibility with  numerous existing Web Platform APIs (such as `fetch()`), which is ultimately a good thing! Here’s an alternative implementation of this step, which performs the `moveToDeploymentSite()` bits in parallel, rather than serially:

 


await Promise.all(dependencies.map(
  dep => moveToDeploymentSite(
    dep.files,
    `${__DEPLOYMENT_DIR__}/deps/${dep.name}`
)));

 

Now, it’s clear from the let dependencies = await fetchNpmDependencies(); line that Promises are unwrapped automatically. What happens if the promise is rejected with an error, rather than resolved with a value? With try-catch blocks, we can catch rejected promise errors inside async functions! And if they are not caught, they will automatically return a rejected Promise from the async function.

function throwsError() { throw new Error("oops"); }

async function foo() { throwsError(); }

// will print the Error thrown in `throwsError`.
foo().catch(console.error)

async function bar() {
  try {
    var value = await foo();
  } catch (error) {
    // Rejected Promise is unwrapped automatically, and
    // execution continues here, allowing us to recover
    // from the error! `error` is `new Error("oops!")`
  }
}

 

There are also lots of convenient forms of async function declarations, which hopefully serve lots of interesting use-cases! You can concisely declare methods as asynchronous in Object literals and ES6 classes, by preceding the method name with the `async` keyword (without a preceding line terminator!)

 


class C {
  async doAsyncOperation() {
    // ...
  }
};

var obj = {
  async getFacebookProfileAsynchronously() {
    /* ... */
  }
};

 

These features allow us to write more idiomatic, easier to understand asynchronous control flow in our applications, and future extensions to the ECMAScript specification will enable even more idiomatic forms for writing complex algorithms, in a maintainable and readable fashion. We are very excited about this! There are numerous other resources on the web detailing async functions, their benefits, and perhaps ways they might be improved in the future. Some good ones include [this piece from Google’s Jake Archibald](https://jakearchibald.com/2014/es7-async-functions/), so give that a read for more details. It’s a few years old, but it holds up nicely!

 

So, now that you’ve seen the overview of the feature, you might be wondering how you can try it out, and when it will be available for use. For the next few weeks, it’s still too experimental even for the “Experimental Javascript” flag.  But if you are adventurous, you can try it already!  Fetch the latest Chrome Canary build, and start Chrome with the command-line-flag `–js-flags=”–harmony-async-await”`. We can’t make promises about the shipping timeline, but it could ship as early as Chrome 53 or Chrome 54, which will become stable in September or October.

 

We owe a shout out to Bloomberg, who have provided us with resources to improve the web platform that we love. Hopefully, we are providing their engineers with ways to write more maintainable, more performant, and more beautiful code. We hope to continue this working relationship in the future!

 

As well, shoutouts are owed to the Chromium team, who have assisted in reviewing the feature, verifying its stability, getting devtools integration working, and ultimately getting the code upstream. Terriffic! In addition, the WebKit team has also been very helpful, and hopefully we will see the feature land in JavaScriptCore in the not too distant future.

by compilers at May 23, 2016 02:08 PM

May 20, 2016

Víctor Jáquez

GStreamer Hackfest 2016

Yes, it happened again: the Gstreamer Spring Hackfest 2016! This time in the beautiful city of Thessaloniki. Thanks a lot, Vivia and Sebastian, for making it happen.

Thessaloniki
Thessaloniki’s promenade

My objective this time was to work with dma-buf support in gstreamer-vaapi. Though it is supported already, it needs a major clean up, and to extend its usage for downstream buffers (bugs 755072 and 765435).

In the way I learned that we need to update our internal API (called `libgstvaapi`), when handling dma-buf, to support mult-plane formats.

On the other hand, Nicolas Dufresne and I talked a bit about kmssink, libdrm and dma-buf. He managed to hack his Odroid U (Exynos3) to enable its V4L2 mem2mem video decoder and share buffers with kmssink. It was amazing. By the way, he promised me to write a blog post with the instructions to replicate his deed.

GStreamer Hackfest
Photo by Luis de Bethencourt

Finally, we had a preview of Edward Hervey‘s decodebin3. It is fun his test of switching the different audio streams in a media container (the different available dubbings) in every second or less. It was truly a multi-language audio!

In the meantime, we shared beers and meals, learning and laughing.

by vjaquez at May 20, 2016 10:32 AM

May 18, 2016

Antonio Gomes

[Chromium] content_shell running on Wayland desktop (Weston Compositor)

During my first weeks at Igalia, I got an interesting and promising task of Understand the status of Wayland support in Chromium upstream.

At first I could see clear signs that Wayland is being actively considered by the Chromium community:

  1. Ozone/Wayland project by Intel – which was my starting point as described later on.
  2. The meta bug “upstream wayland backend for ozone (20% project)“, which has some recent activity by the Chromium community.
  3. This comment in the chromium-dev mailing list (by a Googler):
  4. “(..) I ‘d recommend using ToT. Wayland support is a work-in-progress and newer trees will probably be better.”.

    Chromium’s DEPS file has “wayland” as one its core dependency (search for “wayland”).

Next step, naturally, was get my hands dirty, compiling and experimenting with it. I decided to start with content_shell.


Environment:

Ubuntu 16.04 LTS, regular Gnome session (with X) and Weston Wayland compositor running (no ‘xwayland’ installed) to run Wayland apps.

GYP_DEFINES

component=static_library use_ash=1 use_aura=1 chromeos=0 use_ozone=1 ozone_platform_wayland=1 use_wayland_egl=1 ozone_auto_platforms=0 use_xkbcommon=1 clang=0 use_sysroot=0 linux_use_bundled_gold=0

(note: GYP was used for personal familiarity with it, but GN applies fine here).

Chromium version

Base SHA 5da650481 (as of 2016/May/17)

Initial results

As is, content_shell built fine, but hung at runtime upon launch, hitting a CHECK at desktop_factory_ozone.cc(21).

Analysis and Action

After understanding current Ozone/Wayland upstream support, compare designs/code against 01.org, I could start connecting some of the missing dots.

The following files were “ported” from 01.org:

  • ui/views/widget/desktop_aura/desktop_drag_drop_client_wayland.cc / h
  • ui/views/widget/desktop_aura/desktop_screen_ozone_wayland.cc / h
  • ui/views/widget/desktop_aura/desktop_window_tree_host_ozone_wayland.cc / h

And then, I implemented DesktopFactoryOzoneWayland class (ui/views/widget/desktop_factory_ozone_wayland.cc/h) – it inherits from DesktopFactoryOzone, and implements the following pure virtual methods ::CreateWindowTreeHost and ::CreateDesktopScreen.

Initial result

After that, I could build and run content_shell with Weston Wayland Compositor (with no ‘xwayland’ installed). See a quick preview below.

Remarks

As is, the UI process owns the the Wayland connection, and GPU process runs without GL support.

UI processes initializes Ozone by calling:

#0 ui::WaylandSurfaceFactory::WaylandSurfaceFactory
#1 ui::OzonePlatformWayland::InitializeUI
#2 ui::OzonePlatform::InitializeForUI();
#3 aura::Env::Init()
#4 aura::Env::CreateInstance()
#5 content::BrowserMainLoop::InitializeToolkit()
(…)
#X content::ContentMain()
<UI PROCESS LAUNCH>
On the other side, GPU process gets initialized by calling:
#0 ui::WaylandSurfaceFactory::WaylandSurfaceFactory
#1 ui::OzonePlatformWayland::InitializeGPU
#2 ui::OzonePlatform::InitializeForGPU();
#3 gfx::InitializeStaticGLBindings()
#4 gfx::GLSurface::InitializeOneOffImplementation()
#5 gfx::GLSurface::InitializeOneOff()
#6 content::GpuMain()
<GPU PROCESS LAUNCH>

Differently from UI process, the GPU process call does not initialize OzonePlatformWayland::display_ and instead passes ‘nullptr’ to WaylandSurfaceFactory ctor.

Down the road on the GPU processes initialization WaylandSurfaceFactory::LoadEGLGLES2Bindings is called but bails out earlier explicitly because display_ is NIL.
Then, UI process falls back to software rendering (see call to WaylandSurfaceFactory::CreateCanvasForWidget).

Next step

  • So far I have experimented Ozone/Wayland support using “linux” as the target OS. As far as I can tell, most of the Ozone work upstream though has been focusing on “chromeos” builds instead (e.g. ozone/x11).
  • Hence the idea is to clean up the code and agree with Googlers / 01.org (Intel) people about how to best make use of this code.
  • It is being also discussed with some Googlers what the best way to tackle this lack of GL support is. Some real nice stuff are on the pipe here.

 

by agomes at May 18, 2016 08:21 PM

May 16, 2016

Javier Muñoz

The Ceph RGW storage driver goes upstream in Libcloud

The Ceph RGW storage driver was upstream in Apache Libcloud today. It is available in the Libcloud trunk repository and it will ship with the next release Apache Libcloud 1.0.0.

This post will introduce the new RGW driver together with the proper configuration parameters to run some examples uploading/downloading objects in Ceph Jewel.

The Ceph RGW storage driver

The Ceph RGW storage driver requires Ceph Jewel or above. As of this writing, the last Ceph Jewel version is 10.2.1. This version is available in the downloads section.

The driver extends the Libcloud S3 storage driver to provide a compatible S3 API with Ceph RGW.

The driver also contains support for AWS signature versions 2 (AWS2) and 4 (AWS4). It leverages the Libcloud common auth support on the client side. On the Ceph RGW side it required a little patch to handle unsigned paylods in the AWS4 auth header.

Developers and apps can use the Ceph RGW driver via the S3_RGW provider easily. A simple snippet follows...

from libcloud.storage.types import Provider
from libcloud.storage.providers import get_driver
import libcloud

api_key = 'api_key'
secret_key = 'secret_key'

cls = get_driver(Provider.S3_RGW)

driver = cls(api_key,
             secret_key,
	     signature_version='4',
	     region='my-region',
	     host='my-host',
	     port=8000)

container = driver.get_container(...)

If the region has not an explicit value, the driver will use the default region 'default'.

The valid signature versions are '2' (AWS2) and '4' (AWS4). AWS2 is the default signature version.

One host name is always required. No default value here.

The following two examples contain the minimal code to upload/download objects with the new provider:

Running the upload example...

$ ./test-upload-ceph-rgw-driver.py
<Object: name=my-name-abcdabcd-123,
size=110080,
hash=0a5cfeb3bb10e0971895f8899a64e816,
provider=Ceph RGW S3 (my-region) ...>

Running the download example...

$ ./test-download-ceph-rgw-driver.py
<Object: name=my-name-abcdabcd-123,
size=110080,
hash=0a5cfeb3bb10e0971895f8899a64e816,
provider=Ceph RGW S3 (my-region) ...>

Enjoy!

Acknowledgments

My work in Apache Libcloud is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Libcloud community. Thank you all!

by Javier at May 16, 2016 10:00 PM

May 10, 2016

Sergio Villar

Automatizing the Grid

My Igalia colleagues and me have extensively reviewed how to create grids and how to position items inside the grid using different CSS properties. So far everything was more or less static. We declare the sizes of our columns/rows or define a set of grid areas and that’s it. Well, actually there is room for automatic stuff,  you can dynamically create new tracks just by adding items to positions outside the explicit grid. Furthermore the grid is able to auto-position items for you if you don’t really care much about the final destination.

Use Cases

But imagine the following use case. Let’s assume that you are designing the product catalog of your pretty nice web store. CSS Grid Layout is the obvious choice for such layout. Just set some columns and some rows and that’s mostly it. Thing is that your catalog is likely not static, but automatically generated from a query to some database where you store all that data. You cannot know a priori how many items you’re going to show (users normally can also filter results).

Grid already supports that use case. You can already define the number of columns (rows) and let the grid create as many rows (columns) as needed. Something like this:

    grid-template-columns: repeat(3, 100px);
    grid-auto-rows: 100px;
    grid-auto-flow: row;

Picture showing auto row creation

 

The Multicol Example

But you need more flexibility. You’re happy with the grid creating new tracks on demand in one axis, but you also want to have as many tracks as possible (depending on the available size) in the other axis. You’ve already designed web sites using CSS Multicol and you want your grid to behave like columns: 100px;

Note that in the case of multicol the specified column width is an “optimal” size, meaning that it could be enlarged/narrowed to fill the container once the number of columns is calculated.

So is it possible to tell grid to add as many tracks as needed to fill some available space? It was not but now we have…

Repeat to fill: auto-fill and auto-fit

The grid way to implement it is by using the recently added auto repeat syntax. The already known repeat() function accepts as a first argument two new keywords, auto-fill and auto-fit. Both of them will generate as many repetitions as needed to fill the available space with the specified tracks without overflowing. The only difference between them is that, auto-fit will additionally drop any empty track (meaning no items spanning through it) generated by repeat() after positioning grid items.

The use of these two new keywords has some limitations:

  • Tracks with intrinsic or flexible sizes cannot be combined with auto repetitions
  • There can be just one auto-repeat per axis at the most
  • repeat(auto-fill|fit,) accepts only one track size as second argument (all auto repeat tracks will have the same size)

Some examples of valid declarations:

grid-template-columns: repeat(auto-fill, minmax(200px, 1fr)) [last];
grid-template-rows: 100px repeat(auto-fit, 2em) repeat(10, minmax(15%, 1fr));
grid-template-columns: repeat(auto-fill, minmax(max-content, 300px));

And some examples of invalid ones:

grid-template-columns: min-content repeat(auto-fill, 25px) 10px;
grid-template-rows: repeat(auto-fit, 15px) repeat(auto-fill, minmax(10px, 100px);
grid-template-columns: repeat(auto-fill, min-content);

The Details

I mentioned that we cannot mix flexible and/or intrinsic tracks with auto repetitions, so why repeat(auto-fill, minmax(200px, 1fr)) is a valid declaration? Well, according to the grammar, the auto repeat syntax require something called a <fixed-size> track, which is basically a track which has a length (like 10px) or a percentage in either its min or max function. The following are all <fixed-size> tracks:

15%
minmax(min-content, 200px)
minmax(5%, 1fr)

That length/percentage is the one used by the grid to compute the number of repetitions. You should also be aware of how the number of auto repeat tracks is computed. First of all you need a definite size on the axis where you want to use auto-repeat. If that is not the case then the number of repetitions will be 1.

The nice thing is that the definite size could be either the “normal” size (width: 250px), the max size (max-height: 15em) or even the min size (min-width: 200px). But there is an important difference, if either the size or max-size is definite, then the number of repetitions is the largest possible positive integer that does not cause the grid to overflow its grid container. Otherwise, if the min-size is definite, then the number of repetitions is the smallest possible positive integer that fulfills that minimum requirement.

For example, the following declaration will generate 4 rows:

height: 600px;
grid-template-rows: repeat(auto-fill, 140px);

But this one will generate 5:

min-height: 600px;
grid-template-rows: repeat(auto-fill, 140px);

I want to use it now!

Sure, no problem. I’ve just landed the support for auto-fill in both Blink and WebKit meaning that you’ll be able to test it (unprefixed) really soon in Chrome Canary and Safari Technology Preview (Firefox Nightly builds already support both auto-fill and auto-fit). Many thanks to our friends at Bloomberg for sponsoring this work. Enjoy!

by svillar at May 10, 2016 07:25 PM

May 02, 2016

Diego Pino

Network namespaces: IPv6 connectivity

In the last post I introduced network namespaces and showed a practical example on how to share IPv4 connectivity between a network namespace and a host. Before that post, I also wrote a short tutorial on how to set up an IPv6 tunnel using Hurricane Electric broker service. This kind of service allow us to get into the IPv6 realm using an IPv4 connection.

In this article I continue exploring network namespaces. Taking advantage of the work done in the aforementioned posts, I explain in this post how to share IPv6 connectivity between a host and a network namespace.

Let’s assume we already have a SIT tunnel (IPv6-in-IPv4 tunnel) enabled in our host and we’re able to ping an external IPv6 address. If you haven’t, I encourage you to check out set up an IPv6 tunnel.

$ ping6 ipv6.google.com
PING ipv6.google.com(lis01s14-in-x0e.1e100.net) 56 data bytes
64 bytes from lis01s14-in-x0e.1e100.net: icmp_seq=1 ttl=57 time=93.2 ms

I need to write an script which will create the network namespace and set it up accordingly. I call that script ns-ipv6. If the script works correctly, I should be able to ping an external IPv6 host from the namespace. Such script looks like this:

ns-ipv6.sh

 1 #!/bin/bash
 2 
 3 set -x
 4 
 5 if [[ $EUID -ne 0 ]]; then
 6    echo "You must run this script as root."
 7    exit 1
 8 fi
 9 
10 VETH1_IPV6=fd00::1
11 VPEER1_IPV6=fd00::2
12 
13 # Clean up.
14 ip netns del ns-ipv6 &>/dev/null
15 ip li del veth1 &> /dev/null
16 
17 # Create network namespace.
18 ip netns add ns-ipv6
19 
20 # Create veth pair.
21 ip li add name veth1 type veth peer name vpeer1
22 
23 # Setup veth1 (host).
24 ip -6 addr add ${VETH1_IPV6}/64 dev veth1
25 ip li set dev veth1 up
26 
27 # Setup vpeer1 (network namespace).
28 ip li set dev vpeer1 netns ns-ipv6
29 ip netns exec ns-ipv6 ip li set dev lo up
30 ip netns exec ns-ipv6 ip -6 addr add ${VPEER1_IPV6}/64 dev vpeer1
31 ip netns exec ns-ipv6 ip li set vpeer1 up
32 
33 # Make vpeer1 default gw.
34 ip netns exec ns-ipv6 ip -6 route add default dev vpeer1 via ${VETH1_IPV6}
35 
36 # NAT
37 sysctl -w net.ipv6.conf.all.forwarding=1
38 ip6tables -t nat --flush
39 ip6tables -t nat -A POSTROUTING -o he-ipv6 -j MASQUERADE
40 
41 # Get into ns-ipv6.
42 ip netns exec ns-ipv6 /bin/bash --rcfile <(echo "PS1=\"ns-ipv6> \"")

It actually works:

$ sudo ./ns-ipv6.sh
ns-ipv6> ping6 -c 1 2a00:1450:4003:801::200e
PING 2a00:1450:4003:801::200e(2a00:1450:4003:801::200e) 56 data bytes
64 bytes from 2a00:1450:4003:801::200e: icmp_seq=1 ttl=54 time=83.7 ms

Let’s take a deeper look on how it works.

ULAs

The script creates a veth pair to communicate the network namespace with the host. Each virtual interface is assigned an IPv6 address in the ‘fd00::0/64’ network space (Lines 21 and 27). This type of address is known as ULA or Unique Local Address. ULAs are the IPv6 counterpart of IPv4 private addresses.

Before continuing, a brief reminder on how IPv6 addresses work:

An IPv6 address is a 128-bit value represented as 8 blocks of 16-bit (8 x 16-bit = 128-bit). Blocks are separated by a colon (‘:’). Unlike IPv4 addresses, block values are written in hexadecimal. Since each block is a 16-bit value, they can be written in hexadecimal as 4-digit numbers. Leading zeros in each block can be ommitted. On the same hand, when several consecutive block values are zero they can be ommitted too. In that case two colons (‘::’) are written instead, meaning everything in between is nil. For instance, the address ‘fd00::1’ is the short form of the much longer address ‘fd00:0000:0000:0000:0000:0000:0000:0001’.

RFC 4193 (section 3) describes Unique Local Addresses format as:

| 7 bits |1|  40 bits   |  16 bits  |          64 bits           |
+--------+-+------------+-----------+----------------------------+
| Prefix |L| Global ID  | Subnet ID |        Interface ID        |
+--------+-+------------+-----------+----------------------------+

- Prefix is always fc00::/7.
- L:
  * set to 1, for local assigned addresses (block _fd00::/8_).
  * set to 0, not defined yet (block _fc00::/8_).
- Global ID: 40-bit global identifier used to create a globally unique prefix.
- Subnet ID: 16-bit Subnet ID is an identifier of a subnet within the site.
- Interface ID: 64-bit Interface ID.

The RFC reserves the IPv6 address block ‘fc00::/7’ for ULAs. It divides this address in two subnetworks: ‘fc00::/8’ and ‘fd00::/8’. The use of the ‘fc00::/8’ block has not been defined yet, while the ‘fd00:/8’ block is used for IPv6 local assigned addresses (private addresses).

The address ‘fd63:b1f4:7268:d970::1’ is an example of a valid ULA. It starts by the ‘fd’ prefix followed by an unique Global ID (‘63:b1f4:7268’) and a Subnet ID (‘d970’), leaving 64 bits for the Interface ID (‘::1’). I recommend the page Private IPv6 address range to obtain valid random ULAs.

ULAs are not routable in the global Internet. They are meant to be used inside local networks and that’s precisely the reason why they exist.

NAT on IPv6

Lines 34-36 activate IPv6 forwarding and IP Masquerade on the source address. However, this solution is not optimal.

The Hurricane Electric tunnel broker service lends us a ‘::0/64’ block, with 2^64 - 2 valid hosts. NAT, Network Address Translation, grants a host in a private network external connectivity via a proxy that lends the private host its address. This is the most common use case of NAT, known as Source NAT. Besides IP addresses, NAT translates port numbers too and that’s why it’s sometimes referred as NAPT (Network Address and Port Translation). NAT has been an important technology for optimizing the use of IPv4 address space, although it has its costs too.

The original goal of IPv6 was solving the problem of IP address exhaustation. Mechanisms such as NAT are not needed because the IPv6 address space is so big that every host could have an unique address, reachable from another end of the network. Actually, IPv6 brings back the original point-to-point design of the IPv4 Internet, before private addresses and NAT were introduced.

So let’s try to get rid of NAT66 (IPv6-to-IPv6 translation) by:

  • Using global IPv6 addresses.
  • Removing MASQUERADING.

The new script is available as a gist here: ns-ipv6-no-nat.sh. There’s some tricky bits that are worth explaining:

First thing, is to replace the ULAs by IPv6 addresses which belong to /64 block leased by Hurricane Electric:

VETH1_IPV6=2001:XXXX::101
VPEER1_IPV6=2001:XXXX::102

When setting up the interfaces, the host side should add a more restricted routing rule for the other end of the veth pair. The reason is that all addresses belong to the same network. If from the host side a packet needs to get to the network namespace side, it would be routed through the IPv6 tunnel unless there’s a more restricted rule.

# Setup veth1 (host).
ip -6 addr add ${VETH1_IPV6} dev veth1
ip -6 route add ${VPEER1_IPV6}/128 dev veth1

Lastly, NAT66 can be removed but IP forwarding is still necessary as the host acts as a router.

# Enable IPv6 forwarding.
systctl net.ipv6.conf.all.forwarding=1

When a packet arrives from the network namespace into the host, the destination address of the packet doesn’t match any of the interfaces of the host. If IP forwarding were disabled, the packet will simply be dropped. However, since IP forwarding is enabled, non-delivered packets get forwarded through the host’s default gateway reaching their destination, hopefully.

After these changes, the script still works:

ns-ipv6-no-nat> ping6 2a00:1450:4004:801::200e
PING 2a00:1450:4004:801::200e(2a00:1450:4004:801::200e) 56 data bytes
64 bytes from 2a00:1450:4004:801::200e: icmp_seq=1 ttl=56 time=86.0 ms

DNS resolution

In the line above I’m pinging an IPv6 address directly (this address is actually ipv6.google.com). What happens if I try to ping a host name instead?

ns-ipv6> ping6 ipv6.google.com
unknown host

When I ping ipv6.google.com from the network namespace, the /etc/resolv.conf file is queried to obtain a DNS nameserver address.

/etc/resolv.conf

nameserver 8.8.8.8

This address is an IPv4 address, but the network namespace has IPv6 connectivity only. It cannot reach any host in the IPv4 realm. However, DNS resolution works in the host since the host has either IPv6 and IPv4 connectivity. It is necessary to add a DNS server with an IPv6 address. Luckily, Google has DNS servers available in the IPv6 realm too.

nameserver 8.8.8.8
nameserver 2001:4860:4860::8888

Now I should be able to ping ipv6.google.com from the network namespace:

ns-ipv6> ping6 -c 1 ipv6.google.com
PING ipv6.google.com(lis01s14-in-x0e.1e100.net) 56 data bytes
64 bytes from lis01s14-in-x0e.1e100.net: icmp_seq=1 ttl=57 time=85.7 ms

--- ipv6.google.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 85.702/85.702/85.702/0.000 ms

Wrapping up

After all these changes we end up with a script that:

  • Uses Hurricane Electric’s IPv6 network addresses, instead of ULAs.
  • Doesn’t do NAT66 to provide external IPv6 connectivity to the network namespace.

It has been a lot of fun writing out this post, it helped me to understand many things better. I definitely encourage everyone interested to run some of the scripts above and try out IPv6, if you haven’t yet. The network namespace part is not fundamental but it makes it more interesting.

Lastly, I’d like to thank my colleague Carlos López for his unvaluable help as well as the StackOverflow community which helped me to figure out the script that gets rid of NAT66.

May 02, 2016 10:00 AM

April 29, 2016

Javier Muñoz

Scalable placement of replicated data in Ceph

One of the most interesting and powerful features in Ceph is the way how it computes the placement and storage of a growing amount of data at hyperscale.

This computation avoids the need to look up data locations in a central directory in order to allow nodes to be added or removed, moving as few objects as possible while still maintaining balance across new cluster configurations.

Ceph and the challenges of data placement at scale

Take into consideration the traditional standard storage array in the industry. It has the usual two controller units (for redundancy) and a lot of disk trays. Those controllers connect with a storage area network (SAN) in order to provide storage to clients (servers mainly).

The disk trays are connected to the storage controllers and all storage clients use the disks through those controllers. Scaling up the capacity and performance of the array requires adding/handling more disks to the controllers.

As expected, raising the order of scale the controllers become a bottleneck. They will get overloaded in some moment. The usual solution for this bottleneck is buying a new pair of controllers with a new array of disks and moving some load onto the new hardware.

This solution is expensive and not very flexible. It can work in the terabyte scale but it doesn't fit really well with the new Cloud industry demmanding elasticity and extreme flexibility on the petabyte/exabyte scale.

The Ceph approach to cope with these challenges is the design and implementation of a new scale-out distributed storage system based on commodity hardware communicating over a regular TCP/IP network.

This approach enables a fast and more affordable non-disruptive operational process on demmand while supporting the new petabyte scale with a very high performance. This kind of hyper scalable storage fits really well with the Cloud models.

The way how Ceph enables hyperscale storage is avoiding any kind of centralized coordination via autonomous and smart storage nodes. In this scale the design of the cluster software determines how many nodes the cluster can handle. If the cluster scales out to a hundred of nodes we will be working around the petabytes and millions of I/O operations per second (IOPS)

Another point to consider in this context is the inclusion of the new failure domains coming from the scale-out architecture. In this configuration each node is a new failure domain. This is usually handled by the software coordinating the cluster via copies of all data on at least two nodes. This approach is known as replica-based data protection.

Ceph is able to enable hyperscale storage via CRUSH, a hash-based algorithm for calculating how and where to store and retrieve data in a distributed-object storage cluster.

This major challenge of 'data placement at scale', and how it is designed and implemented, has a huge impact on the scalability and performance of massive storage solutions.

A "bird's-eye view" in the Ceph's data placement

In order to explore CRUSH we need to understand concepts such as pools, placement groups, OSDs and so on. If you are not familiar with the Ceph architecture I would suggest reading one of my previous posts where all this information is summarised together with pointers to the official documentation.

From now on I will consider you are familiar with the general concepts and we will keep the focus on CRUSH and related components.

Ceph uses CRUSH to determine how to store and retrieve data by computing data storage locations. Those data are handled as objects of variable size by Ceph in a simple flat namespace. On top of these native objects, Ceph build the rest of abstractions such as blocks, file systems or S3 buckets.

In this point we can reformulate the data placement problem how an object-to-osd mapping problem without loss of generality.

In Ceph, the objects belong to pools and pools are comprised of placement groups. Each placement group maps to a list of OSDs. This is the critical path you need to understand.

The pool is the way how Ceph divides the global storage. This division or partition is the abstraction used to define the resilience (number of replicas, etc.), the number of placement groups, the CRUSH ruleset, the ownership and so on. You can consider this abstraction as the right place to define the configuration of your policies, so each pool handles its own number of replicas, number of placement groups, etc.

The placement group is the abstraction used by Ceph to map objects to OSDs in a dynamic way. You can consider it as the placement or distribution unit in Ceph.

So how we go from objects to OSDs via pools and placements groups? It is straight.

In Ceph one object will be stored in a concrete pool so the pool identifier (a number) and the name of the object are used to uniquely identify the object in the system.

Those two values, the pool id and the name of the object, are used to get a placement group via hashing.

When a pool is created it is assigned a number of placement groups (PGs). One object is always stored in a concrete pool so the pool identifier (a number) and the name of the object is used to uniquely identify each object in the system.

With the pool identifier and the hashed name of the object, Ceph will compute the hash modulo the number of PGs to retrieve the dynamic list of OSDs.

In detail, the steps to compute for one placement group for the object named 'tree' in the pool 'images' (pool id 7) with 65536 placement groups would be...

  1. Hash the object name : hash('tree') = 0xA062B8CF
  2. Calculates the hash modulo the number of PGs : 0xA062B8CF % 65536 = 0xB8CF
  3. Get the pool id : 'images' = 7
  4. Prepends the pool id to 0xB8CF to get the placement group: 7.B8CF

Ceph uses this new placement group (7.B8CF) together with the cluster map and the placement rules to get the dynamic list of OSDs...

  • CRUSH('7.B8CF') = [4, 19, 3]

The size of this list is the number of replicas configured in the pool. The first OSD in the list is the primary, the next one is the secondary and so on.

Understanding CRUSH

The CRUSH (Controlled Replication Under Scalable Hashing) algorithm determines how to store and retrieve data by computing data storage locations. As mentioned in the previous lines, Ceph uses this approach to overcome the data placement at scale.

Under the hood, the algorithm requires knowing how the cluster storage is organized (device locations, hierarchies and so on). All this information is defined in the CRUSH map.

The roles and responsibilities of the CRUSH map are the following:

  • Together with a ruleset for each hierarchy, the map determines how Ceph stores data.

  • It contains at least one hierarchy of nodes and leaves.

  • The nodes of a hierarchy are called 'buckets'. Those buckets are defined by their type.

  • The data objects are distributed among the storage devices according to a per-device weight value, approximating an uniform probability distribution.

  • The hierarchies are arbitraries. They are defined according to your own needs but the leaf nodes always represent the OSDs, and each leaf belong to one node or bucket.

Let's see one example of default hierarchy...

$ ceph osd tree
ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.16672 root default                                      
-2 0.05557     host node-1                                   
 0 0.02779         osd.0        up  1.00000          1.00000 
 1 0.02779         osd.1        up  1.00000          1.00000 
-3 0.05557     host node-2                                   
 2 0.02779         osd.2        up  1.00000          1.00000 
 3 0.02779         osd.3        up  1.00000          1.00000 
-4 0.05557     host node-3                                   
 4 0.02779         osd.4        up  1.00000          1.00000 
 5 0.02779         osd.5        up  1.00000          1.00000 
$

In this case the bucket hierarchy has six leaf buckets (osd 0-5), three host buckets (node 1-3) and one root node (default)

Having a look in the decompiled map we get the following content...

$ ceph osd getcrushmap -o compiled-crush-map.bin
got crush map from osdmap epoch 65
$ crushtool -d compiled-crush-map.bin -o decompiled-crush-map.txt
$ cat decompiled-crush-map.txt
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host node-1 {
id -20 # do not change unnecessarily
# weight 0.056
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.028
item osd.1 weight 0.028
}
host node-2 {
id -32 # do not change unnecessarily
# weight 0.056
alg straw
hash 0 # rjenkins1
item osd.2 weight 0.028
item osd.3 weight 0.028
}
host node-3 {
id -48 # do not change unnecessarily
# weight 0.056
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.028
item osd.5 weight 0.028
}
root default {
id -10 # do not change unnecessarily
# weight 0.167
alg straw
hash 0 # rjenkins1
item node-1 weight 0.056
item node-2 weight 0.056
item node-3 weight 0.056
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map
$

As you can see the bucket declaration requires specifying its type, an unique name, weight and hash algorithm.

[bucket-type] [bucket-name] {
   id [a unique negative numeric id]
   weight [the relative capacity/capability of the item(s)]
   alg [uniform, list, tree, straw, straw2]
   hash [the hash type]
   item [item-name] weight [weight]
}

The kinds of buckets (uniform, list, tree, straw2, etc) represent internal (non-leaf) nodes in the cluster hierarchies. Those buckets are based on different internal data structures and utilize different functions for pseudo-random choosing nested items.

The typical example is the uniform bucket, in this case all selected items are restricted in that they must contain items that are all of the same weight.

The hash type represent the hash algorithm used in the functions associated with the different kinds of buckets.

Finally, the CRUSH rules define how a Ceph client and OSDs select buckets.

rule  {
   ruleset 
   type [ replicated | raid4]
   min_size 
   max_size 
   step take 
   step [choose|chooseleaf] [firstn|indep]  
   step emit
}

The Jenkins hash function

We mentioned Ceph, and CRUSH, use a hash function as part of their logic but we didn't comment anything about this function yet.

This function is known as the Jenkins hash function, a hash function for hash table lookups. One paper covering the technical details on this hash function is available here.

The paper presents fast and reliable hash functions for table lookup using 32-bit or 64-bit arithmetic together with a framework for evaluating hash functions.

In Ceph, the Jenkins function is not only used in CRUSH as part of the replicas selection. It is used along the Ceph codebase when some hashing requirement is needed.

As Jenkins comments on his paper, these hashes work equally well on all types of inputs, including text, numbers, compressed data, counting sequences, and sparse bit arrays.

I would highlight the following points related to the hash function design. They are relevant to understand the hashing code while sequencing, masking, etc. in Ceph/CRUSH the different input values.

  1. If the hash value needs to be smaller that 32 (64) bits, you can mask the high bits.
  2. The hash functions work best if the size of the hash table is a power of 2.
  3. If the hash table has more than 232 (264) entries, this can be handled by calling the hash function twice with different initial initvals then concatenating the results.
  4. If the key consists of multiple strings, the strings can be hashed sequentially, passing in the hash value from the previous string as the initval for the next.
  5. Hashing a key with different initial initvals produces independent hash values.

Ceph and data placement in practice

Let's locate one object...

$ ceph health
HEALTH_OK
$ ceph osd tree
ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.16672 root default                                      
-2 0.05557     host node-1                                   
 0 0.02779         osd.0        up  1.00000          1.00000 
 1 0.02779         osd.1        up  1.00000          1.00000 
-3 0.05557     host node-2                                   
 2 0.02779         osd.2        up  1.00000          1.00000 
 3 0.02779         osd.3        up  1.00000          1.00000 
-4 0.05557     host node-3                                   
 4 0.02779         osd.4        up  1.00000          1.00000 
 5 0.02779         osd.5        up  1.00000          1.00000 
$ ceph osd pool create mypool 256 256 replicated
pool 'mypool' created
$ ceph osd lspools
0 rbd,1 mypool,
$ ceph osd dump | grep mypool
pool 1 'mypool' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins
pg_num 256 pgp_num 256 last_change 59 flags hashpspool stripe_width 0
$ dd if=/dev/zero of=myobject bs=25M count=1 2> /dev/null
$ ls -sh myobject
25M myobject
$ rados -p mypool put myobject myobject
$ rados -p mypool stat myobject
mypool/myobject mtime 2016-04-30 10:12:10.000000, size 26214400
$ rados -p mypool ls
myobject
$ ceph osd map mypool myobject
osdmap e60 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([4,2,1], p4) acting ([4,2,1], p4)
$

It is mapping the object 'myobject' in the pool 'mypool' to pg 1.5da41c62 (1.62) and OSDs 4, 2 and 1

Let's see how it balances/replicates the object when OSDs go down...

$ ceph osd dump | grep mypool
pool 1 'mypool' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins
pg_num 256 pgp_num 256 last_change 59 flags hashpspool stripe_width 0
$ ceph osd map mypool myobject
osdmap e60 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([4,2,1], p4) acting ([4,2,1], p4)
$ ceph osd tree
ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.16672 root default                                      
-2 0.05557     host node-1                                   
 0 0.02779         osd.0        up  1.00000          1.00000 
 1 0.02779         osd.1        up  1.00000          1.00000 
-3 0.05557     host node-2                                   
 2 0.02779         osd.2      down  1.00000          1.00000 
 3 0.02779         osd.3        up  1.00000          1.00000 
-4 0.05557     host node-3                                   
 4 0.02779         osd.4        up  1.00000          1.00000 
 5 0.02779         osd.5        up  1.00000          1.00000 
$ ceph osd map mypool myobject
osdmap e66 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([4,1], p4) acting ([4,1], p4)
$ ceph osd tree
ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.16672 root default                                      
-2 0.05557     host node-1                                   
 0 0.02779         osd.0        up  1.00000          1.00000 
 1 0.02779         osd.1      down  1.00000          1.00000 
-3 0.05557     host node-2                                   
 2 0.02779         osd.2      down  1.00000          1.00000 
 3 0.02779         osd.3        up  1.00000          1.00000 
-4 0.05557     host node-3                                   
 4 0.02779         osd.4        up  1.00000          1.00000 
 5 0.02779         osd.5        up  1.00000          1.00000 
$ ceph osd map mypool myobject
osdmap e68 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([4], p4) acting ([4], p4)
$ ceph osd map mypool myobject
osdmap e96 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([4,3], p4) acting ([4,3], p4)
$ ceph osd tree
ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.16672 root default                                      
-2 0.05557     host node-1                                   
 0 0.02779         osd.0        up  1.00000          1.00000 
 1 0.02779         osd.1        up  1.00000          1.00000 
-3 0.05557     host node-2                                   
 2 0.02779         osd.2      down        0          1.00000 
 3 0.02779         osd.3        up  1.00000          1.00000 
-4 0.05557     host node-3                                   
 4 0.02779         osd.4        up  1.00000          1.00000 
 5 0.02779         osd.5        up  1.00000          1.00000
$ ceph osd map mypool myobject
osdmap e103 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([4,1,3], p4) acting ([4,1,3], p4)
$ ceph osd tree
ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.16672 root default                                      
-2 0.05557     host node-1                                   
 0 0.02779         osd.0        up  1.00000          1.00000 
 1 0.02779         osd.1        up  1.00000          1.00000 
-3 0.05557     host node-2                                   
 2 0.02779         osd.2        up  1.00000          1.00000 
 3 0.02779         osd.3        up  1.00000          1.00000 
-4 0.05557     host node-3                                   
 4 0.02779         osd.4        up  1.00000          1.00000 
 5 0.02779         osd.5        up  1.00000          1.00000
$ ceph osd map mypool myobject
osdmap e108 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([4,2,1], p4) acting ([4,2,1], p4)
$ ceph osd tree
ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.16672 root default                                      
-2 0.05557     host node-1                                   
 0 0.02779         osd.0        up  1.00000          1.00000 
 1 0.02779         osd.1        up  1.00000          1.00000 
-3 0.05557     host node-2                                   
 2 0.02779         osd.2        up  1.00000          1.00000 
 3 0.02779         osd.3        up  1.00000          1.00000 
-4 0.05557     host node-3                                   
 4 0.02779         osd.4      down  1.00000          1.00000 
 5 0.02779         osd.5        up  1.00000          1.00000 
$ ceph osd map mypool myobject
osdmap e110 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([2,1], p2) acting ([2,1], p2)
$ ceph osd tree
ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.16672 root default                                      
-2 0.05557     host node-1                                   
 0 0.02779         osd.0        up  1.00000          1.00000 
 1 0.02779         osd.1        up  1.00000          1.00000 
-3 0.05557     host node-2                                   
 2 0.02779         osd.2        up  1.00000          1.00000 
 3 0.02779         osd.3        up  1.00000          1.00000 
-4 0.05557     host node-3                                   
 4 0.02779         osd.4        up  1.00000          1.00000 
 5 0.02779         osd.5        up  1.00000          1.00000 
$ ceph osd map mypool myobject
osdmap e115 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([4,2,1], p4) acting ([4,2,1], p4)

It works as expected. It uses three replicas with a minimal size of two replicas.

Wrap-up

Along this post some challenges of data placement at hyperscale were introduced together with the Ceph approach (smart storage nodes, CRUSH algorithm, Jenkins hash function...) to address them.

Some practical examples to illustrate the mapping and data placement path are also available in the previous lines. They were tested on the new Ceph Jewel release.

As usual, I would say the primary reference to understand the current Ceph data placement is the source code. I would suggest to read the Sage's thesis (6.2, 6.3 and 6.4 sections) to know more on the roots of the current solution too. These sections cover reliable autonomic storage, performance and scalability. Beyond of these references you might also find useful the official documentation.

If you are looking for some kind of support related to development, design, deployment, etc. in Ceph or you would love to see some new feature in the next releases. Feel free to contact me!

by Javier at April 29, 2016 10:00 PM

April 26, 2016

Adrián Pérez

Identifying Layer-7 packet flows in SnabbWall

Spring is here already, the snow has melted a while ago, and it looks like a good time to write a bit about network traffic flows, as promised in my previous post about ljndpi. Why so? Well, looking at network traffic and grouping it into logical streams between two endpoints is something that needs to be done for SnabbWall, a suite of Snabb applications which implement a Layer-7 analyzer and firewall which Igalia is developing with sponsorship from the NLnet Foundation.

 

(For those interested in following only my Snabb-related posts, there are separate feeds you can use: RSS, Atom.)

Going With the Flow

Any sequence of related network packets between two hosts can be a network traffic flow. But not quite so: the exact definition may vary, depending on the level at which we are working. For example, an ISP may want to consider all packets between the pair of hosts —regardless of their contents— as part of the same flow in order to account for transferred data in metered connections, but for SnabbWall we want “application-level” traffic flows. That is: all packets generated (or received) by the same application should be classified into the same flow.

But that can get tricky, because even if we looked only at TCP traffic from one application, it is not possible just map a single connection to one flow. Take FTP for example: in active mode it uses a control connection, plus an additional data connection, and both should be considered part of the same flow because both belong to the same application. On the other side of the spectrum are web browsers like the one you are probably using to read this article: it will load the HTML using one connection, and then other related content (CSS, JavaScript, images) needed to display the web page.

In SnabbWall, the assignment of packets to flows is done based on the following fields from the packet:

  • 802.1Q VLAN tag.
  • Source and destination IP addresses.
  • Source and destination port numbers.

The VLAN tag is there to force classifying packets with the same source and destination but in different logical networks in separate packet flows. As for port numbers, in practice these are only extracted from packets when the upper layer protocol is UDP or TCP. There are other protocols which use port numbers, but they are deliberately left out (for now) because either nDPI does not support them, or they are not widely adopted (SCTP comes to mind).

Some Implementation Details

Determining the flow to which packets belong is an important task which is performed for each single packet scanned. Even before packet contents are inspected, they have to be classified.

Handling flows is split in two in the SnabbWall packet scanner: a generic implementation inspects packets to extract the fields above (VLAN tag, addresses, ports) and calculates a unique flow key from them, while backend-specific code inspects the contents of the packet and identifies the application for a flow of packets. Once the generic part of the code has calculated a key, it can be used to keep tables which associate additional data to each flow. While SnabbWall has only one backend which at the moment which uses nDPI, this split makes it easier to add others in the future.

For efficiency —both in terms of memory and CPU usage— flow keys are represented using a C struct. The following snippet shows the one for IPv4 packets, with a similar one where the address fields are 16 bytes wide being used for IPv6 packets:

ffi.cdef [[
   struct swall_flow_key_ipv4 {
      uint16_t vlan_id;
      uint8_t  __pad;
      uint8_t  ip_proto;
      uint8_t  lo_addr[4];
      uint8_t  hi_addr[4];
      uint16_t lo_port;
      uint16_t hi_port;
   } __attribute__((packed));
]]

local flow_key_ipv4 = ffi.metatype("struct swall_flow_key_ipv4", {
   __index = {
      hash = make_cdata_hash_function(ffi.sizeof("struct swall_flow_key_ipv4")),
   }
})

The struct is laid out with an extra byte of padding, to ensure that its size is a multiple of 4. Why so? The hash function (borrowed from the lib.ctable module) used for flow keys works on inputs with sizes multiple of 4 bytes because calculations are done in a word-by-word basis. In Lua the hash value for userdata values is their memory address, which makes them all different to each other: defining our own hashing function allows using the hash values as keys into tables, instead of the flow key itself. Let's see how this works with the following snippet, which counts per-flow packets:

local flows = {}
while not ended do
   local key = key_from_packet(read_packet())
   if flows[key:hash()] then
      flows[key:hash()].num_packets = flows[key:hash()].num_packets + 1
   else
      flows[key:hash()] = { key = key, num_packets = 1 }
   end
end

If we used the keys themselves instead of key:hash() for indexing the flows table, this wouldn't work because the userdata for the new key is created for each packet being processed, which means that keys with the same content created for different packets would have different hash values (their address in memory). On the other hand, the :hash() method always returns the same value keys with the same contents.

Highs and Lows

You may be wondering why our flow key struct has its members named lo_addr, hi_addr, lo_port and hi_port. It turns out that in packets which belong to the same application travel between two hosts in both directions. Let's consider the following:

  • Host A, with address 10.0.0.1.
  • Host B, with address 10.0.0.2.
  • A web browser from A connects (using randomly assigned port 10205) to host B, which has an HTTP server running in port 80.

The sequence of packets observed will go like this:

# Source IP Destination IP Source Port Destination Port
1 10.0.0.1 10.0.0.2 10205 80
2 10.0.0.2 10.0.0.1 80 10205
3 10.0.0.1 10.0.0.2 10205 80
4

If the flow key fields would be src_addr, dst_addr and so on, the first and second packets would be classified in separate flows — but they belong in the same one! This is sidestepped by sorting the addresses and ports of each packet when calculating its flow key. For the example connection above, all packets involved have 10.0.0.1 as the “low IP address” (lo_addr), 10.0.0.2 as the “high IP address” (hi_addr), 80 as the “low port” (lo_port), and 10205 as the “high port” (hi_port) — effectively classifying all the packets into the same flow.

This translates into some minor annoyance in the nDPI scanner backend because nDPI expects us to pass a pair of identifiers for the source and destination hosts for each packet inspected. Not a big deal, though.

Flow(er Power)

Something we have to do for IPv6 packets is traversing the chain of extension headers to get to the upper-layer protocol and extract port numbers from TCP and UDP packets. There can be any number of extension headers, and while in practice they should never be lots, this makes the amount of work needed to derive a flow key from a packet is not constant.

The good news is that RFC 6437 specifies using the 20-bit flow label field of the fixed IPv6 header in a way that, combined with the source and destination addresses, they uniquely identify the flow of the packet. This is all rainbows and ponies, but in practice the current behaviour would still be needed: the specification considers that an all-zeroes value indicates “packets that have not been labeled”. Which means that it is still needed to use the source and destination ports as fallback. What is even worse: while forbidden by the specification, flow labels can mutate while packets are en-route without any means of verifying that the change was made. Also, it is allowed to assign a new flow label to an unlabeled packet when packets are being forwarded. Nevertheless, using the flow label may be interesting to be used instead of the port numbers when the upper layer protocol is neither TCP nor UDP. Due to the limited usefulness, using IPv6 flow labels remains unimplemented for now, but I have not discarded adding support later on.

Something Else

Alongside with the packet scanner, I have implemented the L7Spy application, and the snabb wall command during this phase of the SnabbWall project. Expect another post soon about them!

by aperez (adrian@perezdecastro.org) at April 26, 2016 08:34 PM

April 15, 2016

Frédéric Wang

OpenType MATH in HarfBuzz

TL;DR:

  • Work is in progress to add OpenType MATH support in HarfBuzz and will be instrumental for many math rendering engines relying on that library, including browsers.

  • For stretchy operators, an efficient way to determine the required number of glyphs and their overlaps has been implemented and is described here.

In the context of Igalia browser team effort to implement MathML support using TeX rules and OpenType features, I have started implementation of OpenType MATH support in HarfBuzz. This table from the OpenType standard is made of three subtables:

  • The MathConstants table, which contains layout constants. For example, the thickness of the fraction bar of ab\frac{a}{b}.

  • The MathGlyphInfo table, which contains glyph properties. For instance, the italic correction indicating how slanted an integral is e.g. to properly place the subscript in ∫D\displaystyle\displaystyle\int_{D}.

  • The MathVariants table, which provides larger size variants for a base glyph or data to build a glyph assembly. For example, either a larger parenthesis or a assembly of U+239B, U+239C, U+239D to write something like:

    (abcdefgh\left(\frac{\frac{\frac{a}{b}}{\frac{c}{d}}}{\frac{\frac{e}{f}}{\frac{g}{h}}}\right.

Code to parse this table was added to Gecko and WebKit two years ago. The existing code to build glyph assembly in these Web engines was adapted to use the MathVariants data instead of only private tables. However, as we will see below the MathVariants data to build glyph assembly is more general, with arbitrary number of glyphs or with additional constraints on glyph overlaps. Also there are various fallback mechanisms for old fonts and other bugs that I think we could get rid of when we move to OpenType MATH fonts only.

In order to add MathML support in Blink, it is very easy to import the OpenType MATH parsing code from WebKit. However, after discussions with some Google developers, it seems that the best option is to directly add support for this table in HarfBuzz. Since this library is used by Gecko, by WebKit (at least the GTK port) and by many other applications such as Servo, XeTeX or LibreOffice it make senses to share the implementation to improve math rendering everywhere.

The idea for HarfBuzz is to add an API to

  1. 1.

    Expose data from the MathConstants and MathGlyphInfo.

  2. 2.

    Shape stretchy operators to some target size with the help of the MathVariants.

It is then up to a higher-level math rendering engine (e.g. TeX or MathML rendering engines) to beautifully display mathematical formulas using this API. The design choice for exposing MathConstants and MathGlyphInfo is almost obvious from the reading of the MATH table specification. The choice for the shaping API is a bit more complex and discussions is still in progress. For example because we want to accept stretching after glyph-level mirroring (e.g. to draw RTL clockwise integrals) we should accept any glyph and not just an input Unicode strings as it is the case for other HarfBuzz shaping functions. This shaping also depends on a stretching direction (horizontal/vertical) or on a target size (and Gecko even currently has various ways to approximate that target size). Finally, we should also have a way to expose italic correction for a glyph assembly or to approximate preferred width for Web rendering engines.

As I mentioned at the beginning, the data and algorithm to build glyph assembly is the most complex part of the OpenType MATH and deserves a special interest. The idea is that you have a list of n≥1n\geq 1 glyphs available to build the assembly. For each 0≤i≤n-10\leq i\leq n-1, the glyph gig_{i} has advance aia_{i} in the stretch direction. Each gig_{i} has straight connector part at its start (of length sis_{i}) and at its end (of length eie_{i}) so that we can align the glyphs on the stretch axis and glue them together. Also, some of the glyphs are “extenders” which means that they can be repeated 0, 1 or more times to make the assembly as large as possible. Finally, the end/start connectors of consecutive glyphs must overlap by at least a fixed value omino_{\mathrm{min}} to avoid gaps at some resolutions but of course without exceeding the length of the corresponding connectors. This gives some flexibility to adjust the size of the assembly and get closer to the target size tt.

gig_{i}

sis_{i}

eie_{i}

aia_{i}

gi+1g_{i+1}

si+1s_{i+1}

ei+1e_{i+1}

ai+1a_{i+1}

oi,i+1o_{i,i+1}

Figure 1: Two adjacent glyphs in an assembly

To ensure that the width/height is distributed equally and the symmetry of the shape is preserved, the MATH table specification suggests the following iterative algorithm to determine the number of extenders and the connector overlaps to reach a minimal target size tt:

  1. 1.

    Assemble all parts by overlapping connectors by maximum amount, and removing all extenders. This gives the smallest possible result.

  2. 2.

    Determine how much extra width/height can be distributed into all connections between neighboring parts. If that is enough to achieve the size goal, extend each connection equally by changing overlaps of connectors to finish the job.

  3. 3.

    If all connections have been extended to minimum overlap and further growth is needed, add one of each extender, and repeat the process from the first step.

We note that at each step, each extender is repeated the same number of times r≥0r\geq 0. So if IExtI_{\mathrm{Ext}} (respectively INonExtI_{\mathrm{NonExt}}) is the set of indices 0≤i≤n-10\leq i\leq n-1 such that gig_{i} is an extender (respectively is not an extender) we have ri=rr_{i}=r (respectively ri=1r_{i}=1). The size we can reach at step rr is at most the one obtained with the minimal connector overlap omino_{\mathrm{min}} that is

∑i=0N-1(∑j=1riai-omin)+omin=(∑i∈INonExtai-omin)+(∑i∈IExtr⁢(ai-omin))+omin\sum_{i=0}^{N-1}\left(\sum_{j=1}^{r_{i}}{a_{i}-o_{\mathrm{min}}}\right)+o_{ \mathrm{min}}=\left(\sum_{i\in I_{\mathrm{NonExt}}}{a_{i}-o_{\mathrm{min}}} \right)+\left(\sum_{i\in I_{\mathrm{Ext}}}r{(a_{i}-o_{\mathrm{min}})}\right)+o% _{\mathrm{min}}

We let NExt=|IExt|N_{\mathrm{Ext}}={|I_{\mathrm{Ext}}|} and NNonExt=|INonExt|N_{\mathrm{NonExt}}={|I_{\mathrm{NonExt}}|} be the number of extenders and non-extenders. We also let SExt=∑i∈IExtaiS_{\mathrm{Ext}}=\sum_{i\in I_{\mathrm{Ext}}}a_{i} and SNonExt=∑i∈INonExtaiS_{\mathrm{NonExt}}=\sum_{i\in I_{\mathrm{NonExt}}}a_{i} be the sum of advances for extenders and non-extenders. If we want the advance of the glyph assembly to reach the minimal size tt then

SNonExt-omin⁢(NNonExt-1)+r⁢(SExt-omin⁢NExt)≥t{S_{\mathrm{NonExt}}-o_{\mathrm{min}}\left(N_{\mathrm{NonExt}}-1\right)}+{r% \left(S_{\mathrm{Ext}}-o_{\mathrm{min}}N_{\mathrm{Ext}}\right)}\geq t

We can assume 0" display="inline">SExt-omin⁢NExt>0S_{\mathrm{Ext}}-o_{\mathrm{min}}N_{\mathrm{Ext}}>0 or otherwise we would have the extreme case where the overlap takes at least the full advance of each extender. Then we obtain

r≥rmin=max⁡(0,⌈t-SNonExt+omin⁢(NNonExt-1)SExt-omin⁢NExt⌉)r\geq r_{\mathrm{min}}=\max\left(0,\left\lceil\frac{t-{S_{\mathrm{NonExt}}+o_{ \mathrm{min}}\left(N_{\mathrm{NonExt}}-1\right)}}{S_{\mathrm{Ext}}-o_{\mathrm{ min}}N_{\mathrm{Ext}}}\right\rceil\right)

This provides a first simplification of the algorithm sketched in the MATH table specification: Directly start iteration at step rminr_{\mathrm{min}}. Note that at each step we start at possibly different maximum overlaps and decrease all of them by a same value. It is not clear what to do when one of the overlap reaches omino_{\mathrm{min}} while others can still be decreased. However, the sketched algorithm says all the connectors should reach minimum overlap before the next increment of rr, which means the target size will indeed be reached at step rminr_{\mathrm{min}}.

One possible interpretation is to stop overlap decreasing for the adjacent connectors that reached minimum overlap and to continue uniform decreasing for the others until all the connectors reach minimum overlap. In that case we may lose equal distribution or symmetry. In practice, this should probably not matter much. So we propose instead the dual option which should behave more or less the same in most cases: Start with all overlaps set to omino_{\mathrm{min}} and increase them evenly to reach a same value oo. By the same reasoning as above we want the inequality

SNonExt-o⁢(NNonExt-1)+rmin⁢(SExt-o⁢NExt)≥t{S_{\mathrm{NonExt}}-o\left(N_{\mathrm{NonExt}}-1\right)}+{r_{\mathrm{min}} \left(S_{\mathrm{Ext}}-oN_{\mathrm{Ext}}\right)}\geq t

which can be rewritten

SNonExt+rmin⁢SExt-o⁢(NNonExt+rmin⁢NExt-1)≥tS_{\mathrm{NonExt}}+r_{\mathrm{min}}S_{\mathrm{Ext}}-{o\left(N_{\mathrm{NonExt% }}+{r_{\mathrm{min}}N_{\mathrm{Ext}}}-1\right)}\geq t

We note that N=NNonExt+rmin⁢NExtN=N_{\mathrm{NonExt}}+{r_{\mathrm{min}}N_{\mathrm{Ext}}} is just the exact number of glyphs used in the assembly. If there is only a single glyph, then the overlap value is irrelevant so we can assume NNonExt+r⁢NExt-1=N-1≥1N_{\mathrm{NonExt}}+{rN_{\mathrm{Ext}}}-1=N-1\geq 1. This provides the greatest theorical value for the overlap oo:

omin≤o≤omaxtheorical=SNonExt+rmin⁢SExt-tNNonExt+rmin⁢NExt-1o_{\mathrm{min}}\leq o\leq o_{\mathrm{max}}^{\mathrm{theorical}}=\frac{S_{ \mathrm{NonExt}}+r_{\mathrm{min}}S_{\mathrm{Ext}}-t}{N_{\mathrm{NonExt}}+{r_{ \mathrm{min}}N_{\mathrm{Ext}}}-1}

Of course, we also have to take into account the limit imposed by the start and end connector lengths. So omaxo_{\mathrm{max}} must also be at most min⁡(ei,si+1)\min{(e_{i},s_{i+1})} for 0≤i≤n-20\leq i\leq n-2. But if rmin≥2r_{\mathrm{min}}\geq 2 then extender copies are connected and so omaxo_{\mathrm{max}} must also be at most min⁡(ei,si)\min{(e_{i},s_{i})} for i∈IExti\in I_{\mathrm{Ext}}. To summarize, omaxo_{\mathrm{max}} is the minimum of omaxtheoricalo_{\mathrm{max}}^{\mathrm{theorical}}, of eie_{i} for 0≤i≤n-20\leq i\leq n-2, of sis_{i} 1≤i≤n-11\leq i\leq n-1 and possibly of e0e_{0} (if 0∈IExt0\in I_{\mathrm{Ext}}) and of of sn-1s_{n-1} (if n-1∈IExt{n-1}\in I_{\mathrm{Ext}}).

With the algorithm described above NExtN_{\mathrm{Ext}}, NNonExtN_{\mathrm{NonExt}}, SExtS_{\mathrm{Ext}}, SNonExtS_{\mathrm{NonExt}} and rminr_{\mathrm{min}} and omaxo_{\mathrm{max}} can all be obtained using simple loops on the glyphs gig_{i} and so the complexity is O⁢(n)O(n). In practice nn is small: For existing fonts, assemblies are made of at most three non-extenders and two extenders that is n≤5n\leq 5 (incidentally, Gecko and WebKit do not currently support larger values of nn). This means that all the operations described above can be considered to have constant complexity. This is much better than a naive implementation of the iterative algorithm sketched in the OpenType MATH table specification which seems to require at worst

∑r=0rmin-1NNonExt+r⁢NExt=NNonExt⁢rmin+rmin⁢(rmin-1)2⁢NExt=O⁢(n×rmin2)\sum_{r=0}^{r_{\mathrm{min}}-1}{N_{\mathrm{NonExt}}+rN_{\mathrm{Ext}}}=N_{ \mathrm{NonExt}}r_{\mathrm{min}}+\frac{r_{\mathrm{min}}\left(r_{\mathrm{min}}-% 1\right)}{2}N_{\mathrm{Ext}}={O(n\times r_{\mathrm{min}}^{2})}

and at least Ω⁢(rmin)\Omega(r_{\mathrm{min}}).

One of issue is that the number of extender repetitions rminr_{\mathrm{min}} and the number of glyphs in the assembly NN can become arbitrary large since the target size tt can take large values e.g. if one writes \underbrace{\hspace{65535em}} in LaTeX. The improvement proposed here does not solve that issue since setting the coordinates of each glyph in the assembly and painting them require Θ⁢(N)\Theta(N) operations as well as (in the case of HarfBuzz) a glyph buffer of size NN. However, such large stretchy operators do not happen in real-life mathematical formulas. Hence to avoid possible hangs in Web engines a solution is to impose a maximum limit NmaxN_{\mathrm{max}} for the number of glyph in the assembly so that the complexity is limited by the size of the DOM tree. Currently, the proposal for HarfBuzz is Nmax=128N_{\mathrm{max}}=128. This means that if each assembly glyph is 1em large you won’t be able to draw stretchy operators of size more than 128em, which sounds a quite reasonable bound. With the above proposal, rminr_{\mathrm{min}} and so NN can be determined very quickly and the cases N≥NmaxN\geq N_{\mathrm{max}} rejected, so that we avoid losing time with such edge cases…

Finally, because in our proposal we use the same overlap oo everywhere an alternative for HarfBuzz would be to set the output buffer size to nn (i.e. ignore r-1r-1 copies of each extender and only keep the first one). This will leave gaps that the client can fix by repeating extenders as long as oo is also provided. Then HarfBuzz math shaping can be done with a complexity in time and space of just O⁢(n)O(n) and it will be up to the client to optimize or limit the painting of extenders for large values of NN…

April 15, 2016 10:00 PM

April 10, 2016

Javier Muñoz

The Outscale OSU driver goes upstream in Libcloud

Apache Libcloud 1.0.0-rc2 (preview) was released today and it contains the new Outscale storage driver I contributed upstream several days ago.

This release together with the digital signatures are available in the download section. You can read the change log here.

Along this entry I will introduce the Apache Libcloud project, the Outscale driver and how a new provider can be used to connect with the Outscale object storage service.

The Apache Libcloud project

Apache Libcloud is an Open Source Python library that provides a vendor-neutral interface to cloud provider APIs. It is used to support diversification without vendor lock-in.

The library eases the interaction with the cloud resources through an unified API and backend drivers.

There are available backend drivers to support popular and well-known cloud service providers. As expected, those drivers contain the functionalities exported to developers and applications via the unified API.

From a logical perspective, the library divides the supported resources among the following categories:

  • Cloud Servers and Block Storage
  • Cloud Object Storage and CDN
  • Load Balancers as a Service
  • DNS as a Service
  • Container Services
  • Backup as a Service

The next picture shows a simplified Libcloud diagram:

As we shall see later, the Outscale driver implements a new Libcloud provider to be used with the Cloud Object Storage API and the Outscale Object Storage Unit (OSU) based on Ceph.

The Outscale object storage driver

Righ now Outscale provides two different storage services in its cloud. They go under the names Block Storage Unit (BSU) and Object Storage Unit (OSU)

The BSU is the way to work with block storage on standard hard drives or SSDs. It is used to create and attach block storage volumes to instances in the Flexible Compute Unit (FCU) service.

The OSU service is all about storing and managing objects in a replicated and high-availability environment. The service exports an AWS S3 compatible API.

The Outscale object storage driver targets this second service, the OSU service. The driver extends the Libcloud S3 storage driver to provide a compatible S3 API together with the required bits to connect with the OSU service.

Using the Outscale OSU storage driver in Libcloud

You can use the OSU driver via the S3_RGW_OUTSCALE provider easily. A simple snippet follows...

from libcloud.storage.types import Provider
from libcloud.storage.providers import get_driver
import libcloud

api_key = 'api_key'
secret_key = 'secret_key'

cls = get_driver(Provider.S3_RGW_OUTSCALE)
driver = cls(api_key, secret_key, region='eu-west-1')
container = driver.get_container(...)

The 'region' parameter is an optional parameter. If you don't set up an explicit region the driver will use the default region 'eu-west-2'.

The driver supports the following five Outscale regions:

  • eu-west-1
  • eu-west-2
  • us-west-1
  • us-east-2
  • cn-southeast-1

A full example loading the S3_RGW_OUTSCALE provider and then retrieving an object is available here.

Running the example...

$ ./test-osu-outscale-driver.py
<Object: name=my-name-abcdabcd-123,
size=3453,
hash=89c28be4b979a529afa5f24fae439858,
provider=OUTSCALE Ceph RGW S3 (eu-west-1) ...>

Acknowledgments

My work in Apache Libcloud is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Libcloud community. Thank you all!

by Javier at April 10, 2016 10:00 PM

Diego Pino

Network namespaces

Namespaces and cgroups are two of the main kernel technologies most of the new trend on software containerization (think Docker) rides on. To put it simple, cgroups are a metering and limiting mechanism, they control how much of a system resource (CPU, memory) you can use. On the other hand, namespaces limit what you can see. Thanks to namespaces processes have their own view of the system’s resources.

The Linux kernel provides 6 types of namespaces: pid, net, mnt, uts, ipc and user. For instance, a process inside a pid namespace only sees processes in the same namespace. Thanks to the mnt namespace, it’s possible to attach a process to its own filesystem (like chroot). In this article I focus only in network namespaces.

If you have grasped the concept of namespaces you may have at this point an intuitive idea of what a network namespace might offer. Network namespaces provide a brand-new network stack for all the processes within the namespace. That includes network interfaces, routing tables and iptables rules.

Network namespaces

From the system’s point of view, when creating a new process via clone() syscall, passing the flag CLONE_NEWNET will create a brand-new network namespace into the new process. From the user perspective, we simply use the tool ip (package is iproute2) to create a new persistent network namespace:

$ ip netns add ns1

This command will create a new network namespace called ns1. When the namespace is created, the ip command adds a bind mount point for it under /var/run/netns. This allows the namespace to persist even if there’s no process attached to it. To list the namespaces available in the system:

$ ls /var/run/netns
ns1

Or via ip:

$ ip netns
ns1

As previously said, a network namespace contains its own network resources: interfaces, routing tables, etc. Let’s add a loopback interface to ns1:

1 $ ip netns exec ns1 ip link set dev lo up
2 $ ip netns exec ns1 ping 127.0.0.1
3 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
4 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.115 ms
  • Line 1 brings up the loopback interface inside the network namespace ns1.
  • Line 2 executes the command ping 127.0.0.1 inside the network namespace.

An alternative syntax to bring up the loopback interface could be:

$ ip netns exec ns1 ifconfig lo up

However, I tend to use the command ip as it has become the preferred networking tool in Linux, obsoleting the old but more familiar commands ifconfig, route, etc. Notice that ip requires root privileges, so run it as root or prepend sudo.

A network namespace has its own routing table too:

$ ip netns exec ns1 ip route show

Which at this point returns nothing as we haven’t add any routing table rule yet. Generally speaking, any command run within a network namespace is prepend by the prologue:

$ ip netns exec <network-namespace>

A practical example

One of the consequences of network namespaces is that only one interface could be assigned to a namespace at a time. If the root namespace owns eth0, which provides access to the external world, only programs within the root namespace could reach the Internet. The solution is to communicate a namespace with the root namespace via a veth pair. A veth pair works like a patch cable, connecting two sides. It consists of two virtual interfaces, one of them is assigned to the root network namespace, while the other lives within a network namespace. Setting up their IP addresses and routing rules accordingly, plus enabling NAT in the host side, will be enough to provide Internet access to the network namespace.

Additionally, I feel like I need to make a clarification at this point. I’ve read in several articles about network namespaces that physical device interfaces can only live in the root namespace. At least that’s not the case with my current kernel (Linux 3.13). I can assign eth0 to a namespace other than the root and when setting it up properly have Internet access from the namespace. However, the limitation of one interface living only in one single namespace at a time still applies, and that’s a reason powerful enough to need connecting network namespace via a veth pair.

To start with, let’s create a new network namespace called ns1.

# Remove namespace if it exists.
ip netns del ns1 &>/dev/null

# Create namespace
ip netns add ns1

Next, create a veth pair. Interface v-eth1 will remain inside the root network namespace, while its peer, v-peer1, will be moved to the ns1 namespace.

# Create veth link.
ip link add v-eth1 type veth peer name v-peer1

# Add peer-1 to NS.
ip link set v-peer1 netns ns1

Next, setup IPv4 addresses for both interfaces and bring them up.

# Setup IP address of v-eth1.
ip addr add 10.200.1.1/24 dev v-eth1
ip link set v-eth1 up

# Setup IP address of v-peer1.
ip netns exec ns1 ip addr add 10.200.1.2/24 dev v-peer1
ip netns exec ns1 ip link set v-peer1 up
ip netns exec ns1 ip link set lo up

Additionally I brought up the loopback interface inside ns1.

Now it’s necessary we make all external traffic leaving ns1 to go through v-eth1.

ip netns exec ns1 ip route add default via 10.200.1.1

However this won’t be enough. As with any host sharing Its internet connection, it’s necessary to enable IPv4 forwarding in the host and enable masquerading.

# Share internet access between host and NS.

# Enable IP-forwarding.
echo 1 > /proc/sys/net/ipv4/ip_forward

# Flush forward rules, policy DROP by default.
iptables -P FORWARD DROP
iptables -F FORWARD

# Flush nat rules.
iptables -t nat -F

# Enable masquerading of 10.200.1.0.
iptables -t nat -A POSTROUTING -s 10.200.1.0/255.255.255.0 -o eth0 -j MASQUERADE

# Allow forwarding between eth0 and v-eth1.
iptables -A FORWARD -i eth0 -o v-eth1 -j ACCEPT
iptables -A FORWARD -o eth0 -i v-eth1 -j ACCEPT

If everything went fine, it would be possible to ping an external host from ns1.

$ ip netns exec ns1 ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=50 time=48.5 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=50 time=50.8 ms

This is how the routing table inside ns1 would look like after the setup:

$ ip netns exec ns1 ip route sh
default via 10.200.1.1 dev v-peer1 
10.200.1.0/24 dev v-peer1  proto kernel  scope link  src 10.200.1.2 

Prepending the ip netns exec prologue for every command to run from the namespace might be a bit tedious. Once the most basic features inside the namespace are setup, a more interesting possibility is to run a bash shell and attach it to the network namespace:

$ ip netns exec ns1 /bin/bash --rcfile <(echo "PS1=\"namespace ns1> \"")
namespace ns1> ping www.google.com
PING www.google.com (178.60.128.38) 56(84) bytes of data.
64 bytes from cache.google.com (178.60.128.38): icmp_seq=1 ttl=58 time=17.6 ms

Type exit to leave end the bash process and leave the network namespace.

Conclusion

Network namespaces, as well as other containerization technologies provided by the Linux kernel, are a lightweight mechanism for resource isolation. Processes attached to a network namespace see their own network stack, while not interfering with the rest of the system’s network stack.

Network namespaces are easy to use too. A similar network-level isolation could have been set up using a VM. However, that seems a much more expensive solution in terms of system resources and time investment to build up such environment. If you only need process isolation at the networking level, network namespaces are definitively something to consider.

The full script is available as a GitHub gist at: ns-inet.sh.

More readings

April 10, 2016 09:30 PM

April 06, 2016

Víctor Jáquez

gstreamer-vaapi 1.8: the codec split

On march 23th GStreamer 1.8 was released, along with all its bundled modules, and, of course, one of those modules is gstreamer-vaapi.

Let us talk about this gstreamer-vaapi release, since there are several sweets!

First thing to notice is that the encoders have been renamed. Before they followed the pattern vaapiencode_{codec}, now they follow the pattern vaapi{codec}enc. The purpose of this change is twofold: to fix the plugins gtk-docs and to keep the usual element names in GStreamer. The conversion table is this:

Old New
vaapiencode_h264 vaapih264enc
vaapiencode_h265 vaapih265enc
vaapiencode_mpeg2 vaapimpeg2enc
vaapiencode_jpeg vaapijpegenc
vaapiencode_vp8 vaapivp8enc

But those were not the only name changes, we also have split the vaapidecode. Now we have a vaapijpegdec, which only decodes JPEG images, while keeping the old vaapidecode for video decoding. Also, vaapijpegdec was demoted to a marginal rank, because there are some problems in the Intel VA driver (which is the only one which supports JPEG decoding right now).

Note that in future releases, all the decoders will be split by codec, just as we did the JPEG decoder now; but first, we need to modify vaapidecodebin to choose a decoder in run-time based on the negotiated caps.

Please, update your scripts and code accordingly.

There are a ton of enhancements and optimizations too. Let me enumerate some of them: Vineeth TM fixed several memory leaks, and some compilations issues; Tim enabled vaapisink to send unhandled keyboard or mouse events to the application, making the usage of apps like gst-play-1.0 or apps based on GstPlayer be more natural; Thiago fixed the h264/h265 parsers, meanwhile Sree fixed the vp9 and the h265 ones too; Scott also fixed the h265 parser; et cetera. As may you see, H265/HEVC parser has been very active lately, it is the new thing!

I have to thank Sebastian Dröge, he did all the release work and also fixed a couple compilation issues.

This is the short log summary since 1.6:

 2  Lim Siew Hoon
 1  Scott D Phillips
 8  Sebastian Dröge
 3  Sreerenj Balachandran
 5  Thiago Santos
 1  Tim-Philipp Müller
 8  Vineeth TM
16  Víctor Manuel Jáquez Leal

by vjaquez at April 06, 2016 10:47 AM