# Planet Igalia

## June 21, 2016

### Xavier Castaño

#### We’re proud to sponsor #Linuxcon North America and to celebrate the 25th anniversary of #Linux and #opensource!

LinuxCon North America and ContainerCon features 175+ sessions for developers, operators, users and other open source professionals with a range of content covering Linux, Containers, Cloud and much more!

1. Bonus Content Day featuring Docker 101 lab and tutorials
2. Co-located events such as CloudNativeDay, KVM Forum, Linux Security Summit, Open Source Storage Summit, and Xen Project Developer Summit
3. 25th Anniversary of Linux Casino Royale Gala on Wednesday, August 24th

Igalia will have a booth there so don’t miss your chance to attend and visit us!

## June 20, 2016

### Javier Muñoz

#### Ansible AWS S3 core module now supports Ceph RGW S3

The Ansible AWS S3 core module now supports Ceph RGW S3. The patch was upstream today and it will be included in Ansible 2.2

This post will introduce the new RGW S3 support in Ansible together with the required bits to run Ansible playbooks handling S3 use cases in Ceph Jewel

.

The Ansible project

Ansible is a simple IT automation engine that automates cloud provisioning, configuration management, application deployment and intra-service orchestration among many other IT needs.

Ansible works by connecting to nodes (SSH/WinRM) and pushing out small programs, called 'Ansible modules' to them. These programs are written to be resource models of the desired state of the system. Ansible then executes these modules and removes them when finished.

Ansible uses playbooks to orchestrate the infrastructure with very detailed control. Those playbooks define configuration policies and orchestration workflows. They are a YAML definition of automation tasks that describe how a particular piece of automation should be done.

Playbooks are modeled as a collection of plays, each of which defines a set of tasks to be executed on a group of remote hosts. A play also defines the environment where the tasks will be executed.

Ansible modules ensure indempotence so it is possible running the same tasks over and over without affecting the final result.

Using the Ansible AWS S3 core module with Ceph

The Ceph RGW S3 support is part of the Amazon S3 core module in Ansible. The AWS S3 core module allows the user to manage S3 buckets and the objects within them. It includes support for creating and deleting both objects and buckets, retrieving objects as files or strings and generating download links.

The patch leverages the AWS S3 use cases without any restriction or limitation with regions, URLs, etc.

To enable the RGW S3 flavour in the S3 core module you set the 'rgw' boolean option to 'true' and the 's3_url' string option to the RGW S3 server.

controller:~$ansible rgw.test.node -m s3 -a \ "mode=list bucket=my-bucket s3_url=http://rgw.test.server:8000 rgw=true" The 's3_url' option is mandatory in the RGW S3 flavour. Testing the RGW S3 core module flavour via Playbook This Playbook example tests the RGW S3 flavour in a simple three boxes network ('controller', 'rgw.test.node' and 'rgw.test.server') The 'controller' host is the box running the Ansible engine. The 'rgw.test.node' is the host connecting to the Ceph RGW server and running the S3 use cases. The 'rgw.test.server' is the Ceph RGW S3 server. Running the testing Playbook... controller:~$ ansible-playbook playbook-test-rgw-s3-core-module.yml

PLAY [all] *********************************************************************

ok: [rgw.test.node]

changed: [rgw.test.node]

changed: [rgw.test.node] => (item=my-test-object-1.txt)
changed: [rgw.test.node] => (item=my-test-object-2.txt)

ok: [rgw.test.node]

changed: [rgw.test.node] => (item=my-test-object-1.txt)
changed: [rgw.test.node] => (item=my-test-object-2.txt)

changed: [rgw.test.node]

PLAY RECAP *********************************************************************
rgw.test.node              : ok=6    changed=4    unreachable=0    failed=0


As expected, the Playbook runs the 'create', 'upload', 'list', 'download' and 'remove' S3 use cases for buckets and objects. Adding the '-v' switch will show a more verbose output.

Wrap-up

All examples, modules and playbooks related to RGW S3 were tested on the new Ceph Jewel release.

Beyond of the RGW S3 support in the AWS S3 core module you could be interested in Ansible Playbooks to set up and configure Ceph clusters automatically. You can find those Playbooks in ceph/ceph-ansible with general support for Monitors, OSDs, MDSs and RGW.

The primary documentation for Ansible is available here. I found the Ansible whitepapers a great resource too.

Acknowledgments

My work in Ansible is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Ansible community. Thank you all!

### Manuel Rego

#### My BlinkOn 6 Summary: Grid Layout, Houdini &amp; MathML

Igalia could not miss the chance to participate in a new European edition of BlinkOn. So past week my colleague Dape and I were attending BlinkOn 6 in Munich. In this post I’d do a personal summary about the main conference highlights.

### Status of CSS Grid Layout implementation

During the conference I gave a talk about the status of Grid Layout on Blink. I went over the whole spec checking the things that are DONE, WIP or TODO. The summary is that on top of the things we’re already working on, and that will be landing very soon (auto-fit repeat, orthogonal flows, normal value in align|justify-items properties, etc.), we’ve just a few tasks pending (most of them changes on the spec in the last 2 months).

The main features pending are:

• Fragmentation: This is an issue on the whole project and not only related to Grid Layout. For example, Blink doesn’t have proper fragmentation support on Flexbox either, so probably it’s not a blocker in order to ship Grid Layout. This will affect you when you try to print a grid and the rows are cut in the middle, actually they should have been moved completely to the next page. Of course, it’d be really nice to have it fixed and working as expected.
• Subgrids: This is a hot topic and the opinions vary a lot depending on whom you’re asking. For some people, we should ship without subgrids support and add it later (as the change won’t be breaking any content); other believe that we shouldn’t ship until subgrids are implemented. I think it’s still needed more discussion between the CSS Working Group and the different browser vendors to reach an agreement on this issue.

You can find the slides of my talk on this blog already (notice that you’d need Chrome Canary with the experimental flag enabled to see them properly). Sadly the talks outside the main room were not recorded, so there won’t be any video of mine.

Igalia and Bloomberg working together to build a better web

From the different conversations I had, everyone seems really happy about the work we’re doing on Grid Layout, so thank you very much for the kind words. 😉 As you probably already know Igalia work on this feature has been sponsored by Bloomgerg, big kudos to them!

### CSS Houdini

Houdini had a big presence on the conference, it seems clear that Google is really pushing for this to happen. It was nice to see the current status of some APIs that are already working in an experimental way like Typed OM and Painting API. Also, it’s great to check that other browsers are already taking some initial steps too, like Firefox with the Properties and Values API.

Userspace vector graphics using Houdini & Custom Elements by Shane Stephens

Apart from the introductory talk on the main room there were 2 more sessions:

### MathML in Chromium?

Just in case you don’t know, Igalia has been lately working on the refactoring of the MathML implementation in WebKit. The rationale was that the previous code was a pain to maintain and improve (e.g. it depended a lot on Flexbox making it really hard to implement new features), actually this was one of the reasons why it was removed from Blink after the fork. The refactoring is still ongoing but several patches have already landed.

As the new code (after the refactoring) was looking much better in WebKit my colleague Fred Wang started to port it to Blink on a GitHub repository. The initial results are looking pretty good, but all the code from WebKit hasn’t been ported yet.

MathML mentioned on the Houdini session

Thus, I took advantage of the conference to talk about this work with several people in order to check the feasibility of bringing MathML back to Chromium. After some discussions it seems clear that Google would be interested in having MathML support, however at this moment they seem more inclined to do it through the new Houdini APIs like CSS Layout and Fonts Metrics that are still being worked on. MathML seems to be a nice use case to verify that they work as expected.

If they follow this approach, it probably means that we won’t see MathML supported on Blink in the short-term, as we’d need to wait for those APIs to be ready. In any case, Igalia will keep an eye on all this stuff looking for a good opportunity to make it a reality.

### Summary

As usual for me the most important part of these conferences is having the chance to meet some new faces and old friends too. It’s really nice to have the chance to spend a while talking face to face with people that have been interacting with you on the Internet for a long time.

Regarding Grid Layout I hope that the different parties can find a good way to bring it to the web authors in the 3 major browser (Chrome, Safari and Firefox) almost synchronously. I cannot say if it’s going to happen this year or the next one, but I’m sure it’s going to be something really big for the Web!

All the CSS Houdini stuff sounds really cool but a long-term thing at this moment. Let’s see how long we need to wait until we can use these new APIs for real.

Munich’s Town Hall from from the Tower of St. Peter’s Church

Finally, it was nice to visit Munich and to have some time for a walk around the city. We couldn’t find any single hill around the city center, next time it might be a good idea to rent a bike.

## June 14, 2016

### Requisites

For the context of this post, it is assumed a ‘content_shell’ ChromeOS GN build produced with the following commands:

$gn gen --args='target_os="chromeos" use_ozone=true ozone_platform_wayland=true use_wayland_egl=false ozone_platform_x11=true ozone_platform_headless=true ozone_auto_platforms=false' out-gn-ozone$ ninja -C out-gn-ozone blink_tests

### Ozone at runtime

Looking at the build arguments above, one can foresee that the content_shell app will be able to run on the following Graphics backends: Wayland (no EGL), X11, and “headless” – and it indeed is. This happens thanks to Chromium’s Graphics layer abstraction, Ozone.

So, in order to run content_shell app on Ozone platform “bleh”, one simply does:

out-gn-ozone/content_shell --ozone-platform="bleh" Simple no? Well, yes .. and not as much. The way the desired Ozone platform classes/objects are instantiated is interesting, involving c++ templates, GYP/GN hooks, python generated code, and more. This post aims to detail the process some more. ### Ozone platform selection logic Two methods kick off OzonePlatform instantiation: ::InitializeForUI and ::InitializeForGPU . They both call ::CreateInstance(), which is our starting point. This is how simple it looks: 63 void OzonePlatform::CreateInstance() { 64 if (!instance_) { (..) 69 std::unique_ptr<OzonePlatform> platform = 70 PlatformObject<OzonePlatform>::Create(); 71 72 // TODO(spang): Currently need to leak this object. 73 OzonePlatform* pl = platform.release(); 74 DCHECK_EQ(instance_, pl); 75 } 76 }  Essentially, when PlatformObject<T>::Create is ran (lines 69 and 70), it ends up calling a method named Create{T}Bleh, where • “T” is the template argument name, e.g. “OzonePlatform”. • “bleh” is the value passed to –ozone-platform command line parameter. For instance, in the case of ./content_shell –ozone-platform=x11, the method called would CreateOzonePlatformX11, following the pattern Create{T}Bleh (i.e. “Create”+”OzonePlatform”+”X11”). ### The actual logic In order to understand how PlatformObject class works, lets start by looking at its definition (ui/ozone/platform_object.h & platform_internal_object.h): template <class T> class PlatformObject { public: static std::unique_ptr<T> Create(); }; 16 template <class T> 17 std::unique_ptr<T> PlatformObject<T>::Create() { 18 typedef typename PlatformConstructorList<T>::Constructor Constructor; 19 20 // Determine selected platform (from --ozone-platform flag, or default). 21 int platform = GetOzonePlatformId(); 22 23 // Look up the constructor in the constructor list. 24 Constructor constructor = PlatformConstructorList<T>::kConstructors[platform]; 26 // Call the constructor. 27 return base::WrapUnique(constructor()); 28 }  In line 24 (highlighted above), the ozone platform runtime selection machinery actually happens. It retrieves a Constructor, which is a typedef for a PlatformConstructorList<T>::Constructor. By looking at the definition of PlatformConstructorList class (below), Constructor is actually a pointer to a function that returns a T*. 14 template <class T> 15 struct PlatformConstructorList { 16 typedef T* (*Constructor)(); 17 static const Constructor kConstructors[kPlatformCount]; 18 };  Ok, so basically here is what we know this far: 1. OzonePlatform::CreateInstance method calls OzonePlatform<bleh>::Create 2. OzonePlatform<bleh>::Create picks up an index and retrieves a PlatformConstructorList<bleh>::Constructor (via kConstructor[index]) 3. PlatformConstructorList<bleh>::Constructor is a typedef to a function pointer that returns a bleh*. 4. (..) 5. This chain ends up calling Create{bleh}{ozone_platform}() But wait! kConstructors, the array of pointers to functions – that solves the puzzle – is not defined anywhere in src/! This is because its actual definition is part of some generated code triggered by specific GN/GYP hooks. They are: • generate_ozone_platform_list which generates out/../platform_list.cc,h,txt • generate_constructor_list which generates out/../constructor_list.cc though generate_constructor_list.py In the end out/../generate_constructor_list.cc has the definition of kConstructors. Again, in the specific case of the given GN build arguments, kConstructors would look like: template <> const OzonePlatformConstructor PlatformConstructorList<ui::OzonePlatform>::kConstructors[] = { &ui::CreateOzonePlatformHeadless, &ui::CreateOzonePlatformWayland, &ui::CreateOzonePlatformX11, };  ### Logic wrap up • GYP/GN hooks are ran at build time, and generate plaform_list.{txt,c,h} as well as constructor_list.cc files respecting ozone_platform_bleh parameters. • constructor_list.cc has PlatformConstructorList<bleh>kConstructors actually populated. • ./content_shell –ozone-platform=bleh is called • OzonePlatform::InitializeFor{UI,Gpu}() • OzonePlatform::CreateInstance() • PlaformObject<OzonePlatformBleh>::Create() • PlatformConstructorList<bleh>::Constructor is retrieved – it is a pointer to a function stored in PlatformConstructorList<bleh>::kConstructor • function is ran and an OzonePlatformBleh instance is returned. ## June 02, 2016 ### Alejandro Piñeiro #### Introducing Mesa intermediate representations on Intel drivers with a practical example ###### Introduction The recent big news on the Igalia work on Mesa was that our effort getting the ARB_gpu_shader_fp64 and ARB_vertex_attrib_64bit extensions implemented for Intel Gen8+, allowed to expose OpenGL 4.2 for Gen8+. But I will let other igalians to talk in details about them (no pressures ;)). In a previous blog post I mentioned that NIR was intended to replace GLSL IR. Although that was true on the context I was talking about, that comment could be somewhat misleading, so I will try to clarify it. ###### Intermediate representations on Intel Mesa drivers So first, let’s list the intermediate representations that you would find when working on Mesa Intel drivers: • AST (Abstract Syntax Tree): calling it an Intermediate Language is somewhat an abuse of language. This is the tree representation of your GLSL shader just after parsing it with Flex/Bison. • Mesa IR (Intermediate Representation): also called HIR and GLSL IR. A real Intermediate Language. It is converted from AST. Here you have optimizations, link support, etc. • NIR (New Intermediate Representation): a new Intermediate Language added recently. So the first questions would be, why three? Having AST and another intermediate representation is easier to explain. AST is a raw tree representation, not useful to generate code. But why IR and NIR? Current Mesa IR was created some years ago. The design decisions behind it had their advantages. But it has also some disadvantages. For example, their tree-like structure make it complex to traverse, making difficult to navigate, implement optimizations, etc. Other disadvantage was that it was not on SSA form, making again some optimizations hard to write. You can see a summary of all this on Ian Romanick’s presentation “Three Years Experience with a Tree-like Shader IR“. So at some point an effort was started to “flatenize” Mesa IR and adding SSA support. But the conclusion was that the effort to modify Mesa IR was so big, that it was worth to just start from scratch using the learned lessons, as explained on Connor Abbot’s email “A new IR for Mesa“, in which he proposed this new IR. Some time later, NIR was ready for production, and as I mentioned on my blog post (that one I’m clarifying right now), some parts of Mesa Intel driver was reimplemented in order to use NIR instead of Mesa IR. So Mesa IR was being replaced there. Where exactly? The parts where the final assembly code was being generated. And now that that is finished (at least on the i965 driver), we can say that Mesa IR is not used to generate code at all. So right now there are an AST->Mesa IR->NIR chain. What is the plan now? Generate an AST->NIR pass and completely remove Mesa IR? This same question was asked (among other things) on January 2016, on the mesa-dev email “Nir, SCons, and Gallium“. And the answer was “no”, as you can see on two Ian Romanick’s replies (here and here). The summary is that Mesa IR has several GLSL specifics that aren’t appropriate for NIR’s level. In that sense, NIR is a step below Mesa IR, more near to the GPU needs. It is also worth to mention that since then, Vulkan support was added to Mesa. In order to support Spir-V (Vulkan’s shader language, that is an intermediate representation itself), a SPIRV->NIR pass was created. In that sense, for OpenGL there is an OpenGL-specific intermediate representation, that is Mesa IR, and for Vulkan there is a Vulkan-specific intermediate representation, that is Spirv, and both are translated to the same common representation, NIR. ###### Practical example: So, what means that in the practice? Do I need to deal with all those intermediate representations? Well, as anything in life, that would depend. If for example, you want to provide the support for a new GLSL feature or a specific hw, you would need to touch all three. You would need to modify the flex/bison files, so AST would need to be updated, and then Mesa IR, NIR, and the passes that transform one to the other. But if you want to give support for an GLSL feature that is already supported, but on new hw, the most likely is that you will not need changes on any of them, but just on the code that generate the final assembly using NIR. And what would happen to other features like warnings and errors? Right now most of them are detected at the AST level, and some at the IR level. NIR doesn’t trigger any error/warning yet. It contains several asserts, but basically because it assumes that at that moment the representation of the shader should be correct, so if you find something wrong, means that the developer working on NIR is doing something wrong, in opposite to the developer that wrote the GLSL shader. So lets go for a practical example I was working on: uninitialized variable warnings (“undefined values” on the bug tracking the issue). In short, this is about warn the developer about things like this: out color; void main() { vec4 myTemp; color = myTemp; } What would be the color on screen? Who knows. So although is a feature you can live without, it is a good nice-to-have. On the original bug, it was mentioned that this kind of errors are easy to detect, as NIR has a type ssa_undefs, so we just need to check if they are used. And in fact, when I started to work on it, I quickly find how to raise a warning. For example, on the method nir_print.c:print_ssa_def, used to debug, it is easy to modify it in order to point that it is using a undef. But trying to raise the warning there have some problems: • As mentioned NIR doesn’t have any warning/error triggering mechanism implemented, you would need to add them. • You want to include this warning on the OpenGL InfoLog, but as mentioned NIR is used for both OpenGL and Vulkan, and right now it doesn’t maintains any info about the origin. • And perhaps more important, at that point you lack the context data that points which source code line you are working on. In fact, the last bullet point also applies to Mesa IR. There are some warnings raised at the Mesa IR level, but are “line-less”. Not sure any other developer, but for me, this kind of warning without a reference to the source line number would be annoying to use. Does that mean that the only option would be the totally raw AST tree? Fortunately, Mesa IR was already saving if a variable was being statically assigned or not (to check some other possible errors). This was being computed on the AST to Mesa IR pass, and in fact the documentation mentions that this value is only valid at that moment. We would be on the middle of AST and Mesa IR. So when to raise the warning? The straightforward solution would be when a variable is , just before/after the error “variableX undeclared” is raised. But that is not so easy. For example: float myFloat1; float myFloat2; myFloat1 = myFloat2; How many warnings should we raise? Just one, for myFloat2. But technically we are also using myFloat1, and it is uninitialized. So we need to differentiate both cases. Being AST so raw, at that point we don’t have that information, and in fact it is also impossible to go up to the parent expression in order to compute that information. So it was needed to add an attribute on the AST node, that I called is_lhs (as “is left hand side”). That variable would be set when parent expressions are being transformed. If you are taking attention, probably you start to see what would be the collateral effect of this. Being AST so raw, and OpenGL specific, there would be several corner cases needed to be manually assigned. In fact the first commit of the series is already covering several corner cases. And in spite of this, once the code reached master, there were two cases of false positives that needed extra checks (for builtin-variables and for inout/out function parameters) After those two false positives, managing this warning was spread all along the code that made the AST to Mesa IR pass, so seemed easy to broke. So I decided to send too some unit tests to verify that it gets working. First I sent the make check test that tested that warning, and then the unit tests. 30 unit tests (it was initially 28, but reviewer asked two more). Not a bad number for a warning. ###### Final words At this point, one would wonder if it still makes sense to have this warning on the AST to Mesa IR pass, and if it would have it better to do it as initially proposed, on NIR. But although it is true that “just detecting it” would be easier on NIR, without dealing with so many corner cases, I still think that adding the support for raising warning/errors compatible with OpenGL Infolog, and bringing somehow the original source code line number, would mean too many changes on both Mesa IR and NIR. More changes that dealing with those corner cases when using the variable on the AST to Mesa IR pass. And in any case, if in the future the situation changes, and makes sense to move the warning to NIR, we would have the unit tests that would help to ensure that we don’t introduce regressions. In relation to the intermediate representations, just to note that I’m focusing on the Intel driver. Gallium drivers use other intermediate representation, called TGSI. As far as I know, on those drivers, they have a AST->Mesa IR->TGSI chain, and right now there is a work in progress AST->Mesa IR->NIR->TGSI chain that will be used on some specific cases. But all this is beyond my knowledge, so you would need to investigate if you are interested. ###### Appendix, extra documentation: If you want more details about MESA IR, you can read: • Read Mesa IR README • Read past blog posts from fellow igalian Iago Toral (when NIR was not available yet): post 1 and post 2 If you want extra information about Mesa NIR, you can read: ### Xavier Castaño #### Igalia sponsors LinuxCon and ContainerCon Japan 2016, the premier Linux Conference in Asia Join us at LinuxCon + ContainerCon Japan! LinuxCon Japan is the premier Linux conference in Asia that brings together a unique blend of core developers, administrators, users, community managers and industry experts. It is designed not only to encourage collaboration but to support future interaction between Japan and other Asia Pacific countries and the rest of the global Linux community. ContainerCon also comes to Japan for the first time this year, after a successful launch in North America in 2015. New developments in Linux containers are driving the adoption of cloud and virtualization technologies in many industries and ContainerCon Japan will provide a platform for exhibiting the best work. My colleague Mi Sun Silvia Cho will be talking about WebKit For Wayland and we will show some nice demos about our Browsers, Graphics and Compilers experience in our booth in Tokyo next month. ## May 26, 2016 ### Manuel Rego #### CSS Grid Layout and positioned items As part of the work done by Igalia in the CSS Grid Layout implementation on Chromium/Blink and Safari/WebKit, we’ve been implementing the support for positioned items. Yeah, absolute positioning inside a grid. 😅 Probably the first idea is that come to your mind is that you don’t want to use positioned grid items, but maybe in some use cases it can be needed. The idea of this post is to explain how they work inside a grid container as they have some particularities. Actually there’s not such a big difference compared to regular grid items. When the grid container is the containing block of the positioned items (e.g. using position: relative; on the grid container) they’re placed almost the same than regular grid items. But, there’re a few differences: • Positioned items don't stretch by default. • They don't use the implicit grid. They don't create implicit tracks. • They don't occupy cells regarding auto-placement feature. • autohas a special meaning when referring lines. Let’s explain with more detail each of these features. ### Positioned items shrink to fit We’re used to regular items that stretch by default to fill their area. However, that’s not the case for positioned items, similar to what a positioned regular block does, they shrink to fit. This is pretty easy to get, but a simple example will make it crystal clear: In this example we’ve a simple 2x2 grid. Both the regular item and the positioned one are placed with the same rules taking the whole grid. This defines the area for those items, which takes the 1st & 2nd rows and 1st & 2nd columns. Positioned items shrink to fit The regular item stretches by default both horizontally and vertically, so it takes the whole size of the grid area. However, the positioned item shrink to fit and adapts its size to the contents. For the examples in the next points I’m ignoring this difference, as I want to show the area that each positioned item takes. To get the same result than in the pictures, you’d need to set 100% width and height on the positioned items. ### Positioned items and implicit grid Positioned items don’t participate in the layout of the grid, neither they affect how items are placed. You can place a regular item outside the explicit grid, and the grid will create the required tracks to accommodate the item. However, in the case of positioned items, you cannot even refer to lines in the implicit grid, they'll be treated as auto. Which means that you cannot place a positioned item in the implicit grid. they cannot create implicit tracks as they don't participate in the layout of the grid. Let’s use an example to understand this better: The example defines a 2x2 grid, but the positioned item is using grid-area: 4 / 4; so it tries to goes to the 4th row and 4th column. However the positioned items cannot create those implicit tracks. So it’s positioned like if it has auto, which in this case will take the whole explicit grid. auto has a special meaning in positioned items, it’ll be properly explained later. Positioned items do not create implicit tracks Imagine another example where regular items create implicit tracks: In this case, the regular items will be creating the implicit tracks, making a 4x4 grid in total. Now the positioned item can be placed on the 4th row and 4th column, even if those columns are on the explicit grid. Positioned items can be placed on the implicit grid As you can see this part of the post has been modified, thanks to @fantasai for notifying me about the mistake. ### Positioned items and placement algorithm Again the positioned items do not affect the position of other items, as they don’t participate in the placement algorithm. So, if you’ve a positioned item and you’re using auto-placement for some regular items, it’s expected that the positioned one overlaps the other. The positioned items are completely ignored during auto-placement. Just showing a simple example to show this behavior: Here we’ve again a 2x2 grid, with 3 auto-placed regular items, and 1 absolutely positioned item. As you can see the positioned item is placed on the 1st row and 2nd column, but there’s an auto-placed item in that cell too, which is below the positioned one. This shows that the grid container doesn’t care about positioned items and it just ignores them when it has to place regular items. Positioned items and placement algorithm If all the children were not positioned, the last one would be placed in the given position (1st row and 2nd column), and the rest of them (auto-placed) will take the other cells, without overlapping. ### Positioned items and auto lines This is probably the biggest difference compared to regular grid items. If you don’t specify a line, it’s considered that you’re using auto, but auto is not resolved as span 1 like in regular items. For positioned items auto is resolved to the padding edge. The specification introduces the concepts of the lines 0 and -0, despite how weird it can sound, it actually makes sense. The auto lines would be referencing to those 0 and -0 lines, that represent the padding edges of the grid container. Again let’s use a few examples to explain this: Here we have a 2x2 grid container, which has some padding. The positioned item will be placed in the 2nd row and 1st column, but its area will take up to the padding edges (as the end line is auto in both axis). Positioned items and auto lines We could even place positioned grid items on the padding itself. For example using “grid-column: auto / 1;” the item would be on the left padding. Positioned items using auto lines to be placed on the left padding Of course if the grid is wider and we’ve some free space on the content box, the items will take that space too. For example: Here the grid columns are 500px, but the grid container has 600px width. This means that we’ve 100px of free space in the grid content box. As you can see in the example, that space will be also used when the positioned items extend up to the padding edges. Positioned items taking free space and right padding ### Offsets Of course you can use offsets to place your positioned items (left, right, top and bottom properties). These offsets will apply inside the grid area defined for the positioned items, following the rules explained above. Let’s use another example: Again a 2x2 grid container with some padding. The positioned item have some offsets which are applied inside its grid area. Positioned items and offets ### Wrap-up I’m not completely sure about how important is the support of positioned elements for web authors using Grid Layout. You’ll be the ones that have to tell if you really find use cases that need this. I hope this post helps to understand it better and make your minds about real-life scenarios where this might be useful. The good news is that you can test this already in the most recent versions of some major browsers: Chrome Canary, Safari Technology Preview and Firefox. We hope that the 3 implementations are interoperable, but please let us know if you find any issue. There’s one last thing missing: alignment support for positioned items. This hasn’t been implemented yet in any of the browsers, but the behavior will be pretty similar to the one you can already use with regular grid items. Hopefully, we’ll have time to add support for this in the coming months. Igalia and Bloomberg working together to build a better web Last but least, thanks to Bloomberg for supporting Igalia in the CSS Grid Layout implementation on Blink and WebKit. ## May 24, 2016 ### Alberto Garcia #### I/O bursts with QEMU 2.6 QEMU 2.6 was released a few days ago. One new feature that I have been working on is the new way to configure I/O limits in disk drives to allow bursts and increase the responsiveness of the virtual machine. In this post I’ll try to explain how it works. ## The basic settings First I will summarize the basic settings that were already available in earlier versions of QEMU. Two aspects of the disk I/O can be limited: the number of bytes per second and the number of operations per second (IOPS). For each one of them the user can set a global limit or separate limits for read and write operations. This gives us a total of six different parameters. I/O limits can be set using the throttling.* parameters of -drive, or using the QMP block_set_io_throttle command. These are the names of the parameters for both cases: -drive block_set_io_throttle throttling.iops-total iops throttling.iops-read iops_rd throttling.iops-write iops_wr throttling.bps-total bps throttling.bps-read bps_rd throttling.bps-write bps_wr It is possible to set limits for both IOPS and bps at the same time, and for each case we can decide whether to have separate read and write limits or not, but if iops-total is set then neither iops-read nor iops-write can be set. The same applies to bps-total and bps-read/write. The default value of these parameters is 0, and it means unlimited. In its most basic usage, the user can add a drive to QEMU with a limit of, say, 100 IOPS with the following -drive line: -drive file=hd0.qcow2,throttling.iops-total=100  We can do the same using QMP. In this case all these parameters are mandatory, so we must set to 0 the ones that we don’t want to limit:  { "execute": "block_set_io_throttle", "arguments": { "device": "virtio0", "iops": 100, "iops_rd": 0, "iops_wr": 0, "bps": 0, "bps_rd": 0, "bps_wr": 0 } }  ## I/O bursts While the settings that we have just seen are enough to prevent the virtual machine from performing too much I/O, it can be useful to allow the user to exceed those limits occasionally. This way we can have a more responsive VM that is able to cope better with peaks of activity while keeping the average limits lower the rest of the time. Starting from QEMU 2.6, it is possible to allow the user to do bursts of I/O for a configurable amount of time. A burst is an amount of I/O that can exceed the basic limit, and there are two parameters that control them: their length and the maximum amount of I/O they allow. These two can be configured separately for each one of the six basic parameters described in the previous section, but here we’ll use ‘iops-total’ as an example. The I/O limit during bursts is set using ‘iops-total-max’, and the maximum length (in seconds) is set with ‘iops-total-max-length’. So if we want to configure a drive with a basic limit of 100 IOPS and allow bursts of 2000 IOPS for 60 seconds, we would do it like this (the line is split for clarity):  -drive file=hd0.qcow2, throttling.iops-total=100, throttling.iops-total-max=2000, throttling.iops-total-max-length=60  Or with QMP:  { "execute": "block_set_io_throttle", "arguments": { "device": "virtio0", "iops": 100, "iops_rd": 0, "iops_wr": 0, "bps": 0, "bps_rd": 0, "bps_wr": 0, "iops_max": 2000, "iops_max_length": 60, } }  With this, the user can perform I/O on hd0.qcow2 at a rate of 2000 IOPS for 1 minute before it’s throttled down to 100 IOPS. The user will be able to do bursts again if there’s a sufficiently long period of time with unused I/O (see below for details). The default value for ‘iops-total-max’ is 0 and it means that bursts are not allowed. ‘iops-total-max-length’ can only be set if ‘iops-total-max’ is set as well, and its default value is 1 second. ## Controlling the size of I/O operations When applying IOPS limits all I/O operations are treated equally regardless of their size. This means that the user can take advantage of this in order to circumvent the limits and submit one huge I/O request instead of several smaller ones. QEMU provides a setting called throttling.iops-size to prevent this from happening. This setting specifies the size (in bytes) of an I/O request for accounting purposes. Larger requests will be counted proportionally to this size. For example, if iops-size is set to 4096 then an 8KB request will be counted as two, and a 6KB request will be counted as one and a half. This only applies to requests larger than iops-size: smaller requests will be always counted as one, no matter their size. The default value of iops-size is 0 and it means that the size of the requests is never taken into account when applying IOPS limits. ## Applying I/O limits to groups of disks In all the examples so far we have seen how to apply limits to the I/O performed on individual drives, but QEMU allows grouping drives so they all share the same limits. This feature is available since QEMU 2.4. Please refer to the post I wrote when it was published for more details. ## The Leaky Bucket algorithm I/O limits in QEMU are implemented using the leaky bucket algorithm (specifically the “Leaky bucket as a meter” variant). This algorithm uses the analogy of a bucket that leaks water constantly. The water that gets into the bucket represents the I/O that has been performed, and no more I/O is allowed once the bucket is full. To see the way this corresponds to the throttling parameters in QEMU, consider the following values:  iops-total=100 iops-total-max=2000 iops-total-max-length=60  • Water leaks from the bucket at a rate of 100 IOPS. • Water can be added to the bucket at a rate of 2000 IOPS. • The size of the bucket is 2000 x 60 = 120000. • If iops-total-max is unset then the bucket size is 100. The bucket is initially empty, therefore water can be added until it’s full at a rate of 2000 IOPS (the burst rate). Once the bucket is full we can only add as much water as it leaks, therefore the I/O rate is reduced to 100 IOPS. If we add less water than it leaks then the bucket will start to empty, allowing for bursts again. Note that since water is leaking from the bucket even during bursts, it will take a bit more than 60 seconds at 2000 IOPS to fill it up. After those 60 seconds the bucket will have leaked 60 x 100 = 6000, allowing for 3 more seconds of I/O at 2000 IOPS. Also, due to the way the algorithm works, longer burst can be done at a lower I/O rate, e.g. 1000 IOPS during 120 seconds. ## Acknowledgments As usual, my work in QEMU is sponsored by Outscale and has been made possible by Igalia and the help of the QEMU development team. Enjoy QEMU 2.6! ## May 23, 2016 ### Igalia Compilers Team #### Awaiting the future of JavaScript in V8 On the evening of Monday, May 16th, 2016, we have made history. We’ve landed the initial implementation of “Async Functions” in V8, the JavaScript runtime in use by the Google Chrome and Node.js. We do these things not because they are easy, but because they are hard. Because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one we are willing to accept. It is very exciting to see this, roughly 2 months of implementation, codereview and standards finangling/discussion to land. It is truly an honour. To introduce you to Async Functions, it’s first necessary to understand two things: the status quo of async programming in JavaScript, as well as Generators (previously implemented by fellow Igalian Andy). Async programming in JavaScript has historically been implemented by callbacks. window.setTimeout(function toExecuteLaterOnceTimeHasPassed() {}, …) being the common example. Callbacks on their own are not scalable: when numerous nested asynchronous operations are needed, code becomes extremely difficult to read and reason about. Abstraction libraries have been tacked on to improve this, including caolan’s async package, or Promise libraries such as Q. These abstractions simplify control flow management and data flow management, and are a massive improvement over plain Callbacks. But we can do better! For a more detailed look at Promises, have a look at the fantastic MDN article. Some great resources on why and how callbacks can lead to utter non-scalable disaster exist too, check out http://callbackhell.com! The second concept, Generators, allow a runtime to return from a function at an arbitrary line, and later re-enter that function at the following instruction, in order to continue execution. So right away you can imagine where this is going — we can continue execution of the same function, rather than writing a closure to continue execution in a new function. Async Functions rely on this same mechanism (and in fact, on the underlying Generators implementation), to achieve their goal, immensely simplifying non-trivial coordination of asynchronous operations. As a simple example, lets compare the following two approaches: function deployApplication() { return cleanDirectory(__DEPLOYMENT_DIR__). then(fetchNpmDependencies). then( deps => Promise.all( deps.map( dep => moveToDeploymentSite( dep.files, {__DEPLOYMENT_DIR__}/deps/${dep.name} ))). then(() => compileSources(__SRC_DIR__, __DEPLOYMENT_DIR__)). then(uploadToServer); }  The Promise boiler plate makes this preit harder to read and follow than it could be. And what happens if an error occurs? Do we want to add catch handlers to each link in the Promise chain? That will only make it even more difficult to follow, with error handling interleaved in difficult to read ways. Lets refactor this using async functions:  async function deployApplication() { await cleanDIrectory(__DEPLOYMENT_DIR__); let dependencies = await fetchNpmDependencies(); // *see below* for (let dep of dependencies) { await moveToDeploymentSite( dep.files, ${__DEPLOYMENT_DIR__}/deps/${dep.name}); } await compileSources(__SRC_DIR__, __DEPLOYMENT_DIR__); return uploadToServer(); }  You’ll notice that the “moveToDeploymentSite” step is slightly different in the async function version, in that it completes each operation in a serial pipeline, rather than completing each operation in parallel, and continuing once finished. This is an unfortunate limitation of the async function specification, which will hopefully be improved on in the future. In the meantime, it’s still possible to use the Promise API in async functions, as you can await any Promise, and continue execution after it is resolved. This grants compatibility with numerous existing Web Platform APIs (such as fetch()), which is ultimately a good thing! Here’s an alternative implementation of this step, which performs the moveToDeploymentSite() bits in parallel, rather than serially:  await Promise.all(dependencies.map( dep => moveToDeploymentSite( dep.files, ${__DEPLOYMENT_DIR__}/deps/${dep.name} )));  Now, it’s clear from the let dependencies = await fetchNpmDependencies(); line that Promises are unwrapped automatically. What happens if the promise is rejected with an error, rather than resolved with a value? With try-catch blocks, we can catch rejected promise errors inside async functions! And if they are not caught, they will automatically return a rejected Promise from the async function.  function throwsError() { throw new Error("oops"); } async function foo() { throwsError(); } // will print the Error thrown in throwsError. foo().catch(console.error) async function bar() { try { var value = await foo(); } catch (error) { // Rejected Promise is unwrapped automatically, and // execution continues here, allowing us to recover // from the error! error is new Error("oops!") } }  There are also lots of convenient forms of async function declarations, which hopefully serve lots of interesting use-cases! You can concisely declare methods as asynchronous in Object literals and ES6 classes, by preceding the method name with the async keyword (without a preceding line terminator!)  class C { async doAsyncOperation() { // ... } }; var obj = { async getFacebookProfileAsynchronously() { /* ... */ } };  These features allow us to write more idiomatic, easier to understand asynchronous control flow in our applications, and future extensions to the ECMAScript specification will enable even more idiomatic forms for writing complex algorithms, in a maintainable and readable fashion. We are very excited about this! There are numerous other resources on the web detailing async functions, their benefits, and perhaps ways they might be improved in the future. Some good ones include [this piece from Google’s Jake Archibald](https://jakearchibald.com/2014/es7-async-functions/), so give that a read for more details. It’s a few years old, but it holds up nicely! So, now that you’ve seen the overview of the feature, you might be wondering how you can try it out, and when it will be available for use. For the next few weeks, it’s still too experimental even for the “Experimental Javascript” flag. But if you are adventurous, you can try it already! Fetch the latest Chrome Canary build, and start Chrome with the command-line-flag –js-flags=”–harmony-async-await”. We can’t make promises about the shipping timeline, but it could ship as early as Chrome 53 or Chrome 54, which will become stable in September or October. We owe a shout out to Bloomberg, who have provided us with resources to improve the web platform that we love. Hopefully, we are providing their engineers with ways to write more maintainable, more performant, and more beautiful code. We hope to continue this working relationship in the future! As well, shoutouts are owed to the Chromium team, who have assisted in reviewing the feature, verifying its stability, getting devtools integration working, and ultimately getting the code upstream. Terriffic! In addition, the WebKit team has also been very helpful, and hopefully we will see the feature land in JavaScriptCore in the not too distant future. ## May 20, 2016 ### Víctor Jáquez #### GStreamer Hackfest 2016 Yes, it happened again: the Gstreamer Spring Hackfest 2016! This time in the beautiful city of Thessaloniki. Thanks a lot, Vivia and Sebastian, for making it happen. My objective this time was to work with dma-buf support in gstreamer-vaapi. Though it is supported already, it needs a major clean up, and to extend its usage for downstream buffers (bugs 755072 and 765435). In the way I learned that we need to update our internal API (called libgstvaapi), when handling dma-buf, to support mult-plane formats. On the other hand, Nicolas Dufresne and I talked a bit about kmssink, libdrm and dma-buf. He managed to hack his Odroid U (Exynos3) to enable its V4L2 mem2mem video decoder and share buffers with kmssink. It was amazing. By the way, he promised me to write a blog post with the instructions to replicate his deed. Finally, we had a preview of Edward Hervey‘s decodebin3. It is fun his test of switching the different audio streams in a media container (the different available dubbings) in every second or less. It was truly a multi-language audio! In the meantime, we shared beers and meals, learning and laughing. ## May 18, 2016 ### Antonio Gomes #### [Chromium] content_shell running on Wayland desktop (Weston Compositor) During my first weeks at Igalia, I got an interesting and promising task of Understand the status of Wayland support in Chromium upstream. At first I could see clear signs that Wayland is being actively considered by the Chromium community: 1. Ozone/Wayland project by Intel – which was my starting point as described later on. 2. The meta bug “upstream wayland backend for ozone (20% project)“, which has some recent activity by the Chromium community. 3. This comment in the chromium-dev mailing list (by a Googler): 4. “(..) I ‘d recommend using ToT. Wayland support is a work-in-progress and newer trees will probably be better.”. Chromium’s DEPS file has “wayland” as one its core dependency (search for “wayland”). Next step, naturally, was get my hands dirty, compiling and experimenting with it. I decided to start with content_shell. Environment: Ubuntu 16.04 LTS, regular Gnome session (with X) and Weston Wayland compositor running (no ‘xwayland’ installed) to run Wayland apps. GYP_DEFINES component=static_library use_ash=1 use_aura=1 chromeos=0 use_ozone=1 ozone_platform_wayland=1 use_wayland_egl=1 ozone_auto_platforms=0 use_xkbcommon=1 clang=0 use_sysroot=0 linux_use_bundled_gold=0 (note: GYP was used for personal familiarity with it, but GN applies fine here). Chromium version Base SHA 5da650481 (as of 2016/May/17) Initial results As is, content_shell built fine, but hung at runtime upon launch, hitting a CHECK at desktop_factory_ozone.cc(21). Analysis and Action After understanding current Ozone/Wayland upstream support, compare designs/code against 01.org, I could start connecting some of the missing dots. The following files were “ported” from 01.org: • ui/views/widget/desktop_aura/desktop_drag_drop_client_wayland.cc / h • ui/views/widget/desktop_aura/desktop_screen_ozone_wayland.cc / h • ui/views/widget/desktop_aura/desktop_window_tree_host_ozone_wayland.cc / h And then, I implemented DesktopFactoryOzoneWayland class (ui/views/widget/desktop_factory_ozone_wayland.cc/h) – it inherits from DesktopFactoryOzone, and implements the following pure virtual methods ::CreateWindowTreeHost and ::CreateDesktopScreen. Initial result After that, I could build and run content_shell with Weston Wayland Compositor (with no ‘xwayland’ installed). See a quick preview below. Remarks As is, the UI process owns the the Wayland connection, and GPU process runs without GL support. UI processes initializes Ozone by calling: #0 ui::WaylandSurfaceFactory::WaylandSurfaceFactory #1 ui::OzonePlatformWayland::InitializeUI #2 ui::OzonePlatform::InitializeForUI(); #3 aura::Env::Init() #4 aura::Env::CreateInstance() #5 content::BrowserMainLoop::InitializeToolkit() (…) #X content::ContentMain() <UI PROCESS LAUNCH> On the other side, GPU process gets initialized by calling: #0 ui::WaylandSurfaceFactory::WaylandSurfaceFactory #1 ui::OzonePlatformWayland::InitializeGPU #2 ui::OzonePlatform::InitializeForGPU(); #3 gfx::InitializeStaticGLBindings() #4 gfx::GLSurface::InitializeOneOffImplementation() #5 gfx::GLSurface::InitializeOneOff() #6 content::GpuMain() <GPU PROCESS LAUNCH> Differently from UI process, the GPU process call does not initialize OzonePlatformWayland::display_ and instead passes ‘nullptr’ to WaylandSurfaceFactory ctor. Down the road on the GPU processes initialization WaylandSurfaceFactory::LoadEGLGLES2Bindings is called but bails out earlier explicitly because display_ is NIL. Then, UI process falls back to software rendering (see call to WaylandSurfaceFactory::CreateCanvasForWidget). Next step • So far I have experimented Ozone/Wayland support using “linux” as the target OS. As far as I can tell, most of the Ozone work upstream though has been focusing on “chromeos” builds instead (e.g. ozone/x11). • Hence the idea is to clean up the code and agree with Googlers / 01.org (Intel) people about how to best make use of this code. • It is being also discussed with some Googlers what the best way to tackle this lack of GL support is. Some real nice stuff are on the pipe here. ## May 16, 2016 ### Javier Muñoz #### The Ceph RGW storage driver goes upstream in Libcloud The Ceph RGW storage driver was upstream in Apache Libcloud today. It is available in the Libcloud trunk repository and it will ship with the next release Apache Libcloud 1.0.0. This post will introduce the new RGW driver together with the proper configuration parameters to run some examples uploading/downloading objects in Ceph Jewel. The Ceph RGW storage driver The Ceph RGW storage driver requires Ceph Jewel or above. As of this writing, the last Ceph Jewel version is 10.2.1. This version is available in the downloads section. The driver extends the Libcloud S3 storage driver to provide a compatible S3 API with Ceph RGW. The driver also contains support for AWS signature versions 2 (AWS2) and 4 (AWS4). It leverages the Libcloud common auth support on the client side. On the Ceph RGW side it required a little patch to handle unsigned paylods in the AWS4 auth header. Developers and apps can use the Ceph RGW driver via the S3_RGW provider easily. A simple snippet follows... from libcloud.storage.types import Provider from libcloud.storage.providers import get_driver import libcloud api_key = 'api_key' secret_key = 'secret_key' cls = get_driver(Provider.S3_RGW) driver = cls(api_key, secret_key, signature_version='4', region='my-region', host='my-host', port=8000) container = driver.get_container(...)  If the region has not an explicit value, the driver will use the default region 'default'. The valid signature versions are '2' (AWS2) and '4' (AWS4). AWS2 is the default signature version. One host name is always required. No default value here. The following two examples contain the minimal code to upload/download objects with the new provider: Running the upload example... $ ./test-upload-ceph-rgw-driver.py
<Object: name=my-name-abcdabcd-123,
size=110080,
hash=0a5cfeb3bb10e0971895f8899a64e816,
provider=Ceph RGW S3 (my-region) ...>


$./test-download-ceph-rgw-driver.py <Object: name=my-name-abcdabcd-123, size=110080, hash=0a5cfeb3bb10e0971895f8899a64e816, provider=Ceph RGW S3 (my-region) ...>  Enjoy! Acknowledgments My work in Apache Libcloud is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Libcloud community. Thank you all! ## May 10, 2016 ### Sergio Villar #### Automatizing the Grid My Igalia colleagues and me have extensively reviewed how to create grids and how to position items inside the grid using different CSS properties. So far everything was more or less static. We declare the sizes of our columns/rows or define a set of grid areas and that’s it. Well, actually there is room for automatic stuff, you can dynamically create new tracks just by adding items to positions outside the explicit grid. Furthermore the grid is able to auto-position items for you if you don’t really care much about the final destination. ### Use Cases But imagine the following use case. Let’s assume that you are designing the product catalog of your pretty nice web store. CSS Grid Layout is the obvious choice for such layout. Just set some columns and some rows and that’s mostly it. Thing is that your catalog is likely not static, but automatically generated from a query to some database where you store all that data. You cannot know a priori how many items you’re going to show (users normally can also filter results). Grid already supports that use case. You can already define the number of columns (rows) and let the grid create as many rows (columns) as needed. Something like this:  grid-template-columns: repeat(3, 100px); grid-auto-rows: 100px; grid-auto-flow: row; ### The Multicol Example But you need more flexibility. You’re happy with the grid creating new tracks on demand in one axis, but you also want to have as many tracks as possible (depending on the available size) in the other axis. You’ve already designed web sites using CSS Multicol and you want your grid to behave like columns: 100px; Note that in the case of multicol the specified column width is an “optimal” size, meaning that it could be enlarged/narrowed to fill the container once the number of columns is calculated. So is it possible to tell grid to add as many tracks as needed to fill some available space? It was not but now we have… ### Repeat to fill: auto-fill and auto-fit The grid way to implement it is by using the recently added auto repeat syntax. The already known repeat() function accepts as a first argument two new keywords, auto-fill and auto-fit. Both of them will generate as many repetitions as needed to fill the available space with the specified tracks without overflowing. The only difference between them is that, auto-fit will additionally drop any empty track (meaning no items spanning through it) generated by repeat()  after positioning grid items. The use of these two new keywords has some limitations: • Tracks with intrinsic or flexible sizes cannot be combined with auto repetitions • There can be just one auto-repeat per axis at the most • repeat(auto-fill|fit,) accepts only one track size as second argument (all auto repeat tracks will have the same size) Some examples of valid declarations: grid-template-columns: repeat(auto-fill, minmax(200px, 1fr)) [last]; grid-template-rows: 100px repeat(auto-fit, 2em) repeat(10, minmax(15%, 1fr)); grid-template-columns: repeat(auto-fill, minmax(max-content, 300px));  And some examples of invalid ones: grid-template-columns: min-content repeat(auto-fill, 25px) 10px; grid-template-rows: repeat(auto-fit, 15px) repeat(auto-fill, minmax(10px, 100px); grid-template-columns: repeat(auto-fill, min-content);  ### The Details I mentioned that we cannot mix flexible and/or intrinsic tracks with auto repetitions, so why repeat(auto-fill, minmax(200px, 1fr)) is a valid declaration? Well, according to the grammar, the auto repeat syntax require something called a <fixed-size> track, which is basically a track which has a length (like 10px) or a percentage in either its min or max function. The following are all <fixed-size> tracks: 15% minmax(min-content, 200px) minmax(5%, 1fr)  That length/percentage is the one used by the grid to compute the number of repetitions. You should also be aware of how the number of auto repeat tracks is computed. First of all you need a definite size on the axis where you want to use auto-repeat. If that is not the case then the number of repetitions will be 1. The nice thing is that the definite size could be either the “normal” size (width: 250px), the max size (max-height: 15em) or even the min size (min-width: 200px). But there is an important difference, if either the size or max-size is definite, then the number of repetitions is the largest possible positive integer that does not cause the grid to overflow its grid container. Otherwise, if the min-size is definite, then the number of repetitions is the smallest possible positive integer that fulfills that minimum requirement. For example, the following declaration will generate 4 rows: height: 600px; grid-template-rows: repeat(auto-fill, 140px);  But this one will generate 5: min-height: 600px; grid-template-rows: repeat(auto-fill, 140px);  ### I want to use it now! Sure, no problem. I’ve just landed the support for auto-fill in both Blink and WebKit meaning that you’ll be able to test it (unprefixed) really soon in Chrome Canary and Safari Technology Preview (Firefox Nightly builds already support both auto-fill and auto-fit). Many thanks to our friends at Bloomberg for sponsoring this work. Enjoy! ## May 02, 2016 ### Diego Pino #### Network namespaces: IPv6 connectivity In the last post I introduced network namespaces and showed a practical example on how to share IPv4 connectivity between a network namespace and a host. Before that post, I also wrote a short tutorial on how to set up an IPv6 tunnel using Hurricane Electric broker service. This kind of service allow us to get into the IPv6 realm using an IPv4 connection. In this article I continue exploring network namespaces. Taking advantage of the work done in the aforementioned posts, I explain in this post how to share IPv6 connectivity between a host and a network namespace. Let’s assume we already have a SIT tunnel (IPv6-in-IPv4 tunnel) enabled in our host and we’re able to ping an external IPv6 address. If you haven’t, I encourage you to check out set up an IPv6 tunnel. I need to write an script which will create the network namespace and set it up accordingly. I call that script ns-ipv6. If the script works correctly, I should be able to ping an external IPv6 host from the namespace. Such script looks like this: ns-ipv6.sh It actually works: Let’s take a deeper look on how it works. ## ULAs The script creates a veth pair to communicate the network namespace with the host. Each virtual interface is assigned an IPv6 address in the ‘fd00::0/64’ network space (Lines 21 and 27). This type of address is known as ULA or Unique Local Address. ULAs are the IPv6 counterpart of IPv4 private addresses. Before continuing, a brief reminder on how IPv6 addresses work: An IPv6 address is a 128-bit value represented as 8 blocks of 16-bit (8 x 16-bit = 128-bit). Blocks are separated by a colon (‘:’). Unlike IPv4 addresses, block values are written in hexadecimal. Since each block is a 16-bit value, they can be written in hexadecimal as 4-digit numbers. Leading zeros in each block can be ommitted. On the same hand, when several consecutive block values are zero they can be ommitted too. In that case two colons (‘::’) are written instead, meaning everything in between is nil. For instance, the address ‘fd00::1’ is the short form of the much longer address ‘fd00:0000:0000:0000:0000:0000:0000:0001’. RFC 4193 (section 3) describes Unique Local Addresses format as: The RFC reserves the IPv6 address block ‘fc00::/7’ for ULAs. It divides this address in two subnetworks: ‘fc00::/8’ and ‘fd00::/8’. The use of the ‘fc00::/8’ block has not been defined yet, while the ‘fd00:/8’ block is used for IPv6 local assigned addresses (private addresses). The address ‘fd63:b1f4:7268:d970::1’ is an example of a valid ULA. It starts by the ‘fd’ prefix followed by an unique Global ID (‘63:b1f4:7268’) and a Subnet ID (‘d970’), leaving 64 bits for the Interface ID (‘::1’). I recommend the page Private IPv6 address range to obtain valid random ULAs. ULAs are not routable in the global Internet. They are meant to be used inside local networks and that’s precisely the reason why they exist. ## NAT on IPv6 Lines 34-36 activate IPv6 forwarding and IP Masquerade on the source address. However, this solution is not optimal. The Hurricane Electric tunnel broker service lends us a ‘::0/64’ block, with 2^64 - 2 valid hosts. NAT, Network Address Translation, grants a host in a private network external connectivity via a proxy that lends the private host its address. This is the most common use case of NAT, known as Source NAT. Besides IP addresses, NAT translates port numbers too and that’s why it’s sometimes referred as NAPT (Network Address and Port Translation). NAT has been an important technology for optimizing the use of IPv4 address space, although it has its costs too. The original goal of IPv6 was solving the problem of IP address exhaustation. Mechanisms such as NAT are not needed because the IPv6 address space is so big that every host could have an unique address, reachable from another end of the network. Actually, IPv6 brings back the original point-to-point design of the IPv4 Internet, before private addresses and NAT were introduced. So let’s try to get rid of NAT66 (IPv6-to-IPv6 translation) by: • Using global IPv6 addresses. • Removing MASQUERADING. The new script is available as a gist here: ns-ipv6-no-nat.sh. There’s some tricky bits that are worth explaining: First thing, is to replace the ULAs by IPv6 addresses which belong to /64 block leased by Hurricane Electric: When setting up the interfaces, the host side should add a more restricted routing rule for the other end of the veth pair. The reason is that all addresses belong to the same network. If from the host side a packet needs to get to the network namespace side, it would be routed through the IPv6 tunnel unless there’s a more restricted rule. Lastly, NAT66 can be removed but IP forwarding is still necessary as the host acts as a router. When a packet arrives from the network namespace into the host, the destination address of the packet doesn’t match any of the interfaces of the host. If IP forwarding were disabled, the packet will simply be dropped. However, since IP forwarding is enabled, non-delivered packets get forwarded through the host’s default gateway reaching their destination, hopefully. After these changes, the script still works: ## DNS resolution In the line above I’m pinging an IPv6 address directly (this address is actually ipv6.google.com). What happens if I try to ping a host name instead? ns-ipv6> ping6 ipv6.google.com unknown host  When I ping ipv6.google.com from the network namespace, the /etc/resolv.conf file is queried to obtain a DNS nameserver address. /etc/resolv.conf nameserver 8.8.8.8  This address is an IPv4 address, but the network namespace has IPv6 connectivity only. It cannot reach any host in the IPv4 realm. However, DNS resolution works in the host since the host has either IPv6 and IPv4 connectivity. It is necessary to add a DNS server with an IPv6 address. Luckily, Google has DNS servers available in the IPv6 realm too. nameserver 8.8.8.8 nameserver 2001:4860:4860::8888  Now I should be able to ping ipv6.google.com from the network namespace: ns-ipv6> ping6 -c 1 ipv6.google.com PING ipv6.google.com(lis01s14-in-x0e.1e100.net) 56 data bytes 64 bytes from lis01s14-in-x0e.1e100.net: icmp_seq=1 ttl=57 time=85.7 ms --- ipv6.google.com ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 85.702/85.702/85.702/0.000 ms  ## Wrapping up After all these changes we end up with a script that: • Uses Hurricane Electric’s IPv6 network addresses, instead of ULAs. • Doesn’t do NAT66 to provide external IPv6 connectivity to the network namespace. It has been a lot of fun writing out this post, it helped me to understand many things better. I definitely encourage everyone interested to run some of the scripts above and try out IPv6, if you haven’t yet. The network namespace part is not fundamental but it makes it more interesting. Lastly, I’d like to thank my colleague Carlos López for his unvaluable help as well as the StackOverflow community which helped me to figure out the script that gets rid of NAT66. ## April 29, 2016 ### Javier Muñoz #### Scalable placement of replicated data in Ceph One of the most interesting and powerful features in Ceph is the way how it computes the placement and storage of a growing amount of data at hyperscale. This computation avoids the need to look up data locations in a central directory in order to allow nodes to be added or removed, moving as few objects as possible while still maintaining balance across new cluster configurations. Ceph and the challenges of data placement at scale Take into consideration the traditional standard storage array in the industry. It has the usual two controller units (for redundancy) and a lot of disk trays. Those controllers connect with a storage area network (SAN) in order to provide storage to clients (servers mainly). The disk trays are connected to the storage controllers and all storage clients use the disks through those controllers. Scaling up the capacity and performance of the array requires adding/handling more disks to the controllers. As expected, raising the order of scale the controllers become a bottleneck. They will get overloaded in some moment. The usual solution for this bottleneck is buying a new pair of controllers with a new array of disks and moving some load onto the new hardware. This solution is expensive and not very flexible. It can work in the terabyte scale but it doesn't fit really well with the new Cloud industry demmanding elasticity and extreme flexibility on the petabyte/exabyte scale. The Ceph approach to cope with these challenges is the design and implementation of a new scale-out distributed storage system based on commodity hardware communicating over a regular TCP/IP network. This approach enables a fast and more affordable non-disruptive operational process on demmand while supporting the new petabyte scale with a very high performance. This kind of hyper scalable storage fits really well with the Cloud models. The way how Ceph enables hyperscale storage is avoiding any kind of centralized coordination via autonomous and smart storage nodes. In this scale the design of the cluster software determines how many nodes the cluster can handle. If the cluster scales out to a hundred of nodes we will be working around the petabytes and millions of I/O operations per second (IOPS) Another point to consider in this context is the inclusion of the new failure domains coming from the scale-out architecture. In this configuration each node is a new failure domain. This is usually handled by the software coordinating the cluster via copies of all data on at least two nodes. This approach is known as replica-based data protection. Ceph is able to enable hyperscale storage via CRUSH, a hash-based algorithm for calculating how and where to store and retrieve data in a distributed-object storage cluster. This major challenge of 'data placement at scale', and how it is designed and implemented, has a huge impact on the scalability and performance of massive storage solutions. A "bird's-eye view" in the Ceph's data placement In order to explore CRUSH we need to understand concepts such as pools, placement groups, OSDs and so on. If you are not familiar with the Ceph architecture I would suggest reading one of my previous posts where all this information is summarised together with pointers to the official documentation. From now on I will consider you are familiar with the general concepts and we will keep the focus on CRUSH and related components. Ceph uses CRUSH to determine how to store and retrieve data by computing data storage locations. Those data are handled as objects of variable size by Ceph in a simple flat namespace. On top of these native objects, Ceph build the rest of abstractions such as blocks, file systems or S3 buckets. In this point we can reformulate the data placement problem how an object-to-osd mapping problem without loss of generality. In Ceph, the objects belong to pools and pools are comprised of placement groups. Each placement group maps to a list of OSDs. This is the critical path you need to understand. The pool is the way how Ceph divides the global storage. This division or partition is the abstraction used to define the resilience (number of replicas, etc.), the number of placement groups, the CRUSH ruleset, the ownership and so on. You can consider this abstraction as the right place to define the configuration of your policies, so each pool handles its own number of replicas, number of placement groups, etc. The placement group is the abstraction used by Ceph to map objects to OSDs in a dynamic way. You can consider it as the placement or distribution unit in Ceph. So how we go from objects to OSDs via pools and placements groups? It is straight. In Ceph one object will be stored in a concrete pool so the pool identifier (a number) and the name of the object are used to uniquely identify the object in the system. Those two values, the pool id and the name of the object, are used to get a placement group via hashing. When a pool is created it is assigned a number of placement groups (PGs). One object is always stored in a concrete pool so the pool identifier (a number) and the name of the object is used to uniquely identify each object in the system. With the pool identifier and the hashed name of the object, Ceph will compute the hash modulo the number of PGs to retrieve the dynamic list of OSDs. In detail, the steps to compute for one placement group for the object named 'tree' in the pool 'images' (pool id 7) with 65536 placement groups would be... 1. Hash the object name : hash('tree') = 0xA062B8CF 2. Calculates the hash modulo the number of PGs : 0xA062B8CF % 65536 = 0xB8CF 3. Get the pool id : 'images' = 7 4. Prepends the pool id to 0xB8CF to get the placement group: 7.B8CF Ceph uses this new placement group (7.B8CF) together with the cluster map and the placement rules to get the dynamic list of OSDs... • CRUSH('7.B8CF') = [4, 19, 3] The size of this list is the number of replicas configured in the pool. The first OSD in the list is the primary, the next one is the secondary and so on. Understanding CRUSH The CRUSH (Controlled Replication Under Scalable Hashing) algorithm determines how to store and retrieve data by computing data storage locations. As mentioned in the previous lines, Ceph uses this approach to overcome the data placement at scale. Under the hood, the algorithm requires knowing how the cluster storage is organized (device locations, hierarchies and so on). All this information is defined in the CRUSH map. The roles and responsibilities of the CRUSH map are the following: • Together with a ruleset for each hierarchy, the map determines how Ceph stores data. • It contains at least one hierarchy of nodes and leaves. • The nodes of a hierarchy are called 'buckets'. Those buckets are defined by their type. • The data objects are distributed among the storage devices according to a per-device weight value, approximating an uniform probability distribution. • The hierarchies are arbitraries. They are defined according to your own needs but the leaf nodes always represent the OSDs, and each leaf belong to one node or bucket. Let's see one example of default hierarchy... $ ceph osd tree
ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.16672 root default
-2 0.05557     host node-1
0 0.02779         osd.0        up  1.00000          1.00000
1 0.02779         osd.1        up  1.00000          1.00000
-3 0.05557     host node-2
2 0.02779         osd.2        up  1.00000          1.00000
3 0.02779         osd.3        up  1.00000          1.00000
-4 0.05557     host node-3
4 0.02779         osd.4        up  1.00000          1.00000
5 0.02779         osd.5        up  1.00000          1.00000
$ In this case the bucket hierarchy has six leaf buckets (osd 0-5), three host buckets (node 1-3) and one root node (default) Having a look in the decompiled map we get the following content... $ ceph osd getcrushmap -o compiled-crush-map.bin
got crush map from osdmap epoch 65
$crushtool -d compiled-crush-map.bin -o decompiled-crush-map.txt$ cat decompiled-crush-map.txt
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host node-1 {
id -20 # do not change unnecessarily
# weight 0.056
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.028
item osd.1 weight 0.028
}
host node-2 {
id -32 # do not change unnecessarily
# weight 0.056
alg straw
hash 0 # rjenkins1
item osd.2 weight 0.028
item osd.3 weight 0.028
}
host node-3 {
id -48 # do not change unnecessarily
# weight 0.056
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.028
item osd.5 weight 0.028
}
root default {
id -10 # do not change unnecessarily
# weight 0.167
alg straw
hash 0 # rjenkins1
item node-1 weight 0.056
item node-2 weight 0.056
item node-3 weight 0.056
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map
$ As you can see the bucket declaration requires specifying its type, an unique name, weight and hash algorithm. [bucket-type] [bucket-name] { id [a unique negative numeric id] weight [the relative capacity/capability of the item(s)] alg [uniform, list, tree, straw, straw2] hash [the hash type] item [item-name] weight [weight] }  The kinds of buckets (uniform, list, tree, straw2, etc) represent internal (non-leaf) nodes in the cluster hierarchies. Those buckets are based on different internal data structures and utilize different functions for pseudo-random choosing nested items. The typical example is the uniform bucket, in this case all selected items are restricted in that they must contain items that are all of the same weight. The hash type represent the hash algorithm used in the functions associated with the different kinds of buckets. Finally, the CRUSH rules define how a Ceph client and OSDs select buckets. rule { ruleset type [ replicated | raid4] min_size max_size step take step [choose|chooseleaf] [firstn|indep] step emit }  The Jenkins hash function We mentioned Ceph, and CRUSH, use a hash function as part of their logic but we didn't comment anything about this function yet. This function is known as the Jenkins hash function, a hash function for hash table lookups. One paper covering the technical details on this hash function is available here. The paper presents fast and reliable hash functions for table lookup using 32-bit or 64-bit arithmetic together with a framework for evaluating hash functions. In Ceph, the Jenkins function is not only used in CRUSH as part of the replicas selection. It is used along the Ceph codebase when some hashing requirement is needed. As Jenkins comments on his paper, these hashes work equally well on all types of inputs, including text, numbers, compressed data, counting sequences, and sparse bit arrays. I would highlight the following points related to the hash function design. They are relevant to understand the hashing code while sequencing, masking, etc. in Ceph/CRUSH the different input values. 1. If the hash value needs to be smaller that 32 (64) bits, you can mask the high bits. 2. The hash functions work best if the size of the hash table is a power of 2. 3. If the hash table has more than 232 (264) entries, this can be handled by calling the hash function twice with different initial initvals then concatenating the results. 4. If the key consists of multiple strings, the strings can be hashed sequentially, passing in the hash value from the previous string as the initval for the next. 5. Hashing a key with different initial initvals produces independent hash values. Ceph and data placement in practice Let's locate one object... $ ceph health
HEALTH_OK
$ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.16672 root default -2 0.05557 host node-1 0 0.02779 osd.0 up 1.00000 1.00000 1 0.02779 osd.1 up 1.00000 1.00000 -3 0.05557 host node-2 2 0.02779 osd.2 up 1.00000 1.00000 3 0.02779 osd.3 up 1.00000 1.00000 -4 0.05557 host node-3 4 0.02779 osd.4 up 1.00000 1.00000 5 0.02779 osd.5 up 1.00000 1.00000$ ceph osd pool create mypool 256 256 replicated
pool 'mypool' created
$ceph osd lspools 0 rbd,1 mypool,$ ceph osd dump | grep mypool
pool 1 'mypool' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins
pg_num 256 pgp_num 256 last_change 59 flags hashpspool stripe_width 0
$dd if=/dev/zero of=myobject bs=25M count=1 2> /dev/null$ ls -sh myobject
25M myobject
$rados -p mypool put myobject myobject$ rados -p mypool stat myobject
mypool/myobject mtime 2016-04-30 10:12:10.000000, size 26214400
$rados -p mypool ls myobject$ ceph osd map mypool myobject
osdmap e60 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([4,2,1], p4) acting ([4,2,1], p4)
$ It is mapping the object 'myobject' in the pool 'mypool' to pg 1.5da41c62 (1.62) and OSDs 4, 2 and 1 Let's see how it balances/replicates the object when OSDs go down... $ ceph osd dump | grep mypool
pool 1 'mypool' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins
pg_num 256 pgp_num 256 last_change 59 flags hashpspool stripe_width 0
$ceph osd map mypool myobject osdmap e60 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62) -> up ([4,2,1], p4) acting ([4,2,1], p4)$ ceph osd tree
ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.16672 root default
-2 0.05557     host node-1
0 0.02779         osd.0        up  1.00000          1.00000
1 0.02779         osd.1        up  1.00000          1.00000
-3 0.05557     host node-2
2 0.02779         osd.2      down  1.00000          1.00000
3 0.02779         osd.3        up  1.00000          1.00000
-4 0.05557     host node-3
4 0.02779         osd.4        up  1.00000          1.00000
5 0.02779         osd.5        up  1.00000          1.00000
$ceph osd map mypool myobject osdmap e66 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62) -> up ([4,1], p4) acting ([4,1], p4)$ ceph osd tree
ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.16672 root default
-2 0.05557     host node-1
0 0.02779         osd.0        up  1.00000          1.00000
1 0.02779         osd.1      down  1.00000          1.00000
-3 0.05557     host node-2
2 0.02779         osd.2      down  1.00000          1.00000
3 0.02779         osd.3        up  1.00000          1.00000
-4 0.05557     host node-3
4 0.02779         osd.4        up  1.00000          1.00000
5 0.02779         osd.5        up  1.00000          1.00000
$ceph osd map mypool myobject osdmap e68 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62) -> up ([4], p4) acting ([4], p4)$ ceph osd map mypool myobject
osdmap e96 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([4,3], p4) acting ([4,3], p4)
$ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.16672 root default -2 0.05557 host node-1 0 0.02779 osd.0 up 1.00000 1.00000 1 0.02779 osd.1 up 1.00000 1.00000 -3 0.05557 host node-2 2 0.02779 osd.2 down 0 1.00000 3 0.02779 osd.3 up 1.00000 1.00000 -4 0.05557 host node-3 4 0.02779 osd.4 up 1.00000 1.00000 5 0.02779 osd.5 up 1.00000 1.00000$ ceph osd map mypool myobject
osdmap e103 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([4,1,3], p4) acting ([4,1,3], p4)
$ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.16672 root default -2 0.05557 host node-1 0 0.02779 osd.0 up 1.00000 1.00000 1 0.02779 osd.1 up 1.00000 1.00000 -3 0.05557 host node-2 2 0.02779 osd.2 up 1.00000 1.00000 3 0.02779 osd.3 up 1.00000 1.00000 -4 0.05557 host node-3 4 0.02779 osd.4 up 1.00000 1.00000 5 0.02779 osd.5 up 1.00000 1.00000$ ceph osd map mypool myobject
osdmap e108 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([4,2,1], p4) acting ([4,2,1], p4)
$ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.16672 root default -2 0.05557 host node-1 0 0.02779 osd.0 up 1.00000 1.00000 1 0.02779 osd.1 up 1.00000 1.00000 -3 0.05557 host node-2 2 0.02779 osd.2 up 1.00000 1.00000 3 0.02779 osd.3 up 1.00000 1.00000 -4 0.05557 host node-3 4 0.02779 osd.4 down 1.00000 1.00000 5 0.02779 osd.5 up 1.00000 1.00000$ ceph osd map mypool myobject
osdmap e110 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([2,1], p2) acting ([2,1], p2)
$ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.16672 root default -2 0.05557 host node-1 0 0.02779 osd.0 up 1.00000 1.00000 1 0.02779 osd.1 up 1.00000 1.00000 -3 0.05557 host node-2 2 0.02779 osd.2 up 1.00000 1.00000 3 0.02779 osd.3 up 1.00000 1.00000 -4 0.05557 host node-3 4 0.02779 osd.4 up 1.00000 1.00000 5 0.02779 osd.5 up 1.00000 1.00000$ ceph osd map mypool myobject
osdmap e115 pool 'mypool' (1) object 'myobject' -> pg 1.5da41c62 (1.62)
-> up ([4,2,1], p4) acting ([4,2,1], p4)


It works as expected. It uses three replicas with a minimal size of two replicas.

Wrap-up

Along this post some challenges of data placement at hyperscale were introduced together with the Ceph approach (smart storage nodes, CRUSH algorithm, Jenkins hash function...) to address them.

Some practical examples to illustrate the mapping and data placement path are also available in the previous lines. They were tested on the new Ceph Jewel release.

As usual, I would say the primary reference to understand the current Ceph data placement is the source code. I would suggest to read the Sage's thesis (6.2, 6.3 and 6.4 sections) to know more on the roots of the current solution too. These sections cover reliable autonomic storage, performance and scalability. Beyond of these references you might also find useful the official documentation.

If you are looking for some kind of support related to development, design, deployment, etc. in Ceph or you would love to see some new feature in the next releases. Feel free to contact me!

## April 26, 2016

#### Identifying Layer-7 packet flows in SnabbWall

Spring is here already, the snow has melted a while ago, and it looks like a good time to write a bit about network traffic flows, as promised in my previous post about ljndpi. Why so? Well, looking at network traffic and grouping it into logical streams between two endpoints is something that needs to be done for SnabbWall, a suite of Snabb applications which implement a Layer-7 analyzer and firewall which Igalia is developing with sponsorship from the NLnet Foundation.

(For those interested in following only my Snabb-related posts, there are separate feeds you can use: RSS, Atom.)

## Going With the Flow

Any sequence of related network packets between two hosts can be a network traffic flow. But not quite so: the exact definition may vary, depending on the level at which we are working. For example, an ISP may want to consider all packets between the pair of hosts —regardless of their contents— as part of the same flow in order to account for transferred data in metered connections, but for SnabbWall we want “application-level” traffic flows. That is: all packets generated (or received) by the same application should be classified into the same flow.

But that can get tricky, because even if we looked only at TCP traffic from one application, it is not possible just map a single connection to one flow. Take FTP for example: in active mode it uses a control connection, plus an additional data connection, and both should be considered part of the same flow because both belong to the same application. On the other side of the spectrum are web browsers like the one you are probably using to read this article: it will load the HTML using one connection, and then other related content (CSS, JavaScript, images) needed to display the web page.

In SnabbWall, the assignment of packets to flows is done based on the following fields from the packet:

• 802.1Q VLAN tag.
• Source and destination IP addresses.
• Source and destination port numbers.

The VLAN tag is there to force classifying packets with the same source and destination but in different logical networks in separate packet flows. As for port numbers, in practice these are only extracted from packets when the upper layer protocol is UDP or TCP. There are other protocols which use port numbers, but they are deliberately left out (for now) because either nDPI does not support them, or they are not widely adopted (SCTP comes to mind).

## Some Implementation Details

Determining the flow to which packets belong is an important task which is performed for each single packet scanned. Even before packet contents are inspected, they have to be classified.

Handling flows is split in two in the SnabbWall packet scanner: a generic implementation inspects packets to extract the fields above (VLAN tag, addresses, ports) and calculates a unique flow key from them, while backend-specific code inspects the contents of the packet and identifies the application for a flow of packets. Once the generic part of the code has calculated a key, it can be used to keep tables which associate additional data to each flow. While SnabbWall has only one backend which at the moment which uses nDPI, this split makes it easier to add others in the future.

For efficiency —both in terms of memory and CPU usage— flow keys are represented using a C struct. The following snippet shows the one for IPv4 packets, with a similar one where the address fields are 16 bytes wide being used for IPv6 packets:

ffi.cdef [[
struct swall_flow_key_ipv4 {
uint16_t vlan_id;
uint8_t  ip_proto;
uint16_t lo_port;
uint16_t hi_port;
} __attribute__((packed));
]]

local flow_key_ipv4 = ffi.metatype("struct swall_flow_key_ipv4", {
__index = {
hash = make_cdata_hash_function(ffi.sizeof("struct swall_flow_key_ipv4")),
}
})


The struct is laid out with an extra byte of padding, to ensure that its size is a multiple of 4. Why so? The hash function (borrowed from the lib.ctable module) used for flow keys works on inputs with sizes multiple of 4 bytes because calculations are done in a word-by-word basis. In Lua the hash value for userdata values is their memory address, which makes them all different to each other: defining our own hashing function allows using the hash values as keys into tables, instead of the flow key itself. Let's see how this works with the following snippet, which counts per-flow packets:

local flows = {}
while not ended do
if flows[key:hash()] then
flows[key:hash()].num_packets = flows[key:hash()].num_packets + 1
else
flows[key:hash()] = { key = key, num_packets = 1 }
end
end


If we used the keys themselves instead of key:hash() for indexing the flows table, this wouldn't work because the userdata for the new key is created for each packet being processed, which means that keys with the same content created for different packets would have different hash values (their address in memory). On the other hand, the :hash() method always returns the same value keys with the same contents.

## Highs and Lows

You may be wondering why our flow key struct has its members named lo_addr, hi_addr, lo_port and hi_port. It turns out that in packets which belong to the same application travel between two hosts in both directions. Let's consider the following:

• Host A, with address 10.0.0.1.
• Host B, with address 10.0.0.2.
• A web browser from A connects (using randomly assigned port 10205) to host B, which has an HTTP server running in port 80.

The sequence of packets observed will go like this:

# Source IP Destination IP Source Port Destination Port
1 10.0.0.1 10.0.0.2 10205 80
2 10.0.0.2 10.0.0.1 80 10205
3 10.0.0.1 10.0.0.2 10205 80
4

If the flow key fields would be src_addr, dst_addr and so on, the first and second packets would be classified in separate flows — but they belong in the same one! This is sidestepped by sorting the addresses and ports of each packet when calculating its flow key. For the example connection above, all packets involved have 10.0.0.1 as the “low IP address” (lo_addr), 10.0.0.2 as the “high IP address” (hi_addr), 80 as the “low port” (lo_port), and 10205 as the “high port” (hi_port) — effectively classifying all the packets into the same flow.

This translates into some minor annoyance in the nDPI scanner backend because nDPI expects us to pass a pair of identifiers for the source and destination hosts for each packet inspected. Not a big deal, though.

## Flow(er Power)

Something we have to do for IPv6 packets is traversing the chain of extension headers to get to the upper-layer protocol and extract port numbers from TCP and UDP packets. There can be any number of extension headers, and while in practice they should never be lots, this makes the amount of work needed to derive a flow key from a packet is not constant.

The good news is that RFC 6437 specifies using the 20-bit flow label field of the fixed IPv6 header in a way that, combined with the source and destination addresses, they uniquely identify the flow of the packet. This is all rainbows and ponies, but in practice the current behaviour would still be needed: the specification considers that an all-zeroes value indicates “packets that have not been labeled”. Which means that it is still needed to use the source and destination ports as fallback. What is even worse: while forbidden by the specification, flow labels can mutate while packets are en-route without any means of verifying that the change was made. Also, it is allowed to assign a new flow label to an unlabeled packet when packets are being forwarded. Nevertheless, using the flow label may be interesting to be used instead of the port numbers when the upper layer protocol is neither TCP nor UDP. Due to the limited usefulness, using IPv6 flow labels remains unimplemented for now, but I have not discarded adding support later on.

## Something Else

Alongside with the packet scanner, I have implemented the L7Spy application, and the snabb wall command during this phase of the SnabbWall project. Expect another post soon about them!

## April 15, 2016

### Frédéric Wang

#### OpenType MATH in HarfBuzz

TL;DR:

• Work is in progress to add OpenType MATH support in HarfBuzz and will be instrumental for many math rendering engines relying on that library, including browsers.

• For stretchy operators, an efficient way to determine the required number of glyphs and their overlaps has been implemented and is described here.

In the context of Igalia browser team effort to implement MathML support using TeX rules and OpenType features, I have started implementation of OpenType MATH support in HarfBuzz. This table from the OpenType standard is made of three subtables:

• The MathConstants table, which contains layout constants. For example, the thickness of the fraction bar of ab\frac{a}{b}.

• The MathGlyphInfo table, which contains glyph properties. For instance, the italic correction indicating how slanted an integral is e.g. to properly place the subscript in ∫D\displaystyle\displaystyle\int_{D}.

• The MathVariants table, which provides larger size variants for a base glyph or data to build a glyph assembly. For example, either a larger parenthesis or a assembly of U+239B, U+239C, U+239D to write something like:

 (abcdefgh\left(\frac{\frac{\frac{a}{b}}{\frac{c}{d}}}{\frac{\frac{e}{f}}{\frac{g}{h}}}\right.

Code to parse this table was added to Gecko and WebKit two years ago. The existing code to build glyph assembly in these Web engines was adapted to use the MathVariants data instead of only private tables. However, as we will see below the MathVariants data to build glyph assembly is more general, with arbitrary number of glyphs or with additional constraints on glyph overlaps. Also there are various fallback mechanisms for old fonts and other bugs that I think we could get rid of when we move to OpenType MATH fonts only.

In order to add MathML support in Blink, it is very easy to import the OpenType MATH parsing code from WebKit. However, after discussions with some Google developers, it seems that the best option is to directly add support for this table in HarfBuzz. Since this library is used by Gecko, by WebKit (at least the GTK port) and by many other applications such as Servo, XeTeX or LibreOffice it make senses to share the implementation to improve math rendering everywhere.

The idea for HarfBuzz is to add an API to

1. 1.

Expose data from the MathConstants and MathGlyphInfo.

2. 2.

Shape stretchy operators to some target size with the help of the MathVariants.

It is then up to a higher-level math rendering engine (e.g. TeX or MathML rendering engines) to beautifully display mathematical formulas using this API. The design choice for exposing MathConstants and MathGlyphInfo is almost obvious from the reading of the MATH table specification. The choice for the shaping API is a bit more complex and discussions is still in progress. For example because we want to accept stretching after glyph-level mirroring (e.g. to draw RTL clockwise integrals) we should accept any glyph and not just an input Unicode strings as it is the case for other HarfBuzz shaping functions. This shaping also depends on a stretching direction (horizontal/vertical) or on a target size (and Gecko even currently has various ways to approximate that target size). Finally, we should also have a way to expose italic correction for a glyph assembly or to approximate preferred width for Web rendering engines.

As I mentioned at the beginning, the data and algorithm to build glyph assembly is the most complex part of the OpenType MATH and deserves a special interest. The idea is that you have a list of n≥1n\geq 1 glyphs available to build the assembly. For each 0≤i≤n-10\leq i\leq n-1, the glyph gig_{i} has advance aia_{i} in the stretch direction. Each gig_{i} has straight connector part at its start (of length sis_{i}) and at its end (of length eie_{i}) so that we can align the glyphs on the stretch axis and glue them together. Also, some of the glyphs are “extenders” which means that they can be repeated 0, 1 or more times to make the assembly as large as possible. Finally, the end/start connectors of consecutive glyphs must overlap by at least a fixed value omino_{\mathrm{min}} to avoid gaps at some resolutions but of course without exceeding the length of the corresponding connectors. This gives some flexibility to adjust the size of the assembly and get closer to the target size tt.

To ensure that the width/height is distributed equally and the symmetry of the shape is preserved, the MATH table specification suggests the following iterative algorithm to determine the number of extenders and the connector overlaps to reach a minimal target size tt:

1. 1.

Assemble all parts by overlapping connectors by maximum amount, and removing all extenders. This gives the smallest possible result.

2. 2.

Determine how much extra width/height can be distributed into all connections between neighboring parts. If that is enough to achieve the size goal, extend each connection equally by changing overlaps of connectors to finish the job.

3. 3.

If all connections have been extended to minimum overlap and further growth is needed, add one of each extender, and repeat the process from the first step.

We note that at each step, each extender is repeated the same number of times r≥0r\geq 0. So if IExtI_{\mathrm{Ext}} (respectively INonExtI_{\mathrm{NonExt}}) is the set of indices 0≤i≤n-10\leq i\leq n-1 such that gig_{i} is an extender (respectively is not an extender) we have ri=rr_{i}=r (respectively ri=1r_{i}=1). The size we can reach at step rr is at most the one obtained with the minimal connector overlap omino_{\mathrm{min}} that is

 ∑i=0N-1(∑j=1riai-omin)+omin=(∑i∈INonExtai-omin)+(∑i∈IExtr⁢(ai-omin))+omin\sum_{i=0}^{N-1}\left(\sum_{j=1}^{r_{i}}{a_{i}-o_{\mathrm{min}}}\right)+o_{ \mathrm{min}}=\left(\sum_{i\in I_{\mathrm{NonExt}}}{a_{i}-o_{\mathrm{min}}} \right)+\left(\sum_{i\in I_{\mathrm{Ext}}}r{(a_{i}-o_{\mathrm{min}})}\right)+o% _{\mathrm{min}}

We let NExt=|IExt|N_{\mathrm{Ext}}={|I_{\mathrm{Ext}}|} and NNonExt=|INonExt|N_{\mathrm{NonExt}}={|I_{\mathrm{NonExt}}|} be the number of extenders and non-extenders. We also let SExt=∑i∈IExtaiS_{\mathrm{Ext}}=\sum_{i\in I_{\mathrm{Ext}}}a_{i} and SNonExt=∑i∈INonExtaiS_{\mathrm{NonExt}}=\sum_{i\in I_{\mathrm{NonExt}}}a_{i} be the sum of advances for extenders and non-extenders. If we want the advance of the glyph assembly to reach the minimal size tt then

 SNonExt-omin⁢(NNonExt-1)+r⁢(SExt-omin⁢NExt)≥t{S_{\mathrm{NonExt}}-o_{\mathrm{min}}\left(N_{\mathrm{NonExt}}-1\right)}+{r% \left(S_{\mathrm{Ext}}-o_{\mathrm{min}}N_{\mathrm{Ext}}\right)}\geq t

We can assume 0" display="inline">SExt-omin⁢NExt>0S_{\mathrm{Ext}}-o_{\mathrm{min}}N_{\mathrm{Ext}}>0 or otherwise we would have the extreme case where the overlap takes at least the full advance of each extender. Then we obtain

 r≥rmin=max⁡(0,⌈t-SNonExt+omin⁢(NNonExt-1)SExt-omin⁢NExt⌉)r\geq r_{\mathrm{min}}=\max\left(0,\left\lceil\frac{t-{S_{\mathrm{NonExt}}+o_{ \mathrm{min}}\left(N_{\mathrm{NonExt}}-1\right)}}{S_{\mathrm{Ext}}-o_{\mathrm{ min}}N_{\mathrm{Ext}}}\right\rceil\right)

This provides a first simplification of the algorithm sketched in the MATH table specification: Directly start iteration at step rminr_{\mathrm{min}}. Note that at each step we start at possibly different maximum overlaps and decrease all of them by a same value. It is not clear what to do when one of the overlap reaches omino_{\mathrm{min}} while others can still be decreased. However, the sketched algorithm says all the connectors should reach minimum overlap before the next increment of rr, which means the target size will indeed be reached at step rminr_{\mathrm{min}}.

One possible interpretation is to stop overlap decreasing for the adjacent connectors that reached minimum overlap and to continue uniform decreasing for the others until all the connectors reach minimum overlap. In that case we may lose equal distribution or symmetry. In practice, this should probably not matter much. So we propose instead the dual option which should behave more or less the same in most cases: Start with all overlaps set to omino_{\mathrm{min}} and increase them evenly to reach a same value oo. By the same reasoning as above we want the inequality

 SNonExt-o⁢(NNonExt-1)+rmin⁢(SExt-o⁢NExt)≥t{S_{\mathrm{NonExt}}-o\left(N_{\mathrm{NonExt}}-1\right)}+{r_{\mathrm{min}} \left(S_{\mathrm{Ext}}-oN_{\mathrm{Ext}}\right)}\geq t

which can be rewritten

 SNonExt+rmin⁢SExt-o⁢(NNonExt+rmin⁢NExt-1)≥tS_{\mathrm{NonExt}}+r_{\mathrm{min}}S_{\mathrm{Ext}}-{o\left(N_{\mathrm{NonExt% }}+{r_{\mathrm{min}}N_{\mathrm{Ext}}}-1\right)}\geq t

We note that N=NNonExt+rmin⁢NExtN=N_{\mathrm{NonExt}}+{r_{\mathrm{min}}N_{\mathrm{Ext}}} is just the exact number of glyphs used in the assembly. If there is only a single glyph, then the overlap value is irrelevant so we can assume NNonExt+r⁢NExt-1=N-1≥1N_{\mathrm{NonExt}}+{rN_{\mathrm{Ext}}}-1=N-1\geq 1. This provides the greatest theorical value for the overlap oo:

 omin≤o≤omaxtheorical=SNonExt+rmin⁢SExt-tNNonExt+rmin⁢NExt-1o_{\mathrm{min}}\leq o\leq o_{\mathrm{max}}^{\mathrm{theorical}}=\frac{S_{ \mathrm{NonExt}}+r_{\mathrm{min}}S_{\mathrm{Ext}}-t}{N_{\mathrm{NonExt}}+{r_{ \mathrm{min}}N_{\mathrm{Ext}}}-1}

Of course, we also have to take into account the limit imposed by the start and end connector lengths. So omaxo_{\mathrm{max}} must also be at most min⁡(ei,si+1)\min{(e_{i},s_{i+1})} for 0≤i≤n-20\leq i\leq n-2. But if rmin≥2r_{\mathrm{min}}\geq 2 then extender copies are connected and so omaxo_{\mathrm{max}} must also be at most min⁡(ei,si)\min{(e_{i},s_{i})} for i∈IExti\in I_{\mathrm{Ext}}. To summarize, omaxo_{\mathrm{max}} is the minimum of omaxtheoricalo_{\mathrm{max}}^{\mathrm{theorical}}, of eie_{i} for 0≤i≤n-20\leq i\leq n-2, of sis_{i} 1≤i≤n-11\leq i\leq n-1 and possibly of e0e_{0} (if 0∈IExt0\in I_{\mathrm{Ext}}) and of of sn-1s_{n-1} (if n-1∈IExt{n-1}\in I_{\mathrm{Ext}}).

With the algorithm described above NExtN_{\mathrm{Ext}}, NNonExtN_{\mathrm{NonExt}}, SExtS_{\mathrm{Ext}}, SNonExtS_{\mathrm{NonExt}} and rminr_{\mathrm{min}} and omaxo_{\mathrm{max}} can all be obtained using simple loops on the glyphs gig_{i} and so the complexity is O⁢(n)O(n). In practice nn is small: For existing fonts, assemblies are made of at most three non-extenders and two extenders that is n≤5n\leq 5 (incidentally, Gecko and WebKit do not currently support larger values of nn). This means that all the operations described above can be considered to have constant complexity. This is much better than a naive implementation of the iterative algorithm sketched in the OpenType MATH table specification which seems to require at worst

 ∑r=0rmin-1NNonExt+r⁢NExt=NNonExt⁢rmin+rmin⁢(rmin-1)2⁢NExt=O⁢(n×rmin2)\sum_{r=0}^{r_{\mathrm{min}}-1}{N_{\mathrm{NonExt}}+rN_{\mathrm{Ext}}}=N_{ \mathrm{NonExt}}r_{\mathrm{min}}+\frac{r_{\mathrm{min}}\left(r_{\mathrm{min}}-% 1\right)}{2}N_{\mathrm{Ext}}={O(n\times r_{\mathrm{min}}^{2})}

and at least Ω⁢(rmin)\Omega(r_{\mathrm{min}}).

One of issue is that the number of extender repetitions rminr_{\mathrm{min}} and the number of glyphs in the assembly NN can become arbitrary large since the target size tt can take large values e.g. if one writes \underbrace{\hspace{65535em}} in LaTeX. The improvement proposed here does not solve that issue since setting the coordinates of each glyph in the assembly and painting them require Θ⁢(N)\Theta(N) operations as well as (in the case of HarfBuzz) a glyph buffer of size NN. However, such large stretchy operators do not happen in real-life mathematical formulas. Hence to avoid possible hangs in Web engines a solution is to impose a maximum limit NmaxN_{\mathrm{max}} for the number of glyph in the assembly so that the complexity is limited by the size of the DOM tree. Currently, the proposal for HarfBuzz is Nmax=128N_{\mathrm{max}}=128. This means that if each assembly glyph is 1em large you won’t be able to draw stretchy operators of size more than 128em, which sounds a quite reasonable bound. With the above proposal, rminr_{\mathrm{min}} and so NN can be determined very quickly and the cases N≥NmaxN\geq N_{\mathrm{max}} rejected, so that we avoid losing time with such edge cases…

Finally, because in our proposal we use the same overlap oo everywhere an alternative for HarfBuzz would be to set the output buffer size to nn (i.e. ignore r-1r-1 copies of each extender and only keep the first one). This will leave gaps that the client can fix by repeating extenders as long as oo is also provided. Then HarfBuzz math shaping can be done with a complexity in time and space of just O⁢(n)O(n) and it will be up to the client to optimize or limit the painting of extenders for large values of NN…

## April 10, 2016

### Javier Muñoz

#### The Outscale OSU driver goes upstream in Libcloud

Apache Libcloud 1.0.0-rc2 (preview) was released today and it contains the new Outscale storage driver I contributed upstream several days ago.

This release together with the digital signatures are available in the download section. You can read the change log here.

Along this entry I will introduce the Apache Libcloud project, the Outscale driver and how a new provider can be used to connect with the Outscale object storage service.

The Apache Libcloud project

Apache Libcloud is an Open Source Python library that provides a vendor-neutral interface to cloud provider APIs. It is used to support diversification without vendor lock-in.

The library eases the interaction with the cloud resources through an unified API and backend drivers.

There are available backend drivers to support popular and well-known cloud service providers. As expected, those drivers contain the functionalities exported to developers and applications via the unified API.

From a logical perspective, the library divides the supported resources among the following categories:

• Cloud Servers and Block Storage
• Cloud Object Storage and CDN
• Load Balancers as a Service
• DNS as a Service
• Container Services
• Backup as a Service

The next picture shows a simplified Libcloud diagram:

As we shall see later, the Outscale driver implements a new Libcloud provider to be used with the Cloud Object Storage API and the Outscale Object Storage Unit (OSU) based on Ceph.

The Outscale object storage driver

Righ now Outscale provides two different storage services in its cloud. They go under the names Block Storage Unit (BSU) and Object Storage Unit (OSU)

The BSU is the way to work with block storage on standard hard drives or SSDs. It is used to create and attach block storage volumes to instances in the Flexible Compute Unit (FCU) service.

The OSU service is all about storing and managing objects in a replicated and high-availability environment. The service exports an AWS S3 compatible API.

The Outscale object storage driver targets this second service, the OSU service. The driver extends the Libcloud S3 storage driver to provide a compatible S3 API together with the required bits to connect with the OSU service.

Using the Outscale OSU storage driver in Libcloud

You can use the OSU driver via the S3_RGW_OUTSCALE provider easily. A simple snippet follows...

from libcloud.storage.types import Provider
from libcloud.storage.providers import get_driver
import libcloud

api_key = 'api_key'
secret_key = 'secret_key'

cls = get_driver(Provider.S3_RGW_OUTSCALE)
driver = cls(api_key, secret_key, region='eu-west-1')
container = driver.get_container(...)


The 'region' parameter is an optional parameter. If you don't set up an explicit region the driver will use the default region 'eu-west-2'.

The driver supports the following five Outscale regions:

• eu-west-1
• eu-west-2
• us-west-1
• us-east-2
• cn-southeast-1

A full example loading the S3_RGW_OUTSCALE provider and then retrieving an object is available here.

Running the example...

\$ ./test-osu-outscale-driver.py
<Object: name=my-name-abcdabcd-123,
size=3453,
hash=89c28be4b979a529afa5f24fae439858,
provider=OUTSCALE Ceph RGW S3 (eu-west-1) ...>


Acknowledgments

My work in Apache Libcloud is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Libcloud community. Thank you all!

### Diego Pino

#### Network namespaces

Namespaces and cgroups are two of the main kernel technologies most of the new trend on software containerization (think Docker) rides on. To put it simple, cgroups are a metering and limiting mechanism, they control how much of a system resource (CPU, memory) you can use. On the other hand, namespaces limit what you can see. Thanks to namespaces processes have their own view of the system’s resources.

The Linux kernel provides 6 types of namespaces: pid, net, mnt, uts, ipc and user. For instance, a process inside a pid namespace only sees processes in the same namespace. Thanks to the mnt namespace, it’s possible to attach a process to its own filesystem (like chroot). In this article I focus only in network namespaces.

If you have grasped the concept of namespaces you may have at this point an intuitive idea of what a network namespace might offer. Network namespaces provide a brand-new network stack for all the processes within the namespace. That includes network interfaces, routing tables and iptables rules.

## Network namespaces

From the system’s point of view, when creating a new process via clone() syscall, passing the flag CLONE_NEWNET will create a brand-new network namespace into the new process. From the user perspective, we simply use the tool ip (package is iproute2) to create a new persistent network namespace:

This command will create a new network namespace called ns1. When the namespace is created, the ip command adds a bind mount point for it under /var/run/netns. This allows the namespace to persist even if there’s no process attached to it. To list the namespaces available in the system:

Or via ip:

As previously said, a network namespace contains its own network resources: interfaces, routing tables, etc. Let’s add a loopback interface to ns1:

• Line 1 brings up the loopback interface inside the network namespace ns1.
• Line 2 executes the command ping 127.0.0.1 inside the network namespace.

An alternative syntax to bring up the loopback interface could be:

However, I tend to use the command ip as it has become the preferred networking tool in Linux, obsoleting the old but more familiar commands ifconfig, route, etc. Notice that ip requires root privileges, so run it as root or prepend sudo.

A network namespace has its own routing table too:

Which at this point returns nothing as we haven’t add any routing table rule yet. Generally speaking, any command run within a network namespace is prepend by the prologue:

## A practical example

One of the consequences of network namespaces is that only one interface could be assigned to a namespace at a time. If the root namespace owns eth0, which provides access to the external world, only programs within the root namespace could reach the Internet. The solution is to communicate a namespace with the root namespace via a veth pair. A veth pair works like a patch cable, connecting two sides. It consists of two virtual interfaces, one of them is assigned to the root network namespace, while the other lives within a network namespace. Setting up their IP addresses and routing rules accordingly, plus enabling NAT in the host side, will be enough to provide Internet access to the network namespace.

Additionally, I feel like I need to make a clarification at this point. I’ve read in several articles about network namespaces that physical device interfaces can only live in the root namespace. At least that’s not the case with my current kernel (Linux 3.13). I can assign eth0 to a namespace other than the root and when setting it up properly have Internet access from the namespace. However, the limitation of one interface living only in one single namespace at a time still applies, and that’s a reason powerful enough to need connecting network namespace via a veth pair.

Next, create a veth pair. Interface v-eth1 will remain inside the root network namespace, while its peer, v-peer1, will be moved to the ns1 namespace.

Next, setup IPv4 addresses for both interfaces and bring them up.

Additionally I brought up the loopback interface inside ns1.

Now it’s necessary we make all external traffic leaving ns1 to go through v-eth1.

However this won’t be enough. As with any host sharing Its internet connection, it’s necessary to enable IPv4 forwarding in the host and enable masquerading.

If everything went fine, it would be possible to ping an external host from ns1.

This is how the routing table inside ns1 would look like after the setup:

Prepending the ip netns exec prologue for every command to run from the namespace might be a bit tedious. Once the most basic features inside the namespace are setup, a more interesting possibility is to run a bash shell and attach it to the network namespace:

Type exit to leave end the bash process and leave the network namespace.

## Conclusion

Network namespaces, as well as other containerization technologies provided by the Linux kernel, are a lightweight mechanism for resource isolation. Processes attached to a network namespace see their own network stack, while not interfering with the rest of the system’s network stack.

Network namespaces are easy to use too. A similar network-level isolation could have been set up using a VM. However, that seems a much more expensive solution in terms of system resources and time investment to build up such environment. If you only need process isolation at the networking level, network namespaces are definitively something to consider.

The full script is available as a GitHub gist at: ns-inet.sh.

## April 06, 2016

### Víctor Jáquez

#### gstreamer-vaapi 1.8: the codec split

On march 23th GStreamer 1.8 was released, along with all its bundled modules, and, of course, one of those modules is gstreamer-vaapi.

First thing to notice is that the encoders have been renamed. Before they followed the pattern vaapiencode_{codec}, now they follow the pattern vaapi{codec}enc. The purpose of this change is twofold: to fix the plugins gtk-docs and to keep the usual element names in GStreamer. The conversion table is this:

Old New
vaapiencode_h264 vaapih264enc
vaapiencode_h265 vaapih265enc
vaapiencode_mpeg2 vaapimpeg2enc
vaapiencode_jpeg vaapijpegenc
vaapiencode_vp8 vaapivp8enc

But those were not the only name changes, we also have split the vaapidecode. Now we have a vaapijpegdec, which only decodes JPEG images, while keeping the old vaapidecode for video decoding. Also, vaapijpegdec was demoted to a marginal rank, because there are some problems in the Intel VA driver (which is the only one which supports JPEG decoding right now).

Note that in future releases, all the decoders will be split by codec, just as we did the JPEG decoder now; but first, we need to modify vaapidecodebin to choose a decoder in run-time based on the negotiated caps.

There are a ton of enhancements and optimizations too. Let me enumerate some of them: Vineeth TM fixed several memory leaks, and some compilations issues; Tim enabled vaapisink to send unhandled keyboard or mouse events to the application, making the usage of apps like gst-play-1.0 or apps based on GstPlayer be more natural; Thiago fixed the h264/h265 parsers, meanwhile Sree fixed the vp9 and the h265 ones too; Scott also fixed the h265 parser; et cetera. As may you see, H265/HEVC parser has been very active lately, it is the new thing!

I have to thank Sebastian Dröge, he did all the release work and also fixed a couple compilation issues.

This is the short log summary since 1.6:

 2  Lim Siew Hoon
1  Scott D Phillips
8  Sebastian Dröge
3  Sreerenj Balachandran
5  Thiago Santos
1  Tim-Philipp Müller
8  Vineeth TM
16  Víctor Manuel Jáquez Leal


## March 31, 2016

### Jacobo Aragunde

#### A new PhpReport for 2016

It’s been three years without any new release of our venerable time tracking tool, PhpReport. It doesn’t mean the project has been still during all this time; despite the slower development pace, you will find more than 80 patches in the repository since the last release, which account for 45 fixes or small features in the project.

It’s time to gather them all in a new release, PhpReport 2.16. These are the highlights:

### CSV export

All reports have now an option to export data to CSV, so they can be imported into other software. This was made with spreadsheets in mind, and I can confirm it works perfectly with our dear LibreOffice Calc.

### Quick-access buttons

Most reports got quick-access buttons added for the most common time periods: current and previous week, current month and year, etc.

### Smarter “copy from yesterday”

The “copy from date” feature allows to copy tasks from a certain date into the current one. Its main use case is to copy everything from your previous work day, because you probably have been doing the same work, more or less during the same timetable… You can conveniently copy the tasks and just add some tweaks. The default date to copy from used to be the previous day, but you don’t usually want to copy your tasks from Sunday to Monday! Now it defaults to the previous date you have worked.

### Nit-picking

Addressed several slightly annoying behaviors to make the application more enjoyable to use. For example, when you create a project, the grid will scroll automatically to the newly added project so you can keep editing the project attributes. A similar thing happens when you have some row selected in the accumulated hours report and you load a new set of dates: the previously selected row will be kept selected and visible. Opening the project details from a report now pops up a new window so your report remains untouched when you go back to it. Inexplicably small grids now use all the screen space, and we increased the consistency among the different edition screens.

### Moved to GitHub

The project sources, releases and the information previously available in the project website has been moved to GitHub, so any potential contributors will find a familiar environment if they happen to be interested in PhpReport.

### Future

I’ve bumped the version number from 2.1 straight to 2.16. The intention is to make one release per year, to guarantee some predictability regarding how and when the changes in the repository will arrive to users. You can expect PhpReport 2.17 releasing in March next year; it may look like a long time but I think it’s reasonable for a project with a low level of activity.

### Looking for help!

Igalia has recently announced open positions in our Coding Experience program, and one of them is aimed to work in PhpReport. We have many ideas to improve this tool to make it more productive and powerful. Check the conditions of the program in the website if you are interested!

### Michael Catanzaro

#### Positive progress on WebKitGTK+ security updates

I previously reported that, although WebKitGTK+ releases regular upstream security updates, most Linux distributions are not taking the updates. At the time, only Arch Linux and Fedora were reliably releasing our security updates. So I’m quite pleased that openSUSE recently released a WebKitGTK+ security update, and then Mageia did too. Gentoo currently has an update in the works. It remains to be seen if these distros regularly follow up on updates (expect a follow-up post on this in a few months), but, optimistically, you now have several independent distros to choose from to get an updated version WebKitGTK+, plus any distros that regularly receive updates directly from these distros.

Unfortunately, not all is well yet. It’s still not safe to use WebKitGTK+ on the latest releases of Debian or Ubuntu, or on derivatives like Linux Mint, elementary OS, or Raspbian. (Raspbian is notable because it uses an ancient, insecure version of Epiphany as its default web browser, and Raspberry Pis are kind of popular.)

And of course, no distribution has been able to get rid of old, insecure WebKitGTK+ 2.4 compatibility packages, so many applications on distributions that do provide security updates for modern WebKitGTK+ will still be insecure. (Don’t be fooled by the recent WebKitGTK+ 2.4.10 update; it contains only a few security fixes that were easy to backport, and was spurred by the need to add GTK+ 3.20 compatibility. It is still not safe to use.) Nor have distributions managed to remove QtWebKit, which is also old and insecure. You still need to check individual applications to see if they are running safe versions of WebKit.

But at least there are now several distros providing WebKitGTK+ security updates. That’s good.

Special thanks to Apple and to my colleagues at Igalia for their work on the security advisories that motivate these updates.

#### Epiphany 3.20

So, what’s new in Epiphany 3.20?

First off: overlay scrollbars. Because web sites have the ability to style their scrollbars (which you’ve probably noticed on Google sites), WebKit embedders cannot use a normal GtkScrolledWindow to display content; instead, WebKit has to paint the scrollbars itself. Hence, when overlay scrollbars appeared in GTK+ 3.16, WebKit applications were left out. Carlos García Campos spent some time to work on this, and the result speaks for itself (if you fullscreen this video to see it properly):

Overlay scrollbars did not actually require any changes in Epiphany itself — all applications using an up-to-date version of WebKit will immediately benefit — but I mention it here as it’s one of the most noticeable changes. Read about other WebKit improvements, like the new Faster Than Light FTL/B3 JavaScript compilation tier, on Carlos’s blog.

Next up, there is a new downloads manager, also by Carlos García Campos. This replaces the old downloads bar that used to appear at the bottom of the screen:

I flipped the switch in Epiphany to enable WebGL:

If you watched that video in fullscreen, you might have noticed that page is marked as insecure, even though it doesn’t use HTTPS. Like most browsers, we used to have several confusing security states. Pages with mixed content received a security warning that all users ignored, but pages with no security at all received no such warning. That’s pretty dumb, which is why Firefox and Chrome have been talking about changing this for a year or so now. I went ahead and implemented it. We now have exactly two security states: secure and insecure. If your page loads any content not over HTTPS, it will be marked as insecure. The vast majority of pages will be displayed as insecure, but it’s no less than such sites deserve. I’m not concerned at all about “warning fatigue,” because users are not generally expected to take any action on seeing these warnings. In the future, we will take this further, and use the insecure indicator for sites that use SHA-1 certificates.

Moving on. By popular request, I exposed the previously-hidden setting to disable session restore in the preferences dialog, as “Remember previous tabs on startup:”

Meanwhile, Carlos worked in both WebKit and Epiphany to greatly improve session restoration. Previously, Epiphany would save the URLs of the pages loaded in each tab, and when started it would load each URL in a new tab, but you wouldn’t have any history for those tabs, for example, and the state of the tab would otherwise be lost. Carlos worked on serializing the WebKit session state and exposing it in the WebKitGTK+ API, allowing us to restore full back/forward history for each tab, plus details like your scroll position on each tab. Thanks to Carlos, we also now make use of this functionality when reopening closed tabs, so your reopened tab will have a full back/forward list of history, and also when opening new tabs, so the new tab will inherit the history of the tab it was opened from (a feature that we had in the past, but lost when we switched to WebKit2).

Interestingly, we found the session restoration was at first too good: it would restore the page really exactly as you last viewed it, without refreshing the content at all. This means that if, for example, you were viewing a page in Bugzilla, then when starting the browser, you would miss any new comments from the last time you loaded the page until you refresh the page manually. This is actually the current behavior in Safari; it’s desirable on iOS to make the browser launch instantly, but questionable for desktop Safari. Carlos decided to always refresh the page content when restoring the session for WebKitGTK+.

Last, and perhaps least, there’s a new empty state displayed for new users, developed by Lorenzo Tilve and polished up by me, so that we don’t greet new users with a completely empty overview (where your most-visited sites are normally displayed):

That, plus a bundle of the usual bugfixes, significant code cleanups, and internal architectual improvements (e.g. I converted the communication between the UI process and the web process extension to use private D-Bus connections instead of the session bus). The best things have not changed: it still starts up about 5-20 times faster than Firefox in my unscientific testing; I expect you’ll find similar results.

Enjoy!

## March 24, 2016

### Andy Wingo

#### a simple (local) solution to the pay gap

International Working Women's Day was earlier this month, a day that reminds the world how far it has yet to go to achieve just treatment of women in the workplace. Obviously there are many fronts on which to fight to dismantle patriarchy, and also cissexism, and also transphobia, and also racism, and sometimes it gets a bit overwhelming just to think of a world where people treat each other right.

Against this backdrop, it's surprising that some policies are rarely mentioned by people working on social change. This article is about one of them -- a simple local change that can eliminate the pay gap across all axes of unfair privilege.

OK here it is: just pay everyone in a company the same hourly wage.

That's it!

on simple, on easy

But, you say, that's impossible!

Rich Hickey has this famous talk where he describes one thing as simple and the other as easy. In his narrative, simple is good but hard, and easy is bad but, you know, easy. I enjoy this talk because it's easy (hah!) to just call one thing simple and the other easy and it's codewords for good and bad, and you come across as having the facile prestidigitatory wisdom of a Malcolm Gladwell.

As far as simple, the substance of equal pay is as simple as it gets. And as far as practical implementation goes, it only needs buy-in from one person: your boss could do it tomorrow.

But, you say, a real business would never do this! This is getting closer to the real issues, but not there yet. There are plenty of instances of real businesses that do this. Incidentally, mine is one of them! I do not intend this to be an advertisement for my company, but I have to mention this early because society does its best to implant inside our brains the ideas that certain ideas are possible and certain others are not.

But, you say, this would be terrible for business! Here I think we are almost there. There's a question underneath, if we can manage to phrase it in a more scientific way -- I mean, the ideal sense in which science is a practice of humankind in which we use our limited powers to seek truth, with hypotheses but without prejudice. It might sound a bit pompous to invoke capital-S Science here, but I think few conversations of this kind try to honestly even consider existence proofs in the form of already-existing histories (like the company I work for), much less an unbiased study of the implications of modelling the future on those histories.

Let's assume that you and I want to work for justice, and in this more perfect world, men and women and nonbinary people will have equal pay for equal work, as will all people that lie on all axes of privilege that currently operate in society. If you are with me up to here: great. If not, we don't share a premise so it's not much use to go farther. You can probably skip to the next article in your reading list.

So, then, the questions: first of all, would a flat equal wage within a company actually help people in marginalized groups? What changes would happen to a company if it enacted a flat wage tomorrow? What are its limitations? How could this change come about?

would it help?

Let's take the most basic question first. How would this measure affect people in marginalized groups?

Let us assume that salaries are distributed inversely: the higher salaries are made by fewer people. A lower salary corresponds to more people. So firstly, we are in a situation where the median salary is less than the mean: that if we switched to pay everyone the mean, then most people would see an increase in their salary.

Assuming that marginalized people were evenly placed in a company, that would mean that most would benefit. But we know that is not the case: "marginalized" is the operative term. People are categorized at a lower point than their abilities; people's climb of the organizational hierarchy (and to higher salaries) is hindered by harassment, by undervalued diversity work, and by external structural factors, like institutionalized racism or the burden of having to go through a gender transition. So probably, even if a company touts equal pay within job classifications, the job classifications themselves unfairly put marginalized people lower than white dudes like me. So, proportionally marginalized people would benefit from an equal wage more than most.

Already this plan is looking pretty good: more money going to marginalized people is a necessary step to bootstrap a more just world.

All that said, many (but not most) people from marginalized groups will earn more than the mean. What for them? Some will decide that paying for a more just company as a whole is worth a salary reduction. (Incidentally, this applies to everyone: everyone has their price for justice. It might be 0.1%, it might be 5%, it might be 50%.)

Some, though, will decide it is not worth paying. They will go work elsewhere, probably for even more money (changing jobs being the best general way to advance your salary). I don't blame marginalized folks for getting all they can: more power to them.

From what I can tell, things are looking especially good for marginalized people under a local equal-wage initiative. Not perfect, not in all cases, but generally better.

won't someone think of the dudes

I don't believe in value as a zero-sum proposition: there are many ways in which a more fair world could be more productive, too. But in the short term, a balance sheet must balance. Salary increases in the bottom will come from salary decreases from the top, and the dudebro is top in tech.

We should first note that many and possibly most white men will see their wages increase under a flat-wage scheme, as most people earn below the mean.

Secondly, some men will be willing to pay for justice in the form of equal pay for equal work. An eloquent sales pitch as to what they are buying will help.

Some men would like to pay but have other obligations that a "mean" salary just can't even. Welp, there are lots of jobs out there. We'll give you a glowing recommendation :)

Finally there will be dudes that are fine with the pay gap. Maybe they have some sort of techno-libertarian justification? Oh well! They will find other jobs. As someone who cares about justice, you don't really want to work with these people anyway. Call it "bad culture fit", and treat it as a great policy to improve the composition of your organization.

an aside: what are we here for anyway?

A frequent objection to workplace change comes in the form of a pandering explanation of what companies are for, that corporations are legally obligated to always proceed along the the most profitable path.

I always find it extraordinarily ignorant to hear this parroted by people in tech: it's literally part of the CS canon to learn about the limitations of hill-climbing as an optimization strategy. But on the other hand, I do understand; the power of just-so neoliberal narrative is immense, filling your mind with pat explanations, cooling off your brain into a poorly annealed solid mass.

The funny thing about corporate determinism that it's not even true. Folks who say this have rarely run companies, otherwise they should know better. Loads of corporate decisions are made with a most tenuous link on profitability, and some that probably even go against the profit interest. It's always easy to go in a known-profitable direction, but that doesn't mean it's the only way to go, nor that all the profitable directions are known.

Sometimes this question is framed in the language of "what MyDesignCo really cares about is good design; we're worried about how this measure might affect our output". I respect this question more, because it's more materialist (you can actually answer the question!), but I disagree with the premise. I don't think any company really cares about the product in a significant way. Take the design company as an example. What do you want on your tombstone: "She made good advertisements"??? Don't get me wrong, I like my craft, and I enjoy practicing it with my colleagues. But if on my tombstone they wrote "He worked for justice", and also if there were a heaven, I would be p OK with that. What I'm saying is, you start a company, you have an initial idea, you pivot, whatever, it doesn't matter in the end. What matters is you relationship with life on the planet, and that is the criteria you should use to evaluate what you do.

Beyond all that -- it's amazing how much wrong you can wrap up in a snarky hacker news one-liner -- beyond all that, the concern begs the question by assuming that a flat-wage arrangement is less profitable. People will mention any down-side they can but never an up-side.

possible flat-wage up-sides from a corporate perspective

With that in mind, let's consider some ways that a flat wage can actually improve the commercial fate of a company.

A company with a flat wage already has a marketing point that they can use to attract people that care about this sort of thing. It can make your company stand out from the crowd and attract good people.

The people you attract will know you're doing the flat-wage thing, and so will be predisposed to want to work together. This can increase productivity. It also eliminates some material sources of conflict between different roles in an organization. You would still need "human resources" people but they would need to spend less time on mitigating the natural money-based conflicts that exist in other organizations.

Another positive side relates to the ability of the company to make collective sacrifices. For example a company that is going through harder times can collectively decide not to raise wages or even to lower them, rather than fire people. Obviously this outcome depends on the degree to which people feel responsible for the organization, which is incomplete without a feeling of collective self-management as in a cooperative, but even in a hierarchical organization these effects can be felt.

Incidentally a feeling of "investment" in the organization is another plus. When you work in a company in which compensation depends on random factors that you can't see, you always wonder if you're being cheated out of your true value. If everyone is being paid the same you know that everyone's interest in improving company revenue is aligned with their own salary interest -- you can't gain by screwing someone else over.

limitations of a flat wage at improving justice

All that said, paying all workers/partners/employees the same hourly wage is not a panacea for justice. It won't dismantle patriarchy overnight. It won't stop domestic violence, and it won't stop the cops from killing people of color. It won't stop microagressions or harassment in the workplace, and in some ways if there are feelings of resentment, it could even exacerbate them. It won't arrest attrition of marginalized people from the tech industry, and it won't fix hiring. Enacting the policy in a company won't fix the industry as a whole, even if all companies enacted it, as you would still have different wages at different companies. It won't fix the situation outside of the tech industry; a particularly egregious example being that in almost all places, cleaning staff are hired via subcontracts and not as employees. And finally, it won't resolve class conflict at work: the owner still owns. There are still pressures on the owner to keep the whole balance sheet secret, even if the human resources side of things is transparent.

All that said, these are mainly ways in which an equal wage policy is incomplete. A step in the right direction, on a justice level, but incomplete. In practice though the objections you get will be less related to justice and more commercial in nature. Let's take a look at some of them.

commercial challenges to a flat wage

Having everyone paid the same makes it extraordinarily difficult to hire people that are used to being paid on commission, like sales people. Sales people drive Rolexes and wear Mercedes. It is very, very tough to hire good sales people on salary. At my work we have had some limited success hiring, and some success growing technical folks into sales roles, but this compensation package will hinder your efforts to build and/or keep your sales team.

On the other hand, having the same compensation between sales and engineering does eliminate some of the usual sales-vs-product conflicts of interest.

Another point it that if you institute a flat-wage policy, you will expect to lose some fraction of your highly-skilled workers, as many of these are more highly paid. There are again some mitigations but it's still a reality. Perhaps more perniciously, you will have greater difficulties hiring senior people: you literally can't get into a bidding war with a competitor over a potential hire.

On the flip side, a flat salary can make it difficult to hire more junior positions. There are many theories here but I think that a company is healthy when it has a mix of experiences, that senior folks and junior folks bring different things to the table. But if your flat wage is higher than the standard junior wage, then your potential junior hires are now competing against more senior people -- internally it will be hard to keep a balance between different experiences.

Indeed junior workers that you already have are now competing at their wage level with potential hires that might be more qualified in some way. An unscrupulous management could fire those junior staff members and replace them with more senior candidates. An equal wage policy does not solve internal class conflicts; you need to have equal ownership and some form of workplace democracy for that.

You could sort people into pay grades, but in many ways this would formalize injustice. Marginalized people are by definition not equally distributed across pay grades.

Having a flat wage also removes a standard form of motivation, that your wage is always rising as you get older. It could be that after 5 years in a job, maybe your wages went up because the company's revenues went up, but they're still the same as a new hire's -- how do you feel about that? It's a tough question. I think an ever-rising wage has a lot of negative aspects, including decreasing the employability of older workers, but it's deeply rooted in tech culture at least.

Another point is motivation of people within the same cadre. Some people are motivated by bonuses, by performing relatively well compared to their peers. This wouldn't be an option in an organization with a purely flat wage. Does it matter? I do not know.

work with me tho

As the prophet Pratchett said, "against one perfect moment, the centuries beat in vain". There are some definite advantages to a flat wage within a company: it's concrete, it can be immediately enacted, it solves some immediate problems in a local way. Its commercial impact is unclear, but the force of narrative can bowl over many concerns in that department: what's important is to do the right thing. Everybody knows that!

As far as implementation, I see three-and-a-half ways this could happen in a company.

The first is that equal pay could be a founding principle of the company. This was mostly the case in the company I work for (and operate, and co-own equally with the other 40 or so partners). I wasn't a founder of the company, and the precise set of principles and policies has changed over the 15 years of the company's life, but it's more obvious for this arrangement to continue from a beginning than to change from the normal pay situation.

The second is, the change could come from the top down. Some CEOs get random brain waves and this happens. In this case, the change is super-easy to make: you proclaim the thing and it's done. As a person who has had to deal with cash-flow and payroll and balance sheets, I can tell you that this considerably simplifies HR from a management perspective.

The third is via collective action. This only works if workers are able to organize and can be convinced to be interested in justice in this specific way. In some companies, a worker's body might simply be able to negotiate this with management -- e.g., we try it out for 6 months and see. In most others you'd probably need to unionize and strike.

Finally, if this practice were more wider-spread in a sector, it could be that it just becomes "best practice" in some way -- that company management could be shamed into doing it, or it could just be the way things are done.

fin

Many of these points are probably best enacted in the context of a worker-owned cooperative, where you can do away with the worker-owner conflict at the same time. But still, they are worth thinking of in a broader context, and worth evaluating in the degree to which they work for (or against) justice in the workplace. But enough blathering from me today :) Happy hacking!

## March 22, 2016

### Carlos García Campos

#### WebKitGTK+ 2.12

We did it again, the Igalia WebKit team is pleased to announce a new stable release of WebKitGTK+, with a bunch of bugs fixed, some new API bits and many other improvements. I’m going to talk here about some of the most important changes, but as usual you have more information in the NEWS file.

## FTL

FTL JIT is a JavaScriptCore optimizing compiler that was developed using LLVM to do low-level optimizations. It’s been used by the Mac port since 2014 but we hadn’t been able to use it because it required some patches for LLVM to work on x86-64 that were not included in any official LLVM release, and there were also some crashes that only happened in Linux. At the beginning of this release cycle we already had LLVM 3.7 with all the required patches and the crashes had been fixed as well, so we finally enabled FTL for the GTK+ port. But in the middle of the release cycle Apple surprised us announcing that they had the new FTL B3 backend ready. B3 replaces LLVM and it’s entirely developed inside WebKit, so it doesn’t require any external dependency. JavaScriptCore developers quickly managed to make B3 work on Linux based ports and we decided to switch to B3 as soon as possible to avoid making a new release with LLVM to remove it in the next one. I’m not going to enter into the technical details of FTL and B3, because they are very well documented and it’s probably too boring for most of the people, the key point is that it improves the overall JavaScript performance in terms of speed.

## Persistent GLib main loop sources

Another performance improvement introduced in WebKitGTK+ 2.12 has to do with main loop sources. WebKitGTK+ makes an extensive use the GLib main loop, it has its own RunLoop abstraction on top of GLib main loop that is used by all secondary processes and most of the secondary threads as well, scheduling main loop sources to send tasks between threads. JavaScript timers, animations, multimedia, the garbage collector, and many other features are based on scheduling main loop sources. In most of the cases we are actually scheduling the same callback all the time, but creating and destroying the GSource each time. We realized that creating and destroying main loop sources caused an overhead with an important impact in the performance. In WebKitGTK+ 2.12 all main loop sources were replaced by persistent sources, which are normal GSources that are never destroyed (unless they are not going to be scheduled anymore). We simply use the GSource ready time to make them active/inactive when we want to schedule/stop them.

## Overlay scrollbars

GNOME designers have requested us to implement overlay scrollbars since they were introduced in GTK+, because WebKitGTK+ based applications didn’t look consistent with all other GTK+ applications. Since WebKit2, the web view is no longer a GtkScrollable, but it’s scrollable by itself using native scrollbars appearance or the one defined in the CSS. This means we have our own scrollbars implementation that we try to render as close as possible to the native ones, and that’s why it took us so long to find the time to implement overlay scrollbars. But WebKitGTK+ 2.12 finally implements them and are, of course, enabled by default. There’s no API to disable them, but we honor the GTK_OVERLAY_SCROLLING environment variable, so they can be disabled at runtime.

But the appearance was not the only thing that made our scrollbars inconsistent with the rest of the GTK+ applications, we also had a different behavior regarding the actions performed for mouse buttons, and some other bugs that are all fixed in 2.12.

## The NetworkProcess is now mandatory

The network process was introduced in WebKitGTK+ since version 2.4 to be able to use multiple web processes. We had two different paths for loading resources depending on the process model being used. When using the shared secondary process model, resources were loaded by the web process directly, while when using the multiple web process model, the web processes sent the requests to the network process for being loaded. The maintenance of this two different paths was not easy, with some bugs happening only when using one model or the other, and also the network process gained features like the disk cache that were not available in the web process. In WebKitGTK+ 2.12 the non network process path has been removed, and the shared single process model has become the multiple web process model with a limit of 1. In practice it means that a single web process is still used, but the network happens in the network process.

## NPAPI plugins in Wayland

I read it in many bug reports and mailing lists that NPAPI plugins will not be supported in wayland, so things like http://extensions.gnome.org will not work. That’s not entirely true. NPAPI plugins can be windowed or windowless. Windowed plugins are those that use their own native window for rendering and handling events, implemented in X11 based systems using XEmbed protocol. Since Wayland doesn’t support XEmbed and doesn’t provide an alternative either, it’s true that windowed plugins will not be supported in Wayland. Windowless plugins don’t require any native window, they use the browser window for rendering and events are handled by the browser as well, using X11 drawable and X events in X11 based systems. So, it’s also true that windowless plugins having a UI will not be supported by Wayland either. However, not all windowless plugins have a UI, and there’s nothing X11 specific in the rest of the NPAPI plugins API, so there’s no reason why those can’t work in Wayland. And that’s exactly the case of http://extensions.gnome.org, for example. In WebKitGTK+ 2.12 the X11 implementation of NPAPI plugins has been factored out, leaving the rest of the API implementation common and available to any window system used. That made it possible to support windowless NPAPI plugins with no UI in Wayland, and any other non X11 system, of course.

## New API

And as usual we have completed our API with some new additions:

## March 14, 2016

### Javier Muñoz

#### Requester Pays Bucket goes upstream in Ceph

The last Requester Pays Buckets patches went upstream in Ceph some days ago. This new feature is available in the master branch now, and it will be part of the next Ceph Jewel release.

In S3, this feature is used to configure buckets in such a way that the user who request the contents will pay transfer fee.

Along this post I will introduce the feature in order to know how this concept maps to Ceph and how it works under the hood.

Understanding the feature

The Requester Pays Buckets feature originates in the Amazon S3 storage. It is part of the Amazon business model related to the Cloud storage.

In S3, the bucket owners pay for all Amazon S3 storage and data transfer costs associated with their buckets. This approach makes sense to cover the use cases where the bucket owners use the service to host/consume the content and/or they want to share the content with some authenticated users.

On the other hand, a relevant number of use cases use S3 to share a huge amount of content requiring some kind of option to balance the costs among the different content consumers in that bucket. This option is known as 'Requester Pays Buckets'

Bear in mind this feature becomes critical when the content is fairly popular and the bucket owner have many requests. This is the typical use case among global content distributors. In this case, the transfer fees may become a major issue.

Mapping the feature to Ceph

As mentioned, this feature comes from the S3 storage where it is used to balance costs related to data transfer in buckets. When enabled the requester pays for the data transfer and the request although the bucket owner still pays for the data storage.

The S3 algorithm also charges the bucket owner for the request under the following conditions:

• The requester doesn't tag the request as 'Requester Pays Bucket'
• The request authentication fails
• The request is anonymous
• The request is a SOAP request

Ceph does not implement any billing and account management service oriented to charge users so the feature can not be ported as it is.

In this point we made the decision to implement the mechanisms behind of this feature but keeping out the billing policies of Ceph. This way you can find the proper support to reproduce the original Amazon S3 billing behaviour although you will be free to wrap this support with different and more flexible billing policies if needed.

To keep the compatibility with the tools in the S3 ecosystem, the S3 interface of this feature is in place. The Boto library was used to test the proper behaviour.

In the backend, the usage logging was extended to accommodate the new feature. Now the usage logging records are not displayed by the bucket owner. They are listed by the bucket user where this user may be the owner or not.

This change required a new way to index the records in the usage logging although it doesn't break the compatibility with the previous Ceph versions. Bear in mind the old records are displayed in the new format.

There are two notable differences between the S3 and the RGW S3 algorithms. The S3 algorithm charges the bucket owner for the request if the requester doesn't tag the request as 'Requester Pays Bucket' or the request authentication fails. In the case of RGW S3 both cases are logged under the requester instead of the owner.

The S3 REST API

The Ceph RGW S3 REST API was extended to support the following use cases:

Ceph RGW S3 REST API implements the same behaviour and semantics as S3. It is needed to support the S3 tooling ecosystem in a transparent way.

Some examples with Python and Boto

You can use the following Python scripts to set, retrieve and download objects using the Requester Pays Buckets feature in Ceph. Those examples require Boto.

The usage log

The usage log is the place where the Requester Pays Bucket information is aggregated. There are three fields related with the feature ('user', 'owner' and 'payer').

The 'user' (or 'requester') is the client credential accessing the bucket content.

The 'owner' is the client credential creating the bucket.

The 'payer' is the client credential paying the data transfer. If this field doesn't exist the 'owner' is the client credential paying the data transfer.

The new virtual error buckets

One virtual error bucket is a new abstraction to log the usage on non existent resources (404 Not Found). All virtual error buckets have the same name ('-').

Having a look in the 'ops' and 'successful_ops' fields under the virtual error buckets, you will see the second one is always zero.

Each user has its own virtual error bucket to collect 404 errors. Ceph will add a virtual error bucket with the first 404 error available. The virtual error buckets live in the usage logging only.

Wrap up

With this new feature in place, Ceph implements the required support to know in detail who is accessing the RGW S3 buckets (owners vs authenticated users)

The feature brings in new ways to understand and track the bucket content in massive use cases where the costs may be assigned to different users.

The new usage logging also contains more detailed information to be used in regular reports to customers ('owner vs payer' categories, ops on virtual error buckets, etc)

Acknowledgments

My work in Ceph is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Ceph development team. Thank you guys!

## March 13, 2016

### Michael Catanzaro

#### Do you trust this application?

Much of the software you use is riddled with security vulnerabilities. Anyone who reads Matthew Garrett knows that most proprietary software is a lost cause. Some Linux advocates claim that free software is more secure than proprietary software, but it’s an open secret that tons of popular desktop Linux applications have many known, unfixed vulnerabilities. I rarely see anybody discuss this, as if it’s taboo, but it’s been obvious to me for a long time.

Usually vulnerabilities go unreported simply because nobody cares to look. Here’s an easy game: pick any application that makes HTTP connections — anything stuck on an old version of WebKit is a good place to start — and look for the following basic vulnerabilities:

• Failure to use TLS when required (GNOME Music, GNOME Weather; note these are the only apps I mention here that do not use WebKit). This means the application has no security.
• Failure to perform TLS certificate verification (Shotwell and Pantheon Photos). This means the application has no security against active attackers.
• Failure to perform TLS certificate verification on subresources (Midori and XombreroLiferea). As sites usually send JavaScript in subresources, this means active attackers can get total control of the page by changing the script, without being detected (update: provided JavaScript is enabled). (Regrettably, Epiphany prior to 3.14.0 was also affected by this issue.)
• Failure to perform TLS certificate verification before sending HTTP headers (private Midori bugBanshee). This leaks secure cookies, usually allowing attackers full access to your user account on a website. It also leaks the page you’re visiting, which HTTPS is supposed to keep private. (Update: Regrettably, Epiphany prior to 3.14.0 was affected by this issue. Also, the WebKit 2 API in WebKitGTK+ prior to 2.6.6, CVE-2015-2330.)

Except where noted, the latest release of all of the applications listed above are still vulnerable at the time of this writing, even though almost all of these bugs were reported long ago. With the exception of Shotwell, nobody has fixed any of these issues. Perhaps nobody working on the project cares to fix it, or perhaps nobody working on the project has the time or expertise to fix it, or perhaps nobody is working on the project anymore at all. This is all common in free software.

In the case of Shotwell, the issue has been fixed in git, but it might never be released because nobody works on Shotwell anymore. I informed distributors of the Shotwell vulnerability three months ago via the GNOME distributor list, our official mechanism for communicating with distributions, and advised them to update to a git snapshot. Most distributions ignored it. This is completely typical; to my knowledge, the stable releases of all Linux distributions except Fedora are still vulnerable.

If you want to play the above game, it should be very easy for you to add to my list by checking only popular desktop software. A good place to start would be to check if Liferea or Xombrero (supposedly a security-focused browser) perform TLS certificate verification before sending HTTP headers, or if Banshee performs verification on subresources, on the principle that vulnerable applications probably have other related vulnerabilities. (I did not bother to check.)

On a related note, many applications use insecure dependencies. Tons of popular GTK+ applications are stuck on an old, deprecated version of WebKitGTK+, for example. Many popular KDE applications use QtWebKit, which is old and deprecated. These deprecated versions of WebKit suffer from well over 100 remote code execution vulnerabilities fixed upstream that will probably never be backported. (100 is a lowball estimate; I would be unsurprised if the real number for QtWebKit was much, much higher.)

I do not claim that proprietary software is generally more secure than free software, because that is absolutely not true. Proprietary software vendors, including big name corporations that you might think would know better, are still churning out consumer products based on QtWebKit, for example. (This is unethical, but most proprietary software vendors do not care about security.) Not that it matters too much, as proprietary software vendors rarely provide comprehensive security updates anyway. (If your Android phone still gets updates, guess what: they’re superficial.) A few prominent proprietary software vendors really do care about security and do good work to keep their users safe, but they are rare exceptions, not the rule.

It’s a shame we’re not able to do better with free software.

## March 12, 2016

### Michael Catanzaro

#### Do you trust this website?

TLS certificate validation errors are much less common on today’s Internet than they used to be, but you can still expect to run into them from time to time. Thanks to a decade of poor user interface decisions by web browsers (only very recently fixed in major browsers), users do not understand TLS and think it’s OK to bypass certificate warnings if they trust the site in question.

This is completely backwards. You should only bypass the warning if you do not trust the site.

The TLS certificate does not exist to state that the site is somehow trustworthy. It exists only to state that the site is the site you think it is: to ensure there is no man in the middle (MITM) attacker. If you are visiting https://www.example.com and get a certificate validation error, that means that even though your browser is displaying the URL https://www.example.com, there’s zero reason to believe you’re really visiting https://www.example.com rather than an attack site. Your browser can tell the difference, and it’s warning you. (More often, the site is just broken, or “misconfigured” if you want to be generous, but you and your browser have no way to know that.)

If you do not trust the site in question (e.g. you do not have any user account on the site), then there is not actually any harm in bypassing the warning. You don’t trust the site, so you do not care if a MITM is changing the page, recording your passwords, sending fake data to the site in your name, or whatever else.

But if you do trust the site, this error is cause to freak out and not continue, because it gives you have strong reason to believe there is a MITM attacker. Once you click continue, you should assume the MITM has total control over your interaction with the trusted website.

I will pick on Midori for an example of how bad design can confuse users:

As you can see from the label, Midori has this very wrong. Users are misled into continuing if they trust the website: the very situation in which it is unsafe to continue.

Firefox and Chrome handle this much better nowadays, but not perfectly. Firefox says “Your connection is not secure” while Chrome says “Your connection is not private.” It would be better to say: “This doesn’t look like the real www.example.com.”

## February 29, 2016

### Javier Muñoz

#### AWS Signature Version 4 goes upstream in Ceph

The first stable AWS4 implementation in Ceph went upstream some days ago. It is now available in the master branch and it will ship with the next Ceph release Jewel as planned.

I will use this blog post to talk about this new feature shipping in Ceph Jewel and the current effort by Outscale and Igalia to raise the level of compatibility between the Ceph RGW S3 and Amazon S3 interfaces.

In detail, I will describe the signing process in AWS4, how it works in Ceph RGW, the current coverage and the next steps in the pipeline around this authentication algorithm.

S3 request authentication algorithms

If you are not familiar with request authentication, regions, endpoints, credential scopes, etc. in Amazon S3 you could want to read one of my last posts about this stuff. It offers a simple and quick overview of this stuff while introducing the concepts and terms I will use in this blog post. A more long and low-level technical reading is available in the AWS documentation too. I will use this last one to drive/compare the implementations in an reasonable level for everybody.

Amazon S3 provides storage through web services interfaces (REST, SOAP and BitTorrent). By the way, Ceph RGW implements a compatible S3 REST interface in order to be interoperable with the Amazon S3 REST ecosystem (tools, libraries, third-party services and so on).

This S3 REST interface works over the Hypertext Transfer Protocol (HTTP) with the same HTTP verbs (GET, POST, PUT, DELETE, etc) that web browsers use to retrieve web pages and to send data to remote servers.

There are two kind of Amazon S3 RESTful interactions: authenticated and anonymous. The way to implement request authentication is signing these requests or interactions using an authentication algorithm. In the Amazon signing process' public specification there are two authentication algorithms currently in use: AWS2 and AWS4.

AWS2 and AWS4 in Ceph

The new Signature Version 4 (AWS4) is the current AWS signing protocol. It improves the previous Signature Version 2 (AWS2) significantly. Take into consideration these algorithm strenghts of AWS4 over AWS2

• To sign a message, the signer use a signing key that is derived from her secret access key rather than using the secret access key itself
• The signer derives the signing key from the credential scope, which means that she doesn't need to include the key itself in the request
• Each signing task requires to use the credential scope

The benefits of using AWS4 in Ceph are clear:

• Verification of the identify of the requester via access key ID and secret access key
• Request tampering prevention while the request is in transit
• Replay attacks protection within 15 minutes of the timestamp in the request

Authentication methods

The signing process can express authentication information by using one of the following methods:

• HTTP Authorization header. The most common method of authenticating. The signature calculations vary depending on the method you choose to transfer the request payload; 'transfer payload in a single chunk' vs 'transfer payload in multiple chunks (chunked upload)'
• Query string parameters. It uses a query string to express a request entirely in an URL together with the authorization information. This type of URL is also known as a presigned URL.

The current Ceph AWS4 implementation supports all authentication methods but transfering payload in multiple chunks (chunked upload). It is in the pipeline though.

Lacking chunked upload does not impact the Ceph RGW performance. The server side always use a streaming-hash approach to compute the signature.

Computing a Signature

The idea behind of computing a signature is using a cryptographic hash function over the request, and then use the hash value, some other values from the request, and a secret access key to create a signed hash. That is the signature.

Depending on the kind of authentication method used and the concrete request the algorithm requires different inputs. As one example to illustrate one of the authentication paths we can explore the required steps to craft a signature in the HTTP Authorization header case.

As you can see it computes a canonical request, a string to sign and a signing key as part of the process.

The final signature is the result of hashing the signing key and the string to sign. The keyed-hash message authentication code used along the signature computation is HMAC-SHA256

Default configuration in Ceph Jewel

Ceph Jewel is planned to ship with AWS2 and AWS4 enabled by default. You will not need to configure any extra switch to authenticate with AWS2 or AWS4.

Region constraints

In Amazon S3 the region enforces the allowed authentication algorithms.

In the case of Ceph RGW the code doesn't implement any kind of constraint related to the region names.

The next steps in the pipeline

The chunked upload feature to transfer the payload in multiple chunks is part of the pipeline definitely.

Some kind of integration with zones/regions to provide 'signature binding' could make sense too. It would help to enforce auth policies and so on.

Acknowledgments

My work in Ceph is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the Ceph development team. Thank you guys!

## February 26, 2016

### Xabier Rodríguez Calvar

#### Über latest Media Source Extensions improvements in WebKit with GStreamer

In this post I am going to talk about the implementation of the Media Source Extensions (known as MSE) in the WebKit ports that use GStreamer. These ports are WebKitGTK+, WebKitEFL and WebKitForWayland, though only the latter has the latest work-in-progress implementation. Of course we hope to upstream WebKitForWayland soon and with it, this backend for MSE and the one for EME.

My colleague Enrique at Igalia wrote a post about this about a week ago. I recommend you read it before continuing with mine to understand the general picture and the some of the issues that I managed to fix on that implementation. Come on, go and read it, I’ll wait.

One of the challenges here is something a bit unnatural in the GStreamer world. We have to process the stream information and then make some metadata available to the JavaScript app before playing instead of just pushing everything to a playing pipeline and being happy. For this we created the AppendPipeline, which processes the data and extracts that information and keeps it under control for the playback later.

The idea of the our AppendPipeline is to put a data stream into it and get it processed at the other side. It has an appsrc, a demuxer (qtdemux currently

) and an appsink to pick up the processed data. Something tricky of the spec is that when you append data into the SourceBuffer, that operation has to block it and prevent with errors any other append operation while the current is ongoing, and when it finishes, signal it. Our main issue with this is that the the appends can contain any amount of data from headers and buffers to only headers or just partial headers. Basically, the information can be partial.

First I’ll present again Enrique’s AppendPipeline internal state diagram:

First let me explain the easiest case, which is headers and buffers being appended. As soon as the process is triggered, we move from Not started to Ongoing, then as the headers are processed we get the pads at the demuxer and begin to receive buffers, which makes us move to Sampling. Then we have to detect that the operation has ended and move to Last sample and then again to Not started. If we have received only headers we will not move to Sampling cause we will not receive any buffers but we still have to detect this situation and be able to move to Data starve and then again to Not started.

Our first approach was using two different timeouts, one to detect that we should move from Ongoing to Data starve if we did not receive any buffer and another to move from Sampling to Last sample if we stopped receiving buffers. This solution worked but it was a bit racy and we tried to find a less error prone solution.

We tried then to use custom downstream events injected from the source and at the moment they were received at the sink we could move from Sampling to Last sample or if only headers were injected, the pads were created and we could move from Ongoing to Data starve. It took some time and several iterations to fine tune this but we managed to solve almost all cases but one, which was receiving only partial headers and no buffers.

If the demuxer received partial headers and no buffers it stalled and we were not receiving any pads or any event at the output so we could not tell when the append operation had ended. Tim-Philipp gave me the idea of using the need-data signal on the source that would be fired when the demuxer ran out of useful data. I realized then that the events were not needed anymore and that we could handle all with that signal.

The need-signal is fired sometimes when the pipeline is linked and also when the the demuxer finishes processing data, regardless the stream contains partial headers, complete headers or headers and buffers. It works perfectly once we are able to disregard that first signal we receive sometimes. To solve that we just ensure that at least one buffer left the appsrc with a pad probe so if we receive the signal before any buffer was detected at the probe, it shall be disregarded to consider that the append has finished. Otherwise, if we have seen already a buffer at the probe we can consider already than any need-data signal means that the processing has ended and we can tell the JavaScript app that the append process has ended.

Both need-data signal and probe information come in GStreamer internal threads so we could use mutexes to overcome any race conditions. We thought though that deferring the operations to the main thread through the pipeline bus was a better idea that would create less issues with race conditions or deadlocks.

To finish I prefer to give some good news about performance. We use mainly the YouTube conformance tests to ensure our implementation works and I can proudly say that these changes reduced the time of execution in half!

That’s all folks!

## February 25, 2016

### Javier Muñoz

#### Ceph, a free unified distributed storage system

Over the last few months I have been working in Ceph, a free unified distributed storage system, in order to implement some missing features in RADOS gateway, help some customers with Ceph clusters in production and fixing bugs.

This effort is part of my daily work here in Igalia working in upstream projects. As you could know, Igalia works in the Cloud arena providing services on development, deployment and orchestration around interesting open projects.

Together with Ceph (storage) we are also working upstream in Qemu (compute) and Snabb (networking). All these projects are in the core to create private and public clouds with Open Source.

My goal with this first post is introducing Ceph in a simple and easy way to understand this marvelous piece of software. I will cover the design and main innovations in Ceph together with its architecture, major use cases and relationship with OpenStack (a well-known free and open-source software platform for cloud computing).

Understanding Ceph

Ceph is an object storage based free software storage platform that stores data on a single distributed computer cluster. I would say this definition catches the essence of Ceph perfectly. It is also the foundation to understand its innovations, the architecture and the performance/scalability factors in Ceph.

Let's start with the object storage. The object storage is a storage architecture that manages data as objects, as opposed to other storage architectures like file systems which manage data as a file hierarchy and block storage which manages data as blocks within sectors and tracks. Each object typically includes the data, a variable amount of metadata, and a globally unique identifier.

On top of this object storage, Ceph provides a block interface (RBD), an object interface (RGW) and a filesystem interface (CephFS).

If we add a smart cluster approach in the previous design we will have a reliable object storage service that can scales to many thousands of devices. This reliable object storage service is known as RADOS (Reliable Autonomic Distributed Object Storage) in the current Ceph implementation.

But what is a 'smart cluster approach' here? At the petabyte and exabyte scale, systems are necessarily dynamic. They are built incrementally, they grow and contract with the deployment of new storage and decommissioning of old devices, devices fail and recover on a continous basis, and large amounts of data are created and destroyed. RADOS takes care of a consistent view of the data distribution and consistent read and write access to data objects.

RADOS also provides storage nodes with complete knowledge of the distribution of data in the systems, devices can act semi-autonomously using peer-to-peer like protocols to self-manage data replication, participate in failure detection and respond to device failures and the resulting changes in the distribution of data by replicating or migrating data objects.

If we consider the minimal configuration together with the basic components needed to set up a RADOS system, we will have a set of object storage daemons (OSDs) and a small group of monitors (MONs) reponsible for managing OSD cluster membership.

In Ceph this OSD cluster membership requires a cluster map. This cluster map specifies cluster membership, device state and the mapping of data objects to devices. The data distribution is specified first by mapping objects to placemente groups (PGs) and then mapping each PG onto a set of devices. The algorithm taking care of these steps is known as CRUSH (Controlled, Scalable, Decentralized Placement of Replicated Data)

With this information in mind we may consider two major innovations in Ceph RADOS:

• The CRUSH algorithm. The way how Ceph clients and Ceph OSD daemons compute information (hashing function) about object location instead of having to depend on a central lookup table
• Smart daemons. The Ceph's OSD daemons and Ceph clients are cluster aware. This enables OSDs interact directly with other OSDs and MONs. Ceph clients interacts with OSDs directly.

Both items add significant intelligence in the solution to avoid bottlenecks and, at the same time, pursue hyperscale at the petabyte and exabyte scale.

In this point we should have enough information to understand the raw Ceph architecture. Let's have a look in the usual block diagram for Ceph:

• RGW. A web services gateway for object storage, compatible with S3 and Swift
• RBD. A reliable, fully distributed block device with cloud platform integration
• CEPHFS. A distributed file system with POSIX semantics and scale-out metatadata management
• LIBRADOS. A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
• RADOS. A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors

Mapping out the major components involved under the hood and their interactions makes it still possible getting a more detailed version of this architecture:

The OpenStack basics

Although this is an introduction post in Ceph I will describe OpenStack and its relationship with Ceph briefly. It will be useful later.

Ceph may be used alone but some of its most interesting use cases take place as part of OpenStack. A quick overview on OpenStack will be useful to understand how the OpenStack and Ceph components work together to provide reliable and scalable storage.

The current stable release for OpenStack is 'Liberty' and it includes 17 components (compute, image services, object store, etc). All those components have well-known code names (Nova, Glance, Swift, etc)

The next picture catches a very high level abstraction for OpenStack:

As you can see, Glance (VM image manager) and Cinder (block storage) are two core services in the solution.

We mentioned the previous picture shows a simple view of OpenStack. A more accurate diagram together with the relationships among the services is available in the next picture for 'Folsom', a previous release (2012)

While OpenStack evolves and include new services, this 'Folsom' picture should be good enough to introduce the services related to storage and the level of complexity of OpenStack.

So the storage services in place are Swift (object store service), Glance (image service) and Cinder (block storage service).

Those services work in tandem to cover the general and specific requirements for storage in OpenStack.

Using Ceph in OpenStack

The main integration points between OpenStack and Ceph are the object and block device interfaces.

The RADOS gateway (RGW) and the RADOS block device (RBD) interfaces are used to provide the required storage to 5 services (Keystone, Swift, Cinder, Glance and Nova)

It is worth mentioning the compute service (Nova) interfaces the RBD layer via a hypervisor. An open source hypervisor working like a charm with Ceph is Qemu/KVM. It uses librbd and librados.

Other component to mention in the stack is libvirt. OpenStack uses libvirt to configure Qemu/KVM properly.

Ceph RBD dominates the choice for Cinder drivers currently, as stated in the sixth public survey of OpenStack users (page 31)

The physical deployment of Ceph and OpenStack

Setting up and operating a reliable and scalable storage cluster is always demanding. It requires a careful planning along many different aspects. Some of these critical decisions are related to the cluster capacity (RAM, disks, number of nodes, use profiles, etc)

Although it is always possible going with your own custom configuration some hardware providers offer several standard configurations.

As a random and arbitrary example, we can have a look in the HPE Helion portfolio. This set of solutions is a mix of open-source software and integrated systems for enterprise cloud computing.

The next picture shows the physical space required and how it compares to the different logical components in the architecture.

The new and old use cases

The production of data is expanding at an astonishing pace. Two major drivers in this rapid growth of global data are the analog-to-digital switch (software is everywhere) and the rapid increase in data generation by individuals and companies.

The new use cases related to storage nowadays are radically different of the previous ones a few years ago. These new use cases are all about storing and retrieving unstructured data like photos, videos and social media in massive scale. All this stuff requires real-time analitycs and reporting together with efficient processing.

To get these requirements together, some companies are extending/migrating their current datacenters to support software-defined approaches. As consecuence, those new datacenters leverage virtualization concepts such as abstraction, pooling, and automation to all of the data center’s resources and services to achieve IT as a service. In this vision all elements of the infrastructure (compute, storage, networking and security) are virtualized and delivered as a service.

In this context, we can identify some new and well-known use cases along the next 5 different categories. The original classification is used by the RedHat Storage team. Take into consideration I am merging Cloud infrastructure and Virtualzation here.

• Big data analytics. Storing, integrating, and analyzing data at petabyte scale
• Cloud infrastructure and Virtualization. Virtual machine storage and storage for tenant applications (Swift/S3 API)
• Rich media. Massive scalability and cost containment (scaling out with commodity hardware)
• File sync and share. Secure mobility, collaboration and the need for anytime, anywhere access to files
• Archival data. Agile, scalable, cost-effective and flexible unified storage (objects, blocks and file systems)

Ceph is used to support all these use cases in production with great results.

Pushing Ceph to the limit

Some folks in the CERN IT Department are pushing Ceph to the limit. They use Ceph as part of an OpenStack deployment and I have to say the numbers are great.

The solution is a large distributed OpenStack infrastructure with around 10,000 VMs and 100,000 CPU cores (1000 Cinder volumes and 1500 Glance images). The Cloud is predominantly used for physics data analysis but they also reported on a long tail of conventional IT services and user-managed application VMs.

If you want to know more on this Ceph cluster operated by CERN, I would recommend to watch this video at Vancouver Summit 2015.

In brief, and beyond of the great insights shared along the talk, the current Ceph version scales out to 10 PB. In that scale it just works. Over that threshold, it requires extra configuration adjustments.

Wrap-up!

I told you! This piece of software is marvelous!

I plan to add new blog entries to cover some of the new features implemented in the previous months. They are upstream code now so you will be able to enjoy them in Jewel!

If you are looking for some kind of support related to development, design, deployment, etc. in Ceph or you would love to see some new feature in the next releases. Feel free to contact me!

## February 24, 2016

### Manuel Rego

#### Igalia Coding Experience on Web Engines

In Igalia we’re looking for people to join the Igalia Coding Experience program. Basically we’re opening positions for internships in different fields related to some of our teams like multimedia, compilers, networking or web platform. The main purpose is to give students and recent graduates an initial taste of coding in industry. Where you could work together with several Igalia hackers on different free software projects.

I’m part of the web platform team where we work on different tasks related to the core of several web engines. Apart from our work on CSS Grid Layout in Blink and WebKit, that you probably know if you follow my blog, Igalia has been working on other topics like:

In our team we’re looking for a student willing to help on some of this topics. Probably, the final work might be somehow related to CSS Grid Layout where we’ve a bunch of peripheral tasks that would be really useful. Some ideas off the top of my head:

This is not meant to be an exhaustive list, but just some examples so you can realize the type of tasks you’ll be doing. Of course, depending on the profile of the selected person we’ll choose the task that fits better.

If you’re interested in this internship or any other from the rest of the teams, you can find all the details and conditions in our website. We’re a company spread all around the globe, with igalians in different countries and timezones (from Seoul to San Francisco). And, of course, these internships are remote friendly.

On top of that, Igalia is hiring too, just in case you already have some experience and are looking for a job. Again you can find all the information at igalia.com.

Last but not least, Igalia welcomes everyone and encourages applications from members of underrepresented groups in the free software community. We’re aiming to keep a diverse environment in our company.