Planet Igalia

September 19, 2017

Asumu Takikawa

IPFIX app for Snabb

As you know if you’ve been following this blog, at Igalia we build network functions using the Snabb toolkit. When we’re not directly working on customer projects, we often invest time into building new features into Snabb.

One of these recent investments has been building a basic IP flow export (IPFIX) app for Snabb. The app is now available in the v2017.08 Snabb release. The app’s documentation is up on the web here.

As you can see from the commit log, Andy Wingo helped out a great deal in building the app (and was responsible for much of the performance engineering).

What is IP flow export?

IPFIX refers to a widely used set of tools that let you monitor IP flows in your network. An IP flow is a set of IP packets that share some common characteristics. Often these are described by a flow key composed of a standard 5-tuple of source & destination address, protocol, and TCP/UDP/etc ports.

Monitoring flows can be useful for traffic measurement and management, usage-based billing, and other use cases (many of which are spelled out in RFC 3917).

You may have also heard of “Netflow”, which is the trade name used by Cisco for these tools. IPFIX is the standardized version (see RFC 5470 and related RFCs) that is backwards compatible with Netflow.

(A great long-form overview of the topic is “Flow Monitoring Explained” by Hofstede et al, which was very helpful in designing the app)

The basic architecture of IPFIX consists of a flow metering process which monitors the traffic, an exporter that communicates the data collected by the meter, and finally a collector that aggregates the data. In practice, the metering process and exporter are often combined into one program (so I’ll just refer to it as an exporter or as a probe).

Here’s some ASCII art from the RFC showing the structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
                             +----------------+     +----------------+
                             |[*Application 1]| ... |[*Application n]|
                             +--------+-------+     +-------+--------+
                                      ^                     ^
                                      |                     |
                                      + = = = = -+- = = = = +
                                                 ^
                                                 |
   +------------------------+            +-------+------------------+
   |IPFIX Exporter          |            | Collector(1)             |
   |[Exporting Process(es)] |<---------->| [Collecting Process(es)] |
   +------------------------+            +--------------------------+
           ....                                  ....
   +------------------------+           +---------------------------+
   |IPFIX Device(i)         |           | Collector(j)              |
   |[Observation Point(s)]  |<--------->| [Collecting Process(es)]  |
   |[Metering Process(es)]  |     +---->| [*Application(s)]         |
   |[Exporting Process(es)] |     |     +---------------------------+
   +------------------------+     .
          ....                    .              ....
   +------------------------+     |     +--------------------------+
   |IPFIX Device(m)         |     |     | Collector(n)             |
   |[Observation Point(s)]  |<----+---->| [Collecting Process(es)] |
   |[Metering Process(es)]  |           | [*Application(s)]        |
   |[Exporting Process(es)] |           +--------------------------+
   +------------------------+

(what the RFC calls a “Device” contains some number of “metering processes”)

The exporter might track information such as the total number of packets, payload sizes, and start/end times for a unique flow (these are called information elements and are standardized by the IANA).

The exporter periodically sends its summarized data to a flow collector, which uses some kind of database to keep and track all of the flow information (unlike the exporter which may evict inactive flows). The collector can present this information to users in a variety of ways (e.g., a web UI).

An exporter communicates with a collector using the IPFIX protocol, so that you can use an exporter with any off-the-shelf collector.

Snabb app

For Snabb, we implemented an IPFIX exporter as an app that can be integrated with other apps. For convenience, we provide a snabb ipfix probe commandline program that runs an exporter with a simple configuration.

The flow keys and record fields to be collected are described using Lua tables, like the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
v4 = make_template_info {
   id     = 256,
   filter = "ip",
   keys   = { "sourceIPv4Address",
              "destinationIPv4Address",
              "protocolIdentifier",
              "sourceTransportPort",
              "destinationTransportPort" },
   values = { "flowStartMilliseconds",
              "flowEndMilliseconds",
              "packetDeltaCount",
              "octetDeltaCount"}
}

This table describes a flow record that’s then used to dynamically generate the appropriate FFI data structures and configure the control flow of the app. The fields are described using the names from the IANA information element table.

Since the exporter collects some information from all of the IP packets going through it, it has to be performant and use minimal resources.

Making IPFIX perform well

Performance was the most difficult part of implementing the IPFIX app. The core of the app is fairly simple. For each packet, it looks at the fields that define a flow (currently we just support the usual 5-tuple) and uses this as a flow key to index a ctable.

Since ctables are implemented as a hashtable, this means that the hashing can be a key bottleneck for the app. This isn’t really surprising since hashing is a key operation that is often optimized and also delegated to hardware in networking.

It turns out that the hashing algorithm used by ctable performed well with keys of certain sizes (4 or 8 bytes), but not with larger keys such as flow keys. We tried out several hash algorithms (e.g., FNV hashing) and got some incremental improvements. In the end, Andy was able to get better performance out of implementing SipHash using DynASM.

Andy also implemented a bunch of other performance improvements, such as separating out the IPv4 and IPv6 data paths and re-organizing the multiple original apps into a single app that contains several mini-queues and work loops. In the end, these improvements got the app to line-rate performance in our synthetic benchmark for most packet sizes (performance on 64-byte packet sizes needs more work).

In terms of real-world performance, we still need to do more work but are making progress thanks to our early adopters. Alexander Gall from SWITCH has been providing us with valuable feedback from running and hacking on the app with production data.

Future improvements

At this point, we have an IPFIX app that can be integrated into other Snabb solutions or used standalone on commodity Intel hardware. There’s still a lot of work that can be done on it though. One idea for improvement on the performance side is to parallelize it using the RSS feature on 10G network cards. RSS (receive-side scaling) lets you use multiple processes running in parallel on several receive queues on a single NIC.

Conveniently, it turns out that at Igalia we’ve also been working on improving Snabb’s network drivers to make it easier to use RSS. There’s a good chance that we can parallelize IPFIX very easily, since there’s minimal coordination that’s needed between IPFIX instances.

There are a number of other improvements we could make too. Probably the most useful is to add support for more information elements, and to make the observed IEs more configurable.

Another area for improvement is the app configuration. Snabb has support for app configuration with YANG schemas and the IETF has a schema for IPFIX that we could use.

Another limitation is that the IPFIX app currently just exports over UDP to a collector. The RFCs technically requires support for SCTP as well (adding that could be a lot of work, maybe we would offload the work to an existing userspace library).

Final thoughts

Working on IPFIX was pretty fun overall. One of the rewarding aspects of working on the IPFIX app is that it’s relatively easy to test it and see that it’s doing something. For example, you can plug it up to Wireshark and manually check the IPFIX packets to see what’s going on.

You can also plug it into an off-the-shelf flow collector like nfdump and see some useful output. In fact, we have a test written using nix-shell that will spawn a shell in a new Nix enviroment with nfdump installed and test the app with it. Nix can be very nice for this kind of test environment setup.

In any case, please feel free to try out the app. We would appreciate any feedback or bug reports!

by Asumu Takikawa at September 19, 2017 01:23 AM

September 14, 2017

Jacobo Aragunde

Attending BlinkOn 8

Next week I will be in Tokyo to attend BlinkOn 8! It will be a great opportunity to meet the Chromium community and share what we are doing.

I will give a lightning talk about the challenges of making Chromium run on embedded platforms. I hope to spark the curiosity of the audience in this complex field!

by Jacobo Aragunde Pérez at September 14, 2017 05:28 PM

September 09, 2017

Carlos García Campos

WebDriver support in WebKitGTK+ 2.18

WebDriver is an automation API to control a web browser. It allows to create automated tests for web applications independently of the browser and platform. WebKitGTK+ 2.18, that will be released next week, includes an initial implementation of the WebDriver specification.

WebDriver in WebKitGTK+

There’s a new process (WebKitWebDriver) that works as the server, processing the clients requests to spawn and control the web browser. The WebKitGTK+ driver is not tied to any specific browser, it can be used with any WebKitGTK+ based browser, but it uses MiniBrowser as the default. The driver uses the same remote controlling protocol used by the remote inspector to communicate and control the web browser instance. The implementation is not complete yet, but it’s enough for what many users need.

The clients

The web application tests are the clients of the WebDriver server. The Selenium project provides APIs for different languages (Java, Python, Ruby, etc.) to write the tests. Python is the only language supported by WebKitGTK+ for now. It’s not yet upstream, but we hope it will be integrated soon. In the meantime you can use our fork in github. Let’s see an example to understand how it works and what we can do.

from selenium import webdriver

# Create a WebKitGTK driver instance. It spawns WebKitWebDriver 
# process automatically that will launch MiniBrowser.
wkgtk = webdriver.WebKitGTK()

# Let's load the WebKitGTK+ website.
wkgtk.get("https://www.webkitgtk.org")

# Find the GNOME link.
gnome = wkgtk.find_element_by_partial_link_text("GNOME")

# Click on the link. 
gnome.click()

# Find the search form. 
search = wkgtk.find_element_by_id("searchform")

# Find the first input element in the search form.
text_field = search.find_element_by_tag_name("input")

# Type epiphany in the search field and submit.
text_field.send_keys("epiphany")
text_field.submit()

# Let's count the links in the contents div to check we got results.
contents = wkgtk.find_element_by_class_name("content")
links = contents.find_elements_by_tag_name("a")
assert len(links) > 0

# Quit the driver. The session is closed so MiniBrowser 
# will be closed and then WebKitWebDriver process finishes.
wkgtk.quit()

Note that this is just an example to show how to write a test and what kind of things you can do, there are better ways to achieve the same results, and it depends on the current source of public websites, so it might not work in the future.

Web browsers / applications

As I said before, WebKitWebDriver process supports any WebKitGTK+ based browser, but that doesn’t mean all browsers can automatically be controlled by automation (that would be scary). WebKitGTK+ 2.18 also provides new API for applications to support automation.

  • First of all the application has to explicitly enable automation using webkit_web_context_set_automation_allowed(). It’s important to know that the WebKitGTK+ API doesn’t allow to enable automation in several WebKitWebContexts at the same time. The driver will spawn the application when a new session is requested, so the application should enable automation at startup. It’s recommended that applications add a new command line option to enable automation, and only enable it when provided.
  • After launching the application the driver will request the browser to create a new automation session. The signal “automation-started” will be emitted in the context to notify the application that a new session has been created. If automation is not allowed in the context, the session won’t be created and the signal won’t be emitted either.
  • A WebKitAutomationSession object is passed as parameter to the “automation-started” signal. This can be used to provide information about the application (name and version) to the driver that will match them with what the client requires accepting or rejecting the session request.
  • The WebKitAutomationSession will emit the signal “create-web-view” every time the driver needs to create a new web view. The application can then create a new window or tab containing the new web view that should be returned by the signal. This signal will always be emitted even if the browser has already an initial web view open, in that case it’s recommened to return the existing empty web view.
  • Web views are also automation aware, similar to ephemeral web views, web views that allow automation should be created with the constructor property “is-controlled-by-automation” enabled.

This is the new API that applications need to implement to support WebDriver, it’s designed to be as safe as possible, but there are many things that can’t be controlled by WebKitGTK+, so we have several recommendations for applications that want to support automation:

  • Add a way to enable automation in your application at startup, like a command line option, that is disabled by default. Never allow automation in a normal application instance.
  • Enabling automation is not the only thing the application should do, so add an automation mode to your application.
  • Add visual feedback when in automation mode, like changing the theme, the window title or whatever that makes clear that a window or instance of the application is controllable by automation.
  • Add a message to explain that the window is being controlled by automation and the user is not expected to use it.
  • Use ephemeral web views in automation mode.
  • Use a temporal user profile in application mode, do not allow automation to change the history, bookmarks, etc. of an existing user.
  • Do not load any homepage in automation mode, just keep an empty web view (about:blank) that can be used when a new web view is requested by automation.

The WebKitGTK client driver

Applications need to implement the new automation API to support WebDriver, but the WebKitWebDriver process doesn’t know how to launch the browsers. That information should be provided by the client using the WebKitGTKOptions object. The driver constructor can receive an instance of a WebKitGTKOptions object, with the browser information and other options. Let’s see how it works with an example to launch epiphany:

from selenium import webdriver
from selenium.webdriver import WebKitGTKOptions

options = WebKitGTKOptions()
options.browser_executable_path = "/usr/bin/epiphany"
options.add_browser_argument("--automation-mode")
epiphany = webdriver.WebKitGTK(browser_options=options)

Again, this is just an example, Epiphany doesn’t even support WebDriver yet. Browsers or applications could create their own drivers on top of the WebKitGTK one to make it more convenient to use.

from selenium import webdriver
epiphany = webdriver.Epiphany()

Plans

During the next release cycle, we plan to do the following tasks:

  • Complete the implementation: add support for all commands in the spec and complete the ones that are partially supported now.
  • Add support for running the WPT WebDriver tests in the WebKit bots.
  • Add a WebKitGTK driver implementation for other languages in Selenium.
  • Add support for automation in Epiphany.
  • Add WebDriver support to WPE/dyz.

by carlos garcia campos at September 09, 2017 05:33 PM

September 06, 2017

Frédéric Wang

Review of Igalia's Web Platform activities (H1 2017)

Introduction

For many years Igalia has been committed to and dedicated efforts to the improvement of Web Platform in all open-source Web Engines (Chromium, WebKit, Servo, Gecko) and JavaScript implementations (V8, SpiderMonkey, ChakraCore, JSC). We have been working in the implementation and standardization of some important technologies (CSS Grid/Flexbox, ECMAScript, WebRTC, WebVR, ARIA, MathML, etc). This blog post contains a review of these activities performed during the first half (and a bit more) of 2017.

Projects

CSS

A few years ago Bloomberg and Igalia started a collaboration to implement a new layout model for the Web Platform. Bloomberg had complex layout requirements and what the Web provided was not enough and caused performance issues. CSS Grid Layout seemed to be the right choice, a feature that would provide such complex designs with more flexibility than the currently available methods.

We’ve been implementing CSS Grid Layout in Blink and WebKit, initially behind some flags as an experimental feature. This year, after some coordination effort to ensure interoperability (talking to the different parties involved like browser vendors, the CSS Working Group and the web authors community), it has been shipped by default in Chrome 58 and Safari 10.1. This is a huge step for the layout on the web, and modern websites will benefit from this new model and enjoy all the features provided by CSS Grid Layout spec.

Since the CSS Grid Layout shared the same alignment properties as the CSS Flexible Box feature, a new spec has been defined to generalize alignment for all the layout models. We started implementing this new spec as part of our work on Grid, being Grid the first layout model supporting it.

Finally, we worked on other minor CSS features in Blink such as caret-color or :focus-within and also several interoperability issues related to Editing and Selection.

MathML

MathML is a W3C recommendation to represent mathematical formulae that has been included in many other standards such as ISO/IEC, HTML5, ebook and office formats. There are many tools available to handle it, including various assistive technologies as well as generators from the popular LaTeX typesetting system.

After the improvements we performed in WebKit’s MathML implementation, we have regularly been in contact with Google to see how we can implement MathML in Chromium. Early this year, we had several meetings with Google’s layout team to discuss this in further details. We agreed that MathML is an important feature to consider for users and that the right approach would be to rely on the new LayoutNG model currently being implemented. We created a prototype for a small LayoutNG-based MathML implementation as a proof-of-concept and as a basis for future technical discussions. We are going to follow-up on this after the end of Q3, once Chromium’s layout team has made more progress on LayoutNG.

Servo

Servo is Mozilla’s next-generation web content engine based on Rust, a language that guarantees memory safety. Servo relies on a Rust project called WebRender which replaces the typical rasterizer and compositor duo in the web browser stack. WebRender makes extensive use of GPU batching to achieve very exciting performance improvements in common web pages. Mozilla has decided to make WebRender part of the Quantum Render project.

We’ve had the opportunity to collaborate with Mozilla for a few years now, focusing on the graphics stack. Our work has focused on bringing full support for CSS stacking and clipping to WebRender, so that it will be available in both Servo and Gecko. This has involved creating a data structure similar to what WebKit calls the “scroll tree” in WebRender. The scroll tree divides the scene into independently scrolled elements, clipped elements, and various transformation spaces defined by CSS transforms. The tree allows WebRender to handle page interaction independently of page layout, allowing maximum performance and responsiveness.

WebRTC

WebRTC is a collection of communications protocols and APIs that enable real-time communication over peer-to-peer connections. Typical use cases include video conferencing, file transfer, chat, or desktop sharing. Igalia has been working on the WebRTC implementation in WebKit and this development is currently sponsored by Metrological.

This year we have continued the implementation effort in WebKit for the WebKitGTK and WebKit WPE ports, as well as the maintenance of two test servers for WebRTC: Ericsson’s p2p and Google’s apprtc. Finally, a lot of progress has been done to add support for Jitsi using the existing OpenWebRTC backend.

Since OpenWebRTC development is not an active project anymore and given libwebrtc is gaining traction in both Blink and the WebRTC implementation of WebKit for Apple software, we are taking the first steps to replace the original WebRTC implementation in WebKitGTK based on OpenWebRTC, with a new one based on libwebrtc. Hopefully, this way we will share more code between platforms and get more robust support of WebRTC for the end users. GStreamer integration in this new implementation is an issue we will have to study, as it’s not built in libwebrtc. libwebrtc offers many services, but not every WebRTC implementation uses all of them. This seems to be the case for the Apple WebRTC implementation, and it may become our case too if we need tighter integration with GStreamer or hardware decoding.

WebVR

WebVR is an API that provides support for virtual reality devices in Web engines. Implementation and devices are currently actively developed by browser vendors and it looks like it is going to be a huge thing. Igalia has started to investigate on that topic to see how we can join that effort. This year, we have been in discussions with Mozilla, Google and Apple to see how we could help in the implementation of WebVR on Linux. We decided to start experimenting an implementation within WebKitGTK. We announced our intention on the webkit-dev mailing list and got encouraging feedback from Apple and the WebKit community.

ARIA

ARIA defines a way to make Web content and Web applications more accessible to people with disabilities. Igalia strengthened its ongoing committment to the W3C: Joanmarie Diggs joined Richard Schwerdtfeger as a co-Chair of the W3C’s ARIA working group, and became editor of the Core Accessibility API Mappings, [Digital Publishing Accessibility API Mappings] (https://w3c.github.io/aria/dpub-aam/dpub-aam.html), and Accessible Name and Description: Computation and API Mappings specifications. Her main focus over the past six months has been to get ARIA 1.1 transitioned to Proposed Recommendation through a combination of implementation and bugfixing in WebKit and Gecko, creation of automated testing tools to verify platform accessibility API exposure in GNU/Linux and macOS, and working with fellow Working Group members to ensure the platform mappings stated in the various “AAM” specs are complete and accurate. We will provide more information about these activities after ARIA 1.1 and the related AAM specs are further along on their respective REC tracks.

Web Platform Predictability for WebKit

The AMP Project has recently sponsored Igalia to improve WebKit’s implementation of the Web platform. We have worked on many issues, the main ones being:

  • Frame sandboxing: Implementing sandbox values to allow trusted third-party resources to open unsandboxed popups or restrict unsafe operations of malicious ones.
  • Frame scrolling on iOS: Addressing issues with scrollable nodes; trying to move to a more standard and interoperable approach with scrollable iframes.
  • Root scroller: Finding a solution to the old interoperability issue about how to scroll the main frame; considering a new rootScroller API.

This project aligns with Web Platform Predictability which aims at making the Web more predictable for developers by improving interoperability, ensuring version compatibility and reducing footguns. It has been a good opportunity to collaborate with Google and Apple on improving the Web. You can find further details in this blog post.

JavaScript

Igalia has been involved in design, standardization and implementation of several JavaScript features in collaboration with Bloomberg and Mozilla.

In implementation, Bloomberg has been sponsoring implementation of modern JavaScript features in V8, SpiderMonkey, JSC and ChakraCore, in collaboration with the open source community:

  • Implementation of many ES6 features in V8, such as generators, destructuring binding and arrow functions
  • Async/await and async iterators and generators in V8 and some work in JSC
  • Optimizing SpiderMonkey generators
  • Ongoing implementation of BigInt in SpiderMonkey and class field declarations in JSC

On the design/standardization side, Igalia is active in TC39 and with Bloomberg’s support

In partnership with Mozilla, Igalia has been involved in the specification of various JavaScript standard library features for internationalization, in specification, implementation in V8, code reviews in other JavaScript engines, as well as working with the underlying ICU library.

Other activities

Preparation of Web Engines Hackfest 2017

Igalia has been organizing and hosting the Web Engines Hackfest since 2009. This event under an unconference format has been a great opportunity for Web Engines developers to meet, discuss and work together on the web platform and on web engines in general. We announced the 2017 edition and many developers already confirmed their attendance. We would like to thank our sponsors for supporting this event and we are looking forward to seeing you in October!

Coding Experience

Emilio Cobos has completed his coding experience program on implementation of web standards. He has been working in the implementation of “display: contents” in Blink but some work is pending due to unresolved CSS WG issues. He also started the corresponding work in WebKit but implementation is still very partial. It has been a pleasure to mentor a skilled hacker like Emilio and we wish him the best for his future projects!

New Igalians

During this semester we have been glad to welcome new igalians who will help us to pursue Web platform developments:

  • Daniel Ehrenberg joined Igalia in January. He is an active contributor to the V8 JavaScript engine and has been representing Igalia at the ECMAScript TC39 meetings.
  • Alicia Boya joined Igalia in March. She has experience in many areas of computing, including web development, computer graphics, networks, security, and software design with performance which we believe will be valuable for our Web platform activities.
  • Ms2ger joined Igalia in July. He is a well-known hacker of the Mozilla community and has wide experience in both Gecko and Servo. He has noticeably worked in DOM implementation and web platform test automation.

Conclusion

Igalia has been involved in a wide range of Web Platform technologies going from Javascript and layout engines to accessibility or multimedia features. Efforts have been made in all parts of the process:

  • Participation to standardization bodies (W3C, TC39).
  • Elaboration of conformance tests (web-platform-tests test262).
  • Implementation and bug fixes in all open source web engines.
  • Discussion with users, browser vendors and other companies.

Although, some of this work has been sponsored by Google or Mozilla, it is important to highlight how external companies (other than browser vendors) can make good contributions to the Web Platform, playing an important role on its evolution. Alan Stearns already pointed out the responsibility of the Web Plaform users on the evolution of CSS while Rachel Andrew emphasized how any company or web author can effectively contribute to the W3C in many ways.

As mentioned in this blog post, Bloomberg is an important contributor of several open source projects and they’ve been a key player in the development of CSS Grid Layout or Javascript. Similarly, Metrological’s support has been instrumental for the implementation of WebRTC in WebKit. We believe others could follow their examples and we are looking forward to seeing more companies sponsoring Web Platform developments!

September 06, 2017 10:00 PM

August 31, 2017

Xabier Rodríguez Calvar

Some rough numbers on WebKit code

My wife asked me for some rough LOC numbers on the WebKit project and I think I could share them with you here as well. They come from r221232. As I’ll take into account some generated code it is relevant to mention that I built WebKitGTK+ with the default CMake options.

First thing I did was running sloccount Source and got the following numbers:

cpp: 2526061 (70.57%)
ansic: 396906 (11.09%)
asm: 207284 (5.79%)
javascript: 175059 (4.89%)
java: 74458 (2.08%)
perl: 73331 (2.05%)
objc: 44422 (1.24%)
python: 38862 (1.09%)
cs: 13011 (0.36%)
ruby: 11605 (0.32%)
xml: 11396 (0.32%)
sh: 3747 (0.10%)
yacc: 2167 (0.06%)
lex: 1007 (0.03%)
lisp: 89 (0.00%)
php: 10 (0.00%)

This number do not include IDL code so I did some grepping to get the number myself that gave me 19632 IDL lines:

$ find Source/ -name ".idl" | xargs cat | grep -ve "^[[:space:]]\/*" -ve "^[[:space:]]*" -ve "^[[:space:]]$" -ve "^[[:space:]][$" -ve "^[[:space:]]};$" | wc -l
19632

The interesting part of the IDL files is that they are used to generate code so those 19632 IDL lines expand to:

ansic: 699140 (65.25%)
cpp: 368720 (34.41%)
python: 1492 (0.14%)
xml: 1040 (0.10%)
javascript: 883 (0.08%)
asm: 169 (0.02%)
perl: 11 (0.00%)

Let’s have a look now at the LayoutTests (they test the functionality of WebCore + the platform). Tests are composed mainly by HTML files so if you run sloccount LayoutTests you get:

javascript: 401159 (76.74%)
python: 87231 (16.69%)
xml: 22978 (4.40%)
php: 4784 (0.92%)
ansic: 3661 (0.70%)
perl: 2726 (0.52%)
sh: 199 (0.04%)

It’s quite interesting to see that sloccount does not consider HTML which is quite relevant when you’re testing a web engine so again, we have to count them manually (thanks to Carlos López who helped me to properly grep here as some binary lines were giving me a headache to get the numbers):

find LayoutTests/ -name ".html" -print0 | xargs -0 cat | strings | grep -Pv "^[[:space:]]$" | wc -l
2205690

You can see 2205690 of “meaningful lines” that combine HTML + other languages that you can see above. I can’t substract here to just get the HTML lines because the number above take into account files with a different extension than HTML, though many of them do include other languages, specially JavaScript.

But the LayoutTests do not include only pure WebKit tests. There are some imported ones so it might be interesting to run the same procedure under LayoutTests/imported to see which ones are imported and not written directly into the WebKit project. I emphasize that because they can be written by WebKit developers in other repositories and actually I can present myself and Youenn Fablet as an example as we wrote tests some tests that were finally moved into the specification and included back later when imported. So again, sloccount LayoutTests/imported:

python: 84803 (59.99%)
javascript: 51794 (36.64%)
ansic: 3661 (2.59%)
php: 575 (0.41%)
xml: 250 (0.18%)
sh: 199 (0.14%)
perl: 86 (0.06%)

The same procedure to count HTML + other stuff lines inside that directory gives a number of 295490:

$ find LayoutTests/imported/ -name ".html" -print0 | xargs -0 cat | strings | grep -Pv "^[[:space:]]$" | wc -l
295490

There are also some other tests that we can talk about, for example the JSTests. I’ll mention already the numbers summed up regarding languages and the manual HTML code (if you made it here, you know the drill already):

javascript: 1713200 (98.64%)
xml: 20665 (1.19%)
perl: 2449 (0.14%)
python: 421 (0.02%)
ruby: 56 (0.00%)
sh: 38 (0.00%)
HTML+stuff: 997

ManualTests:

javascript: 297 (41.02%)
ansic: 187 (25.83%)
java: 118 (16.30%)
xml: 103 (14.23%)
php: 10 (1.38%)
perl: 9 (1.24%)
HTML+stuff: 16026

PerformanceTests:

javascript: 950916 (83.12%)
cpp: 147194 (12.87%)
ansic: 38540 (3.37%)
asm: 5466 (0.48%)
sh: 872 (0.08%)
ruby: 419 (0.04%)
perl: 348 (0.03%)
python: 325 (0.03%)
xml: 5 (0.00%)
HTML+stuff: 238002

TestsWebKitAPI:

cpp: 44753 (99.45%)
ansic: 163 (0.36%)
objc: 76 (0.17%)
xml: 7 (0.02%)
javascript: 1 (0.00%)
HTML+stuff: 3887

And this is all. Remember that these are just some rough statistics, not a “scientific” paper.

Update:

In her expert opinion, in the WebKit project we are devoting around 50% of the total LOC to testing, which makes it a software engineering “textbook” project regarding testing and I think we can be proud of it!

by calvaris at August 31, 2017 09:03 AM

August 29, 2017

Frédéric Wang

The AMP Project and Igalia working together to improve WebKit and the Web Platform

TL;DR

The AMP Project and Igalia have recently been collaborating to improve WebKit’s implementation of the Web platform. Both teams are committed to make the Web better and we expect that all developers and users will benefit from this effort. In this blog post, we review some of the bug fixes and features currently being considered:

  • Frame sandboxing: Implementing sandbox values to allow trusted third-party resources to open unsandboxed popups or restrict unsafe operations of malicious ones.

  • Frame scrolling on iOS: Trying to move to a more standard and interoperable approach via iframe elements; addressing miscellaneous issues with scrollable nodes (e.g. visual artifacts while scrolling, view not scrolled when using “Find Text”…).

  • Root scroller: Finding a solution to the old interoperability issue about how to scroll the main frame; considering a new rootScroller API.

Some demo pages for frame sandboxing and scrolling are also available if you wish to test features discussed in this blog post.

Introduction

AMP is an open-source project to enable websites and ads that are consistently fast, beautiful and high-performing across devices and distribution platforms. Several interoperability bugs and missing features in WebKit have caused problems to AMP users and to Web developers in general. Although it is possible to add platform-specific workarounds to AMP, the best way to help the Web Platform community is to directly fix these issues in WebKit, so that everybody can benefit from these improvements.

Igalia is a consulting company with a team dedicated to Web Platform developments in all open-source Web Engines (Chromium, WebKit, Servo, Gecko) working in the implementation and standardization of miscellaneous technologies (CSS Grid/flexbox, ECMAScript, WebRTC, WebVR, ARIA, MathML, etc). Given this expertise, the AMP Project sponsored Igalia so that they can lead these developments in WebKit. It is worth noting that this project aligns with the Web Predictability effort supported by both Google and Igalia, which aims at making the Web more predictable for developers. In particular, the following aspects are considered:

  • Interoperability: Effort is made to write Web Platform Tests (WPT), to follow Web standards and ensure consistent behaviors between web engines or operating systems.
  • Compatibility: Changes are carefully analyzed using telemetry techniques or user feedback in order to avoid breaking compatibility with previous versions of WebKit.
  • Reducing footguns: Removals of non-standard features (e.g. CSS vendor prefixes) are attempted while new features are carefully introduced.

Below we provide further description of the WebKit improvements, showing concretely how the above principles are followed.

Frame sandboxing

A sandbox attribute can be specified on the iframe element in order to enable a set of restrictions on any content it hosts. These conditions can be relaxed by specifying a list of values such as allow-scripts (to allow javascript execution in the frame) or allow-popups (to allow the frame to open popups). By default, the same restrictions apply to a popup opened by a sandboxed frame.

iframe sandboxing
Figure 1: Example of sandboxed frames (Can they navigate their top frame or open popups? Are such popups also sandboxed?)

However, sometimes this behavior is not wanted. Consider for example the case of an advertisement inside a sandboxed frame. If a popup is opened from this frame then it is likely that a non-sandboxed context is desired on the landing page. In order to handle this use case, a new allow-popups-to-escape-sandbox value has been introduced. The value is now supported in Safari Technology Preview 34.

While performing that work, it was noticed that some WPT tests for the sandbox attribute were still failing. It turns out that WebKit does not really follow the rules to allow navigation. More specifically, navigating a top context is never allowed when such context corresponds to an opened popup. We have made some changes to WebKit so that it behaves more closely to the specification. This is integrated into Safari Technology Preview 35 and you can for example try this W3C test. Note that this test requires to change preferences to allow popups.

It is worth noting that web engines may slightly depart from the specification regarding the previously mentioned rules. In particular, WebKit checks a same-origin condition to be sure that one frame is allowed to navigate another one. WebKit always has contained a special case to ignore this condition when a sandboxed frame with the allow-top-navigation flag tries and navigate its top frame. This feature, sometimes known as “frame busting,” has been used by third-party resources to perform malicious auto-redirecting. As a consequence, Chromium developers proposed to restrict frame busting to the case where the navigation is triggered by a user gesture.

According to Chromium’s telemetry frame busting without a user gesture is very rare. But when experimenting with the behavior change of allow-top-navigation several regressions were reported. Hence it was instead decided to introduce the allow-top-navigation-by-user-activation flag in order to provide this improved safety context while still preserving backward compatibility. We implemented this feature in WebKit and it is now available in Safari Technology Preview 37.

Finally, another proposed security improvement is to use an allow-modals flag to explicitly allow sandboxed frames to display modal dialogs (with alert, prompt, etc). That is, the default behavior for sandboxed frames will be to forbid such modal dialogs. Again, such a change of behavior must be done with care. Experiments in Chromium showed that the usage of modal dialogs in sandboxed frames is very low and no users complained. Hence we implemented that behavior in WebKit and the feature should arrive in Safari Technology Preview soon.

Check out the frame sandboxing demos if if you want to test the new allow-popup-to-escape-sandbox, allow-top-navigation-without-user-activation and allow-modals flags.

Frame scrolling on iOS

Apple’s UI choice was to (almost) always “flatten” (expand) frames so that users do not require to scroll them. The rationale for this is that it avoids to be trapped into hierarchy of nested frames. Changing that behavior is likely to cause a big backward compatibility issue on iOS so for now we proposed a less radical solution: Add a heuristic to support the case of “fullscreen” iframes used by the AMP Project. Note that such exceptions already exist in WebKit, e.g. to avoid making offscreen content visible.

We thus added the following heuristic into WebKit Nightly: do not flatten out-of-flow iframes (e.g. position: absolute) that have viewport units (e.g. vw and vh). This includes the case of the “fullscreen” iframe previously mentioned. For now it is still under a developer flag so that WebKit developers can control when they want to enable it. Of course, if this is successful we might consider more advanced heuristics.

The fact that frames are never scrollable in iOS is an obvious interoperability issue. As a workaround, it is possible to emulate such “scrollable nodes” behavior using overflow: scroll nodes with the -webkit-overflow-scrolling: touch property set. This is not really ideal for our Web Predictability goal as we would like to get rid of browser vendor prefixes. Also, in practice such workarounds lead to even more problems in AMP as explained in these blog posts. That’s why implementing scrolling of frames is one of the main goals of this project and significant steps have already been made in that direction.

Class Hierarchy
Figure 2: C++ classes involved in frame scrolling

The (relatively complex) class hierarchy involved in frame scrolling is summarized in Figure 2. The frame flattening heuristic mentioned above is handled in the WebCore::RenderIFrame class (in purple). The WebCore::ScrollingTreeFrameScrollingNodeIOS and WebCore::ScrollingTreeOverflowScrollingNodeIOS classes from the scrolling tree (in blue) are used to scroll, respectively, the main frame and overflow nodes on iOS. Scrolling of non-main frames will obviously have some code to share with the former, but it will also have some parts in common with the latter. For example, passing an extra UIScrollView layer is needed instead of relying on the one contained in the WKWebView of the main frame. An important step is thus to introduce a special class for scrolling inner frames that would share some logic from the two other classes and some refactoring to ensure optimal code reuse. Similar refactoring has been done for scrolling node states (in red) to move the scrolling layer parameter into WebCore::ScrollingStateNode instead of having separate members for WebCore::ScrollingStateOverflowScrollingNode and WebCore::ScrollingStateFrameScrollingNode.

The scrolling coordinator classes (in green) are also important, for example to handle hit testing. At the moment, this is not really implemented for overflow nodes but it might be important to have it for scrollable frames. Again, one sees that some logic is shared for asynchronous scrolling on macOS (WebCore::ScrollingCoordinatorMac) and iOS (WebCore::ScrollingCoordinatorIOS) in ancestor classes. Indeed, our effort to make frames scrollable on iOS is also opening the possibility of asynchronous scrolling of frames on macOS, something that is currently not implemented.

Class Hierarchy
Figure 4: Video of this demo page on WebKit iOS with experimental patches to make frame scrollables (2017/07/10)

Finally, some more work is necessary in the render classes (purple) to ensure that the layer hierarchies are correctly built. Patches have been uploaded and you can view the result on the video of Figure 4. Notice that this work has not been reviewed yet and there are known bugs, for example with overlapping elements (hit testing not implemented) or position: fixed elements.

Various other scrolling bugs were reported, analyzed and sometimes fixed by Apple. The switch from overflow nodes to scrollable iframes is unlikely to address them. For example, the “Find Text” operation in iOS has advanced features done by the UI process (highlight, smart magnification) but the scrolling operation needed only works for the main frame. It looks like this could be fixed by unifying a bit the scrolling code path with macOS. There are also several jump and flickering bugs with position: fixed nodes. Finally, Apple fixed inconsistent scrolling inertia used for the main frame and the one used for inner scrollable nodes by making the former the same as the latter.

Root Scroller

The CSSOM View specification extends the DOM element with some scrolling properties. That specification indicates that the element to consider to scroll the main view is document.body in quirks mode while it is document.documentElement in no-quirks mode. This is the behavior that has always been followed by browsers like Firefox or Interner Explorer. However, WebKit-based browsers always treat document.body as the root scroller. This interoperability issue has been a big problem for web developers. One convenient workaround was to introduce the document.scrollingElement which returns the element to use for scrolling the main view (document.body or document.documentElement) and was recently implemented in WebKit. Use this test page to verify whether your browser supports the document.scrollingElement property and which DOM element is used to scroll the main view in no-quirks mode.

Nevertheless, this does not solve the issue with existing web pages. Chromium’s Web Platform Predictability team has made a huge communication effort with Web authors and developers which has drastically reduced the use of document.body in no-quirks mode. For instance, Chromium’s telemetry on Figure 3 indicates that the percentage of document.body.scrollTop in no-quirks pages has gone from 18% down to 0.0003% during the past three years. Hence the Chromium team is now considering shipping the standard behavior.

UseCounter for ScrollTopBodyNotQuirksMode
Figure 3: Use of document.body.scrollTop in no-quirks mode over time (Chromium's UseCounter)

In WebKit, the issue has been known for a long time and an old attempt to fix it was reverted for causing regressions. For now, we imported the CSSOM View tests and just marked the one related to the scrolling element as failing. An analysis of the situation has been left on WebKit’s bug; Depending on how things evolve on Chromium’s side we could consider the discussion and implementation work in WebKit.

Related to that work, a new API is being proposed to set the root scroller to an arbitrary scrolling element, giving more flexibility to authors of Web applications. Today, this is unfortunately not possible without losing some of the special features of the main view (e.g. on iOS, Safari’s URL bar is hidden when scrolling the main view to maximize the screen space). Such API is currently being experimented in Chromium and we plan to investigate whether this can be implemented in WebKit too.

Conclusion

In the past months, The AMP Project and Igalia have worked on analyzing some interoperability issue and fixing them in WebKit. Many improvements for frame sandboxing are going to be available soon. Significant progress has also been made for frame scrolling on iOS and collaboration continues with Apple reviewers to ensure that the work will be integrated in future versions of WebKit. Improvements to “root scrolling” are also being considered although they are pending on the evolution of the issues on Chromium’s side. All these efforts are expected to be useful for WebKit users and the Web platform in general.

Igalia Logo
AMP Logo

Last but not least, I would like to thank Apple engineers Simon Fraser, Chris Dumez, and Youenn Fablet for their reviews and help, as well as Google and the AMP team for supporting that project.

August 29, 2017 10:00 PM

August 24, 2017

Eleni Maria Stea

A terrain rendering approach (part 1)

There are several methods to create and display a terrain, in real-time. In this post, I will explain the approach I followed on the demo I’m writing for my work at Igalia. Some work is still in progress.

The terrain had to meet the following requirements:

  • its size should be arbitrary
  • parts outside the viewer’s field of view should be culled

Parameters that describe the terrain

For reasons that will become obvious when I explain the terrain generation and drawing, I decided to use a heightfield made by tiles and I used the following parameters to generate it:

/* parameters needed in terrain generation */

struct TerrainParams {
    float xsz; /* terrain size in x axis */
    float ysz; /* terrain size in y axis */
    float max_height; /* max height of the heightfield */
    int xtiles; /* number of tiles in x axis */
    int ytiles; /* number of tiles in y axis */
    int tile_usub;
    int tile_vsub;
    int num_octaves; /* Perlin noise sums */
    float noise_freq; /* Perlin noise scaling factor */
    Image coarse_heightmap; /* mask for low detail heightmap */
};

Let’s explain them a little bit:

Imagine the terrain as a xsz * ysz grid of tiles where each tile is a subdivided in usub, vsub smaller parts. Each grid point can have an “arbitrary” height (we ‘ll see later how we calculate it) that cannot exceed the maximum height: max_height. The number of tiles in each terrain axis is xtiles for the x-axis and ytiles for the y-axis (that is practically the z-axis of out 3-D space). The variables tile_usub and tile_vsub show the number of subdivisions of each tile.

wireframe terrain
Image 1: a wireframe terrain

Note that in general, I use the u, v notation in normalized spaces and the x, y for the world space.

The variables num_octaves and noise_freq and the coarse_heightmap image are used to calculate the heights in different terrain points and will be explained later.

Generating the geometry, applying the textures

We ‘ve already seen that the terrain is a grid of xsz * ysz with xtiles * ytiles tiles, that are subdivided by tile_usub in the x-axis and by tile_vsub in the z-axis. In order to generate a height at every point of this grid, I needed to calculate uniformly distributed random values, like those of Perlin Noise (PN). But I also needed some higher distortions for “mountains” and “hills” here and there that cannot be simulated with PN. For them I used a function that calculates the sum of some Perlin Noise frequencies and results to a fractal-like heightfield. The number of  the PN sum octaves and the frequency (num_octaves, noise_freq) can be customized depending on how much distortion we want.

Having a terrain with spreaded random-looking mountains is nice, but it would be much better if we could further customize its appearence to become more suitable for our scene. For example, I wanted to place a number of objects in the middle of the terrain, so I wanted it to look more flat in the center. Also, I wanted to put some high mountains at the edges to hide some skybox parts (for example mountains and buildings that were part of the skybox texture and looked ugly). In order to modify the terrain shape, I used a grayscale image, the coarse_heightmap, as a mask and I multiplied the terrain height values with the mask intensities (I’ll explain this better later). The mask was like the image below:

terrain coarse heightmap
Image 2: terrain coarse heightmap

Since the heightmap is black at the center (=> the itensity values there are close to 0) the height in the middle will be low and the terrain will look flat, whereas at the edges, where the intensity takes its maximum values, the height values will have a value close to their original height. If you get a more careful look at Image 1 and Image 2 you will understand better the relationship between the terrain heights and the mask intensities.

That’s the function that generates each tile’s heightfield by calculating the height by taking the sum  of the Perlin noise frequencies for each tile:

float Terrain::get_height(float u, float v) const
{
    float sn = gph::fbm(u * params.noise_freq, 
               v * params.noise_freq,
               params.num_octaves);
    sn = sn * 0.5 + 0.5;

    if(params.coarse_heightmap.pixels) {
        Vec4 texel = params.coarse_heightmap.lookup_linear(u, v,
                     1.0 / params.tile_usub, 
                     1.0 / params.tile_vsub);
        sn *= texel.x;
    }
    return sn;
}

Note that the function performs the calculations in u-v space. I found it convenient to write a similar function for the world space to use it in other places.

Now, let’s see how we use the coarse_heightmap on top of that:

The coarse_heightmap image (Image 2) has a very low resolution of 128x128 which means that if we just lookup the closest pixel value for each terrain point, and the terrain is big, many terrain points will map to the same pixel and we’ll start seeing aliasing. To avoid this artifact, I used bilinear interpolation among the pixel’s neighboring pixels by taking into account the distance of the terrain point from each pixel of the neighborhood.

terrain before optimisations
Image 3: terrain and skybox

The following video is a preview of this early stage of the demo where you can see the terrain from different views (as well as a cow called Spot :P):

Improvements:

fog:

A good trick to make the terrain edges appear more distant that they really are and have a more realistic horizon, is to add some fog that is more dense in the more distant terrain points and less dense at the points that are close to the viewer. The fog can be simulated by interpolating the terrain color at each pixel with a color that looks like the sky (I used a light blue here). The interpolation factor must be a function of the distance between the point and the viewer. I used the following function:

e(-density * distance)

(taken by the glFog OpenGL manpage). I hard-coded the density to a value that looked good and I used the -pos.z as the distance where pos is the vertex position in the View space.

Image 4 shows the result of this operation:

terrain with fog
Image 4: the terrain with the fog and a cow 😉
iMage based LIGHTING:

The idea behind the IBL is that instead of using point lights with a standard position in our 3D world, we can calculate the lighting as the radiance coming from each skybox pixel. For this purpose, we need an irradiance map. I avoided the overhead of calculating one by using this nice tool I’ve found on github: cmft to pre-calculate it from the images of my skybox. Then, I used the normal direction at each terrain point (in world space coordinates) to sample the map and I calculated the color inside the pixel shader:

vec4 itexel = textureCube(dstex, normalize(world_normal));
vec3 object_color = diffuse.xyz * texel.xyz * itexel.xyz;

As you can guess the world_normal is the normal in world space (modelling transformation) and the object_color the color we get if we multiply our diffuse color from the material with the texel value (from the terrain texture) and the irradiance map texel value (itexel). The result of this operation can be seen in this video:

TODO LIST

Although the terrain looks better with these additions, I think that it can be further improved if I add the following things:

  1. Multiple textures for different parts of the terrain
  2. (Maybe) specular color calculated by the irradiance map (since atm I have support for diffuse only)
Performance optimisations:

I also want to optimize its performance by adding:

  1. View Frustrum Culling:  The idea is that we only draw the terrain tiles that are visible (part of the view frustrum).
  2. Tessellation shaders: use TC, TE shaders to improve the terrain tesselation.

Some of the above TODOs are in progress, some are still in my wishlist but all of them all TL;DR to be analyzed in this post anyway.

I will post about these additions as soon as I finish them, stay tunned! 😉

by hikiko at August 24, 2017 07:59 AM

August 20, 2017

Hyunjun Ko

Support GstContext for VA-API elements

Since I started working on gstreamer-vaapi, one of what’s disappointing me is that vaapisink is not so popular even though it should be the best choice on vaapi installed machine. There are some reasonable causes and one of the reasons is probably it doesn’t provide a convinient way to be integrated for application developers.

Until now, we provided a way to set X11 window handle by gst_video_overlay_set_window_handle, which is to tell the overlay to display video output to a specific window. But this is not enough since VA and X11 Display handle is managed internally inside gstreamer-vaapi elements, which means that users can’t handle them by themselves.

In short, there was no way to share display handle created by application. Also we have some additional problems due to this issue as the following.

  • If users want to handle multiple display seperatedly, it can’t be possible. bug 754820
  • If users run multiple decoding pipelines with vaapisink, performance is down critically since there’s some locks in each vaapisink with same VADisplay. bug 747946

Recently we have merged a series of patches to provide a way to set external VA Display and X11 Display from application via GstContext. GstContext provides a way of sharing not only between elements but also with the application using queries and messages. (For more details, see https://developer.gnome.org/gstreamer/stable/gstreamer-GstContext.html)

By these patches, application can set its own VA Display and X11 Display to VA-API elements as the following:

  • Create VADisplay instance by vaGetDisplay, it doesn’t need to be initialized at startup and terminated at endup.
  • Call gst_element_set_context with the context to which each display instance is set.

Example: sharing an VADisplay and X11 display with the bus callback, this is almost same as other examples using GstContext.

static GstBusSyncReply
bus_sync_handler (GstBus * bus, GstMessage * msg, gpointer data)
{
  switch (GST_MESSAGE_TYPE (msg)) {
    case GST_MESSAGE_NEED_CONTEXT:{
      const gchar *context_type;
      GstContext *context;
      GstStructure *s;
      VADisplay va_display;
      Display *x11_display;

      gst_message_parse_context_type (msg, &context_type);
      gst_println ("Got need context %s from %s", context_type,
          GST_MESSAGE_SRC_NAME (msg));

      if (g_strcmp0 (context_type, "gst.vaapi.app.Display") != 0)
        break;

      x11_display = /* Get X11 Display somehow */
      va_display = vaGetDisplay (x11_display);

      context = gst_context_new ("gst.vaapi.app.Display", TRUE);
      s = gst_context_writable_structure (context);
      gst_structure_set (s, "va-display", G_TYPE_POINTER, va_display, NULL);
      gst_structure_set (s, "x11-display", G_TYPE_POINTER, x11_display, NULL);

      gst_element_set_context (GST_ELEMENT (GST_MESSAGE_SRC (msg)), context);
      gst_context_unref (context);
      break;
    }   
    default:
      break;
  }

  return GST_BUS_PASS;
}

Also you can find the entire example code here.

Furthermore, we know we need to support Wayland for this feature. See bug 705821. There’s already some pending patches but they need to be rebased and modified based on the current way. I’ll be working on this in the near future.

We really want to test this feature more especially in practical cases until next release. I’d appreciate if someone reports any bug or issue and I promise I’d focus on it precisely.

Thanks for reading!

August 20, 2017 03:00 PM

August 09, 2017

Michael Catanzaro

On Firefox Sync

Epiphany 3.26 is, unfortunately, not going to be packed with cool new features like 3.24 was. We’ve just been too busy working on improving WebKit this cycle. But there is one cool new thing: Firefox Sync support. You can sync bookmarks, history, passwords, and open tabs with other Epiphany instances and as well as both desktop and mobile Firefox. This is already enabled in 3.25.90. Just go to the Sync tab in Preferences and sign in or create your Firefox account there. Please test it out and report bugs now, so we can quash problems you find before 3.26.0 rather than after.

Some thank yous are in order:

  • Thanks to Gabriel Ivascu, for writing all the code.
  • Thanks to Google and Igalia for sponsoring Gabriel’s work.
  • Thanks to Mozilla. This project would never have been possible if Mozilla had not carefully written its terms of service to allow such use.

Go forth and sync!

by Michael Catanzaro at August 09, 2017 07:57 PM

August 06, 2017

Michael Catanzaro

Endgame for WebKit Woes

In my original blog post On WebKit Security Updates, I identified three separate problems affecting WebKit users on Linux:

  • Distributions were not providing updates for WebKitGTK+. This was the main focus of that post.
  • Distributions were shipping a insecure compatibility package for old, unmaintained WebKitGTK+ 2.4 (“WebKit1”).
  • Distributions were shipping QtWebKit, which was also unmaintained and insecure.

Let’s review these problems one at a time.

Distributions Are Updating WebKitGTK+

Nowadays, most major community distributions are providing regular WebKitGTK+ updates, so this is no longer a problem for the vast majority of Linux users. If you’re using a supported version of Ubuntu (except Ubuntu 14.04), Fedora, or most other mainstream distributions, then you are good to go.

My main concern here is still Debian, but there are reasons to be optimistic. It’s too soon to say what Debian’s policy will be going forward, but I am encouraged that it broke freeze just before the Stretch release to update from WebKitGTK+ 2.14 to 2.16.3. Debian is slow and conservative and so has not yet updated to 2.16.6, which is sad because 2.16.3 is affected by a bug that causes crashes on a huge number of websites, but my understanding is it is likely to be updated in the near future. I’m not sure if Debian will update to 2.18 or not. We’ll have to wait and see.

openSUSE is another holdout. The latest stable version of openSUSE Leap, 42.3, is currently shipping WebKitGTK+ 2.12.5. That is disappointing.

Most other major distributions seem to be current.

Distributions Are Removing WebKitGTK+ 2.4

WebKitGTK+ 2.4 (often informally referred to as “WebKit1”) was the next problem. Tons of desktop applications depended on this old, insecure version of WebKitGTK+, and due to large API changes, upgrading applications was not going to be easy. But this transition is going much smoother and much faster than I expected. Several distributions, including Debian, Fedora, and Arch, have recently removed their compatibility packages. There will be no WebKitGTK+ 2.4 in Debian 10 (Buster) or Fedora 27 (scheduled for release this October). Most noteworthy applications have either ported to modern WebKitGTK+, or have configure flags to disable use of WebKitGTK+. In some cases, such as GnuCash in Fedora, WebKitGTK+ 2.4 is being bundled as part of the application build process. But more often, applications that have not yet ported simply no longer work or have been removed from these distributions.

Soon, users will no longer need to worry that a huge amount of WebKitGTK+ applications are not receiving security updates. That leaves one more problem….

QtWebKit is Back

Upstream QtWebKit has not been receiving security updates for the past four years or thereabouts, since it was abandoned by the Qt project. That is still the status quo for most distributions, but Arch and Fedora have recently switched to Konstantin Tokarev’s fork of QtWebKit, which is based on WebKitGTK+ 2.12. (Thank you Konstantin!) If you are using any supported version of Fedora, you should already have been switched to this fork. I am hopeful that the fork will be rebased on WebKitGTK+ 2.16 or 2.18 in the near future, to bring it current on security updates, but in the meantime, being a year and a half behind is an awful lot better than being four years behind. Now that Arch and Fedora have led the way, other distributions should find little trouble in making the switch to Konstantin’s QtWebKit. It would be a disservice to users to continue shipping the upstream version.

So That’s Cool

Things are better. Some distributions, notably Arch and Fedora, have resolved all of the above problems (or will in the very near future). Yay!

by Michael Catanzaro at August 06, 2017 09:47 PM

Modifying hidden settings in Epiphany 3.24

We’re just one short month away from releasing Epiphany 3.26, but this is not a post about that. Turns out there is some confusion about how to edit hidden settings in Epiphany 3.24. Many users previously relied on the dconf-editor tool to tweak hidden settings like the user agent or minimum font size, but this no longer works in 3.24. What gives?

The problem is that these settings can now be configured separately for your main browsing instance and for each web app. This gives you a lot more flexibility, but it does make it harder to change the settings because dconf-editor will not work anymore. The technical problem is that dconf-editor does not support relocatable settings schemas: settings definitions that are reused in many different places. So you will unfortunately have to use the command line to change these settings now. For example:

# Old command, *this no longer works*
$ gsettings set org.gnome.Epiphany.web user-agent 'Mozilla/5.0'

# Replacement command
$ gsettings set org.gnome.Epiphany.web:/org/gnome/epiphany/web/ user-agent 'Mozilla/5.0'

Changing a global setting like this will also affect newly-created web apps, but not existing web apps.

by Michael Catanzaro at August 06, 2017 06:13 PM

August 03, 2017

Eleni Maria Stea

Debugging graphics code using replacement shaders (Linux, Mesa)

Sometimes, when working with the mesa drivers, modifying or replacing a shader might be extremely useful for debugging. Mesa allows users to replace their shaders at runtime without having to change the original code by providing these environment variables:

MESA_SHADER_READ_PATH and MESA_SHADER_DUMP_PATH

Example usage:

In the following example we are going to use these two environment variables with a small OpenGL program called demo.

Step 1:

We create a directory (tmp) to store the shaders and two more directories read and dump inside it:

tmp/
├── dump
└── read

It’s necessary that these dump and read directories exist before running the program that will be debugged (the demo in our case).

Step 2: We export the environment variables:

export MESA_SHADER_READ_PATH=tmp/read
export MESA_SHADER_DUMP_PATH=tmp/dump

the first one sets the directory where the mesa driver will look for replacement shaders whereas the second one sets the directory where the shaders will be dumped.

Step 3: We run the program once to dump its original shaders inside the tmp/dump directory:

./demo

Step 4: We copy the shaders from the dump directory to the read directory and then we modify the ones in the read directory:

cp tmp/dump/* tmp/read/

It is important not to change the filenames of the shader files. After this step both directories should contain some shaders with long names similar to these:

tmp/
├── dump
│   ├── FS_41bfd6998229f924b0bc409fafc85043d0819adc.glsl
│   ├── FS_efc090363ee2378fbae150e66f53a891e072e983.glsl
│   ├── VS_17c27d658ec6d02901f45c88b67111bd4ee955cb.glsl
│   └── VS_9668281d927970b6ff023d45da67b38fc930dafe.glsl
└── read
├── FS_41bfd6998229f924b0bc409fafc85043d0819adc.glsl
├── FS_efc090363ee2378fbae150e66f53a891e072e983.glsl
├── VS_17c27d658ec6d02901f45c88b67111bd4ee955cb.glsl
└── VS_9668281d927970b6ff023d45da67b38fc930dafe.glsl

As you can guess the VS_*.glsl are the program’s vertex shaders and the FS_*.glsl the fragment ones.

The reason that we see two VS_*.glsl (vertex shaders) and two FS_*.glsl (fragment shaders) in the dump directory, is that the demo program was originally using two vertex and two fragment shaders at the rendering.

We could see dumped shader names that start from GS, TC, TE, as well, for Geometry, Tesselation Control and Tesselation Evaluation, if the program was using such shaders.

Now, every shader in the read directory can be safely modified. I will only change one of the fragment shaders for simplicity.

The FS_41bfd6998229f924b0bc409fafc85043d0819adc.glsl is the dump of my original fragment shader with name sky.f.glsl, that was calculating the colors of the pixels of a skybox using this code:

#version 450
uniform samplerCube stex;
in vec3 normal;
out vec4 color;
void main()
{
    vec4 texel = textureCube(stex, normalize(normal));

    color.rgb = texel.rgb;
    color.a = 1.0;
}

I used the sky.f.glsl fragment shader with some code that draws a green quad and the result was:

quad1

If we modify the tmp/read/FS_41bfd6998229f924b0bc409fafc85043d0819adc.glsl replacement shader to simply return blue like that:

#version 450
uniform samplerCube stex;
in vec3 normal;
out vec4 color;
void main()
{
    color = vec4(0.0, 0.0, 1.0, 1.0);
}

and run ./demo again, the result will be a blue sky, as expected:

Debugging graphics code using replacement shaders (Linux, Mesa)

We could safely play with any replacement shader in the read directory and then simply delete it. The demo program’s code would remain the same.

Let’s try a more interesting example with a more complex program, like blender.

We create the same directory tree, we export the variables and then we run blender (select the material and choose GLSL at the rendering options) to dump its default shaders and end up with a directory tree similar to this one:

tmp
├── dump
│   ├── FS_e025add3a93498ca49ba96c38260c36138430d54.glsl
│   └── VS_c5310c724728053b7bf1e0b1055546f530afa9ca.glsl
└── read
├── FS_e025add3a93498ca49ba96c38260c36138430d54.glsl
└── VS_c5310c724728053b7bf1e0b1055546f530afa9ca.glsl

On the screen we see something like this:

that is the default blender scene.

We can then open the file:
tmp/read/FS_e025add3a93498ca49ba96c38260c36138430d54.glsl search for the main function. We will modify the output color by adding this line at the end of the function that sets the output color to pink:

gl_FragColor = vec4(0.84, 0.16, 0.63, 1.0);

The code will look like this:

[...]
	shade_add(vec4(tmp73, 1.0), tmp63, tmp76);
	mtex_alpha_to_col(tmp76, cons78, tmp79);
	shade_mist_factor(varposition, unf81, unf82, unf83, unf84, unf85, tmp86);
	mix_blend(tmp86, tmp79, unf89, tmp90);
	shade_alpha_opaque(tmp90, tmp92);
	linearrgb_to_srgb(tmp92, tmp94);

	gl_FragColor = tmp94;
	gl_FragColor = vec4(0.84, 0.16, 0.63, 1.0);

If we exit and run blender again, we’ll see that selecting the material makes the cube pink:

This is because the blender shader that calculates the material color is replaced by our read/FS_e025add3a93498ca49ba96c38260c36138430d54.glsl shader where we explicitely set the output color (gl_FragColor) to pink (0.84, 0.16, 0.63, 1.0).

Note: Mesa documentation mentions that we need to compile the mesa driver using the --with-sha1 option, for the environment variables to take effect. This option doesn’t seem to exist anymore but fortunately, the trick works without it and it seems that the only thing that we need to pay attention to, is to keep the replacement shader filenames in the read directory unchanged.

by hikiko at August 03, 2017 05:06 PM

Gyuyoung Kim

How to make your code on downstream ?

Nowadays most of my projects have been using opensource or worked based on opensource. If we just needs to use it, I think it would be a good situation for you. However we usually have projects that keep the opensource over years for your products. So if you have to hack a lot of modules inside of opensource code, it can be a nightmare when you rebase current source base against the latest opensource after months or years. In this article I would like to share some of my experiences on how to make a downstream patch when we work using opensource.

  1. Try to contribute your patch to the opensource project as much as possible

    • I think this is the best way to reduce our heavy burden that we should maintain in your downstream source code. Even if the opensource project you’re using is being developed fast, you will often face many conflicts during the rebase because it’s likely that original code or architecture have changed frequently in the meantime. To avoid the conflict, it would be best if you contribute your patch to the opensource project as much as possible.
    • Below documents are a good example to explain how to contribute your code to the opensource project.
  2. Make your downstream port

    • We know well that #1 is the best way to reduce our downstream patches though, it is often hard to keep because downstream patches are often too hacky or unstable. Even if you submit a downstream patch to upstream, you might get many review comments or objections from the opensource maintainers. In such case, what else can you do ? In my experience, it was important to separate our downstream implementation from the original code. For example, we can make new TriangleFoo.h/cpp files instead of original Triangle.h/cpp files, then we can use them through the modification of few build scripts.
      1. Figure class
       class Figure {
        public:
            virtual int calculateSize();
       }
       
      2. Triangle class
       class Triangle : public Figure {
        public:
            virtual int calculateSize() override;
       }
      
       Triangle::calculateSize() {
           return width * height / 2;
       }
      
      3. TriangleFoo class
       class TriangleFoo : public Figure {
           public: virtual int calculateSize() override;
       }
       
       TriangleFoo::calculateSize() {
           return new_width * height / 2;
       }
      
      4. Build script. In this example we use cmake,
       list(APPEND Figure_SOURCES
           Figure.cpp
           Triangle.cpp
           TriangleFoo.cpp
       )
       
       list(REMOVE_ITEM Figure_SOURCES
           Triangle.cpp
       )

      We can avoid some conflicts in the next rebase with the latest opensource in the Triangle.h/cpp files. However we still need to modify the TriangleFoo.h/cpp if Figure.h is changed. For example, when a new parameter is added or a return type is changed.

  3. Use #if ~ #endif guard

    • When we only need to modify few lines or just to change logic inside a function, we can use #if ~ #else ~ #endif guard. The guard can help us to know what codes were added by us or modified by us. Besides it might be help us to check easily if the downstream patch generated side effects through turning it off. In my previous projects, most of issues have come from downstream patches because they lacked code review, missed test cases, or were too hacky against original architecture. In such cases, you can check the issue just by turning the guard off. However, if you use #if ~ #endif guard in many places, the usages can mess your code up. So I’d like to recommend you use it only when you really need.
      Triangle::calculateSize()
      {
      #if defined(DOWNSTREAM_ENABLED)
          return new_width * height / 2;
      #else
          return width * height / 2;
      #endif
      }
  4. Try to make a patch per a feature

    • As you may have experienced before, it is very hard to implement a feature with a commit. Even though you succeed in implementing a new feature with a commit perfectly, you may face to touch the implementation again in order to fix a bug or apply new requirements again. In such case your git history will get messier and messier. It will make it difficult to rebase based on the latest opensource. To avoid it, you may have manually merged original implementation with the fixup commits reflected later. But there are two useful git commands for this case – git commit fixup and git rebase –autosquash.
        • git commit –fixup : Automatically marks your commit as a fix of a previous commit. Construct a commit message for use with git rebase –autosquash.
        • git rebase -i –autosquash : Automatically organize merging of these fixup commits and associated normal commits.

    • Example
      There is a good article to explain that explains method [1]. If you need to understand further, it would be good if you visit the URL. Let’s assume that we have 3 commits on your local repository.
$ git log --oneline
  new commit1 (7ae79f6)
  new commit2 (9e4c1de)
  new commit3 (480ee07)
  previous commit (19c8abf)

But if we just noticed that we missed to add a comment in commit2, it’s time to use --fixup option.

$ git add [modified file]
$ git commit --fixup [new commit2's commit-id]
  (i.e. git commit --fixup 9e4c1de)

Then you can clean your branch before merging it using - autosquash option.

$ git rebase -i --autosquash [previous commit id]
  (i.e. git rebase -i autosquash 19c8abf)

 

Reference
[1] http://fle.github.io/git-tip-keep-your-branch-clean-with-fixup-and-autosquash.html

by gyuyoung at August 03, 2017 01:28 AM

August 01, 2017

Gyuyoung Kim

Hello world!

Welcome to Igalia Blogs. This is your first post. Edit or delete it, then start blogging!

by gyuyoung at August 01, 2017 11:30 AM

July 30, 2017

Iago Toral

Working with lights and shadows – Part II: The shadow map

In the previous post we talked about the Phong lighting model as a means to represent light in a scene. Once we have light, we can think about implementing shadows, which are the parts of the scene that are not directly exposed to light sources. Shadow mapping is a well known technique used to render shadows in a scene from one or multiple light sources. In this post we will start discussing how to implement this, specifically, how to render the shadow map image, and the next post will cover how to use the shadow map to render shadows in the scene.

Note: although the code samples in this post are for Vulkan, it should be easy for the reader to replicate the implementation in OpenGL. Also, my OpenGL terrain renderer demo implements shadow mapping and can also be used as a source code reference for OpenGL.

Algorithm overview

Shadow mapping involves two passes, the first pass renders the scene from te point of view of the light with depth testing enabled and records depth information for each fragment. The resulting depth image (the shadow map) contains depth information for the fragments that are visible from the light source, and therefore, are occluders for any other fragment behind them from the point of view of the light. In other words, these represent the only fragments in the scene that receive direct light, every other fragment is in the shade. In the second pass we render the scene normally to the render target from the point of view of the camera, then for each fragment we need to compute the distance to the light source and compare it against the depth information recorded in the previous pass to decice if the fragment is behind a light occluder or not. If it is, then we remove the diffuse and specular components for the fragment, making it look shadowed.

In this post I will cover the first pass: generation of the shadow map.

Producing the shadow map image

Note: those looking for OpenGL code can have a look at this file ter-shadow-renderer.cpp from my OpenGL terrain renderer demo, which contains the shadow map renderer that generates the shadow map for the sun light in that demo.

Creating a depth image suitable for shadow mapping

The shadow map is a regular depth image were we will record depth information for fragments in light space. This image will be rendered into and sampled from. In Vulkan we can create it like this:

...
VkImageCreateInfo image_info = {};
image_info.sType = VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO;
image_info.pNext = NULL;
image_info.imageType = VK_IMAGE_TYPE_2D;
image_info.format = VK_FORMAT_D32_SFLOAT;
image_info.extent.width = SHADOW_MAP_WIDTH;
image_info.extent.height = SHADOW_MAP_HEIGHT;
image_info.extent.depth = 1;
image_info.mipLevels = 1;
image_info.arrayLayers = 1;
image_info.samples = VK_SAMPLE_COUNT_1_BIT;
image_info.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
image_info.usage = VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT |
                   VK_IMAGE_USAGE_SAMPLED_BIT;
image_info.queueFamilyIndexCount = 0;
image_info.pQueueFamilyIndices = NULL;
image_info.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
image_info.flags = 0;

VkImage image;
vkCreateImage(device, &image_info, NULL, &image);
...

The code above creates a 2D image with a 32-bit float depth format. The shadow map’s width and height determine the resolution of the depth image: larger sizes produce higher quality shadows but of course this comes with an additional computing cost, so you will probably need to balance quality and performance for your particular target. In the first pass of the algorithm we need to render to this depth image, so we include the VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT usage flag, while in the second pass we will sample the shadow map from the fragment shader to decide if each fragment is in the shade or not, so we also include the VK_IMAGE_USAGE_SAMPLED_BIT.

One more tip: when we allocate and bind memory for the image, we probably want to request device local memory too (VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT) for optimal performance, since we won’t need to map the shadow map memory in the host for anything.

Since we are going to render to this image in the first pass of the process we also need to create a suitable image view that we can use to create a framebuffer. There are no special requirements here, we just create a view with the same format as the image and with a depth aspect:

...
VkImageViewCreateInfo view_info = {};
view_info.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;
view_info.pNext = NULL;
view_info.image = image;
view_info.format = VK_FORMAT_D32_SFLOAT;
view_info.components.r = VK_COMPONENT_SWIZZLE_R;
view_info.components.g = VK_COMPONENT_SWIZZLE_G;
view_info.components.b = VK_COMPONENT_SWIZZLE_B;
view_info.components.a = VK_COMPONENT_SWIZZLE_A;
view_info.subresourceRange.aspectMask = VK_IMAGE_ASPECT_DEPTH_BIT;
view_info.subresourceRange.baseMipLevel = 0;
view_info.subresourceRange.levelCount = 1;
view_info.subresourceRange.baseArrayLayer = 0;
view_info.subresourceRange.layerCount = 1;
view_info.viewType = VK_IMAGE_VIEW_TYPE_2D;
view_info.flags = 0;

VkImageView shadow_map_view;
vkCreateImageView(device, &view_info, NULL, &view);
...

Rendering the shadow map

In order to generate the shadow map image we need to render the scene from the point of view of the light, so first, we need to compute the corresponding View and Projection matrices. How we calculate these matrices depends on the type of light we are using. As described in the previous post, we can consider 3 types of lights: spotlights, positional lights and directional lights.

Spotlights are the easiest for shadow mapping, since with these we use regular perspective projection.

Positional lights work similar to spotlights in the sense that they also use perspective projection, however, because these are omnidirectional, they see the entire scene around them. This means that we need to render a shadow map that contains scene objects in all directions around the light. We can do this by using a cube texture for the shadow map instead of a regular 2D texture and render the scene 6 times adjusting the View matrix to capture scene objects in front of the light, behind it, to its left, to its right, above and below. In this case we want to use a field of view of 45º with the projection matrix so that the set of 6 images captures the full scene around the light source with no image overlaps.

Finally, we have directional lights. In the previous post I mentioned that these lights model light sources which rays are parallel and because of this feature they cast regular shadows (that is, shadows that are not perspective projected). Thus, to render shadow maps for directional lights we want to use orthographic projection instead of perspective projection.

Projected shadow from a point light source
Regular shadow from a directional light source

In this post I will focus on creating a shadow map for a spotlight source only. I might write follow up posts in the future covering other light sources, but for the time being, you can have a look at my OpenGL terrain renderer demo if you are interested in directional lights.

So, for a spotlight source, we just define a regular perspective projection, like this:

glm::mat4 clip = glm::mat4(1.0f, 0.0f, 0.0f, 0.0f,
                           0.0f,-1.0f, 0.0f, 0.0f,
                           0.0f, 0.0f, 0.5f, 0.0f,
                           0.0f, 0.0f, 0.5f, 1.0f);

glm::mat4 light_projection = clip *
      glm::perspective(glm::radians(45.0f),
                       (float) SHADOW_MAP_WIDTH / SHADOW_MAP_HEIGHT,
                       LIGHT_NEAR, LIGHT_FAR);

The code above generates a regular perspective projection with a field of view of 45º. We should adjust the light’s near and far planes to make them as tight as possible to reduce artifacts when we use the shadow map to render the shadows in the scene (I will go deeper into this in a later post). In order to do this we should consider that the near plane can be increased to reflect the closest that an object can be to the light (that might depend on the scene, of course) and the far plane can be decreased to match the light’s area of influence (determined by its attenuation factors, as explained in the previous post).

The clip matrix is not specific to shadow mapping, it just makes it so that the resulting projection considers the particularities of how the Vulkan coordinate system is defined (Y axis is inversed, Z range is halved).

As usual, the projection matrix provides us with a projection frustrum, but we still need to point that frustum in the direction in which our spotlight is facing, so we also need to compute the view matrix transform of our spotlight. One way to define the direction in which our spotlight is facing is by having the rotation angles of spotlight on each axis, similarly to what we would do to compute the view matrix of our camera:

glm::mat4
compute_view_matrix_for_rotation(glm::vec3 origin, glm::vec3 rot)
{
   glm::mat4 mat(1.0);
   float rx = DEG_TO_RAD(rot.x);
   float ry = DEG_TO_RAD(rot.y);
   float rz = DEG_TO_RAD(rot.z);
   mat = glm::rotate(mat, -rx, glm::vec3(1, 0, 0));
   mat = glm::rotate(mat, -ry, glm::vec3(0, 1, 0));
   mat = glm::rotate(mat, -rz, glm::vec3(0, 0, 1));
   mat = glm::translate(mat, -origin);
   return mat;
}

Here, origin is the position of the light source in world space, and rot represents the rotation angles of the light source on each axis, representing the direction in which the spotlight is facing.

Now that we have the View and Projection matrices that define our light space we can go on and render the shadow map. For this we need to render scene as we normally would but instead of using our camera’s View and Projection matrices, we use the light’s. Let’s have a look at the shadow map rendering code:

Render pass

static VkRenderPass
create_shadow_map_render_pass(VkDevice device)
{
   VkAttachmentDescription attachments[2];

   // Depth attachment (shadow map)
   attachments[0].format = VK_FORMAT_D32_SFLOAT;
   attachments[0].samples = VK_SAMPLE_COUNT_1_BIT;
   attachments[0].loadOp = VK_ATTACHMENT_LOAD_OP_CLEAR;
   attachments[0].storeOp = VK_ATTACHMENT_STORE_OP_STORE;
   attachments[0].stencilLoadOp = VK_ATTACHMENT_LOAD_OP_DONT_CARE;
   attachments[0].stencilStoreOp = VK_ATTACHMENT_STORE_OP_DONT_CARE;
   attachments[0].initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
   attachments[0].finalLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
   attachments[0].flags = 0;

   // Attachment references from subpasses
   VkAttachmentReference depth_ref;
   depth_ref.attachment = 0;
   depth_ref.layout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL;

   // Subpass 0: shadow map rendering
   VkSubpassDescription subpass[1];
   subpass[0].pipelineBindPoint = VK_PIPELINE_BIND_POINT_GRAPHICS;
   subpass[0].flags = 0;
   subpass[0].inputAttachmentCount = 0;
   subpass[0].pInputAttachments = NULL;
   subpass[0].colorAttachmentCount = 0;
   subpass[0].pColorAttachments = NULL;
   subpass[0].pResolveAttachments = NULL;
   subpass[0].pDepthStencilAttachment = &depth_ref;
   subpass[0].preserveAttachmentCount = 0;
   subpass[0].pPreserveAttachments = NULL;

   // Create render pass
   VkRenderPassCreateInfo rp_info;
   rp_info.sType = VK_STRUCTURE_TYPE_RENDER_PASS_CREATE_INFO;
   rp_info.pNext = NULL;
   rp_info.attachmentCount = 1;
   rp_info.pAttachments = attachments;
   rp_info.subpassCount = 1;
   rp_info.pSubpasses = subpass;
   rp_info.dependencyCount = 0;
   rp_info.pDependencies = NULL;
   rp_info.flags = 0;

   VkRenderPass render_pass;
   VK_CHECK(vkCreateRenderPass(device, &rp_info, NULL, &render_pass));

   return render_pass;
}

The render pass is simple enough: we only have one attachment with the depth image and one subpass that renders to the shadow map target. We will start the render pass by clearing the shadow map and by the time we are done we want to store it and transition it to layout VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL so we can sample from it later when we render the scene with shadows. Notice that because we only care about depth information, the render pass doesn’t include any color attachments.

Framebuffer

Every rendering job needs a target framebuffer, so we need to create one for our shadow map. For this we will use the image view we created from the shadow map image. We link this framebuffer target to the shadow map render pass description we have just defined:

VkFramebufferCreateInfo fb_info;
fb_info.sType = VK_STRUCTURE_TYPE_FRAMEBUFFER_CREATE_INFO;
fb_info.pNext = NULL;
fb_info.renderPass = shadow_map_render_pass;
fb_info.attachmentCount = 1;
fb_info.pAttachments = &shadow_map_view;
fb_info.width = SHADOW_MAP_WIDTH;
fb_info.height = SHADOW_MAP_HEIGHT;
fb_info.layers = 1;
fb_info.flags = 0;

VkFramebuffer shadow_map_fb;
vkCreateFramebuffer(device, &fb_info, NULL, &shadow_map_fb);

Pipeline description

The pipeline we use to render the shadow map also has some particularities:

Because we only care about recording depth information, we can typically skip any vertex attributes other than the positions of the vertices in the scene:

...
VkVertexInputBindingDescription vi_binding[1];
VkVertexInputAttributeDescription vi_attribs[1];

// Vertex attribute binding 0, location 0: position
vi_binding[0].binding = 0;
vi_binding[0].inputRate = VK_VERTEX_INPUT_RATE_VERTEX;
vi_binding[0].stride = 2 * sizeof(glm::vec3);

vi_attribs[0].binding = 0;
vi_attribs[0].location = 0;
vi_attribs[0].format = VK_FORMAT_R32G32B32_SFLOAT;
vi_attribs[0].offset = 0;

VkPipelineVertexInputStateCreateInfo vi;
vi.sType = VK_STRUCTURE_TYPE_PIPELINE_VERTEX_INPUT_STATE_CREATE_INFO;
vi.pNext = NULL;
vi.flags = 0;
vi.vertexBindingDescriptionCount = 1;
vi.pVertexBindingDescriptions = vi_binding;
vi.vertexAttributeDescriptionCount = 1;
vi.pVertexAttributeDescriptions = vi_attribs;
...
pipeline_info.pVertexInputState = &vi;
...

The code above defines a single vertex attribute for the position, but assumes that we read this from a vertex buffer that packs interleaved positions and normals for each vertex (each being a vec3) so we use the binding’s stride to jump over the normal values in the buffer. This is because in this particular example, we have a single vertex buffer that we reuse for both shadow map rendering and normal scene rendering (which requires vertex normals for lighting computations).

Again, because we do not produce color data, we can skip the fragment shader and our vertex shader is a simple passthough instead of the normal vertex shader we use with the scene:

....
VkPipelineShaderStageCreateInfo shader_stages[1];
shader_stages[0].sType =
   VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
shader_stages[0].pNext = NULL;
shader_stages[0].pSpecializationInfo = NULL;
shader_stages[0].flags = 0;
shader_stages[0].stage = VK_SHADER_STAGE_VERTEX_BIT;
shader_stages[0].pName = "main";
shader_stages[0].module = create_shader_module("shadowmap.vert.spv", ...);
...
pipeline_info.pStages = shader_stages;
pipeline_info.stageCount = 1;
...

This is how the shadow map vertex shader (shadowmap.vert) looks like in GLSL:

#version 400

#extension GL_ARB_separate_shader_objects : enable
#extension GL_ARB_shading_language_420pack : enable

layout(std140, set = 0, binding = 0) uniform vp_ubo {
    mat4 ViewProjection;
} VP;

layout(std140, set = 0, binding = 1) uniform m_ubo {
     mat4 Model[16];
} M;

layout(location = 0) in vec3 in_position;

void main()
{
   vec4 pos = vec4(in_position.x, in_position.y, in_position.z, 1.0);
   vec4 world_pos = M.Model[gl_InstanceIndex] * pos;
   gl_Position = VP.ViewProjection * world_pos;
}

The shader takes the ViewProjection matrix of the light (we have already multiplied both together in the host) and a UBO with the Model matrices of each object in the scene as external resources (we use instanced rendering in this particular example) as well as a single vec3 input attribute with the vertex position. The only job of the vertex shader is to compute the position of the vertex in the transformed space (the light space, since we are passing the ViewProjection matrix of the light), nothing else is done here.

Command buffer

The command buffer is pretty similar to the one we use with the scene, only that we render to the shadow map image instead of the usual render target. In the shadow map render pass description we have indicated that we will clear it, so we need to include a depth clear value. We also need to make sure that we set the viewport and sccissor to match the shadow map dimensions:

...
VkClearValue clear_values[1];
clear_values[0].depthStencil.depth = 1.0f;
clear_values[0].depthStencil.stencil = 0;

VkRenderPassBeginInfo rp_begin;
rp_begin.sType = VK_STRUCTURE_TYPE_RENDER_PASS_BEGIN_INFO;
rp_begin.pNext = NULL;
rp_begin.renderPass = shadow_map_render_pass;
rp_begin.framebuffer = shadow_map_framebuffer;
rp_begin.renderArea.offset.x = 0;
rp_begin.renderArea.offset.y = 0;
rp_begin.renderArea.extent.width = SHADOW_MAP_WIDTH;
rp_begin.renderArea.extent.height = SHADOW_MAP_HEIGHT;
rp_begin.clearValueCount = 1;
rp_begin.pClearValues = clear_values;

vkCmdBeginRenderPass(shadow_map_cmd_buf,
                     &rp_begin,
                     VK_SUBPASS_CONTENTS_INLINE);

VkViewport viewport;
viewport.height = SHADOW_MAP_HEIGHT;
viewport.width = SHADOW_MAP_WIDTH;
viewport.minDepth = 0.0f;
viewport.maxDepth = 1.0f;
viewport.x = 0;
viewport.y = 0;
vkCmdSetViewport(shadow_map_cmd_buf, 0, 1, &viewport);

VkRect2D scissor;
scissor.extent.width = SHADOW_MAP_WIDTH;
scissor.extent.height = SHADOW_MAP_HEIGHT;
scissor.offset.x = 0;
scissor.offset.y = 0;
vkCmdSetScissor(shadow_map_cmd_buf, 0, 1, &scissor);
...

Next, we bind the shadow map pipeline we created above, bind the vertex buffer and descriptor sets as usual and draw the scene geometry.

...
vkCmdBindPipeline(shadow_map_cmd_buf,
                  VK_PIPELINE_BIND_POINT_GRAPHICS,
                  shadow_map_pipeline);

const VkDeviceSize offsets[1] = { 0 };
vkCmdBindVertexBuffers(shadow_cmd_buf, 0, 1, vertex_buf, offsets);

vkCmdBindDescriptorSets(shadow_map_cmd_buf,
                        VK_PIPELINE_BIND_POINT_GRAPHICS,
                        shadow_map_pipeline_layout,
                        0, 1,
                        shadow_map_descriptor_set,
                        0, NULL);

vkCmdDraw(shadow_map_cmd_buf, ...);

vkCmdEndRenderPass(shadow_map_cmd_buf);
...

Notice that the shadow map pipeline layout will be different from the one used with the scene too. Specifically, during scene rendering we will at least need to bind the shadow map for sampling and we will probably also bind additional resources to access light information, surface materials, etc that we don’t need to render the shadow map, where we only need the View and Projection matrices of the light plus the UBO with the model matrices of the objects in the scene.

We are almost there, now we only need to submit the command buffer for execution to render the shadow map:

...
VkPipelineStageFlags shadow_map_wait_stages = 0;
VkSubmitInfo submit_info = { };
submit_info.pNext = NULL;
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.waitSemaphoreCount = 0;
submit_info.pWaitSemaphores = NULL;
submit_info.signalSemaphoreCount = 1;
submit_info.pSignalSemaphores = &signal_sem;
submit_info.pWaitDstStageMask = 0;
submit_info.commandBufferCount = 1;
submit_info.pCommandBuffers = &shadow_map_cmd_buf;

vkQueueSubmit(queue, 1, &submit_info, NULL);
...

Because the next pass of the algorithm will need to sample the shadow map during the final scene rendering,we use a semaphore to ensure that we complete this work before we start using it in the next pass of the algorithm.

In most scenarios, we will want to render the shadow map on every frame to account for dynamic objects that move in the area of effect of the light or even moving lights, however, if we can ensure that no objects have altered their positions inside the area of effect of the light and that the light’s description (position/direction) hasn’t changed, we may not need need to regenerate the shadow map and save some precious rendering time.

Visualizing the shadow map

After executing the shadow map rendering job our shadow map image contains the depth information of the scene from the point of view of the light. Before we go on and start using this as input to produce shadows in our scene, we should probably try to visualize the shadow map to verify that it is correct. For this we just need to submit a follow-up job that takes the shadow map image as a texture input and renders it to a quad on the screen. There is one caveat though: when we use perspective projection, Z values in the depth buffer are not linear, instead precission is larger at distances closer to the near plane and drops as we get closer to the far place in order to improve accuracy in areas closer to the observer and avoid Z-fighting artifacts. This means that we probably want to linearize our shadow map values when we sample from the texture so that we can actually see things, otherwise most things that are not close enough to the light source will be barely visible:

#version 400

#extension GL_ARB_separate_shader_objects : enable
#extension GL_ARB_shading_language_420pack : enable

layout(std140, set = 0, binding = 0) uniform mvp_ubo {
    mat4 mvp;
} MVP;

layout(location = 0) in vec2 in_pos;
layout(location = 1) in vec2 in_uv;

layout(location = 0) out vec2 out_uv;

void main()
{
   gl_Position = MVP.mvp * vec4(in_pos.x, in_pos.y, 0.0, 1.0);
   out_uv = in_uv;
}
#version 400

#extension GL_ARB_separate_shader_objects : enable
#extension GL_ARB_shading_language_420pack : enable

layout (set = 1, binding = 0) uniform sampler2D image;

layout(location = 0) in vec2 in_uv;

layout(location = 0) out vec4 out_color;

void main()
{
   float depth = texture(image, in_uv).r;
   out_color = vec4(1.0 - (1.0 - depth) * 100.0);
}

We can use the vertex and fragment shaders above to render the contents of the shadow map image on to a quad. The vertex shader takes the quad’s vertex positions and texture coordinates as attributes and passes them to the fragment shader, while the fragment shader samples the shadow map at the provided texture coordinates and then “linearizes” the depth value so that we can see better. The code in the shader doesn’t properly linearize the depth values we read from the shadow map (that requires to pass the Z-near and Z-far values used in the projection), but for debugging purposes this works well enough for me, if you use different Z clipping planes you may need to alter the ‘100.0’ value to get good results (or you might as well do a proper conversion considering your actual Z-near and Z-far values).

Visualizing the shadow map

The image shows the shadow map on top of the scene. Darker colors represent smaller depth values, so these are fragments closer to the light source. Notice that we are not rendering the floor geometry to the shadow map since it can’t cast shadows on any other objects in the scene.

Conclusions

In this post we have described the shadow mapping technique as a combination of two passes: the first pass renders a depth image (the shadow map) with the scene geometry from the point of view of the light source. To achieve this, we need a passthrough vertex shader that only transforms the scene vertex positions (using the view and projection transforms from the light) and we can skip the fragment shader completely since we do not care for color output. The second pass, which we will cover in the next post, takes the shadow map as input and uses it to render shadows in the final scene.

by Iago Toral at July 30, 2017 09:49 PM

Philippe Normand

The GNOME-Shell Gajim extension maintenance

Back in January 2011 I wrote a GNOME-Shell extension allowing Gajim users to carry on with their chats using the Empathy infrastructure and UI present in the Shell. For some time the extension was also part of the official gnome-shell-extensions module and then I had to move it to Github as a standalone extension. Sadly I stopped using Gajim a few years ago and my interest in maintaining this extension has decreased quite a lot.

I don’t know if this extension is actively used by anyone beyond the few bugs reported in Github, so this is a call for help. If anyone still uses this extension and wants it supported in future versions of GNOME-Shell, please send me a mail so I can transfer ownership of the Github repository and see what I can do for the extensions.gnome.org page as well.

(Huh, also. Hi blogosphere again! My last post was in 2014 it seems :))

by Philippe Normand at July 30, 2017 01:53 PM

July 28, 2017

Eleni Maria Stea

Creating cube map images from HDR panoramas on GNU/Linux

As part of my work for Igalia I wanted to do some environment mapping. I was able to find plenty of high quality .hdr images online but I couldn’t find any (OSS) tool to convert them to cubemap images. Then, Nuclear (John Tsiombikas) gave me the solution: he wrote a minimal tool that does the job quickly and produces high quality cube maps.

So, here’s a short “how to” create cubemaps on Linux using his “cubemapper” program in combination with other OSS tools:

Prerequisites:

Install pfstools pfsview

Install the cubemapper dependencies:

1- libimago

git clone https://github.com/jtsiomb/libimago.git
make
sudo make install

2- libgmath

git clone https://github.com/jtsiomb/gph-math.git
make
sudo make install

Get/Install Cubemapper:

Get the cubemapper code from here: cubemapper-0.1.tar.gz

tar xzvf cubemapper-0.1.tar.gz
cd cubemapper-01/
make
sudo make install

Create the cubemaps:

Before we begin, we can check our hdr images using pfsview like that:

pfsin foobar_in.hdr | pfsview

Sometimes the image is too big and we might need to resize it (if it’s really really big pfsview might crash).

Resize can be done by running:

 pfsin foobar_in.hdr | pfssize --maxy 2048 | pfsout foobar_out.hdr

(You can replace 2048 or maxy with another value)

After resizing to something more reasonable / suitable for our app, we can use the cubemapper to create the cubemap images:

cubemapper foobar_out.hdr

With this command we should see something like that:

Pressing c will save the cubemap images in the current directory.

We can now show a cubemap made by the images we just saved, just to make sure that there aren’t any artifacts, by pressing space:

Exiting the program, we can see that the current directory contains 6 new .hdr files:

cubemap_px.hdr, cubemap_py.hdr cubemap_pz.hdr,
cubemap_nx.hdr, cubemap_ny.hdr, cubemap_nz.hdr

(one for each cubemap direction).

Creating cube map images from HDR panoramas on GNU/Linux

These 6 images can now be used as textures for cube mapping with OpenGL.

Cubemapper works also with other types of images (e.g. jpg, png).

Note: The initial .hdr panorama I used on this post is from: http://noemotionhdrs.net/hdrday.html

by hikiko at July 28, 2017 01:07 PM

July 14, 2017

Alejandro Piñeiro

Bringing VK_KHR_16bit_storage to Intel GPUs

Just yesterday, Vulkan 1.0.54 was released. Among other things, it includes the specification for a new extension, VK_KHR_16bit_storage. And just yesterday, we sent to mesa-dev the implementation of this extension for Intel gen8+ GPUs, that is the outcome of the effort from the igalians José María Casanova, Andrés Goméz, Eduardo Lima, and myself.

In short, this extension allows the use of 16-bit types (half floats, 16-bit ints, and 16-bit uints) in shader input and output interfaces, push constant blocks, and buffers (shader storage buffer objects). The only operation that you can do with those 16-bit variables in the shader are 16-bit to 32-bit, and 32-bit to 16-bit conversions. So no arithmetic (adds, muls, etc) operations for now. The value of this feature at this point is to reduce the memory bandwidth when feeding and getting the data from a shader. It will also be the basis for future extensions defining 16-bit arithmetic operations.

Taking into account that the series is still in the review process, I will not go too deep into the technical details of the implementation. In general, most of the changes were related with the assumption that all we had was 32 or 64 bit types, so we just needed to update some conditions to take into account 16-bit types supported by the HW. In any case, I think that I can list three issues that required some extra work from our side:

  • One of the subfeatures we needed to support is being able to define 16-bit input vertex attributes. A really good reading about how this is implemented and supported on Intel HW is Ben Widawsky’s post “GEN Graphics And The URB”. This post explains in detail how this is done for 32-bit vertex inputs. We used this post as another source of documentation when we implemented the support for 64-bit vertex attributes last year (I briefly mentioned it on my previous blog post). In the case of 64-bit, when feeding the shader with the data, you can configure how the 64-bit data is passed to the shader. There is a surface format that do an implicit conversion to 32-bit, and another that pass it without any conversion (PASSTHRU format). You use one or the other depending on the type of your variable at the shader. But, for the case of 16-bit, there is just one surface format. And as the section FormatConversion at the reference manual points, this surface format do an implicit 16-bit to 32-bit conversion. In order to workaround it, we needed to change the surface format on the fly, using a 32-bit format one, and then reorder the data when it arrived to the shader.
  • Most of the surface read/writes used on intel driver are untyped surface readwrite message. Unfourtunately, those are 32-bit width messages. So we needed to implement the support, and use, a different kind of message, byte scattered read/write messages. The reference already warns that it is likely that it would be better to use a different message (for performance reasons). In any case, using this message is only really needed when using variable of one and three components. Eduardo already have a patch that uses 32-bit untyped read/write messages when possible.

  • For a render target write message (so for example, the output of a fragment shader), we enabled the 16-bit payload using the data format bit (Data Format on the Message Descriptor Definition of Send Messages). But this bit is not available on Broadwell, and doesn’t support unsigned ints on Cherryview/Braswell. So for those cases as workaround we needed to use the 32-bit payload, doing an extra conversion from 16-bit to 32-bits before the HW deals with the surface format conversion when writing 32-bit values to a 16-bit format surface.

  • So the next steps now is getting it reviewed, update the patchs accordingly and land it on master. In parallel we are working on optimizations and other improvements we listed while we were working on the extension (as the already mentioned Eduardo’s patch).

    Finally, I would like to thanks Intel for sponsoring this work and for their support. Also, thanks to Iago Toral and Samuel Iglesias for sharing with us their experience while developing the 64-bit support on both OpenGL and Vulkan that helped us to implement this extension.

    by infapi00 at July 14, 2017 02:40 PM

    July 06, 2017

    Hyunjun Ko

    100 commits in GStreamer

    It’s been 3 years since I’ve started working on GStreamer, meanwhile I contributed over 100 commits fortunately!

    Let’s look at my commits in each project in GStreamer.

    Now that I write this article, I have made 128 commits.

    In Samsung Electronics, which was my previous company that I had been working for, I had a chance to work on gstreamer, which is main multimedia framework on Tizen. Since then, I realized that there are lots of opportunity in open source world and I started enjoying contribution to this project.

    This is my first commit:

    Yes. It’s just a fix typo. This landed in just five minutes after I proposed and I realized that maintainers are looking at all issues in bugzilla. To be honest, I doubted it a bit. :P

    Looking at other commits that I was really happy with.

    While I was working on gst-rtsp-server at that time, I found it’s not working properly for RTP retransmission on the server. I reported the issue and discussion went very positive, then my proposed patches landed finally thanks to Sebastian.

    This was enhancement of infrastructure for RTSP/RTP in GStreamer, which is providing a way to report stats for sender/receiver.

    Then I contributed huge patches of creating new APIs for transformation between SDP and GstCaps including removing duplicated codes. Thanks, Sebastian again.

    Until this time I focused on RTSP/RTP streaming on server side since I was working on Miracast on Tizen which uses gst-rtsp-server. At this time I started looking for company so that I could work on open source more closely. Eventually I found Igalia, which is doing great work in open source world including Webkit, Chromium and GStreamer.

    Since I joined Igalia I have been focusing on gstreamer-vaapi with my great colleague Victor, who is one of maintainers of GStreamer project. I got to have much more chances to contribute than before. As I said, I worked on RTSP server side before, which means that I should focus on encoder, muxer and networking stuff. But since this move, I got started focusing on playback including decoder and sink to be playable on various platforms.

    These are my best patches I think.

    By this set of patches, performance of playback on GL/VAAPI has been improved dramatically.

    Besides, I have contributed some patches that improve vaapi decoder and encoder, most of them is for h264, which also makes me happy.

    During the last three years I worked for GStreamer, I grew up with more capability of SW development, the idea of open source and more deep insight for the world of software. I give a deep appreciation for Igalia that gave me this opportunity, and also I thank you, Victor, for giving me a lot of motivation.

    Even at this moment, I’m still working, enjoying and sometimes struggling with GStreamer. I really want to keep continuing this work and find a chance to contribute something new which could be applied on GStreamer.

    Thanks for reading!

    July 06, 2017 06:05 AM

    Iago Toral

    Working with lights and shadows – Part I: Phong reflection model

    Some time ago I promised to write a bit more about how shadow mapping works. It has taken me a while to bring myself to actually deliver on that front, but I have finally decided to put together some posts on this topic, this being the first one. However, before we cover shadow mapping itself we need to cover some lighting basics first. After all, without light there can’t be shadows, right?

    This post will introdcuce the popular Phong reflection model as the basis for our lighting model. A lighting model provides a simplified representation of how light works in the natural world that allows us to simulate light in virtual scenes at reasonable computing costs. So let’s dive into it:

    Light in the natural world

    In the real world, the light that reaches an object is a combination of both direct and indirect light. Direct light is that which comes straight from a light source, while indirect light is the result of light rays hitting other surfaces in the scene, bouncing off of them and eventually reaching the object as a result, maybe after multiple reflections from other objects. Because each time a ray of light hits a surface it loses part of its energy, indirect light reflection is less bright than direct light reflection and its color might have been altered. The contrast between surfaces that are directly hit by the light source and surfaces that only receive indirect light is what creates shadows. A shadow isn’t but the part of a scene that doesn’t receive direct light but might still receive some amount of (less intense) indirect light.

    Direct vs Indirect light

    Light in the digital world

    Unfortunately, implementing realistic light behavior like that is too expensive, specially for real-time applications, so instead we use simplifications that can produce similar results with much lower computing requirements. The Phong reflection model in particular describes the light reflected from surfaces or emitted by light sources as the combination of 3 components: diffuse, ambient and specular. The model also requires information about the direction in which a particular surface is facing, provided via vectors called surface normals. Let’s introduce each of these concepts:

    Surface normals

    When we study the behavior of light, we notice that the direction in which surfaces reflect incoming light affects our perception of the surface. For example, if we lit a shiny surface (such as a piece of metal) using a strong light shource so that incoming light is reflected off the surface in the exact opposite direction in which we are looking at it, we will see a strong reflection in the form of highlights. If we move around so that we look at the same surface from a different angle, then we will see the reflection get dimmer and the highlights will eventually disappear. In order to model this behavior we need to know the direction in which the surfaces we render reflect incoming light. The way to do this is by associating vectors called normals with the surfaces we render so that shaders can use that information to produce lighting calculations akin to what we see in the natural world.

    Usually, modeling programs can compute normal vectors for us, even model loading libraries can do this work automatically, but some times, for example when we define vertex meshes programatically, we need to define them manually. I won’t covere here how to do this in the general, you can see this article from Khronos if you’re interested in specific algorithms, but I’ll point out something relevant: given a plane, we can compute normal vectors in two opposite directions, one is correct for the front face of the plane/polygon and the other one is correct for the back face, so make sure that if you compute normals manually, you use the correct direction for each face, otherwise you won’t be reflecting light in the correct direction and results won’t be as you expect.

    Light reflected using correct normal vector for the front face of the triangle

    In most scenarios, we only render the front faces of the polygons (by enabling back face culling) and thus, we only care about one of the normal vectors (the one for the front face).

    Another thing to notice about normal vectors is that they need to be transformed with the model to be correct for transformed models: if we rotate a model we need to rotate the normals too, since the faces they represent are now rotated and thus, their normal directions have rotated too. Scaling also affects normals, specifically if we don’t use regular scaling, since in that case the orientation of the surfaces may change and affect the direction of the normal vector. Because normal vectors represent directions, their position in world space is irrelevant, so for the purpose of lighting calculations, a normal vector such as (1, 0, 0) defined for a surface placed at (0, 0, 0) is still valid to represent the same surface at any other position in the world; in other words, we do not need to apply translation transforms to our normal vectors.

    In practice, the above means that we want to apply the rotation and scale transforms from our models to their normal vectors, but we can skip the translation transform. The matrix representing these transforms is usually called the normal matrix. We can compute the normal matrix from our model matrix by computing the transpose of the inverse of the 3×3 submatrix of the model matrix. Usually, we’d want to compute this matrix in the application and feed it to our vertex shader like we do with our model matrix, but for reference, here is how this can be achieved in the shader code itself, plus how to use this matrix to transform the original normal vectors:

    mat3 NormalMatrix = transpose(inverse(mat3(ModelMatrix)));
    vec3 out_normal = normalize(NormalMatrix * in_normal);
    

    Notice that the code above normalizes the resulting normal before it is fed to the fragment shader. This is important because the rasterizer will compute normals for all fragments in the surface automatically, and for that it will interpolate between the normals for each vertex we emit. For the interpolated normals to be correct, all vertex normals we output in the vertex shader must have the same length, otherwise the larger normals will deviate the direction of the interpolated vectors towards them because their larger size will increase their weight in the interpolation computations.

    Finally, even if we emit normalized vectors in the vertex shader stage, we should note that the interpolated vectors that arrive to the fragment shader are not guaranteed to be normalized. Think for example of the normal vectors (1, 0, 0) and (0, 1, 0) being assigned to the two vertices in a line primitive. At the half-way point in between these two vertices, the interpolator will compute a normal vector of (0.5, 0.5, 0), which is not unit-sized. This means that in the general case, input normals in the fragment shader will need to be normalized again even if have normalized vertex normals at the vertex shader stage.

    Diffuse reflection

    The diffuse component represents the reflection produced from direct light. It is important to notice that the intensity of the diffuse reflection is affected by the angle between the light coming from the source and the normal of the surface that receives the light. This makes a surface looking straight at the light source be the brightest, with reflection intensity dropping as the angle increases:

    Diffuse light (spotlight source)

    In order to compute the diffuse component for a fragment we need its normal vector (the direction in which the surface is facing), the vector from the fragment’s position to the light source, the diffuse component of the light and the diffuse reflection of the fragment’s material:

    vec3 normal = normalize(surface_normal);
    vec3 pos_to_light_norm = normalize(pos_to_light);
    float dp_reflection = max(0.0, dot(normal, pos_to_light_norm));
    vec3 diffuse = material.diffuse * light.diffuse * dp_reflection;
    

    Basically, we multiply the diffuse component of the incoming light with the diffuse reflection of the fragment’s material to produce the diffuse component of the light reflected by the fragment. The diffuse component of the surface tells how the object absorbs and reflects incoming light. For example, a pure yellow object (diffuse material vec3(1,1,0)) would absorb the blue component and reflect 100% of the red and green components of the incoming light. If the light is a pure white light (diffuse vec3(1,1,1)), then the observer would see a yellow object. However, if we are using a red light instead (diffuse vec3(1,0,0)), then the light reflected from the surface of the object would only contain the red component (since the light isn’t emitting a green component at all) and we would see it red.

    As we said before though, the intensity of the reflection depends on the angle between the incoming light and the direction of the reflection. We account for this with the dot product between the normal at the fragment (surface_normal) and the direction of the light (or rather, the vector pointing from the fragment to the light source). Notice that because the vectors that we use to compute the dot product are normalized, dp_reflection is exactly the cosine of the angle between these two vectors. At an angle of 0º the surface is facing straight at the light source, and the intensity of the diffuse reflection is at its peak, since cosine(0º)=1. At an angle of 90º (or larger) the cosine will be 0 or smaller and will be clamped to 0, meaning that no light is effectively being reflected by the surface (the computed diffuse component will be 0).

    Ambient reflection

    Computing all possible reflections and bounces of all rays of light from each light source in a scene is way too expensive. Instead, the Phong model approximates this by making indirect reflection from a light source constant across the scene. In other words: it assumes that the amount of indirect light received by any surface in the scene is the same. This eliminates all the complexity while still producing reasonable results in most scenarios. We call this constant factor ambient light.

    Ambient light

    Adding ambient light to the fragment is then as simple as multiplying the light source’s ambient light by the material’s ambient reflection. The meaning of this product is exactly the same as in the case of the diffuse light, only that it affects the indirect light received by the fragment:

    vec3 ambient = material.ambient * light.ambient;
    

    Specular reflection

    Very sharp, smooth surfaces such as metal are known to produce specular highlights, which are those bright spots that we can see on shiny objects. Specular reflection depends on the angle between the observer’s view direction and the direction in which the light is reflected off the surface. Specifically, the specular reflection is strongest when the observer is facing exactly in the opposite direction in which the light is reflected. Depending on the properties of the surface, the specular reflection can be more or less focused, affecting how the specular component scatters after being reflected. This property of the material is usually referred to as its shininess.

    Specular light

    Implementing specular reflection requires a bit more of work:

    vec3 specular = vec3(0);
    vec3 light_dir_norm = normalize(vec3(light.direction));
    if (dot(normal, -light_dir_norm) >= 0.0) {
       vec3 reflection_dir = reflect(light_dir_norm, normal);
       float shine_factor = dot(reflection_dir, normalize(in_view_dir));
       specular = light.specular.xyz * material.specular.xyz *
             pow(max(0.0, shine_factor), material.shininess.x);
    }
    

    Basically, the code above checks if there is any specular reflection at all by computing the cosine of the angle between the fragment’s normal and the direction of the light (notice that, once again, both vectors are normalized prio to using them in the call to dot()). If there is specular reflection, then we compute how shiny the reflection is perceived by the viewer based on the angle between the vector from this fragment to the observer (in_view_dir) and the direction of the light reflected off the fragment’s surface (reflection_dir). The smaller the angle, the more parallel the directions are, meaning that the camera is receiving more reflection and the specular component received is stronger. Finally, we modulate the result based on the shininess of the fragment. We can compute in_view_dir in the vertex shader using the inverse of the View matrix like this:

    mat4 ViewInv = inverse(View);
    out_view_dir =
       normalize(vec3(ViewInv * vec4(0.0, 0.0, 0.0, 1.0) - world_pos));
    

    The code above takes advantage of the fact that camera transformations are an illusion created by applying the transforms to everything else we render. For example, if we want to create the illusion that the camera is moving to the right, we just apply a translation to everything we render so they show up a bit to the left. This is what our View matrix achieves. From the point of view of GL or Vulkan, the camera is always fixed at (0,0,0). Taking advantage of this, we can compute the position of the virtual observer (the camera) in world space coordinates by applying the inverse of our camera transform to its fixed location (0,0,0). This is what the code above does, where world_pos is the position of this vertex in world space and View is the camera’s view matrix.

    In order to produce the final look of the scene according to the Phong reflection model, we need to compute these 3 components for each fragment and add them together:

    out_color  = vec4(diffuse + ambient + specular, 1.0)
    
    Diffuse + Ambient + Specular (spotlight source)

    Attenuation

    In most scenarios, light intensity isn’t constant across the scene. Instead, it is brightest at its source and gets dimmer with distance. We can easily model this by adding an attenuation factor that is multiplied by the distance from the fragment to the light source. Typically, the intensity of the light decreases quite fast with distance, so a linear attenuation factor alone may not produce the best results and a quadratic function is preferred:

    float attenuation = 1.0 /
        (light.attenuation.constant +
         light.attenuation.linear * dist +
         light.attenuation.quadratic * dist * dist);
    
    diffuse = diffuse * attenuation;
    ambient = ambient * attenuation;
    specular = specular * attenuation;
    

    Of course, we may decide not to apply attenuation to the ambient component at all if we really want to make it look like it is constant across the scene, however, do notice that when multiple light sources are present, the ambient factors from each source will accumulate and may produce too much ambient light unless they are attenuated.

    Types of lights

    When we model a light source we also need to consider the kind of light we are manipulating:

    Directional lights

    These are light sources that emit rays that travel along a specific direction so that all are parallel to each other. We typically use this model to represent bright, distant light sources that produce constant light across the scene. An example would be the sun light. Because the distance to the light source is so large compared to distances in the scene, the attenuation factor is irrelevant and can be discarded. Another particularity of directional light sources is that because the light rays are parallel, shadows casted from them are regular (we will talk more about this once we cover shadow mapping in future posts).

    Directional light

    If we had used a directional light in the scene, it would look like this:

    Scene with a directional light

    Notice how the brightness of the scene doesn’t lower with the distance to the light source.

    Point lights

    These are light sources for which light originates at a specific position and spreads outwards in all directions. Shadows casted by point lights are not regular, instead they are projected. An example would be the light produced by a light bulb. The attenuation code I showed above would be appropriate to represent point lights.

    Point light

    Here is how the scene would look like with a point light:

    Scene with a point light

    In this case, we can see how attenuation plays a factor and brightness lowers as we walk away from the light source (which is close to the blue cuboid).

    Spotlights

    This is the light source I used to illustrate the diffuse, ambient and specular components. They are similar to point lights, is the sense that light originates from a specific point in space and spreads outwards, however, instead of scattering in all directions, rays scatter forming a cone with the tip at the origin of the light. The angle formed by the lights’s direction and the sides of the cone is usually called the cutoff angle, because not light is casted outside its limits. Flashlights are a good example of this type of light.

    Spotlight

    In order to create spotlights we need to consider the cutoff angle of the light and make sure that no diffuse or specular component is reflected by a fragment which is beyond the cutoff threshold:

    vec3 light_to_pos_norm = -pos_to_light_norm;
    float dp = dot(light_to_pos_norm, light_dir_norm);
    if (dp <= light.cutoff) {
       diffuse = vec3(0);
       specular = vec3(0);
    }
    

    In the code above we compute the cosine of the angle between the light’s direction and the vector from the light to the fragment (dp). Here, light.cutoff represents the cosine of the spotlight’s cutoff angle too, so when dp is smaller it means that the fragment is outside the light cone emitted by the spotlight and we remove its diffuse and specular reflections completely.

    Multiple lights

    Handling multiple lights is easy enough: we only need to compute the color contribution for each light separately and then add all of them together for each fragment (pseudocode):

    vec3 fragColor = vec3(0);
    foreach light in lights
        fragColor += compute_color_for_light(light, ...);
    ...
    

    Of course, light attenuation plays a vital role here to limit the area of influence of each light so that scenes where we have multiple lights don’t get too bright.

    An important thing to notice above the pseudocode above is that this process involves looping through costy per-fragment light computations for each light source, which can lead to important performance hits as the number of lights in the scene increases. This shading model, as described here, is called forward rendering and it has the benefit that it is very simple to implement but its downside is that we may incur in many costy lighting computations for fragments that, eventually, won’t be visible in the screen (due to them being occluded by other fragments). This is particularly important when the number of lights in the scene is quite large and its complexity makes it so that there are many occluded fragments. Another technique that may be more suitable for these situations is called deferred rendering, which postpones costy shader computations to a later stage (hence the word deferred) in which we only evaluate them for fragments that are known to be visible, but that is a topic for another day, in this series we will focus on forward rendering only.

    Lights and shadows

    For the purpose of shadow mapping in particular we should note that objects that are directly lit by the light source reflect all 3 of the light components, while objects in the shadow only reflect the ambient component. Because objects that only reflect ambient light are less bright, they appear shadowed, in similar fashion as they would in the real world. We will see the details how this is done in the next post, but for the time being, keep this in mind.

    Source code

    The scene images in this post were obtained from a simple shadow mapping demo I wrote in Vulkan. The source code for that is available here, and it includes also the shadow mapping implementation that I’ll cover in the next post. Specifically relevant to this post are the scene vertex and fragment shaders where lighting calculations take place.

    Conclusions

    In order to represent shadows we first need a means to represent light. In this post we discussed the Phong reflection model as a simple, yet effective way to model light reflection in a scene as the addition of three separate components: diffuse, ambient and specular. Once we have a representation of light we can start discussing shadows, which are parts of the scene that only receive ambient light because other objects occlude the diffuse and specular components of the light source.

    by Iago Toral at July 06, 2017 06:00 AM

    July 04, 2017

    Charlie Turner

    Qt Creator Tips for WebKit GTK

    Introduction

    Using an IDE on GNU/Linux for C/C++ development is slightly contentious in many circles. People either seem to not find IDE’s value-add worthwhile compared to their cult text editor + UNIX tools, or have tried them in the past and not had good experiences, so soilder on with the cult text editor approach. I’ve tended to be in the latter camp of people, knowing that in a perfect world an IDE would help me, but they don’t seem to be up-to-snuff in this environment yet.

    I’d limped along with taggers like GNU global and etags alongside Emacs. Cortored find+grep commands wired into Emacs’ helm package were my “Find all references” and “Jump to definition”. It worked to an extent, but it does feel a little primitive, and GUD always frustrated me. New semantic taggers such as SourceWeb and rtags looked interesting and hope they continue to mature, but I was struggling getting WebKit through them. The Clang tooling is also rather slow at processing the source files, upon which both these tools are based.

    You can make Qt Creator build and install WebKit into a jhbuild for, say, Epiphany. I describe those steps if you’re inclined have that full IDE experience. The instructions below are annotated with [Building?] for steps that are applicable only to that configuration. I don’t personally do this because I prefer to run the build/install commands outside the IDE. With those introductions out of the way, what I’ve ended up with is a decent code navigator (alternative to Eclipse!) and a good debugger frontend. I’m happier with the combination of cult editor + Qt Creator for working on C++ projects. It’s not perfect, but I hope you might find it useful as well.

    First get Qt Creator installed. If using your distributions package manager, just check the version is fairly recent. If it isn’t, download if from the Qt Creator site, but during the installation process, I recommend not installing the Qt libraries.

    Load the project into Qt Creator

    • Always run Qt Creator from the WebKit jhbuild environment. E.g., ./Tools/jhbuild/jhbuild-wrapper --gtk run /usr/bin/qtcreator. If you don’t, CMake will find all kinds of random junk it calls dependencies on your system, if you’re lucky.
    • Go to File > Open File or Project.
    • Navigate to the top-level CMakeLists.txt in the WebKit checkout.
    • In the Configure Project screen, change the build directory output to taste. If you’re not planning on building from IDE, this doesn’t matter.
    • [Building?] For build directories, you could put it in $WEBKIT_ROOT/WebKitBuild/{Release,Debug} to match WebKit conventions. Don’t bother with the other configurations, especially not RelWithDebInfo, there are problems in WebKit with this configuration. Now click on Manage in the Desktop kit page (it’s one of those buttons that magically appears on the screen when you hover over it…), scroll down to CMake Configuration and click Change. Remove the Qt install prefix definitions, and add CMAKE_INSTALL_PREFIX:INTERNAL=/path/to/jhbuild/install/ and CMAKE_INSTALL_LIBDIR:INTERNAL=lib. Note carefully that shell variables are not expanded here, so don’t use something like $HOME. You can also change the compiler and debugger used by the kit as well. Also make sure Qt version is None.
    • Once you get into the IDE after these steps, CMake will fail because we haven’t specified a port.
    • From the mode selector, click Projects (thing with the wrench icon) and go to the build settings. Set PORT=GTK and also add a boolean property for DEVELOPER_MODE=ON and ENABLE_WAYLAND_TARGET=OFF if you’re working on X11.
    • [Building?] Click Advanced if you’d like to change the default compiler switches for Debug/Release configurations. With GCC I like to use -ggdb -Og in Debug.
    • Configuring should now suceed
    • [Building?] Click Build and it will likely fail. I’ve found Qt Creator needs to be restarted at this point. Restart and the build should now work.
    • [Building?] As a finishing touch, you can configure the run configuration for launching Epiphany. In the Run Settings window under the projects mode, create a custom deploy step to run command jhbuild with arguments run cmake -P cmake_install.cmake. This will install WebKit in the jhbuild environment. Now add a custom executable and specify the executable to be jhbuild and the arguments to be run epiphany. The Run button will now install WebKit for use by Epiphany and launch the browser ready for attachment (see next section).

    Debugging

    • Due to the multi-process nature of WebKit, you can’t just click on “Start Debugging”, since there’s several processes you might want to attach to. Launch WebKit and once it’s running, go to Debug > Start Debugging > Attach to Running Application and select the PID of the process you’d like to attach to.
    • It’s likely Qt Creator will time out the GDB launch and ask if you’d like to give it more time. Say yes, and go to Tools > Options > Debugger > GDB and bump the timeout up to 60 seconds.
    • If you’re getting assembly instructions when you hit a breakpoint, it’s likely your source isn’t getting found with the debugger. This shouldn’t happen to you, but if it does you’ll want to add ../../Source -> $WEBKIT_CHECKOUT/Source source mapping. This can be done in Tools > Options > Debugger > General. The build system doesn’t force
      the compiler to emit absolute paths in debugging info (there are ways around that, but this is easier)
    • GDB commands can be issued by bringing up the poorly named “Debugger Log” in the debugger views menu. Some helpful commands I’ve used on WebKit are handle SIGUSR1 noprint to stop being interrupted by IPC, and set scheduler-locking on to single-step through the current thread (you really don’t want to enable that from the start though 😉 just use it in the middle of debug session when you want to step a thread).
    • Everything else I’ve found convenient to do via the IDE.

    Issues

    • Header files don’t have their #if parsed properly, I think because the config.h is indirectly available to header files, which is really unfriendly to static analysis tools used by IDEs. This is with the default code model, I’m sure it would be better if you try the Clang code model, but the current support for that in Qt Creator is limited, and the tradeoff is much, much slower indexing. This isn’t really an issue with the IDE but rather the coding style guidelines of WebKit.
    • Switching kits often requires restarting the IDE, otherwise you get build step errors. I’m guessing this has something to do with the CMake caching the IDE uses. When in doubt, restart the IDE.
    • When you do an expensive interaction with the code model, it blocks the UI thread rendering the whole IDE unresponsive. This is much worse with the Clang code model because it’s so much slower than the default. Can be a problem with the Qt Code Model if you ask for things like the type hierachy.

    by cturner at July 04, 2017 12:58 PM

    June 29, 2017

    Andy Wingo

    a new concurrent ml

    Good morning all!

    In my last article I talked about how we composed a lightweight "fibers" facility in Guile out of lower-level primitives. What we implemented there is enough to be useful, but it is missing an important aspect of concurrency: communication. Sure, being able to spawn off fibers is nice, but you have to be able to actually talk to them.

    Fibers had just gotten to the state described above about a year ago as I caught a train from Geneva to Rome for Curry On 2016. Train rides are magnificent for organizing thoughts, and I was in dire need of clarity. I had tentatively settled on Go-style channels by the time I got there, but when I saw that Matthias Felleisen and Matthew Flatt there, I had to take advantage of the opportunity to ask them what they thought. Would they recommend Racket-like threads and channels? Had that been a good experience?

    The answer that I got in return was a "yes, that's what you should do", but also a "you should look at Concurrent ML". Concurrent ML? What's that? I looked and was a bit skeptical. It seemed old and hoary and maybe channels were just as expressive. I looked more deeply into this issue and it seemed CML is a bit more expressive than just channels but damn, it looked complicated to implement.

    I was wrong. This article shows that what you need to do to implement multi-core CML is actually the same as what you need to do to implement channels in a multi-core environment. By building CML first and channels and whatever later, you get more power for the same amount of work.

    Note that this article has an associated talk! If video is your thing, see my Curry On 2017 talk here:

    Or, watch on the youtube if the inline video above doesn't work; slides here as well.

    on channels

    Let's first have a crack at implementing channels. Before we begin, we should be a bit more explicit about what a channel is. My first hack in this area did the wrong thing: I was used to asynchronous queues, and I thought that's what a channel was. Besides ignorance, apparently that's what Erlang does; a process's inbox is an unbounded queue of messages with only very slight back-pressure.

    But an asynchronous queue is not a channel, at least in its classic sense. As they were originally formulated in "Communicating Sequential Processes" by Tony Hoare, adopted into David May's occam, and from there into many other languages, channels are meeting-places. Processes meet at a channel to exchange values; whichever party arrives first has to wait for the other party to show up. The message that is handed off in a channel send/receive operation is never "owned" by the channel; it is either owned by a sender who is waiting at the meeting point for a receiver, or it's accepted by a receiver. After the transaction is complete, both parties continue on.

    You'd think this is a fine detail, but meeting-place channels are strictly more expressive than buffered channels. I was actually called out for this because my first implementation of channels for Fibers had effectively a minimum buffer size of 1. In Go, whose channels are unbuffered by default, you can use a channel for RPC:

    package main
    
    func double(ch chan int) {
      for { ch <- (<-ch * 2) }
    }
    
    func main() {
      ch := make(chan int)
      go double(ch)
      ch <- 2
      x := <-ch
      print(x)
    }
    

    Here you see that the main function sent a value on ch, then immediately read a response from the same channel. If the channel were buffered, then we'd probably read the value we sent instead of the doubled value supplied by the double goroutine. I say "probably" because it's not deterministic! Likewise the double routine could read its responses as its inputs.

    Anyway, the channels we are looking to build are meeting-place channels. If you are interested in the broader design questions, you might enjoy the incomplete history of language facilities for concurrency article I wrote late last year.

    With that prelude out of the way, here's a first draft at the implementation of the "receive" operation on a channel.

    (define (recv ch)
      (match ch
        (($ $channel recvq sendq)
         (match (try-dequeue! sendq)
           (#(value resume-sender)
            (resume-sender)
            value)
           (#f
            (suspend
             (lambda (k)
               (define (resume val)
                 (schedule (lambda () (k val)))
               (enqueue! recvq resume))))))))
    
    ;; Note: this code has a race!  Fixed later.
    

    A channel is a record with two fields, its recvq and sendq. The receive queue (recvq) holds a FIFO queue of continuations that are waiting to receive values, and the send queue holds continuations that are waiting to send values, along with the value that they are sending. Both the recvq and the sendq are lockless queues.

    To receive a value from a meeting-place channel, there are two possibilities: either there's a sender already there and waiting, or we have to wait for a sender. Those two cases are handled above, in that order. We use the suspend primitive from the last article to arrange for the fiber to wait; presumably the sender will resume us when they arrive later at the meeting-point.

    an aside on lockless data structures

    We'll go more deeply into the channel receive mechanics later, but first, a general question: what's the right way to implement a data structure that can be accessed and modified concurrently without locks? Though I am full of hubris, I don't have enough to answer this question definitively. I know many ways, but none that's optimal in all ways.

    For what I needed in Fibers, I chose to err on the side of simplicity.

    Some data in Fibers is never modified; this immutable data is safe to access concurrently from any code. This is the best, obviously :)

    Some mutable data is only ever mutated from an "owner" core; it's safe to read without a lock from that owner core, and in Fibers we do not access this data from other cores. An example of this kind of data structure is the i/o map from file descriptors to continuations; it's core-local. I say "core-local" because in fibers we typically run one scheduler per core, with each core having a pinned POSIX thread; it's really thread-local but I don't want to use the word "thread" too much here as it's confusing.

    Some mutable data needs to be read and written from many cores. An example of this is the recvq of a channel; many receivers and senders can be wanting to read and write there at once. The approach we take in Fibers is just to use immutable data stored inside an "atomic box". An atomic box holds a single value, and exposes operations to read, write, swap, and compare-and-swap (CAS) the value. To read a value, just fetch it from the box; you then have immutable data that you can analyze without locks. Having read a value, you can to compute a new state and use CAS on the atomic box to publish that change. If the CAS succeeds, then great; otherwise the state changed in the meantime, so you typically want to loop and try again.

    Single-word CAS suffices for Guile when every value stored into an atomic box will be unique, a property that freshly-allocated objects have and of which GC ensures us an endless supply. Note that for this to work, the values can share structure internally but the outer reference has to be freshly allocated.

    The combination of freshly-allocated data structures and atomic variables is a joy to use: no hassles about multi-word compare-and-swap or the ABA problem. Assuming your GC can keep up (Guile's does about 700 MB/s), it can be an effective strategy, and is certainly less error-prone than others.

    back at the channel recv ranch

    Now, the theme here is "growing a language": taking primitives and using them to compose more expressive abstractions. In that regard, sure, channel send and receive are nice, but what about select, which allows us to wait on any channel in a set of channels? How do we take what we have and built non-determinism on top?

    I think we should begin by noting that select in Go for example isn't just about receiving messages. You can select on the first channel that can send, or between send and receive operations.

    select {
    case c <- x:
      x, y = y, x+y
    case <-quit:
      return
    }
    

    As you can see, Go provides special syntax for select. Although in Guile we can of course provide macros, usually those macros expand out to a procedure call; the macro is sugar for a function. So we want select as a function. But because we need to be able to select over receiving and sending at the same time, the function needs to take some kind of annotation on what we are going to do with the channels:

    (select (recv A) (send B v))
    

    So what we do is to introduce the concept of an operation, which is simply data describing some event which may occur in the future. The arguments to select are now operations.

    (select (recv-op A) (send-op B v))
    

    Here recv-op is obviously a constructor for the channel-receive operation, and likewise for send-op. And actually, given that we've just made an abstraction over sending or receiving on a channel, we might as well make an abstraction over choosing the first available op among a set of operations. The implementation of select now creates such a choice-op, then performs it.

    (define (select . ops)
      (perform (apply choice-op ops)))
    

    But what we're missing here is the ability to know which operation actually happened. In Go, select's special syntax associates a clause of code with each sub-operation. In Scheme a clause of code is just a function, and so what we want to do is to be able to annotate an operation with a function that will get run if the operation succeeds.

    So we define a (wrap-op op k), which makes an operation that itself annotates op, associating it with k. If op occurs, its result values will be passed to k. For example, if we make a fiber that tries to perform this operation:

    (perform
     (wrap-op
      (recv-op A)
      (lambda (v)
        (string-append "hello, " v))))
    

    If we send the string "world" on the channel A, then the result of this perform invocation will be "hello, world". Providing "wrapped" operations to select allows us to handle the various cases in separate, appropriate ways.

    we just made concurrent ml

    Hey, we just reinvented Concurrent ML! In his PLDI 1988 paper "Synchronous operations as first-class values", John Reppy proposes just this abstraction. I like to compare it to the relationship between an expression (exp) and wrapping that expression in a lambda ((lambda () exp)); evaluating an expression gives its value, and the expression just goes away, whereas evaluating a lambda gives a procedure that you can call in the future to evaluate the expression. You can call the lambda many times, or no times. In the same way, a channel-receive operation is an abstraction over receiving a value from a channel. You can perform that operation many times, once, or not at all.

    Reppy consolidated this work in his PLDI 1991 paper, "CML: A higher-order concurrent language". Note that he uses the term "event" instead of "operation". To me the name "event" to apply to this abstraction never felt quite right; I guess I wrote too much code in the past against event loops. I see "events" as single instances in time and not an abstraction over the possibility of a, well, of an event. Indeed I wish I could refer to an instantiation of an operation as an event, but better not to muddy the waters. Likewise Reppy uses "synchronize" where I use "perform". As you like, really, it's still Concurrent ML; I just prefer to explain to my users using terms that make sense to me.

    what's an op?

    Let's return to that channel recv implementation. It had basically two parts: an optimistic part, where the operation could complete immediately, and a pessimistic part, where we had to wait for the other party to arrive. However, there was a race condition, as I noted in the comment. If a sender and a receiver concurrently arrive at a channel, it could be that they concurrently do the optimistic check, don't notice that the other is there, then they both suspend, waiting for each other to arrive: deadlock. To fix this for recv, we have to recheck the sendq after publishing our presence to the recvq.

    I'll get to the details in a bit for channels, but it turns out that this is a general pattern. All kinds of ops have optimistic and pessimistic behavior.

    (define (perform op)
      (match op
        (($ $op try block wrap)
         (define (do-op)
           ;; Return a thunk that has result values.
           (or optimistic
               pessimistic)))
         ;; Return values, passed through wrap function.
         ((compose wrap do-op)))))
    

    In the optimistic phase, the calling fiber will try to commit the operation directly. If that succeeds, then the calling fiber resumes any other fibers that are part of the transaction, and the calling fiber continues. In the pessimistic phase, we park the calling fiber, publish the fact that we're ready and waiting for the operation, then to resolve the race condition we have to try again to complete the operation. In either case we pass the result(s) through the wrap function.

    Given that the pessimistic phase has to include a re-check for operation completability, the optimistic phase is purely an optimization. It's a good optimization that everyone will want to implement, but it's not strictly necessary. It's OK for a try function to always return #f.

    As shown in the above function, an operation is a plain old data structure with three fields: a try, a block, and a wrap function. The optimistic behavior is implemented by the try function; the pessimistic side is partly implemented by perform, which handles the fiber suspension part, and by the operation's block function. The wrap function implements the wrap-op behavior described above, and is applied to the result(s) of a successful operation.

    Now, it was about at this point that I was thinking "jeebs, this CML thing is complicated". I was both wrong and right -- there's some complication inherent in multicore lockless communication, yes, but I believe CML captures something close to the minimum, and certainly it's just as much work as with a direct implementation of channels. In that spirit, I continue on with the implementation of channel operations in Fibers.

    channel receive operation

    Here's an implementation of a try function for a channel.

    (define (try-recv ch)
      (match ch
        (($ $channel recvq sendq)
         (let ((q (atomic-ref sendq)))
           (match q
             (() #f)
             ((head . tail)
              (match head
                (#(val resume-sender state)
                 (match (CAS! state 'W 'S)
                   ('W
                    (resume-sender (lambda () (values)))
                    (CAS! sendq q tail) ; *
                    (lambda () val))
                   (_ #f))))))))))
    

    In Fibers, a try function either succeeds and returns a thunk, or fails and returns #f. For channel receive, we only succeed if there is a sender already in the queue: the sender has arrived, suspended itself, and published its availability. The state variable is an atomic box that holds the operation state, which initially starts as W and when complete is S. More on that in a minute. If the CAS! compare-and-swap operation managed to change the state from W to S, then the optimistic phase suceeded -- yay! We resume the sender with no values, take the value that the sender gave us, and keep on trucking, returning that value wrapped in a thunk.

    Additionally the sender's entry on the sendq is now stale, as the operation is already complete; we try to pop it off the queue at the line indicated with *, but that could fail due to concurrent queue modification. In that case, no biggie, someone else will do the collect our garbage for us.

    The pessimistic case is a bit more involved. It's the last bit of code though; almost done here! I express the pessimistic phase as a function of the operation's block function.

    (define (pessimistic block)
      ;; For consistency with optimistic phase, result of
      ;; pessimistic phase is a thunk that "perform" will
      ;; apply.
      (lambda ()
        ;; 1. Suspend the thread.  Expect to be resumed
        ;; with a thunk, which we arrange to invoke directly.
        ((suspend
           (lambda (k)
            (define (resume values-thunk)
              (schedule (lambda () (k values-thunk))))
            ;; 2. Make a fresh opstate.
            (define state (make-atomic-box 'W))
            ;; 3. Call op's block function.
            (block resume state))))))
    

    So what about that state variable? Well basically, once we publish the fact that we're ready to perform an operation, fibers from other cores might concurrently try to complete our operation. We need for this perform invocation to complete at most once! So we introduce a state variable, the "opstate", held in an atomic box. It has three states:

    • W: "Waiting"; initial state

    • C: "Claimed"; temporary state

    • S: "Synched"; final state

    There are four possible state transitions, of two kinds. Firstly there are the "local" transitions W->C, C->W, and C->S. These transitions may only ever occur as part of the "retry" phase a block function; notably, no remote fiber will cause these transitions on "our" state variable. Remote fibers can only make the W->S transition, committing an operation. The W->S transition can also be made locally of course.

    Every time an operation is instantiated via the perform function, we make a new opstate. Operations themselves don't hold any state; only their instantiations do.

    The need for the C state wasn't initially obvious to me, but after seeing the recv-op block function below, it will be clear to you I hope.

    block functions

    The block function itself has two jobs to do. Recall that it's called after the calling fiber was suspended, and is passed two arguments: a procedure that can be called to resume the fiber with some number of values, and the fresh opstate for this instantiation. The block function has two jobs: it needs to publish the resume function and the opstate to the channel's recvq, and then it needs to try again to receive. That's the "retry" phase I was mentioning before.

    Retrying the recv can have three possible results:

    1. If the retry succeeds, we resume the sender. We also have to resume the calling fiber, as it has been suspended already. In general, whatever code manages to commit an operation has to resume any fibers that were waiting on it to complete.

    2. If the operation was already in the S state, that means some other party concurrently completed our operation on our behalf. In that case there's nothing to do; the other party resumed us already.

    3. Otherwise if the operation couldn't proceed, then when the other party or parties arrive, they will be responsible for completing the operation and ultimately resuming our fiber in the future.

    With that long prelude out of the way, here's the gnarlies!

    (define (block-recv ch resume-recv recv-state)
      (match ch
        (($ $channel recvq sendq)
         ;; Publish -- now others can resume us!
         (enqueue! recvq (vector resume-recv recv-state))
         ;; Try again to receive.
         (let retry ()
           (let ((q (atomic-ref sendq)))
             (match q
               ((head . tail)
                (match head
                  (#(val resume-send send-state)
                   (match (CAS! recv-state 'W 'C)   ; Claim txn.
                     ('W
                      (match (CAS! send-state 'W 'S)
                        ('W                         ; Case (1): yay!
                         (atomic-set! recv-state 'S)
                         (CAS! sendq q tail)        ; Maybe GC.
                         (resume-send (lambda () (values)))
                         (resume-recv (lambda () val)))
                        ('C                         ; Conflict; retry.
                         (atomic-set! recv-state 'W)
                         (retry))
                        ('S                         ; GC and retry.
                         (atomic-set! recv-state 'W)
                         (CAS! sendq q tail)
                         (retry))))
                     ('S #f)))))                    ; Case (2): cool!
               (() #f)))))))                        ; Case (3): we wait.
    

    As we said, first we publish, then we retry. If there is a sender on the queue, we will try to complete their operation, but before we do that we have to prevent other fibers from completing ours; that's the purpose of going into the C state. If we manage to commit the sender's operation, then we commit ours too, going from C to S; otherwise we roll back to W. If the sender itself was in C then we had a conflict, and we spin to retry. We also try to GC off any completed operations from the sendq via unchecked CAS. If there's no sender on the queue, we just wait.

    And that's it for the code! Thank you for suffering through this all. I only left off a few details: the try function can loop if sender is in the C state, and the block function needs to avoid a (choice-op (send-op A v) (recv-op A)) from sending v to itself. But because opstates are fresh allocations, we can know if a sender is actually ourself by comparing its opstate to ours (with eq?).

    what about select?

    I started about all this "op" business because I needed to annotate the arguments to select. Did I actually get anywhere? Good news, everyone: it turns out that select doesn't have to be a primitive!

    Firstly, note that the choose-op try function just needs to run all try functions of sub-operations (possibly in random order), returning early if one succeeds. Pretty straightforward. And actually the story with the block function is the same: we just run the sub-operation block functions, knowing that the operation will commit at most one time. The only complication is plumbing through the respective wrap functions to all of the sub-operations, but of course that's the point of the wrap facility, so we pay the cost willingly.

    (define (choice-op . ops)
      (define (try)
        (or-map
         (match-lambda
          (($ $op sub-try sub-block sub-wrap)
           (define thunk (sub-try))
           (and thunk (compose sub-wrap thunk))))
         ops))
      (define (block resume opstate)
        (for-each
         (match-lambda
          (($ $op sub-try sub-block sub-wrap)
           (define (wrapped-resume results-thunk)
             (resume (compose sub-wrap results-thunk)))
           (sub-block wrapped-resume opstate)))
         ops))
      (define wrap values)
      (make-op try block wrap))
    

    There are optimizations possible, for example to randomize the order of visiting the sub-operations for more less deterministic behavior, but this is really all there is.

    concurrent ml is inevitable

    As far as I understand things, the protocol to implement CML-style operations on channels in a lock-free environment are exactly the same as what's needed if you wrote out the recv function by hand, without abstracting it to a recv-op.

    You still need the ability to park a fiber in the block function, and you still need to retry the operation after parking. Although try is just an optimization, it's an optimization that you'll want.

    So given that the cost of parallel CML is necessary, you might as well get what you pay for and have your language expose the more expressive CML interface in addition to the more "standard" channel operations.

    concurrent ml between pthreads and fibers

    One really cool aspect about implementing CML is that the bit that suspends the current thread is isolated in the perform function. Of course if you're in a fiber, you suspend the current fiber as we have described above. But what if you're not? What if you want to use CML to communicate between POSIX threads? You can do that, just create a mutex/cond pair and pass a procedure that will signal the cond as the resume argument to the block function. It just works! The channels implementation doesn't need to know anything about pthreads, or even fibers for that matter.

    In fact, you can actually use CML operations to communicate between fibers and full pthreads. This can be really useful if you need to run some truly blocking operation in a side pthread, but you want most of your program to be in fibers.

    a meta-note for a meta-language

    This implementation was based on the Parallel CML paper from Reppy et al, describing the protocol implemented in Manticore. Since then there's been a lot of development there; you should check out Manticore! I also hear that Reppy has a new version of his "Concurrent Programming in ML" book coming out soon (not sure though).

    This work is in Fibers, a concurrency facility for Guile Scheme, built as a library. Check out the manual for full details. Relative to the Parallel CML paper, this work has a couple differences beyond the superficial operation/perform event/sync name change.

    Most significantly, Reppy's CML operations have three phases: poll, do, and block. Fibers uses just two, as in a concurrent context it doesn't make sense to check-then-do. There is no do, only try :)

    Additionally the Fibers channel implementation is lockless, with an atomic sendq and recvq. In contrast, Manticore uses a spinlock and hence needs to mask/unmask interrupts at times.

    On the other hand, the Parallel CML paper included some model checking work, which Fibers doesn't have. It would be nice to have some more confidence on correctness!

    but what about perf

    Performance! Does it scale? Let's poke it. Here I'm going to try to isolate my tests to measure the overhead of communication of channels as implemented in terms of Parallel CML ops. I have more real benchmarks for Fibers on a web server workload where it does well, but here I am really trying to focus on CML.

    My test system is a 2 x E5-2620v3, which is two sockets each having 6 2.6GHz cores, hyperthreads off, performance governor on all cores. This is a system we use for Snabb testing, so the first core on each socket handles interrupts and all others are reserved; Linux won't schedule anything on them. When you run a fibers system, it will spawn a thread per available core, then set the thread's affinity to that core. In these tests, I'll give benchmarks progressively more cores and see how they do with the workload.

    So this is a benchmark measuring total message sends per second on a chain of fibers communicating over channels. For 0 links, that means that there's just a sender and a receiver and no intermediate links. For 10 links, each message is relayed 10 times, for 11 total sends in the chain and 12 total fibers. For 0 links we expect pretty much no parallel speedup, and no slowdown, and that's what we see; but when we get to more links, we should expect more throughput. The fibers are allocated to cores at random (a randomized round-robin initial scheduling, then after that fibers have core affinity; though there is a limited work-stealing phase).

    You would think that the 1-core case would be the same for all of them. Unfortunately it seems that currently there is a fixed cost for bouncing through epoll to pick up new I/O tasks, even though there are no I/O runnables in this test and the timeout is 0, so it will return immediately. It's definitely something to look into as it's a cost that all cores are paying.

    Initially I expected a linear speedup but that's not what we're seeing. But then I thought about it and revised my expectations :) As we add more cores, we add more communication; we should see sublinear speedups as we have to do more cross-core wakeups and synchronizations. After all, we aren't measuring a nice parallelizable computational workload: we're measuring overhead.

    On the other hand, the diminishing returns effect is pretty bad, and then we hit the NUMA cliff: as we cross from 6 to 7 cores, we start talking to the other CPU socket and everything goes to shit.

    But here it's hard to isolate the test from three external factors, whose impact I don't understand: firstly, that Fibers itself has a significant wakeup cost for remote schedulers. I haven't measured contention on scheduler inboxes, but I suspect one issue is that when a remote scheduler has decided it has no runnables, it will sleep in epoll; and to wake it up we need to write on a socketpair. Guile can avoid that when there are lots of runnables and we see the remote scheduler isn't sleeping, but it's not perfect.

    Secondly, Guile is a bytecode VM. I measured that Guile retires about 0.4 billion instructions per second per core on the test machine, whereas a 4 IPC native program will retire about 10 billion. There's overhead at various points, some of which will go away with native compilation in Guile but some might not for a while, given that Go (for example) has baked-in support for channels. So to what extent is it the protocol and to what extent the implementation overhead? I don't know.

    Finally, and perhaps most importantly, we can't isolate this test from the garbage collector. Guile still uses the Boehm GC, which is just OK I think. It does have a nice parallel mark phase, but it uses POSIX signals to pause program threads instead of having those threads reach safepoints; and it's completely NUMA-unaware.

    So, with all of those caveats mentioned, let's see a couple more graphs :) Firstly, similar to the previous one, here's total message send rate for N pairs of fibers that ping-pong their message back and forth. Core allocation was randomized round-robin.

    My conclusion here is that when more fibers are runnable per scheduler turn, the overhead of the epoll phase is less.

    Here's a test where there's one fiber producer, and N fibers competing to consume the messages sent. Ultimately we expect that the rate will be limited on the producer side, but there's still a nice speedup.

    Next is a pretty weak-sauce benchmark where we're computing diagonal lengths on an N-dimensional cube; the squares of the dimensions happen in parallel fibers, then one fiber collects those lengths, sums and makes a square root.

    The workload on that one is just very low, and the serial components become a bottleneck quickly. I think I need to rework that test case.

    Finally, there's a false sieve of Erastothenes, in which every time we find a prime, we add another fiber onto the sieve chain that filters out multiples of that prime.

    Even though the workload is really small, we still see speedups, which is somewhat satisfying. Still, on all of these, the NUMA cliff is something fierce.

    For me what these benchmarks show is that there are still some bottlenecks to work on. We do OK in the handful-of-cores scenario, but the system as a whole doesn't really scale past that. On more real benchmarks with bigger workloads and proportionally much less communication, I get much more satisfactory results; but those tend to be I/O heavy anyway, so the bottleneck is elsewhere.

    closing notes

    There are other parts to CML events, namely guard functions and withNack functions. My understanding is that these are implementable in terms of this "primitive" CML as described here; that was a result of earlier work by Matthew Fluet. I haven't actually implemented these yet! A to-do item, truly.

    There are other event types in CML systems of course! Besides being able to implement operations yourself, there are built-in condition variables (cvars), timeouts, thread join events, and so on. The Fibers manual mentions some of these, but it's an open set.

    Finally and perhaps most significantly, Aaron Turon did some work a few years ago on "Reagents", a pattern library for composing parallel and concurrent operations, initially in Scala. It's claimed that Reagents generalizes CML. Is this the case? I am looking forward to finding out.

    OK, that's it for this verrrrry long post :) I hope that you found that this made parallel CML seem a bit more approachable and interesting, whether as a language implementor, a library implementor, or a user. Comments and corrections welcome. Check out Fibers and give it a go!

    by Andy Wingo at June 29, 2017 02:37 PM

    June 27, 2017

    Javier Muñoz

    AWS4 browser-based upload goes upstream in Ceph

    Some days ago Matt committed the great Radek's effort to have a more coherent and structured scaffolding in the Ceph RGW auth subsystem supporting the differences among the available auth algorithms.

    As part of this effort and patchset related to the RGW auth subsystem, Radek was kind enough to include my last patches supporting the AWS4 authentication for S3 Post Object API as part of this big patchset.

    This entry comments on this AWS4 feature upgrade and how it works with Ceph RGW S3.

    Browser-Based Uploads Using POST (AWS Signature Version 4)

    The Amazon S3 feature documentation is available here. It describes how users upload content to Amazon S3 by using their browsers via authenticated HTTP POST requests and HTML forms.

    Those HTML forms consist of a form declaration and form fields. The form declaration contains high-level information about the request and the form fields contain detailed request information.

    The technical details to craft a S3 HTML form are available here. The HTML form also requires a proper POST policy (have a look here to create a POST policy!

    ).

    The process for sending browser-based POST requests is as follows:

    1. Create a security policy specifying conditions restricting what you want to allow in the request.
    2. Create a signature that is based on the policy. For authenticated requests, the form must include a valid signature and the policy.
    3. Create a HTML form that your users can access in order to upload objects to your Amazon S3 bucket directly.

    Using the feature with Ceph RGW S3 and AWS4

    Ceph RGW S3 supports HTTP POST requests under AWS2. With the new patch in place Ceph RGW S3 also authenticates HTTP POST requests under AWS4.

    To test the feature you can use a browser, the boto3 client or the AWS command line interface. Try the following commands:

    1. Create a new bucket

    $ aws s3 mb s3://test-1-2-1-bucket --region eu-central-1 \
    > --endpoint-url http://s3.eu-central-1.amazonaws.com:8000
    make_bucket: test-1-2-1-bucket
    

    2. Generate some test html code with the minimal and required data form fields to auth under aws4, proper policy encoding, etc. Feel free to use this script in Python to get a simple and tested skeleton.

    $ ./rgw-s3-aws4-form.py
    test-rgw-s3-aws4-form.html created.
    

    3. Load test-rgw-s3-aws4-form.html in some browser and upload a test file. You should receive a 204 message.

    4. Verify the object is in place and the content is good.

    $ md5sum test-1-2-1-key
    aaf3b5e3b7505131a6baf9fb6ec1f9dc test-1-2-1-key
    
    $ aws s3 cp s3://test-1-2-1-bucket/test-1-2-1-key --region eu-central-1 \
    > --endpoint-url http://s3.eu-central-1.amazonaws.com:8000 - | md5sum
    aaf3b5e3b7505131a6baf9fb6ec1f9dc -
    

    Enjoy!

    Note: The example uses s3.eu-central-1.amazonaws.com as an example box name in the local network. You should use the names of your RGWs here.

    Acknowledgments

    My work in Ceph has been made possible by Igalia and the invaluable help of the Ceph development team!

    by Javier at June 27, 2017 10:00 PM

    Andy Wingo

    growing fibers

    Good day, Schemers!

    Over the last 12 to 18 months, as we were preparing for the Guile 2.2 release, I was growing increasingly dissatisfied at not having a good concurrency story in Guile.

    I wanted to be able to spawn a million threads on a core, to support highly-concurrent I/O servers, and Guile's POSIX threads are just not the answer. I needed something different, and this article is about the search for and the implementation of that thing.

    on pthreads

    It's worth being specific why POSIX threads are not a great abstraction. One is that they don't compose: two pieces of code that use mutexes won't necessarily compose together. A correct component A that takes locks might call a correct component B that takes locks, and the other way around, and if both happen concurrently you get the classic deadly-embrace deadlock.

    POSIX threads are also terribly low-level. Asking someone to build a system with mutexes and cond vars is like building a house with exploding toothpicks.

    I want to program network services in a straightforward way, and POSIX threads don't help me here either. I'd like to spawn a million "threads" (scare-quotes!), one for each client, each one just just looping reading a request, computing and writing the response, and so on. POSIX threads aren't the concrete implementation of this abstraction though, as in most systems you can't have more than a few thousand of them.

    Finally as a Guile maintainer I have a duty to tell people the good ways to make their programs, but I can't in good conscience recommend POSIX threads to anyone. If someone is a responsible programmer, then yes we can discuss details of POSIX threads. But for a new Schemer? Never. Recommending POSIX threads is malpractice.

    on scheme

    In Scheme we claim to be minimalists. Whether we actually are that or not is another story, but it's true that we have a culture of trying to grow expressive systems from minimal primitives.

    It's sometimes claimed that in Scheme, we don't need threads because we have call-with-current-continuation, an ultrapowerful primitive that lets us implement any kind of control structure we want. (The name screams for an abbreviation, so the alias call/cc is blessed; minimalism is whatever we say it is, right?) Unfortunately it turned out that while call/cc can implement any control abstraction, it can't implement any two. Abstractions built on call/cc don't compose!

    Fortunately, there is a way to build powerful control abstractions that do compose. This article covers the first half of composing a concurrency facility out of a set of more basic primitives.

    Just to be concrete, I have to start with a simple implementation of an event loop. We're going to build on it later, but for now, here we go:

    (define (run sched)
      (match sched
        (($ $sched inbox i/o)
         (define (dequeue-tasks)
           (append (dequeue-all! inbox)
                   (poll-for-tasks i/o)))
         (let lp ()
           (for-each (lambda (task) (task))
                     (dequeue-tasks))
           (lp)))))
    

    This is a scheduler that is a record with two fields, inbox and i/o.

    The inbox holds a queue of pending tasks, as thunks (procedure of no arguments). When something wants to enqueue a task, it posts a thunk to the inbox.

    On the other hand, when a task needs to wait in some external input or output being available, it will register an event with i/o. Typically i/o will be a simple combination of an epollfd and a mapping of tasks to enqueue when a file descriptor becomes readable or writable. poll-for-tasks does the underlying epoll_wait call that pulls new I/O events from the kernel.

    There are some details I'm leaving out, like when to have epoll_wait return directly, and when to have it wait for some time, and how to wake it up if it's sleeping while a task is posted to the scheduler's inbox, but ultimately this is the core of an event loop.

    a long digression

    Now you might think that I'm getting a little far afield from what my goal was, which was threads or fibers or something. But that's OK, let's go a little farther and talk about "prompts". The term "prompt" comes from the experience you get when you work on the command-line:

    /home/wingo% ./prog
    

    I don't know about you all, but I have the feeling that the /home/wingo% has a kind of solid reality, that my screen is not just an array of characters but there is a left-hand-side that belongs to the system, and a right-hand-side that's mine. The two parts are delimited by a prompt. Well prompts in Scheme allow you to provide this abstraction within your program: you can establish a program part that's a "system" facility, for whatever definition of "system" suits your purposes, and a part that's for the "user".

    In a way, prompts generalize a pattern of system/user division that has special facilities in other programming languages, such as a try/catch block.

    try {
      foo();
    } catch (e) {
      bar();
    }
    

    Here again I put the "user" code in italics. Some other examples of control flow patterns that prompts generalize would be early exit of a subcomputation, coroutines, and nondeterminitic choice like SICP's amb operator. Coroutines is obviously where I'm headed here in the context of this article, but still there are some details to go over.

    To make a prompt in Guile, you can use the % operator, which is pronounced "prompt":

    (use-modules (ice-9 control))
    
    (% expr
       (lambda (k . args) #f))
    

    The name for this operator comes from Dorai Sitaram's 1993 paper, Handling Control; it's actually a pun on the tcsh prompt, if you must know. Anyway the basic idea in this example is that we run expr, but if it aborts we run the lambda handler instead, which just returns #f.

    Really % is just syntactic sugar for call-with-prompt though. The previous example desugars to something like this:

    (let ((tag (make-prompt-tag)))
      (call-with-prompt tag
        ;; Body:
        (lambda () expr)
        ;; Escape handler:
        (lambda (k . args) #f)))
    

    (It's not quite the same; % uses a "default prompt tag". This is just a detail though.)

    You see here that call-with-prompt is really the primitive. It will call the body thunk, but if an abort occurs within the body to the given prompt tag, then the body aborts and the handler is run instead.

    So if you want to define a primitive that runs a function but allows early exit, we can do that:

    (define-module (my-module)
      #:export (with-return))
    
    (define-syntax-rule (with-return return body ...)
      (let ((t (make-prompt-tag)))
        (define (return . args)
          (apply abort-to-prompt t args))
        (call-with-prompt t
          (lambda () body ...)
          (lambda (k . rvals)
            (apply values rvals)))))
    

    Here we define a module with a little with-return macro. We can use it like this:

    (use-modules (my-module))
    
    (with-return return
      (+ 3 (return 42)))
    ;; => 42
    

    As you can see, calling return within the body will abort the computation and cause the with-return expression to evaluate to the arguments passed to return.

    But what's up with the handler? Let's look again at the form of the call-with-prompt invocations we've been making.

    (let ((tag (make-prompt-tag)))
      (call-with-prompt tag
        (lambda () ...)
        (lambda (k . args) ...)))
    

    With the with-return macro, the handler took a first k argument, threw it away, and returned the remaining values. But the first argument to the handler is pretty cool: it is the continuation of the computation that was aborted, delimited by the prompt: meaning, it's the part of the computation between the abort-to-prompt and the call-with-prompt, packaged as a function that you can call.

    If you call the k, the delimited continuation, you reinstate it:

    (define (f)
      (define tag (make-prompt-tag))
      (call-with-prompt tag
       (lambda ()
         (+ 3
            (abort-to-prompt tag)))
       (lambda (k) k)))
    
    (let ((k (f)))
      (k 1))
    ;; =& 4
    

    Here, the abort-to-prompt invocation behaved simply like a "suspend" operation, returning the suspended computation k. Calling that continuation resumes it, supplying the value 1 to the saved continuation (+ 3 []), resulting in 4.

    Basically, when a delimited continuation suspends, the first argument to the handler is a function that can resume the continuation.

    tasks to fibers

    And with that, we just built coroutines in terms of delimited continuations. We can turn our scheduler inside-out, giving the illusion that each task runs in its own isolated fiber.

    (define tag (make-prompt-tag))
    
    (define (call/susp thunk)
      (define (handler k on-suspend) (on-suspend k))
      (call-with-prompt tag thunk handler))
    
    (define (suspend on-suspend)
      (abort-to-prompt tag on-suspend))
    
    (define (schedule thunk)
      (match (current-scheduler)
        (($ $sched inbox i/o)
         (enqueue! inbox (lambda () (call/susp thunk))))))
    

    So! Here we have a system that can run a thunk in a scheduler. Fine. No big deal. But if the thunk calls suspend, then it causes an abort back to a prompt. suspend takes a procedure as an argument, the on-suspend procedure, which will be called with one argument: the suspended continuation of the thunk. We've layered coroutines on top of the event loop.

    Guile's virtual machine is a normal register virtual machine with a stack composed of function frames. It's not necessary to do full CPS conversion to implement delimited control, but if you don't, then your virtual machine needs primitive support for call-with-prompt, as Guile's VM does. In Guile then, a suspended continuation is an object composed of the slice of the stack captured between the prompt and the abort, and also the slice of the dynamic stack. (Guile keeps a parallel stack for dynamic bindings. Perhaps we should unify these; dunno.) This object is wrapped in a little procedure that uses VM primitives to push those stack frames back on, and continue.

    I say all this just to give you a mental idea of what it costs to suspend a fiber. It will allocate storage proportional to the stack depth between the prompt and the abort. Usually this is a few dozen words, if there are 5 or 10 frames on the stack in the fiber.

    We've gone from prompts to coroutines, and from here to fibers there's just a little farther to go. First, note that spawning a new fiber is simply scheduling a thunk:

    (define (spawn-fiber thunk)
      (schedule thunk))
    

    Many threading libraries provide a "yield" primitive, which simply suspends the current thread, allowing others to run. We can do this for fibers directly:

    (define (yield)
      (suspend schedule))
    

    Note that the on-suspend procedure here is just schedule, which re-schedules the continuation (but presumably at the back of the queue).

    Similarly if we are reading on a non-blocking file descriptor and detect that we need more input before we can continue, but none is available, we can suspend and arrange for the epollfd to resume us later:

    (define (wait-for-readable fd)
      (suspend
       (lambda (k)
         (match (current-scheduler)
           (($ $sched inbox i/o)
            (add-read-fd! i/o fd
                          (lambda () (schedule k))))))))
    

    In Guile you can arrange to install this function as the "current read waiter", causing it to run whenever a port would block. The details are a little gnarly currently; see the Non-blocking I/O manual page for more.

    Anyway the cool thing is that I can run any thunk within a spawn-fiber, without modification, and it will run as if in a new thread of some sort.

    solid abstractions?

    I admit that although I am very happy with Emacs, I never really took to using the shell from within Emacs. I always have a terminal open with a bunch of tabs. I think the reason for that is that I never quite understood why I could move the cursor over the bash prompt, or into previous expressions or results; it seemed like I was waking up groggily from some kind of dream where nothing was real. I like the terminal, where the only bit that's "mine" is the current command. All the rest is immutable text in the scrollback.

    Similarly when you make a UI, you want to design things so that people perceive the screen as being composed of buttons and so on, not just lines. In essence you trick the user, a willing user who is ready to be tricked, into seeing buttons and text and not just weird pixels.

    In the same way, with fibers we want to provide the illusion that fibers actually exist. To solidify this illusion, we're still missing a few elements.

    One point relates to error handling. As it is, if an error happens in a fiber and the fiber doesn't handle it, the exception propagates out of the fiber, through the scheduler, and might cause the whole program to error out. So we need to wrap fibers in a catch-all.

    (define (spawn-fiber thunk)
      (schedule
       (lambda ()
         (catch #t thunk
           (lambda (key . args)
             (print-exception (current-error-port) #f key args))))))
    

    Well, OK. Exceptions won't propagate out of fibers, yay. In fact in Guile we add another catch inside the print-exception, in case the print-exception throws an exception... Anyway. Cool.

    Another point relates to fiber-local variables. In an operating system, each process has a number of variables that are local to it, notably in UNIX we have the umask, the current effective user, the current directory, the open files and what file descriptors they are associated with, and so on. In Scheme we have similar facilities in the form of parameters.

    Now the usual way that parameters are used is to bind a new value within the extent of some call:

    (define (with-output-to-string thunk)
      (let ((p (open-output-string)))
        (parameterize ((current-output-port p))
          (thunk))
        (get-output-string p)))
    

    Here the parameterize invocation established p as the current output port during the call to thunk. Parameters already compose quite well with prompts; Guile, like Racket, implements the protocol described by Kiselyov, Shan, and Sabry in their Delimited Dynamic Binding paper (well worth a read!).

    The one missing piece is that parameters in Scheme are mutable (by default). Normally if you call (current-input-port), you just get the current value of the current input port parameter. But if you pass an argument, like (current-input-port p), then you actually set the current input port to that new value. This value will be in place until we leave some parameterize invocation that parameterizes the current input port.

    The problem here is that it could be that there's an interesting parameter which some piece of Scheme code will want to just mutate, so that all further Scheme code will use the new value. This is fine if you have no concurrency: there's just one thing running. But when you have many fibers, you want to avoid mutations in one fiber from affecting others. You want some isolation with regards to parameters. In Guile, we do this with the with-dynamic-state facility, which isolates changes to the dynamic state (parameters and so on) within the extent of the with-dynamic-state call.

    (define (spawn-fiber thunk)
      (let ((state (current-dynamic-state)))
        (schedule
         (lambda ()
           (catch #t
             (lambda ()
               (with-dynamic-state state thunk))
             (lambda (key . args)
               (print-exception (current-error-port) #f key args))))))
    

    Interestingly, with-dynamic-state solves another problem as well. You would like for newly spawned fibers to inherit the parameters from the point at which they were spawned.

    (parameterize ((current-output-port p))
      (spawn-fiber
       ;; New fiber should inherit current-output-port
       ;; binding as "p"
       (lambda () ...)))
    

    Capturing the (current-dynamic-state) outside the thunk does this for us.

    When I made this change in Guile, making sure that with-dynamic-state did not impose a continuation barrier, I ran into a problem. In Guile we implemented exceptions in terms of delimited continuations and dynamic binding. The current stack of exception handlers was a list, and each element included the exceptions handled by that handler, and what prompt to which to abort before running the exception handler. See where the problem is? If we ship this exception handler stack over to a new fiber, then an exception propagating out of the new fiber would be looking up handlers from another fiber, for prompts that probably aren't even on the stack any more.

    The problem here is that if you store a heap-allocated stack of current exception handlers in a dynamic variable, and that dynamic variable is captured somehow (say, by a delimited continuation), then you capture the whole stack of handlers, not (in the case of delimited continuations) the delimited set of handlers that were active within the prompt. To fix this, we had to change Guile's exceptions to instead make catch just rebind the exception handler parameter to hold the handler installed by the catch. If Guile needs to walk the chain of exception handlers, we introduced a new primitive fluid-ref* to do so, building the chain from the current stack of parameterizations instead of some representation of that stack on the heap. It's O(n), but life is that way sometimes. This way also, delimited continuations capture the right set of exception handlers.

    Finally, Guile also supports asynchronous interrupts. We can arrange to interrupt a Guile process (or POSIX thread) every so often, as measured in wall-clock or process time. It used to be that interrupt handlers caused a continuation barrier, but this is no longer the case, so now we can add pre-emption to a fibers using interrupts.

    summary and reflections

    In Guile we were able to create a solid-seeming abstraction for fibers by composing other basic building blocks from the Scheme toolkit. Guile users can take an abstraction that's implemented in terms of an event loop (any event loop) and layer fibers on top in a way that feels "real". We were able to do this because we have prompts (delimited continuation) and parameters (dynamic binding), and we were able to compose the two. Actually getting it all to work required fixing a few bugs.

    In Fibers, we just use delimited continuations to implement coroutines, and then our fibers are coroutines. If we had coroutines as a primitive, that would work just as well. As it is, each suspension of a fiber will allocate a new continuation. Perhaps this is unimportant, given the average continuation size, but it would be comforting in a way to be able to re-use the allocation from the previous suspension (if any). Other languages with coroutine primitives might have an advantage here, though delimited dynamic binding is still relatively uncommon.

    Another point is that because we use prompts to suspend fiberss, we effectively are always unwinding and rewinding the dynamic state. In practice this should be transparent to the user and the implementor should make this transparent from a performance perspective, with the exception of dynamic-wind. Basically any fiber suspension will be run the "out" guard of any enclosing dynamic-wind, and resumption will run the "in" guard. In practice we find that we defer "finalization" issues to with-throw-handler / catch, which unlike dynamic-wind don't run on every entry or exit of a dynamic extent and rather just run on exceptional exits. We will see over time if this situation is acceptable. It's certainly another nail in the coffin of dynamic-wind though.

    This article started with pthreads malaise, and although we've solved the problem of having a million fibers, we haven't solved the communications problem. How should fibers communicate with each other? This is the topic for my next article. Until then, happy hacking :)

    by Andy Wingo at June 27, 2017 10:17 AM

    June 26, 2017

    Andy Wingo

    an early look at p4 for software networking

    Happy midsummer, hackfriends!

    As you know at work we have been trying to find ways to apply compilers technology to the networking space. We will compile high-level configurations into low-level network processing graphs, search algorithms into lookup routines optimized for the target data structures, packet filters into code ready to be further trace-compiled, or hash functions into parallel AVX2 code.

    On one side, we try to provide fast implementations of existing "languages"; on the other side we can't help but try out new co-designed domain-specific languages that can be expressive and run fast. As an example, with pfmatch we extended pflang, the tcpdump language, with a more match-action kind of semantics. It worked fine but the embedding between pfmatch and the host language could have been more smooth; in the end the abstractions it offers don't really apply to what we have needed to build. For a long time we have been wondering if indeed there is a better domain-specific programming language to apply to the networking domain.

    P4 claims to be this language, and I think it's worth a look. P4's goal is be able to define switches and other networking equipment in software, with the specific goal that they would like to be able for P4 programs to be synthesized to ASICs, or installed in the FPGA of a "Smart NIC", or compiled to CPUs. It's a wide target domain and the silicon-bakery side of things definitely constrains what is possible. Indeed P4 explicitly disclaims any ambition to be a general-purpose programming language. Still, I think they manage to achieve an admirable balance between declarative programming and transparent low-level compilability.

    The best, most current intro to P4 out there is probably Vladimir Gurevich's slides from last month's P4 "developer day" in California. I think it does a good job linking the language's syntax and semantics with how they are intended to be applied to the target domain. For a more PL-friendly and abstract introduction, the P416 specification is a true delight.

    Like I said, at work we build software switches and other network functions, and our target is commodity hardware. We write most of our work in Snabb, a powerful network toolkit built on LuaJIT, though we are branching out now to VPP/fd.io as well, just to broaden the offering a bit. Generally we try to build solutions that don't have any dependencies other than a commodity Xeon server and a commodity NIC like Intel's 82599. So how could P4 help us in what we're doing?

    My first thought in this regard was that if there is a library of P4 building blocks out there, that it would be really convenient to be able to incorporate a functional block written in P4 within the graph of a Snabb program. For example, if we have an IPFIX collector written in Snabb (and we do!), it would be cool to stick that in the middle of a P4 traffic conditioner.

    (Immediately I run into the problem that I am straining my mind to think of a network function that we wouldn't rather just write in Snabb -- something valuable enough that we wouldn't want to "own" it and instead we would import someone else's black box into our data-plane. Maybe this interesting in-network key-value cache counts? But I digress, let's assume that something exists here.)

    One question is, why bother doing P4 in software? I can understand that if you have 1Tbps ports that you definitely need custom silicon pushing your packets around. You would like to be able to program that silicon, so P4 looks to be a compelling step forward. But if your needs are satisfied with 40Gbps ports and you have chosen a software networking solution for its low cost, low lock-in, high flexibility, and sufficient performance -- well does P4 buy you something?

    Right now it would seem that the answer is "no". A Cisco group wrote a custom P4 compiler to VPP, which is architecturally pretty much the same as Snabb, and they had to do some work to get the performance within a couple percent of the hand-coded application. The only win I can see is if people start building up libraries of reusable P4 components that can be linked together -- but the language itself currently doesn't support any more global composition primitive than #include (yes, it uses CPP :).

    Additionally, at least as things are now, it doesn't seem that there's a library of reusable, open source P4 components out there to take advantage of. If this changes, I'll have to have another look. And of course it's worth keeping an eye on what kinds of cool things people are building :)

    Thanks to Luke Gorrie for conversations leading to this blog post. All opinions and errors mine, of course!

    by Andy Wingo at June 26, 2017 02:00 PM

    June 16, 2017

    Jacobo Aragunde

    GENIVI-fying Chromium, part 3: multi-seat

    In the previous blog posts, we described the work to bring the Chromium browser to the GENIVI Development Platform (GDP) using the latest version of the Ozone-Wayland project. We also introduced our intention to develop multi-seat capabilities on that version of the Chromium browser. This post covers the details of the multi-seat implementation.

    Goal

    The GENIVI stack is supposed to allow applications run in multi-seat mode. A seat is a set of input/output devices like, for example, a touchscreen and a keyboard; one computer (the head unit) connected to several seats should be able to assing applications to each seat and let them run independently. Hence, our goal is to let one Chromium instance manage several browser windows at the same time and independently, getting their input from different seats.

    Renesas Salvator-X board running Chromium on two seats

    Problem

    We started with an analysis of the browser on a multi-seat environment, comparing its behavior with other applications, and we identified some problems. In first place, we noticed that keyboard focus could be stolen by other browser windows; in second place, we found that only one browser window was receiving all input events regardless of seat configuration.

    Let me first illustrate the flow of events between Chromium process in Ozone-Wayland:

    All browser window surfaces belong to the GPU process, events that affect those surfaces arrive to this process and then they are sent to the browser process via internal IPC. There, events would be processed and their effects sent to the render processes if necessary.

    The concept of “focus”, as implemented in Ozone-Wayland, means there can only be one focused window, and that window would receive all kinds of events. All events that are received by the GPU process that belonged to different surfaces/windows are merged and received by the focused window in the browser process. Important information is lost in the process, like the original device ids. Besides, there is no awareness of the different seats in the browser process, and the GPU process ignores that info despite having it.

    Solution

    The basis of the solution is to break the assumption of having only one focused window and integrate seat information in the event flow.

    We started by creating separate concepts of keyboard and pointer focus, which fixed the first issue for the most part. For the complete solution, we also had to add extra wires to link seats and devices in the GPU process using already existing information, and transfer the required information to the browser process. In particular, we added extra information to the internal IPC messages related to the device ids that produce every event. We also added the concept of seats in the browser process, with new IPC signals to sync the seat objects and seat assignment information. This information is obtained using the ivi-input interface from the Wayland IVI Extension project.

    You can see a class diagram with the highlighted changes (blue: added, red: removed) below:

    The multi-seat implementation of Ozone-Wayland described above is available in my GitHub fork, branch wip/multi-seat.

    Testing it

    Patches haven’t been merged yet into genivi-dev-platform master, but there is a chromium branch with all the integration work so far. The last PR has been recently merged, which includes multi-seat, patches to support the Salvator-X board and a backported fix for the Wayland IVI Extensions.

    You can already do your own builds by cloning the genivi-dev-platform chromium branch. Then, just follow the GDP instructions. We have successfully tested the multi-seat implementation on Intel hardware and also on Renesas R-Car generation 3 boards like the Salvator-X shown above.

    If you are building your own HMI controller, you have to use the Wayland IVI Extension APIs to properly setup the screens, layers, surfaces and seat assignment. Seat configuration is done via udev (see Advanced use). For testing purposes, you may want to use the LayerManagerControl command-line tool to simulate the HMI controller; I can share with you the commands I used to setup the Salvator-X with two seats: two keyboards and two touchscreens, one of them plugged via VGA and another one through HDMI.

    In first place, this is my udev configuration to create the seats, in the file /etc/udev/rules.d/seats.rules. Touchscreens are identified with their physical USB address because they are the same brand and model:

    ATTRS{name}=="Dell Dell USB Keyboard", ENV{WL_SEAT}="seat_1"
    ATTRS{name}=="Logitech USB Keyboard", ENV{WL_SEAT}="seat_2"
    
    ATTRS{phys}=="usb-ee0a0100.usb-1.1/input0", ENV{WL_SEAT}="seat_1", ENV{WL_OUTPUT}="VGA-1"
    ATTRS{phys}=="usb-ee0a0100.usb-1.2/input0", ENV{WL_SEAT}="seat_2", ENV{WL_OUTPUT}="HDMI-A-1"
    

    To manage layers, surfaces and focus on my own, I had to stop the GENIVI HMI:

    systemctl --user stop gdp-new-hmi
    

    I started by setting up one layer for each screen, with sizes that match the screen resolutions:

    LayerManagerControl create layer 1000 1024 768
    LayerManagerControl set layer 1000 visibility 1
    LayerManagerControl set screen 0 render order 1000
    
    LayerManagerControl create layer 2000 1280 720
    LayerManagerControl set layer 2000 visibility 1
    LayerManagerControl set screen 1 render order 2000
    

    Then I ran the chromium browser (you will probably want to have several terminals open into the device for convenience), I ran the command twice to get two browser windows with surfaces 7001 and 7002. I configured the surface sizes and assigned them to each layer:

    LayerManagerControl set surface 7001 visibility 1
    LayerManagerControl set surface 7001 source region 0 0 1728 1080
    LayerManagerControl set surface 7001 destination region 0 0 1024 768
    
    LayerManagerControl set surface 7002 visibility 1
    LayerManagerControl set surface 7002 source region 0 0 1728 1080
    LayerManagerControl set surface 7002 destination region 0 0 1280 720
    
    LayerManagerControl set layer 1000 render order 7001
    LayerManagerControl set layer 2000 render order 7002
    

    Finally, configured seat acceptances for each surface to receive events from only one seat, and gave keyboard focus to both:

    LayerManagerControl set surface 7001 input acceptance to seat_1
    LayerManagerControl set surface 7002 input acceptance to seat_2
    LayerManagerControl set surfaces 7001,7002 input focus keyboard
    

    This work is performed by Igalia and has been made possible by the funding provided by the GENIVI Alliance through the Challenge Grant Program. Thank you!

    GENIVI logo

    by Jacobo Aragunde Pérez at June 16, 2017 09:01 AM

    June 15, 2017

    Michael Catanzaro

    Debian Stretch ships latest WebKitGTK+

    I’ll keep this update short. Debian has decided to ship the latest version of WebKitGTK+, 2.16.3, in its upcoming Stretch release. Since Debian was the last major distribution holding out on providing WebKit security updates, this is a big deal. Huge thanks to Jeremy Bicha for making this possible.

    The bad news is that Debian is still considering whether or not to provide periodic security updates after the release, so there might not be any. But maybe there will be. We will have to wait and see. At least releasing with the latest version is a big step in the right direction.

    by Michael Catanzaro at June 15, 2017 04:34 PM

    June 09, 2017

    Maksim Sisov

    Running Chromium m60 on R-Car M3 board & AGL/Wayland.

    It has been some time ago since my fellow igalian Frédéric Wang wrote a blog post about running Chromium with Wayland on Renesas R-Car M3 board. Since that time, we have made a great success with adding support of Wayland to Chromium with Ozone that aligns with Google plans. The blog post about these achievements can by found at my fellow igalian Antonio Gomes blog.

    Perfomed by …
    [www.igalia.com](http://)

    … and sponsored by …
    https://www.renesas.com/en-eu/

    Since the last build, the Automotive Grade Linux distribution, which is used to power the R-Car M3 board, has had some updates – CC branch was released and the next release had had many changes like update from Weston 1.09 to 1.11 and update binutils from 2.26 to 2.27. The binutils brought up some problems with linking, which was reported to AGL (the issue can be tracked here).

    Due to the above mentioned linking problems, we decided to use CC branch to run tests with the latest Chromium/Ozone with Wayland and present our work during Automotive Linux Summit in Japan, Tokyo, where my fellow igalian Antonio Gomes gave a talk and presented the demo. The demo run smoothly and flawlessly. Afterwards, we rerun the tests, which were run previously in December, and compared the results. The outcome was very good as the overall perfomance of the browser increased.

    But we still wanted to try the browser with the latest AGL branch and spent some time to resolve the issue, which was relocation overflow in R_AARCH64_LD64_GOTPAGE_LO15 and in R_AARCH64_ABS32 relocations. The specs for those relocations can be found from ELF for the ARM® 64-bit Architecture (AArch64) document.

    In order to overcome the problem and fit into the overflow check, which can be found from the above mentioned document, we used -Os and -fPIE flags, which, in overall, optimized the final binary for space and reduced the size of the image, but lead to some perfomance decrease. After the image was ready, we run the R-Car M3 board and successfully started to browser using the following command line command –

    /usr/bin/chromium/chrome –mus –user-data-dir=/tmp/user-data-dir –no-sandbox

    The recipe for our meta-browser can be found from Igalia’s meta-browser github repository. It is also possible to test Chromium/Ozone with Wayland and X11 support by cloning and building our another chromium repository, but please note that the work is still in progress and some issues may occur.

    by msisov at June 09, 2017 10:59 AM

    June 05, 2017

    Diego Pino

    Dive into Lightweight 4over6

    In the previous articles I took a look at the status of the IPv4 address exhaustion problem. I also reviewed the current state of IPv6 adoption as well as covering some of the transition technologies that could ease the adoption of IPv6. One of these transition technologies is Lightweight 4over6 (RFC 7596), often abbreviated as lw4o6, an extension to the Dual-Stack Lite Architecture (RFC 6333).

    In this article I explain the Lightweight 4over6 standard and its main concepts. I will discuss its implementation in another article.

    lw4o6 in a nutshell

    Lightweight 4over6 is an extension to the Dual-Stack Lite architecture. As a reminder, a DS-Lite architecture enables a carrier to provide IPv4 connectivity to its customers over an IPv6-only network. The main components of a DS-Lite architecture are two:

    • B4 (Basic Bridging BroadBand). Runs at the customer’s home router.
    • AFTR (Address Family Transition Router). Runs in the carrier’s network.

    The B4 element tunnels the customer’s IPv4 packets, sending them over an IPv6 network (the carrier’s network). The AFTR runs a Carrier-Grade NAT function on the decapsulated IPv6 packets. The Carrier-Grade NAT function allows sharing of the carrier’s public IPv4 address pool, enlarging the potential total number of IPv4 connections.

    Lightweight 4over6 inheres these main two components of DS-Lite. However, they’re rebranded as lwB4 (pronounced as “lightweight before”) and lwAFTR (pronounced as “lightweight after”) in lw4o6. However, they’re slightly different than their DS-Lite counterparts.

    Lightweight 4over6 mainly differs from DS-Lite on moving the Carrier-Grade NAT function back to the customer’s home router. One of the main disadvantages of DS-Lite is that the AFTR works as a centralized element in the network which can cause escalation problems.

    Instead of NATP, lw4o6 uses a technique called A+P (“The Address plus Port (A+P) Approach to the IPv4 Address Shortage”, RFC 6346) to enable sharing of a single IPv4 public address. This technique partitions the port space of a single IPv4 public address among different customers, in other words, the port is used as part of the addressing schema. Every customer is assigned an IPv4 public address and a port range. Example:

    Customer’s ID IPv4 public address Port-range
    Customer 1 198.51.100.1 0-1023
    Customer 2 198.51.100.1 1024-2047
    Customer 64 198.51.100.1 64512-65535

    This partition guarantees that the customer’s connections are uniquely identified and won’t overlap within the carrier’s network. A+P represents a stateless alternative to NAPT as the A+P gateway does need to keep track of every network flow. However, it requires A+P software on the customer’s side, capable of limiting the source port to the assigned range.

    The diagram below summarizes how an lw4o6 deployment looks like:

    lw4o6 chart
    lw4o6 chart
    • Red arrows represent IPv4 packet flows.
    • Blue arrows represent IPv6 packet flows. They’re actually IPv4-in-IPv6 packets flows.
    • The carrier’s network is an IPv6 network. It only handles IPv4 traffic at its ends (Customer’s PC and Border-Router’s Internet facing side).
    • The home router executes the lwB4 network function. This function performs a NAPT on the customer’s packets, encapsulated them into an IPv6 tunnel and gets them forwarded to the carrier’s network.
    • The border router executes the lwAFTR network function. This function maintains a table of softwires. Softwires are mappings between IPv6 address and IPv4 public address + IPv4 port-set (binding-table). Each softwire represents a customer’s network flow.

    Life of a packet in lw4o6

    Let’s take a look at how a packet would be routed over an lw4o6 network. I take as an use-case outgoing packets originated at the customer’s PC.

    • Like in most environments, a customer’s PC gets an IPv4 private address assigned by its home router. In networking jargon, a home router is more generally called a CPE (Customer’s Premise Equipment). When a packet leaves the customer’s PC, it first reaches its CPE.
    • The CPE runs the lwB4 network function. This function performs a NAPT44 on the customer’s IPv4 packet source address and source port. That’s possible because every CPE of the carrier’s network is provisioned with a public IPv4 address and a port-set. CPE’s are also assigned an unique IPv6 address.
    • In addition to NAPT44, the lwB4 function also encapsulates the customer’s packets into an IPv6 tunnel. The result is an IPv4-in-IPv6 packet where the encapsulated IPv4 packet’s source address is actually an IPv4 public address. The CPE forwards this packet to its next hop so it reaches the carrier’s network.
    • IPv4-in-IPv6 packets get routed within the carrier’s network and eventually reach a networking element called a Border-Router. The Border-Router is the Internet facing side of an lw4o6 network and it runs the lwAFTR function.
    • The lwAFTR function inspects incoming IPv4-in-IPv6 packets arriving onto its internal interface (the interface facing the carrier’s network) and tries to find a matching softwire for each packet.
    • A softwire is a mapping between IPv6 address and IPv4 address + IPv4 port-set. If there’s a match for the incoming IPv4-in-IPv6 packet, the packet gets decapsulated and forwarded onto the Border-Router external interface getting out of the carrier’s network into the Internet realm.

    Similarly, the opposite process happens for incoming packets from the Internet:

    • Incoming IPv4 packets from the Internet firstly reach the Border-Router external interface.
    • The lwAFTR’s function inspects incoming packets and performs a softwire look up using the packet’s IPv4 destination address and port as a key.
    • If there’s a matching softwire, the packet gets IPv6 encapsulated using the matching softwire’s IPv6 address as destination address.
    • Eventually the encapsulated IPv4-in-IPv6 packet reaches the targeted CPE.
    • The lwB4 function at the CPE decapsulates the packet, resolves the NAPT44 and forwards the incoming IPv4 packet to its final destination: the customer’s PC.

    Softwires data-model

    One of the core elements of lw4o6 is the lwAFTR’s binding-table. Basically, this table consists of a collection of softwires. A softwire defines a mapping between an IPv4 address and an IPv6 address. They’re a key concept for the connection of IPv4 networks across IPv6 networks and viceversa.

    The IETF Softwire Working Group has proposed a YANG data model for IPv4-in-IPv6 softwires. This data-model is not only used by lw4o6 but by other A+P mechanisms such as MAP-E (RFC 7597) or MAP-T (RFC 7599). It defines a softwire configuration as:

    +--rw softwire-config
    |  +--...
    |  +--rw binding {binding}?
    |     +--rw br {br}?
    |     |  +--rw enable?                          boolean
    |     |  +--rw br-instances
    |     |     +--rw br-instance* [id]
    |     |        +--rw binding-table
    |     |           +--rw binding-entry* [binding-ipv6info]
    |     |              +--rw binding-ipv6info     union
    |     |              +--rw binding-ipv4-addr    inet:ipv4-address
    |     |              +--rw port-set
    |     |              |  +--rw psid-offset       uint8
    |     |              |  +--rw psid-len          uint8
    |     |              |  +--rw psid              uint16
    |     |              +--rw br-ipv6-addr         inet:ipv6-address
    |     |              +--rw lifetime?            uint32

    An instance of this data-model would look something like this:

    sofwire-config {
      binding {
        br-instances {
          br-instance {
            binding-table {
              binding-entry {
                binding-ipv6info fc00:1:2:3:4:5:3:1;
                binding-ipv4-addr 198.51.100.1;
                port-set {
                  psid-len 6;
                  psid 1;
                }
                br-ipv6-addr fc00::100;
              }
            }
          }
        }
      }
    }

    A softwire-config file can contain several binding-tables. Each binding-table is composed of several binding-entry elements. Each of these entries represent a mapping between an IPv4 address and port and an IPv6 address, in other words, a softwire. Let’s look in detail at each of a softwire’s elements:

    • binding-ipv6info: it’s the IPv6 address of the lwB4. It can be written as an IPv6 address or and IPv6 address plus CIDR.
    • binding-ipv4-addr: it’s a shared public IPv4 address.
    • port-set: it identifies a range of ports. In this case the port-set defined is 1024-2047. I’ll take a closer look at port-set definition later.
    • br-ipv6-addr: it’s the IPv6 address of the lwAFTR. It’s not used for a softwire lookup.

    Imagine an IPv4-in-IPv6 packet with address fc00:1:2:3:4:5:3:1 arrives at the lwAFTR’s internal interface. If the source address of the encapsulated IPv4 packet is 198.51.100.1 and its source port is in the [1024-2047] range, then there’s a match. The lwAFTR’s function will decapsulate the packet and forward it through its external interface out into the Internet realm.

    In addition to the definition of softwires, the proposed YANG data model also defines the relevant parameters for provisioning a lwB4.

    Port-mapping

    lw4o6 uses the same port-mapping algorithm as MAP-E (RFC 7597 - Section 5.1).

    A simple way of defining a port-set could be expressing it as a [min, max] pair where min and max are a 16-bit positive integer number. For instance, [0, 63] defines a port-set between 0 and 63 inclusive. However, crafting port sets by hand would be error-prompt, hard to maintain and would difficult automatic provisioning.

    Another way of expressing a port-set is using a similar approach to CIDR (Classless Inter-Domain Routing). In CIDR notation, an IPv4 address is expressed as a value between 0 and 32. This value represents a number of contiguous bits. For instance:

    IPv4 address CIDR notation
    255.255.255.0 24
    255.255.0.0 16
    255.0.0.0 8

    Following this approach, a port-set could be defined as a number between 0 and 16. This value represents the number of port-sets computed as 2^n. All the remaining bits are used to express the number of ports available per port-set, usually 2^(16-n).

    In the example above:

    port-set {
      psid-len 6;
      psid 1;
    }
    • psid-len: represents the number of available port-sets. Since its value is 6, the total number of port-sets is 64 (2^6).
    • psid: it’s the port-set identifier in the softwire.

    All the remaining bits (16 - 6 = 10) are the number of ports available per port-set. Since in the example above psid-len was 6 the number of ports per port-set is 1024 (2^10).

    Considering the configuration above, what’s the actual port-set for PSID 1? Taking a look at this table would help to figure it out:

    PSID port-set
    0 0 - 1023
    1 1024 - 2047
    63 64512 - 65535

    So [1024, 2047] is the actual port-set for PSID 1 when psid-len is 6 and psid-offset is 10.

    The YANG model also specifies another attribute called port-offset. In lw4o6 this attribute is zero by default. If set, port-offset excludes some bits from psid-len. In MAP-E or MAP-T, port-offset is generally used to exclude the system ports. In lw4o6 excluding the system ports means mostly to not use PSID 0 in a softwire configuration.

    Summary

    In this article I introduced lw4o6 and explained how it works at a high-level view. I also took a look at IETF’s YANG softwire data-model for the definition of softwires, a key element of lw4o6. Lastly, we learned how port-sets are defined in lw4o6.

    In the next article, I will discuss the implementation of a lwAFTR function using Snabb, a toolkit for the development of high-performance network functions.

    June 05, 2017 06:00 AM

    May 30, 2017

    Víctor Jáquez

    GstSpringHackfest2017: a quick report

    Two weeks ago was the GStreamer Spring Hackfest 2017 and I am very happy about how it went. I have the feeling that most of the attendees had a good time, and made some progress in their projects. I want to thank all the people that participated, in some way or another.

    Along that weekend when the hackfest happened, besides my duties as organizer (with a lot of help from my colleagues at Igalia), I managed to hack a bit on GstPlayer, proposing the missing API for setting the subtitles font description (782858). Also I helped Nicolas a bit with the upstreaming of the v4l2 video encoder (728438). Julien Isource and I talked about the missing parts of DMABuf support in gstreamer-vaapi, in particular the action path when the new libva API, for importing and exporting DMABuf, got merged (779146). With Thibault we played with the idea of a Jenkins server doing CI for gstreamer-vaapi. Also I did some kernel debugging, and found out why kmssink failed in db410c when the caps changed from RGB to YUV, thus Rob Clark cooked a patch.

    Finally, I worked on a time-lapse video of the hackfest’s main room, only using GStreamer with gstreamer-vaapi in an Atom-based NUC. You can glance the code of the video grabber. Thanks to Luis de Bethencourt for the original idea and code.

    by vjaquez at May 30, 2017 03:54 PM

    Jacobo Aragunde

    PhpReport 2017

    Time for our yearly release of PhpReport! There has been a lot of activity during last year as a part of Igalia Coding Experience program. Thanks to Tony Thomas for having done a great work!

    These are my highlights of the PhpReport 2.17 release:

    Simplified UI for tasks

    We have changed the tasks UI to give more room to the most important fields and sorted them by importance, so the most important fields are first in the tab-navigation order. The goal is to be able to fill tasks faster and more efficiently.

    In particular, the new projects field is very interesting because it lets users search both by customer or project name. With this change, we have been able to remove the customer field for every task. Choice in the project field is now limited by default to the most common projects, the one users have been directly assigned to; the full list of open projects can be gathered with the special load all entry.

    Simplification extends to the data model, related web services and reports. Now a project can only be assigned to one customer removing the many-to-many relation there used to be. The ability to assign several clients to the same project was barely used, and even felt unnatural in most cases, so it’s not a big loss.

    Auto-save tasks

    The tasks screen got the ability to auto-save changes every certain number of seconds. We have kept one exception: deleted tasks must be manually saved, and we have kept the save buttons for that purpose. This exception will be around until we have some way to undo task deletion.

    Persistent templates

    Templates are finally kept on the server side so they can be accessed from any browser and any computer. Their usage is also less cryptic, now a name for the template is explicitly asked upon creation instead of using the description field as a name. Finally, we added one permanent template to create a task that comprises the entire work day. It’s very useful to fill holidays in, because the length of the work day is calculated for that user and day.

    User weekly goals

    To better keep extra hours under control, we have added a new entry in the User work summary box in the tasks screen. The week goal entry will tell users how many hours they should work every week to finish the year with 0 accumulated hours. It updates every week, taking into account the number of hours accumulated since the beginning of the year. For example, if you worked some extra time in the last month, the weekly goal will give you a figure that is lower to your standard weekly timetable so in the end of the year the extra time is compensated.

    It’s possible to define per-user weekly goals, with custom time periods and numbers of accumulated hours.

    Weekly hours in a project

    A new grid in the project details report will show the number of hours worked every week by project members. It is useful to keep a weekly control of the time devoted to the project.

    New manager user profile

    We have limited access to standard PhpReport users only to certain reports, and a new user profile called manager has been added. Manager users have the ability to access system-wide reports and details from any existing project.

    And more

    Days start with an empty date so you can start typing your progress right away, there are keyboard shortcuts to jump to the next or previous days, more direct access to project details reports…

    Check the other many features and fixes in the release page, and enjoy PhpReport 2.17!

    by Jacobo Aragunde Pérez at May 30, 2017 12:06 PM

    Diego Pino

    IPv6 deployment status and transition technologies

    IPv4 has served us well for the last 35 years. But in a world of already exhausted address space its future seems uncertain. Everyone knew it wouldn’t last forever. However, most ISPs didn’t start deploying IPv6 networks actively until the address pool got almost depleted. Why was that? Why it took so long to react? There are several reasons that explain it:

    • Necessary period of time to implement the standard. The IPv6 proposed standard was published in December 1998, but it certainly takes a long period of time to implement and test a standard. Until it’s finally shipped in products it can pass several years. In the case of IPv6, it was also necessary to deploy pilot networks to test it in real environments. Operators needed time as well to learn how to work with the new technology. All in all, it’s fair to say that the decade that followed the publishing of RFC 2460 (“Internet Protocol, Version 6 (IPv6)”), was a period of time for experimenting, testing and learning.
    • Lack of a clear benefit. Replacing IPv4 for IPv6 won’t result in a performance boost or more reliable networks. Users won’t notice the change. If things are working well, why bothering changing them? In addition, deploying an IPv6 network is not a trivial task. It often implies doing some sort of financial and human resource effort. For instance, if the network depends on legacy software or hardware that doesn’t support IPv6, it’s necessary to replace it. There are operational costs as well, since it’s necessary to maintain an additional network.
    • Lack of incentive. Some governments encouraged the adoption of IPv6 by requiring to upgrade their networks. That was the case of the USA government which in 2005 defined a three year deadline to add IPv6 support to all the backbones of all their federal agencies. Other governments defined agendas but failed to fulfill them. Without demand from customers and lack of a clear benefit, ISPs didn’t take action to make the switch on their own initiative.
    • Dependency on IPv4. IPv6 transition depends on carriers, governments, standards organizations, hardware manufacturers and content providers. It’s a distributed effort and it was known from the start the transition won’t happen simultaneously. For some period of time, IPv4 and IPv6 will have to co-exist. But we still depend on IPv4 today. It’s not only about content or connectivity, it’s about software too. For instance, Skype is reported to not work in IPv6-only environments (Skype 5.0 for Linux, announced two months ago, doesn’t feature that problem). Steam is another application which still depends on IPv4. Another infamous case is Windows XP which depends on IPv4 connectivity to perform DNS resolution.

    Worldwide deployment status

    Since IPv4 and IPv6 are disjoint networks, it’s not possible to reach an IPv4 server from an IPv6 client (unless our ISP provides some sort of bridging). According to Alexa, as for today 25% of the world’s top 1000 sites are reachable over IPv6.

    Alexa Top 1000 websites reachable over IPv6
    Alexa Top 1000 websites reachable over IPv6

    Companies such as Google and Akamai provide statistics of IPv6 adoption status. Adoption is uneven worldwide, with countries such as Belgium (48%), USA (32%) and Greece (30%) on the top of the list of end-user connectivity.

    In the case of the USA, mobile connectivity has helped a lot to increase IPv6 adoption, all thanks to operators such as Verizon. Unlike 3G networks, 4G networks are packet switching only. That means that voice services run on VoIP (voice over IP). Verizon mandates that all its 4G networks work on IPv6 only, deprecating IPv4 capability.

    According to Google, 18% of the world’s Internet traffic today is IPv6.

    IPv6 end-user adoption
    IPv6 end-user adoption

    Cable TV, which nowadays is delivered over IP too, has also helped a lot to increase IPv6 adoption. It’s not surprising to see companies such as Comcast, the largest cable television company in the world, ranking the top 10 of IPv6 network operator adoption.

    Network operator measurements (Top 10)
    Network operator measurements (Top 10)

    It’s important to distinguish though IPv6 end-user adoption from network adoption. Network adoption is measured by counting the number of ASes (Autonomous System) that are IPv6 capable. On the other hand, end-user adoption is often measured by tracking IP requests on dual-stack websites (websites that are reachable either over IPv4 or IPv6). According to RIPE NCC, the number of IPv6 enabled networks worldwide is 23%.

    Worldwide IPv6 enabled networks
    Worldwide IPv6 enabled networks

    That’s a global trend. Usually network adoption is higher than end-user adoption. There are several reasons for that. There are carriers which are IPv6 capable but are not allocating IPv6 addresses to their customers yet. Another reason is a carrier providing IPv4 services on IPv6-only networks. I will get into that later.

    Transitioning to IPv6: efforts and challenges

    Since IPv6 was proposed it was clear that the new protocol would need to live together with IPv4, at least for some period of time. The Internet is now more complex and distributed than when ARPAnet migrated from NCP to IPv4. This time the transition will happen much more gradually, at different paces in different countries.

    For this reason, the scenario that everyone foresaw 10 years ago was Dual-Stack networks. A Dual-Stack network supports either IPv4 and IPv6 connectivity. IPv6 connectivity is preferred, but if a site or service is not available over IPv6, the customer falls back to IPv4.

    In addition to Dual-Stack networks, there are a myriad of mechanisms that provide interoperability between IPv4 and IPv6. Usually these technologies involve some type of tunneling and translation. The possible scenarios are IPv4 connectivity over an IPv6-only network and IPv6 connectivity over an IPv4-only network or a combination of both. Here is a summary:

    Connectivity Type of network Transition technology
    IPv4, IPv6 Dual-Stack None
    IPv6 IPv4-only Tunnel broker/6in4, 6over4, 6rd, 6to4/Teredo, ISATAP, IVI/NAT64/SIIT
    IPv4 IPv6-only Dual-Stack Lite, Light weight 4over6, MAP, 4in6, 464XLAT, IVI/NAT64/SIIT

    IPv6 connectivity over an IPv4-only network is an useful scenario to test and try out IPv6 without incurring in the costs and troubles of deploying an IPv6 network. Technologies that enable that are 6rd (RFC 5969 “IPv6 Rapid Deployment on IPv4 Infrastructures”). RFC 6264 (“An Incremental Carrier-Grade NAT (CGN) for IPv6 Transition”) is also an interesting proposal. Tunnel brokers/6in4 are useful if you’re interested in trying out IPv6 but your ISP hasn’t assigned you an IPv6 address yet. I discussed how to setup a tunnel broker with Hurricane Electric in this other blog post: IPv6 tunnel.

    Translation between IPv6 and IPv4 and viceversa, also known as IVI translation, is another interesting mechanism. NAT64 (RFC 6052, “IPv6 Addressing of IPv4/IPv6 Translators”) is the most popular form of IVI translation and works in all the scenarios. NAT64 translates the headers of IPv4 packets to IPv6 headers, and viceversa. NAT64 deprecates SIIT.

    Lastly, the other possible scenario is to provide IPv4 connectivity over an IPv6-only network. There’s an increasing interest in IPv6-only deployments. One of the disadvantages of Dual-Stack is maintaining two networks. However, since many services still depend on IPv4, operators need to provide IPv4 connectivity to their customers. IPv4 services can still be delivered over an IPv6 network by using tunnels and introducing some business logic in the carrier.

    Some of the most popular IPv4-on-IPv6 technologies are 464XLAT and Dual-Stack Lite. In the next section I will cover the latter, but before that I need to discuss Carrier-Grade NAT.

    Carrier-Grade NAT

    For many years it was thought that the transition to IPv6 will be completed before the IPv4 address pool went totally exhausted. But the transition barely started a few years ago. ISPs now face an scenario where they need to extend the lifetime of IPv4. The proposed solution is called CGN (Carrier-Grade NAT), some times also called LSN (Large-Scale NAT).

    But before diving in the underpinnings of CGN, let me explain how an ISP assigns public addresses to their customers.

    Normally, service providers assign public IP addresses to their customers via DHCP. Each CPE (Customer Premise Equipment), also known as home gateway, receives a public IP address. Sometimes the ISP also exposes you a private IP network address (RFC 1918 “Address Allocation for Private Internets”), although generally is the user who picks a preferred private address. This private address is used within the customer’s private network, while the home gateway uses the public address to communicate within the carrier’s network. The CPE runs a NAT44 function to share its public address with all the devices within the customer’s private network. ISPs tend to lease public addresses to home gateways for a limited period of time. If the ISP detects a customer is inactive, it may try to reclaim its public address and put it back to ISP’s pool of public addresses.

    Strictly speaking, a Carrier-Grade NAT is a NAT placed at the service provider’s network. The more devices a NAT can serve, the more useful it is. Normally a CPE’s NAT serves a limited number of devices, depending on the size of the customer’s network. A NAT performed at the carrier can serve multiple customer’s private networks, maximizing the use of public addresses.

    Usually CGN involves a NAT444 scenario. Outbound packets from the customer pass through 3 different domains: the customer’s private network, the carrier’s private network and the public internet. To avoid address conflicts between the customer’s private networks and the carrier’s private network, the IETF agreed on reserving a /10 block called Shared Address Space (RFC 6598, “Shared Address Space Request”). The Shared Address Space address is 100.64.0.0/10.

    Carrier-Grade NAT
    Carrier-Grade NAT (Source: Wikipedia)

    CGN is not an IPv6 transition mechanism. It’s an approach to extend the lifespan of IPv4 addresses until a full migration is completed.

    On the other hand, CGN has some important disadvantages:

    • It introduces a centralized element in the network, which might cause bottlenecks and scalability issues.
    • It makes impossible to host services in the customer’s hosts.
    • Since different customers shared the same IPv4 public address, a website that bans a customer by IP might have consequences to other customers.

    Dual-Stack Lite

    Now we know about CGN. This is important as CGN is an important component of Dual-Stack Lite. But, what’s Dual-Stack Lite?

    One of the inconveniences of Dual-Stack networks is maintaining two networks. That often means double operational costs as it’s necessary now to configure, provision, monitor, diagnose and troubleshoot two different networks. Wouldn’t be simpler to provide both IPv4 and IPv6 services over one single IPv6 network? Introducing Dual-Stack Lite.

    Dual-Stack Lite, often referred as DS-Lite, provides IPv4 connectivity on IPv6-only networks. To do that DS-Lite relies on IPv4-in-IPv6 tunnels to provide IPv4 services. Tunneled packets reach an element in the carrier’s network called the Address Family Transition Router, which runs a Carrier-Grade NAT function. There are two elements that are fundamental for the deployment of DS-Lite:

    • B4 (BroadBand Basic Bridging): It’s a network function that runs at the WAN interface of a customer’s CPE. The B4 function is responsible of encapsulating IPv4 packets into IPv6. The CPE should not run a NAT44 over the outbound packets since the NAT function is performed at the carrier.
    • AFTR (Address Family Transition Router): Decapsulates IPv4-in-IPv6 packets and runs a CGN function over the packets. The AFTR keeps a binding table grouping together the CPE’s IPv6 address, IPv4 private address and TCP/UDP port. When an inbound packet reaches the AFTR external interface, the CGN undoes the NAT obtaining the associated IPv4 private address. Together with the destination port, both elements can be used to look up the B4’s IPv6 address. The AFTR encapsulates the packet and forwards it to the customer’s B4.
    Dual-Stack Lite
    Dual-Stack Lite (Source: Wikipedia)

    Compared to CGN over IPv4 networks, only one NAT is necessary in DS-Lite. The model encourages service providers to move to IPv6 guaranteeing continued support for IPv4 services.

    However, DS-Lite has also some inconveniences. The AFTR must maintain per-flow state in the form of active NAT sessions. If an AFTR serves a large number of B4 clients, that may cause bottlenecks and scalability issues.

    In the next article, I will discuss Light-Weight 4over6, a model based on DS-Lite but which tries to solve its shortcomings.

    May 30, 2017 10:00 AM

    May 25, 2017

    Diego Pino

    A brief history of IPv4 address space exhaustion

    IPv4 address space exhaustion was a hot topic in the 90s, when everyone started to foresee that inevitable future. However, we’re still relying on IPv4 today. So, what has actually happened? Did anyone find a vast range of unused IPv4 addresses locked in a closet? What happened to IPv6?

    Reviewing the history of IPv4 address depletion is also reviewing the history of the Internet. Many decisions about the Internet have been made with the goal of solving or mitigating this problem. In this post I start from the beginning up to today. It’s not intended to be an exhaustive guide, but a recap of the most important events.

    8-bit Internet

    The Internet has its origin in the ARPAnet. A research network funded by the Advanced Research Projects Agency of the United States of America Department of Defense.

    ARPAnet came to live in 1969, connecting just 4 hosts. The network grew in size over the years connecting more and more hosts, mainly universities and research centers in the US. In 1981, there were a total of 213 hosts connected.

    But back in the days of the ARPAnet, there was not TCP/IP. Its equivalent was NCP (Network Control Protocol). Addresses in NCP were 8-bit numbers. That means each host could be addressed by a simple number such as 10, 23 or 145. Although popular, the ARPAnet was not the single computer network that fostered during the 70s. There was a need of connecting these networks in an inter-network or internet.

    Already in the early 70s, Robert Khan from Darpa and Vinton Cerf, developer of NCP, started to work in a new protocol that allowed communications across several heterogeneous networks. The proposed protocol was called Transmission Control Program, first published in 1974 (A Protocol for Packet Network Intercommunication). Implementations of the protocol went through 4 major versions. In version 3, the protocol split in two: Transmission Control Protocol & Internet Protocol. The first TCP/IP v4 draft was published in 1978 but 3 years more passed until the draft became a standard.

    On 1st of January 1983, also known as flag day, the ARPAnet switched from NCP to TCP/IP.

    4.3 billion addresses will be enough

    One of the novelties that TCP/IP introduced was 32-bit addresses. Vinton Cerf has often taken blame for that decision. But 32-bit addresses seemed very reasonable back in the 70s. In those days, the world’s population was 4.5 billion people and the personal computing revolution hadn’t started yet. Upgrading address space to 16-bit seemed too little and something bigger than 32-bit (4.3 billion addresses) unreasonable and unjustified.

    In 1981, another TCP/IP network was created: CSnet (Computer Science Network) funded by the National Science Foundation. In 1983, Darpa decided to split the ARPAnet in two: a public ARPAnet and MILnet. Finally in 1985, NSF founded another network, NSFnet (National Science Foundation Network).

    NSFnet was the most popular TCP/IP network in the 80s and eventually became the primarily backbone of the Internet at that time. By the end of the decade, the Internet was composed by almost 1000 networks (RFC 1118 “The Hitchhikers Guide to the Internet”) and had 3 million users approximately. ARPAnet ceased its operations in 1990, while CSnet followed in 1991.

    The first concerns about the scalability of the Internet appeared in the early 90s, even before the Web was invented. RFC 1287 (“Towards the Future Internet Architecture”) is the first RFC to discuss the IP address space exhaustion problem.

    One of the first measures to simplify the management of the Internet was the creation of RIRs or Regional Internet Registries in 1992. Before that, the global IP address registry was managed by a single organization, the IANA (Internet Assigned Numbers Authority). Each region was allocated a range of IP addresses. The regions have evolved over time. Today there are 5 RIRs:

    • AFRINIC (Africa).
    • APNIC (Asia-Pacific).
    • ARIN (Canada, many Caribbean and North Atlantic islands, and the United States).
    • LACNIC (Latin America and the Caribbean)
    • RIPE NCC (Europe, Middle East, and Parts of Central Asia).

    The glorious 90s: the Internet explodes

    The World Wide Web debuted in the early 90s leading to an exponential growth of the Internet. But even before that, there were already concerns its scalability.

    The IETF created the ROAD WG (Routing and Addressing Working Group) to come up with proposals which could help to solve this problem. Some of the proposed solutions were:

    • RFC 1519: “Classless Inter-Domain Routing” (September 1993).
    • RFC 1597: “Address Allocation for Private Internets” (March 1994).
    • RFC 1631: “The IP Network Address Translator (NAT)” (May 1994).

    RFC 791 (“Internet Protocol”) defines an IP address as:

    Addresses are fixed length of four octects (32 bits). An address begins with a network number, followed by local address (called the “rest” field)

    It also defines 3 classes of network addresses:

    There are three formats or classes of internet addresses: in class a, the high order bit is zero, the next 7 bits are the network, and the last 24 bits are the local address; in class b, the high order two bits are one-zero, the next 14 bits are the network and the last 16 bits are the local address; in class c, the high order three bits are one-one-zero, the next 21 bits are the network and the last 8 bits are the local address.

    Summarizing:

    Class Leading bits Start Address End Address Network field Rest field
    A 0 0.0.0.0 127.255.255.255 8 bits 24 bits
    B 10 128.0.0.0 191.255.255.255 16 bits 16 bits
    C 110 192.0.0.0 223.255.255.255 24 bits 8 bits

    This scheme is known as classful network.

    Class Inter-Domain Routing defines a variable-length network field for IP addresses which doesn’t depend on its class. This scheme allows two things:

    • To divide a network address into subnetworks, which leads to a more efficient use of the address space.
    • To group networks into supernetworks, which reduces the number of entries in the routing tables.

    This latter issue was the main motivation for the creation of CIDR. Before that, a routing table had to contain one entry per network. For instance:

    Network address Gateway
    193.1.255.0 1.2.3.4
    193.1.254.0 1.2.3.4

    Since 193.1.255.0 and 193.254.0 are contiguous networks, an equivalent table could be represented as:

    Network address Gateway
    193.1.254.0/23 1.2.3.4

    Class Inter-Domain Routing also introduced a new IP address notation known as CIDR notation in which an address is represented as a pair {IPv4 address/bit-mask}. Bit mask is a number between 0 and 32 that represents the number of contiguous bits used as a network mask. Address 193.1.254.0/23 is equivalent to 193.1.254.0/255.255.254.0.

    Class Inter-Domain Routing highly helped to reduce the size of routing tables as well as optimize IP address use and simplify IP address allocation.

    Another standard that enormously helped to mitigate IPv4 address exhaustion was RFC 1597 (“Address Allocation for Private Internets”).

    On its conception, the Internet was designed as an peer-to-peer network where every host was addressable from any other host. Hosts inside private networks that only needed to communicate with other hosts within the same network over TCP/IP were also addressable from the Internet. RFC 1597 explains:

    With the proliferation of TCP/IP technology worldwide, including outside the Internet itself, an increasing number of non-connected enterprises use this technology and its addressing capabilities for sole intra-enterprise communications, without any intention to ever directly connect to other enterprises or the Internet itself. The current practice is to assign globally unique addresses to all hosts that use TCP/IP. There is a growing concern that the finite IP address space might become exhausted.

    The standard proposed the reservation of 3 blocks, each per network class, for private addresses. Hosts using private addresses are not reachable from the Internet, but can communicate to other peers inside the same intranet.

    Class Start Address End Address Total IP addresses
    A 10.0.0.0 10.255.255.255 16,777,216
    B 172.16.0.0 172.31.255.255 1,048,576
    C 192.168.0.0 192.168.255.255 65,536

    The last standard that highly reduced IP address exhaustion was RFC 1631 (“The IP Network Address Translator (NAT)”).

    NAT maps one IP space realm to another IP space realm. That enables a host using a private address, thus non-routable in the Internet, to borrow the address of another host which indeed has a public address assigned. That allows the private host to be addressable in the Internet.

    The original NAT proposal came with some experimental implementations that proved it successful and made it to be adopted very quickly. On the down side, NAT broke the original design of the Internet as a peer-to-peer network.

    Lastly, I think is also worth to mention DHCP (RFC 1541, “Dynamic Host Configuration Protocol”). DHCP has its origins in BOOTP, which at the same time is an evolution of RARP (RFC 903 “A Reverse Address Resolution Protocol”, June 1984). DHCP didn’t come out as the result of ROAD WG deliverance, but when used by ISPs it has greatly helped to optimize public address usage.

    The birth of IPv6

    In addition to the mitigation efforts commented above, during the early 90s the IETF also started to evaluate whether to develop a new version of IP which could definitely solve the address space problem. With that goal in mind, the IETF created the IPng area (Internet Protocol Next Generation area).

    In 1994, the IPng area came up with RFC 1752 (“The Recommendation for the IP Next Generation Protocol”), which encouraged the development of a new successor to IPv4.

    The IPng area also created a new working group called ALE (Address Life Expectation). The goal of this group was to determine the expected lifetime for IPv4. If IPv4 address space was estimated to last many years more, the new version of the Internet Protocol could feature new functionalities. On the other hand, if lifetime was estimated to be short, only the address space exhaustion problem could be handled. The working group estimated IPv4 address space exhaustion would happen sometime between 2005 and 2011.

    That very same year of 1994, the Internet Engineering Steering Group approved the IPv6 recommendation by the IPng area and drafted a proposed standard. In 1995, RFC 1883 (“Internet Protocol, Version (IPv6)”) was published. After several iterations over the proposed standard, an updated document was published in December 1998 (RFC 2460).

    Note: Version 5 of the IP protocol was used during the development of an experimental protocol called Internet Stream Protocol, back in 1979. There was a second review of this protocol in 1995 (RFC 1819, “Internet Stream Protocol Version 2 (ST2)”), but the protocol never gained traction. To avoid potential confusions, the IETF preferred to skip this number and pick Version 6 as the successor of IPv4.

    In the years that followed many hardware vendors and operating system developers began to add and implement support for IPv6 in their products. A first alpha version of IPv6 was implemented in the Linux kernel as early as 1996, although it remained in experimental status until 2005. In 1996 too, a testbed backbone for IPv6, called 6bone, was deployed. The original mission of this backbone was to establish a network to foster the development, testing, and deployment of IPv6. The backbone ceased its operations in 2006. Another important milestone occurred in 2008 when the IANA added AAAA records for the IPv6 addresses of 6 root name servers. That made possible to resolved domain names using only IPv6.

    The decade after the publication of RFC 2460 served as a period of time for development, testing and refinement of IPv6, as well as adaptation of existing products. Originally, the IETF estimated that massive adoption of the new protocol would happen around 2005, although that never happened. Part of the reason was that many of the new functionalities that IPv6 featured, for instance IPSec, were back-ported one way or another to IPv4, so the actual need to make the transition was less urgent and NAT was working pretty well. The only feature which could not be back-ported was the increase of available address space…

    The day the world run out of IPv4 addresses

    On 31st January 2011, the IANA allocated their two remaining top-level address blocks to APNIC. APNIC run out of IPv4 public addresses some months later. RIPE followed the next year, as well as LACNIC in 2014 and ARIN in 2015. Today, the only RIR with IPv4 public addresses available is AFRINIC, but it won’t last long, only until 2018.

    The event didn’t catch anyone off-guard, in fact the dates were in line to the estimation of the ALU working group. Perhaps witnessing the actual IPv4 address depletion in 2011, served as a wake-up a call to accelerate IPv6 adoption worldwide. Since 2010, adoption has been constantly increasing and in some cases doubling every year. Today, IPv6 traffic represents a total of 18% Internet worldwide traffic according to Google. But that’s the world’s average, truth is IPv6 adoption worldwide is uneven across countries. Belgium ranks first with 47% traffic as IPv6, while there are many other countries, such as Italy or Spain, where IPv6 roll-out hasn’t even started yet.

    And this all for now. In a next article I will cover IPv6 adoption and what strategies are ISPs implementing to complete the transition from an already exhausted IPv4 address space to IPv6.

    May 25, 2017 06:30 AM