Netdevconf 0x16

Johannes Zink | 2022-12-14 | conference, event, talk

After a longer time with online-only events, the Netdev 0x16, a conference about the technical aspects of Linux Networking, was organized as hybrid event: online and on-site at Lisbon.

Day 1

As for other conferences, the Netdev 0x16 is about exchanging ideas with other developers. Even before the start of the actual conference program with discussion rounds and breakout sessions, discussions are already started at the coffee tables in the foyer. Not surprisingly, the availability of network hardware without open documentation is a considerable challenge for developers and end users alike The often non-public documentation of hardware components poses a challenge to projects like DENT or Open vSwitch who often face severe challenges when having to providing hardware support beyond the reference platforms.

Often, the hardware that is actually deployed in the field is running Linux, but often very old kernel versions are deployed, which cannot be updated due to firmware blobs or non-standardized

The developers at Netdev agreed that a selection of well-documented switch ICs in documented switch ICs in various performance classes and from different manufacturers, and good support for this hardware in mainline Linux would be ideal. This would improve the situation for both manufacturers of whitebox hardware, i.e generic hardware modules, as well as for manufacturers of end devices, as for endusers in the field, which could then benefit from up-to-date and well- maintained software.

Introduction to time synchronization

In the first session of the day, Maciek Machnikowski explained the precise time synchronization of multiple systems over a network connection using PTP. The "Precision Time Protocol", or PTP for short, is standardized in IEEE1588 and allows configuration-free synchronization of several systems over a network with nanosecond accuracy

After an introduction into the motivation, Maciek briefly introduced the different components in a Linux system that are responsible for the precise acquisition of time (the so-called PTP Hardware Clock, or PHC for short), as well as the different paths, over which in the Kernel the information necessary for the PTP-Daemon are transported to user space.

Maciek dedicates a larger part of his talk to practical examples and tools from the ptp4l suite:

ptp4l, the main daemon that handles the synchronization of the Leader- and Follower-Clock (for the elderly among us: Grandmaster and Slave) and is responsible for handling of the Best Master Clock Algorithm BMCA.
ts2phc, with which Time Events, e.g. the PHC on Pulse per Second (PPS) Events from a precise clock (e.g. a GNSS receiver) to the PTP hardware clock can be synchronized. ts2phc also supports the parsing of NMEA strings sent by GNSS receivers sent over UART for obtaining the Time of Day.
phc2sys to synchronize the system time to the PTP hardware clock.
pmc to read debugging information from the various components.
timemaster to synchronize PTP and NTP.
phc_ctl to debug the PTP hardware clock.

After this very practical part, Maciek briefly discusses the differences of the various profiles of PTP that are supported in ptp4l. He explains that all practical relevant profiles are supported in ptp4l, but that a few features are subject to certain restrictions.

At the end of his talk, Maciek uses a simple setup as an example to briefly explain the limitations and common difficulties of setting up PTP and gives a short outlook on the upcoming release of ptp4l 4.0.

After the lunch break, the topics of the talks on the agenda are only of limited importance for embedded applications, but nevertheless they give a good overview of the variety of requirements in the network subsystem of the Linux kernel.

P4TC - Your Network Datapath Will Be P4 Scripted

In this workshop, Jamal Hadi Salim explained the basics of P4 and the integration of P4 into the Linux kernel as P4TC. P4 stands for the "Programming Protocol-independent Packet Processors" ecosystem, a programming language for Packet Forwarding Planes in network devices, i.e. a programming language, which can be used to describe how network packets can be forwarded directly in hardware to their destination, without generating CPU load.

While P4 has been around for some time and has become an industry standard for the description of the Packet Forwarding Plane, no direct support for it is so far available in the kernel. With P4TC the kernel is now to be extended by the possibility -analogous to BPF- to load external P4 scripts.

These are compiled in combination with a description of the hardware Capabilities such that as many tasks as possible can be performed in hardware, typically Smart NICs, Neural Accelerators or Mesh Processors, while the remaining tasks are performed transparently in software.

Another motivation for the initiators of the project at Intel was to use P4TC for implementing traffic control features without having to compile kernel modules, since that is not possible in the data center environment for various reasons.

Furthermore, this would save the time-consuming mainlining of the Traffic Control, though this aspect was met with a divided response in the subsequent discussion.

There was also a need for further discussion from the ranks of the FRR, since for an efficient efficient implementation, the hardware offloading should be used for frequent routes, while there is often no room in the limited hardware resources for less frequently used ones. In order to make best use of the available hardware offloading resources, a communication channel between the routing daemon in the user space and P4TC will be required, probably by extending the offload flag in switchdev.

RDMA programming tutorial

In the last talk of the day, Roland Dreier used examples to explain the the basics of Remote DMA and the programming of applications against this framework.

RDMA is a feature that originates primarily from the Big Data and data center world. It allows the transfer of memory contents over the network without the CPUs of the target device having to explicitly execute code. RAM is therefore accessed remotely via the network, similar to a local DMA access.

After an introduction about the different forms of Asynchronous queues, Roland gave an overview of their integration into the various subsystems of the kernel and into the abstractions used.

Since different implementations exist in Layer 2 and Layer 3, the use of RDMA is no longer limited to the probably most popular implementation Infiniband. With rxe a soft-roCE implementation is available in the kernel, which can be used over ordinary Ethernet.

After a short overview of the userspace libraries librdmacm for the connection setup and libibverbs for the actual requests and datapath operations, Roland demonstrated with some code snippets how to connect an application to the memory with a local process or to a remote device's memory via RDMA.

For details he recommended to study the examples from the repository RDMA Core <https://github.com/linux-rdma/rdma-core>

Day 2

The second day of NetDev 0x16 started with a workshop on XDP. The Express Data Path is a possibility to use eBPF (extended Berkeley Packet Filter) in order to decide whether network packets should be discarded or forwarded at a very early stage after the receive interrupt.

In addition, packets can bypass a large part of the network stack directly into user space. This increases performance for special applications with high throughput.

The workshop presented the latest changes and developments in this environment, such as Dynamic Pointers by Joanne Kong from Meta, who has implemented a implemented a way to interact with data, whose size is not yet known at the compile time of a BPF program.

Other topics, like the XDP_Hints for adding metadata to XDP buffers, or the work of Zhan Xue (Intel) who uses XDP_REDIRECT to packets directly into hardware accelerators (e.g. for cryptography) were also discussed.

From an embedded developer's perspective, the latter approach is very exciting. Applications such as feeding video streams over the network directly into hardware video codecs with only little CPU intervention sounds very promising.

It is yet unclear, if all required infrastructure is already available for such applications, but nevertheless it is quite exciting to thing about possible future system designs with this approach.

Once again, the strength of doing development in a vibrant community is very obvious. Experts from different backgrounds come together and collaborating on the design and implementation of the subsystem, covering lots of different requirements due to their widespread expertise.

The coffee breaks at the conference offer a valuable addition to the otherwise usual discussions on the mailing lists and provide a more direct and rapid exchange of opinions, experience and ideas

Today, for example, I had the rare opportunity to talk to some developers who work on TSN (Time Sensitive Networking), just like me, about the latest developments on FRER (Frame Replication and Duplicate Elimination for Reliability) and implementation variants of recent RFCs (Request for Comments) from the DetNet (Deterministic Networking) working groups at the IETF on redundant data transmission of real-time networks.

FRR Workshop

In the second session of the day, the developers of FRRouting showed the current developments and improvements to the Free Routing Protocol Suite. Mobashshera Rasool from VMWare presented the design decision and the Support for MLD (Multicast Listener Discovery) and PIM (Protocol Independent Multicast), which are now available for IPv4 and IPv6 via the pimd daemon in FRR. The technical substructure required for this has been available since kernel version 4.19.

Thanks to the work of Dr. Olivier Dugeon at Orange Innovation Networks FRR now also supports segment routing. This allows a list of segment identifiers to be attached to a network packet that determine the packet's path through the network. In contrast to regular routing algorithms, this allows multiple ring-free paths to be spanned through a network, for example to send redundant packets over multiple routes, which is a common implementation to address failure of links in real-time applications. Segment Routing is already being implemented in various commercial routers, and their implementation is fully compatible with the implementation in FRR.

There are also interesting usecases of this for embedded. When combined with DetNet, this can for example be used to explicitly create redundant Real-time links in a network. I am pretty sure it won't be long until segment routing is spotted in the wild for these kinds of applications.

Finally, Donald Sharp gave an overview of the current developments and a preview of the features in the next FRR release. In addition to many incremental improvements, bug fixes and a continuous increased code quality, he particularly mentioned the efforts of Donatas Abraitas who merged the continuous changes in the BGP (Border Gateway Protocol) into FRR.

Network view of embedded specific challenges for non-embedded network developers

After lunch, Oleksij Rempel from Pengutronix gave two talks. In "Network view of embedded specific challenges for non-embedded network developers" Oleksij gave an overview of various practical examples on the everyday work of a kernel hacker with a focus on embedded hardware.

Oleksij first introduced the differences between embedded and non- embedded applications. Although the boundaries are becoming increasingly blurred, there are quite a few more specific restrictions such as power constraints or maximum connection initialization times for embedded applications. Analogous to the softening of the boundary between embedded devices and non-embedded devices, field buses and on-board buses, such as CAN, are also increasingly becoming replaced by Ethernet.

Nevertheless, it turns out that embedded use cases often dictate very different requirements than conventional IT use cases. As an example, Oleksij mentions the fq_codel queuing discipline, which can be used well for TCP, but turns out to be very problematic when used for CAN buses. Since Linux is increasingly being used in embedded applications, developers need to be sensitized for the special requirements of this environment as well as for standard IT networks.

As another example, Oleksij introduces the various field buses in the automotive environment, of which only CAN is mainline, which are increasingly being replaced by Ethernet. Due to the different requirements, however, deviating sub-standards, such as Base-T1 over twisted pairs are used instead of conventional Base-TX. These requirements are mostly related to the costs, performance and energy consumption. Some applications also have special timing requirements. For example, autonegotiation cannot be used for automotive applications, because this process takes too long, a link must be activated within a few milliseconds. Existing configuration interfaces must be expanded for this.

Oleksij mentions explosion-protected areas as another example. In these, the amount of energy in a system is limited to a specific amount, which also poses special challenges for network connections. With 10Base-T1L the amplitude of the link can be limited to 1.0Vpp, which allows use in a potentially explosive environment. Support for this standard has recently been added to the kernel, but require changes, for example to establish connections with remote stations prevent announcing the 2.4Vpp links in the autonegotiation.

Another example is a special implementation variant for link redundancy, where several Phys are connected to one MAC. There is no suitable abstraction for this in the kernel, which is why there is an ongoing discussion about this on the mailing list. Another variant of this application is the connection of several Phys to one MAC to be able to support different standards in layer 1.

Similar challenges arise from the communication chains of MDIO links for SFP cages.

Using these examples, Oleksij shows which effects these special requirements have in the kernel. Some of them can be met by extending already existing interfaces, however, it is sometimes necessary to create completely new interfaces in the kernel.

Implementing cooperative link diagnostics by using dusty corners of IEEE 802.3 specification

In his second talk immediately afterwards, Oleksij explains, how Link Diagnostics can be implemented in Ethernet. Link Diagnostics, i.e. the detection of errors in the physical links, such as defective cables, plugs, etc. is often not considered an important feature by hardware manufacturers. The contributions to this in the kernel therefore mostly come from individual contributors, often hobbyists. However, this feature is urgently needed in many applications. Oleksij explains that there are many different failure modes (Short circuits, ground faults, cable breaks, etc. of the various cable wire pairs).

With the command

ethtool -t eth0

a self-test of MAC and PHY can be performed. It switches the phy into loopback mode, which works in a current kernel. This allows errors in the path between MAC and PHY to be detected.

with the command

ethtool --cable-test eth0

a basic cable test can be performed.

The command

ethtool eth0 | grep -i sqi

can be used to determine whether the phy connected to eth0 supports measuring the signal quality Index (SQI).

The self-test can be expanded in terms of its functionality, which, however, requires a means of communication with the link partner.

Furthermore, cable tests usually only work well if the remote station is switched off, otherwise the Fast Link Pulses (FLPs) of the Autonegotiation disturb the measurement.

To show how this problem can be solved, Oleksij explains how a cable test using Time Domain Reflectometry (TDR) works. For this purpose, a pulse is applied to a cable by the phy, whose propagation and reflection the phy measures. The measurement result can be used to distinguish between cable breaks, short circuits or earth faults and correctly functioning cables. The sampling rate of the Phys determines the precision of the measurement, A measuring accuracy of 80 cm is usual.

Oleksij continues to explain, that the link autonegotiation also uses pulses on the link, which can distort the measurement results of the TDR.

Oleksij suggests adding the autonegotiation flag "remote fault" to the fast link Pulses, which forces a transmission pause in the autonegotiation. In the resulting silence, a measurement could be taken with TDR. Because the autonegotiation pulses operate at a lower frequency than the regular Ethernet link, there is a high probability that this message arrives at the remote station even with a certain degree of damage to the link. In the link code word of the FLP of the autonegotiation, bit 13("RF) and bit 15 ("NP") are used for this. The RF flag (Remote Fault) indicates a general error, the NP flag ("Next Page") is used to add further pages, to communicate more precise requirements.

A vendor specific message would have to be used for this.

Oleksij has sent a first patch to the netdev mailinglist and invites all listeners to review it and join the discussion.

"We've got realtime networking at home" - Why many systems are moving to TSN so slowly

At the end of this day, I had the pleasure to give my own talk. In this talk I summarized the basic requirements for real-time networks: time synchronization, maximum (bounded) transmission delays and quality of service, i.e. the guaranteed transmission of prioritized content.

I then explained how conventional Ethernet implementations tried to fulfill these requirements and compared them to TSN (Time Sensitive Networking). I then discussed, which strategies can be used to systematically migrate systems in the brownfield until a system is fully TSN capable.

At the end of my talk, I motivated the developers to add more documentation, examples and tests. This could help system engineers and end users and enables them to use these new technologies.

After the end of the talks, at the social event I took the chance to exchange ideas with maintainers and developers and got to know the people behind the names of the mailing list postings.

Day 3

The third day of Netdevcon is dedicated to data center networking. Although the requirements for this application are significantly different from most embedded use cases, the talks on these topics nevertheless offer an exciting look beyond the horizon.

It's Time to Replace TCP in the Datacenter

In his keynote, John Ousterhout presented that even though TCP is used very widespread, it is actually relatively unsuitable for data center applications.

He shows in detail that the basic properties of the protocol are diametrically opposed to the fundamental requirements of this relatively special workload. He postulates that this means it is impossible to enhance TCP such that it provides the performance required for datacenter applications.

From the results of his research, his working group developed the HOMA protocol. He discussed the advantages of HOMA over TCP in detail, and presented some quite astonishing measurement results. Even though HOMA does not use hardware offloading yet, it already shows significant improvements over TCP for typical data center workloads. More than one order of magnitude can be achieved, but he concludes that to achieve the line rate, i.e. the theoretically available bandwidth, Hardware offloading in the network cards is also necessary for HOMA.

HomaLS: Tunneling messages through secure segments

In the follow-up talk, Tianyi Gao shows how HOMA segments can be encrypted analogous to DCTCP segments. He presents a prototype of his implementation for hardware offloading on Mellanox CX16 and presents first measurements. His prototype already achieves 20% improvement compared to the usual DCTCP implementations.

dcPIM: Low–latency, High–throughput, Receiver–driven Transport Protocol

In the subsequent talk, the authors give a first outlook on the challenges for Link speeds >200Gbps, so-called terabit links. Based on measurements, it was made clear that the main source of latencies is the residence time in switches. With terabit Ethernet, small network packets only need a few microseconds to be sent, which is why the switch queues have a significant impact on data throughput. Because the architecture of data centers and the data flows there are quite similar to the coarse architecture of a switch, Qizhe Cai and Rachit Agarwal suggest to solve this issue with a method already used in Ethernet switches. They suggest extending Parallel Iterative Matching to the Datacenter, which introduces the datacenter Parallel Iterative Matching (dcPIM). dcPIM is an algorithm to for an iterative process to calculate the optimal assignment of data streams to links and resources.

Since the mathematical model behind PIM it is well researched and since the algorithm converges quickly and no changes to the protocols or APIs used are required, the authors assume that this approach can be deployed easily.

After so much theory with many usecases that are rather far from the usual embedded ones in my daily work, some interesting contacts and discussions arose over lunch, among other things with developers from the WiFi stack, who end up in a joint review of a 5 year old proof of concept code that applies surprisingly well to the current multi-link extensions in the current iterations of the WiFi standards.

bring network and time together using Linux tracing

After the lunch break, Alexander Aring showed in his talk how to record time- synchronized traces from different physical or virtual machines of a computer cluster with trace-cmd. The result can be converted to slog2sdk format and analyzed using jumpshot, to simplify e.g. analysis of the DLM lock protocol in a GANTT-like representation.

In–Kernel Fast Path Performance For Containers Running Telecom Workload

Nishanth Shyamkumar, Piotr Raczynski, Dave Cremins, Michal Kubiak and Ashok Sunder Rajan demonstrated, how a conventional setup of a 4G base station running proprietary software on proprietary hardware with proprietary operating systems (which is in the default telecommunications industry since many years) can be migrated to run on commodity hardware, using Docker to run the software for multiple base stations on a single hardware instance. The main challenge was the efficient connection of the individual containerized subsystems to the network hardware of the host system, which could be solved well with SRIOV and AF_XDP zero copy.

This allows to operate the infrastructure on significantly cheaper hardware and save time and cost when rolling out and operating mobile communications infrastructure.

The result of their work also applies for embedded usecases, where containers are seen more frequently nowadays. This makes the 20-fold increase in performance they achieved in comparison to the initial trivial implementation quite exciting.

Linux kernel networking acceleration using P4–OVS on IPU

Sandeep Nagapattinam, Nupur Uttarwar, Venkata Suresh Kumar and Namrata Limaye showed in their talk how P4 can describe offloading of accelerating features, which are then implemented in an Infrastructure Processing Unit (IPU). They demonstrated, how this technique can achieve significant performance gains for Layer2 forwarding, routing and VxLAN.

The throughput targeted by them will probably never be required for embedded usecases, but it is nevertheless exciting to see how companies successfully work around the saturation of silicon performance caused by the physical limits to Moore's law.

High Performance Programmable Parsers

The last talk of the day by Tom A. Herbert showed an interesting solution to a problem, which can also often be observed in embedded projects. Quite often network packets have to be parsed, i.e. the individual fields in a data frame need to be analyzed, in order to extract all necessary information from a packet.

The necessary software, a so-called parser, is usually implemented manually, which is a very laborious and therefore lengthy and prone-to-error process.

In his talk, Tom shows how the CPL, the Common Parser Language, can be used to describe a protocol declarativly in a json file. From this description, a parser can then be generated that uses the kparser infrastructure in the kernel for parsing network packets. In combination with XDP and eBPF, an incoming packet can be examined easily and efficiently. Depending on the results it can the either be dropped, forwarded to the network stack or directly forwarded to a userspace recipient via the Express Data Path.

Finally, Tom shows how hardware offloading via P4 can be integrated to this infrastructure. There are, however, still some issues to be solved. I'm curious to see if we will soon be able to use this technology in one of our customers' projects and how much development time this will save our customers.

Day 4

The fourth day of NetDev 0x16 had talks addressing many different networking related topics, which gave a good overview on the variety of topics one can encounter when dealing with networking.

When regular expressions meet XDP

In the first talk of the day, Ivan Koveshnikov explained how he implemented DDoS protection for game traffic with regex in XDP at GCore. Since this application requires to parse many small UDP packets with very regular structures, Ivan choose to use regex.

For regex matching he uses the BSD-licensed engine Hyperscan, which is also often used for deep packet inspection and is already optimized for high-performance scanning of network packets. For an efficient implementation of Regex in eBPF programs he needed to patch the kernel due to the size of the regex and because of eBPF's availability limitations of Vector Instructions (XDP runs in a SoftIRQ context where the FPU instructions are not available). Ivan explains that they therefore save and restore the FPU state in SoftIRQs, which however requires to turn off interrupts and preemptions as soon as FPU load/store instructions be used. As a result, the Vector instructions are also available and the regex parser can run very efficiently in an eBPF helper.

For configuration one can either use eBPF maps, which already bring synchronization with them but enforce a fixed size of the entries and are application specific, or entries in a configfs.

Because of the greater flexibility of a configFS Ivan decided to choose this Implementation variant.

He also presented some benchmarks, for XDP_DROP (i.e. for dropped packets) and XDP_TX (i.e. for forwarded packets).

For packets larger than a few hundred bytes of payload, their implementation can drop invalid packets "at line rate", i.e. as fast as the connected network, but forwarding of valid packets is somewhat more complex and therefore less efficient.

Ivan still sees room for improvement here, although the new system is already being successfully used in production.

At the end of his talk, Ivan calls on network card manufacturers to improve the interaction of XDP with hardware offloading, since the drivers often only have DPDK (Dataplane Development Kit) support, but no or only insufficient support for XDP is provided. He sees the XDP hints in particular as a possible way to improve this situation.

Ivan uploaded his code to Github and suggests to the listeners to take their own measurements.

Towards µs Tail Latency and Terabit Ethernet: Disaggregating the Host Network Stack

In the subsequent talk, Qizhe Cai, Midhul Vuppalapati, Jaehyun Hwang, Christos Kozyrakis, and Rachit Agarwal present their research results. After upgrading their network lab from 10GBit to aggregated 100GBit links they discovered that the basic structure of the Linux network stack imposes severe limitations on the performance for fast links.

The underlying architecture with a pipeline concept is difficult to optimize, since having long-lived data flows as well as short data bursts would require changes in many different places, which also have cross-dependencies.

They therefore suggest to change the underlying basic architecture to something similar to the Linux Storage Subsystem which has loosely coupled layers.

Their first prototype, which they published on Github, already achieves very impressive increases in performance compared to the current mainline implementation.

The Anatomy of Networking in High Frequency Trading

After this unusual proposal, PJ Waskiewicz presented the specific requirements for network infrastructures in high-speed trading. Even though high frequency trading is very secretive and hides behind a veil of trade secrets, PJ was able to give the broad outlines of this very unusual world to his listeners.

While the connection to the stock exchanges still uses conventional 10Gbps Ethernet links, the decisions about the purchases and sales are made in High performance computing clusters, in which the corresponding inference models are run partly in software, partly in hardware.

In both cases high performance, low latency, but especially deterministic and predictable latency of Network communication is absolutely essential to avoid waiting times. These would ultimately lead to poorer results in the financial transactions.

On the OS side, PJ showed that optimizations quite similar to what we often build into many embedded applications, for example the use of RT_PREEMPT, cpu isolation, interrupt pinning, choosing suitable CPU power states etc. improve not only the average case and median, but also the worst case performance significantly.

I was surprised to learn that for this special purpose the loss of determinism through the SoftIRQs in the receive path is simply circumvented by permanent polling.

For the high performance computing clusters, RDMA still prevails as the communication medium of choice, often via Infiniband. PJ is hoping to see more performance improvements by io_uring and XDP in particular, since it nowadays becomes increasingly difficult to find experienced developers and administrators for these technologies. Besides having to provide training on it for younger engineers and administrators, Infiniband is extremely expensive because it is an absolute niche product. PJ therefore raises the question, to what extent RoCE or iWARP can meet their requirements, especially with converged networks (i.e. in mixed operation with conventional Ethernet traffic).

New protocols like HOMA are being watched with great interest in the industry, however, investigation whether they meet the low a latency and low jitter requirements are still ongoing.

Merging the Networking Worlds

After this interesting and special talk, David Ahem and Shrijeet Mukherjee gave a talk about how to combine Berkeley Sockets, in particular for access to the Control Plane, with RDMA or verbs-based communication in the data plane. They argue that combining the advantages of both worlds for high-performance data transmission could benefit the network stack significantly.

After the various talks in which the replacement of the "old world" by better upcoming approaches were demanded, is this approach of merging different parts of well-functioning and well-known established approaches for infrastructure looks very promising, at least from the perspective of a veteran network programmer. Evolution instead of revolution.

Fixing TCP slow start for slow fat links

Maryam Ataei Kachooei presents the results of her research on the optimization of TCP congestion control for communication over geostationary and low-orbit satellites.

The algorithms that TCP uses to match the available packet sizes and data rates to the link's capabilities by default do not scale well for satellite links. This is probably mainly due to the comparatively long packet round trip times over the satellite link. However, with her research, Maryam was able to find sets of parameters that achieves significantly better performance for these algorithms.

The TSN building blocks in Linux

After this rather special topic, Ferenc Fejes from Ericsson Research Traffic Lab gave a brief introduction to the various standards for Time Sensitive Networking and a detailed overview of the current status of the individual components' implementation in Linux.

He not only showed the kernel components used, but also provided the associated userspace interfaces for configuration as well as options for hardware offloading.

In addition to his own measurements for some of the components, he showed examples for configuration settings and discussed the remaining open issues, for example the missing network configuration daemons and services.

In his opinion, self-tests and verification tools need to be implemented and provided as open source, in order to support users during development, commissioning, and verification of setups.

Finally, Ferenc underlined once again that the TSN-capable switch hardware is already supported in the switchdev and DSA frameworks today, which makes them first class citizens. He also added, that the hardware offloading features already work well with the existing infrastructure, even though not every device has all their hardwareoffloading implemented yet.

Towards a layer–3 data–center fabric with accelerated Linux E–VPN on the DPU

In the following talk Roopa Prabhu and Rohith Basavaraj demonstrated, how they implemented hardware offloading for E-VPN in the Data Processing Unit DPU, a hardware accelerator from Nvidia, and achieved quite astonishing performance improvements.

State of the union in TCP land

For the last talk of the day, Eric Dumazet summarized the developments of the last year in the Linux TCP stack. Amongst some security fixes, most work done mostly concerns performance improvements.

Kernel Selftests und Switch Testing

After the official event ended, some developers met on site with some colleagues from Hildesheim via video call, to discuss strategies for improving the in-kernel selftests for switches.

Also, a prototype from this year's Pengutronix's techweek for exporting network interfaces to labgrid, which could help for automated switch hardware testing was presented and well received. We hope to see this buildingblock soon being implemented as a labgrid exporter, this would also enable other developers in the community to improve their testing infrastructure, and would allow to find regressions in network drivers much faster.

Day 5

After the usual conversations over the first coffee of the day in the corridors of the Centro de Congressos do Instituto Superior Técnico, the last day of NetDev 0x16 started with an insight into the current developments at OpenVPN.

Pushing OpenVPN down the stack: Data Channel Offload (DCO)

In this talk, Antonio Quartulli explains the current Optimization efforts at OpenVPN. OpenVPN so far implements both Control Plane and Data Plane for VPN connections in user space. While this offers some advantages for implementation variants and a much less restrictive environment compared to an implementation of the data plane in the kernel, this yields a significant loss of performance.

With the userspace implementation, the network traffic from an application (e.g. web browser) needs to be sent from userspace to a tun device, i.e. a Layer3 tunnel device, into the kernel, which then loops the traffic back to the OpenVPN application in userspace.

This userspace application then encrypts the data and sends it back to the kernel, which eventually sends data to the target via the network stack and onto the network. Each data packet must therefore cross the kernel-userspace boundary twice, resulting in significant performance penalties compared to in-kernel implementations (like wireguard).

Besides the fact that the OpenVPN application is single-threaded and therefore cannot benefit from the multitude of processor cores available on modern CPUS, this leads to a performance level of OpenVPN is well behind available hardware bandwidths in modern networks. Since the OpenVPN code-base is getting a bit long in the tooth anyway, Antonio did not optimize the userspace code but rather examined how this problem can solved by directly offloading the data plane into the kernel.

Since in proprietary extensions of the control plane, i.e. the bringup, teardown, and and management of connections are common practice ("BYOV" - bring your own VPN), the control plane is still kept in userspace while encryption is implemented using Kernel Crypto API and encapsulating the network data is implemented in the DCO (Data Channel Offload). The DCO is a virtual device driver that is configured via Netlink API and exchanges data with userspace via standard APIs.

In contrast to the previous implementation, which required to keep an entire routing table in the userspace client, this also allows to use the kernel routing table for the peer-to-multipeer mode (server mode) which is just extended by some more specific routes.

We will see if this implementation can keep up with the performance of wireguard, but for sure it will make up some ground.

The solution presented by Antonio is already fully functional and can be downloaded from Github. However, using it in OpenVPN2 will require the current master branch or the soon-to-be-released Version 2.6 of OpenVPN.

NVMeTCP Offload – Implementation and Performance Gains

In their talk Shai Malin and Aurelien Aptel present optimizations to NVMe-TCP. NVMe-TCP is a protocol for connecting storage via TCP. Each storage queue is connected via a TCP socket, on which the read and write operations, marked accordingly via a Command Identifier (CID), can be tunnelled into a remote system.

In order to achieve the highest possible performance, servers are allowed to reorder requests and to prioritize small requests over longer accesses. Data integrity is guaranteed by a checksum (CRC).

After a very brief introduction to this unusual subject Shai and Aurelia explained, that in the current implementation, despite various optimizations, the biggest performance penalties is the copying of received data. The CRC validation takes the second largest amount computing time and is therefore an attractive optimization target as well.

Since the NVMe-TCP PDUs can also be distributed over several TCP segments (or multiple NVMe-TCP PDUs can be contained in a single TCP segment), a trivial zero-copy approach cannot work for this application.

Instead, they propose the DDP (Direct Data Placement) infrastructure. The basic idea is of this approach is to use dedicated pre-allocated buffers for protocols based on request-response pairs, such as NVMe-TCP.

Since the NIC data is copied to this dedicated buffer via DMA, they can immediately and easily be analyzed, whether hardware offloading is possible.

If this approach is combined with, for example, CRC offloading, up to 58% more bandwidth can be achieved for NVMe-TCP while reducing the CPU load required by 20% at the same time.

Shai and Aurelian recommended to the audience, to take careful measurements with perf and flamegraphs, to optimize performance first on the most expensive operations complete chain.

For example, in their case, they found that the context switches, that were triggered by the interrupts for each UMR event per PDU, led to a significant drop in performance. By assuming that offloading is opportunistic and waiting for completion is not necessary, they were able to bring improve performance significantly.

They also explained that coalescing packets for offloading, i.e. the execution of offloading operations for several packages at once, lead to significant performance gains compared to offloading for individual packages. If the offloading for the respective individual operations (here CRC calculation and DDP) is carried out separately packets can be pooled more often, which increases performance again.

For achieving best results when optimizing the offloading in real-world applications, they recommend to carefully examine real traffic with TCPDump and Wireshark for the statistical distribution of the packet sizes.

To TLS or not to TLS - That is not the question

Nabil Bitar, Jamal Hadi Salim and Pedro Tammela presented the results on analyzing the performance of different variants of TLS on x86.

They compared measurement results for uTLS (i.e. in userspace), to different variants of kTLS (in the kernel - with and without platform optimization) as well as with and without hardware offloading.

While unsurprisingly for larger data streams, kTLS with offloading offers a significant advantage, the overhead required for offloading is greater than the achievable improvements for short data streams. They conclude that uTLS therefore yields better performance for short streams.

The authors presented extensive measurements and demonstrated, how they analyzed the performance overhead on short streams with ftrace and tracked it down to the point where the offloading socketoptions are set.

While the direct results of this talk are probably less relevant for the embedded development reminds me very much of my mentor's mantra: Always check your assumptions.

Cross–Layer Telemetry Support in Linux Kernel

In their talk Justin Iurman and Benoit Donnet demonstrated that the ubiquitous migration from monolithic services to a distributed micro services architecture poses new challenges, especially debugging and performance analysis.

Application performance management (APM) and distributed tracing (e.g. openTelemetry) tackle this issue by rolling out tracing infrastructure over entire network cluster. This allows fine-grained tracing requests and the corresponding function calls throughout all instances, layers and micro-services involved by handling the request.

A particular challenge here is the different orders of magnitude for time spans, since, for example, an HTTP request requires a relatively long round-trip time compared to a single function call. Furthermore, the correlation of network traffic to function calls is a difficult challenge, since the distributed calls need to be tracked across multiple sub-components of the architecture. The authors suggest to use generic identifiers in the network packets for this, which then could be propagated across the different layers. This suggestion however spiked a lively subsequent discussion, also the corresponding drafts in the IETF have faced significant objection, because the advantage of better traceability involves considerable overhead. For this reason, the authors invite to an open discussion, and hope to find alternative and less invasive solutions to this requirement together.

Integrating Power over Ethernet and Power over Dataline support to the Linux kernel

In the last regular talk of the conference, Oleksij Rempel from Pengutronix explains how he created a framework for Power Delivery over Media Dependent Interface

While Power over MDI as PoE, i.e. Power over Ethernet, is probably known by most engineers, the original set of standards in the IEEE has been expanded several times and is also extended to other Layer1 implementations such as Base100x-T1.

Oleksij explains, that not only the huge variety of standards (which are IEEE standards and usually not freely available), but also the variety of names for the individual components where challenging for creating a generic infrastructure in the kernel and of the userspace API.

Furthermore, the standards include various optional and mandatory functional components and communication channels, for example to negotiate power availability out-of-band to the network traffic between components.

Oleksij therefore chose the approach to bring an as-minimal-as-possible implementation to mainline. In particular, his approach to only add components of Power over MDI, which he could test, and to submit further aspects later, is a very understandable and sensible approach.

For this reason, only the PoDL PSE support is currently implemented using this framework, i.e. the power sourcing equipment (energy source) for PoDL (Power over DataLine for Base-T1). The admin state of the PSE per port can already be checked using the framework, this must happen independent from the link admin state for obvious reasons. As soon as the uapi has stabilized, Oleksij will also add ethtool support for this; so far he has already implemented and demonstrated a proof of concept as part of his talk.

The next steps will then be to implement classification, i.e. out-of-band negotiation of the details of a specific PoMDI implementation, as well as interfaces to the power delivery framework. This will allow to manage the total power available and split the power budget between the hardware ports or to prioritize specific ports. Also, the corresponding interfaces for the Powered Device (PD), i.e. power sink, need to be added. This would allow powered devices to deal with the available power budget, and adapt their power consumption or at least output the power available depending on the PSE's delivery capabilities.

In addition, Oleksij hopes to be able to add support for conventional PoE in the kernel soon.

Outreachy

Roopa Prabhu then presented the Outreachy program for networking projects. Outreachy arranges internships in open source projects and provides mentorship programs, with a particular focus on supporting groups of people who are underrepresented in the technical industry.

As an example of a successful outreachy internship, Jaehee Park presented the result of her internship during the last summer, in which she made network stack optimizations in various places, such as using AVX2-Extension for accelerated clearing of skb-structures, the accelerated execution and improvement of self-test or improvements for gratious arps.

Closing Ceremony

NetDev 0x16 ended with the Closing Ceremony. In addition to a lot of praise for the organizers and speakers, the challenges of the hybrid event were discussed amongst organizers and participants. It's refreshing to work in a community who is aware of its strengths and challenges, and rises to challenges and discusses openly, how things can be improved.

Texts and photos are licensed under CC-BY-SA, except explicitly stated otherwise