rsc's Diary: ELC-E 2019 - Day 1
Day 1 at ELC-E started with Kernel-CI being a Linux Foundation project now. Read more about the talks I heard today in Lyon below...
I'm usually not too excited by listening to keynotes at LF events, but my first talk at this year's ELC-E was the announcement of Kernel-CI being a LF project now, and they'll sponsor the kernel testing effort with $500.000 per year.
Testing is an important topic and it is definitely a good move; however, is it just me who is pretty unhappy with the LF's attitude of "until now there was a variety of test projects out there, but now community knows which project to choose"? People have quite different reasons for testing, and (at least from our perspective having done Linux testing for 1.5 decades now) experience has shown that it is pretty difficult to squash all these different perspectives into one framework.
One of the really good things in open source is diversity: there are different OSS projects to choose from, and you can learn about their perspective on the actual problem and make your own choices. Open Source is about a marketplace where all the good ideas out there can compete against each other, isn't it?
However, let's see how things will turn out in reality. Having more focus and resources on testing is a good thing, taken that today's systems are becoming more and more complex. The real challenge will be to find enough smart people who actually fix the bugs unveiled by the testing efforts.
Long-Term Maintenance with Yocto + Debian
Jan Kiszka and Kazuhiro Hayashi talked about how to long-term maintain Linux systems for industrial applications and the Civil Infrastructure Platform. As they are aiming for 10 years of support, one of the concerns they have about updating to new versions is to get regressions. They want to have an Open Source Base Layer (OSBL) they can support for 10+ years - this is what the CIP Core group is working on: they have to care about a SLTC kernel, realtime, security issues, and they have to do continuous testing on reference hardware. Their concept is based on Debian, as it has proven to provide a good overall software quality and stability over a long time.
For the generic profile, they use the tool "Isar", for a tiny profile "Deby" as a build tool. Deby uses the Debian source packages and uses Yocto to build a system out of that code base; with that approach, their minimal system has a footprint of about 2 MB. The meta-debian layer provides mechanisms to cross build code from the Debian code base. Currently, they are using the Morty release of Yocto. On the other hand, "Isar" re-bundles binary components from Debian. The build times for Isar based systems are around 10 minutes, while Deby builds need about an hour, but obviously have a broader configurability.
The results of the builds have to be tested. They use GitLab, with build runners on AWS, and a LAVA instance pulls the results into a testlab and tests on hardware.
The teams are now focussing on more project-ready features, from software update to security hardening.
It is interesting for us to see that Debian based build workflows are recently gaining more speed; besides the variants discussed in this talk, there is also the Apertis approach out there; however, CIP seems to actually have already solved some of the customisation problems that turn up in a project being based on a general-purpose distribution.
Fully Automated Power Measurement Solution
Jerome Neanne and Pascal Mareau then talked about thermo regulated power management platform (TPMP), a mechanism to control all parameters of a device-under-test, repeat the test and collect data offline.
Their test setup consists of a thermo regulation part with a peltier element (controlled by a TEC-1091 thermoelectric controller) and a fan, and a regulator and measurement part based on BayLibre's ACME cape that offers connections for 8 probes. The whole thing is connected to a host PC by USB. BayLibre also has HE-10 power probes and other test equipment.
In contrast to more complex thermal controlling solutions such as thermal chambers or thermo streams, TPMP is much more desktop and budget friendly.
Jerome and Pascal showed the influence of the temperature on several physical parameters of the chip. They found out that the temperature has a significant impact for example on the power dissipation of the MX8M they are playing with. The device could be helpful for chip characterisation, aging tests or general power ci.
On the software side, they have python scripts that can be used to control for instance the die temperature of the DUT. Currently, the team is looking into replacing the homebrewn test framework by the ARM Workload Automation framework and to integrate with LAVA.
During Q&A it turned out that people working on devices would be more interested on heating the whole device, not only the die. It turned out that this stays a task for a thermo chamber...
Low Latency Deterministic Networking with XDP
Markus Karlsson and Björn Tölpel talked about XDP, a mechanism in the kernel since 4.18 that makes it possible to do socket communication with high throughput, low latency and determinism. This is interesting for TSN (time sensitive networking), so interesting for our industrial activities.
Creating an XDP socket is simple: just use PF_XDP on socket(), allocate buffers and setup a memory mapped ring buffer with a set of socket options. For each XDP socket, there is at least four ring buffers: Rx, Tx, Fill and Completion that handle the access to memory between the kernel and the userspace. As there is direct access to the memory, the application can process packages without too many unnecessary system calls. However, with this kind of control over the packets, applications need to be careful with actually handling them, otherwise it is easy to make the packet flow stop if the userspace doesn't do the processing right.
In order to provide XDP support, drivers need to take care of certain things (such as using NAPI and softirqs to have the kernel and userspace processing on different CPUs). There is a standard hook directly after receiving the SKB, but that's just the fallback slow path. If a driver wants not only to work but to perform, it needs to implement zero copy mechanisms. Currently, not that many drivers do already support XDP; most of them are intel drivers; Mellanox and Broadcom are working on it.
Performance measurements show a performance improvement of more than factor 10...20 in terms of packets/s. Latencies go down from 16 ms to about 100 µs worst case.
Safety vs. Security
Jeremy Rosen then reported about his experiences from a project that has both, safety and security requirements. This is a topic that touches our customer projects as well, so I joined in order to learn more about how others tackle this: experience shows that safety experience is to test a system extensively and then never update again, while security experience shows that systems need to be updated constantly in order to avoid security risks.
The focus of safety is to prove that systems are "correct" - however, you cannot prove that with today's hard- and software (i.e. caches change behaviour). In the end, every change to the system leads to a re-certification and a lot of effort. On the other hand, for tackling down security issues, speed is everything: if a security flaw happens, you need to be fast to close the issue. So after all, there are contradicting requirements which cannot be resolved.
Another challenge comes from the fact that it is extraordinarily difficult to actually update a system in the field, because if something goes wrong, the device is a brick. You have to deal with devices that don't have network connectivity, you have to take care of bad blocks, kernel updates and configuration updates over a long time. There are even systems that are not allowed to even stop for an update. And, finally, there are old hardware variants out there that have to be supported for decades. And as devices are in the hands of someone else, it might be necessary to crypto-lockdown the devices. Finally, people are starting to add computers to devices, having no idea how to deal about all this over a long time.
His conclusion is that you have to choose: Bricked or Pwned? He told that to his customers, and they have no idea what to do (well, why does all this sound very familiar to me...?)
One question that needs to be solved is how often to update. If one for example chooses one update per month, it contradicts with systems that have 6 months of certification time.
Is there a solution? Probably not, but there are variants to mitigate the issue:
- not all products are safety critical, but all connected products have to care about security
- you need a robust update system
- automated testing for re-certification
- you should minimize your safety critical perimeter and update it separately
- you could separate safety and security (containers, hypervisors, hardware separation)
- have a plan for security updates, with a maintenance process and a documented end of life for your product that you communicate to your customers
Finally, I cannot agree more. The talk concluded exactly with what we tell customers every day...
Building a Network Operating System using Linux and Yocto
In the last regular talk of the day, John Mehaffey talked about how Aruba builds ArubaOS-CX (code name "Halon"), based on OpenSwitch and OPX, the Linux Foundation's network operating system.
The team started small, uses Yocto to integrate platform dependent features and bases its system on BusyBox in the standard configuration. They found that updating the kernel provides critical updates, but also provides a convenient excuse for problems. When updating from 4.4 to 4.9, they found 18 kernel bugs, but it turned out that 9 were actually design flaws and not something to blame the kernel for. While it was difficult to convince management about kernel updates in the beginning, it went more smoothly once people got used to it. Currently they are using two architectures and 6 variants. As they are addressing modular rack switches, they have to make sure that the PCI bus stays operable during CPUs switching their role and switching from an active to a passive role.
The talk didn't go into the details of how they handled the actual switch, but it seems like they are doing all that in userspace using a big set of blob code instead of making use of the kernel infrastructure.
BoF: Low Spec Devices
My day was concluded with a BoF talking about low spec embedded Linux devices: they run on just a few megabytes of RAM. What's the reason for people choosing this kind of low end hardware? There seem to be two: cost (for very high quantity projects) and ultra low power. The crowd discussed what the smallest system running Linux is, and it turned out that a well maintained one in mainline is the STM32F4 with 2 MB onchip SRAM. Microchip last week released a brand new ARM926 chip with 8 MB internal SRAM, which is very, very cheap. However, it stays unclear what you can do with such a system.
One interesting question from the audience was: why should one use Linux on that low end hardware? It turned out that Linux drivers for filesystems and networking have lots of corner cases covered, but other than that, the value of the low spec systems stayed a bit in the dark.