What Kernel Should I Use (Embedded Edition)
Some days ago, Greg Kroah-Hartmann wrote a great blogpost about Which Stable Kernel One Should Use?. I fully agree with his position; however, I'd like to make some additions for the industry device manufacturer use case and some common pitfalls and misunderstandings we see in that area.
If you didn't read Greg's article yet, please do so first; Greg does a great job maintaining the stable kernels, and in his article, he explains the motivation and mechanisms of the kernel's "stable" process and its different variants.
Especially in the "Older LTS release" section, he describes the dilemma industrial device manufacturers (like Pengutronix' customers) are in: SoC manufacturers often start Linux development years before a new chip comes out, and once the chip is released, the kernel support they give to their customers is already quite old and contains tons of out-of-tree patches.
LTS Isn't Automatically Long Term Stable
While the chip manufacturers originally have started with a LTS series, the name is misleading: a "long term stable" kernel does not magically start being long term stable just because of the name: you only get the promises of the maintainer teams for your industrial product if…
- … you use an LTS kernel unmodified, and
- … if you always update to the latest versions of an LTS series, because that's the version that contains all the bugfixes and security fixes known and backported from mainline at a certain point in time.
What we see in many industry projects is that chip vendors push BSPs to their customers that start with an LTS version, then they add up to 3000 patches for their chips which never went through the community review process. However, the review process is what makes Linux the rock solid and high quality operating system we all love!
Another thing to notice is that many of the bug fixes that end up in the LTS kernel have no good test coverage: they have been developed on top-of-tree kernels, then it was decided that they look like stable material and backported. However, the number of developers that actually test the patches on the LTS kernels are quite small: we recently had a case where important NAND bugs we found and fixed for mainline were backported, but nobody noticed that some other infrastructure in the MTD subsystem was missing in those old kernels; in result, it looked as if the patches were better, but in fact they simply were broken and untested.
Industrial device manufacturers often still stick to the old waterfall development model: design, implement, test, ship, then never touch a device again; this is especially true in regulated markets like medical, automotive or machine control. If this mind set is combined with an LTS based vendor frankenkernel, you get worst of all worlds and the resulting kernel is old, badly tested and far away from anything that gets community support.
If this strategy is used, it doesn't mean ANYTHING that it is called "Long Term Stable".
Define "Long Term"
Another pitfall is that industrial device manufacturers then conclude they need something which is even longer stable: if you look at Greg's table, you'll notice that what the IT industry calls longterm is supported for 2-5 year. That's by far not longterm enough for embedded industrial devices! If you build a tractor, it might be used for 25 years. If you build a medical device, it might be used for 10…15 years. Chip vendors of industrial devices often promise 10…15 years availability of their chips, but in reality, you can still buy embedded chips from the early 90ies.
So even if the argument is "we got this from our chip vendor and it is based on an LTS kernel", you should be prepared to have 3…5 kernel updates over the lifetime of the device. With a vendor kernel, you are then absolutely dependent on the chip manufacturer's kernel teams and his willingness to forward-port 3000 patches for a 10 year old chip! We have seen more than once that chip vendor kernel teams simply don't do that, because their software resources are limited and focussed on new devices. It might even happen that they don't remember the exact reasons for the patches any more…
For Conservative Goals, Be Progressive!
Our strategy here at Pengutronix is different when it comes to kernel decisions: when we talk to a customer who wants to build a new industrial device, we strongly suggest to start with a top-of-tree mainline kernel; in many cases, it is even a totally "unstable" -rc kernel. The newer, the better. You can imagine that this is often surprising for our customers and leads to many discussions…
We observe that industrial customers who do full custom or module based product development have a development cycle of about 1-2 years from the first product idea to Start-of-Production (SoP). It doesn't matter at all, that we are not on a LTS kernel during that time, as that kernel is not what's used in the final product.
When starting with a top-of-tree mainline kernel, we have the advantage that we get all the bugfixes, performance- and infrastructure patches people have developed until now, on exactly the kernel they were developed (which is also top-of-tree). We test those kernels on customer hardware quickly, which makes sure that bugs and regressions turn up during the time when developers actually work with the hardware. If something is missing in the kernel, we can do the development on top-of-tree as well, which makes it easy to get the patches into mainline. Product specific patches are going through the normal kernel quality cycle, and even our experienced kernel developers always get really great feedback from the other community developers! While we try hard to keep our patch stack small, BSP integration technologies like PTXdist or Yocto make it possible to do the integration of all the feature patches on top of a mainline kernel and iterate quickly. Testing is automated as far as possible with the help of labgrid.
While the customer heads towards SoP, we mainline patches and support all of the necessary hardware. The kernel our customers work with during that time is often synchronized every few weeks and often contains -rc kernels.
Once the product enters its more intensive production testing phase, we of course look at what kernel we suggest for the final product. The result might be a LTS kernel (for headless i.MX6 devices, we have several products in the field which are patch free on vanilla LTS kernels from upstream), a normal "stable" kernel or even a normal kernel release.
An important feature for this strategy is the ability to upgrade systems in the field: that's why we developed RAUC, the Robust Auto Update Controller. Even after Start-of-Production and roll-out of the first devices, it is possible to bring new kernels to the devices. Once the "hot" product development phase is over, we often continue to bring patches mainline during a maintenance project. In result, the customer hardware might not be fully supported by a LTS kernel right from the start, but once it is, we have a really low patch count!
In conclusion, our experience is that with this development strategy, "having a device on a LTS kernel" is really possible for industrial devices. However, it is necessary to understand that "longterm" doesn't come automatically, but needs a certain amount of effort. Fortunately, the community is there to help: once you are on a mainline strategy, there are so many friendly community folks out there who really care about mainline, and even do some of the work!
If you want to learn more about our mainline strategy, you should watch my colleague Jan Lübbe's ELCE talk from 2016: