Sunday, November 8, 2015

Porting legacy RTOS software to a custom OS.

So lately I've been working on porting a large-sized application from a well-known RTOS to a custom kernel. The RTOS is VxWorks and the application is a router firmware. Unfortunately I will not be able to share any code of the porting layer. I'll just briefly go over the problems I've endeavoured and some thoughts I have come up with in the process.

The VxWorks compiler (called "diab") is like many other proprietary C compilers based on the Edison frontend. Unlike GCC or clang, by default it is quite lax when it comes to following the standards. Besides, the original firmware project uses the VxWorks Workbench IDE (which is of course a fork of Eclipse) and runs on Windows.

The first thing I had to do was to convert the project build system to Linux. The desirable way would be to write a parser that would go over the original build system and produce a Makefile. The advantage of this approach would be the ability to instantly obtain a new Makefile when someone changes the original project code. However, during the prototyping stage it makes a lot of sense to take shortcuts wherever possible, so I ended up writing a tiny wrapper program to replace the compiler and linker binaries. It would intercept its arguments, write them out to a log file and act as a proxy launching the original program. After that it was just a matter of making minor changes to the log file to convert it to a bash script and later on to a Makefile.

As an effect of the development done on Windows, the majority of files have the same set of defects that prevent them from being compiled by a linux-based compiler: paths in the "#include" directive often have incorrect case, reverse slashes and so on. Worst is that some of the "#include" directives included multiple files in one set of brackets. Luckily, this was easy to fix with a script that parsed the compilation log for the "file not found" errors, looked for the corresponding file ignoring the case and fixed up the source code. In the end I was left with about a dozen places that had to be fixed manually.

Implementing the compatibility layer.

I have done a quick research into the available options and saw that the only up-to-date solution implementing the VxWorks API on other OSs is "Xenomai". However, it is quite intrusive because it relies on a loadable kernel module and some patches to the linux kernel to function. Since we were not interested in getting realtime behaviour but wanted to run on both our OS and Linux and entirely in userspace, I decided to write yet another VxWorks emulation layer.
The original firmware comes as a single ELF file which is reasonable because in VxWorks all processes are implemented as threads in a shared address space. Besides, VxWorks provides a POSIX-compatible API for developers. So in order to identify which functions needed implementation it was enough to try linking the compiled code into a single executable.

"One weird trick" useful for creating porting layers and DDEKits is the GCC/clang option "include" which allows you to prepend an include to absolutely all files compiled. This is super useful. You can use such an uber-header to place the definitions of the data types and function prototypes for the target platform. Besides, you can use it to hook library calls at compile-time.

One of the problem that took a great amount of time was implementing synchronization primitives. Namely, mutexes and semaphores. The tricky semantic difference between semaphores and mutexes in VxWorks is that the latter are recursive. That means that once a thread has acquired a mutex, it is allowed to lock it any number of times as long as the lock/unlock count is balanced.
Before I realized this semantic difference, I couldn't figure out why the software would always lock up, and disabling locking altogether led to totally random crashes.

Ultimately I became frustrated and ended up with a simple implementation of a recursive mutex that has allowed me to move much further (Simple Recursive Mutex Implementation @ github). Later for the purposes of debugging I also added the options to print backtrace indicating the previous lock owner when trying to enter the critical section or when the spinlock took too many attempts.

Hunting for code defects

Uninitialized variables and what can we do about them

One problem I came across was that the code had a lot of uninitialized variables, hundreds of them. On the one hand, the proper solution is to inspect each case manually and write the correct initializer. On the other hand, the code works when compiled with the diab compiler so it must zero-initialize them.
So I went ahead and wrote a clang rewriter plugin to add the initializers: zero for the primitive types and a pair of curly braces for structs. (clang rewriter to add initializers). However, I realized that the biggest problem is that some functions use the convention of returning zero for indicating a failure while other return a non-zero value. This means, we cannot have a generic safe initializer that would make the code take the "fault" path when reaching the rewritten code. An alternative to manual inspection could be writing sophisticated rules for the rewriter to detect the convention used.

I ended up using valgrind and manually patching some warnings. AddressSanitizer was also useful. However fixing each warning and creating a blacklist is too tiresome. I ended up setting the breakpoint on the "__asan_report_error" function and a script that would make gdb print backtrace, return and continue execution.

Duplicate structures

One problem I supposed could be present in the code (due to the deep hirarchy of #ifdefs) is the presence of the structures with the same name but different content. I made up an example of a C program to demonstrate an effect where the compiler does not warn about the type mimatch but at runtime the code silently corrupts memory.

I figured out an easy way of dealing with the problem. I ended up using clang and emitting LLVM bittecode for each file instead of the object files with machine code. Then I linked them together into a single bitcode file and disassembled with llvm-dis.

The nice thing about llvm is that when it finds two structures having the same name but declared differently, it would append a different numeric suffix to the struct name. Then one could just remove the suffixes and look for unique lines with different structure declarations.

Luckily for me, there was only one place where I supposed an incorrect definition, and it was not in the part of the code I was executing, so I ruled out this option as a source of incorrect behavior.

Further work

Improving tooling for the 32-bit world

It is quite unfortunate but the project I have been working on is 32-bit. And one cannot simply convert it into a 64-bit one by a compiler flag. The problem is that the code has quite a lot of places where pointers are for some reason stored into packed structures and some other structures implicitly rely on the structure layout. So it is very difficult to modify this fragile mess.
It is sad that two great tools, MemorySanitizer and ThreadSanitizer, are not available for the 32-bit applications. It is understandable because the 32-bit address space is too tiny to fit the shadow memory. I am thinking of ways to make them available for the 32-bit world. So far I see two ways of solving the problem.
First, we can use the fragile and non-portable flag for the mmap (which is currently only supported on linux) to force allocations to below the 4gig limit. Then one could write an ELF loader that would load all the code below 4gigs, and use the upper memory range for shadows. Besides being non-portable, the other disadvantages of this approach include having to deal with 64-bit identifiers such as file handles.
Alternatively we could store the shadow in a separate process and use shared memory or sockets for communication. That would likely be at least an order slower than using the corresponding sanitizer in a 64-bit world, but likely still faster than valgrind and besides it is a compile-time instrumentation with more internal data from the compiler.

Verifying the porting was done correctly

Now I am left with one challenging task: verify the ported SW is identical to what could be built using the original build system.

One may notice that simply intercepting the calls to the compiler may not be enough because the build system may copy the files or export some shell variables during the build process. Besides, different compilers have different ways of handling "inline" and some other directives. It would be good to verify that the call graph of the original binary and the one produced by our Makefile is similar. (of course we will need to mark some library functions as leaf nodes and not analyze them). For a start I could try inspecting some of the unresolved symbols manually, but I'm thinking of automating the process. I think for this task I'll need a decompiler that can identify basic blocks. Probably Capstone engine should do the job. Any ideas on that?

P.S. Oh, and I once tried visualizing the dependency graph of separate ".o" files (before I realized I could just link them altogether and get the list of missing symbols) and trust me those graphs grow really fast. I have found out that a tool called "gephi" does a decent job at visualizing really huge graphs and supports Graphviz's dot as the input format.

EDIT 2015-11-10
The challenge is complicated by the fact that there some subprojects have multiple copies (with various changes) and one should also ensure that the headers and sources are picked up from the correct copy. However, I've found an easy and acceptable solution. I just wrote a tool that parses the call graph generated by IDA and for every edge in the graph it looks up the function names of the corresponding vertices. Then it just prints a list of pairs "A -> B" for every function A calling function B. After that, we can sort the file alphabetically and remove the nodes that are uninteresting to us (the OS and library functions). Next, we can compare the files side-by-side with a tool like kdiff3 (or we can do it automatically). Whenever there is a significant difference (for example, 5 callees are different for the same caller), we can inspect manually and verify we're compiling the correct file with the correct options. Using this method I have identified several places where we chose the wrong object file for linking and now we're only concerned with the porting layer and OS kernel, without having to worry about the application itself.