CSci 8980-1: Hands-on Assignment 2

Program Analysis for Security

Hands-on Assignment 2: binary rewriting

This assignment consists of three questions that ask you to try out techniques from binary rewriting, a class of program transformation techniques that can be used to provide isolation, enforce security policies, or instrument low-level code properties. You'll be seeing what binary rewriting can do and also thinking about performance trade-offs, which can be important for making rewriting techniques practical. However rewriting from a raw binary is challenging, so instead we'll work with rewriting code at the assembly-language level, which is easier and good for initial prototypes.

These questions don't build on any large pre-existing tools; you'll just write programs in your favorite programming language to create modified versions of assembly language text files. You'll need to run on a Linux machine that supports 32-bit binaries and has a 32-bit compatible version of GCC installed: we pass the flag -m32 to GCC to tell it to generate 32-bit binaries even on a 64-bit host. The sample rewriting programs are written in Perl, because that's your instructor's favorite scripting language, but for the code you write you can use any other language you're comfortable with, such as Python, Ruby, Lisp, ML, C++, etc. It would be a little bit more work to use pure C, because it doesn't have convenient higher-level features for string processing and data structures like hash tables, but you can use it if you want.

One Unix file-naming convention to be aware of when working with assembly language: the usual extension for assembly language files on Unix is .s, rather than .asm as is traditional on Windows. Specifically a lowercase .s is used for assembly-language files that are generated directly by a compiler, and a capital .S is used for hand-written assembly language files, which are also processed by the C preprocessor so they can contain they can contain features like #define macros.

A quick reminder on the modes of GCC: if you give GCC the option -S, it will stop after producing an assembly-language file with extension .s. If you give it the option -c, it will take in a C file or an assembly file and generate an object file with extension .o. If you give it a bunch of .o files and libraries, it will link them together into a final executable, whose name you should specify with the -o option. (If you program has only a single .c file, it's possible to compile and link in one step, but you'll never need to use that mode for this assignment.)

Remember that this is an individual assignment. It's permitted to discuss it at a high level with other students, but each student must submit their own answer, having written all the code and prose in it themselves. Provide proper attribution to any people (other than the instructor) or other resources (books, web sites), that you got ideas from.

Benchmarks setup

For testing the speed of your rewritten code, I've given you a benchmark suite of CPU-intensive programs, similar to a simplified version of the SPEC{int,fp}200{0,6} we've seen mentioned in the papers. However this suite has just five programs, and to make the build process easier to manage, each of them is written in a single .c file. In order of increasing size, fib is the standard recursive implementation of the Fibonacci function, gzip and bzip2 are two well-known lossless compression programs, oggenc is an encoder for the Ogg Vorbis lossy audio compression format (similar to MP3, but not patent encumbered), and gcc is the main phase of an older version of the common C compiler. You can get the source code in the following compressed archive: benchmark-programs.tar.bz2.

You'll also need some benchmark data. For oggenc, a good benchmark is an uncompressed audio file in .wav format that's 25-50MB or 5-10 minutes long. If you have a file like this lying around you can use it, or you can download a suitable file such as this historical speech from the Internet Archive. For gzip, a good test compression file should be about 100MB compressed; you can create such a file by concatenating several other large files together. For bzip2, I recommend using the source code gcc.c as a test input: this is somewhat smaller than the test input for gzip, because bzip2 uses a more compute-intensive algorithm.

To help automate the process of running the benchmarks, I've given you a script named run-benchmarks.pl (updated 4/1 to add a missing declaration and fix a filename). The list of the programs and their benchmark inputs is controlled by the table initialized under my %programs: you can either adjust this if you want to use different benchmark inputs, or else make sure the test files mentioned in the above paragraph are where the programs expect them to be. The initialization of the array @benchmarks_to_run controls which benchmarks the script will execute: for your initial testing you may want to just run one smaller one. The table initialized under my %rewrites controls which transformations to test, in addition to the case of no translations that's called "baseline". For each transformation the table gives the program that does the transformation (which should read from the standard input and write to the standard output), as well as two lists of extra arguments, if any, to give to the compiler and to the linker. Again there's a corresponding @rewrites_to_run array to let you control which transformations you want to test. When you run the script, it will compile, transform, and execute each benchmark under each transformation, and record the time required for execution by appending to the file results.

Replacement C library

One point to consider when designing a binary analysis is whether you need to rewrite all the library code that a rewritten program calls as well. It usually makes a system easier to deploy if you can use pre-existing libraries, because it limits the amount of code a user needs to recompile. This is especially if there are system libraries that may be closed-source or cumbersome to recompile. For questions 1 and 2b, you'll implement transformations that can work with an unmodified system library, but for 2c, you need to recompile the libraries to reserve a register. For question 3, you'll need to decide what is needed when designing your transformation.

The main system library on Unix-like systems is traditionally called libc, the C library. The real Linux C library is large and complicated to compile, but for the purposes of this project I've given you a simplified C library replacement that you can easily recompile, and use in place of the system library. This library consists of three files, named libc.h, libc.c, and libc-syscalls-start.S. Because this "library" is only a few files, it doesn't make sense to use the standard static library mechanisms (.a files); you'll just compile the library into two .o object files.

libc.h (updated 4/2 to remove duplicate declarations) is a single header file that replaces all of the separate header files that normally go with the library. In order to use this library with unmodified source programs, you'll need to make a directory of include files which are all links to the main file. Given a list of target locations from the file libc-h-copies.txt, you can do that with a sequence of shell commands like (bash syntax):

mkdir libc-includes
mkdir libc-includes/sys
for f in $(cat libc-h-copies.txt); do cp -lf libc.h libc-includes/$f; done

libc.c (updated 4/2 to fix the size of a termios structure and a problem with GCC versions prior to 4.5) is the main library implementation written in C, with a few uses of inlined assembly. If you study it you'll see that not all of the library features are completely implemented, but it should be enough for the benchmarks and other simple programs.

libc-syscalls-start.S contains the implementation of some low-level functionality that is best expressed completely in assembly language. As the name implies, this includes low level system calls, which are sent as requests to the kernel using interrupt 0x80, and the very first code that runs when the program starts executing (even before main()), named _start.

When you are compiling programs to use this replacement library, you should tell the compiler to avoid the standard library include files using the flag -nostdinc, and to use the replacement directory you created above with -Ilibc-includes. Analogously when linking, you should use -nostdlib to exclude the standard libraries, and specify libc.o and libc-syscalls-start.o explicitly.

Finally, GCC has a security feature for protecting return addresses on the stack which is incompatible with the replacement library, because the replacement library does not set up the %gs segment register the way that feature expects. Some Linux distributions including Ubuntu have configured their compiler so that this option is turned on by default, but when using the replacement library you'll need to turn it off with the option -fno-stack-protector or else your program will crash mysteriously.

1. Adding no-op instructions (20 pts)

As a first example we'll look at a transformation that clearly should not change the program's behavior at all: adding no-op instructions. This would already be a difficult transformation to do on a binary, since it requires changing the location of all the instructions after the first, but it's pretty easy to do at the assembly language level. The one additional feature that the add-nops.pl script has is that you can control the length in bytes of the no-op instructions added using a command-line option.

Try running the add-nops transformation on the supplied benchmarks, including with different lengths of no-ops. If you had a naive model of how fast a CPU runs, with every instruction taking an equal amount of time, you might expect adding no-ops to exactly double the execution time. But we can see that's not the case: clearly different instructions take different amounts of time, with no-ops being faster than average. And though there may be an occasional outlier, you should generally see that the transformed programs slow down further as you add longer no-ops.

Your job for this question is to explore in more detail which features of a program affect how much the added no-op transformation slows the program down. Thinking back to when you studied computer architecture or CPU organization (and/or reviewing these topics), which kinds of factors affect CPU performance?

Based on this, construct two example programs, one which slows down a lot when you apply the add-nops transformation (more than a 2x slowdown is possible), and another that slows down very little. Note that what you're trying to maximize or minimize here is the ratio between the run-time of the transformed program to the run-time of the un-transformed program. So you should think both about what factors affect the running time of the no-ops, and what factors affect the running time of the rest of the program.

Your writeup should include a description of the programs, source code, and a description of which features of the programs affect their running time.

2. Stopping infinite loops (30 pts)

The next transformation is somewhat more interesting, though it might still be a stretch to call it a security application. Our goal here is to stop programs from running forever, by adding another way of controlling how long they run for. The effect is somewhat like the CPU time resource limit that can be enforced by an operating system, but we're not going to use real time as the measurement of program execution. Instead, we'll give the program a 64-bit counter, and make sure that it periodically decrements the counter: if the counter reaches zero, the program will exit. (By using a 64-bit counter, it should be no problem to set the counter long enough for any execution you would want.)

2a. Design (10 pts)

Here's the basic design. We'll have a function named check_and_count whose job is to decrement the counter and exit if the counter ever reaches zero. We'll use binary rewriting to insert calls to this function. It doesn't need to be called too often, but we have to be sure that the program never gets into an infinite loop that doesn't include calling check_and_count. Our approach is going to be to insert calls to check_and_count before control transfer instructions. Specifically, let me propose the following three rules:

Insert a call to check_and_count before each indirect call or jump (one whose target address is computed at runtime).
For a direct jump, call, or branch to another instruction within the program, insert a call to check_and_count if the target instruction occurs before the jump (a "backwards jump").
Insert a call to check_and_count before each call to a library function.

First, why are each of these rules necessary? For each rule, explain how if it were not present, you could write a program that executed for a very long time without ever calling check_and_count.

Second, are these three rules sufficient? Can you think of any other ways a program could cause an infinite loop despite these rules? If so, how would you amend the rules to fix the problem?

2b. Slow implementation (10 pts)

First we'll implement the simplest version of our loop-breaking transformation, which will require the least code, but will not be very fast. We'll just use the call instruction to insert a call at every location where the counter should be decremented and checked. I've already implemented the routine check_and_count for you, in the file check_and_count.c. (The most complex part of this function is actually parsing the count from the environment variable BREAK_LOOPS_COUNT, which you shouldn't need to change. Note that if this variable is missing, we just supply a very large default value, so the program executes as normal; this is fine for benchmarking purposes.)

However, we have to be careful about how we call this function. The normal calling conventions expect some parts of the CPU's state to be saved and restored by the calling function, but since we might insert calls to check_and_count anywhere, there might not be existing code to do the saving. To deal with this issue, we'll take the approach of writing a separate function in assembly language named wrap_check_and_count. The purpose of wrap_check_and_count will be to save all the parts of the machine state that need to be saved, call check_and_count, and then restore the state. You classmate Ben has started the file wrap_check_and_count.S, but hasn't written any instructions for saving and restoring state.

Thus there are two pieces of code you'll need to modify for this part. First, you'll need to add some instructions to save and restore machine state in wrap_check_and_count.S. Second you'll need to modify the incomplete break-loops.pl (updated 4/1 to remove a spurious assignment) script to add calls to wrap_check_and_count in the appropriate places as described in rules 1-3 above. (If you don't like Perl, feel free to rewrite this script in another language, though the changes needed are small enough that it's probably more efficient to just use the given version.) If you also proposed new rules in 2a, you can implement them too if you want, but you don't have to.

2c. Faster implementation (10 pts)

The translation you implemented for 2b is pretty slow: some of the reasons include that every call to wrap_check_and_count requires two levels of function calls, decrementing a 64-bit value in memory uses 6 instructions, and saving and restoring the flags register is expensive. So next you'll implement a different design that is more efficient. Here are the key differences:

Store only the low-order bits of the count in a dedicated register, %esi. Inline the common case of the decrement and check, so that you need to call a function only when %esi reaches zero.
Compile the program and libraries so that they do not otherwise use %esi, but initialize it when the program starts.
Perform the decrement and check using instructions that do not modify the processor flags, lea and jecxz, so the flags do not need to be saved and restored in the common case.

Of course you'll also have to think about how to move the data around as needed: for instance, how will you get the value of %esi to and from check_count()? Given that jecxz only operates on %ecx, you'll need to move that value to and from %esi.

For this translation you also need to recompile all the code in the program with the GCC option --fixed-esi, which tells it not to use %esi. This also means you'll need to use the replacement libc mentioned above. However you don't need to apply the translation that adds calls to check_and_count to the C library. Doing this in a straightforward way would cause infinite loops, since check_and_count itself depends on C library services. It would be possible to avoid that problem, but you don't need to bother.

Measure the performance of the two different implementations using the supplied benchmarks, and report the results. (If the version that's supposed to be faster isn't faster, you may be doing something wrong.) Also, report the smallest value for BREAK_LOOPS_COUNT that allows the bzip2 benchmark to complete successfully compressing gcc.c.

3. Build your own transformation (50 pts)

Last but certainly not least, the final part of the assignment is to design and implement a binary translation of your own, which uses the techniques and infrastructure from the previous steps and discussed in the papers. Your translation should have some application to security, but that can be broadly defined.

Your translation doesn't need to be very complex: roughly on a similar level as the transformations in problem 2, on the order of 100 lines of new code when you count both the transformation itself and any helper code the translated code calls. Design and implementation are both important, and you should give consideration to both your security goal, and avoiding unnecessary overhead.

Here are some examples of kinds of transformations that might be suitable, but you aren't limited to these:

A simple version of CFI (say with one identifier each for calls and returns)
Profile indirect jump targets to estimate how many different IDs could be used for CFI
Randomize some aspect of the code or data to frustrate code injection attacks
Store return addresses on a shadow stack to stop stack-smashing attacks
XOR the return address with a canary value to stop stack-smashing attacks
Catch null-pointer dereference bugs and print a nice error message instead of a segmentation fault

You should turn in both a description of your design, and the implementation. In the design description, be sure to explain what the security goal is. If your system is a defense, what attacks are blocked and what attacks are still possible. Also discuss runtime overhead. Does your translation make the code noticeably slower, and why? If you wanted to further optimize the translation, how might you do so?

Submission logistics

Questions 1 and 2 of this assignment are due by 11:55pm on Monday, April 8th, and question 3 is due by 11:55pm on Monday, April 15th. You can submit your answers using a form on the class Moodle page.