Hands-on Assignment 2: binary rewriting, solutions

1. Adding no-op instructions (20 pts)

Construct two example programs, one which slows down a lot when you apply the add-nops transformation (more than a 2x slowdown is possible), and another that slows down very little. Note that what you're trying to maximize or minimize here is the ratio between the run-time of the transformed program to the run-time of the un-transformed program. So you should think both about what factors affect the running time of the no-ops, and what factors affect the running time of the rest of the program.

Your writeup should include a description of the programs, source code, and a description of which features of the programs affect their running time.

There were a number of possible answers to this question. One way of classifying them is to think separately about the performance of the original instructions, and the performance of the new no-ops. It seems easier to control the performance of the original instructions, both because you get to pick them, and because the number of factors that affect the performance of a no-op are small. So if you treat the performance of the no-ops as fixed, then to minimize the performance ratio you want to make the original instructions as slow as possible, while to maximize the performance ratio you want to make the original instructions as fast as possible. This first approach is the one the instructor originally had in mind. On the other hand, it is actually possible to change the performance of no-ops in several ways, which leads to another class of solutions.

The instructor's idea for minimizing the ratio was to make the original instructions slow by having them be random accesses to a large array, so as to incur a large memory latency. There are a number of other slow operations that could work equally well. Perhaps most extreme, you could make the original program call a slow system operation like sleep. The most effective approach I saw that fell more in the category of making the no-ops fast was to introduce data dependencies among the original instructions (such has having them all increment a stack location). This takes advantage of how pipelining works: the data dependencies slow the original instructions down somewhat because each must be delayed waiting for the previous one to execute. But the no-ops can execute during the time when the processor is waiting for the original instructions, so they become virtually free.

The instructor's idea for maximizing the ratio can be seen as either speeding the original instructions or slowing the no-ops, since it was to create an original program that itself consisted of a very large number of no-ops. Of course no-ops as the original program execute quite quickly, but the major effect is seen once the size of the instrumented program becomes so large that it no longer fits in the CPU caches (on the machine I was testing with, I saw this only with the largest level cache, which is 8MB). This illustrates that though a no-op instruction doesn't contain any explicit memory references, like any instruction it implicitly requires a memory reference to load the instruction itself.

2. Stopping infinite loops (30 pts)

2a. Design (10 pts)

  1. Insert a call to check_and_count before each indirect call or jump (one whose target address is computed at runtime).
  2. For a direct jump, call, or branch to another instruction within the program, insert a call to check_and_count if the target instruction occurs before the jump (a "backwards jump").
  3. Insert a call to check_and_count before each call to a library function.

First, why are each of these rules necessary? For each rule, explain how if it were not present, you could write a program that executed for a very long time without ever calling check_and_count.

2. Backwards direct jumps. This is the most basic requirement, since the usual way you would write an infinite loop would be by using a backwards direct jump at the end of the loop body. (In relation to this, the control-flow edge that closes a loop is called the loop's "back edge" in compiler terminology.) Note that as stated, the rule isn't clear about an instruction that jumps directly back to itself, but such an instruction constitutes an infinite loop all by itself so it definitely needs to be excluded. (So "non-forward jumps" might actually be better terminology.) The implementation handles this correctly as long as the label of the jump target comes on the line before the instruction.

1. (Any) indirect jump. Without this rule, we could get the same effect as a backward direct jump from #2 by using an indirect jump with a target that appeared earlier. For instance if you had a function that called a function via a function pointer, and you initialized the function pointer to point to that function itself, you'd get an infinite sequence of calls.

3. A call to a library function. This rule is needed because the library itself isn't rewritten in our approach. Without this rule, you could get the equivalent of the function pointer loop from #1 by using a standard library function that takes a callback, such as qsort.

Second, are these three rules sufficient? Can you think of any other ways a program could cause an infinite loop despite these rules? If so, how would you amend the rules to fix the problem?

If you make enough additional assumptions, the given rules are sufficient. A good way to flesh out some of those assumptions is to make the argument for why they're sufficient.

Suppose for contradiction that there was a program that obeyed the rules, but had an infinite loop that did not contain any checking calls. One possibility is that the infinite loop occurs entirely within the library. But let's assume that the library is written to not have any infinite loops. If the infinite loop consisted of both program and library code, it would have to include a call from the program to the library, but that would contradict rule 3. For the case of a loop entirely in program code, let n be the total number of instructions in the program, and consider a sequence of n+1 instructions from the infinite loop. There must be at least one instruction h that appears more than once in that sequence, so consider the sub-sequence of instructions from one occurrence of h to the next. Let i1 and i2 be two adjacent instructions from that sequence. If i1 is a sequential instruction whose address does not wrap around, then i2 will have a higher address. By rule 1, i1 cannot be an indirect jump, and by rule 2, if it's a direct jump, it's a forward jump, so in this case again i2's address will be higher. But by chaining this reasoning across all the instructions in the sequence, we see that each subsequent instruction has a higher address than h and therefore the second occurrence of h must have a higher address than the first occurrence of h, contradicting that they are the same instruction.

In addition to the assumptions that were mentioned explicitly in the previous paragraph, another tricky point is return instructions. Return instructions are effectively indirect jumps in which the target address is read from the stack, and so it would be sufficient to apply rule 1 to them. But you might notice that they are not included as indirect jumps by the parsing rules I gave you in the partial implementation, and so some further assumptions are needed. A well-behaved program will never have an infinite sequence of return instructions without intervening calls (among other things, since a return pops the stack, the stack would eventually underflow without matching calls). And you couldn't make an infinite loop without one of those calls being non-forward by our usual argument. However it's a well-known security problem that some programs have bugs allowing saved return addresses to be overwritten, so if a buggy program allowed both this and some way of causing a balancing stack push, you could create an infinite loop.

2b. Slow implementation (10 pts)

Thus there are two pieces of code you'll need to modify for this part. First, you'll need to add some instructions to save and restore machine state in wrap_check_and_count.S. Second you'll need to modify the incomplete break-loops.pl script to add calls to wrap_check_and_count in the appropriate places as described in rules 1-3 above. (If you don't like Perl, feel free to rewrite this script in another language, though the changes needed are small enough that it's probably more efficient to just use the given version.) If you also proposed new rules in 2a, you can implement them too if you want, but you don't have to.

Here's a simple implementation of wrap_check_and_count:

wrap_check_and_count:
        pusha
        pushf
        call    check_and_count
        popf
        popa
        ret

pusha saves all the general-purpose registers, and pushf saves the flags register that contains the condition code flags, and the corresponding pops restore them.

Several optimizations are possible here. Since check_and_count respects the normal calling conventions, it will already take care of saving and restoring those registers that are callee-saved: thus the only general purpose registers this code really needs to save are %eax, %ecx, and %edx. Also it turns out that on modern x86 architectures, pushf and popf are quite slow, in part because they might affect other control flags within the EFLAGS register. In this case we only need to save and restore the six condition code flags, and that can be done faster using the following longer sequence (from DynInst via Michael Oneil Lam):

                           /* Assume %eax already saved */
        lahf               /* Save SF, ZF, CF, AF, PF to %ah */
        seto    %al        /* Save OF to %al */
        push    %eax
        ...
        pop     %eax
        add     $0x7f, %al /* Restore OF via possible overflow */
        sahf               /* Restore SF, ZF, CF, AF, PF */

(However the optimizations discussed in 2c are more important that these).

Here's my implementation of the instrumentation code:

        if ($arg =~ /^\*/) {
            # Indirect jump
            $indir = "indirect";
            $dir = "unknown";
        } elsif ($arg =~ /^([.\w]+)/) {
            # Direct jump
            $indir = "direct";
            my $target = $arg;
            if (exists $label2num{$target}) {
                if ($label2num{$target} < $insn_num) {
                    $dir = "backward";
                } elsif ($label2num{$target} > $insn_num) {
                    $dir = "forward";
                } else {
                    die;
                }
            } else {
                $dir = "external";
            }
        } else {
            die "Unexpected argument format in $opcode";
        }
        print "\t// \u$indir $type, $dir\n";
        if ($dir ne "forward") {
            print "\tcall\twrap_check_and_count\n";
        }

2c. Faster implementation (10 pts)

Decrementing %esi can be expressed in AT&T syntax as lea -1(%esi), %esi. For using jecxz, the two tricky parts are getting the value into %ecx, and how to control the direction of the jump. Obviously we want to take the value from %esi and put it into %ecx, but if we just did it with a move, that would clobber whatever value the program was keeping in %ecx. You could save the old value of %ecx on the stack, but a more efficient approach is to use %esi to store the old %ecx: i.e., swap the two registers using xchg before the branch, and swap them back afterward. (FYI, one performance caveat to be aware of is that xchg with a memory operand is slow because it's locked; but this doesn't apply to the all-register version.)

It would be convenient if there were a jecxnz instruction, since we want to skip a call when %ecx is not zero, as in:

        jecxnz    no_call
        call      wrap_check_and_count
no_call:
        ....

Unfortunately there's no such jecxnz instruction. The trick to using the negated condition is a "branch-over-jump" construction, where you add an unconditional jump that skips the call, and use jecxz to skip over that unconditional jump. Another way of looking at this is that the most general form of a two-way jump specifies two locations: go to A if the condition is true, and to B otherwise. This can be expressed by the two instruction sequence jcond A; jmp B, and in that sequence it's easy to see that you can swap A and B to get the same effect as negating the condition. A final practical point is that the labels here will need to occur many times in the same assembly file, so all the occurrences can't have the same name. The most general fix is just to assign them sequential numbers (you can see this is how the compiler generates its labels). But Unix assemblers have another convenient feature intended for situations like this, small integer labels that are referenced with a "f" or "b" for forward or back, and always point to the closest occurrence. So in sum our instrumentation sequence looks like:

	leal	-1(%esi), %esi
	xchg	%esi, %ecx
	jecxz	1f
	jmp	2f
1:
	call	wrap_check_and_count
2:
	xchg	%ecx, %esi

Because wrap_check_and_count is now called only when %esi wraps around, from here on out all the instrumentation code we write will be called around a billion times less often than the code we just wrote, so the performance of the remaining code is no longer so important, but of course it needs to be correct. We need to pass the value that used to be in %esi, and is now in %ecx, on to check_and_count and take back a possibly modified version. We could do this using the normal calling conventions, but since check_and_count isn't recursive, an easier way is to just use a global variable. The wrapper looks like:

wrap_check_and_count:
        mov     %ecx, countdown
        pusha
        pushf
        call    check_and_count
        popf
        popa
        mov     countdown, %ecx
        ret

A common mistake was to omit the saving and restoring of registers and flags in this code (such as by skipping wrap_check_and_count and calling straight to check_and_count). This might seem like an optimization, but because the code inside the zero check is rarely executed, the performance difference would be hard to measure. What's worse, the program will behave incorrectly in the cases when the function is called (and clobbers the flags and caller-save registers), but because this case is now a billion times less frequent, the bug will probably be difficult to track down and will not easily show up in testing. A design approach you can take to make this sort of problem easier to test is to make the number if checks between function calls a parameter (like chunk_size in my code below), and to test with this parameter set to a very small value.

Then there are a number of choices of how to modify check_and_count. Having the looping count makes it somewhat more complicated: in particular you need to differentiate the case of when the count hasn't been set up yet, from the case when it has wrapped around. Here's what my code looked like:

unsigned long long the_count = 0;

unsigned long countdown = 0;
int initialized = 0;

const static unsigned long chunk_size = 0x8000000;

void check_and_count(void) {
    assert(countdown == 0);
    if (!initialized) {
        assert(the_count == 0);
        /* read into the_count as before */
        next_count = the_count & (chunk_size - 1);
        the_count -= next_count;
        countdown = next_count;
        initialized = 1;
        if (countdown != 0)
            return;
    }
    if (the_count == 0) {
        fprintf(stderr, "BREAK_LOOPS_COUNT exceeded\n");
        exit(42);
    } else {
        unsigned long next_count;
        if (the_count >= (unsigned long long)chunk_size) {
            next_count = chunk_size;
        } else {
            next_count = the_count;
        }
        the_count -= next_count;
        countdown = next_count;
    }

There's one more thing we need to worry about: the initial value of %esi. check_and_count can tell for itself when it's being called the first time, but there's no similar flag for the high-speed instrumentation, and adding one would slow it down. It also isn't safe to assume that %esi will have any particular value at the start of the program, unless we enforce it. Luckily this is fairly straightforward to do with our modified libc, which has very little assembly code: the only hand-written assembly code that might modify %esi is in libc-syscalls-start.S. If you look at the _start, which is the very first code that runs, you will see that it uses %esi as a temporary to hold the environment variable pointer. But after that it's unused, so you can set it to whatever you'd like. With our approach based on decrementing before the check, we want to set it to 1 so that check_and_count gets called at the very first opportunity.

Measure the performance of the two different implementations using the supplied benchmarks, and report the results. (If the version that's supposed to be faster isn't faster, you may be doing something wrong.) Also, report the smallest value for BREAK_LOOPS_COUNT that allows the bzip2 benchmark to complete successfully compressing gcc.c.

For the implementations described above, the minimal values of BREAK_LOOPS_COUNT I saw varied between 414979020 and 414979050. I realized in retrospect that the results vary slightly because bzip2 does the equivalent of strdup(argv[0]), so there's code whose execution count depends on the length of the name of your binary. Differing values might indicate a bug (for instance, if you forgot to initialize %esi), or they could be caused by differing implementation choices. A good sanity check which I should have asked you to do is to check whether the slow and fast versions require the same BREAK_LOOPS_COUNT.

3. Build your own transformation (50 pts)

Last but certainly not least, the final part of the assignment is to design and implement a binary translation of your own, which uses the techniques and infrastructure from the previous steps and discussed in the papers. Your translation should have some application to security, but that can be broadly defined.

Since everyone did their own transformation, I won't give a sample solution here.