Monday 28 May 2007

C Function Calling Convention

Ever wondered what does happen inside the computer when a program runs? Well, there are many levels that need to be explained. Herein, I'm going to explain the abstraction layer laid between Assembler language (the most abstract of the machine code languages) and C (the least abstract of the structural languages). And although a good assembler programmer would say he writes structured programs and good C programmer would state his programs translate 1:1 to the machine code, there is a great difference between those both languages. This difference need to be somehow handled by the compiler (best example is GCCs C compiler, which actually translates C code into Assembler firstly).

The most obvious difference is that C programs are structured. Once logically into functions and those into blocks. Secondly C lets you use data structures and operate with variables (furthermore, variables are typed in C). In assembler only thing that exists is a one-dimensional bit set. You can put there whatever you want, it will be executed from the beginning to the end (in most cases). There are no variables, no functions, no types. In the matter of fact assembler also represents an approach known from languages of very high abstraction level, like LISP. This because it does not distinguish variables or data from executable code. Executable is that what you address with %eip register (instruction pointer, guess what it does or google) and data is that what you address with any other register. In assembler also, all depends on the context.

What is then the C Function Calling Convention? It's a way C compiler translates one programming approach into the other (C into assembler). It's also a way assembler programmers tend to use when they want to practice structural programming. There is only one instruction in assembler that lets you change program flow:
jmp [offset] (and many others, but all of them derivate from it - most important conditional jumps, but unimportant for our topic). Now, CFCC is the way function calls are translated into flat assembler structure. In order to understand it you will need the knowledge of the basic data structure named stack and two additional registers named %ebp (base pointer) and %esp (stack pointer). I assume basic programming knowledge so I wont bother with explaining what a stack is, rather I would explain how it looks like in programs memory.

So, computer programs stack is an ordinary stack structure, which top address is being pointed to by
%esp register. Operating system takes care of assigning the stack all the memory it needs (or telling the program if there is no memory left). You don't need to allocate memory for stack, pushing and pulling operations do it automatically, as does direct adding/subtracting some offset. I will concentrate on Intel-compatybile processors as they're the most popular ones and the only I know. On an Intel processor stack grows down. This is very important to remember! I would say it is crucial at some points and in some uses. It means that pushing a value onto the stack actually decreases %esp and pulling a value from the top increases it. So, to extend the stack, you subtract some value from %esp.

Now we know how does the stack work we shall be able to proceed. To do it, lets consider following function:

int func(a,b,c) int a,b,c; {return 0;}

When you call
func(1,2,3); there really happens much in the assembler level. What is done first is the argument handling. They are all being pushed onto the stack in reverse order. You already know what does it mean? Yes, in the matter of fact it is really easy to implement variable length argument functions in C. va_* macros make it for you. Further, the call [offset] is invoked. This instruction is a form of jmp, but it does one additional thing. It pushes current %eip value onto the stack. Than it jumps (by overwriting %eip) to the given offset. Let's have a look at our stack at this point:

[....]
[....]
[3]
[2]
[1]
[saved %eip] <- %esp

As a function is entered, first thing that it does it to save the base pointer by pushing it onto the stack and than moving the current %esp to the %ebp register. The purpose of the base pointer is to save the address in the memory where all those crucial data is stored, while %esp is being manipulated by pushes and pulls. Now the code of our function is being executed. In the meantime our stack looks like:

[....]
[....]
[3]
[2]
[1]
[saved %eip]
[saved %ebp] <- %ebp
[....] \
[....] > local variables
[....] |
[....] / <- %esp

Next interesting moment is the return statement. At the return, the return value is being stored into the %eax register, which has no special purpose. Than %ebp is being pushed to the %esp, which points to the saved %ebp. As you already probably thought it is being retrieved back into %ebp. Now a jump is being done to the location pointed by the saved %eip, now placed at the top of the stack. Last thing that is being done is to subtract the size of arguments thathave been pushed onto the stack before the call. This is done by the calling function. At this point we've got the stack state from before running of our function and the return value in %eax register. Note that you can't expect registers to have the values you've assigned them before the call, you need to save and restore them manually (only %esp, %ebp are the same and %eip is as expected).

Why is that useful to know? Firstly, it's very helpful in debugging process, where you can than trace your program flow just by base pointer chain (they make one, if you think of it). Furthermore, you now know where your local variables are being stored and can trace them also! What other uses does it has? God knows, but not only. Aleph1 also knew, and he wrote a great and famous paper about one of those other uses. To give you some clue I'd just say that most of todays exploits wouldn't exist if the stack was growing up and not down! Just google for it. From other reading I would point "Programming from the Ground Up" book by
Jonathan Bartlett, which is a great introduction into programming in assembler, but not only. And it's freely available online!

I hope I was helpful. Response and feedback are welcome.

No comments: