How does the stack work in assembly language?

I'm currently trying to understand how the stack works, so I've decided teach myself some assembly language, I'm using this book:

http://savannah.nongnu.org/projects/pgubook/

I'm using Gas and doing my development on Linux Mint.

I'm a bit confused by something:

As far as I was aware a stack is simply a data structure. So I assumed if I was coding in assembly I'd have to implement the stack myself. However this doesn't seem to be the case as there are commands like

pushl
popl

So when coding in assembly for the x86 architecture and using the Gas syntax: is the stack just a data structure that's already implemented? Or is it actually implemented at the hardware level? Or is it something else? Also would most assembly languages for other chip sets have the stack already implemented?

I know this is a bit of a foolish question but I'm actually quite confused by this.


I think primarily you're getting confused between a program's stack and any old stack.

A Stack

Is an abstract data structure which consists of information in a Last In First Out system. You put arbitrary objects onto the stack and then you take them off again, much like an in/out tray, the top item is always the one that is taken off and you always put on to the top.

A Programs Stack

Is a stack, it's a section of memory that is used during execution, it generally has a static size per program and frequently used to store function parameters. You push the parameters onto the stack when you call a function and the function either address the stack directly or pops off the variables from the stack.

A programs stack isn't generally hardware (though it's kept in memory so it can be argued as such), but the Stack Pointer which points to a current area of the Stack is generally a CPU register. This makes it a bit more flexible than a LIFO stack as you can change the point at which the stack is addressing.

You should read and make sure you understand the wikipedia article as it gives a good description of the Hardware Stack which is what you are dealing with.

There is also this tutorial which explains the stack in terms of the old 16bit registers but could be helpful and another one specifically about the stack.

From Nils Pipenbrinck:

It's worthy of note that some processors do not implement all of the instructions for accessing and manipulating the stack (push, pop, stack pointer, etc) but the x86 does because of it's frequency of use. In these situations if you wanted a stack you would have to implement it yourself (some MIPS and some ARM processors are created without stacks).

For example, in MIPs a push instruction would be implemented like:

addi $sp, $sp, -4  # Decrement stack pointer by 4  
sw   $t0, ($sp)   # Save $t0 to stack  

and a Pop instruction would look like:

lw   $t0, ($sp)   # Copy from stack to $t0  
addi $sp, $sp, 4   # Increment stack pointer by 4  

(I've made a gist of all the code in this answer in case you want to play with it)

I have only ever did most basic things in asm during my CS101 course back in 2003. And I had never really "got it" how asm and stack work until I've realized that it's all basicaly like programming in C or C++ ... but without local variables, parameters and functions. Probably doesn't sound easy yet :) Let me show you (for x86 asm with Intel syntax).


1. What is the stack

Stack is usually a contiguous chunk of memory allocated for every thread before they start. You can store there whatever you want. In C++ terms (code snippet #1):

const int STACK_CAPACITY = 1000;
thread_local int stack[STACK_CAPACITY];

2. Stack's top and bottom

In principle, you could store values in random cells of stack array (snippet #2.1):

stack[333] = 123;
stack[517] = 456;
stack[555] = stack[333] + stack[517];

But imagine how hard would it be to remember which cells of stack are already in use and wich ones are "free". That's why we store new values on the stack next to each other.

One weird thing about (x86) asm's stack is that you add things there starting with the last index and move to lower indexes: stack[999], then stack[998] and so on (snippet #2.2):

stack[999] = 123;
stack[998] = 456;
stack[997] = stack[999] + stack[998];

And still (caution, you're gonna be confused now) the "official" name for stack[999] is bottom of the stack.
The last used cell (stack[997] in the example above) is called top of the stack (see Where the top of the stack is on x86).


3. Stack pointer (SP)

For the purpose of this discussion let's assume CPU registers are represented as global variables (see General-Purpose Registers).

int AX, BX, SP, BP, ...;
int main(){...}

There is special CPU register (SP) that tracks the top of the stack. SP is a pointer (holds a memory address like 0xAAAABBCC). But for the purposes of this post I'll use it as an array index (0, 1, 2, ...).

When a thread starts, SP == STACK_CAPACITY and then the program and OS modify it as needed. The rule is you can't write to stack cells beyond stack's top and any index less then SP is invalid and unsafe (because of system interrupts), so you first decrement SP and then write a value to the newly allocated cell.

When you want to push several values in the stack in a row, you can reserve space for all of them upfront (snippet #3):

SP -= 3;
stack[999] = 12;
stack[998] = 34;
stack[997] = stack[999] + stack[998];

Note. Now you can see why allocation on the stack is so fast - it's just a single register decrement.


4. Local variables

Let's take a look at this simplistic function (snippet #4.1):

int triple(int a) {
    int result = a * 3;
    return result;
}

and rewrite it without using of local variable (snippet #4.2):

int triple_noLocals(int a) {
    SP -= 1; // move pointer to unused cell, where we can store what we need
    stack[SP] = a * 3;
    return stack[SP];
}

and see how it is being called (snippet #4.3):

// SP == 1000
someVar = triple_noLocals(11);
// now SP == 999, but we don't need the value at stack[999] anymore
// and we will move the stack index back, so we can reuse this cell later
SP += 1; // SP == 1000 again

5. Push / pop

Addition of a new element on the top of the stack is such a frequent operation, that CPUs have a special instruction for that, push. We'll implent it like this (snippet 5.1):

void push(int value) {
    --SP;
    stack[SP] = value;
}

Likewise, taking the top element of the stack (snippet 5.2):

void pop(int& result) {
    result = stack[SP];
    ++SP; // note that `pop` decreases stack's size
}

Common usage pattern for push/pop is temporarily saving some value. Say, we have something useful in variable myVar and for some reason we need to do calculations which will overwrite it (snippet 5.3):

int myVar = ...;
push(myVar); // SP == 999
myVar += 10;
... // do something with new value in myVar
pop(myVar); // restore original value, SP == 1000

6. Function parameters

Now let's pass parameters using stack (snippet #6):

int triple_noL_noParams() { // `a` is at index 999, SP == 999
    SP -= 1; // SP == 998, stack[SP + 1] == a
    stack[SP] = stack[SP + 1] * 3;
    return stack[SP];
}

int main(){
    push(11); // SP == 999
    assert(triple(11) == triple_noL_noParams());
    SP += 2; // cleanup 1 local and 1 parameter
}

7. return statement

Let's return value in AX register (snippet #7):

void triple_noL_noP_noReturn() { // `a` at 998, SP == 998
    SP -= 1; // SP == 997

    stack[SP] = stack[SP + 1] * 3;
    AX = stack[SP];

    SP += 1; // finally we can cleanup locals right in the function body, SP == 998
}

void main(){
    ... // some code
    push(AX); // save AX in case there is something useful there, SP == 999
    push(11); // SP == 998
    triple_noL_noP_noReturn();
    assert(triple(11) == AX);
    SP += 1; // cleanup param
             // locals were cleaned up in the function body, so we don't need to do it here
    pop(AX); // restore AX
    ...
}

8. Stack base pointer (BP) (also known as frame pointer) and stack frame

Lets take more "advanced" function and rewrite it in our asm-like C++ (snippet #8.1):

int myAlgo(int a, int b) {
    int t1 = a * 3;
    int t2 = b * 3;
    return t1 - t2;
}

void myAlgo_noLPR() { // `a` at 997, `b` at 998, old AX at 999, SP == 997
    SP -= 2; // SP == 995

    stack[SP + 1] = stack[SP + 2] * 3; 
    stack[SP]     = stack[SP + 3] * 3;
    AX = stack[SP + 1] - stack[SP];

    SP += 2; // cleanup locals, SP == 997
}

int main(){
    push(AX); // SP == 999
    push(22); // SP == 998
    push(11); // SP == 997
    myAlgo_noLPR();
    assert(myAlgo(11, 22) == AX);
    SP += 2;
    pop(AX);
}

Now imagine we decided to introduce new local variable to store result there before returning, as we do in tripple (snippet #4.1). The body of the function will be (snippet #8.2):

SP -= 3; // SP == 994
stack[SP + 2] = stack[SP + 3] * 3; 
stack[SP + 1] = stack[SP + 4] * 3;
stack[SP]     = stack[SP + 2] - stack[SP + 1];
AX = stack[SP];
SP += 3;

You see, we had to update every single reference to function parameters and local variables. To avoid that, we need an anchor index, which doesn't change when the stack grows.

We will create the anchor right upon function entry (before we allocate space for locals) by saving current top (value of SP) into BP register. Snippet #8.3:

void myAlgo_noLPR_withAnchor() { // `a` at 997, `b` at 998, SP == 997
    push(BP);   // save old BP, SP == 996
    BP = SP;    // create anchor, stack[BP] == old value of BP, now BP == 996
    SP -= 2;    // SP == 994

    stack[BP - 1] = stack[BP + 1] * 3;
    stack[BP - 2] = stack[BP + 2] * 3;
    AX = stack[BP - 1] - stack[BP - 2];

    SP = BP;    // cleanup locals, SP == 996
    pop(BP);    // SP == 997
}

The slice of stack, wich belongs to and is in full control of the function is called function's stack frame. E.g. myAlgo_noLPR_withAnchor's stack frame is stack[996 .. 994] (both idexes inclusive).
Frame starts at function's BP (after we've updated it inside function) and lasts until the next stack frame. So the parameters on the stack are part of the caller's stack frame (see note 8a).

Notes:
8a. Wikipedia says otherwise about parameters, but here I adhere to Intel software developer's manual, see vol. 1, section 6.2.4.1 Stack-Frame Base Pointer and Figure 6-2 in section 6.3.2 Far CALL and RET Operation. Function's parameters and stack frame are part of function's activation record (see The gen on function perilogues).
8b. positive offsets from BP point to function parameters and negative offsets point to local variables. That's pretty handy for debugging
8c. stack[BP] stores the address of the previous stack frame, stack[stack[BP]] stores pre-previous stack frame and so on. Following this chain, you can discover frames of all the functions in the programm, which didn't return yet. This is how debuggers show you call stack
8d. the first 3 instructions of myAlgo_noLPR_withAnchor, where we setup the frame (save old BP, update BP, reserve space for locals) are called function prologue


9. Calling conventions

In snippet 8.1 we've pushed parameters for myAlgo from right to left and returned result in AX. We could as well pass params left to right and return in BX. Or pass params in BX and CX and return in AX. Obviously, caller (main()) and called function must agree where and in which order all this stuff is stored.

Calling convention is a set of rules on how parameters are passed and result is returned.

In the code above we've used cdecl calling convention:

  • Parameters are passed on the stack, with the first argument at the lowest address on the stack at the time of the call (pushed last <...>). The caller is responsible for popping parameters back off the stack after the call.
  • the return value is placed in AX
  • EBP and ESP must be preserved by the callee (myAlgo_noLPR_withAnchor function in our case), such that the caller (main function) can rely on those registers not having been changed by a call.
  • All other registers (EAX, <...>) may be freely modified by the callee; if a caller wishes to preserve a value before and after the function call, it must save the value elsewhere (we do this with AX)

(Source: example "32-bit cdecl" from Stack Overflow Documentation; copyright 2016 by icktoofay and Peter Cordes ; licensed under CC BY-SA 3.0. An archive of the full Stack Overflow Documentation content can be found at archive.org, in which this example is indexed by topic ID 3261 and example ID 11196.)


10. Function calls

Now the most interesting part. Just like data, executable code is also stored in memory (completely unrelated to memory for stack) and every instruction has an address.
When not commanded otherwise, CPU executes instructions one after another, in the order they are stored in memory. But we can command CPU to "jump" to another location in memory and execute instructions from there on. In asm it can be any address, and in more high-level languages like C++ you can only jump to addresses marked by labels (there are workarounds but they are not pretty, to say the least).

Let's take this function (snippet #10.1):

int myAlgo_withCalls(int a, int b) {
    int t1 = triple(a);
    int t2 = triple(b);
    return t1 - t2;
}

And instead of calling tripple C++ way, do the following:

  1. copy tripple's code to the beginning of myAlgo body
  2. at myAlgo entry jump over tripple's code with goto
  3. when we need to execute tripple's code, save on the stack address of the code line just after tripple call, so we can return here later and continue execution (PUSH_ADDRESS macro below)
  4. jump to the address of the 1st line (tripple function) and execute it to the end (3. and 4. together are CALL macro)
  5. at the end of the tripple (after we've cleaned up locals), take return address from the top of the stack and jump there (RET macro)

Because there is no easy way to jump to particular code address in C++, we will use labels to mark places of jumps. I won't go into detail how macros below work, just believe me they do what I say they do (snippet #10.2):

// pushes the address of the code at label's location on the stack
// NOTE1: this gonna work only with 32-bit compiler (so that pointer is 32-bit and fits in int)
// NOTE2: __asm block is specific for Visual C++. In GCC use https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
#define PUSH_ADDRESS(labelName) {               \
    void* tmpPointer;                           \
    __asm{ mov [tmpPointer], offset labelName } \
    push(reinterpret_cast<int>(tmpPointer));    \
}

// why we need indirection, read https://stackoverflow.com/a/13301627/264047
#define TOKENPASTE(x, y) x ## y
#define TOKENPASTE2(x, y) TOKENPASTE(x, y)

// generates token (not a string) we will use as label name. 
// Example: LABEL_NAME(155) will generate token `lbl_155`
#define LABEL_NAME(num) TOKENPASTE2(lbl_, num)

#define CALL_IMPL(funcLabelName, callId)    \
    PUSH_ADDRESS(LABEL_NAME(callId));       \
    goto funcLabelName;                     \
    LABEL_NAME(callId) :

// saves return address on the stack and jumps to label `funcLabelName`
#define CALL(funcLabelName) CALL_IMPL(funcLabelName, __LINE__)

// takes address at the top of stack and jump there
#define RET() {                                         \
    int tmpInt;                                         \
    pop(tmpInt);                                        \
    void* tmpPointer = reinterpret_cast<void*>(tmpInt); \
    __asm{ jmp tmpPointer }                             \
}

void myAlgo_asm() {
    goto my_algo_start;

triple_label:
    push(BP);
    BP = SP;
    SP -= 1;

    // stack[BP] == old BP, stack[BP + 1] == return address
    stack[BP - 1] = stack[BP + 2] * 3;
    AX = stack[BP - 1];

    SP = BP;     
    pop(BP);
    RET();

my_algo_start:
    push(BP);   // SP == 995
    BP = SP;    // BP == 995; stack[BP] == old BP, 
                // stack[BP + 1] == dummy return address, 
                // `a` at [BP + 2], `b` at [BP + 3]
    SP -= 2;    // SP == 993

    push(AX);
    push(stack[BP + 2]);
    CALL(triple_label);
    stack[BP - 1] = AX;
    SP -= 1;
    pop(AX);

    push(AX);
    push(stack[BP + 3]);
    CALL(triple_label);
    stack[BP - 2] = AX;
    SP -= 1;
    pop(AX);

    AX = stack[BP - 1] - stack[BP - 2];

    SP = BP; // cleanup locals, SP == 997
    pop(BP);
}

int main() {
    push(AX);
    push(22);
    push(11);
    push(7777); // dummy value, so that offsets inside function are like we've pushed return address
    myAlgo_asm();
    assert(myAlgo_withCalls(11, 22) == AX);
    SP += 1; // pop dummy "return address"
    SP += 2;
    pop(AX);
}

Notes:
10a. because return address is stored on the stack, in principle we can change it. This is how stack smashing attack works
10b. the last 3 instructions at the "end" of triple_label (cleanup locals, restore old BP, return) are called function's epilogue


11. Assembly

Now let's look at real asm for myAlgo_withCalls. To do that in Visual Studio:

  • set build platform to x86 (not x86_64)
  • build type: Debug
  • set break point somewhere inside myAlgo_withCalls
  • run, and when execution stops at break point press Ctrl + Alt + D

One difference with our asm-like C++ is that asm's stack operate on bytes instead of ints. So to reserve space for one int, SP will be decremented by 4 bytes.
Here we go (snippet #11.1, line numbers in comments are from the gist):

;   114: int myAlgo_withCalls(int a, int b) {
 push        ebp        ; create stack frame 
 mov         ebp,esp  
; return address at (ebp + 4), `a` at (ebp + 8), `b` at (ebp + 12)
 
 sub         esp,0D8h   ; reserve space for locals. Compiler can reserve more bytes then needed. 0D8h is hexadecimal == 216 decimal 
 
 push        ebx        ; cdecl requires to save all these registers
 push        esi  
 push        edi  
 
 ; fill all the space for local variables (from (ebp-0D8h) to (ebp)) with value 0CCCCCCCCh repeated 36h times (36h * 4 == 0D8h)
 ; see https://stackoverflow.com/q/3818856/264047
 ; I guess that's for ease of debugging, so that stack is filled with recognizable values
 ; 0CCCCCCCCh in binary is 110011001100...
 lea         edi,[ebp-0D8h]     
 mov         ecx,36h    
 mov         eax,0CCCCCCCCh  
 rep stos    dword ptr es:[edi]  
 
;   115:    int t1 = triple(a);
 mov         eax,dword ptr [ebp+8]   ; push parameter `a` on the stack
 push        eax  
 
 call        triple (01A13E8h)  
 add         esp,4                   ; clean up param 
 mov         dword ptr [ebp-8],eax   ; copy result from eax to `t1`
 
;   116:    int t2 = triple(b);
 mov         eax,dword ptr [ebp+0Ch] ; push `b` (0Ch == 12)
 push        eax  
 
 call        triple (01A13E8h)  
 add         esp,4  
 mov         dword ptr [ebp-14h],eax ; t2 = eax
 
 mov         eax,dword ptr [ebp-8]   ; calculate and store result in eax
 sub         eax,dword ptr [ebp-14h]  

 pop         edi  ; restore registers
 pop         esi  
 pop         ebx  
 
 add         esp,0D8h  ; check we didn't mess up esp or ebp. this is only for debug builds
 cmp         ebp,esp  
 call        __RTC_CheckEsp (01A116Dh)  
 
 mov         esp,ebp  ; destroy frame
 pop         ebp  
 ret  

And asm for tripple (snippet #11.2):

 push        ebp  
 mov         ebp,esp  
 sub         esp,0CCh  
 push        ebx  
 push        esi  
 push        edi  
 lea         edi,[ebp-0CCh]  
 mov         ecx,33h  
 mov         eax,0CCCCCCCCh  
 rep stos    dword ptr es:[edi]  
 imul        eax,dword ptr [ebp+8],3  
 mov         dword ptr [ebp-8],eax  
 mov         eax,dword ptr [ebp-8]  
 pop         edi  
 pop         esi  
 pop         ebx  
 mov         esp,ebp  
 pop         ebp  
 ret  

Hope, after reading this post, assembly doesn't look as cryptic as before :)


Here are links from the post's body and some further reading:

  • Eli Bendersky, Where the top of the stack is on x86 - top/bottom, push/pop, SP, stack frame, calling conventions
  • Eli Bendersky, Stack frame layout on x86-64 - args passing on x64, stack frame, red zone
  • University of Mariland, Understanding the Stack - a really well-written introduction to stack concepts. (It's for MIPS (not x86) and in GAS syntax, but this is insignificant for the topic). See other notes on MIPS ISA Programming if interested.
  • x86 Asm wikibook, General-Purpose Registers
  • x86 Disassembly wikibook, The Stack
  • x86 Disassembly wikibook, Functions and Stack Frames
  • Intel software developer's manuals - I expected it to be really hardcore, but surprisingly it's pretty easy read (though amount of information is overwhelming)
  • Jonathan de Boyne Pollard, The gen on function perilogues - prologue/epilogue, stack frame/activation record, red zone

Regarding whether the stack is implemented in the hardware, this Wikipedia article might help.

Some processors families, such as the x86, have special instructions for manipulating the stack of the currently executing thread. Other processor families, including PowerPC and MIPS, do not have explicit stack support, but instead rely on convention and delegate stack management to the operating system's Application Binary Interface (ABI).

That article and the others it links to might be useful to get a feel for stack usage in processors.


You confuse an abstract stack and the hardware implemented stack. The latter is already implemented.


The Concept

First think of the whole thing as if you were the person who invented it. Like this:

First think of an array and how it is implemented at the low level --> it is basically just a set of contiguous memory locations (memory locations that are next to each other). Now that you have that mental image in your head, think of the fact that you can access ANY of those memory locations and delete it at your will as you remove or add data in your array. Now think of that same array but instead of the possibility to delete any location you decide that you will delete only the LAST location as you remove or add data in your array. Now your new idea to manipulate the data in that array in that way is called LIFO which means Last In First Out. Your idea is very good because it makes it easier to keep track of the content of that array without having to use a sorting algorithm every time you remove something from it. Also, to know at all times what the address of the last object in the array is, you dedicate one Register in the Cpu to keep track of it. Now, the way that register keeps track of it is so that every time you remove or add something to your array you also decrement or increment the value of the address in your register by the amount of objects you removed or added from the array (by the amount of address space they occupied). You also want to make sure that that amount by which you decrement or increment that register is fixed to one amount (like 4 memory locations ie. 4 bytes) per object, again, to make it easier to keep track and also to make it possible to use that register with some loop constructs because loops use fixed incrementation per iteration (eg. to loop trough your array with a loop you construct the loop to increment your register by 4 each iteration, which would not be possible if your array has objects of different sizes in it). Lastly, you choose to call this new data structure a "Stack", because it reminds you of a stack of plates in a restaurant where they always remove or add a plate on the top of that stack.

The Implementation

As you can see a stack is nothing more than an array of contiguous memory locations where you decided how to manipulate it. Because of that you can see that you don't need to even use the special instructions and registers to control the stack. You can implement it yourself with the basic mov, add and sub instructions and using general purpose registers instead the ESP and EBP like this:

mov edx, 0FFFFFFFFh

; --> this will be the start address of your stack, furthest away from your code and data, it will also serve as that register that keeps track of the last object in the stack that i explained earlier. You call it the "stack pointer", so you choose the register EDX to be what ESP is normally used for.

sub edx, 4

mov [edx], dword ptr [someVar]

; --> these two instructions will decrement your stack pointer by 4 memory locations and copy the 4 bytes starting at [someVar] memory location to the memory location that EDX now points to, just like a PUSH instruction decrements the ESP, only here you did it manually and you used EDX. So the PUSH instruction is basically just a shorter opcode that actually does this with ESP.

mov eax, dword ptr [edx]

add edx, 4

; --> and here we do the opposite, we first copy the 4 bytes starting at the memory location that EDX now points to into the register EAX (arbitrarily chosen here, we could have copied it anywhere we wanted). And then we increment our stack pointer EDX by 4 memory locations. This is what the POP instruction does.

Now you can see that the instructions PUSH and POP and the registers ESP ans EBP were just added by Intel to make the above concept of the "stack" data structure easier to write and read. There are still some RISC (Reduced Instruction Set) Cpu-s that don't have the PUSH ans POP instructions and dedicated registers for stack manipulation, and while writing assembly programs for those Cpu-s you have to implement the stack by yourself just like i showed you.