Cortex-M0+ dual application/image linking

For most products, we implement a bootloader+application approach which uses an external SPI flash for storing different application versions. Upon startup, the bootloader checks if a new image is stored in the SPI flash. If so, it flashes the application area of the µC and starts it. The internal flash-layout typically looks like this:

+-----------------+
| 6k Bootloader   |
+-----------------+
| 2k EEPROM       |
+-----------------+
| 56K Application |
+-----------------+

However, for our next product, we want to eliminate the external flash and want to use a µC with an bigger internal flash to store two applications in it to dynamically switch between them.

+-----------------+
| 6k Bootloader   |
+-----------------+
| 2k EEPROM       |
+-----------------+
| 60K App A       |
+-----------------+
| 60K App B       |
+-----------------+

In our plans, App A is able to receive OTA updates and flash it to the memory region of App B. It then marks App B as the newer application and upon reboot, the bootloader jumps to App B instead of App A. Naturaly, App B can flash App A and mark it as the newer image.

So far so good, now the actual problem: While rolling out the OTA, we do not know if the OTA is flashed as App A or App B. Thus, when linking the application, we do not know which memory region in the internal flash is used. In other words, we do not know the offset which the application is started from and can not define jump addresses and position of isr-tables.

How do I link such an application that can be flashed in both regions? Is there a possibility to tell the compiler/linker to use "relative" jumps instead of "absolute" jumps? If not, are there any other solutions for such an approach? I.E. telling the M0+ to treat all addresses with an offset which is set up by the bootloader?


Solution 1:

Basically I don't think this is possible with any widely available compiler.

You would have to use program-counter relative addressing for all data in flash, but absolute addressing for addresses in RAM. While the ELF for ARM specification has these kinds of relocations, I don't think any compiler knows how to do generate code to use them. Also, it wouldn't know which to use in each case given that what is in flash and what is in RAM isn't decided until the linker stage.

One solution (which I have used in production) is to compile all your sources once but link them twice with two different linker scripts with absolute addresses. You will then have to distribute a double-sized update image, only half of which is used on any occasion.

Alternatively if you only want to have one image, then you need to do a double-shuffle. Have your working image write the new image to one area of internal flash, then reboot and have the bootloader copy it from there to the working location. You couldn't run it from the temporary location because the embedded addresses would be wrong (this is identical to your current solution but only uses the internal flash).

Solution 2:

You build the application as position independent, the startup code looks at the address it is executing from and adjusts the GOT before it launches into main() or whatever your C entry point is. Can certainly do this with gnu tools, I assume llvm supports position independent as well.

Now you can only boot from one place and that is common to both paths so you do know which path you are taking because you have to create a non-volatile scheme to mark which path to try first or which path is the most recent. various ways to do that, too broad to answer here. That code can do the GOT thing just like a normal program loader would for an operating system, or each application can patch itself up before calling compiled code, or can do a hybrid.

You can have the application run in sram instead of from flash and the single entry point code can copy whichever one is the current one to ram, and you do not need position independence you link for ram and either can be a copy and jump.

You can link two versions of each application and part of the update process is to in some way (many ways to do it, too broad for here) determine which partition and write the one linked for that partition, then the common entry point can simply branch to the right address.

Some mcus have banked flashes and a way to put one out front, you can buy one of those products.

The common code can live in each image just in case and the A code assumed to be at the entry point for the chip will be the real bootloader that determines which partition to use and make adjustments as needed.

Your chip might have an mmu and you can remap whichever into the same address space and link for that address space.

Not only is this doable this is not complicated, there are too many ways to do it to describe and you have to simply pick the one you want to implement and maintain. The chip may or may not help in this decision. If you are already a baremetal programmer then dealing with startup code and PIC and the GOT, and stuff are simple not scary.


What does GOT mean. global offset table, a side effect of how at least gcc/binutils work. With position independent code.

Just use the tools, no target hardware necessary....

flash.s

.cpu cortex-m0
.thumb

.thumb_func
.word 0x20001000
.word reset

.thumb_func
reset:
    bl notmain
.thumb_func
hang:   b .

.thumb_func
.globl lbounce
lbounce:
    bx lr

bounce.c

void fbounce ( unsigned int x )
{
}

notmain.c

void lbounce ( unsigned int );
void fbounce ( unsigned int );
unsigned int x = 9;
int notmain ( void )
{
    x++;
    lbounce(5);
    fbounce(7);
    return(0);
}

flash.ld

MEMORY
{
    rom : ORIGIN = 0x00000000, LENGTH = 0x1000
    mor : ORIGIN = 0x80000000, LENGTH = 0x1000
    ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
    .ftext   : { bounce.o } > mor
    .text   : { *(.text*)   } > rom     
    .data    : { *(.data*)    } > ram
}

build it position dependent

arm-linux-gnueabi-as --warn --fatal-warnings -mcpu=cortex-m0 flash.s -o flash.o
arm-linux-gnueabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m0 -mthumb -c notmain.c -o notmain.o
arm-linux-gnueabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m0 -mthumb -c bounce.c -o bounce.o
arm-linux-gnueabi-ld -nostdlib -nostartfiles -T flash.ld flash.o notmain.o bounce.o -o notmain.elf
arm-linux-gnueabi-objdump -D notmain.elf > notmain.list
arm-linux-gnueabi-objcopy -O binary notmain.elf notmain.bin

arm-linux-gnueabi vs arm-none-eabi does not matter for this code the differences are not relevant here. just happened to be what I had in this makefile.

arm-none-eabi-objdump -d notmain.o

notmain.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <notmain>:
   0:   4a06        ldr r2, [pc, #24]   ; (1c <notmain+0x1c>)
   2:   b510        push    {r4, lr}
   4:   6813        ldr r3, [r2, #0]
   6:   2005        movs    r0, #5
   8:   3301        adds    r3, #1
   a:   6013        str r3, [r2, #0]
   c:   f7ff fffe   bl  0 <lbounce>
  10:   2007        movs    r0, #7
  12:   f7ff fffe   bl  0 <fbounce>
  16:   2000        movs    r0, #0
  18:   bd10        pop {r4, pc}
  1a:   46c0        nop         ; (mov r8, r8)
  1c:   00000000    .word   0x00000000

so the variable x is in .data the code is in .text, they are separate segments so you cannot make assumptions when compiling a module the pieces come together when linking. So this code is built so that the linker will fill in the value that is shown in this output as the 1c address. The address of x is read using that pc-relaive load, then x itself is accessed using that address to do the x++;

lbounce and fbounce are not known to this code they are external functions so the best the compiler can really do is create a branch link to them.

once linked

00000010 <notmain>:
  10:   4a06        ldr r2, [pc, #24]   ; (2c <notmain+0x1c>)
  12:   b510        push    {r4, lr}
  14:   6813        ldr r3, [r2, #0]
  16:   2005        movs    r0, #5
  18:   3301        adds    r3, #1
  1a:   6013        str r3, [r2, #0]
  1c:   f7ff fff7   bl  e <lbounce>
  20:   2007        movs    r0, #7
  22:   f000 f805   bl  30 <__fbounce_veneer>
  26:   2000        movs    r0, #0
  28:   bd10        pop {r4, pc}
  2a:   46c0        nop         ; (mov r8, r8)
  2c:   20000000    andcs   r0, r0, r0

per the linker script .data starts at 0x20000000 and we only have one item there so that is where x is. now fbounce is too far away to encode in a single bl instruction so there is a trampoline or veneer created by the linker to connect them

00000030 <__fbounce_veneer>:
  30:   b401        push    {r0}
  32:   4802        ldr r0, [pc, #8]    ; (3c <__fbounce_veneer+0xc>)
  34:   4684        mov ip, r0
  36:   bc01        pop {r0}
  38:   4760        bx  ip
  3a:   bf00        nop
  3c:   80000001    andhi   r0, r0, r1

note that is 0x80000000 with the lsbit set, the address is 0x80000000 not 0x80000001 look at the arm documentation (the lsbit means this is a thumb address and tells the bx instruction logic to go to or remain in thumb mode, then strip that bit off and make it a zero. being a cortex-m it is always thumb mode if you have a zero there then the processor will fault because it does not have an arm mode, the tool generated that properly).

Now position independent.

arm-linux-gnueabi-as --warn --fatal-warnings -mcpu=cortex-m0 flash.s -o flash.o
arm-linux-gnueabi-gcc -fPIC -Wall -O2 -ffreestanding -mcpu=cortex-m0 -mthumb -c notmain.c -o notmain.o
arm-linux-gnueabi-gcc -fPIC -Wall -O2 -ffreestanding -mcpu=cortex-m0 -mthumb -c bounce.c -o bounce.o
arm-linux-gnueabi-ld -nostdlib -nostartfiles -T flash.ld flash.o notmain.o bounce.o -o notmain.elf
arm-linux-gnueabi-objdump -D notmain.elf > notmain.list
arm-linux-gnueabi-objcopy -O binary notmain.elf notmain.bin

on my machine gives

0000000e <lbounce>:
   e:   4770        bx  lr

00000010 <notmain>:
  10:   b510        push    {r4, lr}
  12:   4b07        ldr r3, [pc, #28]   ; (30 <notmain+0x20>)
  14:   4a07        ldr r2, [pc, #28]   ; (34 <notmain+0x24>)
  16:   447b        add r3, pc
  18:   589a        ldr r2, [r3, r2]
  1a:   2005        movs    r0, #5
  1c:   6813        ldr r3, [r2, #0]
  1e:   3301        adds    r3, #1
  20:   6013        str r3, [r2, #0]
  22:   f7ff fff4   bl  e <lbounce>
  26:   2007        movs    r0, #7
  28:   f000 f806   bl  38 <__fbounce_veneer>
  2c:   2000        movs    r0, #0
  2e:   bd10        pop {r4, pc}
  30:   1fffffea    svcne   0x00ffffea
  34:   00000000    andeq   r0, r0, r0

00000038 <__fbounce_veneer>:
  38:   b401        push    {r0}
  3a:   4802        ldr r0, [pc, #8]    ; (44 <__fbounce_veneer+0xc>)
  3c:   4684        mov ip, r0
  3e:   bc01        pop {r0}
  40:   4760        bx  ip
  42:   bf00        nop
  44:   80000001    andhi   r0, r0, r1

There may be some other interesting PIC flags, but so far this already shows a difference.

  12:   4b07        ldr r3, [pc, #28]   ; (30 <notmain+0x20>)
  14:   4a07        ldr r2, [pc, #28]   ; (34 <notmain+0x24>)
  16:   447b        add r3, pc
  18:   589a        ldr r2, [r3, r2]

  1c:   6813        ldr r3, [r2, #0]
  1e:   3301        adds    r3, #1
  20:   6013        str r3, [r2, #0]

Now it takes all of this to increment x. It is using a pc-relative offset in the literal pool at the end of the function to get from where we are to the GOT in .data. (important thing to note that means of you move .text around in memory you also need to move .data to be at the same relative offset for this to work). Then it has to do those math steps to get the address of x then it can do the read modify write.

Also note that the trampoline doesnt change still has the hardcoded address for the section that contains fbounce, so this tells us that we cannot move fbounce as built. Disassembly of section .data:

20000000 <x>:
20000000:   00000009    andeq   r0, r0, r9

Disassembly of section .got:

20000004 <.got>:
20000004:   20000000    andcs   r0, r0, r0

So what is a GOT? global offset table, in this trivial example there is this data item x in .data. there is a global offset table (okay correcting a statement above the .got section needs to be in the same place relative to .text for this to work as built, but .data can move and that is the whole point. .got for this to work needs to be in sram (and naturally that means a proper bootstrap needs to put it there just like .data needs to be placed from flash to sram before the C entry point).

So if you wanted to move the .data to some other address in memory because you are copying it from flash to a different address than it was linked for. you would need to go through the global offset table and modify those addresses. If you think about the position dependent, every function every instances of the global variable x, there will be that word in the literal pool after all the functions that access it, in places you cannot unravel. you would have to know the dozens to hundreds of places to modify if you wanted to move .data. With this solution you need to know where the GOT is and update it. So if instead of 0x20000000 you put it at 0x20008000 then you would go through the got and add 0x8000 to each entry before you launched this code or at least before the code touches any .data item.

And if you wanted to only move .text then at least as I have built this here, and not .data, then you need to move/copy the .text to the other address space and copy .got to the same relative address but not modify the got.

In your case you want to execute from two different address spaces but not move .data. So ideally need to get the tools to have a fixed location for the .got which I did not dig into, exercise for the reader. If you can get the .got to be fixed and independent of the location of .text then a single build of your program can be executed from two different flash locations without any modification. Without the need for two separately linked, binaries, etc.

.got : { *(.got*) } > rom

done...

Disassembly of section .data:

20000000 <x>:
20000000:   00000009    andeq   r0, r0, r9

Disassembly of section .got:

00000048 <.got>:
  48:   20000000    andcs   r0, r0, r0

that means the GOT is in the flash, which would end up being in the big binary blob that would get programmed in either flash location in the correct relative position.

So my fbounce was for demonstration.

If you build position independent and there is not enough flash to worry about the veneer, so a single binary can be created that will execute from two different flash locations without modification.

Thus the term position independent code. But this is the trivial part of your problem, the part that takes little effort (took me a few minutes to see and solve the one issue I had). You still have other issues to solve that can be solved dozens of ways and you have to decide, design, and implement your solution. Way to broad for stackoverflow.