Custom bootloader booted via USB drive produces incorrect output on some computers

I am fairly new to assembly, but I'm trying to dive into the world of low level computing. I'm trying to learn how to write assembly code that would run as bootloader code; so independent of any other OS like Linux or Windows. After reading this page and a few other lists of x86 instruction sets, I came up with some assembly code that is supposed to print 10 A's on the screen and then 1 B.

      BITS 16
start: 
    mov ax, 07C0h       ; Set up 4K stack space after this bootloader
    add ax, 288     ; (4096 + 512) / 16 bytes per paragraph
    mov ss, ax
    mov sp, 4096

    mov ax, 07C0h       ; Set data segment to where we're loaded
    mov ds, ax

    mov cl, 10          ; Use this register as our loop counter
    mov ah, 0Eh         ; This register holds our BIOS instruction

.repeat:
    mov al, 41h         ; Put ASCII 'A' into this register
    int 10h             ; Execute our BIOS print instruction
    cmp cl, 0           ; Find out if we've reached the end of our loop
    dec cl              ; Decrement our loop counter
    jnz .repeat         ; Jump back to the beginning of our loop
    jmp .done           ; Finish the program when our loop is done

.done:
    mov al, 42h         ; Put ASCII 'B' into this register
    int 10h             ; Execute BIOS print instruction
    ret


times 510-($-$$) db 0   ; Pad remainder of boot sector with 0s
dw 0xAA55

So the output should look like this:

AAAAAAAAAAB

I assembled the code using the nasm assembler running on the Windows 10 Ubuntu Bash program. After it produced the .bin file, I opened it using a hex editor. I used the same hex editor to copy the contents of that .bin file into the first 512 bytes of a flash drive. Once I had written my program to the flash drive, I disconnected it and plugged it into a computer with an Intel Core i3-7100. On bootup, I selected my USB flash drive as the boot device, only to get the following output:

A

After changing various things in the program, I finally got frustrated and tried the program on another computer. The other computer was a laptop with an i5-2520m. I followed the same process as I mentioned before. Sure enough, it gave me the expected output:

AAAAAAAAAAB

I immediately tried it on my original computer with the i3, but it still didn't work.

So my question is: Why does my program work with one x86 processor but not the other? They both support the x86 instruction set. What gives?


Solution:
Ok, I've been able to track down the real solution with some help. If you read Michael Petch's answer below, you'll find a solution that will fix my problem, and another problem of a BIOS looking for a BPB.

Here was the problem with my code: I was writing the program to the first bytes of my flash drive. Those bytes were loaded into memory, but some BIOS interrupts were using those bytes for itself. So my program was being overwritten by the BIOS. To prevent this, you can add a BPB description as shown below. If your BIOS works the same way mine does, it will simply overwrite the BPB in memory, but not your program. Alternatively, you can add the following code to the top of your program:

jmp start
resb 0x50

start: 
;enter code here

This code (courtesy of Ross Ridge) will push your program to memory location 0x50 (offset from 0x7c00) to prevent it from being overwritten by the BIOS during execution.

Also keep in mind that whenever you call any subroutine, the values of the registers you were using could be overwritten. Make sure you either use push, pop or save your values to memory before calling a subroutine. Look at Martin Rosenau's answer below to read more about that.

Thank you to all who replied to my question. I now have a better understanding of how this low-level stuff works.


This could probably be made into a canonical answer on this subject.

Real Hardware / USB / Laptop Issues

If you are attempting to use USB to boot on real hardware then you may encounter another issue even if you get it working in BOCHS and QEMU. If your BIOS is set to do USB FDD emulation (and not USB HDD or something else) you may need to add a BIOS Parameter Block(BPB) to the beginning of your bootloader. You can create a fake one like this:

org 0x7c00
bits 16

boot:
    jmp main
    TIMES 3-($-$$) DB 0x90   ; Support 2 or 3 byte encoded JMPs before BPB.

    ; Dos 4.0 EBPB 1.44MB floppy
    OEMname:           db    "mkfs.fat"  ; mkfs.fat is what OEMname mkdosfs uses
    bytesPerSector:    dw    512
    sectPerCluster:    db    1
    reservedSectors:   dw    1
    numFAT:            db    2
    numRootDirEntries: dw    224
    numSectors:        dw    2880
    mediaType:         db    0xf0
    numFATsectors:     dw    9
    sectorsPerTrack:   dw    18
    numHeads:          dw    2
    numHiddenSectors:  dd    0
    numSectorsHuge:    dd    0
    driveNum:          db    0
    reserved:          db    0
    signature:         db    0x29
    volumeID:          dd    0x2d7e5a1a
    volumeLabel:       db    "NO NAME    "
    fileSysType:       db    "FAT12   "

main:
    [insert your code here]

Adjust the ORG directive to what you need or omit it if you just need the default 0x0000.

If you were to modify your code to have the layout above the Unix/Linux file command may be able to dump out the BPB data that it thinks makes up your VBR in the disk image. Run the command file disk.img and you may get this output:

disk.img: DOS/MBR boot sector, code offset 0x3c+2, OEM-ID "mkfs.fat", root entries 224, sectors 2880 (volumes <=32 MB) , sectors/FAT 9, sectors/track 18, serial number 0x2d7e5a1a, unlabeled, FAT (12 bit)


How the Code in this Question Could be Modified

In the case of this OPs original code it could have been modified to look like this:

bits 16

boot:
    jmp main
    TIMES 3-($-$$) DB 0x90   ; Support 2 or 3 byte encoded JMPs before BPB.

    ; Dos 4.0 EBPB 1.44MB floppy
    OEMname:           db    "mkfs.fat"  ; mkfs.fat is what OEMname mkdosfs uses
    bytesPerSector:    dw    512
    sectPerCluster:    db    1
    reservedSectors:   dw    1
    numFAT:            db    2
    numRootDirEntries: dw    224
    numSectors:        dw    2880
    mediaType:         db    0xf0
    numFATsectors:     dw    9
    sectorsPerTrack:   dw    18
    numHeads:          dw    2
    numHiddenSectors:  dd    0
    numSectorsHuge:    dd    0
    driveNum:          db    0
    reserved:          db    0
    signature:         db    0x29
    volumeID:          dd    0x2d7e5a1a
    volumeLabel:       db    "NO NAME    "
    fileSysType:       db    "FAT12   "

main:
    mov ax, 07C0h       ; Set up 4K stack space after this bootloader
    add ax, 288     ; (4096 + 512) / 16 bytes per paragraph
    mov ss, ax
    mov sp, 4096

    mov ax, 07C0h       ; Set data segment to where we're loaded
    mov ds, ax

    mov cl, 10          ; Use this register as our loop counter
    mov ah, 0Eh         ; This register holds our BIOS instruction

.repeat:
    mov al, 41h         ; Put ASCII 'A' into this register
    int 10h             ; Execute our BIOS print instruction
    cmp cl, 0           ; Find out if we've reached the end of our loop
    dec cl              ; Decrement our loop counter
    jnz .repeat         ; Jump back to the beginning of our loop
    jmp .done           ; Finish the program when our loop is done

.done:
    mov al, 42h         ; Put ASCII 'B' into this register
    int 10h             ; Execute BIOS print instruction
    ret

times 510-($-$$) db 0   ; Pad remainder of boot sector with 0s
dw 0xAA55

Other Suggestions

As has been pointed out - you can't ret to end a bootloader. You can put it into an infinite loop or halt the processor with cli followed by hlt.

If you ever allocate a large amount of data on the stack or start writing to data outside the 512 bytes of your bootloader you should set your own stack pointer (SS:SP) to a region of memory that won't interfere with your own code. The original code in this question does setup a stack pointer. This is a general observation for anyone else reading this Q/A. I have more information on that in my Stackoverflow answer that contains General Bootloader Tips.


Test Code to See if Your BIOS is Overwriting the BPB

If you want to know if the BIOS might be overwriting data in the BPB and to determine what values it wrote you could use this bootloader code to dump the BPB as the bootloader sees it after control is transferred to it. Under normal circumstances the first 3 bytes should be EB 3C 90 followed by a series of AA. Any value that isn't AA was likely overwritten by the BIOS. This code is in NASM and can be assembled into a bootloader with nasm -f bin boot.asm -o boot.bin

; Simple bootloader that dumps the bytes in the BIOS Parameter
; Block BPB. First 3 bytes should be EB 3C 90. The rest should be 0xAA
; unless you have a BIOS that wrote drive geometry information
; into what it thinks is a BPB.

; Macro to print a character out with char in BX
%macro print_char 1
    mov al, %1
    call bios_print_char
%endmacro

org 0x7c00
bits 16

boot:
    jmp main
    TIMES 3-($-$$) DB 0x90   ; Support 2 or 3 byte encoded JMPs before BPB.

    ; Fake BPB filed with 0xAA
    TIMES 59 DB 0xAA

main:
    xor ax, ax
    mov ds, ax
    mov ss, ax              ; Set stack just below bootloader at 0x0000:0x7c00
    mov sp, boot
    cld                     ; Forward direction for string instructions

    mov si, sp              ; Print bytes from start of bootloader
    mov cx, main-boot       ; Number of bytes in BPB
    mov dx, 8               ; Initialize column counter to 8
                            ;     So first iteration prints address
.tblloop:
    cmp dx, 8               ; Every 8 hex value print CRLF/address/Colon/Space
    jne .procbyte
    print_char 0x0d         ; Print CRLF
    print_char 0x0a
    mov ax, si              ; Print current address
    call print_word_hex
    print_char ':'          ; Print ': '
    print_char ' '
    xor dx, dx              ; Reset column counter to 0
.procbyte:
    lodsb                   ; Get byte to print in AL
    call print_byte_hex     ; Print the byte (in BL) in HEX
    print_char ' '
    inc dx                  ; Increment the column count
    dec cx                  ; Decrement number of  bytes to process
    jnz .tblloop

    cli                     ; Halt processor indefinitely
.end:
    hlt
    jmp .end

; Print the character passed in AL
bios_print_char:
    push bx
    xor bx, bx              ; Attribute=0/Current Video Page=0
    mov ah, 0x0e
    int 0x10                ; Display character
    pop bx
    ret

; Print the 16-bit value in AX as HEX
print_word_hex:
    xchg al, ah             ; Print the high byte first
    call print_byte_hex
    xchg al, ah             ; Print the low byte second
    call print_byte_hex
    ret

; Print lower 8 bits of AL as HEX
print_byte_hex:
    push bx
    push cx
    push ax

    lea bx, [.table]        ; Get translation table address

    ; Translate each nibble to its ASCII equivalent
    mov ah, al              ; Make copy of byte to print
    and al, 0x0f            ;     Isolate lower nibble in AL
    mov cl, 4
    shr ah, cl              ; Isolate the upper nibble in AH
    xlat                    ; Translate lower nibble to ASCII
    xchg ah, al
    xlat                    ; Translate upper nibble to ASCII

    xor bx, bx              ; Attribute=0/Current Video Page=0
    mov ch, ah              ; Make copy of lower nibble
    mov ah, 0x0e
    int 0x10                ; Print the high nibble
    mov al, ch
    int 0x10                ; Print the low nibble

    pop ax
    pop cx
    pop bx
    ret
.table: db "0123456789ABCDEF", 0

; boot signature
TIMES 510-($-$$) db 0
dw 0xAA55

Output should look like this for any BIOS that didn't update the BPB before transferring control to the bootloader code:

7C00: EB 3C 90 AA AA AA AA AA
7C08: AA AA AA AA AA AA AA AA
7C10: AA AA AA AA AA AA AA AA
7C18: AA AA AA AA AA AA AA AA
7C20: AA AA AA AA AA AA AA AA
7C28: AA AA AA AA AA AA AA AA
7C30: AA AA AA AA AA AA AA AA
7C38: AA AA AA AA AA AA

Assembly code only works on one of my two x86 processors

It is not the processors but the BIOSes:

The int instruction actually is a special variant of the call instruction. The instruction calls some sub-routine (typically written in assembler).

(You can even replace that sub-routine by your own one - which is actually done by MS-DOS, for example.)

On two computers you have two different BIOS versions (or even vendors) which means that the sub-routine called by the int 10h instruction has been written by different programmers and therefore does not exactly do the same.

only to get the following output

The problem I suspect here is that the sub-routine called by int 10h on the first computer does not save the register values while the routine on the second computer does.

In other words:

On the first computer the routine called by int 10h may look like this:

...
mov cl, 5
mov ah, 6
...

... so after the int 10h call the ah register does no longer contain the value 0Eh and it may even be the case that the cl register is modified (which will end in an endless loop then).

To avoid the problem you could save the cl register using push (you have to save the entire cx register) and restore it after the int instruction. You also have to set the value of the ah register before each call of the int 10h sub-routine because you cannot be sure that it has not modified since then:

push cx
mov ah, 0Eh
int 10h
pop cx

mov sp, ... ... ret

Please think about Peter Cordes' comment:

How does the ret instruction work and how is it related to the sp and ss registers?

The ret instruction here will definitely not do what you expect!

On floppy disks the boot sectors typically contain the following code instead:

mov ax, 0  ; (may be written as "xor ax, ax")
int 16h
int 19h

int 19h does exactly what you expect from the ret instruction.

However the BIOS will boot the computer again which means that it will load the code from your USB stick and execute it again.

You'll get the following result:

AAAAABAAAAABAAAAABAAAAAB...

Therefore the int 16h instruction is inserted. This will wait for the user to press a key on the keyboard when the ax register has the value 0 before calling the int 16h sub-routine.

Alternatively you can simply add an endless loop:

.endlessLoop:
    jmp .endlessLoop

mov ss, ...

When an interrupt occurs between these two instructions:

mov ss, ax
    ; <--- Here
mov sp, 4096

... the combination of the sp and ss registers does not represent a "valid" representation of values.

If you are unlucky the interrupt will write data somewhere to memory where you don't want it. It may even overwrite your program!

Therefore you typically lock interrupts when modifying the ss register:

cli          ; Forbid interrupts
mov ss, ax
mov sp, 4096
sti          ; Allow interrupts again