What are the segment and offset in real mode memory addressing?

I am reading about memory addressing. I read about segment offset and then about descriptor offset. I know how to calculate the exact addresses in real mode. All this is OK, but I am unable to understand what exactly offset is? Everywhere I read:

In real mode, the registers are only 16 bits, so you can only address up to 64k. In order to allow addressing of more memory, addresses are calculated from segment * 16 + offset.

Here I can understand the first line. We have 16 bits, so we can address up to 2^16 = 64k.

But what is this second line? What the segment represent? Why we multiply it with 16? why we add offset. I just can't understand what this offset is? Can anybody explain me or give me link for this please?

Solution 1:

When Intel was building the 8086, there was a valid case for having more than 64KB in a machine, but there was no way it'd ever use a 32-bit address space. Back then, even a megabyte was a whole lot of memory. (Remember the infamous quote "640K ought to be enough for anybody"? It's essentially a mistranslation of the fact that back then, 1MB was freaking huge.) The word "gigabyte" wouldn't be in common use for another 15-20 years, and it wouldn't be referring to RAM for like another 5-10 years after that.

So instead of implementing an address space so huge that it'd "never" be fully utilized, what they did was implement 20-bit addresses. They still used 16-bit words for addresses, because after all, this is a 16-bit processor. The upper word was the "segment" and the lower word was the "offset". The two parts overlapped considerably, though -- a "segment" is a 64KB chunk of memory that starts at (segment) * 16, and the "offset" can point anywhere within that chunk. In order to calculate the actual address, you multiply the segment part of the address by 16 (or shift it left by 4 bits...same thing), and then add the offset. When you're done, you have a 20-bit address.

 19           4  0
  +--+--+--+--+
  |  segment  |
  +--+--+--+--+--+
     |   offset  |
     +--+--+--+--+

For example, if the segment were 0x8000, and the offset were 0x0100, the actual address comes out to ((0x8000 << 4) + 0x0100) == 0x80100.

   8  0  0  0
      0  1  0  0
  ---------------
   8  0  1  0  0

The math is rarely that neat, though -- 0x80100 can be represented by literally thousands of different segment:offset combinations (4096, if my math is right).

Solution 2:

Under x86 Real-Mode Memory the physical address is 20 bit long and is therefore calculated as:

PhysicalAddress = Segment * 16 + Offset

Check also: Real-Mode Memory Management

Solution 3:

I want to add an answer here just because I've been scouring the internet trying to understand this too. The other answers were leaving out a key piece of information that I did get from the link presented in one of the answers. However, I almost totally missed it. Reading through the linked page, I still wasn't understanding how this was working.

The problem I was probably having was from myself only really understanding how the Commodore 64 (6502 processor) laid out memory. It uses similar notation to address memory. It has 64k of total memory, and uses 8-bit values of PAGE:OFFSET to access memory. Each page is 256 bytes long (8-bit number) and the offset points to one of values in that page. Pages are spaced back-to-back in memory. So page 2 starts where page 1 ends. I was going into the 386 thinking the same style. This is not so.

Real mode is using a similar style even if it is different wording SEGMENT:OFFSET. A segment is 64k in size. However, the segments themselves are not laid out back-to-back like the Commodore was. They are spaced 16 bytes apart from each other. Offset still operates the same, indicating how many bytes from the page\segment start.

I hope this explanation helps anyone else who finds this question, it has helped me in writing it.

Solution 4:

I can see the question and answers are some years old, but there is a wrong statement that there are only 16 bit registers exist within the real mode.

Within the real mode the registers are not only 16 bit, because there are also 8 bit registers too. Every of these 8 bit register is a part of a 16 bit register which are divided into a lower and a higher part of a 16 bit register.

And starting the real mode with a 80386+ we become 32 bit registers and additional also two new instruction prefixes, one for to override/reverse the default operand-size and one for to override/reverse the default address-size of one instruction inside of a codesegment.

These instruction prefixes can be used in combination for to reverse the operand-size and the address-size together for one instruction. Within the real mode the default operand-size and address-size is 16 bit. With these both instruction prefixes we can use a 32 bit operand/register example for to calculate a 32 bit value in one 32 bit register, or for to move a 32 bit value to and from a memmory location. And we can use all 32 bit registers(maybe in combination with a base+index*scale+displacement) as an address-register, but the sum of the effective address do not have to be exceed the limit of the 64 kb segment-size.

(On the OSDEV-Wiki page we can find in the table for the "Operand-size and address-size override prefix" that the "0x66 operand prefix" and the "0x67 address prefix" is N/A(not aviable) for the real mode and the virtual 8086 mode. http://wiki.osdev.org/X86-64_Instruction_Encoding
But this is totaly wrong, because in the Intel manual we can find this statement: "These prefixes can be used in real-address mode as well as in protected mode and virtual-8086 mode".)

Starting with a Pentium MMX we become eight 64 bit MMX-Registers.
Starting with a Pentium 3 we become eight 128 bit XMM-Registers.
..

If i am not wrong, then the 256 bit YMM-Register and the 512 bit ZMM-Register and the 64 bit general-purpose Register of a x64 can not be used within the real mode.

Dirk

Solution 5:

Minimal example

With:

offset = msg
segment = ds

mov $0, %ax
mov %ax, %ds
mov %ds:msg, %al
/* %al contains 1 */

mov $1, %ax
mov %ax, %ds
mov %ds:msg, %al
/* %al contains 2: 1 * 16 bytes forward. */

msg:
.byte 1
.fill 15
.byte 2

So if you want to access memory above 64k:

mov $0xF000, %ax
mov %ax, %ds

Note that this allows for addresses larger than 20 bits wide if you use something like:

0x10 * 0xFFFF + 0xFFFF == 0x10FFEF

On earlier processors which had only 20 address wires, it was simply truncated, but later on things got complicated with the A20 line (21st address wire): https://en.wikipedia.org/wiki/A20_line

On a GitHub repo with the required boilerplate to run it.