What is the difference between native code, machine code and assembly code?
The terms are indeed a bit confusing, because they are sometimes used inconsistently.
Machine code: This is the most well-defined one. It is code that uses the byte-code instructions which your processor (the physical piece of metal that does the actual work) understands and executes directly. All other code must be translated or transformed into machine code before your machine can execute it.
Native code: This term is sometimes used in places where machine code (see above) is meant. However, it is also sometimes used to mean unmanaged code (see below).
Unmanaged code and managed code: Unmanaged code refers to code written in a programming language such as C or C++, which is compiled directly into machine code. It contrasts with managed code, which is written in C#, VB.NET, Java, or similar, and executed in a virtual environment (such as .NET or the JavaVM) which kind of “simulates” a processor in software. The main difference is that managed code “manages” the resources (mostly the memory allocation) for you by employing garbage collection and by keeping references to objects opaque. Unmanaged code is the kind of code that requires you to manually allocate and de-allocate memory, sometimes causing memory leaks (when you forget to de-allocate) and sometimes segmentation faults (when you de-allocate too soon). Unmanaged also usually implies there are no run-time checks for common errors such as null-pointer dereferencing or array bounds overflow.
Strictly speaking, most dynamically-typed languages — such as Perl, Python, PHP and Ruby — are also managed code. However, they are not commonly described as such, which shows that managed code is actually somewhat of a marketing term for the really big, serious, commercial programming environments (.NET and Java).
Assembly code: This term generally refers to the kind of source code people write when they really want to write byte-code. An assembler is a program that turns this source code into real byte-code. It is not a compiler because the transformation is 1-to-1. However, the term is ambiguous as to what kind of byte-code is used: it could be managed or unmanaged. If it is unmanaged, the resulting byte-code is machine code. If it is managed, it results in the byte-code used behind-the-scenes by a virtual environment such as .NET. Managed code (e.g. C#, Java) is compiled into this special byte-code language, which in the case of .NET is called Common Intermediate Language (CIL) and in Java is called Java byte-code. There is usually little need for the common programmer to access this code or to write in this language directly, but when people do, they often refer to it as assembly code because they use an assembler to turn it into byte-code.
What you see when you use Debug + Windows + Disassembly when debugging a C# program is a good guide for these terms. Here's an annotated version of it when I compile a 'hello world' program written in C# in the Release configuration with JIT optimization enabled:
static void Main(string[] args) {
Console.WriteLine("Hello world");
00000000 55 push ebp ; save stack frame pointer
00000001 8B EC mov ebp,esp ; setup current frame
00000003 E8 30 BE 03 6F call 6F03BE38 ; Console.Out property getter
00000008 8B C8 mov ecx,eax ; setup "this"
0000000a 8B 15 88 20 BD 02 mov edx,dword ptr ds:[02BD2088h] ; arg = "Hello world"
00000010 8B 01 mov eax,dword ptr [ecx] ; TextWriter reference
00000012 FF 90 D8 00 00 00 call dword ptr [eax+000000D8h] ; TextWriter.WriteLine()
00000018 5D pop ebp ; restore stack frame pointer
}
00000019 C3 ret ; done, return
Right-click the window and tick the "Show Code Bytes" to get a similar display.
The column on the left is the machine code address. Its value is faked by the debugger, the code is actually located somewhere else. But that could be anywhere, depending on the location selected by the JIT compiler, so the debugger just starts numbering addresses from 0 at the start of the method.
The second column is the machine code. The actual 1s and 0s that the CPU executes. Machine code, like here, is commonly displayed in hex. Illustrative perhaps is that 0x8B selects the MOV instruction, the additional bytes are there to tell the CPU exactly what needs to be moved. Also note the two flavors of the CALL instruction, 0xE8 is the direct call, 0xFF is the indirect call instruction.
The third column is the assembly code. Assembly is a simple language, designed to make it easier to write machine code. It compares to C# being compiled to IL. The compiler used to translate assembly code is called an "assembler". You probably have the Microsoft assembler on your machine, its executable name is ml.exe, ml64.exe for the 64-bit version. There are two common versions of assembly languages in use. The one you see is the one that Intel and AMD use. In the open source world, assembly in the AT&T notation is common. The language syntax is heavily dependent on the kind of CPU for which is was written, the assembly language for a PowerPC is very different.
Okay, that tackles two of the terms in your question. "Native code" is a fuzzy term, it isn't uncommonly used to describe code in an unmanaged language. Instructive perhaps is to see what kind of machine code is generated by a C compiler. This is the 'hello world' version in C:
int _tmain(int argc, _TCHAR* argv[])
{
00401010 55 push ebp
00401011 8B EC mov ebp,esp
printf("Hello world");
00401013 68 6C 6C 45 00 push offset ___xt_z+128h (456C6Ch)
00401018 E8 13 00 00 00 call printf (401030h)
0040101D 83 C4 04 add esp,4
return 0;
00401020 33 C0 xor eax,eax
}
00401022 5D pop ebp
00401023 C3 ret
I didn't annotate it, mostly because it is so similar to the machine code generated by the C# program. The printf() function call is quite different from the Console.WriteLine() call but everything else is about the same. Also note that the debugger is now generating the real machine code address and that it is a bit smarter about symbols. A side effect of generating debug info after generating machine code like unmanaged compilers often do. I should also mention that I turned off a few machine code optimization options to make the machine code look similar. C/C++ compilers have a lot more time available to optimize code, the result is often hard to interpret. And very hard to debug.
Key point here is there are very few differences between machine code generated from a managed language by the JIT compiler and machine code generated by a native code compiler. Which is the primary reason why the C# language can be competitive with an native code compiler. The only real difference between them are the support function calls. Many of which are implemented in the CLR. And that revolves primary around the garbage collector.
Native code and machine code are the same thing -- the actual bytes that the CPU executes.
Assembly code has two meanings: one is the machine code translated into a more human-readable form (with the bytes for the instructions translated into short wordlike mnemonics like "JMP" (which "jumps" to another spot in the code). The other is the IL bytecode (instruction bytes that compilers like C# or VB generate, that will end up translated into machine code eventually, but aren't yet) that lives in a DLL or EXE.