In C or C++, when declaring variables for identical integers twice, do we end up with two distinct memory bocks for those two integers

If we have

int a = 123;
int b = 123;

will we end up with two distinct memory bocks allocated for the integer 123 or we only end up with one memory block allocated for 123 and variable a and b are just loaded at the same memory address?

What about

int a = 123;
int b = a;

Does this change the answer?

I tried to print out the memory addresses of both variables in C++ and found that they are different

  int a = 123;
  cout << &a << endl; // 0x7fff46512da0
  int b = 123; 
  cout << &b << endl; // 0x7fff46512da4

Does this mean in that specific environment the program stores duplicate int 123 at two different memory blocks?

Does the answer change if the values are strings?

The reason I am asking this question is that I found out in Python the memory addresses are always the same for primitive values if they are equal. I heard it is because of the constant pool. I wonder if this still applies to C and C++?

e.g.

a = 123
b = 123

print(id(a)) // 9792896
print(id(b)) // 9792896

Solution 1:

C and C++ programs have dual natures. The meaning of a program is described using a theoretical model with an abstract computer that executes the program literally as the source describes it. In this model, each object has different memory from every other object, because an object is by definition reserved memory, associated with a type. (Note that string literals in source code may overlap, referring to one common array.)

However, a compiler is not required to produce assembly code that executes this meaning literally. It may produce any program that has the same observable behavior as the original source code. Observable behavior includes the output the program writes the files, input/output interactions, and accesses to volatile objects. In between observable behavior, the compiler can optimize the program, including eliminating unnecessary memory use.

Whenever you define an object, the compiler might not reserve memory for it at all if it is able to make your program work without using such memory. For example, in:

int main(void)
{
    int a = 123;
    int b;
    scanf("%d", &b);
    printf("%d\n", a+b);
}

the compiler is likely to perform the calculation by loading the constant 123 as an immediate operation of an instruction, without reserving separate memory for it.

If the compiler does need memory, perhaps because it does not have enough processor registers to keep everything it is working with in registers, then it might keep only one copy of a constant that is used to initialized two objects which are never changed and whose addresses are not taken.

If you pass the objects by address to other routines or give them different values, the compiler is more likely to reserve separate memory for them, depending on circumstances.

Solution 2:

Eric's answer is very good. I will add some practical cases using C as the base languange for my answer.

Take the following code:

#include <stdio.h>

int main() {

    int a = 123;
    int b = 123;

    printf("%d", a);
    printf("%d", b);
}

If you compile this code with gcc 11.2 x86-64 C compiler (intel asm) the following assembly is produced:

.LC0:
        .string "%d"
main:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], 123
        mov     DWORD PTR [rbp-8], 123
        mov     eax, DWORD PTR [rbp-4]
        mov     esi, eax
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 0
        call    printf
        mov     eax, DWORD PTR [rbp-8]
        mov     esi, eax
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 0
        call    printf
        mov     eax, 0
        leave
        ret

As you can see storage is provided for the 2 variables.

Now, if I use optimization -O flag, then the following assembly is produced:

.LC0:
        .string "%d"
main:
        sub     rsp, 8
        mov     esi, 123
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 0
        call    printf
        mov     esi, 123
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 0
        call    printf
        mov     eax, 0
        add     rsp, 8
        ret

The compiler just uses the 123 literal, because no changes are made to those variables, it figures they can be treated as constant values and no storage will be needed.

That doesn't mean that the literal exist in the ether, it has to be embedded in the assembly.

With Python everything is an object, even primitive types, notice that print(id(a)) and print(id(123)) will render the same result, in both cases the identifier of the specific object 123, a pointer or reference to it, if you will, but nothing related to the variable to which it's assigned.

C/C++, on the other hand, is not like Python, int literals are not objects, there are no references to them, justs the bits. For the 123 literal example, let's try to print its address:

printf("%p\n", (void*)123);

What happens here:

mov esi, 123 // sets ESI register to 123
mov edi, OFFSET FLAT:.LC0 //unimportant, gets the specifier string
mov eax, 0 // sets EAX register to 0
call printf // prints the literal

The output:

0x7b // 123 hexadecimal

Now let's also print the address of a variable that has 123 assigned:

int a = 123;   
printf("%p", (void*)&a);

Looking at the assembly we can spot the difference:

mov DWORD PTR [rsp+12], 123 // moving `123` literal to its address
lea rsi, [rsp+12] // placing the address in the register
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf // printing the address

In this case the address of the variable is printed, as expected. The literal was placed in the memory location where the variable a lives, therefore we can print its address.

If you have two variables with the same value, they're probably going to have different addresses, but if the compiler finds a way to have only one address for the two variables and still produce the desired outcome, there is no rule preventing it.

There is little to no constraints in the language standard about what a compiler can do, it just has to conform with the language rules and produce a program that in all circumstances behaves in a consistent, defined manner, provided that it is correctly coded.

The assignment of a to b by itself changes nothing, nor does the fact that the literal is a string, modern compilers are very smart, so there probably will be only one copy of the same literal (especially considering that string literals created by assingnent to pointers are immutable), unless there are platform specific constraints that prevent it. The bottom line is that you can't assume, there is no language rules mandating or preventing it.

Side note:

C and C++ are different languages, I want to explicitly point this out because more often than not C++ is mistakenly regarded as a superset of C, though that may have been the case in the early years, it is not true today, these are very different languages, despite of the fact that C++ retains compatibility, for the most part, for C code.

Solution 3:

will we end up with two distinct memory bocks allocated

As far as the abstract machine is concerned: Yes, the variables have overlapping storage duration, so they must have distinct memory addresses.

As far as the language implementation is concerned: It depends. There could even be no memory used at all if isn't needed.

Does this change the answer?

No.

Solution 4:

If we have

int a = 123
int b = 123

will we end up with two distinct memory bocks allocated for the integer 123 or we only end up with one memory block allocated for 123 and variable a and b are just loaded at the same memory address?

Regardless of what values they are initialized with or hold at any time, the two objects declared by those two distinct declarations are logically distinct objects, with, therefore, logically distinct storage.

Compilers may play all manner of games and trickery under the hood, but there is no way for a conforming C program containing those declarations, running on a conforming C implementation, to perceive them as referring to the same object or having the same or overlapping storage.

What about

int a = 123
int b = a

Does this change the answer?

No. The initial values specified, if any, have nothing to do with whether the two objects declared have the same storage (as far as C semantics are concerned or can discern).

I tried to print out the memory addresses of both variables in C++ and found that they are different

  int a = 123;
  cout << &a << endl; // 0x7fff46512da0
  int b = 123; 
  cout << &b << endl; // 0x7fff46512da4

Does this mean in that specific environment the program stores duplicate int 123 at two different memory blocks?

Yes.

Does the answer change if the values are strings?

No, but see below. Whether by "strings" you mean std::strings or arrays of char or pointers to char, separately declared objects are separate objects, with separate storage.

However, different appearances of C string literals with the same content do not necessarilly have separate storage. These are not declared objects, so this does not conflict with the "No"s above, but it does mean that in a case such as this ...

const char *a = "foo";
const char *b = "foo";

... it might be true that a == b. Even then, however, you can still rely on &a == &b to evaluate to false, because the storage for the pointers identified by a and b is different, even if they point to the same object . If you don't understand why I'm calling out this case then that's fine -- all the better, in fact.

The reason I am asking this question is that I found out in Python the memory addresses are always the same for primitive values if they are equal. I heard it is because of the constant pool. I wonder if this still applies to c/c++?

Python is very different from C++, and even more different from C. Python's built-in types (its language specification does not use the term "primitive types") are all object types, analogous to C++ classes. Values of Python built-in types are analogous to C++ class instances. For most numeric types among them, these require several times more storage than would be needed to represent the numeric value alone, so for efficiency, Python has a constant pool so that some of those objects can be shared rather than duplicated. That works out because the types involved are immutable, so for the most part, only the values they represent are important, not the identities of the objects containing them.

That is not the case for C's built-in types, whether in C or in C++. These are not represented by wrapper objects, just by the bits that make up the values themselves. Moreover, it's not even an apples-to-apples comparison. All Python variables behave similarly to C++ references (and even more like Java references), and there are no pointers. Thus, in Python you cannot even talk about the storage for a variable itself, only about the storage for the object to which it refers.

Different Python variables do have their own, separate storage in exactly the same sense that C and C++ variables do, but you cannot look at it or touch it except to determine the object to which the variable refers or to make it refer to a different object. But you know that they have distinct storage because if they didn't then assigning a new value to one would change the value of the other as well.