Why do we use arrays instead of other data structures?

Solution 1:

Time to go back in time for a lesson. While we don't think about these things much in our fancy managed languages today, they are built on the same foundation, so let's look at how memory is managed in C.

Before I dive in, a quick explanation of what the term "pointer" means. A pointer is simply a variable that "points" to a location in memory. It doesn't contain the actual value at this area of memory, it contains the memory address to it. Think of a block of memory as a mailbox. The pointer would be the address to that mailbox.

In C, an array is simply a pointer with an offset, the offset specifies how far in memory to look. This provides O(1) access time.

  MyArray   [5]
     ^       ^
  Pointer  Offset

All other data structures either build upon this, or do not use adjacent memory for storage, resulting in poor random access look up time (Though there are other benefits to not using sequential memory).

For example, let's say we have an array with 6 numbers (6,4,2,3,1,5) in it, in memory it would look like this:

=====================================
|  6  |  4  |  2  |  3  |  1  |  5  |
=====================================

In an array, we know that each element is next to each other in memory. A C array (Called MyArray here) is simply a pointer to the first element:

=====================================
|  6  |  4  |  2  |  3  |  1  |  5  |
=====================================
   ^
MyArray

If we wanted to look up MyArray[4], internally it would be accessed like this:

   0     1     2     3     4 
=====================================
|  6  |  4  |  2  |  3  |  1  |  5  |
=====================================
                           ^
MyArray + 4 ---------------/
(Pointer + Offset)

Because we can directly access any element in the array by adding the offset to the pointer, we can look up any element in the same amount of time, regardless of the size of the array. This means that getting MyArray[1000] would take the same amount of time as getting MyArray[5].

An alternative data structure is a linked list. This is a linear list of pointers, each pointing to the next node

========    ========    ========    ========    ========
| Data |    | Data |    | Data |    | Data |    | Data |
|      | -> |      | -> |      | -> |      | -> |      | 
|  P1  |    |  P2  |    |  P3  |    |  P4  |    |  P5  |        
========    ========    ========    ========    ========

P(X) stands for Pointer to next node.

Note that I made each "node" into its own block. This is because they are not guaranteed to be (and most likely won't be) adjacent in memory.

If I want to access P3, I can't directly access it, because I don't know where it is in memory. All I know is where the root (P1) is, so instead I have to start at P1, and follow each pointer to the desired node.

This is a O(N) look up time (The look up cost increases as each element is added). It is much more expensive to get to P1000 compared to getting to P4.

Higher level data structures, such as hashtables, stacks and queues, all may use an array (or multiple arrays) internally, while Linked Lists and Binary Trees usually use nodes and pointers.

You might wonder why anyone would use a data structure that requires linear traversal to look up a value instead of just using an array, but they have their uses.

Take our array again. This time, I want to find the array element that holds the value '5'.

=====================================
|  6  |  4  |  2  |  3  |  1  |  5  |
=====================================
   ^     ^     ^     ^     ^   FOUND!

In this situation, I don't know what offset to add to the pointer to find it, so I have to start at 0, and work my way up until I find it. This means I have to perform 6 checks.

Because of this, searching for a value in an array is considered O(N). The cost of searching increases as the array gets larger.

Remember up above where I said that sometimes using a non sequential data structure can have advantages? Searching for data is one of these advantages and one of the best examples is the Binary Tree.

A Binary Tree is a data structure similar to a linked list, however instead of linking to a single node, each node can link to two children nodes.

         ==========
         |  Root  |         
         ==========
        /          \ 
  =========       =========
  | Child |       | Child |
  =========       =========
                  /       \
            =========    =========
            | Child |    | Child |
            =========    =========

 Assume that each connector is really a Pointer

When data is inserted into a binary tree, it uses several rules to decide where to place the new node. The basic concept is that if the new value is greater than the parents, it inserts it to the left, if it is lower, it inserts it to the right.

This means that the values in a binary tree could look like this:

         ==========
         |   100  |         
         ==========
        /          \ 
  =========       =========
  |  200  |       |   50  |
  =========       =========
                  /       \
            =========    =========
            |   75  |    |   25  |
            =========    =========

When searching a binary tree for the value of 75, we only need to visit 3 nodes ( O(log N) ) because of this structure:

  • Is 75 less than 100? Look at Right Node
  • Is 75 greater than 50? Look at Left Node
  • There is the 75!

Even though there are 5 nodes in our tree, we did not need to look at the remaining two, because we knew that they (and their children) could not possibly contain the value we were looking for. This gives us a search time that at worst case means we have to visit every node, but in the best case we only have to visit a small portion of the nodes.

That is where arrays get beat, they provide a linear O(N) search time, despite O(1) access time.

This is an incredibly high level overview on data structures in memory, skipping over a lot of details, but hopefully it illustrates an array's strength and weakness compared to other data structures.

Solution 2:

For O(1) random access, which can not be beaten.