How external merge sort algorithm works?

Solution 1:

I guess after such a long time you must have got an answer. But I am still providing some example links to help someone else who hits this question.

NOTE: Before looking into this link you should have an idea about Heap data structure Take a look at Example of Two-Way Sorting and Example of multiway external sorting and you will get a complete idea of the implementation of a external sorting algorithm

Solution 2:

First of all, by sorting the numbers in parts of 4 numbers, you should get 3 chunks.

A:[1,2,3,5]  
B:[4,6,8,9]
C:[7]

Then you will read half of each file (ignore C since it won't fit) and merge them. So, you will load into memory {[1, 2], [4, 6]}. You will do a casual merge and write the result in a new chunk D:

Compare 1 and 4 -> D:[1]
Compare 2 and 4 -> D:[1, 2]

Now the part of A that was in RAM finished merging, so now you will have to bring the second half of it in memory. Now your memory will have {[3, 5], [4, 6]}.

Compare 3 and 4 -> D:[1, 2, 3]
Compare 5 and 4 -> D:[1, 2, 3, 4]
Compare 5 and 6 -> D:[1, 2, 3, 4, 5]

All of chunk A got merged, so now just append the rest of B into D

D:[1,2,3,4,5,6,8,9]

Now you would have to do the same process with chunks C and D. Remember that C could have more than one number in another example. By merging C and D you will get a new chunk E that will be the final sorted file.

Also, note that in a bigger example you might need more merge phases. For example, if you had 20 numbers to sort, You would create 5 chunks of 4 numbers, and then you would combine and merge two of them each time, resulting in 2 chunks of 8 numbers (plus one extra of 4 numbers), and then merge the newer chunks into one of 16 numbers and so on.

Solution 3:

You'll iterate through the files at the same time.

Just start from the beginning of each file and keep picking whichever file's element is not greater (i.e. smaller or equal) than the other, output that element to the new file and increase the iterator.

From your last statement, it's unclear whether or not you already know to do this, but this is all you need to do, because:

  • You'd only need to have one number in memory for each of the files, and of course any indices and other variables that are presumably ignored for the purpose of this exercise.

  • You only need to read each file once, as you can keep the files open at the correct position during this process so you don't need to read the whole file again to get to the correct position.

So, for:

A:[1,2,3,5]
B:[4,6,8,9]

You'd start off with the first element from each file - 1 and 4.

The 1 is smaller, so you output that to the new file and move on to 2.

2 is smaller than 4, so you output that and move on to 3.

3 is smaller than 4, so you output that and move on to 5.

4 is smaller than 5, so you output that and move on to 6.

5 is smaller than 6, so you output that and then you've reached the end of A.

Now just output the rest of B: 6, 8, 9.

This gives you [1,2,3,4,5,6,8,9].

Solution 4:

External sorting is usually used when you need to sort files that are too large to fit into memory.

The trick is to break the larger input file into k sorted smaller chunks and then merge the chunks into a larger sorted file. For the merge use a min heap. k will depend on your memory threshold.

Read a certain number of records (depending on your memory threshold) from each chunk and put it in a queue per chunk.

Pop the leftmost item (This will be the smallest item as the items in the queue will be sorted) from each queue and push it to the heap

Pop the min item from the heap. Note what queue it came from

Replenish the queue with the next item from it's corresponding chunk that is not in the queue

Pop the left most item from the queue and push it to the heap

Write the min item to the output file

Continue the above 4 steps till the heap is empty

Sample python code (Does not merge in place)

import os
import heapq
import itertools
import linecache
from collections import deque
import sys


def external_sort(input_directory, input_file_name, output_file_name):
    with open(os.path.expanduser(input_directory + '/' + output_file_name), 'w+') as f:
        heap = []
        pages = {}
        next_line_numbers = {}
        has_more_items = {}
        chunk_file_paths, max_chunk_size = create_sorted_chunks(input_directory, input_file_name)
        max_page_size = max_chunk_size // 10
        for chunk_file_path in chunk_file_paths:
            pages[chunk_file_path] = populate_page(chunk_file_path, max_page_size)
            next_line_numbers[chunk_file_path] = len(pages[chunk_file_path])
            has_more_items[chunk_file_path] = True
        for chunk_file_path in chunk_file_paths:
            heapq.heappush(heap, pages[chunk_file_path].popleft())
        while heap:
            item, chunk_file_path = heapq.heappop(heap)
            f.write(str(item)+'\n')
            if has_more_items[chunk_file_path]:
                has_more_items[chunk_file_path] = append_next(pages, chunk_file_path, next_line_numbers[chunk_file_path])
                next_line_numbers[chunk_file_path] += 1
            if pages[chunk_file_path]:
                heapq.heappush(heap, pages[chunk_file_path].popleft())
    for chunk_file_path in chunk_file_paths:
        os.remove(chunk_file_path)


def populate_page(chunk_file_path, max_page_size):
    chunk = deque()
    with open(chunk_file_path, 'r') as f:
        for line in itertools.islice(f, 0, max_page_size):
            chunk.append((int(line), chunk_file_path))
    return chunk


def append_next(chunks, chunk_file_path, line_number):
    chunk = chunks[chunk_file_path]
    item = linecache.getline(chunk_file_path, line_number)
    if item and len(item) > 0:
        chunk.append((int(item), chunk_file_path))
        has_more = True
    else:
        has_more = False
    return has_more


def create_sorted_chunks(input_file_directory, input_file_name):
    input_file_path = os.path.expanduser(input_file_directory + '/' + input_file_name)
    suffix = 1
    begin, end, tot = 0, 0, 0
    chunk_file_paths = []
    with open(input_file_path, 'r') as f:
        for line in f.readlines():
            tot += 1
    end = tot//10
    while suffix <= 10:
        buffer = []
        chunk_file_name = 'temp' + str(suffix) + '.txt'
        chunk_file_path = os.path.expanduser(input_file_directory + '/' + chunk_file_name)
        if not os.path.isfile(chunk_file_path):
            with open(os.path.expanduser(input_file_path), 'r') as f:
                for line in itertools.islice(f, begin, end):
                    buffer.append(int(line))
                create_chunk(chunk_file_path, buffer)
        suffix += 1
        begin = end
        end += tot//10
        chunk_file_paths.append(chunk_file_path)
    return chunk_file_paths, tot//10


def create_chunk(chunk_file_path, buffer):
    buffer.sort()
    with open(chunk_file_path, 'w+') as f:
        for i in buffer:
            f.write(str(i) + '\n')


if __name__ == '__main__':
    external_sort(sys.argv[1], sys.argv[2], sys.argv[3])

Solution 5:

Please read the README file to properly understand external merge sort.

There is step by step implementation defined

https://github.com/melvilgit/external-Merge-Sort/blob/master/README.md