Select random lines from a file
In a Bash script, I want to pick out N random lines from input file and output to another file.
How can this be done?
Solution 1:
Use shuf
with the -n
option as shown below, to get N
random lines:
shuf -n N input > output
Solution 2:
Sort the file randomly and pick first 100
lines:
lines=100
input_file=/usr/share/dict/words
# This is the basic selection method
<$input_file sort -R | head -n $lines
# If the file has duplicates that must never cause duplicate results
<$input_file sort | uniq | sort -R | head -n $lines
# If the file has blank lines that must be filtered, use sed
<$input_file sed $'/^[ \t]*$/d' | sort -R | head -n $lines
Of course <$input_file
can be replaced with any piped standard input. This (sort -R
and $'...\t...'
to get sed
to match tab chars) works with GNU/Linux and BSD/macOS.
Solution 3:
Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.
Challenge accepted...
EDIT: I beat my own record
powershuf did it in 0.047 seconds
$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 0.02s user 0.01s system 80% cpu 0.047 total
The reason it is so fast, well I don't read the whole file and just move the file pointer 10 times and print the line after the pointer.
Gitlab Repo
Old attempt
First I needed a file of 78.000.000.000 lines:
seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt
This gives me a a file with 78 Billion newlines ;-)
Now for the shuf part:
$ time shuf -n 10 lines_78000000000.txt
shuf -n 10 lines_78000000000.txt 2171.20s user 22.17s system 99% cpu 36:35.80 total
The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.
Python is what I regularly use so that's what I'll use to make this faster:
#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
buffer = f.read(65536)
if not buffer: break
count += buffer.count('\n')
for i in range(10):
f.readline(random.randint(1, count))
This got me just under a minute:
$ time ./shuf.py
./shuf.py 42.57s user 16.19s system 98% cpu 59.752 total
I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.
I know it can get faster but I'll leave some room to give others a try.
Line counter source: Luther Blissett