Why has 'cat' this strange time behaviour?

I'm using cat to pipe different files into one big file. The number of different files varies, from two files up to ten, but the total size of all files is always the same (a couple of GB).

My problem: Whenever I get to the case where I have a total of six files, the time it takes to concatenate them peaks (i.e significantly more than with five or seven), and I have no idea why.

Anyone has an idea?

The files (all same size)

output
outputTEMP1
outputTEMP2
outputTEMP3
outputTEMP4
outputTEMP5

Command

cat outputTEMP* >> output && rm -f outputTEMP*

Currently, the Machine has to perform some calculations, but I will update later when new measurements are available.


One way to debug this problem is to use strace.

strace -tt -e trace=open,close -o /tmp/strace.cat.log cat apt.list authors.txt >/tmp/t.test
cat /tmp/strace.cat.log 

23:12:08.022588 open("apt.list", O_RDONLY|O_LARGEFILE) = 3
23:12:08.023451 close(3)                = 0
23:12:08.023717 open("authors.txt", O_RDONLY|O_LARGEFILE) = 3
23:12:08.025403 close(3)                = 0

-tt option logs the time stamp of system call to milli-seconds resolution. -e trace=open,close log only open,close API. Try remove them and you will see a very noisy log file.


So Davides comment is spot on. We need two things here, to do an accurate assessment:

  1. assurance caching isn't part of the scenario
  2. actual measurement of the time it's taking.

Assuming you have the disk space I'll describe a test scenario that'll more accurately determine if this is a real issue. If so, the supporting evidence from this approach WILL help the developers to know it's real and be able to reproduce it.

To help with problem isolation let's not do the rm part here at all. let the TEMP files sit around afterward. You can then repeat the tests doing the 'rm' part later, if you wish.

Here's the test scenario:

  • make 9 directories - one for each quantity of files ( 2 3 4 5 6 7 8 9 and 10) - if you don't have space, maybe just do 2, 5, 6, 7, and 10.
  • ensure you are putting DIFFERENT files into each of these directories; NO duplicates anywhere
  • use the time command like this:

    time (cat outputTEMP* >> output)

Capture the real, user, and sys numbers reported for each test you run.

I agree with Reynolds; if this is real, you should definitely email details to [email protected].