Why does the same data take more space in a text file than in an Excel file?
The xlsx
format used by modern Excel is actually a compressed format. It's a ZIP archive that contains text (XML) files in a certain structure.
If you compress your plain text file with a similar ZIP compression tool, you should achieve similar file sizes.
Additionally, as mentioned by Bradley Uffner and Morgen in the comments, Excel will deduplicate identical strings and only store one copy of them. I'm not sure about the exact gains of such a method, and it will depend on your data set, but simple zip compression will probably get you most of the way there.1
9.1.3 Physical Packages
Each Office Open XML document is implemented as a ZIP archive.
— ECMA-376-1:2016
1 My guess is that this deduplication is most effective when you have multiple worksheets, since zip compression applies independently to each file in an archive and only over limited sections of the data at a time - by storing all strings together in a single file, there should be some benefit to the later compression. More practically, if your plain text format is in a single file anyway then there'll probably be little difference.
The answer given is correct, it is due to Excel storing your data as xml. It is also due to this, that sorting your data efficiently will also reduce the file size. Test it yourself - say you have data like
A B C
John Smith-Johnson-Williamson 12345
Sally Smith-Johnson-Williamson 67890
John Williams 34567
If you sort by C (a column with all or almost all unique values) only, then the identical values of B will not be adjacent. In Excel's xml it looks like this:
<12345><John><Smith-Johnson-Williamson>
<34567><John><Williams>
<67890><Sally><Smith-Johnson-Williamson>
If you sort by B (a column with common values), then the identical values are adjacent. In Excel's xml it looks like this:
<Smith-Johnson-Williamson><John><12345>
<Sally><67890>
<Williams><John><34567>
Because that long string is identical and adjacent, Excel knows it can pack them together, similar to when people write lists, and to repeat part of the above line, they type quotes rather than re-write the same thing. I did not find any evidence of a shared-string dictionary in my investigation - just this indentation in place of the repeated field value.
I had mailing lists of 250,000 customers across only 11 states, and on each record there was a field that was one of two strings identifying the offer they get. Our employees were for some reason accustomed to looking people up by their street address as spoken, so it was sorted on the street number column, then the street name, city, etc... when I resorted by the offer first, then state, zip code, city, street name, street number, and finally address-line-2, the file size was reduced incredibly. I examined the unpacked xml on the file sorted each way to see what was going on, and the above is what I deduced. If there are multiple fields with more than say 5 characters, but the values are of a limited set (say, ticket dispositions like 'resolved','rejected','approved',etc), then give some thought to sorting and see if it helps.