How to find the N largest files in a git repository?
I wanted to find the 10 largest files in my repository. The script I came up with is as follows:
REP_HOME_DIR=<top level git directory>
max_huge_files=10
cd ${REP_HOME_DIR}
git verify-pack -v ${REP_HOME_DIR}/.git/objects/pack/pack-*.idx | \
grep blob | \
sort -r -k 3 -n | \
head -${max_huge_files} | \
awk '{ system("printf \"%-80s \" `git rev-list --objects --all | grep " $1 " | cut -d\" \" -f2`"); printf "Size:%5d MB Size in pack file:%5d MB\n", $3/1048576, $4/1048576; }'
cd -
Is there a better/more elegant way to do the same?
By "files" I mean the files that have been checked into the repository.
Solution 1:
I found another way to do it:
git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 | tail -n 10
Quoted from: SO: git find fat commit
Solution 2:
This bash "one-liner" displays the 10 largest blobs in the repository, sorted from smallest to largest. In contrast to the other answers, this includes all files tracked by the repository, even those not present in any branch tip.
It's very fast, easy to copy & paste and only requires standard GNU utilities.
git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --key=2 \
| tail -n 10 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
The first four lines implement the core functionality, the fifth limits the number of results, while the last two lines provide the nice human-readable output that looks like this:
...
0d99bb931299 530KiB path/to/some-image.jpg
2ba44098e28f 12MiB path/to/hires-image.png
bd1741ddce0d 63MiB path/to/some-video-1080p.mp4
For more information, including further filtering use cases and an output format more suitable for script processing, see my original answer to a similar question.
macOS users: Since numfmt
is not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils
.
Solution 3:
How about
git ls-files | xargs ls -l | sort -nrk5 | head -n 10
-
git ls-files
: List all the files in the repo -
xargs ls -l
: performls -l
on all the files returned ingit ls-files
-
sort -nrk5
: Numerically reverse sort the lines based on 5th column -
head -n 10
: Print the top 10 lines
Solution 4:
Cannot comment. ypid's answer modified for powershell
git ls-tree -r -l --abbrev --full-name HEAD | Sort-Object {[int]($_ -split "\s+")[3]} | Select-Object -last 10
Edit raphinesse's solution(ish)
git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | Where-Object {$_ -like "blob*"} | Sort-Object {[int]($_ -split "\s+")[2]} | Select-Object -last 10