Search for large checkins in a subversion repository

In the past people have checked in massive files to our subversion repository and later regretted it. To fix it, they just deleted the directory and made another checkin (leaving the massive files in the repository's history).

Since these massive files were accidents and not intended to be in the history, I wanted to filter them out using svndumpfilter. Is there any easy way to find large directories that were checked in? Perhaps sort revision diffs by size?


I basically found it by analysing the svndump file and using a small python script:

$ # dump repository to file
$ svnadmin dump /var/lib/svn/ > svn_full_dump.txt

$ # find byte offsets of 'Revision-number' ignoring non-ascii and save in file
$ egrep -boa  '^Revision-number: .+$' svn_full_dump.txt > revisions.txt

$ head revisions.txt 
75:Revision-number: 0
195:Revision-number: 1
664:Revision-number: 2
863:Revision-number: 3
1058:Revision-number: 4
1254:Revision-number: 5
1858:Revision-number: 6

$ # find size of checkins and sort by size
$ python revision_size.py  | sort -nr | head
1971768485 r1528
44453981 r2375
39073877 r1507
34731033 r2394
30499012 r484
...

The python file is:

#!/usr/bin/env python

f = file('revisions.txt')

last_offset = 0
last_revision = None

for l in f:
    l = l.strip()
    (offset, middle, revision) = l.split(':')
    offset = int(offset.strip())

    revision_size = offset-last_offset
    if last_revision:
        print '%s r%s'%(revision_size, last_revision.strip())

    last_revision = revision
    last_offset = offset

# will ignore last revision

f.close()

update: fixed a bug in the revision_size script where the size wasn't quite matched up to the right revision.