Getting all the filenames (not content) recursively from an http directory
I actually had the exact same problem. Both of these solutions didn't work for me. However, this did:
Install lftp, then do
lftp https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/
du -a > manifest.txt
and that'll give you all the directories and file names.
Opposed to the FTP protocol, HTTP does not know the concept of a directory listing. Thus, wget
can only look for links and follow them according to certain rules the user defines.
That being said, if you absolutely want it, you can abuse wget
s debug mode to gather a list of the links it encounters when analyzing the HTML pages. It sure ain't no beauty, but here goes:
wget -d -r -np -N --spider -e robots=off --no-check-certificate \
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ \
2>&1 | grep " -> " | grep -Ev "\/\?C=" | sed "s/.* -> //"
Some sidenotes:
- This will produce a list which still contains duplicates (of directories), so you need to redirect the output to a file and use
uniq
for a pruned list. -
--spider
causeswget
not to download anything, but it still will do a HTTP HEAD request on each of the files it deems to enqueue. This will cause a lot more traffic than is actually needed/intended and cause the whole thing to be quite slow. -
-e robots=off
is needed to ignore arobots.txt
file which may causewget
to not start searching (which is the case for the server you gave in your question). - If you have
wget
1.14 or newer, you can use--reject-regex="\?C="
to reduce the number of needless requests (for those "sort-by" links already mentioned by @slm). This also eliminates the need for thegrep -Ev "\/\?C="
step afterwards.