How to list all files in a directory and its subdirectories in hadoop hdfs
I have a folder in hdfs which has two subfolders each one has about 30 subfolders which,finally,each one contains xml files. I want to list all xml files giving only the main folder's path. Locally I can do this with apache commons-io's FileUtils.listFiles(). I have tried this
FileStatus[] status = fs.listStatus( new Path( args[ 0 ] ) );
but it only lists the two first subfolders and it doesn't go further. Is there any way to do this in hadoop?
Solution 1:
If you are using hadoop 2.* API there are more elegant solutions:
Configuration conf = getConf();
Job job = Job.getInstance(conf);
FileSystem fs = FileSystem.get(conf);
//the second boolean parameter here sets the recursion to true
RemoteIterator<LocatedFileStatus> fileStatusListIterator = fs.listFiles(
new Path("path/to/lib"), true);
while(fileStatusListIterator.hasNext()){
LocatedFileStatus fileStatus = fileStatusListIterator.next();
//do stuff with the file like ...
job.addFileToClassPath(fileStatus.getPath());
}
Solution 2:
You'll need to use the FileSystem object and perform some logic on the resultant FileStatus objects to manually recurse into the subdirectories.
You can also apply a PathFilter to only return the xml files using the listStatus(Path, PathFilter) method
The hadoop FsShell class has examples of this for the hadoop fs -lsr command, which is a recursive ls - see the source, around line 590 (the recursive step is triggered on line 635)
Solution 3:
/**
* @param filePath
* @param fs
* @return list of absolute file path present in given path
* @throws FileNotFoundException
* @throws IOException
*/
public static List<String> getAllFilePath(Path filePath, FileSystem fs) throws FileNotFoundException, IOException {
List<String> fileList = new ArrayList<String>();
FileStatus[] fileStatus = fs.listStatus(filePath);
for (FileStatus fileStat : fileStatus) {
if (fileStat.isDirectory()) {
fileList.addAll(getAllFilePath(fileStat.getPath(), fs));
} else {
fileList.add(fileStat.getPath().toString());
}
}
return fileList;
}
Quick Example : Suppose you have the following file structure:
a -> b
-> c -> d
-> e
-> d -> f
Using the code above, you get:
a/b
a/c/d
a/c/e
a/d/f
If you want only the leaf (i.e. fileNames), use the following code in else
block :
...
} else {
String fileName = fileStat.getPath().toString();
fileList.add(fileName.substring(fileName.lastIndexOf("/") + 1));
}
This will give:
b
d
e
f