How to get the input file name in the mapper in a Hadoop program?

How I can get the name of the input file within a mapper? I have multiple input files stored in the input directory, each mapper may read a different file, and I need to know which file the mapper has read.


Solution 1:

First you need to get the input split, using the newer mapreduce API it would be done as follows:

context.getInputSplit();

But in order to get the file path and the file name you will need to first typecast the result into FileSplit.

So, in order to get the input file path you may do the following:

Path filePath = ((FileSplit) context.getInputSplit()).getPath();
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();

Similarly, to get the file name, you may just call upon getName(), like this:

String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

Solution 2:

Use this inside your mapper :

FileSplit fileSplit = (FileSplit)context.getInputSplit();
String filename = fileSplit.getPath().getName();

Edit :

Try this if you want to do it inside configure() through the old API :

String fileName = new String();
public void configure(JobConf job)
{
   filename = job.get("map.input.file");
}

Solution 3:

If you are using Hadoop Streaming, you can use the JobConf variables in a streaming job's mapper/reducer.

As for the input file name of mapper, see the Configured Parameters section, the map.input.file variable (the filename that the map is reading from) is the one can get the jobs done. But note that:

Note: During the execution of a streaming job, the names of the "mapred" parameters are transformed. The dots ( . ) become underscores ( _ ). For example, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar. To get the values in a streaming job's mapper/reducer use the parameter names with the underscores.


For example, if you are using Python, then you can put this line in your mapper file:

import os
file_name = os.getenv('map_input_file')
print file_name

Solution 4:

If you're using the regular InputFormat, use this in your Mapper:

InputSplit is = context.getInputSplit();
Method method = is.getClass().getMethod("getInputSplit");
method.setAccessible(true);
FileSplit fileSplit = (FileSplit) method.invoke(is);
String currentFileName = fileSplit.getPath().getName()

If you're using CombineFileInputFormat, it's a different approach because it combines several small files into one relatively big file (depends on your configuration). Both the Mapper and RecordReader run on the same JVM so you can pass data between them when running. You need to implement your own CombineFileRecordReaderWrapper and do as follows:

public class MyCombineFileRecordReaderWrapper<K, V> extends RecordReader<K, V>{
...
private static String mCurrentFilePath;
...
public void initialize(InputSplit combineSplit , TaskAttemptContext context) throws IOException, InterruptedException {
        assert this.fileSplitIsValid(context);
        mCurrentFilePath = mFileSplit.getPath().toString();
        this.mDelegate.initialize(this.mFileSplit, context);
    }
...
public static String getCurrentFilePath() {
        return mCurrentFilePath;
    }
...

Then, in your Mapper, use this:

String currentFileName = MyCombineFileRecordReaderWrapper.getCurrentFilePath()

Hope I helped :-)

Solution 5:

Noticed on Hadoop 2.4 and greater using the old api this method produces a null value

String fileName = new String();
public void configure(JobConf job)
{
   fileName = job.get("map.input.file");
}

Alternatively you can utilize the Reporter object passed to your map function to get the InputSplit and cast to a FileSplit to retrieve the filename

public void map(LongWritable offset, Text record,
        OutputCollector<NullWritable, Text> out, Reporter rptr)
        throws IOException {

    FileSplit fsplit = (FileSplit) rptr.getInputSplit();
    String inputFileName = fsplit.getPath().getName();
    ....
}