Hadoop DistributedCache is deprecated - what is the preferred API?
My map tasks need some configuration data, which I would like to distribute via the Distributed Cache.
The Hadoop MapReduce Tutorial shows the usage of the DistributedCache class, roughly as follows:
// In the driver
JobConf conf = new JobConf(getConf(), WordCount.class);
...
DistributedCache.addCacheFile(new Path(filename).toUri(), conf);
// In the mapper
Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job);
...
However, DistributedCache
is marked as deprecated in Hadoop 2.2.0.
What is the new preferred way to achieve this? Is there an up-to-date example or tutorial covering this API?
Solution 1:
The APIs for the Distributed Cache can be found in the Job class itself. Check the documentation here: http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html The code should be something like
Job job = new Job();
...
job.addCacheFile(new Path(filename).toUri());
In your mapper code:
Path[] localPaths = context.getLocalCacheFiles();
...
Solution 2:
To expand on @jtravaglini, the preferred way of using DistributedCache
for YARN/MapReduce 2 is as follows:
In your driver, use the Job.addCacheFile()
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = Job.getInstance(conf, "MyJob");
job.setMapperClass(MyMapper.class);
// ...
// Mind the # sign after the absolute file location.
// You will be using the name after the # sign as your
// file name in your Mapper/Reducer
job.addCacheFile(new URI("/user/yourname/cache/some_file.json#some"));
job.addCacheFile(new URI("/user/yourname/cache/other_file.json#other"));
return job.waitForCompletion(true) ? 0 : 1;
}
And in your Mapper/Reducer, override the setup(Context context)
method:
@Override
protected void setup(
Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
if (context.getCacheFiles() != null
&& context.getCacheFiles().length > 0) {
File some_file = new File("./some");
File other_file = new File("./other");
// Do things to these two files, like read them
// or parse as JSON or whatever.
}
super.setup(context);
}
Solution 3:
The new DistributedCache API for YARN/MR2 is found in the org.apache.hadoop.mapreduce.Job
class.
Job.addCacheFile()
Unfortunately, there aren't as of yet many comprehensive tutorial-style examples of this.
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Job.html#addCacheFile%28java.net.URI%29