Calling a mapreduce job from a simple java program
Solution 1:
Oh please don't do it with runJar
, the Java API is very good.
See how you can start a job from normal code:
// create a configuration
Configuration conf = new Configuration();
// create a new job based on the configuration
Job job = new Job(conf);
// here you have to put your mapper class
job.setMapperClass(Mapper.class);
// here you have to put your reducer class
job.setReducerClass(Reducer.class);
// here you have to set the jar which is containing your
// map/reduce class, so you can use the mapper class
job.setJarByClass(Mapper.class);
// key/value of your reducer output
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// this is setting the format of your input, can be TextInputFormat
job.setInputFormatClass(SequenceFileInputFormat.class);
// same with output
job.setOutputFormatClass(TextOutputFormat.class);
// here you can set the path of your input
SequenceFileInputFormat.addInputPath(job, new Path("files/toMap/"));
// this deletes possible output paths to prevent job failures
FileSystem fs = FileSystem.get(conf);
Path out = new Path("files/out/processed/");
fs.delete(out, true);
// finally set the empty out path
TextOutputFormat.setOutputPath(job, out);
// this waits until the job completes and prints debug out to STDOUT or whatever
// has been configured in your log4j properties.
job.waitForCompletion(true);
If you are using an external cluster, you have to put the following infos to your configuration via:
// this should be like defined in your mapred-site.xml
conf.set("mapred.job.tracker", "jobtracker.com:50001");
// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");
This should be no problem when the hadoop-core.jar
is in your application containers classpath.
But I think you should put some kind of progress indicator to your web page, because it may take minutes to hours to complete a hadoop job ;)
For YARN (> Hadoop 2)
For YARN, the following configurations need to be set.
// this should be like defined in your yarn-site.xml
conf.set("yarn.resourcemanager.address", "yarn-manager.com:50001");
// framework is now "yarn", should be defined like this in mapred-site.xm
conf.set("mapreduce.framework.name", "yarn");
// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");
Solution 2:
Calling MapReduce job from java web application (Servlet)
You can call a MapReduce job from web application using Java API. Here is a small example of calling a MapReduce job from servlet. The steps are given below:
Step 1: At first create a MapReduce driver servlet class. Also develop map & reduce service. Here goes a sample code snippet:
CallJobFromServlet.java
public class CallJobFromServlet extends HttpServlet {
protected void doPost(HttpServletRequest request,HttpServletResponse response) throws ServletException, IOException {
Configuration conf = new Configuration();
// Replace CallJobFromServlet.class name with your servlet class
Job job = new Job(conf, " CallJobFromServlet.class");
job.setJarByClass(CallJobFromServlet.class);
job.setJobName("Job Name");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class); // Replace Map.class name with your Mapper class
job.setNumReduceTasks(30);
job.setReducerClass(Reducer.class); //Replace Reduce.class name with your Reducer class
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
// Job Input path
FileInputFormat.addInputPath(job, new
Path("hdfs://localhost:54310/user/hduser/input/"));
// Job Output path
FileOutputFormat.setOutputPath(job, new
Path("hdfs://localhost:54310/user/hduser/output"));
job.waitForCompletion(true);
}
}
Step 2: Place all the related jar (hadoop, application specific jars) files inside lib folder of the web server (e.g. Tomcat). This is mandatory for accessing the Hadoop configurations ( hadoop ‘conf’ folder has configuration xml files i.e. core-site.xml , hdfs-site.xml etc ) . Just copy the jars from hadoop lib folder to web server(tomcat) lib directory. The list of jar names are as follows:
1. commons-beanutils-1.7.0.jar
2. commons-beanutils-core-1.8.0.jar
3. commons-cli-1.2.jar
4. commons-collections-3.2.1.jar
5. commons-configuration-1.6.jar
6. commons-httpclient-3.0.1.jar
7. commons-io-2.1.jar
8. commons-lang-2.4.jar
9. commons-logging-1.1.1.jar
10. hadoop-client-1.0.4.jar
11. hadoop-core-1.0.4.jar
12. jackson-core-asl-1.8.8.jar
13. jackson-mapper-asl-1.8.8.jar
14. jersey-core-1.8.jar
Step 3: Deploy your web application into web server (in ’webapps’ folder for Tomcat).
Step 4: Create a jsp file and link the servlet class (CallJobFromServlet.java) in form action attribute. Here goes a sample code snippet:
Index.jsp
<form id="trigger_hadoop" name="trigger_hadoop" action="./CallJobFromServlet ">
<span class="back">Trigger Hadoop Job from Web Page </span>
<input type="submit" name="submit" value="Trigger Job" />
</form>