OK, I agree that processing large scale with Hadoop is cool, but sometimes, it makes me frustrated when I was doing my course project.
Often, in the task of Map-Reduce, we will use join, as the input for the whole job may be muti-files.
How to handle muti-files in one Mapper:
Multiple mappers: each mapper handles its own file: Code Demo
1 2 3
MultipleInputs.addInputPath(conf, new Path(args[0]), TextInputFormat.class, CustomerMap.class); MultipleInputs.addInputPath(conf, new Path(args[1]), TextInputFormat.class, TransactionMap.class); FileOutputFormat.setOutputPath(conf, new Path(args[2]));
Single mapper:one mapper handles all input files: Code Demo (As following code, we can know the souce for the data inside the mapper, then take responding actions)