Normally when using Avro files as input or output to a MapReduce job, you write a Java main method to set up the Job using AvroJob. That documentation page does a good job of explaining where to use AvroMappers, AvroReducers, and the AvroKey and AvroValue (N.B. if you want a file full of a particular Avro object, not key-value pair of two Avro types, use AvroKeyOutputWriter as the OutputFormat, AvroKey as the key and NullWritable as the value).
Sometimes (like if you’re using Oozie), you need to set everything up without using AvroJob as a helper. The documentation is less clear here, so here’s a list of Hadoop keys and the appropriate values (for MRv2):
- avro.schema.output.key – The JSON representation of the output key’s Avro schema. For large objects you may run afoul of Oozie’s 100,000 character workflow limit, in which case you can isolate your Avro job in a subflow
- avro.schema.output.value – Likewise, if you’re emitting key-value pairs instead of using AvroKeyOutputWriter, put your value’s JSON schema here
- avro.mapper – your mapper class that extends AvroMapper. You can also use a normal Mapper (with the normal Mapper configuration option), but you’ll have to handle coverting the AvroKey/AvroValue yourself
- avro.reducer – likewise, a class that extends AvroReducer
- mapreduce.job.output.key.class – always AvroKey
- mapreduce.job.output.value.class – AvroValue or NullWritable, as above
- mapreduce.input.format.class – if you’re reading Avro files as Input, you’ll need to set this to
- mapreduce.map.output.key.class – AvroKey, if you’re using a subclass of AvroMapper. If you write your own Mapper, you can pick
- mapreduce.map.output.value.class – AvroKey or NullWritable, unless you write a Mapper without subclassing AvroMapper
- io.serializations – AvroJob set this value to the following:
org.apache.hadoop.io.serializer.WritableSerialization, org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization, org.apache.hadoop.io.serializer.avro.AvroReflectSerialization, org.apache.avro.hadoop.io.AvroSerialization
With these configuration options you should be able to set up an Avro job in Oozie, or any other place where you have to set up your MapReduce job manually.
Discover more about our expertise in Hadoop.
Interested in working with Alan? Schedule a tech call.