Avro MapReduce jobs in Oozie

Posted in: Hadoop, Technical Track

Normally when using Avro files as input or output to a MapReduce job, you write a Java main[] method to set up the Job using AvroJob. That documentation page does a good job of explaining where to use AvroMappers, AvroReducers, and the AvroKey and AvroValue (N.B. if you want a file full of a particular Avro object, not key-value pair of two Avro types, use AvroKeyOutputWriter as the OutputFormat, AvroKey as the key and NullWritable as the value).

Sometimes (like if you’re using Oozie), you need to set everything up without using AvroJob as a helper. The documentation is less clear here, so here’s a list of Hadoop keys and the appropriate values (for MRv2):

  • avro.schema.output.key – The JSON representation of the output key’s Avro schema. For large objects you may run afoul of Oozie’s 100,000 character workflow limit, in which case you can isolate your Avro job in a subflow
  • avro.schema.output.value – Likewise, if you’re emitting key-value pairs instead of using AvroKeyOutputWriter, put your value’s JSON schema here
  • avro.mapper – your mapper class that extends AvroMapper. You can also use a normal Mapper (with the normal Mapper configuration option), but you’ll have to handle coverting the AvroKey/AvroValue yourself
  • avro.reducer – likewise, a class that extends AvroReducer
  • mapreduce.job.output.key.class – always AvroKey
  • mapreduce.job.output.value.class – AvroValue or NullWritable, as above
  • mapreduce.input.format.class  – if you’re reading Avro files as Input, you’ll need to set this to
  • mapreduce.map.output.key.class – AvroKey, if you’re using a subclass of AvroMapper. If you write your own Mapper, you can pick
  • mapreduce.map.output.value.class – AvroKey or NullWritable, unless you write a Mapper without subclassing AvroMapper
  • io.serializations  – AvroJob set this value to the following:

org.apache.hadoop.io.serializer.WritableSerialization, org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization, org.apache.hadoop.io.serializer.avro.AvroReflectSerialization, org.apache.avro.hadoop.io.AvroSerialization

With these configuration options you should be able to set up an Avro job in Oozie, or any other place where you have to set up your MapReduce job manually.

Discover more about our expertise in Hadoop.

email

Interested in working with Alan? Schedule a tech call.

1 Comment. Leave new

Thanks for this update! Apache Avro is a very popular data serialization format in the Hadoop technology stack. The Avro MapReduce API is an Avro module for running MapReduce programs which produce or consume Avro data files. The avro-mapred JAR does not ship with the CDH3 hadoop-0.20 package; it is intended to be used as a library which you can retrieve using a tool such as Maven or Ant. More at http://www.youtube.com/watch?v=1jMR4cHBwZE

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *