Elephant Bird

About

Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats, Writables, Pig LoadFuncs, Hive SerDe, HBase miscellanea, etc. The majority of these are in production at Twitter running over data every day.

Join the conversation about Elephant-Bird on the developer mailing list.

License

Apache licensed.

Quickstart

Get the code: git clone git://github.com/kevinweil/elephant-bird.git
Build the jar: mvn package
Explore what's available: mvn javadoc:javadoc

Note: For any of the LZO-based code, make sure that the native LZO libraries are on your java.library.path. Generally this is done by setting JAVA_LIBRARY_PATH in pig-env.sh or hadoop-env.sh. You can also add lines like

PIG_OPTS=-Djava.library.path=/path/to/my/libgplcompression/dir

to pig-env.sh. See the instructions for Hadoop-LZO for more details.

There are a few simple examples that use the input formats. Note how the Protocol Buffer and Thrift classes are passed to input formats through configuration.

Maven repository

Elephant Bird release artifacts are published to the Sonatype OSS releases repository and promoted from there to Maven Central. From time to time we may also deploy snapshot releases to the Sonatype OSS snapshots repository.

Version compatibility

Protocol Buffers 2.3 (not compatible with 2.4+)
Pig 0.8, 0.9 (not compatible with 0.7 and below)
Hive 0.7 (with HIVE-1616)
Thrift 0.5.0, 0.6.0, 0.7.0
Mahout 0.6
Cascading2 (as the API is evolving, see libraries.properties for the currently supported version)

Protocol Buffer and Thrift compiler dependencies

Elephant Bird requires Protocol Buffer compiler version 2.3 at build time, as generated classes are used internally. Thrift compiler is required to generate classes used in tests. As these are native-code tools they must be installed on the build machine (java library dependencies are pulled from maven repositories during the build).

Hadoop provides two API implementations: the the old-style org.apache.hadoop.mapred and new-style org.apache.hadoop.mapreduce packages. Elephant-Bird provides wrapper classes that allow unmodified usage of mapreduce input and output formats in contexts where the mapred interface is required.

For more information, see DeprecatedInputFormatWrapper.java and DeprecatedOutputFormatWrapper.java

Hadoop Writables

Elephant-Bird provides protocol buffer and thrift writables for directly working with these formats in map-reduce jobs.

Pig Support

Loaders and storers are available for the input and output formats listed above. Additionally, pig-specific features include:

JSON loader (including nested structures)
Regex-based loader
Includes converter interface for turning Tuples into Writables and vice versa
Provides implementations to convert generic Writables, Thrift, Protobufs, and other specialized classes, such as Apache Mahout's VectorWritable.

Hive Support

Elephant-Bird provides Hive support for reading thrift and protocol buffers. For more information, see How to use Elephant Bird with Hive.

Lucene Integration

Elephant-Bird provides hadoop Input/Output Formats and pig Load/Store Funcs for creating + searching lucene indexes. See Elephant Bird Lucene

Utilities

Counters in Pig
Protocol Buffer utilities
Thrift utilities
Conversions from Protocol Buffers and Thrift messages to Pig tuples
Conversions from Thrift to Protocol Buffer's DynamicMessage
Reading and writing block-based Protocol Buffer format (see ProtobufBlockWriter)

Working with Thrift and Protocol Buffers in Hadoop

We provide InputFormats, OutputFormats, Pig Load / Store functions, Hive SerDes, and Writables for working with Thrift and Google Protocol Buffers. We haven't written up the docs yet, but look at ProtobufMRExample.java, ThriftMRExample.java, people_phone_number_count.pig, people_phone_number_count_thrift.pig under examples directory for reflection-based dynamic usage. We also provide utilities for generating Protobuf-specific Loaders, Input/Output Formats, etc, if for some reason you want to avoid the dynamic bits.

Hadoop SequenceFiles and Pig

Reading and writing Hadoop SequenceFiles with Pig is supported via classes SequenceFileLoader and SequenceFileStorage. These classes make use of a WritableConverter interface, allowing pluggable conversion of key and value instances to and from Pig data types.

Here's a short example: Suppose you have SequenceFile<Text, LongWritable> data sitting beneath path input. We can load that data with the following Pig script:

REGISTER '/path/to/elephant-bird.jar';

%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
%declare LONG_CONVERTER 'com.twitter.elephantbird.pig.util.LongWritableConverter';

pairs = LOAD 'input' USING $SEQFILE_LOADER (
  '-c $TEXT_CONVERTER', '-c $LONG_CONVERTER'
) AS (key: chararray, value: long);

To store {key: chararray, value: long} data as SequenceFile<Text, LongWritable>, the following may be used:

%declare SEQFILE_STORAGE 'com.twitter.elephantbird.pig.store.SequenceFileStorage';

STORE pairs INTO 'output' USING $SEQFILE_STORAGE (
  '-c $TEXT_CONVERTER', '-c $LONG_CONVERTER'
);

For details, please see Javadocs in the following classes:

How To Contribute

Bug fixes, features, and documentation improvements are welcome! Please fork the project and send us a pull request on github.

Each new release since 2.1.3 has a tag. The latest version on master is what we are actively running on Twitter's hadoop clusters daily, over hundreds of terabytes of data.

Contributors

Major contributors are listed below. Lots of others have helped too, thanks to all of them! See git logs for credits.

Kevin Weil (@kevinweil)
Dmitriy Ryaboy (@squarecog)
Raghu Angadi (@raghuangadi)
Andy Schlaikjer (@sagemintblue)
Travis Crawford (@tc)
Johan Oskarsson (@skr)

Name		Name	Last commit message	Last commit date
Latest commit History 897 Commits
cascading2		cascading2
core		core
examples		examples
hive		hive
lucene		lucene
mahout		mahout
pig-lucene		pig-lucene
pig		pig
rcfile		rcfile
repo/com/twitter/elephant-bird		repo/com/twitter/elephant-bird
.gitignore		.gitignore
.travis.yml		.travis.yml
Changes.md		Changes.md
LICENSE		LICENSE
Readme.md		Readme.md
pom.xml		pom.xml

License

edrevo/elephant-bird

Folders and files

Latest commit

History

Repository files navigation

Elephant Bird

About

License

Quickstart

Maven repository

Version compatibility

Protocol Buffer and Thrift compiler dependencies

Contents

Hadoop Input and Output Formats

Hadoop API wrappers

Hadoop Writables

Pig Support

Hive Support

Lucene Integration

Utilities

Working with Thrift and Protocol Buffers in Hadoop

Hadoop SequenceFiles and Pig

How To Contribute

Contributors

About

Resources

License

Stars

Watchers

Forks

Languages