Hadoop provides you with the
Writable
interface if you want to write your object to a SequenceFile
. It's up to you to implement the write()
and readFields()
methods for your object. It's easy if your object is simple: just write each of your instance variables to a DataOutput
and read them back in the same order from a DataInput
.Don't Write Your Object As A Serialized Byte Array
I got lazy when I was implementing the Writable interface with one of our classes because it had a ton of instance variables. I figured I'd just serialize it to a byte array, then write the array length and the whole array to the DataOutput. And on the read, well, just unserialize the object from the byte array. This was my
write()
:
@Override
public void write(DataOutput out) throws IOException {
ByteArrayOutputStream byteOutStream = new ByteArrayOutputStream();
ObjectOutputStream objectOut = new ObjectOutputStream(byteOutStream);
objectOut.writeObject(getContainedObject());
objectOut.close();
byte[] serializedObject= byteOutStream.toByteArray();
out.writeInt(serializedObject.length);
out.write(serializedModel);
}
Naw, dude. Bad idea.I knew that I'd be paying some overhead in both space and time for this little scheme, but I didn't know how much. It was just a little bit per object, but when we started seeing MapReductions take way too much time in I/O, it was time to revisit this.
What This Cost In Space And Time
First, the Java serialization space overhead. On a toy example of this object, serialization to a byte array used 953 bytes. Properly writing out the instance variables consumed 296 bytes. In production, doing it the right way shrunk a 1,600-record
SequenceFile
from 1.4GB to 825MB.Time savings were great, too. In the same toy example, it took my JVM 7.2 milliseconds to serialize the object and 1.7 milliseconds to unserialize. Doing with with stream I/O only took 76,000 nanoseconds to serialize, 58,000 nanoseconds to unserialize.
I love order-of-magnitude improvements.
Lesson learned: get off your lazy ass and do it right.