Skip to main content
Version: 1.2.0

Avro

Apache Avro is a data serialization system that provides rich data structures and a compact, fast, binary data format. Originally developed within the Apache Hadoop ecosystem, Avro is designed for schema evolution and language-neutral data exchange.

Binary Layout

SectionInternal NameDescriptionPossible Values / Format
File Headermagic4-byte magic number identifying Avro filesASCII: Obj followed by 1 byte (hex: 4F 62 6A 01)
metaMetadata map storing key-value pairs (e.g., schema, codec)Map of string keys to byte values (e.g., "avro.schema" → JSON schema string)
sync16-byte random sync marker used between blocks16 random bytes (unique per file)
Data BlockblockCountNumber of records in the blockLong (variable-length zigzag encoding)
blockSizeSize in bytes of the serialized records (after compression, if any)Long
blockDataSerialized records (optionally compressed)Binary-encoded data per schema
syncSync marker repeated after each blockSame 16-byte value as in header

Schema Types (Stored in Metadata)

TypeInternal NameDescriptionExample / Format
Primitivenull, boolean, int, long, float, double, bytes, stringBasic types`"type": "string"
RecordrecordNamed collection of fields{ "type": "record", "name": "Person", "fields": [...] }
EnumenumNamed set of symbols{ "type": "enum", "name": "Suit", "symbols": ["SPADES", "HEARTS"] }
ArrayarrayOrdered list of items{ "type": "array", "items": "string" }
MapmapKey-value pairs with string keys{ "type": "map", "values": "int" }
UnionJSON arrayMultiple possible types[ "null", "string" ]
FixedfixedFixed-size byte array{ "type": "fixed", "name": "md5", "size": 16 }

Metadata Keys (in meta)

KeyDescriptionExample Value
avro.schemaJSON-encoded schemaJSON string defining the schema
avro.codecCompression codec used (optional)"null" (default), "deflate", "snappy", "bzip2", "xz"

Compression Codecs

CodecDescriptionBest For
nullNo compression appliedSmall files or testing
deflateStandard ZIP compressionGeneral-purpose compression
snappyFast compression/decompressionReal-time streaming applications
bzip2High compression ratioStorage-constrained environments
xzModern compression algorithmMaximum compression efficiency