Avro
Apache Avro is a data serialization system that provides rich data structures and a compact, fast, binary data format. Originally developed within the Apache Hadoop ecosystem, Avro is designed for schema evolution and language-neutral data exchange.
Binary Layout
Section | Internal Name | Description | Possible Values / Format |
---|---|---|---|
File Header | magic | 4-byte magic number identifying Avro files | ASCII: Obj followed by 1 byte (hex: 4F 62 6A 01 ) |
meta | Metadata map storing key-value pairs (e.g., schema, codec) | Map of string keys to byte values (e.g., "avro.schema" → JSON schema string) | |
sync | 16-byte random sync marker used between blocks | 16 random bytes (unique per file) | |
Data Block | blockCount | Number of records in the block | Long (variable-length zigzag encoding) |
blockSize | Size in bytes of the serialized records (after compression, if any) | Long | |
blockData | Serialized records (optionally compressed) | Binary-encoded data per schema | |
sync | Sync marker repeated after each block | Same 16-byte value as in header |
Schema Types (Stored in Metadata)
Type | Internal Name | Description | Example / Format |
---|---|---|---|
Primitive | null , boolean , int , long , float , double , bytes , string | Basic types | `"type": "string" |
Record | record | Named collection of fields | { "type": "record", "name": "Person", "fields": [...] } |
Enum | enum | Named set of symbols | { "type": "enum", "name": "Suit", "symbols": ["SPADES", "HEARTS"] } |
Array | array | Ordered list of items | { "type": "array", "items": "string" } |
Map | map | Key-value pairs with string keys | { "type": "map", "values": "int" } |
Union | JSON array | Multiple possible types | [ "null", "string" ] |
Fixed | fixed | Fixed-size byte array | { "type": "fixed", "name": "md5", "size": 16 } |
Metadata Keys (in meta
)
Key | Description | Example Value |
---|---|---|
avro.schema | JSON-encoded schema | JSON string defining the schema |
avro.codec | Compression codec used (optional) | "null" (default), "deflate" , "snappy" , "bzip2" , "xz" |
Compression Codecs
Codec | Description | Best For |
---|---|---|
null | No compression applied | Small files or testing |
deflate | Standard ZIP compression | General-purpose compression |
snappy | Fast compression/decompression | Real-time streaming applications |
bzip2 | High compression ratio | Storage-constrained environments |
xz | Modern compression algorithm | Maximum compression efficiency |