Skip to main content

Attachment

Enrich Elastic Compatible

Synopsis

Extracts content and metadata from common document formats like XLSX, DOCX, PDF, RTF, and ODT using native format libraries.

Schema

attachment:
- description: <text>
- field: <ident>
- target_field: <ident>
- if: <script>
- indexed_chars_field: <ident>
- indexed_chars: <number>
- ignore_failure: <boolean>
- ignore_missing: <boolean>
- on_failure: <processor[]>
- on_success: <processor[]>
- properties: <array>
- remove_binary: <boolean>
- resource_name: <string>
- tag: <string>

Configuration

FieldRequiredDefaultDescription
fieldYField to get the base64-encoded data from
descriptionN-Explanatory text
ifN-Condition to run
indexed_chars_fieldNnullField to override the value in the indexed_chars field
indexed_charsN100000Number of characters that can be used for extraction. This is to avoid oversized fields. To specify no limits, use -1
ignore_failureNfalseSee Handling Failures
ignore_missingNfalseIf set to true and field does not exist, exit quietly without making any modifications
on_failureNSee Handling Failures
on_successN-See Handling Success
propertiesNallArray of properties to be stored. Available options: author, content_type, content_length, date, name, keywords, language, and title
remove_binaryNIf set to true, the binary field will be removed from the document
resource_nameN-Name of the resource to decode.
tagN-Identifier
target_fieldNattachmentField containing the attachment

Details

The source field must contain base64-encoded binary data of the document to be processed. For optimal performance with large binary files, consider using binary data formats instead of base64 encoding.

The following fields can be extracted from a document:

altitude author comments content content_length content_type contributor coverage creator_tool date description format identifier keywords language latitude longitude metadata_date modified modifier print_date publisher rating relation rights source title type

The processor supports the following libraries to extract these fields:

  • Microsoft Excel (XLSX)
  • Microsoft Word (DOCX)
  • PDF documents
  • Rich Text Format (RTF)
  • OpenDocument Text (ODT)
  • Plain text (TXT)
note

For multiple attachments, use the foreach processor.

Examples

Basic Extraction

The files to be attached to the JSON documents must first be encoded as base64 strings. Then, to decode the string the processor can be used:

Extract content from an Excel file...

{
"data": "<base64-encoded XLSX>"
}
attachment:
- field: data

Get the sheet content and metadata:

{
"data": "<base64-encoded XLSX>",
"attachment": {
"content": "Sheet1\nName Lastname Age\nJohn Smith 25\nJane Doe 30",
"content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"content_length": 55,
"author": "John Smith",
"sheets": "Sheet1",
"application_name": "Microsoft Excel"
}
}

Character Limits

Limit extracted content length...

{
"data": "<base64-encoded PDF>",
"max_size": 50
}
attachment:
- field: data
- indexed_chars_field: max_size

to prevent memory issues with large documents:

{
"data": "<base64-encoded PDF>",
"max_size": 50,
"attachment": {
"content": "This is the first 50 characters of the document...",
"content_type": "application/pdf",
"content_length": 50
}
}

Metadata Extraction

Extract specific metadata properties from a Word document...

{
"data": "<base64-encoded DOCX>"
}
attachment:
- field: data
- properties:
- author
- title
- content_status

only the requested properties are returned:

{
"data": "<base64-encoded DOCX>",
"attachment": {
"author": "Jane Doe",
"title": "Project Report",
"content_status": "Final"
}
}

Content Type Detection

The processor auto-detects content type using MIME detection, but you can help with resource_name...

{
"data": "<base64-encoded data>"
}
attachment:
- field: data
- resource_name: document.rtf

which aids in format identification:

{
"data": "<base64-encoded data>",
"attachment": {
"content_type": "application/rtf",
"content": "Document content...",
"content_length": 17
}
}