Attachment

Enrich Elastic Compatible

Synopsis

Extracts content and metadata from common document formats like XLSX, DOCX, PDF, RTF, and ODT using native format libraries.

Schema

attachment:
  - description: <text>
  - field: <ident>
  - target_field: <ident>
  - if: <script>
  - indexed_chars_field: <ident>
  - indexed_chars: <number>
  - ignore_failure: <boolean>
  - ignore_missing: <boolean>
  - on_failure: <processor[]>
  - on_success: <processor[]>
  - properties: <array>
  - remove_binary: <boolean>
  - resource_name: <string>
  - tag: <string>

Configuration

Field	Required	Default	Description
`field`	Y		Field to get the base64-encoded data from
`description`	N	-	Explanatory text
`if`	N	-	Condition to run
`indexed_chars_field`	N	`null`	Field to override the value in the `indexed_chars` field
`indexed_chars`	N	100000	Number of characters that can be used for extraction. This is to avoid oversized fields. To specify no limits, use `-1`
`ignore_failure`	N	`false`	See Handling Failures
`ignore_missing`	N	`false`	If set to `true` and `field` does not exist, exit quietly without making any modifications
`on_failure`	N		See Handling Failures
`on_success`	N	-	See Handling Success
`properties`	N	all	Array of properties to be stored. Available options: `author`, `content_type`, `content_length`, `date`, `name`, `keywords`, `language`, and `title`
`remove_binary`	N		If set to `true`, the binary field will be removed from the document
`resource_name`	N	-	Name of the resource to decode.
`tag`	N	-	Identifier
`target_field`	N	attachment	Field containing the attachment

Details

The source field must contain base64-encoded binary data of the document to be processed. For optimal performance with large binary files, consider using binary data formats instead of base64 encoding.

The following fields can be extracted from a document:

altitude author comments content content_length content_type contributor coverage creator_tool date description format identifier keywords language latitude longitude metadata_date modified modifier print_date publisher rating relation rights source title type

The processor supports the following libraries to extract these fields:

Microsoft Excel (XLSX)
Microsoft Word (DOCX)
PDF documents
Rich Text Format (RTF)
OpenDocument Text (ODT)
Plain text (TXT)

note

For multiple attachments, use the foreach processor.

Examples

Basic Extraction

The files to be attached to the JSON documents must first be encoded as base64 strings. Then, to decode the string the processor can be used:

Extract content from an Excel file...

{
  "data": "<base64-encoded XLSX>"
}

attachment:
  - field: data

Get the sheet content and metadata:

{
  "data": "<base64-encoded XLSX>",
  "attachment": {
    "content": "Sheet1\nName Lastname Age\nJohn Smith 25\nJane Doe 30",
    "content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    "content_length": 55,
    "author": "John Smith",
    "sheets": "Sheet1",
    "application_name": "Microsoft Excel"
  }
}

Character Limits

Limit extracted content length...

{
  "data": "<base64-encoded PDF>",
  "max_size": 50
}

attachment:
  - field: data
  - indexed_chars_field: max_size

to prevent memory issues with large documents:

{
  "data": "<base64-encoded PDF>",
  "max_size": 50,
  "attachment": {
    "content": "This is the first 50 characters of the document...",
    "content_type": "application/pdf",
    "content_length": 50
  }
}

Metadata Extraction

Extract specific metadata properties from a Word document...

{
  "data": "<base64-encoded DOCX>"
}

attachment:
  - field: data
  - properties:
    - author
    - title
    - content_status

only the requested properties are returned:

{
  "data": "<base64-encoded DOCX>",
  "attachment": {
    "author": "Jane Doe",
    "title": "Project Report",
    "content_status": "Final"
  }
}

Content Type Detection

The processor auto-detects content type using MIME detection, but you can help with resource_name...

{
  "data": "<base64-encoded data>"
}

attachment:
  - field: data
  - resource_name: document.rtf

which aids in format identification:

{
  "data": "<base64-encoded data>",
  "attachment": {
    "content_type": "application/rtf",
    "content": "Document content...",
    "content_length": 17
  }
}

Synopsis​

Schema​

Configuration​

Details​

Examples​

Basic Extraction​

Character Limits​

Metadata Extraction​

Content Type Detection​