Attachment
Synopsis
Extracts content and metadata from common document formats like XLSX, DOCX, PDF, RTF, and ODT using native format libraries.
Schema
attachment:
- description: <text>
- field: <ident>
- target_field: <ident>
- if: <script>
- indexed_chars_field: <ident>
- indexed_chars: <number>
- ignore_failure: <boolean>
- ignore_missing: <boolean>
- on_failure: <processor[]>
- on_success: <processor[]>
- properties: <array>
- remove_binary: <boolean>
- resource_name: <string>
- tag: <string>
Configuration
Field | Required | Default | Description |
---|---|---|---|
field | Y | Field to get the base64-encoded data from | |
description | N | - | Explanatory text |
if | N | - | Condition to run |
indexed_chars_field | N | null | Field to override the value in the indexed_chars field |
indexed_chars | N | 100000 | Number of characters that can be used for extraction. This is to avoid oversized fields. To specify no limits, use -1 |
ignore_failure | N | false | See Handling Failures |
ignore_missing | N | false | If set to true and field does not exist, exit quietly without making any modifications |
on_failure | N | See Handling Failures | |
on_success | N | - | See Handling Success |
properties | N | all | Array of properties to be stored. Available options: author , content_type , content_length , date , name , keywords , language , and title |
remove_binary | N | If set to true , the binary field will be removed from the document | |
resource_name | N | - | Name of the resource to decode. |
tag | N | - | Identifier |
target_field | N | attachment | Field containing the attachment |
Details
The source field must contain base64-encoded binary data of the document to be processed. For optimal performance with large binary files, consider using binary data formats instead of base64 encoding.
The following fields can be extracted from a document:
altitude
author
comments
content
content_length
content_type
contributor
coverage
creator_tool
date
description
format
identifier
keywords
language
latitude
longitude
metadata_date
modified
modifier
print_date
publisher
rating
relation
rights
source
title
type
The processor supports the following libraries to extract these fields:
- Microsoft Excel (XLSX)
- Microsoft Word (DOCX)
- PDF documents
- Rich Text Format (RTF)
- OpenDocument Text (ODT)
- Plain text (TXT)
For multiple attachments, use the foreach
processor.
Examples
Basic Extraction
The files to be attached to the JSON documents must first be encoded as base64
strings. Then, to decode the string the processor can be used:
Extract content from an Excel file... |
|
Get the sheet content and metadata: |
|
Character Limits
Limit extracted content length... |
|
to prevent memory issues with large documents: |
|
Metadata Extraction
Extract specific metadata properties from a Word document... |
|
only the requested properties are returned: |
|
Content Type Detection
The processor auto-detects content type using MIME detection, but you can help with resource_name... |
|
which aids in format identification: |
|