Azure Blob Storage
Synopsis
Azure Blob Storage device reads and processes files from Azure storage containers. This pull-type device connects to Azure Blob Storage containers to retrieve files in various formats (JSON, JSONL, Parquet) and processes them through DataStream pipelines. The device supports both connection string and service principal authentication methods.
Schema
- id: <numeric>
name: <string>
description: <string>
type: azblob
tags: <string[]>
pipelines: <pipeline[]>
status: <boolean>
properties:
connection_string: <string>
container_name: <string>
tenant_id: <string>
client_id: <string>
client_secret: <string>
account: <string>
path_prefix: <string>
file_format: <string>
batch_size: <number>
poll_interval: <number>
delete_after_processing: <boolean>
max_concurrent_files: <number>
Configuration
Field | Type | Required | Default | Description |
---|---|---|---|---|
id | numeric | Y | - | Unique numeric identifier |
name | string | Y | - | Device name |
description | string | N | - | Optional description of the device's purpose |
type | string | Y | - | Device type identifier (must be azblob ) |
tags | string[] | N | - | Array of labels for categorization |
pipelines | pipeline[] | N | - | Array of preprocessing pipeline references |
status | boolean | N | true | Boolean flag to enable/disable the device |
connection_string | string | Y* | - | Azure storage account connection string for authentication |
container_name | string | Y | - | Name of the Azure Blob Storage container to read from |
tenant_id | string | Y* | - | Azure tenant ID for service principal authentication |
client_id | string | Y* | - | Azure client ID for service principal authentication |
client_secret | string | Y* | - | Azure client secret for service principal authentication |
account | string | Y* | - | Azure storage account name for service principal authentication |
path_prefix | string | N | "" | Path prefix filter to limit which files are processed |
file_format | string | N | json | File format to expect: json , jsonl , or parquet |
batch_size | number | N | 1000 | Number of records to process in each batch |
poll_interval | number | N | 60 | Interval in seconds between container polling cycles |
delete_after_processing | boolean | N | false | Whether to delete files after successful processing |
max_concurrent_files | number | N | 5 | Maximum number of files to process concurrently |
* = Conditionally required (see authentication methods below)
Choose either connection string OR service principal authentication:
- Connection String: Requires
connection_string
andcontainer_name
- Service Principal: Requires
tenant_id
,client_id
,client_secret
,account
, andcontainer_name
Avoid hardcoding connection_string
and client_secret
in plain text. Prefer referencing encrypted secrets (e.g., environment variables, vault integrations, or secret files) supported by DataStream. Rotate credentials regularly and restrict scope/permissions to least privilege.
Details
The Azure Blob Storage device operates as a pull-type data source that periodically scans Azure storage containers for new files. The device supports multiple file formats and provides flexible authentication options for enterprise environments.
File Format Processing: The device automatically detects and processes files based on the configured format. JSON files are parsed as individual objects, JSONL files process each line as a separate record, and Parquet files are read using columnar processing for efficient large-data handling.
Polling Behavior: The device maintains state to track processed files and only processes new or modified files during each polling cycle. The polling interval can be adjusted based on data arrival patterns and processing requirements.
Concurrent Processing: Multiple files can be processed simultaneously to improve throughput. The concurrency level is configurable and should be tuned based on available system resources and storage account limits.
Error Handling: Files that fail processing are marked and can be retried on subsequent polling cycles. The device provides detailed logging for troubleshooting connection and processing issues.
Examples
Basic Connection String Authentication
Configuring Azure Blob Storage device with connection string authentication to process JSON files... |
|
Device polls the 'logs' container every 5 minutes for JSON files and processes each file as individual records... |
|
Service Principal Authentication
Using service principal authentication for enterprise security compliance... |
|
Service principal provides enterprise-grade authentication with path filtering for production logs only... |
|
High-Volume Parquet Processing
Processing large Parquet files with optimized settings for high-volume data... |
|
Optimized for processing large Parquet files with batching and automatic cleanup after successful processing... |
|
Pipeline Processing
Integrating blob storage device with preprocessing pipeline for data transformation... |
|
Raw blob data is processed through pipelines for timestamp normalization and field enrichment before routing to targets... |
|
Path-Based File Organization
Using path prefixes to organize and process files from specific subdirectories... |
|
Device only processes files from the specific path structure, enabling organized data ingestion patterns... |
|
Error Recovery Configuration
Configuring robust error handling with retry logic and processing state management... |
|
Conservative settings preserve files after processing and limit concurrency for stable processing of critical data... |
|