Pipelines: Overview
Director pipelines were designed to automate your processes dealing with data ingestion and data transformation. They can be used to extract values, transform or convert them, enrich them by correlating them with other available information, and to design the overall workflow of your telemetry data processing scheme.
Pipelines are created to route incoming data streams from sources to targeted outgoing streams or some other final destination.
The sources can be devices, networks, or pipelines, and the targets can be the console, data storage files, or again other pipelines. Sources and targets can be connected to each other in one-to-one, one-to-many, many-to-one, and many-to-many configurations.
Each incoming stream can be queried with criteria specific to the properties of the data, and each outgoing stream can be enriched with inferences made from the data for further processing on the target side.
Definitions
Pipelines are chains of processors that run sequentially, operating on the incoming data streamed from providers named sources, and routing their output to consumers called targets for consumption or further processing.
Schematically, this looks like so:
As can be expected, in real life scenarious the configurations connecting sources to targets will vary based on the requirements of the various actors.
Configurations
A pipeline can consume data from multiple providers, and once it is finised processing them, can direct some of the processed data items to a specific consumer in a one-to-one scheme:
It can also consume data from a single provider and deliver the processed results to multiple consumers:
Finally, and in most real-life situations, we will have multiple providers and multiple consumers directed to each other in complex arrangements:
Effectively, each source may be the target of an upstream pipeline, and each target may serve as the source for a downstream one. The detail to keep in mind is that each Provider side is delegating some processing to the next pipeline based on some established requirements of the Consumer side. The pipeline is acting as the middleman for the data interchange. Even if it routes some data items without any processing, this should be due to the demands of the Consumer side, so the pipeline is performing a meaningful role by at least fulfilling a configured routing policy.
Telemetry
In telemetry, these basic configurations are clustered according to a scheme where the primary provider is named Source and the ultimate consumer is named Target or Destination.
There is, however, a middle stage—where the software solutions do their real work—called Routing.
Until now, we have not mentioned two interim stages—which also act as both consumers and providers—requried to link those stages in a chain. These are called Pre-processing and Post-processing, respectively, and they both involve what is known as normalization, i.e. preparing the data for processing downstream.
Pipelines are, once again, our primary tool at these interim stages. Schematically:
As can be seen from the graphics, the components in the middle of this setup are actually called routes.
The purpose of these simple illustrations is to emphasize one thing: the primary purpose of pipelines is using division of labor to simplify telemetry.
Let us illustrate.
Data Streams
In any telemetry operation, the incoming raw data streams will naturally have their own structure. This structure, however, may not—and frequently does not—lend itself to use for specific purposes, such as analysis or transformation. Trying to do so would be time consuming, processor-intensive, and possibly error prone. Therefore, it is unwise to implement a curating strategy on the Source side, i.e. where the data originates. Similarly, if the curated data needs to be transformed or enriched as per the requirements of a downstream consumer, it is unwise to try to delegate these transformation tasks to the Target side, i.e. where the data items to be consumed are expected to be in an easily consumable format.
And these are exactly the places where pipelines enter the picture.
Divide and Rule
A pipeline is essentially based on the principle of divide-and-rule: divide the operations to be performed on the available data into the smallest meaningful and coherent actions, sequence them in a logical order based on the expected inputs and the generated outputs, and run them as a chain.
The basic layout of the configuration for a typical pipeline looks like so:
pipeline:
description: "What the pipeline does"
processors:
- processor-1:
field: foo
value: "A"
- processor-2:
description: "What processor-2 does"
field: bar
value: true
- processor-3:
field: baz
value: 10
This pipeline is created based on the assumption that the incoming data stream will have the following structure:
data:
- foo: ["A", "B", "C", ...]
- baz: [5, 10, 1, ...]
- bar: true
Note that the pipeline definition contains a mixture of required and optional fields, and that—in this form—it disregards the order of fields in the source data. All that it cares about is that
- the enumerated fields exist, and
- they contain the expected types of data
Simplified Functionality
Let us point out here that the above pipeline may appear incomplete: it does not explicitly specify what will happen to the selected values when the processors are run. Assume that, after this pipeline has been run, the data appears like so:
data:
- foo: ["Z", "B", "C", ...]
- baz: [5, 100, 1, ...]
-bar: true
With a brief glance, one can notice that the value A
in foo
has become Z
, that the 10
in baz
has become 100
, whereas bar
is left untouched since it was already true
.
Although the pipeline appears underspecified, the example is intended to illustrate that what a sequence of processors does is as simple as this. Each performs a single operation on a specific field, and that is the pipeline's power. Once the incoming data is normalized—i.e. converted into a structure making it easy to extract data items from it— it becomes very easy to curate, transform, and enrich its values into a desirable form on the fly.
Let us now explain the concepts of curation, transformation, and enrichment a bit further.
Streamlined Streams
As we have said earlier, the incoming raw data streams will have their own structure. As such, they will at best be partially suitable for analysis and, therefore, decision making. In its raw form, data is frequently not—at least may not be—what it seems to be. It has to be normalized and sifted through.
The process of selecting or filtering data based on specific criteria is called curation. This involves checking whether a field's value matches or contains specific values or fragments of values.
After this is done, the selected data may need to be converted into forms making them more suitable for analysis and consumption. This second phase is called transformation.
Finally, the data may contain hints or fragments of information which, when correlated with other available data, may yield insights that may be required for analysis and use. The process of adding correlated information in order to render the data more relevant—or increase its relevance—is known as enrichment.
The overall process can be represented schematically as:
It is through this three-fold staging that a pipeline truly shines and becomes indispensible.
Best Practices
Processors are designed to select data points by a combination of field values. Their primary design principle, as stated above, is to serve one purpose only to streamline the work they do.
Choice and Sequencing
This naturally has an impact on performance: the fewer the data points or field values a processor has to act upon, the better the overall performance. If processor p operates on field foo
only when its values is A
, whereas processor q operates on bar
if its value is B
, using the two in sequence guarantees that each completes its work in a shorter time.
However, this implies that their overuse is likely to cancel whatever benefits they may offer, and their sequencing has to be carefully considered, particularly if the output of one is fed into the next as input.
Additionally, their selection will generally be dictated by the consumers downstream. This is where premature optimization, as usual, can be a source of frustration.
Therefore, the guiding principles should be
- use only the processors directly relevant to the curation process
- use them in the correct order
- avoid any premature processing
Purpose of Use
It is also essential to be cognizant of the type of pipeline that is being designed.
Pre-processing pipelines are attached to sources, and prepare the data before it enters the routing stage. They focus on:
- Data reduction - Filtering unnecessary events, removing redundant fields, sampling high-volume data, and aggregating similar events
- Initial normalization - Field and protocol standardization, and format conversion and time normalization
- Early enrichment - Geolocation data, asset information, basic threat intelligence, and custom metadata
Normalization pipelines handle the conversion between different log formats throughout the processing chain. The primary transformations are field name standardization, data type normalization, structure unification, and time format alignment.
Post-processing pipelines are attached to targets, and perform final transformations before data storage and analysis:
- Format finalization - Target-specific formatting, schema alignment, and final field mapping
- Storage optimization - Compression configuration, index preparation, partitioning strategy, and retention setup
- Integration - Target-specific transformations, protocol adaptation, authentication preparation, and error handling
Keep the pipelines focused on their purpose, and minimize cross-type dependencies.
Leverage the seperation of concerns this brings as it clarifies the role of each, enabling a modular architecture, which, in turn, reduces overhead through better resource utilization, improves scalability and routing efficiency, and makes robust testing possible.
For performance optimization, deal with heavy transformations early, optimize routing decisions, and monitor type-specific metrics.
For error handling, implement stage-appropriate ones, use type-specific failure responses, maintain clear error boundaries, and log errors with context.