Pipelines: Quick Start
Creating a pipeline requires a systematic approach that primarily involves two key factors:
Ingestion Source The origin of the data. Pipelines have to be designed to be able to handle data with specific characteristics that will be determined by the source
Configuration The specific arrangement of the processors. Pipelines need to be configured to meet certain objectives in their output
In other words, the pipeline has an input and an ultimate output, and the selection and configuration of the processors that make up the pipeline are dictated by what is to be consumed and what is to be produced.
Design Considerations
When designing a pipeline, a number of key aspects need to be considered: the sequential relations between the processors, and the types of interactions anticipated to take place between them.
Pipeline design is an iterative process. Always start simple and progressively improve your configuration as you better grasp the requirements of your specific use cases.
Order and Dependency
The first thing to consider is the relations between the processors. There are three possibilities:
-
Run simultaneously without relying on each other's output
Example
The
parser
andenricher
processors run independently:pipelines:
processors:
- parser
- enricher -
Use the output of a previous one as their input
Example
The
normalizer
processor uses the output ofparser
, and theenricher
processor uses normalized data:pipelines:
processors:
- parser
- normalizer
- enricher -
Run based on specific conditions, such as when the result of a previous one meets certain criteria or when a computation completes:
Interaction Patterns
The next on the list is the interactions between the pipelines. Real-world scenarios often require complex exchanges between them. There are three possible layouts:
-
Run simultaneously:
Example
The
network_logs
andsecurity_logs
pipelines run independently, and so they can run simultaneously:pipelines:
- name: network_logs
processors:
- network_parser
- network_enricher
- name: security_logs
processors:
- security_parser
- threat_detector -
Trigger one another upon completion:
Example
The
secondary
pipeline is triggered by theprimary
pipeline:pipelines:
- name: primary
processors:
- initial_parser
on_complete:
trigger: secondary
- name: secondary
processors:
- advanced_enrichment -
Run based on a pre-defined hierarchical order, and potentially relay data:
Best Practices
Finally, we have to consider a few guidelines for designing effective and efficient pipelines. Always keep the following in mind.
Modularity
Reusability is one of the ever-present requirements in IT design.
In the context of pipelines, this means focusing on specific transformations. If, for example, a string field needs to be stripped of formatting tags before extracting a certain value, it is best to keep these two together.
Simplicity
Complex pipelines are likely to bring an overhead which, if not handled carefully, may degrade performance. Therefore, it is essential to check whether every processor included is absolutely essential for its primary task. Keep in mind that performance is also related to modularity.
Volumes
Anticipate handling varying data volumes.
Pipelines really shine at scale. It is best to keep in mind handling large amounts of data, a consideration which may expose inefficient design choices.
Failures
Always implement robust error handling.
This is particularly relevant to expensive computations repeating which is likely to consume extra system resources and therefore may become wasteful.
Given the above, a well-designed logging mechanism is likely to be a good aid in the implementation of error handling.
Optimization
This requires observing the following:
-
Use parallel processing where possible
If your pipelines are modular enough, you should not have any difficulty running them simultaneously. Conversely, if you want to be able to do as much parallel processing as possible, mind the modularity of your pipelines.
-
Streamline data transformations
Unless it is directly relevant to its goal, it is best not to include a data transformation in a pipeline. Managing intricate data manipulation requirements is always a challenge. Make sure that the pipeline serves a specific and clear purpose.
-
Reduce computational complexity
The key to achieving this is choosing the appropriate processor order. A sloppy design in that regard, i.e. not paying attention to the input-output sequence may increase the computational burden of a pipeline in unexpected ways.
Data Integrity
This requires consistent data typing across the processors, implementing validation steps, and handling edge cases and unexpected input formats.
The most common challenge in this regard is format variations. Make sure that you have paid enough attention to the stage.
Next Steps
Always review the available processors and their specific configurations first before embarking on a specifi design. Build incrementally and iteratively, and at every step test your design.