Skip to main content

Pipelines: Quick Start

Creating a pipeline requires a systematic approach that primarily involves two key factors:

Ingestion Source The origin of the data. Pipelines have to be designed to be able to handle data with specific characteristics that will be determined by the source

Configuration The specific arrangement of the processors. Pipelines need to be configured to meet certain objectives in their output

In other words, the pipeline has an input and an ultimate output, and the selection and configuration of the processors that make up the pipeline are dictated by what is to be consumed and what is to be produced.

Design Considerations

When designing a pipeline, a number of key aspects need to be considered: the sequential relations between the processors, and the types of interactions anticipated to take place between them.

note

Pipeline design is an iterative process. Always start simple and progressively improve your configuration as you better grasp the requirements of your specific use cases.

Order and Dependency

The first thing to consider is the relations between the processors. There are three possibilities:

  • Run simultaneously without relying on each other's output

    Example

    The parser and enricher processors run independently:

    pipelines:
    processors:
    - parser
    - enricher
  • Use the output of a previous one as their input

    Example

    The normalizer processor uses the output of parser, and the enricher processor uses normalized data:

    pipelines:
    processors:
    - parser
    - normalizer
    - enricher
  • Run based on specific conditions, such as when the result of a previous one meets certain criteria or when a computation completes:

Interaction Patterns

The next on the list is the interactions between the pipelines. Real-world scenarios often require complex exchanges between them. There are three possible layouts:

  • Run simultaneously:

    Example

    The network_logs and security_logs pipelines run independently, and so they can run simultaneously:

    pipelines:
    - name: network_logs
    processors:
    - network_parser
    - network_enricher

    - name: security_logs
    processors:
    - security_parser
    - threat_detector
  • Trigger one another upon completion:

    Example

    The secondary pipeline is triggered by the primary pipeline:

    pipelines:
    - name: primary
    processors:
    - initial_parser
    on_complete:
    trigger: secondary

    - name: secondary
    processors:
    - advanced_enrichment
  • Run based on a pre-defined hierarchical order, and potentially relay data:

Best Practices

Finally, we have to consider a few guidelines for designing effective and efficient pipelines. Always keep the following in mind.

Modularity

Reusability is one of the ever-present requirements in IT design.

In the context of pipelines, this means focusing on specific transformations. If, for example, a string field needs to be stripped of formatting tags before extracting a certain value, it is best to keep these two together.

Simplicity

Complex pipelines are likely to bring an overhead which, if not handled carefully, may degrade performance. Therefore, it is essential to check whether every processor included is absolutely essential for its primary task. Keep in mind that performance is also related to modularity.

Volumes

Anticipate handling varying data volumes.

Pipelines really shine at scale. It is best to keep in mind handling large amounts of data, a consideration which may expose inefficient design choices.

Failures

Always implement robust error handling.

This is particularly relevant to expensive computations repeating which is likely to consume extra system resources and therefore may become wasteful.

Given the above, a well-designed logging mechanism is likely to be a good aid in the implementation of error handling.

Optimization

This requires observing the following:

  • Use parallel processing where possible

    If your pipelines are modular enough, you should not have any difficulty running them simultaneously. Conversely, if you want to be able to do as much parallel processing as possible, mind the modularity of your pipelines.

  • Streamline data transformations

    Unless it is directly relevant to its goal, it is best not to include a data transformation in a pipeline. Managing intricate data manipulation requirements is always a challenge. Make sure that the pipeline serves a specific and clear purpose.

  • Reduce computational complexity

    The key to achieving this is choosing the appropriate processor order. A sloppy design in that regard, i.e. not paying attention to the input-output sequence may increase the computational burden of a pipeline in unexpected ways.

Data Integrity

This requires consistent data typing across the processors, implementing validation steps, and handling edge cases and unexpected input formats.

The most common challenge in this regard is format variations. Make sure that you have paid enough attention to the stage.

Next Steps

Always review the available processors and their specific configurations first before embarking on a specifi design. Build incrementally and iteratively, and at every step test your design.