Skip to main content
Version: 1.2.0

Regex Replace

Parse Pattern Matching Data Transformation

Synopsis

A text processing processor that finds and replaces text patterns using regular expressions, providing powerful pattern-based text transformation capabilities for data cleaning, formatting, and normalization.

Schema

- regex_replace:
field: <ident>
regex: <string>
replacement: <string>
target_field: <ident>
description: <text>
if: <script>
ignore_failure: <boolean>
ignore_missing: <boolean>
on_failure: <processor[]>
on_success: <processor[]>
tag: <string>

Configuration

The following fields are used to define the processor:

FieldRequiredDefaultDescription
fieldY-Field containing the text to process
regexY-Regular expression pattern to match
replacementY-Replacement text or pattern
target_fieldNfieldField to store the modified text
descriptionN-Explanatory note
ifN-Condition to run
ignore_failureNfalseContinue if regex processing fails
ignore_missingNfalseContinue if source field doesn't exist
on_failureN-See Handling Failures
on_successN-See Handling Success
tagN-Identifier

Details

The processor uses regular expressions to find and replace text patterns within string fields. It supports both simple text replacement and complex pattern matching with capture groups and backreferences.

note

This processor is an alias for the gsub processor, providing the same functionality with a more descriptive name.

Regular expression patterns support full regex syntax including character classes, quantifiers, anchors, and grouping. The replacement string can include backreferences ($1, $2, etc.) to captured groups from the regex pattern.

The processor handles all occurrences of the pattern within the text, making it suitable for comprehensive text cleaning and transformation tasks.

warning

Test regex patterns thoroughly to avoid unintended matches or performance issues with complex patterns.

Examples

Basic Text Replacement

Replacing simple text patterns...

{
"message": "Error: Failed to connect to server"
}
- regex_replace:
field: message
regex: "Error:"
replacement: "WARNING:"

updates the error level:

{
"message": "WARNING: Failed to connect to server"
}

Pattern Matching with Capture Groups

Using capture groups for reformatting...

{
"timestamp": "2024-01-15 14:30:25"
}
- regex_replace:
field: timestamp
regex: "(\d{4})-(\d{2})-(\d{2}) (\d{2}):(\d{2}):(\d{2})"
replacement: "$2/$3/$1 $4:$5:$6"
target_field: formatted_timestamp

reformats the date:

{
"timestamp": "2024-01-15 14:30:25",
"formatted_timestamp": "01/15/2024 14:30:25"
}

Email Masking

Masking email addresses for privacy...

{
"user_info": "Contact [email protected] for support"
}
- regex_replace:
field: user_info
regex: "([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+\\.[a-zA-Z]{2,})"
replacement: "***@$2"

masks the username portion:

{
"user_info": "Contact ***@example.com for support"
}

Log Level Normalization

Normalizing various log level formats...

{
"log_line": "[ERR] Database connection failed",
"log_line2": "WARN: Memory usage high"
}
- regex_replace:
field: log_line
regex: "\\[(ERR|ERROR)\\]"
replacement: "[ERROR]"
- regex_replace:
field: log_line2
regex: "WARN:"
replacement: "[WARNING]"

standardizes log levels:

{
"log_line": "[ERROR] Database connection failed",
"log_line2": "[WARNING] Memory usage high"
}

URL Path Extraction

Extracting paths from URLs...

{
"request_url": "https://api.example.com/v1/users/123?param=value"
}
- regex_replace:
field: request_url
regex: "https?://[^/]+(/[^?]*)"
replacement: "$1"
target_field: url_path

extracts just the path:

{
"request_url": "https://api.example.com/v1/users/123?param=value",
"url_path": "/v1/users/123"
}

Multi-Pattern Replacement

Applying multiple regex replacements...

{
"raw_text": "User ID: 12345, Phone: (555) 123-4567, Email: [email protected]"
}
- regex_replace:
field: raw_text
regex: "\\d{5}"
replacement: "XXXXX"
- regex_replace:
field: raw_text
regex: "\\(\\d{3}\\) \\d{3}-\\d{4}"
replacement: "XXX-XXX-XXXX"
- regex_replace:
field: raw_text
regex: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
replacement: "***@***.***"

sanitizes sensitive information:

{
"raw_text": "User ID: XXXXX, Phone: XXX-XXX-XXXX, Email: ***@***.***"
}