April 25, 2025

A Principled Architecture.

Ari Zilka

The Smart Telemetry Hub - Powered by Open Telemetry

So far in this channel, we have been talking about observability 1.0 and how it might have gaps. To me, observability 1.0 is an analytics tool. If you look at New Relic, Datadog, Splunk, Dynatrace,they provide a few “golden path” solutions that represent automatic and intelligent analytics but they are still made up of canned queries and visualizations. These may even be robust solutions to point problems, but for the most part, you send data to a cloud DB and you paginate back cursor-based answers to queries.

If you want to move from passive analytics to active automation, you need a different approach. Let's go through the architecture of an approach that gets us beyond analytics.


Static vs. Dynamic OpenTelemetry: Why Flexibility Matters

OpenTelemetry (OTEL) offers deep control over telemetry data — allowing you to rewrite, enrich, route, and store data as it flows through your systems. It's incredibly powerful. But the way most teams use OTEL today is still fairly static: configuration is locked into YAML files and embedded Go code. Once it's deployed, it rarely changes without significant effort.

And that's a problem — because your software never stops changing. If your observability pipelines can't evolve with your application, your visibility degrades just when you need it most.

Dynamic OpenTelemetry: Built for Change

Dynamic OpenTelemetry unlocks runtime flexibility. It allows telemetry behavior to adapt on the fly — without system restarts, manual redeployments, or waiting on centralized observability teams to push config updates. This makes it easier to reduce costs, improve relevance, and respond to real-time events.

Let's revisit our ongoing example that we have been covering in this blog series: data filtration.

Imagine you have two teams working in the same environment. One team needs strict guarantees that certain logs are never filtered, regardless of volume. The other does not. Meanwhile, your site reliability engineers are scrambling to stop a misbehaving service that's flooding the system with repetitive, low-value log entries. Now your developers and SREs both want to manage the filterProcessor. Any changes one team makes could break the other team's requirements.

You need a global set of filtration rules — but with local overrides. You need expressions like “always include” and “always exclude” that can adapt per team, per service, or even per deployment — all without restarting or redeploying OTEL components. And you need a way for separate teams' requirements to be merged together into your production filterProcessor configuration that gets deployed to your OTEL collector environment. Rest assured, things will change with time and so this isn't a one-time challenge.

Decentralize Control, Avoid Bottlenecks

To scale the use of OpenTelemetry effectively across your organization, you can't bottleneck observability control behind a central OTEL engineering team. Teams must be empowered to manage their own telemetry behavior within guardrails — allowing them to iterate quickly and meet their own operational needs.

That's the promise of dynamic OpenTelemetry: observability pipelines that are flexible, responsive, and built for the pace of modern software delivery.

Rethinking Telemetry Control with Variables

If we want to evolve OpenTelemetry (OTEL) from a static configuration tool into a dynamic, code-integrated platform for runtime control of observability pipelines we need to start with the introduction of variables. We call these telemetry control variables — configuration, enrichment, and automation variables— each serving a distinct role in modern observability systems.

Configuration Variables: From Static YAML to Programmable Pipelines

Traditionally, OTEL configurations lived in static YAML files or were hardcoded into environment variables. This approach limited flexibility and made runtime changes difficult. Today, configuration variables can be promoted into application code itself. Think of this as introducing an OTEL manifest within your app — a codified interface for adjusting telemetry behavior via environment-driven parameters.

This unlocks new capabilities. Instead of embedding filtering logic directly into OTEL's filterProcessor block, you can now define, register, and adjust filters programmatically at the application level. This means filter logic becomes part of your deployable artifact, enabling versioning, testing, and conditional activation based on business context. For now, let’s not complicate the discussion with how we might be able to wire our manifest full of OTEL directives into a living, breathing OTEL collector instance at deployment time; we will figure that out in another blog entry.

Enrichment Variables: Telemetry Meets Context

Enrichment variables are fundamentally different. They're about context injection — joining runtime data streams with reference information (tables) using key-based relationships. Whether you're appending metadata to a metric, embedding lookup-based tags in log lines, or inserting conditional logic (e.g., CASE statements) into telemetry streams, enrichment variables make it possible.

The goal is smarter telemetry: data that's not just collected, but enriched with relevance and context for better diagnostics, alerting, and analysis.


Automation Variables: Observability that Responds

Automation variables are where observability becomes truly adaptive. These variables are not about setting up pipelines — they're about changing them in real time based on what the system is doing. Imagine flipping a switch to enable log filtering when traffic spikes, triggering data replays during incidents, or adjusting buffer thresholds dynamically based on observed risk patterns.

In short, automation variables let your observability infrastructure respond — not just report. They bridge the gap between telemetry and action, enabling use cases like anomaly detection, incident remediation, and real-time ops tuning.

A Practical Example: Dynamic Data Filters

We can ground all this theory using a practical use case — dynamic log filtering. Filters should:

  • Include or exclude services based on lifecycle stage (e.g., retain full logs for newly deployed services).
  • Be toggled when telemetry volume exceeds defined thresholds.
  • Share quotas to maximize data pass-thru without breaching observability budgets.

Achieving this requires a blend of configuration (to define what and how to filter) and automation (to turn filters on/off in response to volume metrics or incidents). Together, they enable a telemetry system that is cost-aware, context-sensitive, and operationally agile.



From Observation to Action: The Role of Evaluation Functions in Dynamic Observability

To truly operationalize dynamic observability, it's not enough to just make telemetry pipelines configurable at runtime — we also need a way to evaluate telemetry in motion and take action based on what we observe. This is where evaluation functions (or "evals") come into play.

Evaluation functions act like continuous queries over your telemetry streams. They monitor live data in real time, apply logic, and return actionable signals. Combined with telemetry control variables, evals enable systems that not only observe their own behavior but also respond to it dynamically.

What Do Evaluation Functions Look Like?

Evals come in different forms, depending on what you want to measure or detect. A few key examples:

  • Byte Counters: Track the volume of telemetry (e.g., log data) sent by each service. You can aggregate by service or team by enriching the data with service ownership metadata.
  • Error Rate Trackers: Continuously compute 5-minute rolling error counts per service. If error rates cross a threshold, trigger alerts or automate mitigation workflows.
  • Pattern Matchers: Scan telemetry for sensitive or targeted patterns — like traces containing a specific username or database table — and route relevant spans to developers or incident response systems.
  • Security detectors: scan telemetry data for package manifests. Store those in a ClickHouse software Bill of Materials. Separately, join the ClickHouse data with known CVE metadata as a lookup table and escalate to SecOps any production deployments and memory-resident copies of known vulnerable libraries.

These functions can be grouped into three broad categories:

  • Statistical Evaluations (counters, rates, thresholds)
  • Continuous Queries & Pattern Matchers
  • Custom Plug-in Functions, which can be authored to support domain-specific needs

Evals Don't Just Return Values — They Trigger Actions

What makes evals powerful is that they're not passive. When conditions are met, they can drive automation:

  • Triggering alerts in Slack or PagerDuty
  • Launching ServiceNow or CI/CD workflows
  • Calling cloud provider APIs to scale infrastructure or quarantine workloads
  • Reconfiguring observability pipelines in real time

This approach pushes observability, specifically OpenTelemetry-beyond monitoring. It turns OpenTelemetry into a smart telemetry hub for your systems.


Injected Capabilities: Dynamic Dependency Injection in OpenTelemetry

One of the more compelling use cases for dynamic OpenTelemetry is the ability to inject system capabilities at runtime — much like in modern development languages. With configuration variables and evaluation functions working in tandem, OpenTelemetry can dynamically alter its behavior based on live conditions, data patterns, or business logic.

In a previous post, we explored buffer replay — selectively routing stored telemetry during an incident. That's a perfect example of this model in action, where the routingProcessor can dynamically determine where to send log data based on runtime criteria.

But we can extend this concept much further. Let's look at a few examples of injected capabilities that can dramatically enhance the observability pipeline:

  • Storage Tiering: Dynamically route telemetry to different storage backends based on data criticality or freshness — e.g., hot for active debugging, warm for recent history, and cold for archival compliance.
  • Asynchronous External Queries: Instead of querying external systems every time you evaluate a variable, you can inject an asynchronous update loop. This decouples data freshness from query latency and keeps enriched context up to date without degrading pipeline performance.
  • AI/ML Model Evaluation & Decisioning: Inject user-defined AI/ML models into your telemetry pipelines to enable intelligent decision branches. For example, use a model to classify anomalies in real time and decide whether to trigger alerts, escalate incidents, or initiate remediation workflows.

This model allows you to shift OpenTelemetry from being a static observability tool into a dynamic, programmable telemetry control plane — one that can adapt to your systems and evolve with your architecture.


Runtime Logic Loop: putting OpenTelemetry, variables, evals, and injection to work together

The following diagram provides a reference architecture for how to put all these concepts together in a core logic loop that turns the OpenTelemetry collector into a smart telemetry hub for enterprises who want to move from passive observation to automated actions.

Architecture diagram

Options for integration:

  • Data science > scalars > vars loop
  • Alerting and eventing computed using StructredStreaming
    1. statistics (tm-kidding)
  • Evals computed using StructuresStreaming:
    1. toptalkers
    2. leasttalkers
    3. recentAnomalous
    4. youngestnodes
    5. oldestnodes
    6. anomaliesbyDiskIO6sigma
    7. others TBD thru partnership

Core Components of a Dynamic Observability System

To deliver a fully dynamic, smart telemetry hub, we rely on a modular architecture built around three key components: the Evaluator Core, the Event Processor, and the Compiler. Together, they transform static OpenTelemetry into a responsive, adaptable control plane for observability.

Evaluator Core: Continuous Rule Evaluation

The Evaluator Core is responsible for executing logical rules—either on a scheduled basis or in response to specific events. Its primary role is to monitor telemetry data and system state, evaluate conditions, and emit events that drive system behavior.

Key Capabilities:

  • Executes rules and triggers one or more resulting events
  • Supports multiple event types from a single rule evaluation
  • Allows different evaluation frequencies based on rule type and operational scope
  • Prevents conflicting or redundant events from being emitted
  • Enforces rate limits (min/max) for event types across defined scopes

Example Use Case: Updating a Dynamic Lookup Table

  • A webhook triggers a call to ServiceNow to fetch updated metadata
  • A State Update event modifies the dynamic lookup table if the team mapping has changed
  • A Recompile & Deploy event initiates a pipeline recompile if required

Event Processor: Orchestrating System Response

The Event Processor receives events from the Evaluator Core and other internal or external sources. It acts as a central orchestrator, coordinating execution and dispatching follow-on events to the appropriate systems, including external tools and platforms. Core Event Types:

  • Recompile & Deploy (R&D)
    • Triggers configuration recompilation and redeployment
    • Adjusts parameterized settings using evaluated logic
    • Enables real-time routing changes and pipeline optimization
    • Creates adaptive feedback loops within the telemetry system
  • State Update
    • Recalculates dynamic variables and updates state stores
    • Refreshes lookup tables used across pipelines
    • Supports dynamic queries and statistical calculations
  • External Trigger
    • Initiates workflows in external systems (e.g., CI/CD, ITSM, cloud providers)
    • Abstracts integration logic via modular interfaces
    • Enables an extensible plugin-based observability architecture

Compiler: Generating Deployable OTEL Configurations

The Compiler translates high-level, dynamic configurations into valid OpenTelemetry Collector YAML files. It resolves all parameterized settings—both static and logical—into actionable, deployable configurations. This way we do not have to fork the OTEL code base to turn it into the smart telemetry hub we need.

Core Event Types:

  • Continuously compiles OTEL configurations from master templates
  • Resolves variables and evaluation logic into concrete values
  • Differentiates between:
    • Variables: Static or external values (e.g., team mappings, thresholds)
    • Evals: Logical conditions evaluated at compile-time (can reference state, lookups, or user-defined variables)

To minimize operational overhead, recompilation is intentionally decoupled from evaluation and event processing. This separation ensures that only necessary changes trigger costly operations like redeployments.


What's Next: Smarter, More Efficient Compilation

As this architecture matures, a system could focus is on optimizing the configuration compilation lifecycle. By refining the interactions between evaluators, events, and the compiler, we aim to reduce unnecessary pipeline churn and improve responsiveness without sacrificing stability.

As always, thoughts? Join us on slack: MyDecisive community slack
Ari

For Media Inquiriespr@mydecisive.ai
Support via Slack

We will respond within 48 business hours

Core Business Hours

Monday - Friday

9am - 5pm PDT

LinkedIn logoGithub logoYouTube logoSlack logo