If you have ever been on-call, you know the absolute dread of the "noisy tenant." You are watching the dashboard as a single customer suddenly floods your system with too many connections, eating up your application resources. Before you can manually intervene, a single tenant's load causes a cluster-wide failover, turning a localized issue into an outage for thousands of customers.
After living through this nightmare over the last two weeks on-call routinely, community contributor Ali Salehi decided to fix it.
Using newly released mdai-labs which leverages OpenTelemetry pipelining, Ali provides the community with a dynamic, agentic workflow that not only prevents database failovers but slashes observability costs in the process. Here is how he did it.
The Problem: Cascading Failures & Wasted Telemetry
Standard static thresholds fail during sudden spikes. If a tenant pushes massive database load, generating 500s and 429s (rate limiting), traditional monitoring just alerts you while the database goes down. Furthermore, if you do manage to rate limit them, your system starts generating thousands of 429 Too Many Requests traces. You end up paying massive ingestion fees to store 429 logs that block out the true issues all your tenants are facing. You already know you are rate limiting and which tenant is causing it. You want to know what the impact is..
The Solution: Dynamic Mitigation & Telemetry FinOps
Salehi (with an assist from Claude) used MyDecisive to wire telemetry signals directly into an active feedback loop and he did it in a single paragraph prompt and less than 10 minutes of his time. Here is the logic he built for the MDAI pipeline:
-
Detect the Spike: MyDecisive monitors the DB error rate across four tenants. When errors spike, it isolates the noisy tenant by activating a rate-limit flag in the application's JDBC layer via a combination of webhook plus app API call.
-
Agentic Rate Limiting: Instead of waking up an engineer, MyDecisive dynamically rate-limits the noisy tenant, throwing 429s and immediately dropping the database load back to healthy levels. Now, MyDecisive starts tracking the frequency of 429s from the noisy tenant and can automatically remove the throttle when the tenants stops firing calls at the system
-
Smart Trace Sampling: Here is the brilliant FinOps part. The pipeline explicitly drops the trace sampling rate to 0% for all 429s for the specific tenant being generated by the rate limiter, saving massive ingestion costs.
-
Preserve the Signal: Real application errors for the noisy tenant are still tracked, and trace sampling for healthy tenants is temporarily lowered during the incident to reduce noise while the team investigates.
The Code Snippet
Here is an example of what that dynamic, stateful policy looks like inside a MyDecisive pipeline configuration:
processors:
tail_sampling:
decision_wait: 10s
policies:
# 0. THROTTLE-DROP. A noisy-tenant span that was rejected by our own
# rate-limiter (throttled==true). These 429s are MDAI's mitigation
# working as intended — no diagnostic value — so we keep ~1% as a
# sanity sample and drop the other 99%. This policy is FIRST so it
# wins over noisy-tenant-normal (which would otherwise keep 50% of
# them). The 429 *rate* is still fully visible in metrics
# (app_throttled_total / the Action-1 panel); we just stop paying
# to store the redundant per-request traces.
- name: noisy-tenant-throttled-drop
type: and
and:
and_sub_policy:
- name: is-noisy-tenant
type: string_attribute
string_attribute:
values:
- ${env:NOISY_TENANT_LIST:-a^}
key: tenant
enabled_regex_matching: true
- name: is-throttled
type: ottl_condition
ottl_condition:
error_mode: ignore
span:
- 'attributes["throttled"] == true'
- name: keep-1-percent
type: probabilistic
probabilistic:
sampling_percentage: 1
# 1. KEEP EVERYTHING from a noisy tenant that hit a *genuine* DB error (5xx).
# Deliberately does NOT match 429s — those were handled by policy 0
# above. We match the injected-error attribute our /work handler sets,
# OR a Status.ERROR that is not a throttle.
- name: noisy-tenant-errors
type: and
and:
and_sub_policy:
- name: is-noisy-tenant
type: string_attribute
string_attribute:
key: tenant
# When NOISY_TENANT_LIST is unset/empty, expand to "a^" — a regex
# that matches no string. Without this, the empty regex matches
# everything and inverts the policy semantics.
values:
- ${env:NOISY_TENANT_LIST:-a^}
enabled_regex_matching: true
# Genuine DB failures only. We match the injected-error attribute our
# /work handler sets, OR a Status.ERROR that is NOT a throttle (the
# second clause excludes 429 spans, which carry throttled==true).
- name: is-genuine-error
type: ottl_condition
ottl_condition:
error_mode: ignore
span:
- 'attributes["error.injected"] == true'
- 'status.code == STATUS_CODE_ERROR and attributes["throttled"] != true'
# 2. Sample 50% of the noisy tenant's non-error traffic (still want context).
- name: noisy-tenant-normal
type: and
and:
and_sub_policy:
- name: is-noisy-tenant
type: string_attribute
string_attribute:
key: tenant
# When NOISY_TENANT_LIST is unset/empty, expand to "a^" — a regex
# that matches no string. Without this, the empty regex matches
# everything and inverts the policy semantics.
values:
- ${env:NOISY_TENANT_LIST:-a^}
enabled_regex_matching: true
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 50
# 3. Healthy tenants ride 2%. They were fine before, they're fine now — save the budget.
- name: healthy-tenant-sample
type: and
and:
and_sub_policy:
- name: is-healthy-tenant
type: string_attribute
string_attribute:
key: tenant
values:
- ${env:HEALTHY_TENANT_LIST:-a^}
enabled_regex_matching: true
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 2
# 4. Default catch-all for spans without a tenant attribute (e.g., infra).
- name: default-sample
type: probabilistic
probabilistic:
sampling_percentage: 10
Full Pull Request can be found here
The Video Demo
Want to see it in action? Watch the 3-minute Loom walkthrough where Salehi spins up the load test, isolates the noisy tenant, and forces the trace drops in real-time.
Why MyDecisive
This is where MyDecisive breaks the cycle. By allowing teams to deploy OpenTelemetry in minutes rather than weeks, MyDecisive drops a stateful, AI-driven proxy directly into your cloud environment. Instead of acting as a dumb pipe that blindly forwards every event to an expensive backend, it novelly intercepts the telemetry stream at the edge. Because it maintains state internally and because it actually understands the context of the incident through its deep understanding of telemetry data and context, Mydecisive instantly throttles the offending tenant while simultaneously filtering out the resulting storm of useless 429 traces. You stop the failover, kill the noise, and only pay to ingest the high-value errors you actually need for debugging and customer management.
Come Build With Us
This is exactly why we open-sourced the MyDecisive platform - to let engineers build the tools they actually wish they had at 3:00 AM on a Saturday.
We are actively looking for more engineers to test the limits of the MyDecisive and MDAI Labs. Whether you want to build intelligent auto-rollbacks, dynamic sampling pipelines, or cost-deflection rules, we want your code.
Star the repo and check out the CONTRIBUTING.md (we just added our CLA!), and take the platform for a spin.
See MyDecisive Solutions in action here.
Find us on GitHub.
