Finding the Unknown Unknowns: Why your PR review bots are failing

The Limits of Known Defenses
In our previous post, "Avoiding Integration Bugs: Beyond SDKs & Contract Testing," we explored the common defenses teams use to prevent services from breaking each other in a distributed architecture. We covered the strengths and weaknesses of Typed SDKs, Contract Testing, and full-blown Integration Testing. These tools are valuable, creating essential safety nets that catch known failure modes.
But what about the failures you can’t predict?
The most frustrating, expensive, and time-consuming bugs often don't come from a failure in one of these known patterns. They come from the "unknown unknowns"—the undocumented dependencies and forgotten consumers that no one on the current team even knows exist. This is a problem of architectural blindness, and it’s a blind spot that even the most sophisticated, AI-powered tools share. This post dives into that blind spot and shows how to eliminate it.
To see why this is so critical, let's walk through a scenario that plays out in engineering teams every single day.
A developer on your team submits a pull request. It’s a simple, well-reasoned change to a core service—let’s call it the ETA-Service
in a logistics platform. The goal is to refactor an endpoint, renaming a field from estimated_arrival_in_minutes
to a more precise estimated_arrival_utc
. The change is clean. The unit tests are updated. The PR description is clear.
Within seconds, the automated checks pass. The code is merged.
Days later, a different team flags an issue in QA. A low-priority, internal dashboard used by the operations team to predict warehouse load is now completely broken. It turns out, this dashboard was an undocumented consumer of the ETA-Service
. The release is now blocked while your developer, who had already moved on, context-switches back to diagnose a problem in a service they’ve never touched. It’s a frustrating, expensive, and entirely avoidable delay.
The Local Context Blind Spot
This scenario is common because the tools we rely on for safety—from simple linters to sophisticated AI-powered PR bots—operate with a fundamental limitation: local context. They are architecturally blind.
But what about modern AI agents? Can't they just read the whole monorepo and figure it out? The promise of agentic systems and Model Context Protocol (MCP) servers is that an AI can be given a toolbox and reason about how to solve a problem. This is a "pull" model: the AI agent must orchestrate a series of tool calls, pulling information from various sources to piece together an answer.
This approach fails for complex architectural analysis due to cognitive load and decision fatigue—for the AI. By requiring the agent to decide when and how to find and parse every potential dependency, you are polluting its context and dramatically increasing the likelihood of hallucination and error. The AI has to "judge" when to stop looking, and it can never be sure it has found everything. The best mitigations that people have come up with (to date), are structured task lists, as seen in Claude Code, and recently Cursor. Unfortunately, this only helps, and doesn't solve the problem.
The alternative is a "push" model. In this setup, a composite tool like Nimbus performs the complex, multi-step analysis automatically behind the scenes, traversing the entire system to build a complete picture. It then pushes a simple, deterministic, and context-rich result to the AI. This improves performance and reliability by reducing the AI's task from complex analysis to simple reasoning.
This "pull" model approach runs into two fundamental problems:
Completeness: An AI agent might find the most obvious consumers of an API by searching for its name. But did it find all of them? Did it check the repository for that legacy PHP service nobody has touched in three years? The probabilistic nature of LLMs means you can never be certain you have a complete picture. For architectural safety, "mostly correct" isn't good enough.
Synthesis: A large context window doesn't equate to understanding. An AI can find a "needle in a haystack" if you ask it a specific question, but it can't connect the dots of a distributed system on its own to see the fully impact of a change. This is exacerbated as the number of relevant pieces of information increases, despite technology improvements. It's the difference between reading a dictionary and understanding a conversation. There's a reason why "context engineering" has been thrown around so much recently.
The Solution: A Deterministic System Graph
True architectural safety requires a structured, queryable system graph—a reliable map of your entire architecture that can be queried for definitive answers.
Instead of asking an AI agent to read a million lines of code and hope for the best, you give it a new, powerful tool: access to a live, comprehensive graph of your architecture. This graph is built by continuously parsing the source code, dependency manifests, and infrastructure definitions across all your repositories, creating a rich, always-up-to-date model of how your system actually works.
When a PR changes an API, an intelligent agent armed with this tool doesn't just guess at the impact. It asks a precise question: "Traversing the system graph, what is the complete set of downstream consumers for this specific ETA-Service endpoint, and do any of them reference the estimated_arrival_in_minutes
field?"
The answer comes back in milliseconds, and it is deterministic, not probabilistic.
Industry Context: Anatomy of a Cascading Failure
The challenge of "unknown unknowns" is a well-documented growing pain for any company scaling with microservices. In their post-mortem of the February 22, 2022 outage, Slack's engineering team described a "textbook example of complex systems failure." The story is a masterclass in how a small, routine change can cause a massive, unexpected ripple effect.
Here’s a simplified breakdown of what happened:
A Routine Restart: A maintenance task restarted a health-checker service (
Consul
) on some servers.An Efficient Helper Backfires: A manager service (
Mcrib
) saw the restarts. Trying to be helpful, it immediately refreshed the system's short-term memory (Memcached
), causing it to be temporarily empty.A Hidden Flaw is Exposed: With the cache empty, the main Slack application was forced to ask the main database for everything. Crucially, it did this using a highly inefficient type of query (a "scatter query") that was normally harmless because it was so rarely used.
The System Overloads: The database was flooded with these inefficient queries. As the Slack team wrote, "The database was overwhelmed as the read load increased superlinearly... which meant that the cache could not be filled." This created a vicious feedback loop that brought the entire system down.
No developer reviewing the routine maintenance change could have reasonably been expected to trace this complex chain of events. The problem wasn't a single mistake, but a series of interconnected behaviors. By mapping these services and their interactions, we can start to draw conclusions about these hidden risks.
Here's how an automated analysis engine like Nimbus would use this graph to identify the blast radius of a change. When a PR proposes modifying the Consul Agent
restart process, the analysis doesn't just see one change; it simulates the chain of effects:
Initial Impact: It identifies that the change to
Consul
(A) will trigger a high-frequencyWrites New
action inMcrib
(B).Immediate Side Effect: This action updates the
Mcrouter Config
(C), which causesMcrouter
(D) toFlush
theCache
(F) more often. The system models a significant drop in the cache hit rate.Tracing the Ripple Effect: The analysis engine follows the dependencies from the
Slack Web App
(E). It sees two paths: a normalCache Hit
path and aCache Miss
path.Surfacing Latent Risk: It analyzes the
Cache Miss
path to theDatabase
(G) and finds its properties:Query Type: Scatter
,Cost: High
. Because the model predicts a sharp drop in cache hits, it concludes that this high-cost, normally rare path will now be executed at a frequency far outside its normal operating parameters.
The system flags that the seemingly innocent change to the health-checker has a high probability of causing a database overload. The "unknown unknown" becomes a known, quantifiable risk, allowing the team to mitigate it before it takes down production.
From Blind Spots to Blueprints
Relying on developers—or even AI agents—to manually discover architectural context during a PR review is fundamentally flawed. It's slow, unreliable, and invites the exact kind of "unknown unknown" that causes the most painful outages.
When you shift from a "pull" model (hope the reviewer finds the problem) to a "push" model (the system tells the reviewer the impact), the entire dynamic of the code review changes.
The Nimbus bot posting a comment in the PR is the first step:
⚠️ API Impact Warning: This change to
ETA-Service
will break 3 downstream consumers:Driver-App-Gateway
,Customer-Notification-Service
, and the internalWarehouse-Load-Dashboard
. This change is not safe to merge.
This level of automated analysis transforms the pull request. It shifts the focus from a manual, often stressful hunt for potential risks to a straightforward verification of the change against its known architectural impacts. The long-term goal is to use this same analysis to empower an AI agent to not only identify these breakages but to automatically generate the PRs that fix them, turning a week of cross-team coordination into a five-minute task. This creates a foundation for safer, more efficient development cycles.