Skip to content

Decision Automation Needs Three Layers, Not One

Rules, ML signals, and LLM reasoning each have different jobs. Treating them as one layer creates brittle systems.

8 min
Decision Automation Needs Three Layers, Not One

Framework at a glance

How to read the model and what each layer is doing.

Rules, ML signals, and LLM reasoning each have different jobs. Treating them as one layer creates brittle systems.

Watch the video

Decision Automation Needs Three Layers, Not One

0:00

# Decision Automation Needs Three Layers, Not One

A lot of AI decisioning discussions collapse three very different mechanisms into one bucket. That is how teams end up asking a language model to do work that should have stayed in a rule engine or a scoring model.

I keep seeing this pattern everywhere. A team gets excited about GPT-4 or Claude, throws every decision problem at it, then wonders why their inference costs are through the roof and their system takes 30 seconds to approve a simple transaction. In practice that most decision automation needs three distinct layers, each optimized for different types of problems.

Three layers, three jobs

The first layer is deterministic logic. If the condition is explicit, high-volume, and low-ambiguity, rules are still the best answer. They are explainable, fast, and cheap.

Three Layer Decision Architecture

I'm talking about decisions like "Is this customer in California? Then apply sales tax rate X." Or "Has this account been flagged for review? Route to manual queue." These aren't AI problems. They're if-then-else problems. Tools like Drools, AWS Step Functions, or even hardcoded business logic handle millions of these decisions per second at negligible cost.

The second layer is probabilistic scoring. When the problem needs ranking, confidence, or prediction, ML belongs there. It turns noisy signals into usable probabilities.

Think fraud scoring, lead qualification, or recommendation ranking. You're not asking "What is the rule?" You're asking "What is the likelihood?" Traditional ML models (XGBoost, neural networks, logistic regression) excel here. They're orders of magnitude cheaper than LLMs and give you actual probability distributions you can threshold and monitor.

The third layer is LLM reasoning. This is the layer for ambiguity, synthesis, natural-language interpretation, and cases where the system has to explain itself in human terms.

This is where you actually need Claude or GPT-4. Complex contract review. Customer complaint categorization that goes beyond simple keywords. Generating human-readable explanations for why a loan was denied. These are problems where the input is messy, the logic is nuanced, and the output needs to be conversational.

The routing problem

The real architecture question is routing. Which cases stop at rules? Which escalate to scoring? Which require language reasoning? What confidence threshold decides the handoff?

Cost Comparison Matrix

I've seen teams build this wrong so many times. They either create a monolithic LLM that tries to handle everything (expensive and slow) or they build three separate systems with manual handoffs (operational nightmare). The key is building an intelligent router that knows which tool to use for which decision.

Here's what I typically recommend: Start with a lightweight classifier that can quickly categorize incoming decisions. This doesn't need to be an LLM. A simple embedding model like Sentence-BERT can map requests to decision types. Based on that classification, route to the appropriate layer.

Real architectures I've seen work

Let me walk through a concrete example from a fintech client. They process loan applications with three distinct decision points:

  1. Basic eligibility rules (credit score thresholds, income requirements, geographic restrictions)
  2. Risk scoring (probability of default based on 200+ features)
  3. Exception handling (explaining denials, handling appeals, processing non-standard documentation)

Their architecture looks like this:

ComponentTechnologyLatencyCost per DecisionVolume
Rule EngineAWS Lambda + DynamoDB rules<50ms$0.0000195% of applications
ML ScoringSageMaker XGBoost endpoint<200ms$0.000140% of applications
LLM ReasoningAnthropic Claude via Bedrock2-5s$0.025% of applications

Notice how the volumes funnel down. Most decisions never need expensive AI. The rule engine handles the obvious approvals and denials. ML scoring catches the middle ground. Only the truly complex cases hit the LLM.

Another pattern I like comes from an e-commerce platform doing customer service automation. They use LangGraph to orchestrate the flow between layers. Simple queries ("Where's my order?") hit a rule-based lookup. Sentiment analysis and priority scoring use a fine-tuned BERT model. Only complex complaints or special requests invoke GPT-4 for response generation.

The cost reality nobody talks about

Let's do the math on why this matters. Say you're processing 1 million decisions per month:

All-LLM approach:

  • 1M decisions × $0.02 per call = $20,000/month
  • Average latency: 3 seconds
  • Governance nightmare (how do you audit GPT-4's reasoning?)

Three-layer approach:

  • 950K rule decisions × $0.00001 = $9.50
  • 45K ML decisions × $0.0001 = $4.50
  • 5K LLM decisions × $0.02 = $100
  • Total: $114/month
  • Average latency: 150ms
  • Clear audit trail for each layer

That's a 175x cost reduction. I've seen teams burn through their entire AI budget in weeks because they didn't think about this architecture.

Implementation patterns that scale

The key to making this work is choosing the right orchestration pattern. I see three main approaches in production:

Sequential filtering: Start with rules, then ML, then LLM. Each layer can handle the decision or pass it up. This works well when you have clear escalation criteria. Uber uses something similar for ride pricing, where surge calculations start with basic supply/demand rules before invoking more complex models.

Parallel consultation: Hit all three layers simultaneously and combine results. This is faster but more expensive. I've seen it work for high-value decisions like large B2B quotes where latency matters more than cost.

Confidence-based routing: Use the ML layer to assign confidence scores, then route based on thresholds. Low confidence goes to LLM for reasoning. High confidence stays in the ML layer. This is what most fraud systems do today.

For orchestration, I'm seeing more teams adopt agentic frameworks. CrewAI and AutoGen work well for complex multi-step decisions. For simpler flows, Step Functions or Temporal give you better control and observability.

Common failure modes

I want to call out the mistakes I see repeatedly:

Using LLMs for structured data extraction. I watched a team use GPT-4 to extract prices from invoices. They were paying $0.02 per invoice for work that a regex or simple OCR + rules could handle for effectively free.

Ignoring confidence calibration. Your ML models might output probabilities, but are they calibrated? If your model says 90% confidence, does that actually mean 9 out of 10 predictions are correct? Most teams skip this step and end up routing too many decisions to expensive LLM reasoning.

Building without fallbacks. What happens when Claude is down? When your ML endpoint is overloaded? The best architectures gracefully degrade. High-priority decisions might skip straight to LLM. Low-priority ones might wait in a queue.

Over-engineering the router. I've seen teams build complex meta-models to decide which model to use. Keep it simple. Start with heuristics. You can always add intelligence later.

What this means for your architecture decisions

If you're building decision automation today, here's my advice:

First, map your decisions. Actually sit down and categorize them. What percentage are simple rules? What percentage need probabilistic scoring? What percentage require natural language reasoning? If you don't know these numbers, you're guessing at architecture.

Second, instrument everything. You need to know the cost, latency, and accuracy of each layer. I use Datadog for performance monitoring and Arize for ML observability. Without this data, you can't optimize the routing logic.

Third, plan for evolution. Start simple. Maybe everything goes through rules first, and you manually review exceptions. As volumes grow, add the ML layer. As complexity increases, add LLM reasoning. Don't build all three layers on day one.

Fourth, consider your team's skills. Rules engines need business analysts. ML models need data scientists. LLM systems need prompt engineers. Make sure you have the right people for each layer.

The vendors are starting to catch up to this pattern. AWS Bedrock now has built-in routing between models. Google's Vertex AI has similar capabilities. But I still think you get better control building your own routing logic.

For most teams, I recommend starting with a simple Lambda function that implements routing logic. As you scale, consider moving to a proper orchestration framework. LangGraph is my current favorite for LLM-heavy workflows. Temporal works better for mixed workloads.

The teams that get this right save millions in inference costs while actually improving decision quality. The ones that don't end up with expensive, slow systems that nobody trusts. The choice is pretty clear to me.