February 2026 12 min read

Building a Scalable ML Pipeline That Scores Hundreds of Sales Calls Daily

How we designed and shipped a production ML system on AWS for a sales intelligence client — and how we'd rebuild it today with LLMs on Modal and fine-tuned ColBERT for retrieval that actually understands sales conversations.

The Problem

A sales intelligence platform needed to score sales calls at scale. Their clients' sales teams were making hundreds of calls daily, and each call needed to be transcribed, analyzed for key moments (objection handling, discovery questions, closing techniques), and scored against a methodology framework. Doing this manually was impossible. Doing it with basic keyword matching was inaccurate.

The platform needed an ML pipeline that could:

The Architecture We Built

Production ML Pipeline Architecture on AWS
Production ML pipeline architecture — AWS-native, event-driven, built to scale.

We designed an event-driven, fully serverless architecture on AWS that decouples every stage of the pipeline. Here's how it works end-to-end:

1. Ingestion Layer

When a sales call is completed, the backend sends an API request to API Gateway, which triggers a Lambda function. This Lambda drops a message into SQS (Simple Queue Service) — our primary decoupling mechanism. Using SQS means we never lose a call even if downstream processing is temporarily overwhelmed. The queue absorbs spikes and feeds processing at a sustainable rate.

2. Transcript Processing

A second Lambda picks messages off the queue and stages raw transcripts in S3. This staging layer is important — it gives us a durable record of every input and makes reprocessing trivial if we update our models.

Another SQS queue feeds the transcripts to the ML compute layer. We deliberately used multiple queues rather than one monolithic queue, because different stages have different throughput characteristics and retry needs.

3. ML Compute

The heavy lifting happens on AWS Batch backed by EC2 compute instances. We chose Batch over Lambda for ML inference because:

Custom tracking components monitor the state of every file through the pipeline. This was essential for reliability: we always know exactly where every call is in the process, and we can detect and retry failures automatically.

4. Output Routing

After scoring, the pipeline splits into two paths:

The human-in-the-loop path was critical. No ML model is 100% accurate, and for a product where managers make coaching decisions based on scores, we needed a safety valve. Flagged calls get reviewed, and those corrections feed back into training data — making the model better over time.

5. Container Workloads

AWS Fargate handles longer-running containerized tasks — batch reprocessing, model evaluation runs, and periodic retraining jobs. Fargate gave us the flexibility of containers without managing EC2 instances for these less predictable workloads.

What Worked Well

How We'd Rebuild It Today: LLMs + Fine-Tuned Retrieval

The architecture above was built in 2023-2024. It works. It's in production. But if we were starting today, two things would change dramatically: the scoring model and the retrieval layer.

Replacing Custom ML Scoring with LLMs on Modal

The original pipeline used custom-trained classification models to score calls against sales methodology dimensions. Training these models required significant labeled data, iteration cycles, and ongoing maintenance. Every time a new scoring dimension was needed, it meant collecting labels and retraining.

Today, we'd use an LLM — specifically Llama 3.1 8B — for the scoring task. Why?

Why Modal for LLM Hosting?

We host our LLMs on Modal rather than AWS Bedrock or self-managed GPU instances. Here's the math: a dedicated g5.xlarge on AWS costs ~$730/month whether you're using it or not. Modal charges per second of actual GPU compute. For a pipeline that processes calls in batches (not 24/7 real-time), this means we pay for ~3-4 hours of GPU time per day instead of 24. That's roughly $150-200/month vs $730. Modal also handles scaling to zero, cold starts in seconds, and version deployments with a single command. For batch workloads with variable volume, it's a no-brainer.

The integration is clean. The SQS + Lambda architecture stays the same — instead of calling AWS Batch for ML inference, the Lambda calls a Modal endpoint. The LLM processes the transcript, returns structured scores, and everything downstream is unchanged. Same S3 staging, same dashboard, same human-in-the-loop path.

Replacing OpenAI Embeddings with Fine-Tuned ColBERT

The retrieval layer is where things get interesting. When scoring a call, the model needs to find relevant moments — the exact section where the rep handled an objection, or where they asked a discovery question. In the original system, we used basic text search and heuristics. In a modern RAG setup, most people reach for OpenAI's embedding API.

We use neither. We use fine-tuned ColBERTv2, a late interaction retrieval model. Here's why it matters:

The Problem with Standard Embeddings

Dense embedding models (OpenAI text-embedding-3-small, BGE, MiniLM) compress an entire passage into a single vector. This works well for general semantic search, but it crushes nuance. In sales conversations, the difference between a good discovery question and a mediocre one might be a single word or phrase. A single-vector embedding averages that signal away.

How ColBERT Is Different

ColBERT generates a vector per token, not per passage. When you search, it does a fine-grained token-level matching between the query and every document. Think of it as the model checking "does this specific word in my query match this specific word in the document?" across all combinations, then summing the best matches.

For sales call analysis, this means:

Fine-Tuning Makes It Domain-Specific

Out-of-the-box ColBERTv2 is trained on general web data. Fine-tuning it on sales conversation pairs — "this query should match this transcript segment" — makes it dramatically better at finding the right moments in calls. We create training pairs from the human-reviewed calls (the human-in-the-loop data that was already being collected), so the model improves continuously with zero additional labeling effort.

Retrieval Method Approach Strength Weakness
BM25 (keyword) Term frequency matching Fast, no ML needed Misses semantic meaning entirely
OpenAI Embeddings Single vector per passage Good general semantic search Compresses away token-level nuance, API dependency, per-call cost
BGE / MiniLM Single vector, self-hosted No API cost, decent quality Same compression problem as OpenAI
Fine-Tuned ColBERTv2 Multi-vector, token-level matching Precise retrieval, domain-adapted, self-hosted, no API cost Higher storage, slightly more compute

The storage trade-off is real — ColBERT stores 128-dimensional vectors per token instead of one vector per chunk. But for a corpus of sales call transcripts (not billions of web pages), the storage cost is negligible. We're talking gigabytes, not terabytes.

The Modern Stack

If we were rebuilding this pipeline today, the architecture diagram would look simpler, not more complex:

The biggest win? Time to add a new scoring dimension drops from weeks to hours. Write a prompt describing the new dimension, test it against a few calls, deploy. No labeling, no retraining, no waiting.

Lessons for Your Pipeline

Whether you're scoring sales calls, reviewing legal contracts, or analyzing field reports, the principles are the same:

  1. Decouple with queues. SQS (or any message queue) between every stage. Your pipeline will fail — make sure failures are isolated and retryable.
  2. Stage everything in durable storage. You will want to reprocess. Make it easy.
  3. Build human-in-the-loop from day one. It's a quality safety net today and a training data source tomorrow.
  4. Use the right retrieval for your domain. If precision matters — and in any professional context, it does — fine-tuned ColBERT outperforms general-purpose embeddings. The extra engineering effort pays for itself in result quality.
  5. Host LLMs where the economics make sense. Dedicated GPU instances for variable workloads is burning money. Pay-per-second platforms like Modal align cost with actual usage.
  6. Keep your data private. Self-hosted models + self-hosted retrieval means client data never leaves your infrastructure. For regulated industries, this isn't optional.
Want to Build Something Similar?

We build and deploy private AI systems for businesses that need their data to stay private. Whether it's sales intelligence, legal document analysis, or operational analytics — we handle the AI so you can focus on your business. Book a free session to discuss your use case.