I Built an LLM-Powered Campaign Engine for Financial Services

For four years I ran priority banking campaigns at Axis Bank. The playbook was simple: segment customers by AUM tier, pick a product, write copy, blast it out. Response rates hovered around 2-3%, and we called that good.

When I started building an LLM-powered campaign engine from scratch, I assumed the hard part would be prompt engineering. I was wrong. The hard part was everything upstream - getting customer data into a shape where an LLM could reason about it meaningfully, and everything downstream - making sure the output didn’t violate SEBI advertising guidelines or RBI’s fair practices code.

Here’s what I actually built, what surprised me, and what I’d do differently.

The Architecture

The system has four stages, each with distinct failure modes.

flowchart LR
    A[Customer Data<br/>Ingestion] --> B[Behavioral<br/>Clustering]
    B --> C[LLM Segment<br/>Profiling]
    C --> D[Campaign Content<br/>Generation]

    A1[CRM + Transaction<br/>+ Digital Footprint] --> A
    B1[Embedding-based<br/>Similarity] --> B
    C1[GPT-4 / Claude<br/>+ Domain Context] --> C
    D1[Multi-language<br/>+ Compliance Check] --> D

    style A fill:#1a1a2e,stroke:#e94560,color:#fff
    style B fill:#1a1a2e,stroke:#e94560,color:#fff
    style C fill:#1a1a2e,stroke:#e94560,color:#fff
    style D fill:#1a1a2e,stroke:#e94560,color:#fff

Stage 1: Data Ingestion. Customer data comes from at least five sources - core banking (account balances, tenure), transaction history (UPI, NEFT, card spends), CRM (interaction logs, complaints), digital footprint (app usage patterns, feature adoption), and market context (what products they browsed but didn’t buy). The join key problem alone took two weeks. Indian banks have customers with multiple CIFs (Customer Information Files), name variations across systems, and address formats that range from PIN code precision to “near temple, behind petrol pump.”

Stage 2: Behavioral Clustering. This is where embeddings earn their keep. Instead of segmenting on static attributes (age, income, AUM), I encode behavioral sequences into dense vectors and cluster on those. More on this below.

Stage 3: LLM Segment Profiling. Given a cluster, the LLM generates a rich profile - not just demographics, but inferred life stage, financial sophistication, likely next product, preferred communication channel, risk appetite. It’s surprisingly good at this when you give it enough behavioral context.

Stage 4: Campaign Generation. Personalized copy, in the right language, for the right channel, with compliance guardrails baked in. This is the part that looks easy and isn’t.

The Clustering Pipeline

Traditional RFM (Recency, Frequency, Monetary) segmentation gives you maybe 8-12 segments. Useful, but crude. A customer who does 50 UPI transactions of Rs 200 each is behaviorally different from one who does 2 NEFT transfers of Rs 5,000 - even if the monetary value is identical.

Here’s the core concept of the embedding-based approach:

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.cluster import HDBSCAN

def encode_customer_behavior(transactions: list[dict]) -> np.ndarray:
    """
    Convert a customer's transaction history into a behavioral
    embedding by encoding transaction narratives and aggregating.
    """
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Build behavioral narrative from raw transactions
    narratives = []
    for txn in transactions:
        narrative = (
            f"{txn['channel']} {txn['type']} of {bucket_amount(txn['amount'])} "
            f"to {txn['category']} on {txn['day_of_week']} "
            f"{'recurring' if txn.get('is_recurring') else 'one-time'}"
        )
        narratives.append(narrative)

    # Encode and aggregate (mean pooling over transaction embeddings)
    embeddings = model.encode(narratives)
    customer_vector = np.mean(embeddings, axis=0)
    return customer_vector

def bucket_amount(amount: float) -> str:
    """Discretize amounts to reduce noise while preserving signal."""
    if amount < 500: return "micro"
    if amount < 5000: return "small"
    if amount < 50000: return "medium"
    if amount < 500000: return "large"
    return "high-value"

# Cluster customers using HDBSCAN (no need to predefine k)
customer_vectors = np.array([
    encode_customer_behavior(txns) for txns in all_customer_transactions
])
clusterer = HDBSCAN(min_cluster_size=50, min_samples=10)
labels = clusterer.fit_predict(customer_vectors)

The thing that made this work: by encoding transaction narratives rather than raw numbers, you capture behavioral patterns that pure numerical features miss. “UPI micro payment to food delivery on weekday recurring” tells a very different story than “NEFT large transfer to brokerage one-time.”

HDBSCAN matters here because it finds clusters of varying density and marks outliers as noise (-1). You want that. Forcing every customer into a cluster is how you end up sending gold loan campaigns to NRIs.

The India Problem: 22 Languages, 28 States, One Pipeline

When I first got the system working in English, I figured multilingual was a configuration problem. Add a translation step, done.

That assumption lasted about a week.

Regional product affinity isn’t just cultural - it’s economic. Gold loan demand in Kerala and Tamil Nadu isn’t a stereotype. It’s driven by gold holding patterns that go back generations and a pawnbroking infrastructure that formal credit is only now matching. SIP penetration in Maharashtra and Gujarat correlates with equity market proximity and a broker network that simply doesn’t exist at the same density in Bihar or Jharkhand. The LLM needs this context to generate relevant campaigns, not just translated ones.

Financial terminology doesn’t translate cleanly. “Systematic Investment Plan” in Hindi becomes something that sounds either like a government scheme or a chit fund, depending on how you phrase it. We ended up maintaining a financial terminology knowledge base across 8 languages (Hindi, Tamil, Telugu, Kannada, Malayalam, Bengali, Marathi, Gujarati). Not translations - culturally appropriate phrasings vetted by regional banking teams. This was tedious, unglamorous work that turned out to be one of the most important things we did.

Compliance is language-dependent. SEBI’s advertising code requires specific disclaimers. “Mutual fund investments are subject to market risks” has an approved Hindi version. You can’t just translate regulatory phrasings on the fly. You have to know the approved versions in each language.

What Surprised Me

LLMs are remarkably good at cross-sell identification. Our rule-based system had maybe 15 cross-sell rules (has home loan + no home insurance = pitch home insurance). The LLM, given a customer’s full behavioral profile, found patterns we’d never codified. One that stuck with me: customers who recently started recurring SIPs and had a spike in term insurance research were often in a life stage transition (new child, typically). The LLM caught this and suggested child education plan campaigns. Our rule-based system had no path to that insight. We wouldn’t have written that rule in a hundred years.

LLMs are terrible at compliant financial copy without guardrails. Left unconstrained, the LLM would generate copy that was persuasive, well-written, and would get you a notice from SEBI. “Guaranteed returns,” implicit performance promises, missing risk disclaimers - they showed up constantly. We built a compliance layer (essentially a second LLM pass with a constitution of regulatory constraints) that flags and rewrites non-compliant content before it reaches the campaign queue.

Embeddings cluster life stages better than demographics. A 28-year-old earning Rs 15 lakhs and a 35-year-old earning Rs 15 lakhs might be in the same income segment, but their transaction embeddings tell completely different stories. The younger customer’s vectors cluster with “early career, high discretionary spend, exploratory investing.” The older one clusters with “family formation, EMI-heavy, insurance-seeking.” That information exists in behavior data but gets lost in traditional segmentation. It was kind of obvious in retrospect, but I didn’t expect the gap to be this large.

Response Rate Impact

I want to be honest about results because overstating outcomes is everywhere in this space. Moving from rule-based to LLM-powered segmentation improved campaign response rates from roughly 2.5% to 4.8%. Nearly double, which is good. But it’s not the 10x that LinkedIn posts would have you believe.

Where the real impact showed up, though, was in negative outcomes. Irrelevant campaign volume dropped by about 60%. Compliance flags in pre-review dropped by 40% - the compliance LLM layer caught issues that human reviewers were also catching, just earlier in the pipeline. And campaign creation time went from 2-3 weeks (segment definition, brief, copywriting, translation, compliance review) to 2-3 days. That last one might be the most valuable thing, honestly.

What I’d Do Differently

Start with the compliance layer, not the generation layer. I built generation first and compliance as an afterthought. In financial services, that’s backwards. The compliance constraints should shape the generation space, not filter its output. I learned this the hard way.

Invest more in evaluation. We measured response rates, but what I really wanted was a way to measure segment quality independent of campaign performance. A great segment paired with mediocre creative looks like a mediocre segment. I’d build segment evaluation metrics (coherence, separation, stability over time) as a first-class concern next time around.

Use smaller, fine-tuned models for the production path. GPT-4 level models are great for segment profiling and ideation. For production campaign generation across 8 languages at scale, the latency and cost don’t justify the marginal quality improvement over a well-fine-tuned smaller model with strong guardrails. Save the big model for the thinking, use the small one for the doing.

LLMs don’t replace the domain expertise needed to run financial services campaigns. They amplify it. The system I built works because it encodes real understanding of Indian banking - regional economics, regulatory constraints, product interdependencies - not because a language model is magically good at marketing. The model is the engine. The domain knowledge is the fuel.