Keeping ML Models Honest: Lessons from Production in Indian Banking

There’s a moment every ML engineer in banking dreads. Your model has been in production for eight months. Performance metrics look stable. Then Diwali hits, and your credit risk model starts flagging 40% of transactions as anomalous - because festival spending patterns look nothing like the steady-state behavior the model trained on. Alerts go haywire. The ops team wants to know if the model is broken. It’s not broken. It just doesn’t understand India.

I’ve spent the past year running ML systems in Indian financial services, and the single biggest lesson is this: drift detection in India is a fundamentally different problem than what the textbooks describe. High variance isn’t a bug - it’s the baseline. If you don’t design your monitoring around that reality, you’ll either drown in false alerts or miss genuine degradation. There’s no middle ground.

What Drift Actually Looks Like in Indian Banking

Standard ML monitoring literature treats drift as a deviation from a stationary distribution. Feature distributions shift, prediction accuracy degrades, you retrain. Clean and simple.

Indian financial data isn’t stationary to begin with. Here are the seasonal patterns I’ve had to deal with:

Festival spending cycles. Diwali (October-November), Dussehra, Eid, Christmas, regional festivals like Pongal and Onam - each creates distinct spending spikes. Credit card transaction volumes can jump 3-5x during Diwali week. Average transaction values shift upward. Category distributions change entirely: jewellery, electronics, and travel spike while grocery and utility stay flat. A model trained on January-June data will look at October and basically see an alien planet.

Monsoon-linked agricultural credit. For banks with rural and semi-urban exposure, monsoon quality directly affects credit behavior. Good monsoon means timely crop loan repayments, higher rural consumption, lower NPAs. Bad monsoon reverses all of that. This isn’t drift - it’s a known, recurring exogenous variable. But if your drift detector doesn’t know that, every kharif season triggers alerts.

Salary credit patterns. Government employees get paid on the 1st, private sector salaries arrive between the 25th and 1st, gig economy has no fixed pattern at all. Intra-month distribution of deposit and spending behavior is wildly non-uniform, and the mix shifts as the bank’s customer base evolves.

Tax-season behavior. March is advance tax season. January is tax-saving season (ELSS, PPF, NPS contributions spike). Predictable, but significant distribution shifts.

The real challenge is telling these known seasonal patterns apart from genuine drift - the kind caused by a product change, a competitor’s move, a regulatory shift, or an actual change in customer behavior.

Why Standard Drift Detection Fails

Most production drift detection uses statistical tests comparing recent input distributions to a reference distribution. PSI and the Kolmogorov-Smirnov test are the workhorses. Both have the same problem in Indian banking: they have no concept of seasonality.

If your reference distribution is the last 90 days, and those 90 days didn’t include a festival season, then festival season will always look like drift. If your reference is the same month last year, you’re assuming year-over-year stationarity - which breaks when structural changes (UPI adoption, new product launches, regulatory changes) alter the baseline.

I learned this the hard way. My first production deployment, I set up a standard PSI monitor with a 0.25 threshold. During the first Diwali post-deployment, it flagged 14 out of 22 input features as drifted. The model was performing fine. Predictions were accurate, business metrics were healthy. But the monitoring system was screaming. After the third false alarm escalation, the ops team just started ignoring alerts entirely. Which is exactly the worst possible outcome, because it means genuine drift also gets ignored.

A Practical Approach: Season-Aware Drift Detection

Here’s the monitoring pipeline I use now, with governance checkpoints built in:

graph TD
    A[Production Model] --> B[Prediction Logs]
    B --> C[Feature Distribution Extraction]
    C --> D{Season-Aware Drift Detection}
    D -->|No Drift| E[Continue Monitoring]
    D -->|Drift Detected| F[Alert + Root Cause Analysis]
    F --> G{Governance Review}
    G -->|Seasonal / Expected| H[Log & Update Baseline]
    G -->|Genuine Drift| I[Trigger Retraining Pipeline]
    I --> J[Retrain on Updated Data]
    J --> K[Validation Against Holdout + Business Rules]
    K --> L{Model Risk Committee Review}
    L -->|Approved| M[Shadow Deployment]
    M --> N[A/B Validation in Production]
    N --> O{Performance Acceptable?}
    O -->|Yes| P[Full Deployment]
    O -->|No| Q[Rollback + Investigation]
    L -->|Rejected| R[Back to Development]

    style D fill:#FF9800,color:#fff
    style G fill:#5C6BC0,color:#fff
    style L fill:#5C6BC0,color:#fff
    style P fill:#4CAF50,color:#fff
    style Q fill:#EF5350,color:#fff

The key design choices.

Season-aware reference distributions. Instead of comparing against a single reference window, I maintain seasonal baselines. The reference for October isn’t the last 90 days - it’s October of previous years, adjusted for trend. You need at least 2-3 years of historical data to build reliable seasonal baselines, which is a real constraint for newer models.

Two-tier alerting. Tier 1 alerts fire on statistical drift (PSI, KS) and go to the data science team for investigation. Tier 2 alerts fire on performance drift - accuracy, precision, recall degradation against business-relevant metrics - and escalate to the model risk committee. Only Tier 2 alerts trigger the retraining pipeline. This distinction alone cut our false escalations by about 80%.

Governance checkpoints at two stages. Before retraining begins (to confirm drift is genuine and not seasonal), and before redeployment (to validate the retrained model meets regulatory and business requirements). This isn’t checkbox compliance - it’s the actual decision architecture that keeps bad models out of production.

Here’s a practical implementation of season-aware PSI with the KS test as a secondary check:

import numpy as np
from scipy import stats
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
from datetime import datetime


@dataclass
class DriftResult:
    feature: str
    psi: float
    ks_statistic: float
    ks_pvalue: float
    is_drifted: bool
    is_seasonal: bool
    recommendation: str


def calculate_psi(reference: np.ndarray, current: np.ndarray,
                  n_bins: int = 10) -> float:
    """
    Population Stability Index between reference
    and current distributions.
    """
    # Create bins from reference distribution
    breakpoints = np.percentile(reference,
                                np.linspace(0, 100, n_bins + 1))
    breakpoints = np.unique(breakpoints)

    ref_counts = np.histogram(reference, bins=breakpoints)[0]
    curr_counts = np.histogram(current, bins=breakpoints)[0]

    # Add small constant to avoid division by zero
    ref_pct = (ref_counts + 1e-6) / (ref_counts.sum() + 1e-6 * len(ref_counts))
    curr_pct = (curr_counts + 1e-6) / (curr_counts.sum() + 1e-6 * len(curr_counts))

    psi = np.sum((curr_pct - ref_pct) * np.log(curr_pct / ref_pct))
    return psi


def season_aware_drift_check(
    current_data: Dict[str, np.ndarray],
    seasonal_baselines: Dict[str, Dict[str, np.ndarray]],
    rolling_baseline: Dict[str, np.ndarray],
    current_month: int,
    psi_threshold: float = 0.25,
    ks_alpha: float = 0.01,
) -> List[DriftResult]:
    """
    Compare current feature distributions against both
    seasonal and rolling baselines to separate genuine
    drift from expected seasonal variation.
    """
    month_key = f"month_{current_month:02d}"
    results = []

    for feature, current_values in current_data.items():
        # PSI against rolling baseline (last 90 days)
        rolling_ref = rolling_baseline.get(feature)
        if rolling_ref is None:
            continue

        psi_rolling = calculate_psi(rolling_ref, current_values)
        ks_stat, ks_p = stats.ks_2samp(rolling_ref, current_values)

        # Check against seasonal baseline if available
        seasonal_ref = None
        psi_seasonal = None
        if feature in seasonal_baselines:
            seasonal_ref = seasonal_baselines[feature].get(month_key)

        if seasonal_ref is not None and len(seasonal_ref) > 50:
            psi_seasonal = calculate_psi(seasonal_ref, current_values)

        # Decision logic
        drifted_vs_rolling = (
            psi_rolling > psi_threshold or ks_p < ks_alpha
        )

        if not drifted_vs_rolling:
            # No drift against rolling baseline - all clear
            results.append(DriftResult(
                feature=feature,
                psi=psi_rolling,
                ks_statistic=ks_stat,
                ks_pvalue=ks_p,
                is_drifted=False,
                is_seasonal=False,
                recommendation="No action needed",
            ))
        elif psi_seasonal is not None and psi_seasonal < psi_threshold:
            # Drifted vs rolling but consistent with seasonal pattern
            results.append(DriftResult(
                feature=feature,
                psi=psi_rolling,
                ks_statistic=ks_stat,
                ks_pvalue=ks_p,
                is_drifted=False,
                is_seasonal=True,
                recommendation=(
                    f"Seasonal shift detected (PSI vs season: "
                    f"{psi_seasonal:.3f}). Log and monitor."
                ),
            ))
        else:
            # Drifted vs both baselines - genuine drift
            results.append(DriftResult(
                feature=feature,
                psi=psi_rolling,
                ks_statistic=ks_stat,
                ks_pvalue=ks_p,
                is_drifted=True,
                is_seasonal=False,
                recommendation=(
                    "Genuine drift detected. Escalate to "
                    "Tier 2 review and evaluate retraining."
                ),
            ))

    return results


# Example: simulating Diwali-season drift detection
np.random.seed(42)

# Normal spending distribution (non-festival baseline)
normal_spending = np.random.lognormal(mean=7.5, sigma=1.2, size=5000)

# Diwali spending - higher mean, fatter tail
diwali_spending = np.random.lognormal(mean=8.2, sigma=1.4, size=2000)

# Historical Diwali baseline (from previous years)
historical_diwali = np.random.lognormal(mean=8.1, sigma=1.35, size=4000)

results = season_aware_drift_check(
    current_data={"txn_amount": diwali_spending},
    seasonal_baselines={
        "txn_amount": {"month_10": historical_diwali}
    },
    rolling_baseline={"txn_amount": normal_spending},
    current_month=10,
)

for r in results:
    print(f"Feature: {r.feature}")
    print(f"  PSI (vs rolling): {r.psi:.4f}")
    print(f"  KS stat: {r.ks_statistic:.4f}, p-value: {r.ks_pvalue:.4e}")
    print(f"  Drifted: {r.is_drifted}, Seasonal: {r.is_seasonal}")
    print(f"  Recommendation: {r.recommendation}")

The output will show Diwali spending appearing drifted against the rolling baseline but consistent with the seasonal baseline - so the system logs it as seasonal variation instead of triggering a retraining pipeline. That’s the difference between a monitoring system that works in a textbook and one that works in India.

The Retraining Cadence Question

How often should you retrain? Honest answer: it depends. Anyone giving you a universal number hasn’t run models in production.

Here’s the framework I use. Three triggers, operating independently.

Scheduled retraining (quarterly). Every 90 days, retrain on the most recent 18-24 months of data regardless of drift signals. This catches slow distribution shifts that are individually below alert thresholds but add up over time. The 18-24 month window is deliberate - it captures at least one full seasonal cycle while staying recent enough to reflect current behavior.

Drift-triggered retraining. When the season-aware pipeline flags genuine drift (not seasonal), I initiate retraining within 2 weeks. That 2-week buffer isn’t procrastination - it’s time for root cause analysis. Sometimes what looks like “drift” is actually a data pipeline issue, a feature engineering bug, or an upstream system change. Retraining on corrupted data just makes things worse.

Event-triggered retraining. Major structural changes - new product launch, regulatory change (new LTV norms, revised risk weights), a macro event - trigger immediate model review and potential retraining. This is where institutional knowledge really matters. The data science team needs to know enough about the business to recognize when a structural break has happened, even before the statistical tests catch up.

Lessons from Structural Breaks

Two events from recent Indian financial history show why static models fail.

Post-demonetization behavior. Models trained on pre-November 2016 data had a specific cash-digital mix baked into their transaction pattern features. After demonetization, digital transactions surged permanently. Any credit risk model using transaction channel mix as a feature was fundamentally miscalibrated. This wasn’t a temporary shift that washed out - it was a permanent structural break. Models needed to be retrained from scratch, not incrementally updated.

COVID distribution shift. The pandemic created a multi-year distribution shift. Spending patterns collapsed in April 2020, partially recovered by late 2020, shifted again with the second wave in 2021, and only normalized (to a new baseline) by mid-2023. For nearly three years, historical data was unreliable for training. What worked in practice: shorter training windows (6-9 months instead of 18-24), more frequent retraining (monthly instead of quarterly), and heavier regularization to avoid overfitting to pandemic-era distributions.

Monsoon-linked agri credit. From my time at NABARD, I learned that monsoon quality isn’t a binary variable. A delayed monsoon is different from a deficit monsoon is different from an excess monsoon, and each affects agricultural credit behavior in its own way. Models that use a single monsoon-quality feature miss all of this. I now encode monsoon data as a vector: cumulative rainfall deviation, regional distribution (a national surplus can mask a regional deficit), and temporal pattern (early vs. late season rainfall matters for different crops).

Governance That Actually Works

Let me be direct about model governance: most governance frameworks I’ve seen in Indian banking are compliance theater. A 40-page model documentation template, reviewed once before deployment and never touched again. A model risk committee that meets quarterly and rubber-stamps whatever the data science team presents.

Here’s what I’ve found actually works.

Living model cards. One page per model, updated with every retraining. What data was it trained on, what are its current performance metrics, what are the known failure modes, what were the last three drift events. Accessible to the business team, not buried in a SharePoint folder that nobody opens.

Pre-mortem reviews. Before deployment, the team explicitly asks: “How will this model fail? What data shift will break it?” This generates a watch list of features and scenarios to monitor. Way more useful than a generic risk assessment, and it takes an hour.

Business metric alignment. Every model gets at least one business metric (not just an ML metric) tracked continuously. For a credit risk model, that might be actual-vs-predicted default rate at the 30-day mark. For a campaign targeting model, conversion rate on targeted offers. When the ML metrics look fine but the business metric degrades - that’s the most dangerous kind of drift. And it’s the kind most governance frameworks completely miss.

Quarterly model reviews with business stakeholders. Not just the data science team. The business heads who rely on model outputs need to be in the room. They often know about drift before the statistical tests do, because they see the business impact first.

The Honest Truth

Running ML in Indian banking is harder than the conference talks make it sound. The data is messier, the seasonality is more extreme, structural breaks happen more often, and the governance requirements are more demanding than in markets with longer digital histories.

But that difficulty is also the moat. Anyone can train a model. Keeping it honest in production, season after season, structural break after structural break - that’s the hard part. It requires understanding why October looks different from June, why a monsoon deficit matters for a personal loan model, why governance needs to be a living practice and not just a compliance artifact.

The models I trust most aren’t the ones with the best initial metrics. They’re the ones with the most thoughtful monitoring, the most honest failure documentation, and the most engaged governance around them. Unglamorous? Yes. But it’s the part that actually matters.