Embeddings-Based Portfolio Intelligence: An Architecture for Indian Markets

Most portfolio allocation systems - even the ones marketed as “AI-powered” - run on a periodic rebalancing cadence. Monthly, quarterly, maybe weekly if you’re lucky. They pull updated fundamentals, run an optimizer, spit out target weights. For US large-cap equities, where price discovery is continuous and information diffuses efficiently, this is fine.

Indian markets don’t work that way. I’ve watched a single RBI MPC announcement reprice the entire banking sector in 90 minutes. A monsoon forecast revision can move agri-input and FMCG stocks within hours. And FII flow reversals - driven by US Treasury yield movements or some geopolitical flare-up halfway around the world - will shift market character from momentum-driven to mean-reverting in a single trading session. Union Budget speech clauses trigger sectoral rotations while the finance minister is still talking.

Periodic rebalancing misses all of this. What you actually need is a system that continuously processes information signals - price data, yes, but also textual signals from news, regulatory announcements, corporate disclosures - and turns them into portfolio allocation adjustments. That’s what I’ve been building, and this post walks through the architecture.

Text as a Leading Indicator

Price data is a lagging indicator. By the time a stock moves, the information that caused the move has already been priced in. But the text that comes before price moves - analyst reports, news articles, regulatory circulars, management commentary - that stuff contains signals that can give you a temporal edge. If you process it right.

And “process it right” is doing a lot of work in that sentence. Reading news and making trades is as old as markets. The difference is in how you represent and compare text. Traditional NLP approaches - keyword matching, sentiment scoring - throw away information. Consider two headlines: “RBI maintains status quo on repo rate” and “RBI holds rates steady amid inflation concerns.” Same fact, very different implications. Keyword and sentiment approaches can’t really distinguish between “maintains status quo” (neutral-to-positive for rate-sensitives) and “holds rates amid inflation concerns” (neutral-to-negative, hinting at future tightening bias).

Embeddings preserve these distinctions. Map text into high-dimensional vector spaces where semantic similarity corresponds to geometric proximity, and you can detect nuanced signal changes - not just “positive vs. negative” but “hawkish hold vs. dovish hold vs. genuinely neutral.”

System Architecture

Here’s the full architecture. I’ll walk through each component below.

flowchart TB
    subgraph Sources["Data Sources"]
        S1[BSE/NSE Price Feeds<br/>via NSEpy + BSElib]
        S2[News Sources<br/>ET, Livemint, Moneycontrol]
        S3[Regulatory Feeds<br/>RBI, SEBI Circulars]
        S4[Corporate Filings<br/>BSE XBRL, Annual Reports]
        S5[Macro Indicators<br/>MOSPI, RBI DBIE]
    end

    subgraph Processing["Signal Processing Layer"]
        P1[Text Preprocessing<br/>Deduplication + Chunking]
        P2[Embedding Generation<br/>Domain-Tuned Model]
        P3[Signal Extraction<br/>Cosine Similarity + Clustering]
        P4[Regime Detection<br/>Hidden Markov Model]
    end

    subgraph Portfolio["Portfolio Intelligence"]
        Q1[Signal Aggregation<br/>Sector + Stock Level]
        Q2[Risk Model Integration<br/>VaR + Drawdown Constraints]
        Q3[Allocation Engine<br/>Black-Litterman + Signal Views]
        Q4[Rebalance Decision<br/>Threshold-Based Trigger]
    end

    subgraph Output["Output Layer"]
        O1[Allocation Recommendations<br/>with Confidence Scores]
        O2[Signal Attribution<br/>Why This Change?]
        O3[Compliance Check<br/>SEBI PMS Guidelines]
    end

    S1 & S2 & S3 & S4 & S5 --> P1
    P1 --> P2
    P2 --> P3
    P3 --> P4
    P4 --> Q1
    S1 --> Q2
    Q1 & Q2 --> Q3
    Q3 --> Q4
    Q4 --> O1 & O2 & O3

    style Sources fill:#e8f5e9,stroke:#2e7d32
    style Processing fill:#e3f2fd,stroke:#1565c0
    style Portfolio fill:#fff3e0,stroke:#ef6c00
    style Output fill:#fce4ec,stroke:#c62828

Data Sources

Price feeds come from BSE and NSE via standard APIs. Nothing novel here - the data is the same everyone has access to. The alpha isn’t in the price data.

News sources are where it gets India-specific. The Economic Times, Livemint, Moneycontrol, and Business Standard are the primary sources for Indian market-moving news. I also ingest Reuters and Bloomberg for global macro context, but the India-specific sources carry signals that international feeds miss - particularly around policy leaks, government source-based reporting, and sectoral commentary that reflects Indian market microstructure.

Regulatory feeds from RBI and SEBI are critical and underutilized. An RBI circular on priority sector lending norms doesn’t just affect compliance - it changes the economics of lending for specific bank categories. A SEBI circular on mutual fund expense ratios reprices the AMC business model. These signals need to be ingested and processed within hours, not at the next quarterly rebalance.

Corporate filings via BSE’s XBRL repository provide structured financial data, but the unstructured management commentary in annual reports and investor presentations often contains forward-looking signals that the structured data misses.

The Embeddings Pipeline

This is the core technical component. Here’s the conceptual implementation:

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from collections import defaultdict
from datetime import datetime, timedelta

class IndianMarketSignalProcessor:
    """
    Continuous signal extraction from Indian market news
    using embeddings-based semantic comparison.
    """

    # Reference embeddings for known market-moving themes
    REFERENCE_SIGNALS = {
        "rbi_hawkish": [
            "RBI signals tightening bias amid persistent inflation",
            "Central bank concerned about inflation expectations unanchoring",
            "Monetary policy stance shifts to withdrawal of accommodation",
        ],
        "rbi_dovish": [
            "RBI signals comfort with inflation trajectory",
            "Growth concerns may prompt rate reduction consideration",
            "Monetary policy committee votes for accommodative stance",
        ],
        "fii_outflow_pressure": [
            "Foreign institutional investors continue selling in Indian equities",
            "Dollar strengthening triggers emerging market outflows",
            "FII net sellers for consecutive sessions amid global risk-off",
        ],
        "capex_cycle_positive": [
            "Government capital expenditure accelerates in infrastructure",
            "Order book pipeline strengthens for capital goods companies",
            "Private capex cycle revival signals broad-based recovery",
        ],
        "monsoon_risk": [
            "IMD revises monsoon forecast downward for key growing regions",
            "Rainfall deficit in major agricultural states raises concern",
            "El Nino conditions may impact kharif crop output",
        ],
    }

    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.reference_embeddings = self._build_reference_embeddings()
        self.signal_history = defaultdict(list)

    def _build_reference_embeddings(self):
        """Pre-compute embeddings for reference signal descriptions."""
        ref_emb = {}
        for signal_name, descriptions in self.REFERENCE_SIGNALS.items():
            embeddings = self.model.encode(descriptions)
            # Average the reference embeddings for robustness
            ref_emb[signal_name] = np.mean(embeddings, axis=0)
        return ref_emb

    def process_news_batch(self, articles: list[dict]) -> dict:
        """
        Process a batch of news articles and extract signal strengths.

        Each article: {"text": str, "source": str, "timestamp": datetime}
        Returns: signal strengths for each reference theme.
        """
        if not articles:
            return {}

        # Encode all articles in batch for efficiency
        texts = [a["text"] for a in articles]
        article_embeddings = self.model.encode(texts)

        # Compare each article against reference signals
        signal_scores = defaultdict(list)

        for i, emb in enumerate(article_embeddings):
            for signal_name, ref_emb in self.reference_embeddings.items():
                similarity = cosine_similarity(
                    emb.reshape(1, -1),
                    ref_emb.reshape(1, -1)
                )[0][0]

                # Apply source credibility weighting
                source_weight = self._source_weight(articles[i]["source"])
                weighted_score = similarity * source_weight

                if weighted_score > 0.45:  # Relevance threshold
                    signal_scores[signal_name].append(weighted_score)

        # Aggregate: mean of top-3 scores per signal (reduces noise)
        aggregated = {}
        for signal_name, scores in signal_scores.items():
            top_scores = sorted(scores, reverse=True)[:3]
            aggregated[signal_name] = np.mean(top_scores)

        # Track temporal evolution
        timestamp = max(a["timestamp"] for a in articles)
        for signal_name, strength in aggregated.items():
            self.signal_history[signal_name].append(
                (timestamp, strength)
            )

        return aggregated

    def detect_signal_shift(self, signal_name: str,
                            window_days: int = 5) -> dict:
        """
        Detect if a signal has strengthened or weakened
        relative to its recent history.
        """
        history = self.signal_history.get(signal_name, [])
        if len(history) < 2:
            return {"shift": "insufficient_data"}

        cutoff = datetime.now() - timedelta(days=window_days)
        recent = [s for t, s in history if t > cutoff]
        prior = [s for t, s in history if t <= cutoff]

        if not recent or not prior:
            return {"shift": "insufficient_data"}

        recent_mean = np.mean(recent)
        prior_mean = np.mean(prior)
        shift_magnitude = recent_mean - prior_mean

        return {
            "signal": signal_name,
            "current_strength": round(recent_mean, 4),
            "prior_strength": round(prior_mean, 4),
            "shift": round(shift_magnitude, 4),
            "direction": "strengthening" if shift_magnitude > 0.05
                        else "weakening" if shift_magnitude < -0.05
                        else "stable"
        }

    def _source_weight(self, source: str) -> float:
        """Weight signals by source credibility for market impact."""
        weights = {
            "reuters": 1.0,
            "bloomberg": 1.0,
            "economic_times": 0.9,
            "livemint": 0.85,
            "moneycontrol": 0.8,
            "business_standard": 0.85,
            "rbi_circular": 1.0,   # Regulatory sources get max weight
            "sebi_circular": 1.0,
        }
        return weights.get(source, 0.7)

A few design decisions worth explaining.

Reference signal architecture. Instead of training a separate classifier for each market theme, I pre-define reference descriptions for known signal types and use cosine similarity to see how closely incoming news matches each one. More flexible than classification (a single article can partially match multiple signals) and more interpretable - you can inspect exactly which reference descriptions drove the match.

Source credibility weighting. An RBI circular is a primary source. A Moneycontrol article might just be commentary on that same circular. Weighting by source keeps the system from double-counting a signal that’s bouncing across multiple outlets with varying levels of interpretation.

Shift detection over absolute levels. This one took me a while to internalize. The detect_signal_shift method matters way more than absolute signal strength. Markets don’t move on “inflation is high.” They move on “inflation expectations are shifting.” It’s the temporal derivative of the signal, not the signal itself, that drives allocation decisions.

India-Specific Design Challenges

T+1 Settlement and Position Sizing

India moved to T+1 settlement in January 2023, ahead of most global markets. What this means in practice: your capital gets locked faster, and reversing a position costs more than in T+2 markets. So my system applies a higher confidence threshold for allocation changes. A signal shift needs to be stronger to trigger a rebalance, because the cost of being wrong is recovered more slowly.

Sectoral Concentration Risk

Nifty 50 is heavily concentrated in financial services (~33% weight as of early 2025) and IT services (~13%). A naive signal-driven system will over-allocate to financials simply because banks and NBFCs generate more news. I learned this the hard way during early testing. The fix: normalize signal strength by sector. A “strong” signal in a sector that generates 40% of financial news gets weighted differently than an equally strong signal from a sector with 5% of news coverage.

FII/DII Flow Dynamics

FII flows are a dominant driver of Indian market direction. In 2024, FIIs were net sellers of roughly Rs 1.2 lakh crore in Indian equities, while DIIs were net buyers of Rs 1.7 lakh crore. This tug-of-war creates a market regime that simply doesn’t exist in US markets.

I track FII flow sentiment as a separate signal channel. When FII outflow signals strengthen but domestic macro signals stay positive, the system reads this as a “flow-driven dislocation” - fundamentally sound sectors getting sold because of foreign portfolio rebalancing, not because anything went wrong domestically. Those are often buying opportunities, and the system flags them as such.

Event Calendar Integration

Indian markets have a distinct event calendar that creates predictable volatility windows:

Union Budget (typically February 1): Broad sectoral impacts, fiscal policy shifts
RBI MPC meetings (six per year): Rate-sensitive sector repricing
Monsoon forecasts (April-June): Agri-input, FMCG, and rural consumption plays
Quarterly earnings seasons: Concentrated in April-May and October-November
GST collection data (monthly): Leading indicator for economic activity

During these windows, the system cranks up signal sensitivity and lowers the threshold for allocation changes. Textual signals are just more predictive when the market is expecting a specific catalyst.

From Signals to Allocation

The signal processing layer spits out a vector of signal strengths. Turning that into portfolio weights requires combining these views with traditional risk models. I use a modified Black-Litterman approach.

Market equilibrium weights serve as the prior - basically, the market-cap weighted portfolio. Signal-derived views layer on top: “overweight financials by X basis points because RBI dovish signals are strengthening while FII outflow pressure is weakening.” Risk constraints bound everything - max sector deviation from benchmark, max single-stock weight, VaR limits. And the confidence scores from signal processing map to Black-Litterman’s uncertainty parameter, so weak signals produce small tilts and strong ones produce larger tilts.

The output isn’t a trade list. It’s a set of allocation recommendations with confidence scores and signal attribution. Something like: “Recommend increasing financial sector weight from 31% to 34%, driven by RBI dovish shift (signal strength 0.72) and declining FII outflow pressure (signal strength 0.58).”

What This Actually Gets You

Going from periodic to continuous signal processing changes how portfolio management works in practice.

Event response drops from days to hours. When the RBI MPC announces a surprise rate hold (or cut, or hike), the system processes the announcement text, the governor’s statement, and analyst commentary within hours. Not at next week’s review meeting.

You get institutional memory for free. Every allocation recommendation comes with a trail - which signals drove it, how strong they were, how they compare to historical patterns. Useful for performance attribution, but also for regulatory compliance under SEBI’s PMS guidelines. I didn’t fully appreciate this until a compliance review where we could trace every recommendation back to specific signals.

Regime detection helps with drawdowns. The Hidden Markov Model layer picks up when the market shifts from momentum-driven (typically during sustained FII inflows) to mean-reverting (typically during FII outflows with DII support). These are very different regimes, and an allocation strategy that works in one can be destructive in the other. Continuous signal processing catches these shifts earlier than price-only models.

Practical Caveats

This isn’t a black box that prints money. Some honest limitations.

Embedding models carry biases. Models trained mostly on English text from US/European sources don’t always capture Indian financial terminology or the particular way Indian business journalism uses language. You need domain adaptation - even just fine-tuning on a corpus of Indian financial news makes a noticeable difference.

Garbage in, garbage out. Indian financial news sources sometimes carry speculative or poorly sourced articles. Source credibility weighting helps but doesn’t eliminate the problem. A false rumor from a credible source still generates a signal, and I haven’t found a great way around that yet.

Latency matters more than you’d think. For this to provide value over periodic rebalancing, the full pipeline - ingestion through allocation recommendation - needs to complete in under 4 hours. Achievable, but you have to be careful about the embedding generation step, which is the bottleneck.

Regulatory boundaries. Any system generating allocation recommendations has to operate within SEBI’s framework for investment advisory and portfolio management. The system generates recommendations for human review, not automated trades. That’s both a regulatory requirement and, honestly, the right design choice given where the technology is.

Indian markets - event-driven, multi-regime, sensitive to flow dynamics - are a genuinely interesting environment for this kind of work. What I’ve described here is a starting point. I keep refining it as the signal processing gets better and India’s market data infrastructure matures. There’s a wide gap between what’s possible and what’s actually deployed in Indian portfolio management, and that gap is where the opportunity sits.