Project H
Engineering log

The AI/ML approach to options selection — why we picked ranking over classification

Harish Subramanian
Harish Subramanian
·
Cover image for The AI/ML approach to options selection — why we picked ranking over classification

The wrong question gets the wrong model

A lot of retail ML-for-trading projects start with the same instinct: train a binary classifier on "did this stock go up tomorrow?" Then they're surprised when the model produces a score that's 0.51 for everything — basically flat, basically useless.

The problem isn't the data. It's the framing.

In a portfolio context — and every options-trading book is a portfolio, even one stock at a time — the question that matters is not "will RELIANCE go up tomorrow?" It's "of these 146 F&O stocks, which 50 are most likely to outperform the median tomorrow?" That's a ranking question, not a classification question. And it changes the entire model.

What ranking buys you

A ranking model has three concrete advantages over binary classification for this kind of work:

1. It learns relative dispersion, not absolute calibration

Binary classifiers spend most of their training capacity figuring out the absolute boundary between "up" and "down." That boundary moves day-to-day with overall market direction. So the model spends effort on something that's mostly noise.

A ranker doesn't care about the absolute boundary. It cares about which stocks are stronger than which, holding the day's overall direction constant. That's the actual signal in cross-sectional alpha.

2. It handles regime shifts more gracefully

A bull-market binary classifier learns "most things go up." A bear-market classifier learns "most things go down." Neither generalises well. A ranker learns "stocks with characteristics X, Y, Z tend to outperform peers" — which is a regime-invariant statement.

In practice, this is what makes a system survive when the macro tide turns.

3. The outputs compose with other signals naturally

If you have a ranking score (a per-stock number that lives on a meaningful scale) you can blend it with a news score, a corporate-event flag, a macro-regime indicator — and the ranking still makes sense. Try doing that with a poorly-calibrated classifier output and you'll spend weeks on calibration.

Why this isn't a one-line model swap

Ranking models — LambdaRank, pairwise / listwise objectives — train differently from classifiers. They need:

  • Group-aware loss: each trading day is a "group" of cross-sectional comparisons. The model has to know which examples are comparable to which.
  • Relevance tiers, not binary labels: instead of "up = 1, down = 0," you split each day's stocks into tiers (top 20%, middle 60%, bottom 20%) and train the model to predict the tier ordering.
  • Validation against rank-IC, not AUC: AUC measures binary calibration; rank-IC measures whether the predicted ranking actually correlates with realised forward returns.

Each of those choices is a fork in the road where teams new to ranking get stuck. The model behaviour we see in production reflects a sequence of these calls — and each can be re-tuned as more data accumulates.

Where AI / ML stops and human judgement starts

Project H is unapologetically AI-driven for the selection step — model ranks the universe, top names get directional bias one way, bottom names the other. That decision is mechanical.

But the mechanics around the model are very deliberate human design:

  • What features the model gets to see — what news is fed in, how it's normalised, what time horizons are encoded
  • How model outputs combine with other signals — a model score and a news score should reinforce each other, not just average each other
  • When to trust the model and when to abstain — a model that's confident in a stock with no fresh news is different from one confident in a stock with a corporate-action filing 30 minutes ago. The system reads both.
  • Risk overlays — even a high-confidence ranker gets sized down in unusual VIX regimes or near major event calendars

This is the part that gets called "the secret sauce" in industry. We don't publish the exact weights, the exact feature set, or the exact training pipeline. Those are private. But the principles are above — and they're enough that someone smart could build a different but credibly competitive system from scratch.

What separates a good system from a great one

In our experience, three things:

  1. Honest validation: out-of-sample testing that doesn't leak future information. This sounds trivial. It's not. Most retail backtests have lookahead bias somewhere.

  2. Wide, diverse feature input: more independent signals beats deeper feature engineering on a few signals. The marginal news source you almost don't bother adding is often the one that flips a few stock-level decisions on a regime day.

  3. Discipline on what to drop: most signals don't earn their keep. A continuous information-coefficient audit — does this signal predict anything new over what other signals already capture? — is what keeps the model from accumulating decorative inputs that look smart but don't actually help.

What's next

Future posts will cover:

  • Risk management and exit logic — why CALL/PUT timing matters as much as direction
  • The 30-day paper-trading results once the live arc accumulates evidence
  • What happens when an AI-ranked system meets a real-world incident (network outage, exchange anomaly, news source going dark) — the resilience layer

The model is one piece. The system around the model is the other. Both have to work for either to be valuable.