Engineering 14 min read

How We Built BullAlert: A Multi-Session Stock Scanner in Production

Engineering deep-dive into a multi-session momentum scanner: 60+ sources, 20 internal scoring gates, three sessions, catch% → peak% per session.

How We Built BullAlert: A Multi-Session Stock Scanner in Production
Engineering · Educational illustration · Not a real chart

What this post is — and isn't

This is an engineering deep-dive into how BullAlert is built as a data system. It is not a trading-bot post (that's a separate pillar). BullAlert doesn't trade — it surfaces positive-momentum small-cap tickers within five minutes of their session crossing, publishes catch% and peak% per session, and leaves the execution layer entirely to the user.

We're going to walk through the shape of the system: the three sessions, the two-stage scoring loop, the twenty internal gates, the catch%/peak% truth model, and the daily cleanup pass that reconciles live samples against bar-resolution truth. We are not going to name the specific market-data vendor, news provider, language model, or backtest engine the pipeline uses. Those choices are the moat. The architecture, however, is fair game and educational. Nothing here is financial advice.

Three sessions, three threshold sets

The single most important architectural decision in BullAlert is that the three US trading sessions are treated as independent universes. Pre-market (4:00–9:30 AM ET), regular hours (9:30 AM–4:00 PM ET), and after-hours (4:00–8:00 PM ET) have radically different liquidity profiles, average volumes, and microstructure. A signal that's noteworthy at 4:30 AM against the prior day's regular-session close is unremarkable at 10:30 AM against today's open. Treating them as one continuous tape would produce false positives by construction.

Concretely: each session has its own (ticker, day, session) leaderboard row. catch% is measured against the correct prior-session reference — pre-market and regular hours against the prior day's regular-session close, after-hours against today's regular-session close. peak% is the running max within that session, refreshed every five minutes, frozen when the session ends. At a session boundary, every live row resets and the previous session's row gets archived to the snapshot table.

The thresholds inside each session are different too — required volume floors, RVOL multipliers, sentiment density thresholds — but those numbers stay internal. What's public is the principle: the same generic signal is scored against three different bars depending on when it fires.

The sentiment layer — accumulation, not decision

The first stage runs continuously, around the clock. Its job is to track which tickers are being talked about across a basket of sixty-plus financial communities and newsfeeds. It doesn't care about price. It accumulates: how many independent sources have surfaced the ticker in the last 24 hours, how recently, with what tone, with what catalyst tag.

Architecturally, the sentiment stage is a simple upsert loop. Pull a batch of recent posts, extract the tickers (NER plus a regex universe), score the sentiment with a fast language model on the rolling text, write a row to the candidates table. Validate the ticker against a live snapshot before persisting (the universe shifts daily as new tickers list and old ones delist; you don't want to score noise on dead symbols). That's it.

The job doesn't make signal decisions. It feeds a candidate pool that the next stage will rank. Separating accumulation from decision is the cleanest simplification we made — the sentiment job runs 24/7 without caring about market hours, and the decision job runs only inside sessions and consumes the accumulated state.

The hunter loop — bulk-screen, enrich, score, emit

The decision stage runs every five minutes inside trading sessions. The shape:

  1. Heat scoring. Rank the candidate pool by a composite heat score (recency, mention density, community spread, sentiment, persistence, freshness). The composition stays internal; the principle is that "lots of independent recent sources saying something" beats "one source shouting".
  2. Bulk screen. Pull current snapshots for the top ~150 candidates in one batched call. This is where the heaviest pre-filtering happens — apply session-aware thresholds for volume, price band, change-from-reference, and reject anything outside the universe.
  3. Enrich. For the survivors (typically ~25), pull intraday bars and compute the technical layer: indicators, VWAP distance, opening-range structure, pattern detection. This is where the per-call cost lives; doing it for 25 candidates instead of 150 is what makes the pipeline economical.
  4. Score and gate. Run the twenty internal gates against each enriched candidate. Anything that fails a hard gate drops out. Tier-grade what's left (S, A, B, C, D) based on how many confirming layers stack.
  5. Emit one signal per cycle. The highest-conviction candidate that passes all gates and isn't a duplicate of an existing same-session signal becomes the published signal. Per-session caps prevent a single session from over-firing (5 / 10 / 5 for PM / RTH / AH).

There is intentionally no model in the loop that "predicts" outcomes. The gates measure structural facts about the current state of a ticker; the gating stack ranks; the cap discriminates. Predictive ML over the catch could be added — and might be, in the future — but the current pipeline is deliberately not predictive. It's a filter.

The twenty internal gates

Every candidate has to clear twenty gates to become a published signal. They are organised into four groups; the categories are public, the thresholds are not.

  • Liquidity & structure (4 gates): active session, volume floor, float guardrails, price band.
  • Momentum & flow (4 gates): relative volume regime, momentum acceleration, session-aware change, multi-bar trend confirmation.
  • Quality & catalyst (4 gates): sentiment density, catalyst presence, scoring threshold, tier eligibility.
  • Late-catch & fader rejection (4 gates): late-catch guard, fader rejection, drawdown brake, per-session cap.

The gates are hard, not weighted. A candidate has to clear every relevant gate; there's no partial credit. We chose hard gates because partial-credit scoring is the fastest way to over-fit on retrospective data — every "this would have caught X if we weighted Y a bit higher" change leaks tomorrow's losers into today's winners. Hard gates that are explicit, named, and have a binary outcome are easier to reason about and easier to test.

catch% and peak% — the only user-facing metrics

Here is the single most-important rule in the system: the catch% and peak% shown everywhere — in-app, in email, on the public history, on the landing — for the same (ticker, day, session) triple, must be byte-identical. There is no "approximately" and no "trending now" proxy. Any surface that displays catch% must trace its read back to one canonical column.

The two numbers:

  • catch% — frozen at the moment we caught the ticker, against the correct session reference. Once written, it never updates. This is what determines "how early" a catch was.
  • peak% — running max from catch through the session end. Refreshed every five minutes during live cycles, then reconciled against bar-resolution truth in the daily cleanup pass.

The honest measure of edge is the gap between them — the catch-to-peak delta. We sort every leaderboard surface by that delta, not by raw peak%. A catch at +107% that peaks at +136% (delta +29pp) outranks a catch at +40% that peaks at +64% (delta +24pp), even though the second has a higher absolute peak. The delta is the only thing that's actually about the scanner's earliness.

Backtest validation — used differently than you'd expect

We use an event-driven backtester to validate the scoring layer, not to generate trade signals. The distinction matters. A typical backtest asks "if I had traded these signals, what would my P&L look like?" Our backtest asks "if I had used this version of the scoring stack on the last 90 days of accumulated candidate data, what would the catch-to-peak delta have looked like, on average and at the tails?"

That framing keeps the backtester focused on a metric we can measure honestly — catch% to peak% — and avoids the fee/slippage/fill-modelling rabbit hole that strategy backtests have to confront. We're not optimising for realised P&L (we don't trade) — we're optimising for the distance between what the scanner caught and what the session actually did.

Walk-forward validation, fold-by-fold floors on per-session recall and fader rejection, and a strict no-regression rule on each scoring change keep the metric honest over time. The choice of backtest engine — like the choice of market data feed — stays internal. There are several good options out there; the backtest engines guide covers them honestly.

Daily cleanup — where bar-truth wins

Live samples lie a little. The peak we observe at the 5-minute cycle that fires nearest the session high is usually accurate to within a fraction of a percent, but it's a sample, not a guarantee. The cleanup pass at session end re-derives peak% from actual bars, scoped to the session window, and overwrites the live sample with the bar-truth value when they disagree.

Two cleanup runs per day:

  • Market close (4:15 PM ET): close regular-hours signals, refresh prices, ship the daily report.
  • After-hours final (8:15 PM ET): close pre-market and after-hours signals, run the full cleanup, archive performance, prune dormants.

The reason this matters: every signal in the public history reflects bar-resolution truth. Anyone is welcome to verify that catch% / peak% line up with public bar data on their own provider of choice.

Lessons we wish we'd known earlier

Per-session truth is a feature, not a complication

The first version of the system tried to maintain a single "today's catch" across all three sessions. It was a constant source of off-by-session bugs. Splitting into three independent (ticker, day, session) leaderboards is more rows, more reconciliation, and materially fewer wrong numbers downstream.

Hard gates beat soft scores

Soft-scored candidates always look better in retrospect because you can tune the weights against any past distribution. Hard gates fail loudly and stay honest. We have softer gates only at the periphery — the heat score that ranks the candidate pool — and only because it's a ranking, not a decision.

Don't predict, surface

The single biggest temptation in this space is to add a predictive model that ranks candidates by expected peak. It would probably even work, in-sample. We've chosen not to do that yet because the moment we ship a "predicted peak" number, every user reads it as a target, and the data tool starts implying execution decisions. Surface what's structurally real; let the user decide.

Pre-stage corporate actions

Reverse splits, stock splits, special dividends, and ticker-symbol changes are cheap to detect at start-of-day and expensive to debug after the fact. We pre-stage them every morning before the first cycle so that catch% never reflects a contaminated baseline.

What's not here, and why

We've described the architecture of the data system but not the recipe. The choice of market data feed, news provider, scoring weights, gating thresholds, and validation fixture is the moat — and a public-facing engineering post is the wrong place to publish a moat. The methodology of catch% and peak% is on the methodology page; every signal we've ever published is on the public history. Those are the surfaces that let any reader verify the system. The recipe stays internal.

Frequently asked questions

Why is BullAlert framed as a data tool and not a trading bot?

Because we don't trade. The product surfaces positive-momentum tickers within five minutes of their session crossing — the user computes their own P&L, owns their own broker, and makes their own entry/exit decisions. Keeping execution out of the product simplifies the surface area, sidesteps regulatory ambiguity, and lets the same data feed any downstream consumer (a notebook, a paper-trading sim, a Python bot).

What does "multi-session" actually mean in practice?

Three independent sessions per day: pre-market (4:00–9:30 AM ET), regular hours (9:30 AM–4:00 PM ET), after-hours (4:00–8:00 PM ET). Each session has its own threshold set, its own catch%, its own peak%, and its own per-session caps. A pre-market signal does not carry forward into regular hours. Each session is scored against its correct prior-session reference.

How do the 20 internal scoring gates work?

They're hard filters, not weights. A candidate has to clear every relevant gate to become a published signal — there's no 'partial credit'. The gates are organised into four groups: liquidity & structure, momentum & flow, quality & catalyst, and late-catch / fader rejection. Each gate measures one thing and contributes a binary pass/fail. The recipe (thresholds, exact formulas, vendor weights) stays internal; the categories are public on the methodology page.

Why only one signal per cycle?

To force discrimination. Without a per-cycle cap, the scanner would publish the entire passing list every five minutes and the feed would be noisy by construction. One per cycle means each signal is the highest-conviction candidate at that moment; it forces the gating stack to actually rank, not just filter.

What's the daily-cleanup pass for?

Two jobs. First, freeze each session's catch% and peak% from running-max approximation to bar-resolution truth (refreshed against actual session-aware bars at session close). Second, prune stale candidates that haven't been touched in 60 days and clear enrichment data older than 30 days. The cleanup is what makes the public history trustworthy versus a live snapshot.

Related reading

All posts →