Retail Financial Sentiment as a Systematic Signal: 18 Years, 426M Messages

Abstract

We characterize retail financial sentiment as a systematic, tradeable signal built on the largest social-finance platform. The underlying corpus holds 426M+ human-written messages posted since 2009, roughly 18 years of continuous coverage across 50,000+ tickers, growing by 6-7M new posts per month. Critically, sentiment is human-labeled bullish or bearish at the point of authorship - it is ground truth, not inferred by a language model after the fact.

From this corpus we derive a per-ticker sentiment score on a 0-100 scale, refreshed every five minutes and constructed to be point-in-time safe. Long/short strategies driven by the score earn an 18.1% CAGR at a 1.20 Sharpeon the NASDAQ-100 and a 16.4% CAGR at a 0.86 Sharpe on the S&P-500, and the ground-truth signal outperforms an X/Twitter NLP-sentiment baseline (122% versus 95% cumulative long/short return). We are explicit about scope: sentiment is one signal among many, the population skews retail, and nothing here is investment advice.

Data & Coverage

The dataset is the message history of the largest social-finance platform. Every post is human-written and carries an author-supplied directional tag, which makes the corpus uniquely suited to sentiment work: the label is the user's own stated stance, not a model's guess. The current footprint:

Dimension	Value
Human-written messages	426M+ since 2009
History	~18 years, continuous
Tickers covered	50,000+
New posts per month	6-7M
Corpus size	250-270 TB

Three products expose this corpus at different layers of resolution, from the raw stream to a distilled score:

Product	Description	Cadence
Firehose	Full real-time message stream	~200ms lag
Symbol Event	Ticker-level engagement: volume, likes, pageviews, watchlist adds	Real-time
Sentiment	Per-ticker human-labeled sentiment score, 0-100	5-minute refresh

Firehose carries the full real-time message stream at roughly 200ms lag. Symbol Event aggregates ticker-level engagement - message volume, likes, pageviews, and watchlist adds - and Sentiment distills the human labels into the five-minute per-ticker score that the rest of this paper studies.

Sentiment Construction: Human Labels vs NLP Inference

Most financial-sentiment products begin with unlabeled text and apply a natural-language model to guess whether a post is bullish or bearish. That inference step is where error and look-ahead bias enter: the model can be wrong, it can drift as language changes, and re-scoring historical text with a newer model silently rewrites the past.

Our construction removes that step entirely. Each message is labeled bullish or bearish by its author at the moment it is posted, so the label is observed, not predicted. We aggregate these ground-truth labels per ticker into a single score on a 0-100 scale, where higher values reflect more bullish posting. The score is recomputed every five minutes, so it tracks shifts in retail positioning intraday rather than only at the daily close.

Ground truth

Sentiment source

Bullish/bearish labels are supplied by message authors at post time, not inferred by an NLP model after the fact. There is no inference step to be wrong or to drift.

Backtest Results

We test the score as a long/short signal: go long the most bullish names and short the most bearish, rebalanced on the score, on two standard universes. Results are documented and reported as-is.

18.1%

NASDAQ-100 long/short CAGR

Compound annual growth rate of the long/short strategy on the NASDAQ-100 universe.

1.20

NASDAQ-100 long/short Sharpe

Risk-adjusted return of the same NASDAQ-100 long/short strategy.

Strategy	CAGR	Sharpe
NASDAQ-100 long/short	18.1%	1.20
S&P-500 long/short	16.4%	0.86

The ground-truth signal also beats an inferred-sentiment baseline directly. Holding the strategy construction fixed and swapping only the sentiment source, human labels outperform an X/Twitter NLP-sentiment feed on cumulative long/short return:

122% vs 95%

Ground-truth vs NLP baseline (cumulative long/short return)

Human-labeled sentiment returns 122% cumulatively against 95% for an X/Twitter NLP-sentiment baseline, over the same period and construction.

Why Ground-Truth Labels Beat Inferred Sentiment

The 122% versus 95% gap is not incidental - it follows from where the label comes from. An NLP classifier sits between the text and the signal, and that layer introduces classification error, sarcasm and slang failures, and model drift over an 18-year window where market language has changed substantially. A label authored by the trader carries none of that intermediate noise.

Ground-truth labels are also robust to the look-ahead trap that quietly inflates many inferred-sentiment backtests. Because the label exists at post time and never changes, there is no temptation to re-score history with a better model. The signal you would have had in real time is exactly the signal the backtest sees - which is why the live edge survives where re-inferred sentiment often does not.

Point-in-Time Integrity & Spam Filtering

For the score to be usable in a backtest, it must reflect only information available at each timestamp. The score is constructed to be point-in-time safe: it is refreshed every five minutes and stabilizes within roughly ten minutes of the underlying activity, so a value read at time t encodes only posts authored at or before t. This is what lets the backtest results above stand without a hidden look-ahead.

Raw social data is also noisy with promotion and manipulation, which would otherwise contaminate the labels. An AI/ML spam filter removes coordinated stock-pumps and promotional posts before they reach the score:

~95%

Spam filter catch rate

The AI/ML filter catches roughly 95% of 2024 promotions and stock-pumps, so the sentiment score reflects genuine retail positioning rather than coordinated manipulation.

Limitations

Three limits bound these results. First, sentiment is one signal, not a complete strategy - it is most useful combined with price, liquidity, and fundamental inputs rather than traded in isolation. Second, the population is retail-skewed: it captures the positioning of individual traders on a social platform, which is informative but is not the whole market and behaves differently from institutional flow. Third, the spam filter, while strong, is not perfect, so a residual fraction of promotional posts can still reach the score.

None of the above is investment advice. The results are descriptive and historical; past performance does not guarantee future results.

Conclusion

Eighteen years and 426M+ human-written, human-labeled messages give retail financial sentiment a depth and a cleanliness that inferred-sentiment feeds cannot match. Because the bullish/bearish label is ground truth rather than a model's guess, the derived per-ticker score is both point-in-time safe and predictive, powering long/short strategies at 18.1% CAGR and a 1.20 Sharpe on the NASDAQ-100 and beating an NLP-sentiment baseline 122% to 95%. We hold the signal to an honest standard: it is one input, the population is retail-skewed, and it is offered as a measured edge, not as investment advice.

Retail Financial Sentiment as a Systematic Signal: 18 Years, 426M Messages

Abstract

Data & Coverage

Sentiment Construction: Human Labels vs NLP Inference

Backtest Results

Why Ground-Truth Labels Beat Inferred Sentiment

Point-in-Time Integrity & Spam Filtering

Limitations

Conclusion

Catalog

Research

Company

Contact