RETAIL INVESTOR DATA
Financial Social Intelligence
Exclusive retail investor sentiment from the largest social finance platform.
See the data
Representative records in the exact shape we deliver. Real provenance and full slices are shared under license.
Raw message, author-labeled (firehose)
Representative shape, not real data. sentiment is the author own label at post time - the ground truth, not inferred.
{
"id": 582914037,
"body": "$NVDA incredible earnings beat. Blackwell demand is insane.",
"created_at": "2026-02-27T14:23:11Z",
"sentiment": "Bullish",
"user": { "id": 1847293, "username": "<handle>", "followers": 12840, "ideas": 8741, "join_date": "2015-03-12" },
"symbols": ["NVDA"],
"likes_total": 312,
"replies": 47
}Curated message, relinked symbol
Representative. Shows the curation and symbol-mapping provenance that makes every downstream signal row auditable.
{
"record_id": "post:1lpgxpa",
"source_platform": "<social-finance platform>",
"created_at_utc": "2025-07-01T23:34:45Z",
"body_or_title": "QQQ calls printing today",
"symbols_detected": ["QQQ"],
"requested_symbol": null,
"mapping_method": "alias_relinked",
"mapping_confidence": 0.65,
"finance_intent_score": 0.68,
"curation_band": "strict",
"curation_bucket": "signal_keep"
}Signal panel row, one symbol-day (the quant product)
Representative. One row per symbol-day; trade_date makes it look-ahead-free for backtesting.
{
"date": "2023-01-02",
"symbol": "AAPL",
"mention_count": 13,
"weighted_mention_count": 6.96,
"unique_author_count": 5,
"sentiment_score": 0.123,
"conviction_score": 0.236,
"composite_signal": 0.324,
"confidence_score": 0.617,
"avg_spam_probability": 0.005,
"trade_date": "2023-01-03",
"trade_lag_days": 1,
"company_name": "Apple Inc."
}Record shape
Every field, its type, whether it can be null, and a representative value.
| Field | Type | Constraint | Description |
|---|---|---|---|
| date | date | required | Trading date for the symbol-day observation. e.g. 2023-01-02 |
| symbol | string | required | Ticker the row aggregates. e.g. AAPL |
| mention_count | int | required | Raw message mentions that day. e.g. 13 |
| weighted_mention_count | float | required | Mentions weighted by message and author quality. e.g. 6.96 |
| unique_author_count | int | required | Distinct authors mentioning the symbol. e.g. 5 |
| sentiment_score | float · 0..1 | required | Quality-weighted continuous sentiment. e.g. 0.123 |
| conviction_score | float | nullable | Conviction from engagement and author authority. e.g. 0.236 |
| composite_signal | float | nullable | Composite multi-feature daily signal. e.g. 0.324 |
| confidence_score | float | required | Coverage and quality confidence for the row. e.g. 0.617 |
| avg_spam_probability | float · 0..1 | required | Mean spam-model probability across the day messages. e.g. 0.005 |
| trade_date | date | required | First date the signal is tradeable - makes the panel look-ahead-free. e.g. 2023-01-03 |
| trade_lag_days | int · days | required | Lag from signal date to tradeable date. e.g. 1 |
| company_name | string | nullable | Security-master join for the ticker. e.g. Apple Inc. |
Firehose
Full message stream with user metadata, tickers, sentiment flags, and activity (likes, reshares, follows).
Symbol Event
Ticker-level engagement - message volume, likes, pageviews, watchlist changes. Real-time.
Sentiment
Per-ticker sentiment scores refreshed every 5 minutes. Backtested: 18.1% CAGR on Nasdaq 100 L/S.
How it is built
- 01
Licensed collection
Bulk export under license from private social-finance apps with 10 to 18 years of history, plus real-time ingestion from public-platform APIs. No web scraping.
- 02
PII removal
All personally identifiable information is stripped before delivery: user IDs pseudonymized, real names removed, in-text PII redacted. The corpus is PII-free and MNPI-free.
- 03
Three-layer construction
broad (the cleaned underlying feed with source attribution) becomes curated (strict and balanced bands, the audit bridge) becomes signal (the post-processed symbol-day panel). The wide band is excluded from delivery.
- 04
Spam and promo filtering
Multi-layer filtering: platform moderation, bot detection on posting frequency, content patterns and account age, and known-spam exclusion. The filter removes roughly 95% of coordinated 2024 promotions.
- 05
Symbol mapping with provenance
Each message links to securities through a mapping view, and every signal row is traceable back to the exact contributing messages by record_id.
- 06
Point-in-time aggregation
Only message-time information enters feature construction - no future-return data. The signal panel carries trade_date and trade_lag_days so backtests are look-ahead-free.
How we validate
What each evaluation measures and how it is run. Where no benchmark is published, we show the methodology and say so.
NASDAQ-100 long/short
Measures
Whether the daily sentiment score separates winners from losers in a tradeable portfolio.
Method
Long the most-bullish names, short the most-bearish, rebalanced on the score across the NASDAQ-100 universe.
Result
Published backtest: 18.1% CAGR, 1.20 Sharpe. See the research paper for construction and caveats.
Ground truth vs NLP baseline
Measures
Whether the author-stated sentiment label beats an inferred NLP-sentiment feed.
Method
Hold the strategy construction fixed and swap only the sentiment source - human labels vs an NLP feed - over the same period.
Result
Author labels win: 122% vs 95% cumulative long/short return. The label is observed, not modeled.
Spam-filter catch rate
Measures
How much coordinated pump-and-promo content the filter removes.
Method
Evaluated against known 2024 stock-promotion campaigns.
Result
Roughly 95% of coordinated promotions removed.
Ground truth
What correct means for this data, and how it is established.
Ground truth
For sentiment, the author own bullish or bearish label at post time - observed, not inferred, so there is zero sentiment-model risk. For signal validation, realized forward returns from the price panel.
How it is established
Every signal row is auditable end to end: security_day_signal back to message_security_view, join on record_id into curated messages, and optionally into the broad feed. Two delivered notebooks walk the full provenance trace. Baseline factors are scored against the price panel.
Agreement
No inter-rater figure applies - by design the sentiment label is the author stated stance, so there is no separate human-rater pass to agree with.
Retail Sentiment Signals
Identify retail investor conviction shifts before they show up in order flow. Track ticker-level attention as a leading indicator for L/S equity strategies.
Event-Driven Flow Detection
Detect retail pile-ins and narrative shifts around earnings, FDA decisions, and macro events. Real-time firehose enables sub-minute signal generation.
Attention as Alpha
Use engagement metrics (pageviews, watchlist adds, message velocity) as a proxy for retail attention - a proven uncorrelated signal for systematic strategies.
How you load it
Delivery
S3 push, REST API, WebSocket firehose, Restricted data room
Formats
Parquet, JSON, CSV
Auth
Licensed for internal research, model development, and portfolio management; redistribution prohibited without a separate agreement. PII-free and MNPI-free public commentary.
Cadence
Real-time streaming at roughly 200ms firehose latency, or daily batch. History is corrected and backfilled weekly.
# Quant entry point: the symbol-day signal panelsignal/normalized/security_day_signal.parquetsignal/normalized/security_day_research.parquet# Engineering entry: raw feed + manifestbroad/messages.parquetdelivery_manifest.json# Recommended first tests: 1-day, 5-day, 20-day horizons.# Audit any signal row back to its source messages by record_id.
Request access.
Restricted-scope evaluation access for qualified teams. We share real samples, full schema, and provenance under a mutual NDA.