RETAIL INVESTOR DATA

Financial Social Intelligence

Exclusive retail investor sentiment from the largest social finance platform.

16 YEARSJSON · CSV · ParquetReal-time (WebSocket) or daily batch
426M+
Messages
18 yrs
Continuous history
250+ TB
Raw corpus
3-layer
Auditable signal
01Sample

See the data

Representative records in the exact shape we deliver. Real provenance and full slices are shared under license.

Raw message, author-labeled (firehose)

Representative shape, not real data. sentiment is the author own label at post time - the ground truth, not inferred.

messages.jsonlrepresentative
{
  "id": 582914037,
  "body": "$NVDA incredible earnings beat. Blackwell demand is insane.",
  "created_at": "2026-02-27T14:23:11Z",
  "sentiment": "Bullish",
  "user": { "id": 1847293, "username": "<handle>", "followers": 12840, "ideas": 8741, "join_date": "2015-03-12" },
  "symbols": ["NVDA"],
  "likes_total": 312,
  "replies": 47
}

Curated message, relinked symbol

Representative. Shows the curation and symbol-mapping provenance that makes every downstream signal row auditable.

curated/messages.jsonlrepresentative
{
  "record_id": "post:1lpgxpa",
  "source_platform": "<social-finance platform>",
  "created_at_utc": "2025-07-01T23:34:45Z",
  "body_or_title": "QQQ calls printing today",
  "symbols_detected": ["QQQ"],
  "requested_symbol": null,
  "mapping_method": "alias_relinked",
  "mapping_confidence": 0.65,
  "finance_intent_score": 0.68,
  "curation_band": "strict",
  "curation_bucket": "signal_keep"
}

Signal panel row, one symbol-day (the quant product)

Representative. One row per symbol-day; trade_date makes it look-ahead-free for backtesting.

security_day_signal.jsonlrepresentative
{
  "date": "2023-01-02",
  "symbol": "AAPL",
  "mention_count": 13,
  "weighted_mention_count": 6.96,
  "unique_author_count": 5,
  "sentiment_score": 0.123,
  "conviction_score": 0.236,
  "composite_signal": 0.324,
  "confidence_score": 0.617,
  "avg_spam_probability": 0.005,
  "trade_date": "2023-01-03",
  "trade_lag_days": 1,
  "company_name": "Apple Inc."
}
02Schema

Record shape

Every field, its type, whether it can be null, and a representative value.

FieldTypeConstraintDescription
datedaterequiredTrading date for the symbol-day observation.
e.g. 2023-01-02
symbolstringrequiredTicker the row aggregates.
e.g. AAPL
mention_countintrequiredRaw message mentions that day.
e.g. 13
weighted_mention_countfloatrequiredMentions weighted by message and author quality.
e.g. 6.96
unique_author_countintrequiredDistinct authors mentioning the symbol.
e.g. 5
sentiment_scorefloat · 0..1requiredQuality-weighted continuous sentiment.
e.g. 0.123
conviction_scorefloatnullableConviction from engagement and author authority.
e.g. 0.236
composite_signalfloatnullableComposite multi-feature daily signal.
e.g. 0.324
confidence_scorefloatrequiredCoverage and quality confidence for the row.
e.g. 0.617
avg_spam_probabilityfloat · 0..1requiredMean spam-model probability across the day messages.
e.g. 0.005
trade_datedaterequiredFirst date the signal is tradeable - makes the panel look-ahead-free.
e.g. 2023-01-03
trade_lag_daysint · daysrequiredLag from signal date to tradeable date.
e.g. 1
company_namestringnullableSecurity-master join for the ticker.
e.g. Apple Inc.
03What's included

Firehose

Full message stream with user metadata, tickers, sentiment flags, and activity (likes, reshares, follows).

Symbol Event

Ticker-level engagement - message volume, likes, pageviews, watchlist changes. Real-time.

Sentiment

Per-ticker sentiment scores refreshed every 5 minutes. Backtested: 18.1% CAGR on Nasdaq 100 L/S.

04Methodology

How it is built

  1. 01

    Licensed collection

    Bulk export under license from private social-finance apps with 10 to 18 years of history, plus real-time ingestion from public-platform APIs. No web scraping.

  2. 02

    PII removal

    All personally identifiable information is stripped before delivery: user IDs pseudonymized, real names removed, in-text PII redacted. The corpus is PII-free and MNPI-free.

  3. 03

    Three-layer construction

    broad (the cleaned underlying feed with source attribution) becomes curated (strict and balanced bands, the audit bridge) becomes signal (the post-processed symbol-day panel). The wide band is excluded from delivery.

  4. 04

    Spam and promo filtering

    Multi-layer filtering: platform moderation, bot detection on posting frequency, content patterns and account age, and known-spam exclusion. The filter removes roughly 95% of coordinated 2024 promotions.

  5. 05

    Symbol mapping with provenance

    Each message links to securities through a mapping view, and every signal row is traceable back to the exact contributing messages by record_id.

  6. 06

    Point-in-time aggregation

    Only message-time information enters feature construction - no future-return data. The signal panel carries trade_date and trade_lag_days so backtests are look-ahead-free.

05Evals

How we validate

What each evaluation measures and how it is run. Where no benchmark is published, we show the methodology and say so.

NASDAQ-100 long/short

Measures

Whether the daily sentiment score separates winners from losers in a tradeable portfolio.

Method

Long the most-bullish names, short the most-bearish, rebalanced on the score across the NASDAQ-100 universe.

Result

Published backtest: 18.1% CAGR, 1.20 Sharpe. See the research paper for construction and caveats.

Ground truth vs NLP baseline

Measures

Whether the author-stated sentiment label beats an inferred NLP-sentiment feed.

Method

Hold the strategy construction fixed and swap only the sentiment source - human labels vs an NLP feed - over the same period.

Result

Author labels win: 122% vs 95% cumulative long/short return. The label is observed, not modeled.

Spam-filter catch rate

Measures

How much coordinated pump-and-promo content the filter removes.

Method

Evaluated against known 2024 stock-promotion campaigns.

Result

Roughly 95% of coordinated promotions removed.

06Graders

Ground truth

What correct means for this data, and how it is established.

Ground truth

For sentiment, the author own bullish or bearish label at post time - observed, not inferred, so there is zero sentiment-model risk. For signal validation, realized forward returns from the price panel.

How it is established

Every signal row is auditable end to end: security_day_signal back to message_security_view, join on record_id into curated messages, and optionally into the broad feed. Two delivered notebooks walk the full provenance trace. Baseline factors are scored against the price panel.

Agreement

No inter-rater figure applies - by design the sentiment label is the author stated stance, so there is no separate human-rater pass to agree with.

07Application

Retail Sentiment Signals

Identify retail investor conviction shifts before they show up in order flow. Track ticker-level attention as a leading indicator for L/S equity strategies.

Event-Driven Flow Detection

Detect retail pile-ins and narrative shifts around earnings, FDA decisions, and macro events. Real-time firehose enables sub-minute signal generation.

Attention as Alpha

Use engagement metrics (pageviews, watchlist adds, message velocity) as a proxy for retail attention - a proven uncorrelated signal for systematic strategies.

08Environment & integration

How you load it

Delivery

S3 push, REST API, WebSocket firehose, Restricted data room

Formats

Parquet, JSON, CSV

Auth

Licensed for internal research, model development, and portfolio management; redistribution prohibited without a separate agreement. PII-free and MNPI-free public commentary.

Cadence

Real-time streaming at roughly 200ms firehose latency, or daily batch. History is corrected and backfilled weekly.

quickstart.sh
# Quant entry point: the symbol-day signal panel
signal/normalized/security_day_signal.parquet
signal/normalized/security_day_research.parquet
 
# Engineering entry: raw feed + manifest
broad/messages.parquet
delivery_manifest.json
 
# Recommended first tests: 1-day, 5-day, 20-day horizons.
# Audit any signal row back to its source messages by record_id.

Request access.

Restricted-scope evaluation access for qualified teams. We share real samples, full schema, and provenance under a mutual NDA.