From 4M comments to a style-controlled comment generator

I ended up with 4M+ comments, each paired with a username and a short description of the content being commented on.

I didn’t want a chatbot. I wanted a generator that can produce believable comments in a requested style.

The whole pipeline

Clean + dedup: normalize text, drop very short comments (< 12 chars), exact dedup by hash, near-dedup with MinHash/LSH.
Label: hand-label ~1k, fine-tune DeBERTaV3 classifier, then pseudo-label with a human in the loop.
Filter hard: keep only high-confidence labels (>= 70%), treat noise as a first-class bucket and exclude it from training.
Train: fine-tune SmolLM3 (3B) with LoRA (SFT-only) to condition on the target style.

Styles:

I ran a simple arena-style “spot the model” test (n = 500) and only got it right about 57% of the time.

The main lesson: data cleaning + strict filtering beats raw scale when you want controllable style.