From 4M comments to a style-controlled comment generator
tl;dr summary
I cleaned and deduped 4M scraped comments, bootstrapped style labels with a DeBERTaV3 classifier + pseudo-labeling, then fine-tuned SmolLM3 with LoRA to generate comments in controllable styles.
table of contents
From 4M comments to a style-controlled comment generator
I accidentally built a dataset that sounds like a flex and behaves like a responsibility: 4M+ comments, each paired with a username, a short description of the thing being commented on, and the comment text itself.
The goal was not “train a general chatbot”. I wanted something narrower and more useful: generate believable comments in a specific style on demand.
The styles I ended up using:
happytoxicsarcasticcringewholesomenoise(spam / contextless junk / “do not teach the model this”; excluded later)
This post is the 5-minute version of what actually made it work: cleaning, labeling at scale, and then LoRA fine-tuning a small model to be controllable.
Step 0: accept that scraped comments are disgusting
If you scrape comments at scale, you get all the classics:
- exact duplicates (bots, reposts, copy-paste)
- near duplicates (templates with tiny edits)
- whitespace garbage
- ultra-short reactions (“lol”, ”.”, “ok”)
If you do not fix this early, you label the same sentence 50 times, your classifier learns templates instead of style, and your generator becomes a slot machine that keeps repeating itself.
Step 1: clean + dedup before you label anything
My cleaning pass ran directly on the SQLite DB that stored the raw scrape, because it was the simplest way to do chunked reads + bulk deletes without loading 4M rows into RAM.
The core operations:
Normalize text
Before comparing anything, I normalize each comment so cosmetic differences stop mattering:
- trim
- collapse repeated whitespace
- lowercase
That turns things like " Nice!!! ", "nice!!!", and "NICE!!!" into the same canonical string.
Drop very short comments
Anything under 12 characters gets dropped. This is where you delete a shocking amount of low-signal junk.
Exact dedup (fast path)
After normalization, I hash each comment (xxHash) and keep a seen set. If a hash repeats, the row gets deleted.
Near dedup (the important part)
Exact dedup barely scratches the surface because the internet loves templates.
I used MinHash + LSH over character n-grams and treated comments as duplicates when their approximate Jaccard similarity crossed a high threshold. This catches:
- emoji/no-emoji variants
- template spam with one word swapped
- “same comment, slightly rewritten”
The chunked processing part matters: reading in batches and buffering deletes is the difference between “pipeline” and “my laptop is a space heater”.
After cleaning, the dataset landed at roughly:
- 3.7M comments remaining (from 4M+)
- ~44 characters average comment length
Step 2: pick labels you can actually label
The taxonomy stayed intentionally small. If I cannot define a label in one sentence, it is not a label, it is a future argument with myself.
noise is a first-class label on purpose. It is not an insult, it is a safety/quality boundary: spam, unreadable fragments, and stuff that only makes sense in-thread goes there.
This is also how you keep “style control” from collapsing into “style soup”.
Step 3: bootstrap labels with a classifier (DeBERTaV3)
I started with about ~1k hand-labeled comments. That is not enough to do science, but it is enough to bootstrap.
Then I fine-tuned DeBERTaV3 base as a multi-class classifier on comment text only.
I got around 85% validation accuracy on a random split, which was useful as an iteration signal, not a publication result. With text data you always worry about leakiness from near-duplicates and topical clusters.
The real value was that it became a labeling multiplier.
Step 4: pseudo-labeling with a human in the loop
The loop looked like this:
- sample a comment
- classifier predicts a style
- I accept/correct
- periodically retrain
- repeat
Once the model is “good enough”, this becomes dramatically faster than labeling from scratch. That got me to about ~8k solid labels.
One trick that helped: once I could generate comments in a target style, I reviewed generations and aggressively threw failures into noise. It sounds circular, but it surfaced failure modes early and made noise a practical quality gate.
Step 5: auto-label at scale, then filter hard
After the classifier stabilized, I ran it across the cleaned corpus and kept only high-confidence labels.
The filtering rules were intentionally blunt:
- keep predictions with >= 70% probability
- if
noisehad > 40% confidence, drop the sample entirely - exclude
noisefrom the final generator training set
This tradeoff is worth it. Fewer samples with strong labels beat millions of weak guesses that blur boundaries.
Result: about ~1.8M confidently labeled samples for generator training.
Step 6: train the generator (SmolLM3 + LoRA)
With the dataset in good shape, I trained a style-controlled generator:
- Base model: SmolLM3 (3B)
- Fine-tune: LoRA, SFT-only
- Hardware: RTX 4090
- Stack: Unsloth (CUDA/tooling kept current)
Prompting stayed deliberately simple so the model learned the conditioning reliably:
- System prompt: the requested style + a couple constraints
- User message:
<username>...</username><description>...</description>
- Assistant output: the comment
No fancy post-processing. The whole point was “consistent conditioning → consistent style”.
Quick eval: could I spot my own model?
I did a small arena-style test:
- each trial: two comments, guess which one is generated
- latest run: n = 500
- I identified the model about 57% of the time
That is not invisibility, but it is close enough to random guessing to be a meaningful win for this use case.
What actually mattered
If you only steal one idea from this project, steal this: treat it like a pipeline.
- Clean first (dedup + short-comment removal stops repetition learning)
- Make
noisereal (a strict junk bucket is a superpower) - Use pseudo-labeling (1k → 8k happens fast once the classifier helps)
- Filter aggressively (confidence thresholds matter more than raw scale)
- LoRA is enough when the task is narrow and the conditioning is clean