From 4M comments to a style-controlled comment generator

I accidentally built a dataset that sounds like a flex and behaves like a responsibility: 4M+ comments, each paired with a username, a short description of the thing being commented on, and the comment text itself.

The goal was not “train a general chatbot”. I wanted something narrower and more useful: generate believable comments in a specific style on demand.

The styles I ended up using:

happy
toxic
sarcastic
cringe
wholesome
noise (spam / contextless junk / “do not teach the model this”; excluded later)

This post is the 5-minute version of what actually made it work: cleaning, labeling at scale, and then LoRA fine-tuning a small model to be controllable.

Step 0: accept that scraped comments are disgusting

If you scrape comments at scale, you get all the classics:

exact duplicates (bots, reposts, copy-paste)
near duplicates (templates with tiny edits)
whitespace garbage
ultra-short reactions (“lol”, ”.”, “ok”)

If you do not fix this early, you label the same sentence 50 times, your classifier learns templates instead of style, and your generator becomes a slot machine that keeps repeating itself.

Step 1: clean + dedup before you label anything

My cleaning pass ran directly on the SQLite DB that stored the raw scrape, because it was the simplest way to do chunked reads + bulk deletes without loading 4M rows into RAM.

The core operations:

Normalize text

Before comparing anything, I normalize each comment so cosmetic differences stop mattering:

trim
collapse repeated whitespace
lowercase

That turns things like " Nice!!! ", "nice!!!", and "NICE!!!" into the same canonical string.

Drop very short comments

Anything under 12 characters gets dropped. This is where you delete a shocking amount of low-signal junk.

Exact dedup (fast path)

After normalization, I hash each comment (xxHash) and keep a seen set. If a hash repeats, the row gets deleted.

Near dedup (the important part)

Exact dedup barely scratches the surface because the internet loves templates.

I used MinHash + LSH over character n-grams and treated comments as duplicates when their approximate Jaccard similarity crossed a high threshold. This catches:

emoji/no-emoji variants
template spam with one word swapped
“same comment, slightly rewritten”

The chunked processing part matters: reading in batches and buffering deletes is the difference between “pipeline” and “my laptop is a space heater”.

After cleaning, the dataset landed at roughly:

3.7M comments remaining (from 4M+)
~44 characters average comment length

Step 2: pick labels you can actually label

The taxonomy stayed intentionally small. If I cannot define a label in one sentence, it is not a label, it is a future argument with myself.

noise is a first-class label on purpose. It is not an insult, it is a safety/quality boundary: spam, unreadable fragments, and stuff that only makes sense in-thread goes there.

This is also how you keep “style control” from collapsing into “style soup”.

Step 3: bootstrap labels with a classifier (DeBERTaV3)

I started with about ~1k hand-labeled comments. That is not enough to do science, but it is enough to bootstrap.

Then I fine-tuned DeBERTaV3 base as a multi-class classifier on comment text only.

I got around 85% validation accuracy on a random split, which was useful as an iteration signal, not a publication result. With text data you always worry about leakiness from near-duplicates and topical clusters.

The real value was that it became a labeling multiplier.

Step 4: pseudo-labeling with a human in the loop

The loop looked like this:

sample a comment
classifier predicts a style
I accept/correct
periodically retrain
repeat

Once the model is “good enough”, this becomes dramatically faster than labeling from scratch. That got me to about ~8k solid labels.

One trick that helped: once I could generate comments in a target style, I reviewed generations and aggressively threw failures into noise. It sounds circular, but it surfaced failure modes early and made noise a practical quality gate.

Step 5: auto-label at scale, then filter hard

After the classifier stabilized, I ran it across the cleaned corpus and kept only high-confidence labels.

The filtering rules were intentionally blunt:

keep predictions with >= 70% probability
if noise had > 40% confidence, drop the sample entirely
exclude noise from the final generator training set

This tradeoff is worth it. Fewer samples with strong labels beat millions of weak guesses that blur boundaries.

Result: about ~1.8M confidently labeled samples for generator training.

Step 6: train the generator (SmolLM3 + LoRA)

With the dataset in good shape, I trained a style-controlled generator:

Base model: SmolLM3 (3B)
Fine-tune: LoRA, SFT-only
Hardware: RTX 4090
Stack: Unsloth (CUDA/tooling kept current)

Prompting stayed deliberately simple so the model learned the conditioning reliably:

System prompt: the requested style + a couple constraints
User message:

<username>...</username><description>...</description>

Assistant output: the comment

No fancy post-processing. The whole point was “consistent conditioning → consistent style”.

Quick eval: could I spot my own model?

I did a small arena-style test:

each trial: two comments, guess which one is generated
latest run: n = 500
I identified the model about 57% of the time

That is not invisibility, but it is close enough to random guessing to be a meaningful win for this use case.

What actually mattered

If you only steal one idea from this project, steal this: treat it like a pipeline.

Clean first (dedup + short-comment removal stops repetition learning)
Make noise real (a strict junk bucket is a superpower)
Use pseudo-labeling (1k → 8k happens fast once the classifier helps)
Filter aggressively (confidence thresholds matter more than raw scale)
LoRA is enough when the task is narrow and the conditioning is clean