BLOG_POST / smollm-finetuning-comment-generator

From 4M comments to a style-controlled comment generator

5 min read
945 words
tl;dr summary

I cleaned and deduped 4M scraped comments, bootstrapped style labels with a DeBERTaV3 classifier + pseudo-labeling, then fine-tuned SmolLM3 with LoRA to generate comments in controllable styles.

From 4M comments to a style-controlled comment generator

I accidentally built a dataset that sounds like a flex and behaves like a responsibility: 4M+ comments, each paired with a username, a short description of the thing being commented on, and the comment text itself.

The goal was not “train a general chatbot”. I wanted something narrower and more useful: generate believable comments in a specific style on demand.

The styles I ended up using:

  • happy
  • toxic
  • sarcastic
  • cringe
  • wholesome
  • noise (spam / contextless junk / “do not teach the model this”; excluded later)

This post is the 5-minute version of what actually made it work: cleaning, labeling at scale, and then LoRA fine-tuning a small model to be controllable.


Step 0: accept that scraped comments are disgusting

If you scrape comments at scale, you get all the classics:

  • exact duplicates (bots, reposts, copy-paste)
  • near duplicates (templates with tiny edits)
  • whitespace garbage
  • ultra-short reactions (“lol”, ”.”, “ok”)

If you do not fix this early, you label the same sentence 50 times, your classifier learns templates instead of style, and your generator becomes a slot machine that keeps repeating itself.


Step 1: clean + dedup before you label anything

My cleaning pass ran directly on the SQLite DB that stored the raw scrape, because it was the simplest way to do chunked reads + bulk deletes without loading 4M rows into RAM.

The core operations:

Normalize text

Before comparing anything, I normalize each comment so cosmetic differences stop mattering:

  • trim
  • collapse repeated whitespace
  • lowercase

That turns things like " Nice!!! ", "nice!!!", and "NICE!!!" into the same canonical string.

Drop very short comments

Anything under 12 characters gets dropped. This is where you delete a shocking amount of low-signal junk.

Exact dedup (fast path)

After normalization, I hash each comment (xxHash) and keep a seen set. If a hash repeats, the row gets deleted.

Near dedup (the important part)

Exact dedup barely scratches the surface because the internet loves templates.

I used MinHash + LSH over character n-grams and treated comments as duplicates when their approximate Jaccard similarity crossed a high threshold. This catches:

  • emoji/no-emoji variants
  • template spam with one word swapped
  • “same comment, slightly rewritten”

The chunked processing part matters: reading in batches and buffering deletes is the difference between “pipeline” and “my laptop is a space heater”.

After cleaning, the dataset landed at roughly:

  • 3.7M comments remaining (from 4M+)
  • ~44 characters average comment length

Step 2: pick labels you can actually label

The taxonomy stayed intentionally small. If I cannot define a label in one sentence, it is not a label, it is a future argument with myself.

noise is a first-class label on purpose. It is not an insult, it is a safety/quality boundary: spam, unreadable fragments, and stuff that only makes sense in-thread goes there.

This is also how you keep “style control” from collapsing into “style soup”.


Step 3: bootstrap labels with a classifier (DeBERTaV3)

I started with about ~1k hand-labeled comments. That is not enough to do science, but it is enough to bootstrap.

Then I fine-tuned DeBERTaV3 base as a multi-class classifier on comment text only.

I got around 85% validation accuracy on a random split, which was useful as an iteration signal, not a publication result. With text data you always worry about leakiness from near-duplicates and topical clusters.

The real value was that it became a labeling multiplier.


Step 4: pseudo-labeling with a human in the loop

The loop looked like this:

  1. sample a comment
  2. classifier predicts a style
  3. I accept/correct
  4. periodically retrain
  5. repeat

Once the model is “good enough”, this becomes dramatically faster than labeling from scratch. That got me to about ~8k solid labels.

One trick that helped: once I could generate comments in a target style, I reviewed generations and aggressively threw failures into noise. It sounds circular, but it surfaced failure modes early and made noise a practical quality gate.


Step 5: auto-label at scale, then filter hard

After the classifier stabilized, I ran it across the cleaned corpus and kept only high-confidence labels.

The filtering rules were intentionally blunt:

  • keep predictions with >= 70% probability
  • if noise had > 40% confidence, drop the sample entirely
  • exclude noise from the final generator training set

This tradeoff is worth it. Fewer samples with strong labels beat millions of weak guesses that blur boundaries.

Result: about ~1.8M confidently labeled samples for generator training.


Step 6: train the generator (SmolLM3 + LoRA)

With the dataset in good shape, I trained a style-controlled generator:

  • Base model: SmolLM3 (3B)
  • Fine-tune: LoRA, SFT-only
  • Hardware: RTX 4090
  • Stack: Unsloth (CUDA/tooling kept current)

Prompting stayed deliberately simple so the model learned the conditioning reliably:

  • System prompt: the requested style + a couple constraints
  • User message:
<username>...</username><description>...</description>
  • Assistant output: the comment

No fancy post-processing. The whole point was “consistent conditioning → consistent style”.


Quick eval: could I spot my own model?

I did a small arena-style test:

  • each trial: two comments, guess which one is generated
  • latest run: n = 500
  • I identified the model about 57% of the time

That is not invisibility, but it is close enough to random guessing to be a meaningful win for this use case.


What actually mattered

If you only steal one idea from this project, steal this: treat it like a pipeline.

  • Clean first (dedup + short-comment removal stops repetition learning)
  • Make noise real (a strict junk bucket is a superpower)
  • Use pseudo-labeling (1k → 8k happens fast once the classifier helps)
  • Filter aggressively (confidence thresholds matter more than raw scale)
  • LoRA is enough when the task is narrow and the conditioning is clean
hash: e0c
EOF