IMDb Sentiment Analysis: RNN vs Pretrained Transformers

I used IMDb movie reviews to test how far a simple sequence model can go before pretrained Transformers become worth the extra cost. The story is not just "positive versus negative"; the reviews are long, noisy, full of contrast, and often hide the real opinion late in the text. I compared a BiLSTM baseline with DistilGPT-2 and XLNet, then tested built-in classifier heads against a custom MLP head and different max sequence lengths.

NLP Sentiment Analysis Transformers RNN Truncation Study Kaggle

The Problem I Was Testing

IMDb sentiment is a clean binary task on the surface, but the raw reviews are not clean model inputs. Sentiment can depend on a word like "but", on a late reversal, or on a complaint that only makes sense after several sentences of setup. The dataset also contains HTML tags, repeated punctuation, capitalization, and other artifacts that are common in scraped review text.

01

Long-context sentiment

Reviews average 234 words, but the longest review reaches 2,470 words. Truncation can remove exactly the part where the opinion turns.

02

Noisy natural text

HTML tags appeared in 58.67% of reviews, repeated punctuation in 36.49%, and all-caps words in 22.56%.

03

Harder gains after strong baselines

The BiLSTM reached 0.8905 TestAccuracy, but the real experiment was whether pretrained models and better heads could push the result into the 0.93-0.96 range.

A Few Reviews That Explain the Task

Short snippets from the dataset show why this is more than keyword spotting.

Negative contrast + complaint
"The idea is good, but there are too many stupid errors in the movie..."

The sentence starts with a positive setup, then flips into the real label. A model that overweights early tokens can miss the turn.

Positive craft + recommendation
"The pacing, the camera work, the emotion, the haunting musical score... make it a must see."

This is easier for sentiment, but it still depends on a cluster of descriptive phrases rather than one isolated word.

Messy sample markup + intensity
"*Spoiler warning*<br /><br />First of all I rated this movie 2 out of 10..."

Real inputs include spoiler tags, HTML breaks, ratings, punctuation, and emotional language all at once. Cleaning removes markup noise, while the model still needs to keep sentiment-heavy signals such as low ratings and strong complaints.

Approach

I kept the task fixed and changed the model family, classifier head, and amount of review context the model could see.

BiLSTM (from scratch)

Word-level tokenizer, embedding layer, and bidirectional LSTM trained from scratch as a practical sequential baseline.

What changed: max_len 500 vs 1024

DistilGPT-2 (fine-tune)

Fine-tuned a pretrained GPT-style model to test how far transfer learning helps over the RNN baseline.

What changed: built-in head vs custom MLP, max_len 500 vs 1024

XLNet (fine-tune)

Fine-tuned XLNet to test whether stronger bidirectional context modeling helps on long-form sentiment.

What changed: built-in head vs custom MLP, max_len 500/1024/1200

Dataset facts + experiment variable

Training reviews

25,000 labeled IMDb

Class balance

12,500 negative / 12,500 positive

Length stats

mean 234, median 174, IQR 127-284, max 2470

max_len

RNN: 500/1024 • Transformers: 500/1024/1200

Results

TestAccuracy is the Kaggle metric. Best values are highlighted.

Max Length
Head
Model 500 1024 1200
BiLSTM 0.8727 0.8905 -
DistilGPT-2 0.9225 0.9282 -
DistilGPT-2 + MLP 0.9243 0.9299 -
XLNet 0.9524 0.9578 0.9571
XLNet + MLP 0.9525 0.9572 0.9570

Best overall: XLNet 0.9578 @ max_len=1024

TestAccuracy computed on Kaggle using 50% of the test set.

Takeaways

  • Pretrained Transformers clearly outperform the BiLSTM baseline on this long-review sentiment task.
  • XLNet gives the strongest result overall, reaching 0.9578 TestAccuracy at max_len=1024.
  • The custom MLP head gives DistilGPT-2 a small lift, while XLNet's built-in head is already very strong.
  • Increasing max sequence length helps, but the gain flattens around 1024 tokens and does not justify endless context growth.