IMDb Sentiment Analysis: RNN vs Pretrained Transformers

We compared a BiLSTM baseline against DistilGPT-2 and XLNet under matched sentiment settings, including built-in heads versus a custom MLP head and truncation/max sequence length choices. XLNet performed best, and longer context helped until gains became marginal beyond about 1024 tokens.

NLP Sentiment Analysis Transformers RNN Truncation Study Kaggle

Approach

Three model tracks under the same sentiment task and evaluation setup.

BiLSTM (from scratch)

Bidirectional LSTM baseline trained on tokenized IMDb reviews for binary sentiment classification.

What we varied: max_len + head (built-in vs MLP)

DistilGPT-2 (fine-tune)

Fine-tuned DistilGPT-2 on the same review split to compare pretrained transfer against the RNN baseline.

What we varied: max_len + head (built-in vs MLP)

XLNet (fine-tune)

Fine-tuned XLNet with matched preprocessing and evaluation to test stronger long-context handling.

What we varied: max_len + head (built-in vs MLP)

Dataset facts + experiment variable

Training reviews

25,000 labeled IMDb

Class balance

12,500 negative / 12,500 positive

Length stats

mean 234, median 174, IQR 127-284, max 2470

max_len

RNN: 500/1024 • Transformers: 500/1024/1200

Results

TestAccuracy only. Best values are highlighted for quick scanning.

Max Length
Head
Model 500 1024 1200
BiLSTM 0.8727 0.8905 -
DistilGPT-2 0.9225 0.9282 -
DistilGPT-2 + MLP 0.9243 0.9299 -
XLNet 0.9524 0.9578 0.9571
XLNet + MLP 0.9525 0.9572 0.9570

Best overall: XLNet 0.9578 @ max_len=1024

TestAccuracy computed on Kaggle using 50% of the test set.

Takeaways + Links

  • Pretrained Transformers outperform BiLSTM on IMDb sentiment.
  • XLNet consistently achieves the top accuracy.
  • Increasing max sequence length helps, with diminishing returns beyond ~1024.