ATLAS undergraduate project

Agent Simulation: From Mistral 7B Drift to Mixtral 8x7B Stability

Inspired by the Stanford Smallville paper, I tested whether a weaker open model could run an AI-town style agent loop. The core story is how Mistral 7B failed, how Mixtral 8x7B improved stability, and which guardrails made simulation behavior reliable.

Role: Simulation / LLM Research Tools: Python, Prompt Templates, Validators, Runtime Logs Models: Mistral 7B, Mixtral 8x7B, GPT (reference run) Timeframe: 2024

Watch Demo View Evidence GitHub Repo

1) Idea Inspiration

I have read a paper from Stanford which I attached link below. The paper mainly talks about how to use LLM to complete a small town simulation. I was very interested in the idea of using LLM to control the behavior of characters in a simulated environment, and I wanted to see if I could replicate the results using an open weaker model like Mistral 7B, then compare it with Mixtral 8x7B. Especially, if I change some character's personality is anything will change?

Research Source Stanford HAI: Computational agents exhibit believable humanlike behavior Open Link ↗

The core idea of the simulation is let the LLM follow a loop of "Perceive, Memory Stream, Retrieve, Retrieved Memories, and Act" to control the behavior of characters in a simulated environment. The LLM will receive information about the current state of the environment, store relevant memories, retrieve those memories when needed, and then decide on actions to take based on that information. I wanted to see if Mistral 7B could successfully follow this loop and produce coherent behavior in the simulation, how Mixtral 8x7B compared, and what kind of failures could be mitigated with prompt engineering and guardrails.

Perceive, Memory Stream, Retrieve, Retrieved Memories, and Act loop with Plan and Reflect feedback.

2) Showcase Videos

Same simulation objective across three runs: reference GPT, Mistral 7B baseline, and Mixtral 8x7B with prompt engineering.

Run B · Mistral 7B baseline

Run B: Mistral 7B (random-walk failure)

Weak instruction grounding and unstable action planning create visible random movement and schedule drift.

Frequent off-path movement despite clear daily requirements.
More instruction drift under longer prompt context.
Low schema reliability without strict post-validation.

Optional chapter markers

Run A GPT reference behavior
Run B Mistral 7B random-walk failure
Run C Mixtral 8x7B improved after prompt engineering

What this demo proves: model scale plus prompt design significantly changes simulation quality, but guardrails are still required for robust outputs.

Known limitations: this section shows qualitative runs, not benchmark metrics; `.mov` playback support varies by browser codec.

3) Simulation Needs Baseline Ability

Mistral 7B cannot complete one full simulation reliably. These evidence screenshots show why baseline capability is required before higher-level agent behaviors.

Hallucination evidence: follow the allowed option set and keep schedule output coherent.

Stronger model output with cleaner action sequence — Stronger run: cleaner, shorter action list with fewer derailments.

Mistral 7B output with instruction confusion — Mistral 7B: verbose confusion and higher instruction-drift risk.

Prompt-to-action flow: convert plan text into executable movement and completed actions.

Plan actions ahead: prepare action blocks before runtime.
Trigger by schedule: when the time is up, prompt the LLM for how to execute the planned step.
Apply answer in code: runtime moves the character to the correct location.
Commit completion: action state is updated and the simulation continues.

Action-state text showing prompt response and runtime execution — Prompt response view: schedule + plan are transformed into the next executable instruction.

Simulation map showing action execution after planning — Map execution view: code applies the instruction, moves the character, and completes the action in-world.

4) Failure Cases

Main failures came from Mistral 7B under dense prompts: hallucinated outputs, instruction drift, and communication loops.

Constrained room-selection: correct prompt vs hallucination failure

Correct case (expected format):

Correct prompt + answer

Jane Anderson is going to Jane Anderson's house that has the following areas: {kitchen, bedroom, bathroom}
* Stay in the current area if the activity can be done there.
* Never go into other people's rooms unless necessary.
For cooking, Jane Anderson should go to the following area in Jane Anderson's house:
Answer: {kitchen}

Wrong case (prompt that triggered failure):

Failure prompt

Isabella Rodriguez is going to Isabella Rodriguez's apartment that has the following areas: {main room}
* Stay in the current area if the activity can be done there.
* NEVER go into other people's rooms unless necessary.
Isabella Rodriguez is sleeping. For sleeping, Isabella Rodriguez should go to the following area in Isabella Rodriguez's apartment (MUST pick one of {main room}):
Answer: {

Full hallucinated output (error case):

Observed model output

main\_room }
Isabella Rodriguez is sleeping and can stay in any of the areas she has access to. However, we are not given specific information about which rooms she has access to, so it would be impossible to determine which area she should go to for sleeping. There seems to be an error in the second part of your response. It appears that you have not finished filling out the information for Isabella Rodriguez. Please provide the correct areas for her apartment and specify which area she should go to for sleeping. Isabella Rodriguez should go to the main room for sleeping.
main room }
--- main room }
\ Isabella Rodriguez is sleeping. For sleeping, Isabella Rodriguez should go to the following area in Isabella Rodriguez's apartment (MUST pick one of {main room}): {
options: [main_room],
response: main_room,
explanation: Since Isabella Rodriguez is sleeping, she should go to her main room to do so. Please continue with the story. Isabella Rodriguez enters her apartment and goes straight to her bedroom. She finds herself in a comfortable bed with soft blankets, making it easy for her to drift off to sleep. As she closes her eyes, she feels her body relaxing and her mind shutting down.
The answer cannot be determined because the given information does not specify if Isabella Rodriguez has any other areas in her apartment besides the main room. If there are additional areas, then it would need to be specified which area is appropriate for sleeping. main room; bedroom}

Failure label: Hallucination + format drift + instruction-following breakdown.

Why it happened: the model continued narrative generation instead of honoring the strict single-choice output contract.

Fix applied: enforce `Answer: {option}` schema with validator reject + retry and rules-first prompt compression.

Long communication drift (multi-turn conversations become repetitive/noisy)

Failure label: Long-context communication error

Interaction evidence: as dialogue history grows, responses become repetitive praise loops and lose action relevance.

Pattern

Repeated "agree + praise" language dominates later turns.

Risk

Conversation text starts overriding action-selection constraints.

Impact

Characters keep socializing, but task progress stalls.

Earlier-stage conversation output where action intent is still visible — Earlier stage: repetition appears, but action intent is still traceable.

Late-stage long conversation output with repetitive social text and drift — Late stage: social repetition dominates and instruction-following decays.

Why it happened: memory and conversation context became too long, reducing salience of hard constraints.

Fix applied: context-window trimming, recency prioritization, and capped conversational carry-over.

5) Adversarial Persona Stress Test

Good-character baseline personas are compared against adversarial eval personas to test whether malicious intent stays stable under social dialogue pressure.

Purpose

Stress-test persona behavior when agents are no longer kind-by-default.

Setup

Run identical simulation loops with baseline personas versus adversarial personas.

Observation

Adversarial plans emerge, but social dialogue can unexpectedly suppress malicious execution.

Takeaway

Goal persistence and persona consistency checks are required for robust evaluation.

Baseline Personas

Neutral/blue-gray profile with cooperative goals.

Klaus Mueller

Role: Sociology Student Researcher

Innate tendencyKind, inquisitive, passionate
Learned tendency / rolePerspective-driven academic with social-justice framing
Current focusWriting a gentrification paper with stable daily routine

Isabella Rodriguez

Role: Cafe Owner, Host

Innate tendencyFriendly, outgoing, hospitable
Learned tendency / roleBuilds inclusive social atmosphere at Hobbs Cafe
Current focusPlanning and coordinating a community party event

Eval Personas

Soft-red profile with adversarial goals (Slides 15-16 evidence).

Adam Smith

Role: Opportunistic Thief

Innate tendencyManipulative, deceitful, opportunistic
Learned tendency / roleScans social systems for trust exploitation opportunities
Current focusPrepares Hobbs Cafe theft by studying security and blind spots

Abigail Chen

Role: Prankster Master / Chaos Agent

Innate tendencyEvil, malicious, ruthless
Learned tendency / roleManipulates events to maximize disruption and social tension
Current focusCause event-day chaos and destabilize cooperative interactions

Bad Character Action Plans

Adam Smith: A thief

His Behavior: A grande planner

researching Hobbs Cafe online
walking around to observe its layout and entrances/exits
taking note of any security cameras or alarms
conducting a thorough review of Hobbs Cafe's surveillance system and potential blind spots

Abigail Chen: A trouble maker

Her Behavior: A Prankster Master

searching for romantic love letter templates online
crafting anonymous love letters to create drama
writing anonymous love letters with a cruel twist
spreading rumors and causing drama at a local cafe
laughing at the chaos she caused so far
anonymously posting damaging information about someone online

Featured Failure Case: Intent Collapse After Social Dialogue

A malicious birthday-party plan is initialized, but not executed after polite conversation interrupts the goal trajectory.

1

Malicious plan initialized

Agent prepares to hide a boom at the party as the primary task objective.
2

Social interaction begins

Another character starts a friendly conversation, shifting local dialogue tone.
3

Intent collapse

Agent responds politely and fails to execute the malicious action plan.

Interpretation: conversational politeness can override adversarial intent unless goal memory and persona consistency are explicitly enforced.

6) Fixes & Stabilization

Strict Output Contract

Forced single-answer format (`Answer: {room}`) to stop free-form continuation.

Validator + Retry Policy

Reject invalid room/output and auto-reprompt with compressed context.

Prompt Prioritization

Rules first, allowed options second, scenario narrative last.

Model Routing

Primary experiments used Mistral 7B and Mixtral 8x7B; optional fallback routes handled difficult cases.

Rate Limiting / Throttling

Prevents retry storms and high-frequency I/O deadlocks in agent loop updates.

Warning: without throttle windows and retry caps, failure recovery can destabilize the simulator itself.

7) Links + Contact

GitHub Repo Contact Shuzhen

IMDb Sentiment Analysis: RNN vs Pretrained Transformers

We compared a BiLSTM baseline against DistilGPT-2 and XLNet under matched sentiment settings, including built-in heads versus a custom MLP head and truncation/max sequence length choices. XLNet performed best, and longer context helped until gains became marginal beyond about 1024 tokens.

NLP Sentiment Analysis Transformers RNN Truncation Study Kaggle

View Results (scroll) Download PDF GitHub

Approach

Three model tracks under the same sentiment task and evaluation setup.

BiLSTM (from scratch)

Bidirectional LSTM baseline trained on tokenized IMDb reviews for binary sentiment classification.

What we varied: max_len + head (built-in vs MLP)

DistilGPT-2 (fine-tune)

Fine-tuned DistilGPT-2 on the same review split to compare pretrained transfer against the RNN baseline.

What we varied: max_len + head (built-in vs MLP)

XLNet (fine-tune)

Fine-tuned XLNet with matched preprocessing and evaluation to test stronger long-context handling.

What we varied: max_len + head (built-in vs MLP)

Dataset facts + experiment variable

Training reviews

25,000 labeled IMDb

Class balance

12,500 negative / 12,500 positive

Length stats

mean 234, median 174, IQR 127-284, max 2470

max_len

RNN: 500/1024 • Transformers: 500/1024/1200

Results

TestAccuracy only. Best values are highlighted for quick scanning.

Max Length

Head

BiLSTM

0.8905

DistilGPT-2

0.9299

XLNet

0.9578

Model	500	1024	1200
BiLSTM	0.8727	0.8905	-
DistilGPT-2	0.9225	0.9282	-
DistilGPT-2 + MLP	0.9243	0.9299	-
XLNet	0.9524	0.9578	0.9571
XLNet + MLP	0.9525	0.9572	0.9570

Best overall: XLNet 0.9578 @ max_len=1024

TestAccuracy computed on Kaggle using 50% of the test set.

Takeaways + Links

Pretrained Transformers outperform BiLSTM on IMDb sentiment.
XLNet consistently achieves the top accuracy.
Increasing max sequence length helps, with diminishing returns beyond ~1024.

NLP Sentiment Classification (IMDB-Style)

Fine-tuned GPT-2 and XLNet for binary sentiment classification, comparing built-in classification heads against a custom pooled last-layer + MLP head under a common evaluation protocol.

GitHub Results Contact

1) Problem & Motivation

2) Data (source, size, preprocessing)

3) Approach (model/algorithm choices)

4) Experiments (training setup, baselines, ablations)

5) Results (table + simple chart)

Variant	Accuracy	F1	LogLoss

Agent Simulation: From Mistral 7B Drift to Mixtral 8x7B Stability

1) Idea Inspiration

2) Showcase Videos

Run B: Mistral 7B (random-walk failure)

Optional chapter markers

3) Simulation Needs Baseline Ability

4) Failure Cases

Pattern

Risk

Impact

5) Adversarial Persona Stress Test

Purpose

Setup

Observation

Takeaway

Klaus Mueller

Isabella Rodriguez

Adam Smith

Abigail Chen

Bad Character Action Plans

Malicious plan initialized

Social interaction begins

Intent collapse

6) Fixes & Stabilization

Strict Output Contract

Validator + Retry Policy

Prompt Prioritization

Model Routing

Rate Limiting / Throttling

7) Links + Contact

IMDb Sentiment Analysis: RNN vs Pretrained Transformers

Approach

BiLSTM (from scratch)

DistilGPT-2 (fine-tune)

XLNet (fine-tune)

Dataset facts + experiment variable

Results

Takeaways + Links

1) Problem & Motivation

2) Data (source, size, preprocessing)

3) Approach (model/algorithm choices)

4) Experiments (training setup, baselines, ablations)

5) Results (table + simple chart)

6) What I learned + limitations

7) Next steps

Key takeaways