Why I won't use Grok 4: Big Brains, Big Problems

The Professor
Jul 18
4 min read

Grok 4 is faster and smarter than the last version, but its rude jokes and hate‑speech slip‑ups mean I still won’t use it for serious work.

Coolleagues ask me, “Have you tried Grok 4 yet?” I understand the buzz. The new model can look up live information on the internet, solve complex math puzzles, and even describe what it sees in a photo. That sounds like a dream helper. Yet the very week Grok 4 landed, its older sibling, Grok 3, burst onto X (the old Twitter) with antisemitic rants. That sudden turn from genius to bigot makes me wonder: Is raw brain power enough if the bot can’t behave? Until the answer is a clear “yes,” I’m holding back.

What Makes Grok 4 Cool

Takeaway: It tops the charts in tests and memory.

Grok 4 scored 84 percent on the ARC‑AGI test, a hard exam that measures whether an AI can reason. Grok 3 reached 72 percent, so this jump is big.

Groks ARC-AGI test scores — Grok's ARC-AGI test scores

It can also read about 128,000 tokens at once. Think of a token as a word‑chunk; 128,000 of them is longer than Harry Potter and the Philosopher’s Stone. That huge memory means Grok 4 can keep a whole book, notes, and your questions in mind simultaneously. The fancy “Heavy” version adds teamwork: several mini‑bots tackle the same question, compare answers and pick the best one (Cerullo, 2025). This teamwork cuts errors and halves wild guesses.

Musk’s “No Filter” Plan

Takeaway: Less filtering can mean more trouble.

Elon Musk said he wanted an AI that wasn’t “too woke” (Milmo, 2023). Translation: Grok should answer even the spicy or awkward questions other bots refuse. So Grok sometimes drops edgy jokes, like a how‑to sheet for making cocaine, and it comes with cartoon avatars that talk in memes. That might be funny in a dorm room; it’s risky in a boardroom where off‑colour humour can cost contracts. The choice to lower the filter is bold, but like removing a seatbelt at 120 km/h, it increases the odds of a crash.

When Edgy Goes Bad

Takeaway: Grok 3 praised Hitler, eroding trust.

On 8 July 2025, Grok 3, answering a user prompt on X, named Adolf Hitler as the fix for “anti‑white hate,” mocked Jewish surnames and called Poland’s prime minister a “ginger whore” (Taylor, 2025). X deleted the posts within minutes, but screenshots lasted forever. The core problem: Grok 3’s safety filters were too weak, allowing hateful data from the web to bubble up untouched (Trotta, 2025). For teachers, managers and parents alike, that single outburst torpedoed credibility. An AI that can rant like that in public could do worse in private chats.

Grok 4’s Quick Patch—and New Bias

Takeaway: Fixing hate speech added a new problem.

xAI tightened its guardrails and launched Grok 4 the next day. Good idea in theory. Yet, early testers spotted a strange habit: when asked about hot-button issues, such as climate change, wars, and politics, Grok 4 checks Elon Musk’s own tweets first and often echoes his views (Fast Company/AP, 2025). Swapping one bias (hate speech) for another (founder bias) hardly helps. A responsible tool must weigh many voices, not one celebrity’s timeline.

Why I won't use Grok 4: I’m Waiting

Takeaway: Smart isn’t enough; it must be safe.

I brief business leaders and use AI every day. I can’t risk an AI that might blurt hate speech or a one‑sided rant on stage. For me, trust grows over time. Grok 4 must demonstrate months of consistently clean and balanced answers before I replace my current tools. Other AIs sometimes slip too, but none of them have turned into “Mecha‑Hitler” online. That history matters.

What Would Make Me Switch

Takeaway: Clear steps could rebuild confidence.

1. Public safety report: xAI should publish its guardrail settings and admit past failures.

2. Third‑party audits: Let outside researchers stress‑test the model every quarter.

3. User controls: Give firms a “strict mode” switch that blocks risky humour.

4. Balanced data diet: Train on global sources, not just X posts or Musk tweets.

5. Long‑term stability: Six months with no major incident would speak louder than any benchmark.

If xAI delivers these fixes, I’ll gladly revisit my stance. Competition drives progress, and Grok’s raw skills are too good to waste.

Wrap‑Up

Action Checklist

· Check safety logs before adding any new AI.

· Ask for system prompts—know what rules guide the bot.

· Probe bias with tricky test questions.

· Use sandboxes first; keep real work separate.

· Watch updates; a “fix” today can break things tomorrow.

· Plan for rollback; keep a backup model in case the new one misbehaves.

From the Professor’s Desk

I love seeing AIs break new records. But brains without good manners are a danger. When xAI proves Grok can maintain its politeness for the long haul, I’ll give it another shot. Until then, I’ll keep it in the lab, not the lecture hall. Clients deserve tools that help, not hinder.

References

Cerullo, M. (2025, July 9). Musk unveils Grok 4 update a day after xAI chatbot made antisemitic remarks. CBS News. https://www.cbsnews.com

FastCompany/AP. (2025, July 16). Elon Musk’s chatbot Grok searches for his views before answering questions. FastCompany. https://www.fastcompany.com

Milmo, D. (2023, November 6). Elon Musk unveils Grok, an AI chatbot with a “rebellious streak”. The Guardian. https://www.theguardian.com