What they're not telling you: # Wall Street Keeps Testing AI Traders, But Most Are Still Underperforming Large language models are losing money at scale when given actual trading authority, and Wall Street's quiet experiments reveal a technology industry overselling capabilities it hasn't yet delivered. Recent competitions testing AI models from OpenAI, Anthropic, Google, and xAI show a pattern of consistent underperformance that contradicts the hype surrounding artificial intelligence in finance. According to Bloomberg's analysis of these tests, models frequently lost money, traded excessively, and made erratic decisions even when receiving identical instructions.
What the Documents Show
In Alpha Arena, a competition created by startup Nof1, eight AI models were each given $10,000 to trade U.S. tech stocks over two weeks. Across four separate competitions, the models collectively lost roughly one-third of their capital, with only six of 32 outcomes generating profit. The results suggest that despite months of headlines promising AI-driven investing, the technology remains fundamentally unreliable for actual portfolio management. The inconsistency extends beyond mere losses.
Follow the Money
Under identical prompts, competing models exhibited wildly different behaviors—xAI's Grok 4.20 executed just 158 trades in one contest while Alibaba's Qwen made 1,418 trades under the exact same conditions. This variation suggests the models lack the coherent decision-making frameworks necessary for investing, instead generating responses that diverge dramatically based on subtle differences in how they process information. Nof1 founder Jay Azhang identified the core problem: current models struggle with foundational trading concepts including "position sizing, timing, signal weighting and overtrading." These aren't edge-case failures—they're fundamental gaps in how these systems approach risk management. The broader market data confirms what these experiments reveal. Research blog Flat Circle analyzed 11 public AI trading competitions and found that while each event produced at least one profitable model, only two generated a profitable median return. This distinction matters: it means that most AI trading bots underperform more often than they succeed, a red flag for any technology being positioned as a replacement for human judgment.
What Else We Know
What's notable is how cautiously Wall Street itself has approached this technology despite years of promotional messaging. JPMorgan Chase and Balyasny Asset Management already deploy AI for research and fraud detection, but both firms have deliberately avoided delegating actual investment decisions to these systems. Azhang was direct about the current state: deploying an LLM with autonomous trading authority "isn't a thing yet." For ordinary investors, this gap between promise and performance carries real implications. While venture capital and tech companies have aggressively marketed AI as the future of wealth management, the empirical evidence suggests retail investors should remain skeptical of robo-advisors or AI-driven services making independent trading decisions. The technology may eventually reach that capability, but current testing shows we're nowhere close. Until these systems demonstrate consistent profitability and coherent decision-making, the safest assumption is that human oversight—or traditional passive strategies—remains the more reliable path to protecting wealth.
Primary Sources
- Source: ZeroHedge
- Category: Money & Markets
- Cross-reference independently — don't take our word for it.
Disclosure: NewsAnarchist aggregates from public records, API feeds (Federal Register, CourtListener, MuckRock, Hacker News), and independent media. AI-assisted synthesis. Always verify primary sources linked above.

