Predicting World Cup knockout matches, how much do different AIs differ?

Odaily资深作者

2026-07-02 01:42

本文約2725字，閱讀全文需要約4分鐘

Gemini and DeepSeek craft scripts for upsets and dark horses; Grok and Qwen nail the small-margin scores in popular matchups; ChatGPT and Claude are better suited for analyzing match processes.

AI總結

展開

Core Insight: This article compares the predictions of six AI models, including ChatGPT, Grok, and DeepSeek, for the knockout stage of the 2023 World Cup. The results reveal a clear divergence in performance: DeepSeek and Gemini accurately predicted upsets (e.g., Morocco eliminating the Netherlands on penalties), Grok and Qwen correctly forecasted scores in hot matches, while analytical models like ChatGPT leaned towards favorites and struggled to capture surprises.
Key Elements:
1. DeepSeek and Gemini successfully predicted the Netherlands vs. Morocco match: Gemini directly gave a 'script' of a 1:1 draw in regular time and a Morocco win on penalties, perfectly aligning with the actual 1:1 result and the 3:2 penalty shootout; DeepSeek leaned towards Morocco pulling off an upset through defensive counterattacks.
2. Grok and Qwen accurately predicted the exact scores in three popular matches (Canada 1:0 South Africa, Brazil 2:1 Japan, Norway 2:1 Ivory Coast), demonstrating a more nuanced judgment of favorites 'narrowly winning' or 'dominating'.
3. In matches like Brazil vs. Japan and England vs. DR Congo, while ChatGPT got the direction right, it emphasized the challenges posed by opponents (e.g., Japan's pressure, DR Congo's deep defense). Its process analysis was accurate, but its score predictions lacked highlights. Like Claude, it leaned towards the favorites in upsets (e.g., Netherlands vs. Morocco).
4. All AI models collectively failed in the Germany vs. Paraguay match: they unanimously favored Germany beforehand (predicting scores like 2:0, 3:0), but underestimated Paraguay's defensive resilience. Germany was ultimately eliminated in a penalty shootout.
5. Summary of each model's applicability: DeepSeek and Gemini excel at spotting upsets and crafting scripts; Grok and Qwen are best for predicting scores in favored matchups; ChatGPT and Claude are suitable for understanding match challenges but tend to favor the hot teams and lack the decisiveness to call upsets.

Original | Odaily Planet Daily (@OdailyChina)

Author | Asher (@Asher_ 0210)

Before every World Cup match, I ask AI to make predictions. Almost every model sounds plausible and provides detailed insights.

Some talk about team values, some analyze group stage data, some discuss injuries and tactics, and others directly predict scores, extra time, and penalty shootout scenarios. At first glance, ChatGPT, Grok, Qwen, DeepSeek, Gemini, and Claude all seem to understand football quite well.

But as a prediction market user, what I really care about is not which model gives the most complete explanation, but which one is more trustworthy.

As the World Cup enters the knockout stage, Odaily Planet Daily started from the first match, asking different AI models the same questions before each game, and then comparing their predictions with the actual results afterward—to see which models merely sounded analytical and which ones truly captured the match's trajectory in advance.

So far in the concluded knockout rounds, Canada narrowly defeated South Africa 1-0, Brazil narrowly beat Japan 2-1, Germany was eliminated by Paraguay after a penalty shootout, and the Netherlands also fell to Morocco on penalties. In the Belgium vs. Senegal match, the game escalated to a 2-2 draw followed by an extra-time comeback, fully showcasing the uncertainty of the knockout stage.

DeepSeek and Gemini Shine with Their Predictions on Morocco

The most memorable predictions so far are those by DeepSeek and Gemini for the Netherlands vs. Morocco match. Before this game, it was easy to pick the wrong side—the Netherlands had stronger paper strength and a more complete lineup. Many models knew Morocco would be tough, but ultimately leaned toward the Netherlands advancing.

Where DeepSeek and Gemini excelled was that they didn't stop at saying "this match will be tight." Instead, they wrote out the subsequent script. Gemini directly predicted a 1-1 draw in regular time, with Morocco winning on penalties. The match indeed ended 1-1, and Morocco eliminated the Netherlands 3-2 on penalties. They didn't just get the direction right; they accurately foresaw how the game would be dragged into penalties and who would eventually prevail.

Gemini's prediction for the Netherlands vs. Morocco match

DeepSeek was also very close. It judged that regular time would likely end 1-1 or 0-0, the match could drag into extra time or even penalties, and leaned towards Morocco advancing through defense and counterattacks as an upset.

DeepSeek's prediction for the Netherlands vs. Morocco match

After this match, the presence of DeepSeek and Gemini was immediately elevated. Gemini, in particular, didn't seem to be making a pre-match prediction but rather appeared to have seen the game script in advance.

Grok and Qwen Consistently Hit Exact Scores, Showing Stronger Stability Than Expected

Beyond DeepSeek and Gemini's standout performance with Morocco, Grok and Qwen also made their mark. Their most impressive feat was in matches where the winner was relatively clear; they not only correctly predicted the advancing team but also gave specific scores very close to the final results.

The South Africa vs. Canada match is a case in point. Most AI models favored Canada before the match, but the debate was whether Canada would win comfortably. Grok predicted a 1-0 win for Canada, and Qwen also suggested a narrow one-goal victory. In the end, Canada indeed won by just one goal, failing to achieve the lopsided victory many imagined.

Qwen's prediction for the South Africa vs. Canada match

The Brazil vs. Japan match was similar. Most AI models acknowledged Brazil's strength, but the key question was whether Japan could keep the game tight. Both Grok and Qwen predicted a 2-1 scoreline, and the match indeed ended with Brazil narrowly winning 2-1. They correctly assessed not only that Brazil would win, but that Japan would pose significant challenges.

In the Ivory Coast vs. Norway match, both models were also accurate. Norway boasts Haaland, so their advancement wasn't hard to predict, but Ivory Coast's physicality and wing attacks would prevent a one-sided affair. Grok and Qwen both predicted a 2-1 win for Norway, and the final score matched this "script."

Grok's prediction for the Ivory Coast vs. Norway match

The strength of Grok and Qwen lies in their more detailed analysis of favored matches. They didn't predict major upsets like Morocco eliminating the Netherlands, but in matches involving Canada, Brazil, Norway, and France, they provided accurate win-loss directions and scorelines. In other words, they may not be the best at spotting underdog wins, but they excel at judging whether a favorite will dominate or scrape through a tough victory.

ChatGPT Didn't Deliver Many Perfect Score Predictions, but Its Match Process Analysis Was Accurate

ChatGPT didn't predict Morocco's penalty shootout elimination of the Netherlands like Gemini, nor did it consistently hit specific scores like Grok and Qwen. However, its strength is that in many matches where favorites seemed to have an edge, ChatGPT would more clearly warn that the game might not be so easy.

Take Brazil vs. Japan. ChatGPT predicted Brazil would advance, but instead of portraying an easy win, it highlighted that Japan's pressing, movement, and discipline would make Brazil uncomfortable, and Japan might even score first or equalize. Similarly, for Ivory Coast vs. Norway, ChatGPT predicted a Norway win but warned it wouldn't be easy, as Ivory Coast's physicality, wing attacks, and transition play would create problems.

Furthermore, in the England vs. DR Congo knockout match, ChatGPT didn't simply predict a big England win but suggested the game might be quite dull, with DR Congo using a low block to slow the tempo. England advanced, but without much ease.

ChatGPT's prediction for the England vs. DR Congo match

ChatGPT's forte is not in nailing exact scores every time, but in often identifying the potential obstacles in a match beforehand. It is useful for understanding the game's dynamics, but less suited for those who just want a final score prediction. It can accurately describe the process, but it lacks decisiveness when it comes to predicting major upsets.

Germany's Exit Became a Collective Failure for AI Models

If the previous matches showed different models' strengths, the Germany vs. Paraguay match was a collective failure.

Before the match, all AI models sided with Germany. ChatGPT, Grok, Qwen, Gemini, and Claude all favored Germany, with score predictions mostly concentrated on 2-0, 3-0, or 3-1. Their reasoning was consistent: Germany had stronger paper strength, better squad depth, and superior attacking firepower.

But that's where things went wrong. The AI models underestimated Paraguay's ability to drag the game into a quagmire. Germany failed to secure a win in regular time, couldn't break the deadlock in extra time, and ultimately lost in a penalty shootout, getting eliminated.

Who is the Most Accurate So Far?

Based on the knockout matches concluded so far, different models' characteristics are becoming apparent.

DeepSeek and Gemini have the most standout moments. They didn't just predict the advancement of favorites like Brazil and France; they also provided highly valuable answers in harder-to-judge upset matches. In the Netherlands vs. Morocco match, their key advantage was daring to predict Morocco's upset and a penalty shootout in advance. Gemini, in particular, directly predicted Morocco's advancement on penalties—a truly impressive performance.

Grok and Qwen are more like "score-oriented players." They nailed several specific scores, performing notably well in matches involving Canada, Brazil, Norway, and France. However, the issue is that when faced with traditional powerhouses like Germany and the Netherlands, they ultimately leaned toward the favorites.

ChatGPT and Claude are more like "analysis-oriented players." They provide complete reasoning, their directions are mostly sound, and they can also warn about the risk of extra time. However, the problem is that they often see the challenges but rarely commit to an upset verdict. The Netherlands vs. Morocco match is a clear example; despite identifying the risks of extra time and penalties, they ultimately trusted the Netherlands more.

So, instead of hastily asking which model knows football best, it's better to consider which scenarios each one is suitable for.

預測市場

歡迎加入Odaily官方社群