On November 12, 1970, the Bhola cyclone slammed into the coast of what was then East Pakistan. The storm brought maximum sustained wind speeds of 130 miles per hour (205 kilometers per hour) and a 35-foot (10.5-meter) storm surge, killing an estimated 300,000 to 500,000 people.
Today, the Bhola cyclone remains the deadliest tropical storm on record. But if it had struck a decade later, it might not have been so devastating. Weather forecasting changed dramatically in the 1970s as meteorologists adopted physics-based computer models that improved storm prediction. With the rise of AI, forecasting is evolving again—but this time, experts worry the new models may be less reliable when it comes to predicting unprecedented weather events.
Researchers are calling this the “gray swan” problem. Gray swan weather extremes are physically plausible but so rare that they are poorly represented in training datasets. The trouble is, climate change is leading to more first-of-their-kind weather extremes. Think: the 2021 Pacific Northwest heatwave. This event was so severe that it would have been virtually impossible without climate change.
Physical forecast models can simulate gray swan events like the Pacific Northwest heatwave, though they are labeled extremely rare. They can do that because they are built on the laws of physics. AI models are trained on past weather data, wherein gray swans are practically nonexistent.
“They fail on gray swans,” Pedram Hassanzadeh, an associate professor of geophysical sciences at the University of Chicago, told Gizmodo. He and his colleagues published a study last April that removed all Category 3 through 5 hurricanes from an AI model’s training dataset, then tested it on Category 5 storms. The results showed that AI models cannot accurately forecast previously unseen events, as this would require extrapolation.
“The concern isn’t occasional misses. It’s that AI models can miss silently, producing confident forecasts of unremarkable weather while a record-breaking event is unfolding,” Rose Yu, an associate professor of computer science and engineering at the University of California San Diego, told Gizmodo in an email.
“Other risks matter too,” she said. “AI models can violate conservation laws in subtle ways that don’t show up in standard metrics. When they bust a forecast, diagnosing why is harder. They depend on stable observing systems, which is a real concern given current pressure on satellite programs. And institutionally, if we consolidate around AI too quickly and let physics-based infrastructure atrophy, we lose the redundancy that currently catches AI’s failures.”
The case for AI forecasting
Despite these pitfalls, meteorologists are rapidly adopting AI forecast models, and it’s actually easy to understand why. They’re faster, cheaper, and require far less computational infrastructure than physical models. When it comes to predicting typical weather patterns and events (not gray swans), their accuracy is comparable and improving rapidly.
“The typical rate of progress for most state-of-the-art physical models has been something like a day more accurate per decade, which doesn’t sound like a lot, but that’s consequential,” Andrew Charlton-Perez, a professor of meteorology and head of the School of Mathematical, Physical, and Computational Sciences at the University of Reading, told Gizmodo.
“The rate of accuracy growth for machine learning models has vastly exceeded that,” he said. “They are now competitive, and two-three years ago, they were not even in the same ballpark.”
During the 2025 Atlantic hurricane season, for example, Google DeepMind’s model outperformed nearly every physical model on storm track and intensity. In fact, since 2023, leading AI models such as GraphCast, Pangu-Weather, and the ECMWF’s AIFS have matched or outperformed the best physical models on medium-range forecasting metrics, according to Yu.
AI models are proving especially valuable in parts of the world that lack traditional forecasting resources—regions that are often on the frontlines of climate change. Hassanzadeh co-directed an initiative that provided 38 million farmers across India with AI-based monsoon forecasts, giving them up to four weeks’ advance notice of the rainy season’s onset.
“A lot of countries were left behind in that first revolution of weather forecasting, because [traditional] weather forecasting requires a supercomputer, hundreds of millions of dollars, various fields, workforce, and experts,” Hassanzadeh explained. AI models, by comparison, are far more accessible to lower-income countries.
Filling the knowledge gaps
Still, rapidly adopting these models without addressing the risks would be dangerous, especially in parts of the world highly vulnerable to the impacts of climate change. Shruti Nath, a postdoctoral research associate at the University of Oxford, recently co-authored an editorial calling for more rigorous testing of AI forecast models before public agencies widely adopt them.
“There is still a lot of work to be done in understanding the limits of these models, alongside where they could supplement physical models and why,” she told Gizmodo in an email.
Nath’s editorial outlines a framework for testing AI forecast models that would deliberately withhold a designated set of “iconic” extreme events (like the Pacific Northwest heat wave, for example) from the training dataset. These events would be reserved solely for testing in order to assess the models’ ability to extrapolate unprecedented weather extremes, or gray swans.
Actually implementing this AI Retraining Without Iconic Events (AIRWIE) protocol “would require the meteorological community to agree on which high-impact events constitute a rigorous benchmark,” the editorial states. This would be a great undertaking, but Nath believes most researchers agree that there is an urgent need for this kind of testing.
“We need to be a bit more organized, however, in ensuring that proper protocols can be followed and that robust safeguards are put in place and maintained by the community,” Nath said. “This is difficult when things are in such a hype phase and no one wants to miss out on the bandwagon.”
Other researchers, like Hassanzadeh, are developing ways to teach AI forecast models to predict gray swans. He and his colleagues are investigating whether combining AI systems with “relevant sampling” methods—which allow them to generate samples of gray swan events—can improve the models’ ability to extrapolate unprecedented extremes.
Efforts to understand and address the limitations of AI forecasting will be critical, because there’s no turning back now. AI is already reshaping the way we predict the weather, and as the climate becomes increasingly volatile, meteorologists will need every tool in their arsenal to be sharp and reliable. Despite their current limitations, there is much to gain from continuing to push these systems forward and figuring out how to best integrate them with physical forecasting.
“The research agenda is about making AI models physically consistent, well-calibrated, and robust to distribution shift,” Yu said. “Abandoning this approach because of the gray swan problem means giving up the biggest improvement in forecasting in a generation.”
Read the full article here
