WindAI: A Deep Learning Approach to Global Wind Resource Assessment Using Multi-Source Reanalysis Data
Abstract
Accurate wind resource assessment is a prerequisite for wind energy project development, yet conventional methods remain costly, time-consuming, and geographically constrained. This paper presents WindAI, a deep learning system that predicts hourly wind farm capacity factors for any location on Earth using freely available meteorological reanalysis data. The model is a deep neural network with multiple hidden layers, batch normalization, and regularization, trained on over 10 million hourly generation-weather observation pairs from more than 300 wind farms across eight countries: Australia, the United Kingdom, Belgium, Denmark, Canada (Ontario), the United States (Texas), New Zealand, and Brazil. Input features are drawn from four independent data sources — ERA5 reanalysis, MERRA2 reanalysis, ERA5 static fields, and Copernicus DEM elevation data — sampled at multiple spatial grid points surrounding each site, yielding 400+ features per observation. Evaluated on six geographically and technologically diverse held-out wind farms never seen during training, WindAI achieves an hourly root mean square error (RMSE) of 0.147 and a coefficient of determination (R²) of 0.777. When aggregated to annual mean capacity factors, prediction errors range from 2.1% to 7.8% across the test plants. The system provides predictions in minutes at negligible marginal cost, compared to weeks and tens of thousands of dollars for traditional consultant-led assessments.
1. Introduction
1.1 The Problem
Wind energy is one of the fastest-growing sources of electricity worldwide, yet the fundamental bottleneck in wind project development remains the same as it was decades ago: determining whether a given site has enough wind to justify construction. Traditional wind resource assessments (WRAs) are expensive, slow, and inherently local. A preliminary assessment from a specialized consultant typically costs $8,000 to $9,000 and takes two to four weeks. A bankable WRA — the level of analysis required to secure project financing — can cost $15,000 to $50,000 or more and take four to twelve weeks, often requiring the installation and operation of on-site meteorological masts for one or more years.
Physics-based wind resource modeling tools such as WAsP (developed by DTU Wind Energy) and DNV's WindFarmer provide high-fidelity predictions but require detailed site-specific inputs, expert calibration, and per-site software licenses. A WAsP license costs approximately €2,100, while DNV WindFarmer carries an annual fee of approximately €5,639.
1.2 The Opportunity
A convergence of three developments creates an opportunity to close this gap. First, decades of high-quality hourly weather reanalysis data are now freely available through ERA5 and MERRA2. Second, growing volumes of publicly disclosed wind farm generation data provide ground-truth observations. Third, advances in deep learning enable models that learn the complex, nonlinear mapping between gridded weather variables and actual power output.
1.3 Our Approach
WindAI takes a data-driven approach. Rather than modeling the physics of atmospheric flow and turbine aerodynamics from first principles, we train a single neural network on the joint distribution of weather conditions and observed power output across hundreds of wind farms spanning diverse geographies, climates, turbine technologies, and terrain types. The model ingests 400+ features per hourly observation and outputs a scalar capacity factor. The system is deployed as a REST API that accepts latitude, longitude, and turbine specifications, fetches the relevant reanalysis data on demand, and returns hourly capacity factor predictions, annual energy production estimates, and statistical summaries — all within minutes.
2. Data Sources
WindAI integrates data from six distinct sources, each contributing different aspects of the information needed to predict wind farm power output.
2.1 ERA5 Reanalysis
ERA5 is the fifth-generation atmospheric reanalysis produced by ECMWF. It provides hourly estimates of atmospheric variables on a global grid at 0.25-degree spatial resolution. Six ERA5 variables are extracted at 16 grid points per site (6 × 16 = 96 features):
| Variable | Description |
|---|---|
| u100 | Eastward wind at 100 metres |
| v100 | Northward wind at 100 metres |
| u10 | Eastward wind at 10 metres |
| v10 | Northward wind at 10 metres |
| t2m | Air temperature at 2 metres |
| sp | Surface air pressure |
2.2 MERRA2 Reanalysis
MERRA2 is produced by NASA at 0.5° × 0.625° resolution. Including MERRA2 alongside ERA5 provides an independent estimate of atmospheric conditions. Two MERRA2 wind variables (U50M, V50M) are extracted at 16 grid points, yielding 32 features.
2.3 ERA5 Boundary Layer Height
The planetary boundary layer height (BLH) serves as a proxy for atmospheric stability. It is extracted at 16 grid points, contributing 16 features.
2.4 ERA5 Static and Invariant Fields
Time-invariant fields describing terrain and surface characteristics: geopotential at surface, land-sea mask, standard deviation of orography, slope, anisotropy, and angle of sub-grid orography. Extracted at 16 grid points (6 × 16 = 96 features).
2.5 Copernicus Digital Elevation Model
The Copernicus GLO-30 DEM provides global terrain elevation at 30-metre resolution. Ten summary statistics are computed per site (min, p20, p50, p80, max, std, mean, range, slope_mean, slope_std).
2.6 Spatial Sampling: The 16-Point Grid
Rather than extracting weather data at a single grid point, WindAI samples at 16 points in a 4×4 grid surrounding each site. This captures spatial gradients in wind speed, pressure, and temperature. Points are enumerated in a perimeter-spiral order:
1 2 3 4 12 13 14 5 11 16 15 6 10 9 8 7
2.7 Wind Farm Generation Data
Hourly generation records from wind farms in eight countries, sourced from grid operators and regulatory bodies (AEMO, ENTSO-E, IESO, ERCOT, ONS). In total, approximately 10.5 million hourly observations across 300+ wind farms spanning 2006 to 2020.
| Country | Data Source |
|---|---|
| Australia (South) | AEMO dispatch SCADA |
| Australia (West) | SW Australia facility data |
| Brazil | ONS hourly generation |
| United Kingdom | ENTSO-E |
| Belgium | ENTSO-E |
| Denmark | ENTSO-E |
| Canada (Ontario) | IESO |
| USA (Texas) | ERCOT |
3. Model Architecture
3.1 Network Design
WindAI employs a multi-layer deep neural network with 400+ input features. The architecture uses batch normalization to handle heterogeneous input scales, dropout regularization to improve generalization, and a funnel-shaped design that progressively compresses representations from high-dimensional input to a scalar capacity factor output.
3.2 Design Rationale
Batch Normalization stabilizes training by normalizing internal activations across heterogeneous input scales (wind speed in m/s, pressure in Pascals, temperature in Kelvin). Dropout regularization reduces co-adaptation and improves generalization to unseen locations. The funnel architecture forces progressively compressed representations, distilling hundreds of raw and physics-derived features into a single prediction.
3.3 Feature Categories
The model combines raw meteorological variables with physics-derived features including wind speed, wind shear exponents, wind direction components, and air density. These are computed from the underlying reanalysis data at multiple spatial grid points surrounding each site.
3.4 Feature Inventory (400+ total)
| Category | Description |
|---|---|
| Plant attributes | Hub height, turbine count, rated power, rotor diameter, etc. |
| Spatial distances | Distance from each grid point to plant location |
| ERA5 meteorological | Wind, temperature, and pressure variables at multiple grid points |
| ERA5 boundary layer | Boundary layer height at multiple grid points |
| MERRA2 wind | Independent wind estimates at multiple grid points |
| ERA5 static fields | Invariant terrain and surface fields at multiple grid points |
| Elevation | Terrain statistics from Copernicus DEM |
| Temporal encoding | Hour-of-day and month-of-year |
| Derived physics | Wind speed, shear, direction, air density |
4. Training
4.1 Data Split
The dataset is split by plant identity rather than by random sampling. Six wind farms are held out entirely for evaluation. This plant-level holdout ensures the model is evaluated on its ability to generalize to completely unseen locations and turbine configurations.
| Plant | Country | Type | Turbines | Rated Power (kW) |
|---|---|---|---|---|
| Albany Grasmere | Australia | Onshore | 6 | 2,300 |
| Amazon Wind Farm TX | USA (Texas) | Onshore | 110 | 2,300 |
| Belwind I | Belgium | Offshore | 55 | 3,000 |
| Bobcat Bluff TX | USA (Texas) | Onshore | 100 | 1,500 |
| Comber | Canada (Ontario) | Onshore | 72 | 2,300 |
| Kingsbridge I | Canada (Ontario) | Onshore | 22 | 1,800 |
4.2 Optimization
| Parameter | Value |
|---|---|
| Optimizer | AdamW (weight_decay = 1e-4) |
| Learning rate schedule | OneCycleLR (0.001 → 0.005) |
| Batch size | 8,192 |
| Epochs | 50 |
| Loss function | Mean Squared Error (MSE) |
The model is implemented in PyTorch. The full training run completes in approximately 4 minutes on an NVIDIA A10G GPU. The model weights and normalization statistics are exported to a portable NumPy archive (~1.6 MB).
5. Results
5.1 Overall Performance
| Metric | Value |
|---|---|
| RMSE | 0.147 |
| MAE | 0.100 |
| R² | 0.777 |
An hourly RMSE of 0.147 capacity factor units means that, on average, hourly predictions deviate from actuals by approximately 15 percentage points of installed capacity. The practical relevance lies in aggregation to monthly and annual scales, where random hourly fluctuations cancel out.
5.2 Per-Plant Results
| Plant | Country | Actual CF | Predicted CF | Relative Error |
|---|---|---|---|---|
| Albany Grasmere | Australia | 23.7% | 23.2% | 2.1% |
| Amazon Wind Farm TX | USA | 44.3% | 42.3% | 4.5% |
| Belwind I (offshore) | Belgium | 35.9% | 37.7% | 5.0% |
| Bobcat Bluff TX | USA | 33.6% | 36.1% | 7.4% |
| Comber | Canada | 29.0% | 28.2% | 2.8% |
| Kingsbridge I | Canada | 30.6% | 28.2% | 7.8% |
The mean absolute relative error across these plants is 5.0%. Four of the six plants are predicted within 5% of their actual annual capacity factor; all six are within 8%.
5.3 Temporal Aggregation Effects
Prediction accuracy improves substantially with temporal aggregation. While hourly RMSE is ~0.147, monthly errors are typically 2-5 percentage points, and annual errors range from 2-8%. For the primary use case of estimating annual energy production, the relevant metric is annual accuracy.
6. Comparison with Alternative Approaches
| Characteristic | WindAI | WAsP | WindFarmer | Consultant WRA |
|---|---|---|---|---|
| Cost per site | $49.99 | €2,100 (license) | €5,639/year | $8,000–50,000+ |
| Time per site | 2–5 minutes | Days | Days–weeks | 2–12 weeks |
| Calibration required | No | Yes | Yes | Yes (met mast) |
| Wake modeling | Implicit (learned) | Explicit | Explicit | Explicit |
| Global coverage | Yes | Requires local data | Requires local data | Per-site |
| Temporal resolution | Hourly | Statistical | Statistical | Statistical |
7. Limitations and Future Work
7.1 Current Limitations
- Reanalysis resolution: ERA5's 0.25-degree resolution (~28 km) means terrain features smaller than this scale are not explicitly resolved. Sites in exceptionally complex terrain may exhibit larger prediction errors.
- No site-specific calibration: The model does not incorporate site-specific measurement data. It cannot capture local channeling effects or unusual turbulence regimes.
- No explicit wake modeling: Wake effects are learned implicitly from aggregate farm-level data but cannot be modeled for different turbine layouts.
- Geographic training bias: Training data is concentrated in temperate/subtropical climates. Performance in tropical or extreme-latitude regions may be less reliable.
7.2 Future Work
- Higher-resolution data: ERA5-Land at 0.1-degree resolution (~11 km) for improved predictions in complex terrain.
- Temporal lag features: Incorporating lagged features and rolling statistics for improved hourly accuracy.
- Transfer learning: Fine-tuning on site-specific SCADA data for sites with short-term measurements.
- Uncertainty quantification: Monte Carlo dropout or quantile regression for prediction intervals.
- Expanded training data: Northern Europe, continental Europe, and Asia for improved generalization.
8. Conclusion
WindAI demonstrates that a single, globally trained deep learning model can produce useful wind resource assessments for diverse locations worldwide, using only freely available reanalysis data and basic turbine specifications as inputs. Trained on over 10 million hourly observations from more than 300 wind farms across eight countries, the model achieves an hourly RMSE of 0.147 and R² of 0.777 on six held-out test plants spanning four countries. Annual capacity factor predictions fall within 2–8% of observed values.
The model's practical value lies in its ability to provide rapid, low-cost pre-feasibility assessments at scale. Whereas traditional wind resource assessments cost $8,000 to $50,000 and require weeks to months, WindAI delivers results in minutes at a fraction of the cost. This enables developers to screen large portfolios of candidate sites efficiently, focusing detailed assessment resources on the most promising locations.