This document presents a rigorous evaluation of the Wattness race time prediction engine, based on a multi-discipline physics model (swim, bike, run) coupled with an individual coefficient personalization system.
The evaluation covers 511 real race results from 93 athletes across 24 courses (Ironman and 70.3), from 2017 to 2025. The protocol faithfully reproduces production conditions: chronological leave-future-out validation, where personal coefficients are only computed from prior races.
Two prediction modes are evaluated:
Each prediction is generated by the exact production pipeline: calculateRacePacing() produces the physics baseline, then fetchPersonalCoefficients() adjusts per discipline if sufficient history exists. No parameters are tuned after the fact.
For each athlete's race, personal coefficients are computed only from prior races (exactly as in production). Future races are never visible, eliminating any risk of data leakage.
| Metric | Definition |
|---|---|
| MAPE | Mean absolute percentage error relative to observed time |
| MAE | Mean absolute error (in minutes) |
| MedAE | Median absolute error (robust to outliers) |
| P90 | 90th percentile of absolute error (upper bound for 90% of predictions) |
| Bias | Mean signed error. Positive = predicted too slow; negative = predicted too fast |
Athletes are classified into three levels based on a multi-criteria algorithm combining physiological thresholds (FTP, CSS, CS), training volume, competition history, and consistency:
| Level | Typical profile | n (dataset) |
|---|---|---|
| Elite | Advanced athletes (ADV_* categories): high thresholds, consistent volume, proven history | 196 |
| Competitive | Confirmed and regular (CMP_*, EST_RESILIENT categories): solid base, race experience | 260 |
| Age-group | Developing or irregular (DEV_*, EST_FRAGILE/STANDARD categories): variable profile, limited history | 55 |
| Rating | MAPE threshold | Interpretation |
|---|---|---|
| Good | ≤ 5 % | Sufficient accuracy for reliable race planning |
| Fair | 5 – 8 % | Usable with caution, noticeable gap on long races |
| Needs work | > 8 % | Significant gap, prediction should be taken as a rough estimate |
| Inconclusive | n < 5 | Sample too small to draw conclusions |
The baseline is the engine's foundation: a deterministic physics model that predicts times without personal history. This is the version available to all users, including new ones.
| Level | n | MAPE | MAE | MedAE | P90 | Bias |
|---|---|---|---|---|---|---|
| Elite | 196 | 4.6 % | 21 min | 14 min | 46 min | -15 min |
| Competitive | 260 | 5.5 % | 24 min | 17 min | 1h00 | ≈ 0 |
| Age-group | 55 | 14.0 % | 1h01 | 59 min | 1h42 | +37 min |
| Level | Format | n | MAPE | MAE | Bias |
|---|---|---|---|---|---|
| Elite | 70.3 | 106 | 4.1 % | 12 min | -8 min |
| Elite | Full Ironman | 90 | 4.7 % | 29 min | -19 min |
| Competitive | 70.3 | 174 | 4.8 % | 15 min | -3 min |
| Competitive | Full Ironman | 86 | 6.6 % | 44 min | -20 min |
| Age-group | Full Ironman | 23 | 9.6 % | 1h04 | +38 min |
| Age-group | 70.3 | 32 | 13.4 % | 45 min | +11 min |
| Level | Swim MAPE | Bike MAPE | Run MAPE | Swim Bias | Bike Bias | Run Bias |
|---|---|---|---|---|---|---|
| Elite | 8.9 % | 4.3 % | 6.9 % | -2 min | ≈ 0 | -8 min |
| Competitive | 11.1 % | 8.3 % | 8.6 % | -2 min | +11 min | -9 min |
| Age-group | 8.5 % | 18.3 % | 11.3 % | -2 min | +31 min | -1 min |
Known baseline limitations: The model applies the athlete's current thresholds (FTP, CSS) to past races (2017-2025). If an athlete has improved or declined, this introduces a temporal bias, particularly visible for age-groupers (bike +31 min). Resolving this bias (storing historical thresholds) is improvement priority #1.
Personal coefficients are the median of observed / baseline ratios computed from the athlete's prior races (minimum 5, outlier-filtered). They correct the baseline per discipline, capturing systematic individual deviations.
| Level | n | MAPE baseline | MAPE adjusted | Gain | Bias baseline | Bias adjusted |
|---|---|---|---|---|---|---|
| Elite | 86 | 3.8 % | 3.3 % | -0.5 pts | -9 min | -4 min |
| Competitive | 76 | 5.2 % | 4.9 % | -0.3 pts | -4 min | -14 min |
| Age-group | 15 | 15.2 % | 7.3 % | -7.9 pts | +46 min | +6 min |
Key observation: The most dramatic gain is seen for age-groupers (MAPE 15.2 % → 7.3 %, bias +46 min → +6 min). This confirms that personal coefficients effectively compensate for baseline bias in this profile, though the sample remains limited (n=15).
| Level | n | Mode | MAPE | MAE | MedAE | P90 | Bias |
|---|---|---|---|---|---|---|---|
| Elite | 196 | Baseline | 4.6 % | 21 min | 14 min | 46 min | -15 min |
| Adjusted | 4.4 % | 20 min | 12 min | 46 min | -13 min | ||
| Competitive | 260 | Baseline | 5.5 % | 24 min | 17 min | 1h00 | ≈ 0 |
| Adjusted | 5.4 % | 25 min | 15 min | 1h00 | -5 min | ||
| Age-group | 55 | Baseline | 14.0 % | 1h01 | 59 min | 1h42 | +37 min |
| Adjusted | 11.8 % | 53 min | 51 min | 1h42 | +22 min |
The Wattness engine is built on physics and physiology models documented in the scientific literature:
| Module | Foundation | Reference |
|---|---|---|
| Swim hydrodynamics | Drag/velocity relationship in open water | Chatard et al. (1998) [7] |
| Bike power solver | Newton-Raphson equation with resistive forces (CdA, Crr, gravity, wind) | Coggan (2003) [8], Blocken et al. (2018) [6] |
| Run (elevation) | Energy cost as a function of gradient | Minetti et al. (2002) [3] |
| Heat penalty (WBGT) | Temperature impact on marathon performance | Ely et al. (2007) [2] |
| Bike → Run coupling | Transition and pre-run fatigue effect | Hausswirth & Brisswalter (2008) [4], Millet & Vleck (2000) [5] |
| Triathlon decomposition | Relative contribution of each discipline | Rust et al. (2021) [1] |
The model is not a black box: each prediction is decomposable segment by segment, with penalties explicitly attributed (heat, elevation, coupling, glycogen).
Modeled with explicit penalty: Heat (WBGT), elevation, head/tailwind, bike-run coupling, glycogen depletion.
Partially modeled: Drafting (average factor by level), bike position (TT vs road).
Not modeled: Ocean current, road conditions, mechanical issues, tactical race management, weather conditions beyond temperature (rain, extreme humidity).
To measure each sub-module's contribution, the benchmark is re-run while disabling one module at a time. Two naive estimators quantify the physics model's added value:
| Variant | n | MAPE | MAE | Bias | MedAE | P90 | % >30min |
|---|---|---|---|---|---|---|---|
| Full model | 511 | 6.1 % | 27 min | −1 min | 17 min | 1h05 | 29.9 % |
| No heat | 511 | 6.1 % | 27 min | −8 min | 17 min | 1h06 | 34.4 % |
| No coupling | 511 | 6.0 % | 26 min | −5 min | 17 min | 1h04 | 31.3 % |
| No glycogen | 511 | 6.1 % | 27 min | −1 min | 17 min | 1h05 | 29.9 % |
| Individual naive | 353 | 6.0 % | 27 min | +1 min | 19 min | 59 min | 34.8 % |
| Population naive | 511 | 7.9 % | 34 min | −5 min | 25 min | 1h10 | 42.1 % |
Elite (n=196)
| Variant | n | MAPE | MAE | Bias | MedAE | P90 | % >30min |
|---|---|---|---|---|---|---|---|
| Full model | 196 | 4.6 % | 20 min | −15 min | 14 min | 45 min | 23.0 % |
| No heat | 196 | 5.4 % | 25 min | −21 min | 16 min | 58 min | 33.7 % |
| No coupling | 196 | 4.9 % | 22 min | −17 min | 15 min | 47 min | 27.0 % |
| Individual naive | 149 | 5.4 % | 24 min | +6 min | 18 min | 52 min | 31.5 % |
| Population naive | 196 | 7.5 % | 30 min | −3 min | 24 min | 1h03 | 39.8 % |
Competitive (n=260)
| Variant | n | MAPE | MAE | Bias | MedAE | P90 | % >30min |
|---|---|---|---|---|---|---|---|
| Full model | 260 | 5.5 % | 24 min | ≈ 0 | 16 min | 1h00 | 24.6 % |
| No heat | 260 | 5.3 % | 24 min | −6 min | 15 min | 56 min | 26.2 % |
| No coupling | 260 | 5.4 % | 24 min | −3 min | 16 min | 1h00 | 24.2 % |
| Individual naive | 169 | 6.3 % | 29 min | ≈ 0 | 18 min | 1h08 | 37.3 % |
| Population naive | 260 | 7.9 % | 34 min | −4 min | 25 min | 1h19 | 42.7 % |
Age-group (n=55)
| Variant | n | MAPE | MAE | Bias | MedAE | P90 | % >30min |
|---|---|---|---|---|---|---|---|
| Full model | 55 | 14.0 % | 1h00 | +36 min | 58 min | 1h42 | 80.0 % |
| No heat | 55 | 12.3 % | 53 min | +26 min | 52 min | 1h31 | 76.4 % |
| No coupling | 55 | 12.8 % | 55 min | +28 min | 51 min | 1h32 | 80.0 % |
| Individual naive | 35 | 6.9 % | 33 min | −12 min | 24 min | 1h05 | 37.1 % |
| Population naive | 55 | 9.5 % | 44 min | −15 min | 27 min | 2h09 | 47.3 % |
Full Ironman (n=199)
| Variant | n | MAPE | MAE | Bias | MedAE | P90 | % >30min |
|---|---|---|---|---|---|---|---|
| Full model | 199 | 6.2 % | 39 min | −8 min | 28 min | 1h36 | 46.2 % |
| No heat | 199 | 6.7 % | 43 min | −20 min | 34 min | 1h30 | 58.3 % |
| No coupling | 199 | 6.2 % | 39 min | −13 min | 31 min | 1h29 | 51.8 % |
| Individual naive | 163 | 6.4 % | 40 min | +6 min | 30 min | 1h30 | 50.3 % |
| Population naive | 199 | 7.4 % | 47 min | −6 min | 39 min | 1h42 | 58.8 % |
70.3 (n=312)
| Variant | n | MAPE | MAE | Bias | MedAE | P90 | % >30min |
|---|---|---|---|---|---|---|---|
| Full model | 312 | 6.0 % | 18 min | +2 min | 13 min | 42 min | 19.6 % |
| No heat | 312 | 5.7 % | 18 min | −1 min | 12 min | 38 min | 19.2 % |
| No coupling | 312 | 5.9 % | 18 min | ≈ 0 | 13 min | 42 min | 18.3 % |
| Individual naive | 190 | 5.6 % | 17 min | −2 min | 12 min | 41 min | 21.6 % |
| Population naive | 312 | 8.3 % | 26 min | −4 min | 22 min | 52 min | 31.4 % |
Heat module: most significant impact for Elite (MAPE 4.6 % → 5.4 % without heat, +11 pts in >30min error rate). Reverse effect for Competitive (5.5 % → 5.3 % without heat) — the thermal penalty slightly overcorrects this population.
Coupling module: moderate impact on elites (MAPE +0.3 pts, +4 pts in >30min error rate). Negligible on competitive and age-group.
Glycogen module: no measurable impact at any level. This module adds no value in its current form.
The ablation compares two types of naive estimators, clarifying the physics model's value depending on context:
| Criterion | Physics model | Individual naive | Population naive |
|---|---|---|---|
| History required | None | Yes (same format) | None |
| Decomposition | By discipline + segments | Total only | Total only |
| Course adaptation | Profile, weather, conditions | None | None |
| MAPE global | 6.1 % | 6.0 % (n=353) | 7.9 % |
| MedAE global | 17 min | 19 min | 25 min |
| % erreurs >30 min | 29.9 % | 34.8 % | 42.1 % |
Key takeaways:
Across 511 real results covering 93 athletes and 24 courses, the Wattness engine shows generally solid accuracy, particularly for elite and competitive profiles, with a clear gain when personal history is available. Results should be interpreted with caution for age-groupers due to a still limited sample size.
| Profile | Free (MAPE) | Free (MedAE) | Personalized (MAPE) | Best case |
|---|---|---|---|---|
| Elite | 4.6 % | 14 min | 4.4 % | 3.3 % |
| Competitive | 5.5 % | 17 min | 5.4 % | 4.9 % |
| Age-group | 14.0 % | 59 min | 11.8 % | 7.3 % (n=15) |
The engine's value goes beyond its overall MAPE. The ablation study confirms that the physics model clearly outperforms a naive estimator without history (MAPE 6.1 % vs 7.9 %, >30min error rate 30 % vs 42 %), and that heat and coupling modules deliver real value for elites. Its value also lies in its per-discipline decomposition (enabling an actionable race plan), its course-specific adaptation (elevation, weather, technicality), and its ability to work without history — three properties a simple statistical estimator cannot offer.
The model is under continuous improvement. Priority areas are historical threshold storage (to eliminate temporal bias, the main error driver for age-groupers), improved extreme heat modeling (competitive overcorrection), and expanding the dataset for age-group profiles.
[1] Rust, C.A. et al. (2021). "What Is the Best Discipline to Predict Overall Triathlon Performance?" Frontiers in Physiology, 12, 654552.
[2] Ely, M.R. et al. (2007). "Impact of Weather on Marathon-Running Performance." Medicine & Science in Sports & Exercise, 39(3), 487-493.
[3] Minetti, A.E. et al. (2002). "Energy cost of walking and running at extreme uphill and downhill slopes." Journal of Applied Physiology, 93(3), 1039-1046.
[4] Hausswirth, C. & Brisswalter, J. (2008). "Strategies for improving performance in long duration events." Sports Medicine, 38(11), 881-891.
[5] Millet, G.P. & Vleck, V.E. (2000). "Physiological and biomechanical adaptations to the cycle to run transition in Olympic triathlon." British Journal of Sports Medicine, 34(5), 384-390.
[6] Blocken, B. et al. (2018). "CFD simulations of the aerodynamic drag of two drafting cyclists." Computers & Fluids, 171, 209-229.
[7] Chatard, J.C. et al. (1998). "Analysis of body composition, swimming performance and estimated energy expenditure." European Journal of Applied Physiology, 78(2), 109-113.
[8] Coggan, A.R. (2003). "Training and racing using a power meter." Training Peaks whitepaper.