Wattness Prediction Engine Benchmark v1

1. Context and Objective

This document presents a rigorous evaluation of the Wattness race time prediction engine, based on a multi-discipline physics model (swim, bike, run) coupled with an individual coefficient personalization system.

The evaluation covers 511 real race results from 93 athletes across 24 courses (Ironman and 70.3), from 2017 to 2025. The protocol faithfully reproduces production conditions: chronological leave-future-out validation, where personal coefficients are only computed from prior races.

Two prediction modes are evaluated:

Baseline (free version): physics model only, using the athlete's thresholds (FTP, CSS, CS, weight). No race history required.
Adjusted (personalized version): baseline corrected by individual coefficients (median of observed/predicted ratios from prior races). Requires a minimum of 5 prior races.

2. Methodology

2.1 Protocol

Each prediction is generated by the exact production pipeline: calculateRacePacing() produces the physics baseline, then fetchPersonalCoefficients() adjusts per discipline if sufficient history exists. No parameters are tuned after the fact.

2.2 Leave-future-out validation

For each athlete's race, personal coefficients are computed only from prior races (exactly as in production). Future races are never visible, eliminating any risk of data leakage.

2.3 Inclusion criteria

Known courses only: only courses with validated elevation profiles are included (24 courses on Ironman and 70.3 circuits)
No DNF: only completed races are retained
Outlier filter: personal coefficients filtered [0.8 – 1.2], standard deviation ≤ 0.1
Coverage: 177/511 predictions (35%) have active personal coefficients; the remaining 334 are pure baseline

2.4 Metrics

Metric	Definition
MAPE	Mean absolute percentage error relative to observed time
MAE	Mean absolute error (in minutes)
MedAE	Median absolute error (robust to outliers)
P90	90th percentile of absolute error (upper bound for 90% of predictions)
Bias	Mean signed error. Positive = predicted too slow; negative = predicted too fast

2.5 Athlete category definitions

Athletes are classified into three levels based on a multi-criteria algorithm combining physiological thresholds (FTP, CSS, CS), training volume, competition history, and consistency:

Level	Typical profile	n (dataset)
Elite	Advanced athletes (ADV_* categories): high thresholds, consistent volume, proven history	196
Competitive	Confirmed and regular (CMP_*, EST_RESILIENT categories): solid base, race experience	260
Age-group	Developing or irregular (DEV_*, EST_FRAGILE/STANDARD categories): variable profile, limited history	55

2.6 Evaluation grid

Rating	MAPE threshold	Interpretation
Good	≤ 5 %	Sufficient accuracy for reliable race planning
Fair	5 – 8 %	Usable with caution, noticeable gap on long races
Needs work	> 8 %	Significant gap, prediction should be taken as a rough estimate
Inconclusive	n < 5	Sample too small to draw conclusions

3. Physics Model Results (baseline / free version)

The baseline is the engine's foundation: a deterministic physics model that predicts times without personal history. This is the version available to all users, including new ones.

3.1 Overall accuracy by level

Level	n	MAPE	MAE	MedAE	P90	Bias
Elite	196	4.6 %	21 min	14 min	46 min	-15 min
Competitive	260	5.5 %	24 min	17 min	1h00	≈ 0
Age-group	55	14.0 %	1h01	59 min	1h42	+37 min

3.2 By race format

Level	Format	n	MAPE	MAE	Bias
Elite	70.3	106	4.1 %	12 min	-8 min
Elite	Full Ironman	90	4.7 %	29 min	-19 min
Competitive	70.3	174	4.8 %	15 min	-3 min
Competitive	Full Ironman	86	6.6 %	44 min	-20 min
Age-group	Full Ironman	23	9.6 %	1h04	+38 min
Age-group	70.3	32	13.4 %	45 min	+11 min

3.3 By discipline (baseline)

Level	Swim MAPE	Bike MAPE	Run MAPE	Swim Bias	Bike Bias	Run Bias
Elite	8.9 %	4.3 %	6.9 %	-2 min	≈ 0	-8 min
Competitive	11.1 %	8.3 %	8.6 %	-2 min	+11 min	-9 min
Age-group	8.5 %	18.3 %	11.3 %	-2 min	+31 min	-1 min

Known baseline limitations: The model applies the athlete's current thresholds (FTP, CSS) to past races (2017-2025). If an athlete has improved or declined, this introduces a temporal bias, particularly visible for age-groupers (bike +31 min). Resolving this bias (storing historical thresholds) is improvement priority #1.

4. Personalization Impact (individual coefficients)

4.1 Principle

Personal coefficients are the median of observed / baseline ratios computed from the athlete's prior races (minimum 5, outlier-filtered). They correct the baseline per discipline, capturing systematic individual deviations.

4.2 Results (n=177, athletes with sufficient history)

Level	n	MAPE baseline	MAPE adjusted	Gain	Bias baseline	Bias adjusted
Elite	86	3.8 %	3.3 %	-0.5 pts	-9 min	-4 min
Competitive	76	5.2 %	4.9 %	-0.3 pts	-4 min	-14 min
Age-group	15	15.2 %	7.3 %	-7.9 pts	+46 min	+6 min

Key observation: The most dramatic gain is seen for age-groupers (MAPE 15.2 % → 7.3 %, bias +46 min → +6 min). This confirms that personal coefficients effectively compensate for baseline bias in this profile, though the sample remains limited (n=15).

5. Overall Summary (baseline + adjusted)

Level	n	Mode	MAPE	MAE	MedAE	P90	Bias
Elite	196	Baseline	4.6 %	21 min	14 min	46 min	-15 min
Elite	196	Adjusted	4.4 %	20 min	12 min	46 min	-13 min
Competitive	260	Baseline	5.5 %	24 min	17 min	1h00	≈ 0
Competitive	260	Adjusted	5.4 %	25 min	15 min	1h00	-5 min
Age-group	55	Baseline	14.0 %	1h01	59 min	1h42	+37 min
Age-group	55	Adjusted	11.8 %	53 min	51 min	1h42	+22 min

6. Scientific Foundations

The Wattness engine is built on physics and physiology models documented in the scientific literature:

Module	Foundation	Reference
Swim hydrodynamics	Drag/velocity relationship in open water	Chatard et al. (1998) [7]
Bike power solver	Newton-Raphson equation with resistive forces (CdA, Crr, gravity, wind)	Coggan (2003) [8], Blocken et al. (2018) [6]
Run (elevation)	Energy cost as a function of gradient	Minetti et al. (2002) [3]
Heat penalty (WBGT)	Temperature impact on marathon performance	Ely et al. (2007) [2]
Bike → Run coupling	Transition and pre-run fatigue effect	Hausswirth & Brisswalter (2008) [4], Millet & Vleck (2000) [5]
Triathlon decomposition	Relative contribution of each discipline	Rust et al. (2021) [1]

The model is not a black box: each prediction is decomposable segment by segment, with penalties explicitly attributed (heat, elevation, coupling, glycogen).

7. Limitations and Unmodeled Factors

Modeled with explicit penalty: Heat (WBGT), elevation, head/tailwind, bike-run coupling, glycogen depletion.

Partially modeled: Drafting (average factor by level), bike position (TT vs road).

Not modeled: Ocean current, road conditions, mechanical issues, tactical race management, weather conditions beyond temperature (rain, extreme humidity).

8. Ablation Study (contribution of each module)

To measure each sub-module's contribution, the benchmark is re-run while disabling one module at a time. Two naive estimators quantify the physics model's added value:

Individual naive: median of the athlete's prior times (same format) — requires personal history
Population naive: median by format × level (no individual data) — fair comparator vs baseline without history

8.1 Overall results

Variant	n	MAPE	MAE	Bias	MedAE	P90	% >30min
Full model	511	6.1 %	27 min	−1 min	17 min	1h05	29.9 %
No heat	511	6.1 %	27 min	−8 min	17 min	1h06	34.4 %
No coupling	511	6.0 %	26 min	−5 min	17 min	1h04	31.3 %
No glycogen	511	6.1 %	27 min	−1 min	17 min	1h05	29.9 %
Individual naive	353	6.0 %	27 min	+1 min	19 min	59 min	34.8 %
Population naive	511	7.9 %	34 min	−5 min	25 min	1h10	42.1 %

8.2 By athlete level

Elite (n=196)

Variant	n	MAPE	MAE	Bias	MedAE	P90	% >30min
Full model	196	4.6 %	20 min	−15 min	14 min	45 min	23.0 %
No heat	196	5.4 %	25 min	−21 min	16 min	58 min	33.7 %
No coupling	196	4.9 %	22 min	−17 min	15 min	47 min	27.0 %
Individual naive	149	5.4 %	24 min	+6 min	18 min	52 min	31.5 %
Population naive	196	7.5 %	30 min	−3 min	24 min	1h03	39.8 %

Competitive (n=260)

Variant	n	MAPE	MAE	Bias	MedAE	P90	% >30min
Full model	260	5.5 %	24 min	≈ 0	16 min	1h00	24.6 %
No heat	260	5.3 %	24 min	−6 min	15 min	56 min	26.2 %
No coupling	260	5.4 %	24 min	−3 min	16 min	1h00	24.2 %
Individual naive	169	6.3 %	29 min	≈ 0	18 min	1h08	37.3 %
Population naive	260	7.9 %	34 min	−4 min	25 min	1h19	42.7 %

Age-group (n=55)

Variant	n	MAPE	MAE	Bias	MedAE	P90	% >30min
Full model	55	14.0 %	1h00	+36 min	58 min	1h42	80.0 %
No heat	55	12.3 %	53 min	+26 min	52 min	1h31	76.4 %
No coupling	55	12.8 %	55 min	+28 min	51 min	1h32	80.0 %
Individual naive	35	6.9 %	33 min	−12 min	24 min	1h05	37.1 %
Population naive	55	9.5 %	44 min	−15 min	27 min	2h09	47.3 %

8.3 By race format

Full Ironman (n=199)

Variant	n	MAPE	MAE	Bias	MedAE	P90	% >30min
Full model	199	6.2 %	39 min	−8 min	28 min	1h36	46.2 %
No heat	199	6.7 %	43 min	−20 min	34 min	1h30	58.3 %
No coupling	199	6.2 %	39 min	−13 min	31 min	1h29	51.8 %
Individual naive	163	6.4 %	40 min	+6 min	30 min	1h30	50.3 %
Population naive	199	7.4 %	47 min	−6 min	39 min	1h42	58.8 %

70.3 (n=312)

Variant	n	MAPE	MAE	Bias	MedAE	P90	% >30min
Full model	312	6.0 %	18 min	+2 min	13 min	42 min	19.6 %
No heat	312	5.7 %	18 min	−1 min	12 min	38 min	19.2 %
No coupling	312	5.9 %	18 min	≈ 0	13 min	42 min	18.3 %
Individual naive	190	5.6 %	17 min	−2 min	12 min	41 min	21.6 %
Population naive	312	8.3 %	26 min	−4 min	22 min	52 min	31.4 %

8.4 Ablation conclusions

Heat module: most significant impact for Elite (MAPE 4.6 % → 5.4 % without heat, +11 pts in >30min error rate). Reverse effect for Competitive (5.5 % → 5.3 % without heat) — the thermal penalty slightly overcorrects this population.

Coupling module: moderate impact on elites (MAPE +0.3 pts, +4 pts in >30min error rate). Negligible on competitive and age-group.

Glycogen module: no measurable impact at any level. This module adds no value in its current form.

8.5 Physics model vs naive estimators

The ablation compares two types of naive estimators, clarifying the physics model's value depending on context:

Criterion	Physics model	Individual naive	Population naive
History required	None	Yes (same format)	None
Decomposition	By discipline + segments	Total only	Total only
Course adaptation	Profile, weather, conditions	None	None
MAPE global	6.1 %	6.0 % (n=353)	7.9 %
MedAE global	17 min	19 min	25 min
% erreurs >30 min	29.9 %	34.8 %	42.1 %

Key takeaways:

Vs population naive (fair duel, no history): the physics model is clearly superior (MAPE 6.1 % vs 7.9 %, MedAE 17 vs 25 min, >30min rate 30 % vs 42 %). Individual physics (thresholds + course) delivers real value.
Vs individual naive (with history): tied on overall MAPE, but the model is better on MedAE (17 vs 19 min) and extreme error rate (30 % vs 35 %). The naive beats the model for Age-group (6.9 % vs 14.0 %), but this is an artifact of the temporal bias affecting the physics model.
For elite/competitive, the physics model beats both naive estimators: MAPE 4.6 %/5.5 % vs 5.4 %/6.3 % (individual naive) and 7.5 %/7.9 % (population naive).

9. Summary

Across 511 real results covering 93 athletes and 24 courses, the Wattness engine shows generally solid accuracy, particularly for elite and competitive profiles, with a clear gain when personal history is available. Results should be interpreted with caution for age-groupers due to a still limited sample size.

Profile	Free (MAPE)	Free (MedAE)	Personalized (MAPE)	Best case
Elite	4.6 %	14 min	4.4 %	3.3 %
Competitive	5.5 %	17 min	5.4 %	4.9 %
Age-group	14.0 %	59 min	11.8 %	7.3 % (n=15)

The engine's value goes beyond its overall MAPE. The ablation study confirms that the physics model clearly outperforms a naive estimator without history (MAPE 6.1 % vs 7.9 %, >30min error rate 30 % vs 42 %), and that heat and coupling modules deliver real value for elites. Its value also lies in its per-discipline decomposition (enabling an actionable race plan), its course-specific adaptation (elevation, weather, technicality), and its ability to work without history — three properties a simple statistical estimator cannot offer.

The model is under continuous improvement. Priority areas are historical threshold storage (to eliminate temporal bias, the main error driver for age-groupers), improved extreme heat modeling (competitive overcorrection), and expanding the dataset for age-group profiles.

Scientific References

[1] Rust, C.A. et al. (2021). "What Is the Best Discipline to Predict Overall Triathlon Performance?" Frontiers in Physiology, 12, 654552.

[2] Ely, M.R. et al. (2007). "Impact of Weather on Marathon-Running Performance." Medicine & Science in Sports & Exercise, 39(3), 487-493.

[3] Minetti, A.E. et al. (2002). "Energy cost of walking and running at extreme uphill and downhill slopes." Journal of Applied Physiology, 93(3), 1039-1046.

[4] Hausswirth, C. & Brisswalter, J. (2008). "Strategies for improving performance in long duration events." Sports Medicine, 38(11), 881-891.

[5] Millet, G.P. & Vleck, V.E. (2000). "Physiological and biomechanical adaptations to the cycle to run transition in Olympic triathlon." British Journal of Sports Medicine, 34(5), 384-390.

[6] Blocken, B. et al. (2018). "CFD simulations of the aerodynamic drag of two drafting cyclists." Computers & Fluids, 171, 209-229.

[7] Chatard, J.C. et al. (1998). "Analysis of body composition, swimming performance and estimated energy expenditure." European Journal of Applied Physiology, 78(2), 109-113.

[8] Coggan, A.R. (2003). "Training and racing using a power meter." Training Peaks whitepaper.

Technical Validation Report v1