Goal
To develop a machine learning model that could predict whether a single rider or small group of cyclists that have broken away from the peleton (main group of cyclists) during a particular stage in the Tour dé France, would maintain their lead or be caught before the end of the stage.
Preamble
This project was driven by a need to provide a more immersive experience to television viewers of the Tour dé France cycling race. The Catch the Breaks predictor was devised as one of a suite of 4 machine learning models that were to provide a range of analytics to broadcasters and provide commentators with extra talking points during their coverage.
Data
Training Data
Two years of historical readings from wireless transmitters that were fitted beneath the bicycle seat during previous races provided a variety of readings, including: speed, gradient, latitude & longitude, altitude & humidity. Data from transmitters mounted on motorcycles that preceded and followed the cyclists provided additional data such as the relative speeds of the two rider groups, the size of the peleton, the size of the gap between the peleton and the leading group, the wind direction and wind speed. Other data included the weather conditions, the stage terrain (flat, rolling hills or mountainous) as well as data about each of the cyclists in the race.
Transmitter data was provided in Json format stored in a SQL Server database, whereas the historical weather condition data was downloaded as csv files from the weather channel internet page, and the data about each cyclist was scraped from procyclingstats.com.
Live Data
A live stream of real-time data from the transmitters during the race in Json format was delivered to an IBM Bluemix cloud platform via Apache Kafka stream processing, where it was ingested by a custom-coded server module for processing. Additionally, real-time weather data from the IBM weather service was ingested via a web API and stored in a SQL database.
Method
The data was analysed using exploratory techniques in the R scripting language, where many of the over 200 features were discarded as either containing too many missing values or containing no significantly novel information that wasn’t able to be found in other features.
A number of engineered features were also generated, such as the estimated effort being expended by each rider and the effect of wind on rider speed.
Due to the sheer amount of information coming through every second, it was decided to aggregate the data into 10 minute windows, so that the mean values of each feature over this time were the inputs into the machine learning model. A number of different model types were trialled before deciding on an approach. These included linear regression, logistic regression and random forest models. After many iterations, both of the model type and the hyperparameters, it was found that around 30 features were required in a random forest model for optimal predictive results.
The predictions, which were stored every 1.5 minutes into a SQL Server database, were stored by timestamp, both as probabilities, and also with a classification of either Yes or No (Yes – the peleton will catch the breakaway; No- the peleton will not catch the breakaway).
Results
After running and re-training the model after each stage (there are 21 stages), the reproduced accuracy tended to a 98-99% average, which was terrific! This meant that 99 times out of 100 the model got it right. This was impressive considering that the number of training instances where there were actual breakaways that were either caught or not, were somewhat limited. These results were also apparent over the different terrains. Other measures of the performance of the model, such as precision and recall were similarly impressive.
The other measure of success was the requirement to produce a new prediction every 2 minutes, and this was also indeed achieved, assisted in large part by the aggregation of the data features to means over 10 minute windows.
Tour dé France: Catch the Breaks Predictor
Published August 20, 2018